identity_prompt = """
You are tasked with identifying and extracting all the real object names from a caption.
An object name refers to any tangible or physical entity mentioned in the caption. Ensure not to include any adjectives or single-word descriptions that do not refer to a specific object, such as "background."
Please follow these instructions:
Identify all object names in the caption in the order they appear. Maintain the exact wording of each object name as it is in the caption, including case consistency. Output the object names in a Python list format. For example, consider the following caption:
Example 1:
"A woman in a yellow hat and dress holds a basket of roses while sitting on a stone bench in a lush garden."
Your output should be a list of object names like this:
['A woman', 'a yellow hat', 'dress', 'basket of roses', 'a stone bench', 'a lush garden']
Example 2:
"Cat in spacesuit, floating on an asteroid, fishing in the Milky Way with a fishing rod."
Your output should be a list of object names like this:
['Cat in spacesuit', 'an asteroid', 'Milky Way', 'fishing rod']
Example 3:
"A picturesque log cabin sits nestled among snow-covered trees and rocky shores at Lake Tahoe."
Your output should be a list of object names like this:
['log cabin', 'snow-covered trees', 'rocky shores', 'Lake Taho']
Example 4:
"A photo of a wood chair on the left of an orange snowboard on a snowy mountain."
Your output should be a list of object names like this:
['a wood chair', 'an orange snowboard', 'a snowy mountain']
Example 4:
"A cozy winter night scene in a snowy forest. Warm yellow lights glow from a cabin sitting in the center. To the front-left of the cabin's main entrance, a child is busy building a snowman under the falling snow. Near a large snow-covered pine tree to the back-right of the cabin, a deer stands watching quietly. Smoke gently rises from the chimney, and in the sky, the northern lights shimmer above the treetops."
Your output should be a list of object names like this:
entities  = ['Warm yellow lights', 'cabin', 'child', 'snowman', 'large snow-covered pine tree', 'deer']


Now, given the following caption, extract the object names in the same format: <caption>
"""

gen_prompt = """
As a 3D scene layout planner, generate a quantitative 3D layout (size, position, orientation) for specified entities based on a text caption.

**Input:**
1.  A text caption describing the scene.
2.  A list of important entity names in the scene.

**Output:**
Provide a JSON object with two keys: `scene_parameters` and `entity_layout`.

1.  **`scene_parameters`**: Define the primary focus area of the scene.
    * `scene_size` (meters): Characteristic dimension of the main subject/interaction area. Use a scale appropriate for foreground elements.
    * `camera_pitch_angle` (degrees): Camera's vertical viewing angle (positive = looking down).

2.  **`entity_layout`**: An array of objects, one for each entity.
    * `entity_name` (string).
    * `size` ([length, width, height] meters): Dimensions. **Scale for sufficient visibility** within the scene, not strict real-world size. 
        **When there are 1 or 2 entities, if entity typically small relative to scene size, ensure they are scaled prominently to be the primary focus and occupy a significant portion of the view. Without maintaining plausible relative proportions between objects. 
        **[X, Z, Y] before rotation.
    * `position` ([X, Y, Z] meters): Volumetric center relative to the origin (center of focus area ground plane, Y=0). 
        **Strictly adhere to explicit spatial relationships stated in the caption.
        ** For relationships like 'A in front of B' or 'A behind of B' or 'A hidden by B', ensure the primary difference is in the Z coordinate, with minimal or no significant lateral (X-axis) displacement unless implied otherwise. Ensure enough distance or a slight offset (X or Y) so that the background entity is visible and not completely blocked by the foreground one.
        ** For entities central to the scene, coordinates should ideally be around `[0, scene_size]` in Z. Background entities may be positioned outside this range.
    * `orient` (degrees): Yaw angle (rotation around Y-axis). `0` = faces -Z (towards camera), `90` = +X (right), `180` = +Z (into scene), `270` = -X (left).

**Coordinate System:** Right-handed. Origin (0,0,0) = center of focus area ground. +X=right, +Y=up, +Z=into scene. Ground at Y=0.

**Note:** Values are estimates. `scene_size` governs the central area scale; background elements might be large/distant relative to this.

# Example:

**Input:**
*   Caption: "A red sports car is parked on the street in front of a small cafe. A person is walking towards the cafe on the sidewalk."
*   Entities: ["red sports car", "small cafe", "person"]

**Output JSON:**
```json
{
  "scene_parameters": {
    "scene_size": 10,
    "camera_pitch_angle": 10
  },
  "entity_layout": [
    {
      "entity_name": "red sports car",
      "size": [4.5, 1.8, 1.4],
      "position": [-1.0, 0.7, 4.0],
      "orient": 15
    },
    {
      "entity_name": "small cafe",
      "size": [8.0, 6.0, 5.0],
      "position": [3.0, 2.5, 8.0],
      "orient": 0
    },
    {
      "entity_name": "person",
      "size": [0.5, 0.4, 1.7],
      "position": [-2.0, 0.85, 2.0],
      "orient": 45
    }
  ]
}

**Now, analyze the following caption and generate the quantitative 3D layout plan:**
Caption: <caption>
Entities: <entities>
"""

gen_prompt_new = """
As a 3D scene layout planner, generate a quantitative 3D layout (size, position, orientation) for specified entities based on a text caption.

**Input:**
1.  A text caption describing the scene.
2.  A list of important entity names in the scene.

**Output:**
Provide a JSON object with two keys: `scene_parameters` and `entity_layout`.

1.  **`scene_parameters`**: Define the primary focus area of the scene.
    * `scene_size` (meters): Characteristic dimension of the main subject/interaction area. Use a scale appropriate for foreground elements.
    * `camera_pitch_angle` (degrees): Camera's vertical viewing angle (positive = looking down).

2.  **`entity_layout`**: An array of objects, one for each entity.
    * `entity_name` (string).
    * `size` ([length, width, height] meters): Dimensions. **Scale for sufficient visibility** within the scene, not strict real-world size. 
        **If entity typically small relative to scene size, ensure they are scaled prominently to be the primary focus and occupy a significant portion of the view. Without maintaining plausible relative proportions between objects. 
        **[X, Z, Y] before rotation and length, width, height should large than scene_size/10 .
    * `position` ([X, Y, Z] meters): Volumetric center relative to the origin (center of focus area ground plane, Y=0). 
        **Strictly adhere to explicit spatial relationships stated in the caption.
        ** For relationships like 'A in front of B' or 'A behind of B' or 'A hidden by B', ensure the primary difference is in the Z coordinate, with minimal or no significant lateral (X-axis) displacement unless implied otherwise. Ensure enough distance or a slight offset (X or Y) so that the background entity is visible and not completely blocked by the foreground one.
        ** For entities central to the scene, coordinates should ideally be around `[0, scene_size]` in Z. Background entities may be positioned outside this range.
    * `orient` (degrees): Yaw angle (rotation around Y-axis). `0` = faces -Z (towards camera), `90` = +X (right), `180` = +Z (into scene), `270` = -X (left).

**Coordinate System:** Right-handed. Origin (0,0,0) = center of focus area ground. +X=right, +Y=up, +Z=into scene. Ground at Y=0.

**Note:** Values are estimates. `scene_size` governs the central area scale; background elements might be large/distant relative to this.

# Example:

**Input:**
*   Caption: "A red sports car is parked on the street in front of a small cafe. A person is walking towards the cafe on the sidewalk."
*   Entities: ["red sports car", "small cafe", "person"]

**Output JSON:**
```json
{
  "scene_parameters": {
    "scene_size": 10,
    "camera_pitch_angle": 10
  },
  "entity_layout": [
    {
      "entity_name": "red sports car",
      "size": [4.5, 1.8, 1.4],
      "position": [-1.0, 0.7, 4.0],
      "orient": 15
    },
    {
      "entity_name": "small cafe",
      "size": [8.0, 6.0, 5.0],
      "position": [3.0, 2.5, 8.0],
      "orient": 0
    },
    {
      "entity_name": "person",
      "size": [0.5, 0.4, 1.7],
      "position": [-2.0, 0.85, 2.0],
      "orient": 45
    }
  ]
}

**Now, analyze the following caption and generate the quantitative 3D layout plan:**
Caption: <caption>
Entities: <entities>
"""

gen_prompt_old_0 = """
# Role: Quantitative 3D Scene Layout Planner AI

# Task:
Your primary task is to analyze the provided text caption and a list of specified entities within that scene. Based on the caption, you must first define the overall **scene parameters**, focusing on the **primary interaction or subject area**. Then, for **each entity provided in the input list (including potential background elements)**, you must plan a plausible **quantitative** 3D layout, determining its size, position, and orientation relative to the defined focus area. Ensure the layout is consistent with the defined scene parameters and the spatial relationships implied by the caption. This layout plan, using numerical values, will guide a text-to-image generation model accepting 3D bounding box controls. Focus on capturing spatial relationships and inferring reasonable real-world scales, orientations, and positions for **all given entities** based on the caption and scene context, while ensuring the core subjects are appropriately scaled within the main focus area.

# Input:
1.  A single text caption describing a visual scene.
2.  A list of important entity names relevant to the caption (this list *can* include background elements).

# Output Requirements:
Present the layout plan clearly as a **JSON object**. The JSON object must contain two main keys: `scene_parameters` and `entity_layout`.

1.  **`scene_parameters` (JSON Object):** Define the overall scene context first. This object should contain:
    *   `scene_size`: meters. A single numerical value representing the characteristic dimension (e.g., approximate side length of a relevant cubic volume) of the **scene's primary focus area**. This area typically contains the main subjects or the core interaction described in the caption. **Crucially, this value should primarily reflect the scale needed to plausibly arrange the foreground/interactive entities, even if large background entities are present in the `entity_list`.** The positions and sizes of all entities will be defined relative to the origin of this focus area.
    *   `camera_pitch_angle`: degrees. Estimate the camera's vertical angle relative to the horizontal plane in degrees. A positive angle means looking down, and a negative angle means looking up. Example: `10`.

2.  **`entity_layout` (JSON Array):** An array containing objects, where **each object** represents an entity from the input list. Each entity object must contain the following key-value pairs:
    *   `entity_name`: The name of the object or subject (e.g., "red car", "snowy forest setting"). (String)
    *   `size`: meters. Estimate the entity's dimensions as an array `[length, width, height]` in **meters**. Base this on typical real-world sizes, potentially scaled reasonably relative to the scene's context. Assume `length`, `width`, and `height` correspond to the entity's dimensions along the X, Z, and Y axes respectively, *before* any rotation is applied. (Array of 3 numbers)
    *   `position`: (X, Y, Z meters). Estimate the entity's **volumetric center** coordinates as an array `[X, Y, Z]` in **meters** relative to the scene origin (center of the focus area, see Coordinate System). The position must be plausible within the overall scene context. For large background entities, their position might be distant from the origin, indicating they surround the focus area defined by `scene_size`.
    *   `orient`: Yaw Angle degrees. Estimate the entity's rotation around the vertical Y-axis (Yaw) in **degrees**. Use the following convention: (Number)
        *   `0` degrees: Front faces along the **-Z direction (towards camera)**.
        *   `90` degrees: Front faces along the **+X direction (right)**.
        *   `180` degrees: Front of the entity faces along the **+Z direction (into the scene)**.
        *   `270` degrees: Front faces along the **-X direction (left)**.

# Coordinate System Assumption:
Assume a right-handed coordinate system:
*   **Origin (0, 0, 0):** Located at the **center of the scene's primary focus area ground plane**.
*   **+X Axis:** Points to the **right**.
*   **+Y Axis:** Points **upwards**.
*   **+Z Axis:** Points **into the scene (away from the camera)**.
*   **Ground Plane:** Assumed to be flat at **Y = 0** within the focus area. An object resting directly on the ground within this area will have its center Y coordinate at `height / 2`. Background elements might have different Y positions depending on context (e.g., mountains).

# Output Format:
Generate a single **JSON object** containing the `scene_parameters` object and the `entity_layout` array as described in the Output Requirements. Ensure the JSON is well-formed.

# Important Caveat:
The numerical values provided are **estimates**. Assume a plausible viewing perspective.
**Note on Scene Size and Background Entities:** The `scene_size` parameter defines the scale of the **primary focus area**. When large background elements (e.g., 'forest', 'cityscape') are included in the `entity_list`, you must still estimate their `size`, `position`, and `orient`. However, determine the `scene_size` value based primarily on the scale required by the **foreground subjects and their interactions**. The background entities should then be positioned and sized appropriately *relative* to this focus area (they might appear large and distant in the layout data), but they should *not* inherently force the `scene_size` parameter itself to become excessively large. The goal is to keep the main subjects reasonably prominent within the volume defined by `scene_size`.

# Example:

**Input:**
*   Caption: "A red sports car is parked on the street in front of a small cafe. A person is walking towards the cafe on the sidewalk."
*   Entities: ["red sports car", "small cafe", "person"]

**Output JSON:**
```json
{
  "scene_parameters": {
    "scene_size": 10,
    "camera_pitch": 10
  },
  "entity_layout": [
    {
      "entity_name": "red sports car",
      "size": [4.5, 1.8, 1.4],
      "position": [-1.0, 0.7, 4.0],
      "orient": 15
    },
    {
      "entity_name": "small cafe",
      "size": [8.0, 6.0, 5.0],
      "position": [3.0, 2.5, 8.0],
      "orient": 0
    },
    {
      "entity_name": "person",
      "size": [0.5, 0.4, 1.7],
      "position": [-2.0, 0.85, 2.0],
      "orient": 45
    }
  ]
}

**Now, analyze the following caption and generate the quantitative 3D layout plan:**
Caption: <caption>
Entities: <entities>
"""

gen_prompt_0 = """
# Role: Quantitative 3D Scene Layout Planner AI

# Task:
Your primary task is to analyze the provided text caption and a list of specified entities within that scene. Based on the caption, you must first define the overall **scene parameters**, including its scale and camera perspective. Then, for **each entity provided in the input list**, you must plan a plausible **quantitative** 3D layout, determining its size, position, and orientation. Ensure the layout is consistent with the defined scene parameters and the spatial relationships implied by the caption. This layout plan, using numerical values, will guide a text-to-image generation model accepting 3D bounding box controls. Focus on capturing spatial relationships and inferring reasonable real-world scales, orientations, and positions for the **given entities** based on the caption and scene context.

# Input:
1.  A single text caption describing a visual scene.
2.  A list of important entity names relevant to the caption.

# Output Requirements:
Present the layout plan clearly as a **JSON object**. The JSON object must contain two main keys: `scene_parameters` and `entity_layout`.

1.  **`scene_parameters` (JSON Object):** Define the overall scene context first. This object should contain:
    *   `scene_size`: meters. A single numerical value representing the characteristic dimension (e.g., approximate side length of a relevant cubic volume) of the **area immediately surrounding and containing the listed foreground entities**. This size should primarily reflect the scale needed to plausibly arrange the listed entities and their direct interactions as described in the caption, rather than the full extent of large background environments (like forests or cities) unless those environments are the main subject or scale focus.
    *   `camera_pitch_angle`: degrees. Estimate the camera's vertical angle relative to the horizontal plane in degrees. A positive angle means looking down, and a negative angle means looking up. Example: `10`.

2.  **`entity_layout` (JSON Array):** An array containing objects, where **each object** represents an important entity identified in the caption. Each entity object must contain the following key-value pairs:
    *   `entity_name`: The name of the object or subject (e.g., "red car", "tall building"). (String)
    *   `size`: Estimate the entity's dimensions as an array `[length, width, height]` in **meters**. Base this on typical real-world sizes, potentially scaled reasonably relative to `scene_size`. Assume `length`, `width`, and `height` correspond to the entity's dimensions along the X, Z, and Y axes respectively (*before* any rotation is applied). (Array of 3 numbers)
    *   `position`: Estimate the entity's **volumetric center** coordinates as an array `[X, Y, Z]` in **meters** relative to the scene origin (see Coordinate System). The position must be plausible within the defined `scene_size` and placed correctly relative to the ground plane. (Array of 3 numbers)
    *   `orient`: Estimate the entity's rotation around the vertical Y-axis (Yaw) in **degrees**. Use the following convention: (Number)
        *   `0` degrees: Front faces along the **-Z direction (towards camera)**.
        *   `90` degrees: Front faces along the **+X direction (right)**.
        *   `180` degrees: Front of the entity faces along the **+Z direction (into the scene)**.
        *   `270` degrees: Front faces along the **-X direction (left)**.

# Coordinate System Assumption:
Assume a right-handed coordinate system:
*   **Origin (0, 0, 0):** Located at the **center of the scene's ground plane area**.
*   **+X Axis:** Points to the **right**.
*   **+Y Axis:** Points **upwards**.
*   **+Z Axis:** Points **into the scene (away from the camera)**.
*   **Ground Plane:** Assumed to be flat at **Y = 0**. An object resting directly on the ground will have its center Y coordinate at `height / 2`.

# Output Format:
Generate a single **JSON object** containing the `scene_parameters` object and the `entity_layout` array as described in the Output Requirements. Ensure the JSON is well-formed.

# Important Caveat:
The numerical values provided for both scene parameters and entity layouts are necessarily **estimates** based on common sense interpretations of the caption, as captions typically lack explicit scale and geometric information. Assume a plausible, common viewing perspective and infer reasonable relationships between the scene parameters and the entities within it.

# Example:

**Input:**
*   Caption: "A red sports car is parked on the street in front of a small cafe. A person is walking towards the cafe on the sidewalk."
*   Entities: ["red sports car", "small cafe", "person"]

**Output JSON:**
```json
{
  "scene_parameters": {
    "scene_size": 10,
    "camera_pitch": 10
  },
  "entity_layout": [
    {
      "entity_name": "red sports car",
      "size": [4.5, 1.8, 1.4],
      "position": [-1.0, 0.7, 4.0],
      "orient": 15
    },
    {
      "entity_name": "small cafe",
      "size": [8.0, 6.0, 5.0],
      "position": [3.0, 2.5, 8.0],
      "orient": 0
    },
    {
      "entity_name": "person",
      "size": [0.5, 0.4, 1.7],
      "position": [-2.0, 0.85, 2.0],
      "orient": 45
    }
  ]
}

**Now, analyze the following caption and generate the quantitative 3D layout plan:**
Caption: "A young girl dressed as Elsa from Frozen, wearing a blue dress adorned with snowflake patterns, a matching crown, and holding a wand, stands in a snowy forest setting."
Entities: ["A young girl", "a blue dress adorned with snowflake patterns", "a matching crown", "a wand", "a snowy forest setting"]
"""

gen_prompt_1 = """
# Role: Quantitative 3D Scene Layout Planner AI

# Task:
Your primary task is to analyze the provided text caption and a list of specified entities within that scene. Based on the caption, you must first define the overall **scene parameters**, including its scale and camera perspective. Then, for **each entity provided in the input list**, you must plan a plausible **quantitative** 3D layout, determining its size, position, and orientation. Ensure the layout is consistent with the defined scene parameters and the spatial relationships implied by the caption. This layout plan, using numerical values, will guide a text-to-image generation model accepting 3D bounding box controls. Focus on capturing spatial relationships and inferring reasonable real-world scales, orientations, and positions for the **given entities** based on the caption and scene context.

# Input:
1.  A single text caption describing a visual scene.
2.  A list of important entity names relevant to the caption.

# Output Requirements:
Present the layout plan clearly as a **JSON object**. The JSON object must contain two main keys: `scene_parameters` and `entity_layout`.

1.  **`scene_parameters` (JSON Object):** Define the overall scene context first. This object should contain:
    *   `scene_size`: meters. A single numerical value representing the characteristic dimension of the primary interaction area (e.g., the approximate side length of a central cubic volume relevant to the caption's main subjects).
    *   `camera_pitch_angle`: degrees. Estimate the camera's vertical angle relative to the horizontal plane in degrees. A positive angle means looking down, and a negative angle means looking up. Example: `10`.

2.  **`entity_layout` (JSON Array):** An array containing objects, where **each object** represents an important entity identified in the caption. Each entity object must contain the following key-value pairs:
    *   `entity_name`: The name of the object or subject (e.g., "red car", "tall building"). (String)
    *   `size`: Estimate the entity's dimensions as an array `[length, width, height]` in **meters**. Base this on typical real-world sizes, potentially scaled reasonably relative to `scene_size`. Assume 'length' is along the entity's primary axis (aligned with its forward direction before rotation). (Array of 3 numbers)
    *   `position`: Estimate the entity's **volumetric center** coordinates as an array `[X, Y, Z]` in **meters** relative to the scene origin (see Coordinate System). The position must be plausible within the defined `scene_size` and placed correctly relative to the ground plane. (Array of 3 numbers)
    *   `orient`: Estimate the entity's rotation around the vertical Y-axis (Yaw) in **degrees**. Use the following convention: (Number)
        *   `0` degrees: Front faces along the **-Z direction (towards camera)**.
        *   `90` degrees: Front faces along the **+X direction (right)**.
        *   `180` degrees: Front of the entity faces along the **+Z direction (into the scene)**.
        *   `270` degrees: Front faces along the **-X direction (left)**.

# Coordinate System Assumption:
Assume a right-handed coordinate system:
*   **Origin (0, 0, 0):** Located at the **center of the scene's ground plane area**.
*   **+X Axis:** Points to the **right**.
*   **+Y Axis:** Points **upwards**.
*   **+Z Axis:** Points **into the scene (away from the camera)**.
*   **Ground Plane:** Assumed to be flat at **Y = 0**. An object resting directly on the ground will have its center Y coordinate at `height / 2`.

# Output Format:
Generate a single **JSON object** containing the `scene_parameters` object and the `entity_layout` array as described in the Output Requirements. Ensure the JSON is well-formed.

# Important Caveat:
The numerical values provided for both scene parameters and entity layouts are necessarily **estimates** based on common sense interpretations of the caption, as captions typically lack explicit scale and geometric information. Assume a plausible, common viewing perspective and infer reasonable relationships between the scene parameters and the entities within it.

# Example:

**Input:**
*   Caption: "A red sports car is parked on the street in front of a small cafe. A person is walking towards the cafe on the sidewalk."
*   Entities: ["red sports car", "small cafe", "person"]

**Output JSON:**
```json
{
  "scene_parameters": {
    "scene_size": 10,
    "camera_pitch": 10
  },
  "entity_layout": [
    {
      "entity_name": "red sports car",
      "size": [4.5, 1.8, 1.4],
      "position": [-1.0, 0.7, 4.0],
      "orient": 15
    },
    {
      "entity_name": "small cafe",
      "size": [8.0, 6.0, 5.0],
      "position": [3.0, 2.5, 8.0],
      "orient": 0
    },
    {
      "entity_name": "person",
      "size": [0.5, 0.4, 1.7],
      "position": [-2.0, 0.85, 2.0],
      "orient": 45
    }
  ]
}

**Now, analyze the following caption and generate the quantitative 3D layout plan:**
Caption: "A young girl dressed as Elsa from Frozen, wearing a blue dress adorned with snowflake patterns, a matching crown, and holding a wand, stands in a snowy forest setting."
Entities: ["A young girl", "a blue dress adorned with snowflake patterns", "a matching crown", "a wand", "a snowy forest setting"]
"""

gen_prompt_md = """
# Role: Quantitative 3D Scene Layout Planner AI

# Task:
Your primary task is to analyze the provided text caption describing a scene. Based on the caption, you must first define the overall **scene parameters**, including its scale, camera perspective, and ground plane configuration. Then, identify the most important entities within the scene and plan a plausible **quantitative** 3D layout for them, ensuring consistency with the defined scene parameters. This layout plan, using numerical values, will guide a text-to-image generation model accepting 3D bounding box controls. Focus on capturing spatial relationships and inferring reasonable real-world scales, orientations, and positions implied by the caption and the overall scene context you establish.

# Input:
A single text caption describing a visual scene.

# Output Requirements:
Present the layout plan clearly using Markdown. The output must include:

1.  **Scene Parameters:** Define the overall scene context first.
    *   **Scene Size (meters):** A single numerical value representing the characteristic dimension of the primary interaction area (e.g., the approximate side length of a central cubic volume relevant to the caption's main subjects).
    *   **Camera Pitch Angle (degrees):** Estimate the camera's vertical angle relative to the horizontal plane. A positive angle means looking down and a negative angle means looking up. Example: `10`.

2.  **Entity Layout:** For **each important entity** identified in the caption, provide the following information in numerical format:
    *   **Entity Name:** The name of the object or subject (e.g., "red car", "tall building").
    *   **Size (meters):** Estimate the entity's dimensions as `[length, width, height]` in **meters**. Base this on typical real-world sizes, potentially scaled reasonably relative to the `Scene Size`. Assume 'length' is along the entity's primary axis (aligned with its forward direction before rotation).
    *   **Position (X, Y, Z meters):** Estimate the entity's **volumetric center** coordinates `[X, Y, Z]` in **meters** relative to the scene origin (see Coordinate System). The position must be plausible within the defined `Scene Size` and placed correctly relative to the ground plane defined by `Floor Offset`.
    *   **Orientation (Yaw Angle degrees):** Estimate the entity's rotation around the vertical Y-axis (Yaw) in **degrees**. Use the following convention:
        *   `0` degrees: Front of the entity faces along the **+Z direction (into the scene)**.
        *   `90` degrees: Front faces along the **+X direction (right)**.
        *   `180` degrees: Front faces along the **-Z direction (towards camera)**.
        *   `270` degrees: Front faces along the **-X direction (left)**.

# Coordinate System Assumption:
Assume a right-handed coordinate system:
*   **Origin (0, Y=Floor Offset, 0):** Located at the **center of the scene's ground plane area**, vertically shifted by the `Floor Offset`.
*   **+X Axis:** Points to the **right**.
*   **+Y Axis:** Points **upwards**.
*   **+Z Axis:** Points **into the scene (away from the camera)**.
*   **Ground Plane:** Assumed to be flat at **Y = Floor Offset**. An object resting directly on the ground will have its center Y coordinate at `Floor Offset + height / 2`. The `Floor Scale X/Y` parameters influence the conceptual extent of this ground plane.


# Output Format:
Present the layout plan as a clear list using Markdown. First list the `Scene Parameters`, then list each entity with its properties (Name, Size, Position, Orientation) using the specified numerical formats.

# Important Caveat:
The numerical values provided for both scene parameters and entity layouts are necessarily **estimates** based on common sense interpretations of the caption, as captions typically lack explicit scale and geometric information. Assume a plausible, common viewing perspective and infer reasonable relationships between the scene parameters and the entities within it.

# Example:

**Input Caption:** "A red sports car is parked on the street in front of a small cafe. A person is walking towards the cafe on the sidewalk."

**Output Layout Plan:**

*   **Scene Parameters:**
    *   **Scene Size (meters):** `10`
    *   **Camera Pitch Angle (degrees):** `10` (Slightly elevated view)

*   **Entity Layout:**
    *   **Entity 1:**
        *   **Entity Name:** red sports car
        *   **Size (meters):** `[4.5, 1.8, 1.4]`
        *   **Position (X, Y, Z meters):** `[-1.0, 0.7, 4.0]` (Note Z is positive, into the scene)
        *   **Orientation (Yaw Angle degrees):** `15` (Slightly angled towards the right)
    *   **Entity 2:**
        *   **Entity Name:** small cafe
        *   **Size (meters):** `[8.0, 6.0, 5.0]` (L=facade width, W=depth, H=height)
        *   **Position (X, Y, Z meters):** `[3.0, 2.5, 8.0]` (Further back, slightly right)
        *   **Orientation (Yaw Angle degrees):** `0` (Facing into the scene)
    *   **Entity 3:**
        *   **Entity Name:** person
        *   **Size (meters):** `[0.5, 0.4, 1.7]`
        *   **Position (X, Y, Z meters):** `[-2.0, 0.85, 2.0]` (Left side, walking towards cafe)
        *   **Orientation (Yaw Angle degrees):** `45` (Facing right, towards cafe)

---

**Now, analyze the following caption and generate the quantitative 3D layout plan:**

caption: 
"""

edit_prompt = """
**Prompt Title:** Image-Caption Alignment & Layout Correction (Focused on `entity_list`, Viewer's Perspective)

**Role:** You are an AI assistant that evaluates if a generated image matches a text caption, focusing *exclusively* on the entities specified in the `entity_list` and whether they are clearly discernible in the image. You will also analyze the underlying 3D layout for issues causing mismatch related to these specific entities and provide a corrected layout.

**Task:** Compare the `generated_image` to the `text_caption`. Determine if they accurately align, considering *only* entities from the `entity_list` that are clearly discernible in the `generated_image`. If they do not align, analyze the provided `json_layout` to diagnose the likely cause of the misalignment related to these `entity_list` entities (whether discernible or non-discernible). Based on your diagnosis, provide an improved `json_layout` and explain your reasoning and changes.

**Input:** You will receive:

1.  `text_caption` (string): The original text description.
2.  `entity_list` (list of strings): The *only* entities to consider for alignment and layout correction.
3.  `json_layout` (JSON object): The 3D layout used to generate the image.
      * `size` is `[length_X, width_Z, height_Y]` before rotation.
      * `position` is the center `[X, Y, Z]` with Y=0 ground.
4.  `generated_image` (image): The image produced from the `json_layout`.

**Evaluation & Diagnosis:**

1.  **Check Alignment (Focused Exclusively on `entity_list`):** Does the `generated_image` accurately depict the aspects of the `text_caption` that pertain to the entities listed in `entity_list`? **Your alignment check must *only* consider entities from the `entity_list`.**

      * Verify if each entity in `entity_list` that is mentioned in the `text_caption` is clearly discernible in the `generated_image`.
      * For those discernible entities from `entity_list`, verify if their spatial relationships, as described in the `text_caption`, are correctly represented in the `generated_image`.
          * **Crucial Perspective Instruction:** All spatial and directional terms in the `text_caption` (e.g., 'left', 'right', 'in front of', 'behind', 'above', 'below') **must be interpreted from the perspective of the viewer looking at the `generated_image`**. These terms are *not* to be interpreted from the perspective of any character, subject, or object depicted *within* the image itself.
      * **A misalignment occurs if:**
          * An entity from `entity_list` mentioned in the `text_caption` is *not clearly discernible* in the `generated_image`.
          * The described state/relationship of a *discernible entity from `entity_list`* cannot be verified due to lack of visual clarity.
          * The spatial relationships between *discernible entities from `entity_list`* (as described in the caption, interpreting directional terms from the viewer's perspective) are incorrect in the image.
      * **Crucially: The presence, absence, or properties of any entity *not* in `entity_list` (even if mentioned in the caption or visible in the image) should *not* influence the `is_aligned` status.**

2.  **If Mismatch (Related to `entity_list`):** If the image and caption do *not* align (based *solely* on the evaluation of entities in `entity_list` as described above), analyze the `json_layout` to determine the probable cause *related to these `entity_list` entities*. Consider if the layout:

      * **Was Incorrect:** Directly contradicted the caption's spatial description (interpreted from the viewer's perspective) for `entity_list` entities that *should have been* discernible (e.g., wrong positions/relationships set leading to occlusion or out-of-frame placement of an `entity_list` entity).
      * **Was Insufficient:** Parametrically correct for an `entity_list` entity, but its relationships were not visually clear or prominent enough for generation (e.g., `entity_list` entities too close/far, sizes not emphasizing relationships, `entity_list` entities rendered too small or obscurely to be discernible).

**Output:**

Your output will be a text explanation followed by a JSON object.

The JSON object must contain:

  * `is_aligned` (boolean): `true` if the discernible aspects of the image related to `entity_list` match the caption (respecting viewer's perspective for spatial terms), `false` if not.
  * `optimized_layout` (JSON object):
      * If `is_aligned` is `true`, return the original `json_layout`.
      * If `is_aligned` is `false`, return your improved `json_layout`. **Ensure `size` is `[X, Z, Y]` and `position` is `[X, Y, Z]` (Y=0 ground) in the improved layout.**

The text explanation should:

  * State if alignment is `true` or `false`.
  * If `is_aligned` is `false`, briefly describe the mismatch *focusing only on entities from `entity_list`* (e.g., "The 'blue car' from `entity_list` was not discernible," or "The relative position of discernible entities 'red ball' and 'green box' from `entity_list` was incorrect as per the caption, when viewed from the viewer's perspective"). State whether the layout was likely Incorrect or Insufficient (or both) *in relation to the `entity_list` entities*, explain why, and detail the changes made to the `optimized_layout` specifically for those `entity_list` entities.

**Output Format Example:**

```
Image and caption are NOT aligned. The 'blue car' from entity_list, mentioned in the caption, is not clearly discernible in the generated image, potentially due to its small size or an occluding object (which itself is not evaluated) in the foreground as suggested by the layout for the 'blue car'. The layout for the 'blue car' was insufficient. Changes made: Increased the size of the 'blue car' and moved it slightly forward.
```json
{
  "is_aligned": false,
  "optimized_layout": [
    {
      "entity_name": "blue car", // This entity must be in the input entity_list
      "size": [1.8, 0.8, 1.0], // [X, Z, Y] - Example: increased size
      "position": [2.0, 0.0, 3.0], // [X, Y, Z] - Example: moved forward
      "orient": "..."
    },
    // ... other entities from the original layout, possibly unchanged if they were not in entity_list or if they were fine
    // or other entities from entity_list that were also adjusted.
  ]
}
```

text_caption: <caption>
entity_list: <entities>
json_layout: <layout>
"""

edit_prompt_3 = """
**Prompt Title:** Image-Caption Alignment & Layout Correction

**Role:** You are an AI assistant that evaluates if a generated image matches a text caption, analyzes the underlying 3D layout for issues causing mismatch, and provides a corrected layout.

**Task:** Compare the `generated_image` to the `text_caption`. Determine if they accurately align. If they do not, analyze the provided `json_layout` to diagnose the likely cause of the misalignment during image generation. Based on your diagnosis, provide an improved `json_layout` and explain your reasoning and changes.

**Input:** You will receive:

1.  `text_caption` (string): The original text description.
2.  `entity_list` (list of strings): Important entities in the scene.
3.  `json_layout` (JSON object): The 3D layout used to generate the image.
    * `size` is `[length_X, width_Z, height_Y]` before rotation.
    * `position` is the center `[X, Y, Z]` with Y=0 ground.
4.  `generated_image` (image): The image produced from the `json_layout`.

**Evaluation & Diagnosis:**

1.  **Check Alignment:** Does the `generated_image` accurately depict the scene described in the `text_caption`? **Specifically, verify if the spatial relationships between the entities listed in `entity_list`, as described in the `text_caption`, are correctly represented in the image.**
2.  **If Mismatch:** If the image and caption do *not* align, analyze the `json_layout` to determine the probable cause. Consider if the layout:
    * **Was Incorrect:** Directly contradicted the caption's spatial description (e.g., wrong positions/relationships set).
    * **Was Insufficient:** Parametrically correct but relationships were not visually clear or prominent enough for generation (e.g., entities too close/far, sizes not emphasizing relationships, off-screen issues).

**Output:**

Your output will be a text explanation followed by a JSON object.

The JSON object must contain:

* `is_aligned` (boolean): `true` if the image matches the caption, `false` if not.
* `optimized_layout` (JSON object):
    * If `is_aligned` is `true`, return the original `json_layout`.
    * If `is_aligned` is `false`, return your improved `json_layout`. **Ensure `size` is `[X, Z, Y]` and `position` is `[X, Y, Z]` (Y=0 ground) in the improved layout.**

The text explanation should:

* State if alignment is `true` or `false`.
* If `is_aligned` is `false`, briefly describe the mismatch, state whether the layout was likely Incorrect or Insufficient (or both), explain why, and detail the changes made to the `optimized_layout`.

**Output Format Example:**

```
Image and caption are aligned/NOT aligned. [Briefly describe the reason and changes].

```json
{
  "is_aligned": false,
  "optimized_layout": {
    "scene_parameters": {
      // adjusted parameters
    },
    "entity_layout": [
      {
        "entity_name": "...",
        "size": [...], // [X, Z, Y]
        "position": [...], // [X, Y, Z]
        "orient": ...
      },
      // ...
    ]
  }
}
```

text_caption: <caption>
entity_list: <entitirs>
json_layout: <layout>
"""

edit_prompt_2 = """
**Prompt Title:** 3D Scene Layout Evaluation and Optimization

**Role:** You are a professional 3D scene layout evaluator and optimizer, skilled at judging the visual reasonableness of layouts based on text descriptions and renders, and making necessary adjustments to ensure core entities are prominent, well-positioned within the frame, and accurately reflect the scene description, strictly adhering to the specified 3D parameter definitions.

**Task:** Your task is to evaluate whether a given 3D layout (in JSON format) is reasonable based on the provided scene description, list of entities, and the rendered image of the current layout. If the layout is found to be visually or spatially unreasonable based on the caption (e.g., main subject not prominent, entities off-screen, incorrect relationships), provide an improved, more reasonable 3D layout (also in JSON format), ensuring the parameter definitions are strictly followed in the output.

**Input:** You will receive the following information:

1.  `Caption` (string): Text describing the 3D scene.
2.  `Entities` (list of strings): List of important entity names in the scene.
3.  `current_layout_json` (JSON object): The current 3D layout, containing `scene_parameters` and `entity_layout`. This JSON uses the following definitions:
    * `entity_layout[i].size` is `[length_X, width_Z, height_Y]` in meters *before* rotation. Scale for visibility, not strict real-world size, especially for 1-2 main entities.
    * `entity_layout[i].position` is the volumetric center `[X, Y, Z]` in meters relative to the origin (center of focus area ground plane, Y=0).
    * `entity_layout[i].orient` is the yaw angle (rotation around Y-axis) in degrees. 0 = faces -Z (towards camera).
4.  `rendered_image` (image): An image rendered from a specific camera viewpoint based on the `current_layout_json`.

**Evaluation Criteria and Correction Guidance:**

You should focus on examining the `rendered_image`, and in conjunction with the `Caption` and `Entities`, look for the following potential unreasonable situations. If issues are found, make corresponding adjustments in the output JSON, **strictly using the `[X, Z, Y]` order for size and `[X, Y, Z]` order for position (Y=0 ground) as defined above**:

1.  **Lack of Subject Prominence (Entities too far or too small):**
    * **Issue:** Core entities (from the `Entities`) appear too far away or too small in the `rendered_image`, resulting in a lack of a clear main subject or focal point in the frame. This is usually because their `position.z` value is too large (far from the camera) or their `size` values are too small relative to the scene size or each other's proportion.
    * **Adjustment:** Considering both `scene_parameters.scene_size` and the visual effect in the `rendered_image`, adjust the core entities' `position.z` coordinate (to move them closer to the camera, reducing the Z value) and/or their `size` parameters (increase appropriately, remember `size` is `[X, Z, Y]` before rotation, scaled for visibility/focus). Ensure they become the focal point and occupy a reasonable visual proportion.

2.  **Entities Outside the Frame (Entities off-screen):**
    * **Issue:** The rendered_image shows that some or all core entities are clipped by the camera and are not within the visible frame. This is often because their position.x is too far left/right, their position.y is too high (above the camera's view), their position.z is too small (too close, causing clipping or filling the frame entirely), or their size is too large but they are not centrally positioned.
    * **Adjustment:** Adjust the entity's `position` (`[X, Y, Z]`) and `orient` comprehensively to ensure all core entities are visible within the camera's primary view.

3.  **Incorrect Spatial Relationships:**
    * **Issue:** The arrangement of entities in the `current_layout_json` (and visible in the `rendered_image`) does not match the explicit spatial relationships described in the `Caption` (e.g., caption says "A is on top of B", but the layout places them side-by-side).
    * **Adjustment:** Comprehensively adjust the entity's `position` (`[X, Y, Z]`, remembering Y=0 is ground level) and `orient` to accurately reflect the spatial relationships stated in the `Caption`. This is a high-priority adjustment. **Specifically for relationships like 'A in front of B' or 'A behind/hidden by B', ensure the primary difference is in the Z coordinate, with minimal or no significant lateral (X-axis) displacement unless implied otherwise. Ensure enough distance or a slight offset (X or Y) so the background entity is visible and not completely blocked by the foreground one.**

**Output:**

Your output should be a JSON object representing the optimized and corrected 3D layout. **Ensure the `size` parameters follow the `[X, Z, Y]` order before rotation, and `position` parameters follow the `[X, Y, Z]` order with Y=0 as the ground plane, strictly adhering to the definitions provided in the Input section.**

* If the original layout is reasonable according to the above evaluation criteria, you *may* return the original JSON or the same JSON again, perhaps stating no changes were needed based on the criteria.
* If the original layout has unreasonable aspects, your returned JSON should be the adjusted layout. **Please briefly explain what major modifications you made (e.g., which entities' size or position were adjusted, corrections to spatial relationships) and why, before providing the complete JSON object.**

**Output Format Example (if changes were made):**

```
Based on the render and caption, entities [Entity Name(s)] were found to be too small/too far/off-screen/placed with incorrect spatial relationships. Their position/size/orientation has been adjusted to make them more prominent/visible and correctly reflect the description, ensuring the correct [X, Y, Z] position and [X, Z, Y] size formats are used.

```json
{
  "scene_parameters": {
    // potentially adjusted scene parameters
  },
  "entity_layout": [
    {
      "entity_name": "...",
      "size": [...], // potentially adjusted, format is [X, Z, Y]
      "position": [...], // potentially adjusted, format is [X, Y, Z]
      "orient": ... // potentially adjusted
    },
    // ... other entities
  ]
}
```

**Important:** When evaluating, use the visual effect in the `rendered_image` and the explicit spatial relationships described in the `Caption` as the primary basis for judgment and adjustment. Prioritize correcting spatial relationships (Criterion 3) while also addressing prominence (Criterion 1) and visibility (Criterion 2). Strictly adhere to the defined `size` ([X, Z, Y]) and `position` ([X, Y, Z]) formats in your output JSON.

Caption: The red hat was on top of the brown coat rack.
Entities: ["The red hat", "the brown coat rack"]
current_layout_json:
{
  "scene_parameters": {
      "scene_size": 3,
      "camera_pitch_angle": 15
  },
  "entity_layout": [
      {
          "entity_name": "the brown coat rack",
          "size": [
              1.0,
              1.0,
              2.5
          ],
          "position": [
              0.0,
              1.25,
              1.5
          ],
          "orient": 0
      },
      {
          "entity_name": "The red hat",
          "size": [
              0.8,
              0.8,
              0.4
          ],
          "position": [
              0.0,
              2.7,
              1.5
          ],
          "orient": 0
      }
  ]
}
rendered_image: <image>
"""

edit_prompt_1 = """
**Prompt Title:** 3D Scene Layout Evaluation and Optimization

**Role:** You are a professional 3D scene layout evaluator and optimizer, skilled at judging the visual reasonableness of layouts based on text descriptions and renders, and making necessary adjustments to ensure core entities are prominent, well-positioned within the frame, and accurately reflect the scene description.

**Task:** Your task is to evaluate whether a given 3D layout (in JSON format) is reasonable based on the provided scene description, list of entities, and the rendered image of the current layout. If the layout is found to be visually or spatially unreasonable based on the caption (e.g., main subject not prominent, entities off-screen, incorrect relationships), provide an improved, more reasonable 3D layout (also in JSON format).

**Input:** You will receive the following information:

1.  `Caption` (string): Text describing the 3D scene.
2.  `Entities` (list of strings): List of important entity names in the scene.
3.  `current_layout_json` (JSON object): The current 3D layout, in the same format as your expected output (including `scene_parameters` and `entity_layout`).
4.  `rendered_image` (image): An image rendered from a specific camera viewpoint based on the `current_layout_json`.

**Evaluation Criteria and Correction Guidance:**

You should focus on examining the `rendered_image`, and in conjunction with the `Caption` and `Entities`, look for the following potential unreasonable situations. If issues are found, make corresponding adjustments in the output JSON:

1.  **Lack of Subject Prominence (Entities too far or too small):**
    * **Issue:** Core entities (from the `Entities`) appear too far away or too small in the `rendered_image`, resulting in a lack of a clear main subject or focal point in the frame. This is usually because their `position.z` value is too large (far from the camera) or their `size` is too small relative to the scene size or each other's proportions.
    * **Adjustment:** Considering both `scene_parameters.scene_size` and the visual effect in the `rendered_image`, adjust the `position.z` coordinate (to move them closer to the camera, reducing the Z value) and/or the `size` (increase appropriately) of the core entities. Ensure they become the focal point and occupy a reasonable visual proportion in the frame. When adjusting size, you don't need to strictly follow real-world proportions; prioritize the visual impact.

2.  **Entities Outside the Frame (Entities off-screen):**
    * **Issue:** The `rendered_image` shows that some or all core entities are clipped by the camera and are not within the visible frame. This is usually because their `position.x` or `position.z` values are inappropriate, or their `size` is too large but not centrally positioned.
    * **Adjustment:** Adjust the entity's `position` (`x`, `y`, `z`) and `orient` comprehensively to ensure all core entities are visible within the camera's primary view.

3.  **Incorrect Spatial Relationships:**
    * **Issue:** The arrangement of entities in the `current_layout_json` (and visible in the `rendered_image`) does not match the spatial relationships described in the `Caption` (e.g., caption says "A is on top of B", but the layout places them side-by-side).
    * **Adjustment:** Comprehensively adjust the entity's `position` (`x`, `y`, `z`) and `orient` to accurately reflect the spatial relationships stated in the `Caption`. This is a high-priority adjustment.

**Output:**

Your output should be a JSON object representing the optimized and corrected 3D layout.

* If the original layout is reasonable according to the above evaluation criteria, the returned JSON should be the same as the `current_layout_json`.
* If the original layout has unreasonable aspects, your returned JSON should be the adjusted layout. **Please briefly explain what major modifications you made (e.g., which entities' size or position were adjusted, corrections to spatial relationships) and why, before providing the complete JSON object.**

**Output Format Example (if changes were made):**

```
Based on the render and caption, entities [Entity Name(s)] were found to be too small/too far/off-screen/placed with incorrect spatial relationships. Their position/size/orientation has been adjusted to make them more prominent/visible and correctly reflect the description.

```json
{
  "scene_parameters": {
    // potentially adjusted scene parameters
  },
  "entity_layout": [
    {
      "entity_name": "...",
      "size": [...], // potentially adjusted
      "position": [...], // potentially adjusted
      "orient": ... // potentially adjusted
    },
    // ... other entities
  ]
}
```

**Important:** When evaluating, use the visual effect in the `rendered_image` and the specific spatial relationships described in the `Caption` as the primary basis for judgment and adjustment. Prioritize correcting spatial relationships (Criterion 3) while also addressing prominence (Criterion 1) and visibility (Criterion 2).

Caption: The red hat was on top of the brown coat rack.
Entities: ["The red hat", "the brown coat rack"]
current_layout_json:
```json
{
  "scene_parameters": {
      "scene_size": 3,
      "camera_pitch_angle": 15
  },
  "entity_layout": [
      {
          "entity_name": "the brown coat rack",
          "size": [
              1.0,
              1.0,
              2.5
          ],
          "position": [
              0.0,
              1.25,
              1.5
          ],
          "orient": 0
      },
      {
          "entity_name": "The red hat",
          "size": [
              0.8,
              0.8,
              0.4
          ],
          "position": [
              0.0,
              2.7,
              1.5
          ],
          "orient": 0
      }
  ]
}
```
"""

edit_prompt_0 = """
# Role: 3D Layout Verification and Adjustment AI

# Task:
Your task is to evaluate if a generated image accurately reflects a given text caption, considering the 3D layout plan that was used to guide the image generation.
1.  **Analyze:** Examine the provided text caption, the generated image, and the initial 3D layout JSON (which was used to generate the image).
2.  **Compare:** Determine if the key entities, their properties (like size), positions, orientations, and spatial relationships depicted in the **image** align with the descriptions in the **caption**.
3.  **Judge Alignment:** Conclude whether the image is currently aligned with the caption.
4.  **Adjust Layout (If Misaligned):** If the image is **misaligned** with the caption, identify the specific discrepancies (e.g., wrong size, incorrect position, bad orientation, missing entity, incorrect relationship). Based on this analysis, **modify the provided initial 3D layout JSON** to correct these issues and better match the **caption's intent**. The goal is to propose a *new* layout that, if used for generation, would likely result in a better-aligned image. Use the visual evidence from the misaligned image and the original layout structure as guides for your adjustments.

# Input:
1.  `caption`: The original text caption describing the desired scene. (String)
2.  `generated_image`: The image produced by a text-to-image model using the `initial_layout`. (Image Data/Reference)
3.  `initial_layout`: The JSON object representing the 3D layout plan that was originally generated (based on the prompt you refined earlier) and used to condition the `generated_image`. (JSON Object)

# Output Requirements:
Present your evaluation and potential adjustments as a single **JSON object** containing the following keys:

1.  `alignment_status`: A boolean value indicating if the `generated_image` aligns well with the `caption`. `true` if aligned, `false` if misaligned. (Boolean)
2.  `misalignment_reason`: A brief textual description of the specific reasons for misalignment, **only present if `alignment_status` is `false`**. Explain what aspects of the image contradict the caption, potentially referencing the `initial_layout` parameters that might have caused the issue. (String, optional)
3.  `adjusted_layout`: The 3D layout plan.
    *   If `alignment_status` is `true`, this should be the **same as the `initial_layout`** provided in the input.
    *   If `alignment_status` is `false`, this should be the **modified version of the `initial_layout`**, containing the specific quantitative adjustments (to `size`, `position`, `orient`, or even `scene_parameters`) needed to better align with the `caption` based on the visual feedback from the `generated_image`. Ensure the output follows the exact same JSON structure as the `initial_layout` (containing `scene_parameters` and `entity_layout`). (JSON Object)

# Layout Structure Reminder (from previous context):
*   The layout JSON contains `scene_parameters` (`scene_size`, `camera_pitch_angle`) and `entity_layout` (an array of objects, each with `entity_name`, `size` [X, Z, Y meters before rotation], `position` [X, Y, Z meters relative to focus area origin], `orient` [Yaw degrees]).
*   The `scene_size` defines the primary focus area scale.
*   Coordinate System: Right-handed, +X right, +Y up, +Z into scene, Origin at center of focus area ground plane (Y=0).

# Goal of Adjustment:
The primary goal of adjusting the layout is to make it a better **quantitative guide** for the text-to-image model to generate an image that **visually matches the text `caption`**. Use the discrepancy between the current `generated_image` and the `caption` to inform the necessary numerical changes in the layout structure.

# Example Scenario (Conceptual):

**Input:**
*   `caption`: "A small red cube sits directly on top of a large blue sphere."
*   `generated_image`: (An image showing a large red cube *next to* a small blue sphere)
*   `initial_layout`:
    ```json
    {
      "scene_parameters": {"scene_size": 2, "camera_pitch_angle": 15},
      "entity_layout": [
        {"entity_name": "small red cube", "size": [0.2, 0.2, 0.2], "position": [-0.5, 0.1, 0.5], "orient": 0},
        {"entity_name": "large blue sphere", "size": [1.0, 1.0, 1.0], "position": [0.5, 0.5, 0.5], "orient": 0}
      ]
    }
    ```

**Output JSON:**
```json
{
  "alignment_status": false,
  "misalignment_reason": "The image shows the cube next to the sphere, not on top as requested by the caption. The cube also appears larger than 'small' relative to the sphere. The initial layout positions did not correctly place the cube centered above the sphere's top surface.",
  "adjusted_layout": {
    "scene_parameters": {"scene_size": 2, "camera_pitch_angle": 15},
    "entity_layout": [
      {
        "entity_name": "small red cube",
        "size": [0.2, 0.2, 0.2], // Keep size small
        // Adjust position: X and Z match sphere's center, Y is sphere's radius + cube's half-height
        "position": [0.5, 1.1, 0.5], // Original sphere center was [0.5, 0.5, 0.5], radius 0.5. Y = 0.5 (sphere center Y) + 0.5 (sphere radius) + 0.1 (cube half-height)
        "orient": 0
      },
      {
        "entity_name": "large blue sphere",
        "size": [1.0, 1.0, 1.0], // Keep size large
        "position": [0.5, 0.5, 0.5], // Keep sphere position
        "orient": 0
      }
    ]
  }
}

**Now, process the following inputs to verify alignment and adjust the layout if necessary:**
{
  "caption": "<caption>",
  "generated_image": "<image>",
  "initial_layout": {
    <layout>
  }
}

"""

gen_prompt_2d = """
As a master of composition, generate a 2D layout for specified entities based on a text caption.

**Input:**
1.  A text caption describing the scene.
2.  A list of important entity names in the scene.

Output Format: JSON
Please output the 2D layout as a JSON object. The JSON should contain a list of entities, where each entity includes:
* `"name"`: The name of the entity (must match one from the input list).
* `"bbox"`: A list of 4 floating-point numbers representing the bounding box: `[x_min, y_min, x_max, y_max]`.

**Coordinate System & Range:**
* The bounding box coordinates must be **normalized**, ranging from `0.0` to `1.0`.
* `0.0` corresponds to the top or left edge of the scene/image.
* `1.0` corresponds to the bottom or right edge of the scene/image.
* The origin `(0.0, 0.0)` is the top-left corner.
* `x` values increase from left to right.
* `y` values increase from top to bottom.

Constraints:
* Include a bounding box entry for **every** entity listed in the "Entities" input.
* The predicted bounding boxes should reflect a spatial arrangement that is consistent with the description in the "Caption".

Example:
**Input:**
*   Caption: "A dog on the right of a horse."
*   Entities: ["dog", "horse"]

**Output JSON:**
```json
{
  "entity_layout": [
    {
      "entity_name": "dog",
      "bbox": [0.55, 0.1, 0.95, 0.9]
    },
    {
      "entity_name": "horse",
      "bbox": [0.05, 0.1, 0.45, 0.9]
    }
  ]
}

**Now, analyze the following caption and generate the 2D layout plan, instead of generating images:**
Caption: <caption>
Entities: <entities>
"""

gen_prompt_rpg = """
You are a master of composition who excels at extracting key objects and their attributes from input text and supplementing the original text with more detailed imagination, creating layouts that conform to human aesthetics. Your task is described as follows:

Extract the key entities and their corresponding attributes from the input text, and determine how many regions should be splited.
For each key object identified in the previous step, use precise spatial imagination to assign each object to a specific area within the image and start numbering from 0. The area refers to dividing the entire image into different regions for a general layout. Each key entities is assigned to a region. And for each entity in the region, give it a more detailed description based on the original text. This layout should segment the image and strictly follow the method below:
a. Determine if the image needs to be divided into multiple rows (It should be noted that a single entity should not be split into different rows, except when describing different parts of a person like the head, clothes/body, and lower garment):
• If so, segment the image into several rows and assign an identifier to each row from top to bottom (e.g., Row0, Row1, ...).
• Specify the percentage of height each row occupies within the image (e.g., Row0 (height=0.33) indicates that the row occupies 33% of the height of the entire upper portion of the image).
b. Within each row, further assess the need for division into multiple regions (it should be noted that each region should contain only one entity):
• If required, divide each row from left to right into several blocks and assign a number to each block (e.g., Region0, Region1, ...).
• Specify the percentage of width each block occupies within its respective row (e.g., Region0 (Row0, width=0.5) denotes that the block is located in Row0 and occupies 50% of the width of that row's left side).
c. Output the overall ratio along with the regional prompts:
• First, combine each row's height separated by semicolons like Row0_height; Row1_height; ...; Rown_height. If there is only one row, skip this step.
• Secondly, attach each row's regions' width after each row's height separated with commas, like Row0_height,Row0_region0_width,Row0_region1_width,...Row0_regionm_width;Row1_height,Row1_region0_width,...;Rown_height,...Rown_regionj_width.
• If the row doesn't have more than one region, just continue to the next row.
• It should be noted that we should use decimal representation in the overall ratio, and if there is only one row, just omit the row ratio.

Output Format: analysis + JSON
Please output the Final_split_ratio and Regional_Prompt as a JSON object:
* `"Final_split_ratio"`: ...
* `"Regional_Prompt"`: ...


Examples:
**Input:**
Caption: Two girls are chatting in the cafe 

**Output:**
analysis:
Key entities identification:
The caption identifies two key entities without explicit attributes:
Girl 1 (human subject, unspecified attributes)
Girl 2 (human subject, unspecified attributes)
Since no specific attributes are given for either girl, we will need to imagine details for each entity. We will split the image into two regions to represent each girl.

Plan the structure split for the image:
a. Rows
Considering that we have two key entities and no specific attributes to separate vertically, we can choose to have a single row that encompasses both entities:
Row0 (height=1): This row will occupy the entire image, showing both girls chatting in the cafe.

b. Regions within rows
We will divide the row into two regions to represent each girl:

Region0 (Row0, width=0.5): This region will capture Girl 1, who could be imagined as having a casual hairstyle and a comfortable outfit, seated with a cup of coffee, engaged in conversation.
Region1 (Row0, width=0.5): This region will capture Girl 2, perhaps with a different hairstyle for contrast, such as a bun or waves, and a distinct style of clothing, also with a beverage, actively participating in the chat.

c. Overall ratio:
Since there is only one row, we omit the row ratio and directly provide the widths of the regions within the row:

**Output JSON:**
```json
{
  "Final_split_ratio": "0.5,0.5",
  "Regional_Prompt": "A casually styled Girl 1 with a warm smile, sipping coffee, her attention focused on her friend across the table, the background softly blurred with the ambiance of the cafe. BREAK Girl 2, with her hair up in a loose bun, laughing at a shared joke, her hands wrapped around a steaming mug, the cafe's cozy interior framing their intimate conversation.",

}

Caption: <caption>
"""