identity_prompt = """
You are tasked with identifying and extracting all the real object names from a caption.
An object name refers to any tangible or physical entity mentioned in the caption. Ensure not to include any adjectives or single-word descriptions that do not refer to a specific object, such as "background."
Please follow these instructions:
Identify all object names in the caption in the order they appear. Maintain the exact wording of each object name as it is in the caption, including case consistency. Output the object names in a Python list format. For example, consider the following caption:
Example 1:
"A woman in a yellow hat and dress holds a basket of roses while sitting on a stone bench in a lush garden."
Your output should be a list of object names like this:
['A woman', 'a yellow hat', 'dress', 'basket of roses', 'a stone bench', 'a lush garden']
Example 2:
"Cat in spacesuit, floating on an asteroid, fishing in the Milky Way with a fishing rod."
Your output should be a list of object names like this:
['Cat in spacesuit', 'an asteroid', 'Milky Way', 'fishing rod']
Example 3:
"A picturesque log cabin sits nestled among snow-covered trees and rocky shores at Lake Tahoe."
Your output should be a list of object names like this:
['log cabin', 'snow-covered trees', 'rocky shores', 'Lake Taho']
Example 4:
"A photo of a wood chair on the left of an orange snowboard on a snowy mountain."
Your output should be a list of object names like this:
['a wood chair', 'an orange snowboard', 'a snowy mountain']
Example 4:
"A cozy winter night scene in a snowy forest. Warm yellow lights glow from a cabin sitting in the center. To the front-left of the cabin's main entrance, a child is busy building a snowman under the falling snow. Near a large snow-covered pine tree to the back-right of the cabin, a deer stands watching quietly. Smoke gently rises from the chimney, and in the sky, the northern lights shimmer above the treetops."
Your output should be a list of object names like this:
entities  = ['Warm yellow lights', 'cabin', 'child', 'snowman', 'large snow-covered pine tree', 'deer']


Now, given the following caption, extract the object names in the same format: <caption>
"""

gen_prompt = """
As a 3D scene layout planner, generate a quantitative 3D layout (size, position, orientation) for specified entities based on a text caption.

**Input:**
1.  A text caption describing the scene.
2.  A list of important entity names in the scene.

**Output:**
Provide a JSON object with two keys: `scene_parameters` and `entity_layout`.

1.  **`scene_parameters`**: Define the primary focus area of the scene.
    * `scene_size` (meters): Characteristic dimension of the main subject/interaction area. Use a scale appropriate for foreground elements.
    * `camera_pitch_angle` (degrees): Camera's vertical viewing angle (positive = looking down).

2.  **`entity_layout`**: An array of objects, one for each entity.
    * `entity_name` (string).
    * `size` ([length, width, height] meters): Dimensions. **Scale for sufficient visibility** within the scene, not strict real-world size. 
        **When there are 1 or 2 entities, if entity typically small relative to scene size, ensure they are scaled prominently to be the primary focus and occupy a significant portion of the view. Without maintaining plausible relative proportions between objects. 
        **[X, Z, Y] before rotation.
    * `position` ([X, Y, Z] meters): Volumetric center relative to the origin (center of focus area ground plane, Y=0). 
        **Strictly adhere to explicit spatial relationships stated in the caption.
        ** For relationships like 'A in front of B' or 'A behind of B' or 'A hidden by B', ensure the primary difference is in the Z coordinate, with minimal or no significant lateral (X-axis) displacement unless implied otherwise. Ensure enough distance or a slight offset (X or Y) so that the background entity is visible and not completely blocked by the foreground one.
        ** For entities central to the scene, coordinates should ideally be around `[0, scene_size]` in Z. Background entities may be positioned outside this range.
    * `orient` (degrees): Yaw angle (rotation around Y-axis). `0` = faces -Z (towards camera), `90` = +X (right), `180` = +Z (into scene), `270` = -X (left).

**Coordinate System:** Right-handed. Origin (0,0,0) = center of focus area ground. +X=right, +Y=up, +Z=into scene. Ground at Y=0.

**Note:** Values are estimates. `scene_size` governs the central area scale; background elements might be large/distant relative to this.

# Example:

**Input:**
*   Caption: "A red sports car is parked on the street in front of a small cafe. A person is walking towards the cafe on the sidewalk."
*   Entities: ["red sports car", "small cafe", "person"]

**Output JSON:**
```json
{
  "scene_parameters": {
    "scene_size": 10,
    "camera_pitch_angle": 10
  },
  "entity_layout": [
    {
      "entity_name": "red sports car",
      "size": [4.5, 1.8, 1.4],
      "position": [-1.0, 0.7, 4.0],
      "orient": 15
    },
    {
      "entity_name": "small cafe",
      "size": [8.0, 6.0, 5.0],
      "position": [3.0, 2.5, 8.0],
      "orient": 0
    },
    {
      "entity_name": "person",
      "size": [0.5, 0.4, 1.7],
      "position": [-2.0, 0.85, 2.0],
      "orient": 45
    }
  ]
}

**Now, analyze the following caption and generate the quantitative 3D layout plan:**
Caption: <caption>
Entities: <entities>
"""

gen_prompt_new = """
As a 3D scene layout planner, generate a quantitative 3D layout (size, position) for specified entities based on a text caption.

**Input:**
1.  A text caption describing the scene.
2.  A list of important entity names in the scene.

**Output:**
Provide a JSON object with two keys: `scene_parameters` and `entity_layout`.

1.  **`scene_parameters`**: Define the primary focus area of the scene.
    * `scene_size` (meters): Characteristic dimension of the main subject/interaction area. Use a scale appropriate for foreground elements.
    * `camera_pitch_angle` (degrees): Camera's vertical viewing angle (positive = looking down).

2.  **`entity_layout`**: An array of objects, one for each entity.
    * `entity_name` (string).
    * `size` ([length, width, height] meters): Dimensions. **Scale for sufficient visibility** within the scene, not strict real-world size. 
        **If entity typically small relative to scene size, ensure they are scaled prominently to be the primary focus and occupy a significant portion of the view. Without maintaining plausible relative proportions between objects. 
        **length, width, height should large than scene_size/10 .
    * `position` ([X, Y, Z] meters): Volumetric center relative to the origin (center of focus area ground plane, Y=0). 
        **Strictly adhere to explicit spatial relationships stated in the caption.
        ** For relationships like 'A in front of B' or 'A behind of B' or 'A hidden by B', ensure the primary difference is in the Z coordinate, with minimal or no significant lateral (X-axis) displacement unless implied otherwise. Ensure enough distance or a slight offset (X or Y) so that the background entity is visible and not completely blocked by the foreground one.
        ** For entities central to the scene, coordinates should ideally be around `[0, scene_size]` in Z. Background entities may be positioned outside this range.

**Coordinate System:** Right-handed. Origin (0,0,0) = center of focus area ground. +X=right, +Y=up, +Z=into scene. Ground at Y=0.

**Note:** Values are estimates. `scene_size` governs the central area scale; background elements might be large/distant relative to this.

# Example:

**Input:**
*   Caption: "A red sports car is parked on the street in front of a small cafe. A person is walking towards the cafe on the sidewalk."
*   Entities: ["red sports car", "small cafe", "person"]

**Output JSON:**
```json
{
  "scene_parameters": {
    "scene_size": 10,
    "camera_pitch_angle": 10
  },
  "entity_layout": [
    {
      "entity_name": "red sports car",
      "size": [4.5, 1.8, 1.4],
      "position": [-1.0, 0.7, 4.0],
    },
    {
      "entity_name": "small cafe",
      "size": [8.0, 6.0, 5.0],
      "position": [3.0, 2.5, 8.0],
    },
    {
      "entity_name": "person",
      "size": [0.5, 0.4, 1.7],
      "position": [-2.0, 0.85, 2.0],
    }
  ]
}

**Now, analyze the following caption and generate the quantitative 3D layout plan:**
Caption: <caption>
Entities: <entities>
"""
