Image-to-Image
univa
Safetensors
univa_qwen2p5vl
File size: 19,460 Bytes
d206d32
 
e3d0263
d206d32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1655e6d
 
d206d32
 
 
 
 
 
 
 
 
 
03cbcad
d206d32
03cbcad
 
 
 
 
 
 
 
 
d206d32
 
03cbcad
 
 
 
 
 
 
 
 
 
 
d206d32
 
 
 
 
 
03cbcad
d206d32
 
 
 
 
 
 
 
b87d415
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
---
license: mit
pipeline_tag: any-to-any
---

<p align="center">
    <img src="https://s21.ax1x.com/2025/06/03/pVCBdw8.png" width="200"/>
<p>
<h2 align="center"> 
  <a href="https://github.com/PKU-YuanGroup/UniWorld-V1/">
    UniWorld: High-Resolution Semantic Encoders for <br> Unified Visual Understanding and Generation
  </a>
</h2>



<h5 align="left">

[![arXiv](https://img.shields.io/badge/Arxiv-Report%20-b31b1b.svg?logo=arXiv)](https://github.com/user-attachments/files/20573816/report.pdf)
[![model](https://img.shields.io/badge/🤗-Model-blue.svg)](https://huggingface.co/LanguageBind/UniWorld-V1)
[![data](https://img.shields.io/badge/🤗-Dataset-blue.svg)](https://huggingface.co/datasets/LanguageBind/UniWorld-V1) 
[![License](https://img.shields.io/badge/License-Apache-yellow)](https://github.com/PKU-YuanGroup/UniWorld/blob/main/LICENSE)
[![Twitter](https://img.shields.io/badge/-Twitter@LinBin46984-black?logo=twitter&logoColor=1D9BF0)](https://x.com/LinBin46984/status/1929905024349679682) <br>
[![demo0](https://img.shields.io/badge/🤗-Demo0-blue.svg)](http://8.130.165.159:8800/)
[![demo0](https://img.shields.io/badge/🤗-Demo1-blue.svg)](http://8.130.165.159:8801/)
[![demo0](https://img.shields.io/badge/🤗-Demo2-blue.svg)](http://8.130.165.159:8802/)
[![demo0](https://img.shields.io/badge/🤗-Demo3-blue.svg)](http://8.130.165.159:8803/)
[![demo0](https://img.shields.io/badge/🤗-Demo4-blue.svg)](http://8.130.165.159:8804/)
[![demo0](https://img.shields.io/badge/🤗-Demo5-blue.svg)](http://8.130.165.159:8805/)
[![demo0](https://img.shields.io/badge/🤗-Demo6-blue.svg)](http://8.130.165.159:8806/)
[![demo0](https://img.shields.io/badge/🤗-Demo7-blue.svg)](http://8.130.165.159:8807/) <br>
[![GitHub repo stars](https://img.shields.io/github/stars/PKU-YuanGroup/UniWorld-V1?style=flat&logo=github&logoColor=whitesmoke&label=Stars)](https://github.com/PKU-YuanGroup/UniWorld-V1/stargazers)&#160;
[![GitHub repo forks](https://img.shields.io/github/forks/PKU-YuanGroup/UniWorld-V1?style=flat&logo=github&logoColor=whitesmoke&label=Forks)](https://github.com/PKU-YuanGroup/UniWorld-V1/network)&#160;
[![GitHub repo watchers](https://img.shields.io/github/watchers/PKU-YuanGroup/UniWorld-V1?style=flat&logo=github&logoColor=whitesmoke&label=Watchers)](https://github.com/PKU-YuanGroup/UniWorld-V1/watchers)&#160;
[![GitHub repo size](https://img.shields.io/github/repo-size/PKU-YuanGroup/UniWorld-V1?style=flat&logo=github&logoColor=whitesmoke&label=Repo%20Size)](https://github.com/PKU-YuanGroup/UniWorld-V1/archive/refs/heads/main.zip) <br>
[![GitHub repo contributors](https://img.shields.io/github/contributors-anon/PKU-YuanGroup/UniWorld-V1?style=flat&label=Contributors)](https://github.com/PKU-YuanGroup/UniWorld-V1/graphs/contributors) 
[![GitHub Commit](https://img.shields.io/github/commit-activity/m/PKU-YuanGroup/UniWorld-V1?label=Commit)](https://github.com/PKU-YuanGroup/UniWorld-V1/commits/main/)
[![Pr](https://img.shields.io/github/issues-pr-closed-raw/PKU-YuanGroup/UniWorld-V1.svg?label=Merged+PRs&color=green)](https://github.com/PKU-YuanGroup/UniWorld-V1/pulls)
[![GitHub issues](https://img.shields.io/github/issues/PKU-YuanGroup/UniWorld-V1?color=critical&label=Issues)](https://github.com/PKU-YuanGroup/UniWorld-V1/issues?q=is%3Aopen+is%3Aissue)
[![GitHub closed issues](https://img.shields.io/github/issues-closed/PKU-YuanGroup/UniWorld-V1?color=success&label=Issues)](https://github.com/PKU-YuanGroup/UniWorld-V1/issues?q=is%3Aissue+is%3Aclosed)
</h5>



# 📣 News

* **[2025.06.03]** 🤗 We release UniWorld, a unified framework for understanding, generation, and editing. All [data](https://huggingface.co/datasets/LanguageBind/UniWorld-V1), [models](https://huggingface.co/LanguageBind/UniWorld-V1), [training code](https://github.com/PKU-YuanGroup/UniWorld-V1?tab=readme-ov-file#%EF%B8%8F-training), and [evaluation code](https://github.com/PKU-YuanGroup/UniWorld-V1?tab=readme-ov-file#%EF%B8%8F-evaluation) are open-sourced. Checking our [report](https://github.com/user-attachments/files/20573816/report.pdf) for more details. Welcome to **watch** 👀 this repository for the latest updates.


# 😍 Gallery

UniWorld shows excellent performance in **20+** tasks. 

UniWorld, trained on only 2.7M samples, consistently outperforms [BAGEL](https://github.com/ByteDance-Seed/Bagel) (trained on 2665M samples) on the ImgEdit-Bench for image manipulation. It also surpasses the specialized image editing model [Step1X-Edit](https://github.com/stepfun-ai/Step1X-Edit) across multiple dimensions, including add, adjust, and extract on ImgEdit-Bench. 

**Click to play**

<p align="left">
  <a href="https://www.youtube.com/watch?v=77U0PKH7uxs" target="_blank">
    <img src="https://github.com/user-attachments/assets/dbb2acf7-3a54-44b5-9bca-b30cb3385056" width="850" style="margin-bottom: 0.2;"/>
  </a>
</p>


<p align="left">
    <img src="https://s21.ax1x.com/2025/06/03/pVCB6ln.png" width="850" style="margin-bottom: 0.2;"/>
<p>

# 😮 Highlights

### 1. All Resources Fully Open-Sourced
- We fully open-source the models, data, training and evaluation code to facilitate rapid community exploration of unified architectures. 

- We curate 10+ CV downstream tasks, including canny, depth, sketch, MLSD,  segmentation and so on. 

- We annotate 286K long-caption samples using [Qwen2-VL-72B](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct). We use GPT-4o to filter [ImgEdit](https://github.com/PKU-YuanGroup/ImgEdit), result in 724K high-quality editing samples (all shortedge ≥ 1024 pix). Additionally, we organize and filter existing open-sourced datasets. The details can be found [here](https://github.com/PKU-YuanGroup/UniWorld/tree/main?tab=readme-ov-file#data-details).

### 2. Contrastive Semantic Encoders as Reference Control Signals
- Unlike prior approaches that use VAE-encoded reference images for low-level control, we advocate using contrastive visual encoders as control signals for reference images. 

- For such encoders, we observe that as resolution increases, global features approach saturation and model capacity shifts toward preserving fine details, which is crucial for maintaining fidelity in non-edited regions.

### 3. Image Priors via VLM Encoding Without Learnable Tokens

- We find that multimodal features encoded by VLMs can interpret instructions while retaining image priors. Due to causal attention, the format `<instruction><image>` is particularly important.


<p align="left">
    <img src="https://s21.ax1x.com/2025/06/03/pVCB5Y4.jpg" width="850" style="margin-bottom: 0.2;"/>
<p>

# 🤗 Demo

### Gradio Web UI

Highly recommend trying out our web demo by the following command.

```bash
MODEL_PATH="path/to/model"
FLUX_PATH="path/to/flux"
SIGLIP_PATH="path/to/siglip"
CUDA_VISIBLE_DEVICES=0 python -m univa.serve.gradio_web_server \
    --model_path ${MODEL_PATH} \
    --flux_path ${FLUX_PATH} \
    --siglip_path ${SIGLIP_PATH}
```

### CLI Inference

```bash
MODEL_PATH="path/to/model"
FLUX_PATH="path/to/flux"
SIGLIP_PATH="path/to/siglip"
CUDA_VISIBLE_DEVICES=1 python -m univa.serve.cli \
    --model_path ${MODEL_PATH} \
    --flux_path ${FLUX_PATH} \
    --siglip_path ${SIGLIP_PATH}
```

### ComfyUI

Coming soon...

# ⚙️ Requirements and Installation

1. Clone this repository and navigate to UniWorld folder
```
git clone https://github.com/PKU-YuanGroup/UniWorld
cd UniWorld
```
2. Install required packages

```
conda create -n univa python=3.10 -y
conda activate univa
pip install -r requirements.txt
```

# 🗝️ Training

### Data preparation

Download the data from [LanguageBind/UniWorld-V1](https://huggingface.co/datasets/LanguageBind/UniWorld-V1). The dataset consists of two parts: source images and annotation JSON files.

Prepare a `data.txt` file in the following format:

1. The first column is the root path to the image.

2. The second column is the corresponding annotation JSON file.

3. The third column indicates whether to enable the region-weighting strategy. We recommend setting it to True for edited data and False for others.

```
data/BLIP3o-60k,json/blip3o_t2i_58859.json,false
data/coco2017_caption_canny-236k,coco2017_canny_236574.json,false
data/imgedit,json/imgedit/laion_add_part0_edit.json,true
```

We provide a simple online verification tool to check whether your paths are set in `data.txt` correctly.
```
python univa/serve/check_data.py
```

<p align="left">
    <img src="https://s21.ax1x.com/2025/05/30/pV9iP8f.png" width="850" style="margin-bottom: 0.2;"/>
<p>

### Data details

<details><summary>Text-to-Image Generation</summary><p>
    
- [BLIP3o-60k](https://huggingface.co/datasets/BLIP3o/BLIP3o-60k): We add text-to-image instructions to half of the data. [108 GB storage usage.]
- [OSP1024-286k](https://huggingface.co/datasets/LanguageBind/UniWorld-V1/tree/main/data/OSP1024-286k): Sourced from internal data of the [Open-Sora Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan), with captions generated using [Qwen2-VL-72B](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct). Images have an aspect ratio between 3:4 and 4:3, aesthetic score ≥ 6, and a short side ≥ 1024 pixels. [326 GB storage usage.]

</p></details>

<details><summary>Image Editing</summary><p>
    
- [imgedit-724k](https://huggingface.co/datasets/sysuyy/ImgEdit/tree/main): Data is filtered using GPT-4o, retaining approximately half. [2.1T storage usage.]
- [OmniEdit-368k](https://huggingface.co/datasets/TIGER-Lab/OmniEdit-Filtered-1.2M): For image editing data, samples with edited regions smaller than 1/100 were filtered out; images have a short side ≥ 1024 pixels. [204 GB storage usage.]
- [SEED-Data-Edit-Part1-Openimages-65k](https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit-Part1-Openimages): For image editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [10 GB storage usage.]
- [SEED-Data-Edit-Part2-3-12k](https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit-Part2-3): For image editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [10 GB storage usage.]
- [PromptfixData-18k](https://huggingface.co/datasets/yeates/PromptfixData): For image restoration data and some editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [9 GB storage usage.]
- [StyleBooth-11k](https://huggingface.co/scepter-studio/stylebooth): For transfer style data, images have a short side ≥ 1024 pixels. [4 GB storage usage.]
- [Ghibli-36k](https://huggingface.co/datasets/LanguageBind/UniWorld-V1/tree/main/data/Ghibli-36k): For transfer style data, images have a short side ≥ 1024 pixels. **Warning: This data has not been quality filtered.** [170 GB storage usage.]
</p></details>

<details><summary>Extract & Try-on</summary><p>
    
- [viton_hd-23k](https://huggingface.co/datasets/forgeml/viton_hd): Converted from the source data into an instruction dataset for product extraction. [1 GB storage usage.]
- [deepfashion-27k](https://huggingface.co/datasets/lirus18/deepfashion): Converted from the source data into an instruction dataset for product extraction. [1 GB storage usage.]
- [shop_product-23k](https://huggingface.co/datasets/LanguageBind/UniWorld-V1/tree/main/data/shop_product-23k): Sourced from internal data of the [Open-Sora Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan), focusing on product extraction and virtual try-on, with images having a short side ≥ 1024 pixels. [12 GB storage usage.]

</p></details>

<details><summary>Image Perception</summary><p>
    
- [coco2017_caption_canny-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_canny): img->canny & canny->img [25 GB storage usage.]
- [coco2017_caption_depth-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_depth): img->depth & depth->img [8 GB storage usage.]
- [coco2017_caption_hed-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_hed): img->hed & hed->img [13 GB storage usage.]
- [coco2017_caption_mlsd-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_mlsd): img->mlsd & mlsd->img [ GB storage usage.]
- [coco2017_caption_normal-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_normal): img->normal & normal->img [10 GB storage usage.]
- [coco2017_caption_openpose-62k](https://huggingface.co/datasets/wangherr/coco2017_caption_openpose): img->pose & pose->img [2 GB storage usage.]
- [coco2017_caption_sketch-236k](https://huggingface.co/datasets/wangherr/coco2017_caption_sketch): img->sketch & sketch->img [15 GB storage usage.]
- [unsplash_canny-20k](https://huggingface.co/datasets/wtcherr/unsplash_10k_canny): img->canny & canny->img [2 GB storage usage.]
- [open_pose-40k](https://huggingface.co/datasets/raulc0399/open_pose_controlnet): img->pose & pose->img [4 GB storage usage.]
- [mscoco-controlnet-canny-less-colors-236k](https://huggingface.co/datasets/hazal-karakus/mscoco-controlnet-canny-less-colors): img->canny & canny->img [13 GB storage usage.]
- [coco2017_seg_box-448k](https://huggingface.co/datasets/LanguageBind/UniWorld-V1/tree/main/data/coco2017_seg_box-448k): img->detection & img->segmentation (mask), instances with regions smaller than 1/100 were filtered out. We visualise masks on the original image as gt-image. [39 GB storage usage.]
- [viton_hd-11k](https://huggingface.co/datasets/forgeml/viton_hd): img->pose [1 GB storage usage.]
- [deepfashion-13k](https://huggingface.co/datasets/lirus18/deepfashion): img->pose [1 GB storage usage.]

</p></details>


### Training

#### Prepare pretrained weights
Download [black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) to `$FLUX_PATH`.
Download [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) to `$QWENVL_PATH`. We also support other sizes of Qwen2.5-VL.

```
SAVE_PATH="path/to/save/UniWorld-Qwen2.5-VL-7B-Instruct-FLUX.1-dev-fp32"
python scripts/make_univa_qwen2p5vl_weight.py \
    --origin_flux_ckpt_path $FLUX_PATH \
    --origin_qwenvl_ckpt_path $QWENVL_PATH \
    --save_path ${SAVE_PATH}
```

```
# stage1
bash scripts/denoiser/flux_qwen2p5vl_7b_vlm_stage1_512.sh
```

Download [flux-redux-siglipv2-512.bin](https://huggingface.co/LanguageBind/UniWorld-V1/resolve/main/flux-redux-siglipv2-512.bin?download=true) and set its path to `pretrained_siglip_mlp_path` in `stage2.yaml`. The weight is sourced from [ostris/Flex.1-alpha-Redux](https://huggingface.co/ostris/Flex.1-alpha-Redux), we just re-organize the weight.
```
# stage2
bash scripts/denoiser/flux_qwen2p5vl_7b_vlm_stage2_512.sh
```

# ⚡️ Evaluation

### Text-to-Image Generation

<details><summary>GenEval</summary><p>

```
cd univa/eval/geneval
# follow the instruction in univa/eval/geneval/README.md
```
</p></details>

<details><summary>WISE</summary><p>

```
cd univa/eval/wise
# follow the instruction in univa/eval/wise/README.md
```

</p></details>

<details><summary>GenAI-Bench</summary><p>

```
cd univa/eval/genai
# follow the instruction in univa/eval/genai/README.md
```

</p></details>

<details><summary>DPG-Bench</summary><p>

```
cd univa/eval/dpgbench
# follow the instruction in univa/eval/dpgbench/README.md
```

</p></details>

### Image Editing

<details><summary>ImgEdit</summary><p>

```
cd univa/eval/imgedit
# follow the instruction in univa/eval/imgedit/README.md
```

</p></details>

<details><summary>GEdit</summary><p>

```
cd univa/eval/gdit
# follow the instruction in univa/eval/gdit/README.md
```

</p></details>

# 📊 Benchmarks



<p align="left">
    <img src="https://s21.ax1x.com/2025/06/03/pVPFuTJ.png" width="850" style="margin-bottom: 0.2;"/>
<p>


# 💡 How to Contribute
We greatly appreciate your contributions to the UniWorld open-source community and helping us make it even better than it is now!

For more details, please refer to the [Contribution Guidelines](docs/Contribution_Guidelines.md).

# 👍 Acknowledgement and Related Work
* [ImgEdit](https://github.com/PKU-YuanGroup/ImgEdit): ImgEdit is a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs.
* [Open-Sora Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan): An open‑source text-to-image/video foundation model, which provides a lot of caption data.
* [SEED-Data-Edit](https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit): A hybrid dataset for instruction-guided image editing.
* [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct): The new flagship vision-language model of Qwen.
* [FLUX.1-Redux-dev](https://huggingface.co/black-forest-labs/FLUX.1-Redux-dev): Given an input image, FLUX.1 Redux can reproduce the image with slight variation, allowing to refine a given image.
* [SigLIP 2](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/README_siglip2.md): New multilingual vision-language encoders.
* [Step1X-Edit](https://github.com/stepfun-ai/Step1X-Edit): A state-of-the-art image editing model.
* [BLIP3-o](https://github.com/JiuhaiChen/BLIP3o): A unified multimodal model that combines the reasoning and instruction following strength of autoregressive models with the generative power of diffusion models.
* [BAGEL](https://github.com/ByteDance-Seed/Bagel): An open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data.


# 🔒 License
* See [LICENSE](LICENSE) for details. The FLUX weights fall under the [FLUX.1 [dev] Non-Commercial License](https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/LICENSE.md).


## ✨ Star History

[![Star History](https://api.star-history.com/svg?repos=PKU-YuanGroup/UniWorld)](https://star-history.com/#PKU-YuanGroup/UniWorld&Date)



# ✏️ Citing



```bibtex
@misc{lin2025uniworldhighresolutionsemanticencoders,
      title={UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation}, 
      author={Bin Lin and Zongjian Li and Xinhua Cheng and Yuwei Niu and Yang Ye and Xianyi He and Shenghai Yuan and Wangbo Yu and Shaodong Wang and Yunyang Ge and Yatian Pang and Li Yuan},
      year={2025},
      eprint={2506.03147},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.03147}, 
}
```


```bibtex
@article{niu2025wise,
  title={Wise: A world knowledge-informed semantic evaluation for text-to-image generation},
  author={Niu, Yuwei and Ning, Munan and Zheng, Mengren and Lin, Bin and Jin, Peng and Liao, Jiaqi and Ning, Kunpeng and Zhu, Bin and Yuan, Li},
  journal={arXiv preprint arXiv:2503.07265},
  year={2025}
}
```

```bibtex
@article{lin2024open,
  title={Open-Sora Plan: Open-Source Large Video Generation Model},
  author={Lin, Bin and Ge, Yunyang and Cheng, Xinhua and Li, Zongjian and Zhu, Bin and Wang, Shaodong and He, Xianyi and Ye, Yang and Yuan, Shenghai and Chen, Liuhan and others},
  journal={arXiv preprint arXiv:2412.00131},
  year={2024}
}
```


# 🤝 Community contributors

<a href="https://github.com/PKU-YuanGroup/UniWorld-V1/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=PKU-YuanGroup/UniWorld-V1" />
</a>

This model is presented in the paper: [UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation](https://huggingface.co/papers/2506.03147)