Update README.md
Browse files
README.md
CHANGED
|
@@ -176,33 +176,39 @@ Our code is based on LLaVA-NeXT, before running, please install the LLaVA-NeXT t
|
|
| 176 |
pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
|
| 177 |
```
|
| 178 |
**Error Handling**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 179 |
|
| 180 |
-
|
| 181 |
-
```shell
|
| 182 |
RuntimeError: Error(s) in loading state_dict for CLIPVisionModel:
|
| 183 |
size mismatch for vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([729, 1152]) from checkpoint, the shape in current model is torch.Size([730, 1152]).
|
| 184 |
You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
|
| 185 |
```
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
<details>
|
| 189 |
-
<summary>Error handling guideline</summary>
|
| 190 |
-
|
| 191 |
-
This is a logical error encountered when loading the vision tower from the local path. To fix this issue, you can prepare the environment in any of the following ways.
|
| 192 |
|
| 193 |
**Option 1: Install from our fork of LLaVA-NeXT:**
|
| 194 |
|
| 195 |
-
```shell
|
| 196 |
pip install git+https://github.com/inst-it/LLaVA-NeXT.git
|
| 197 |
```
|
| 198 |
|
| 199 |
**Option 2: Install from LLaVA-NeXT and manually modify its code:**
|
| 200 |
* step 1: clone source code
|
| 201 |
-
```shell
|
| 202 |
git clone https://github.com/LLaVA-VL/LLaVA-NeXT.git
|
| 203 |
```
|
| 204 |
* step 2: before installing LLaVA-NeXT, you need to modify `line 17` of [llava/model/multimodal_encoder/builder.py](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/llava/model/multimodal_encoder/builder.py#L17).
|
| 205 |
-
```python
|
| 206 |
# Before modification:
|
| 207 |
if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
|
| 208 |
|
|
@@ -210,7 +216,7 @@ if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.
|
|
| 210 |
if "clip" in vision_tower or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
|
| 211 |
```
|
| 212 |
* step 3: install LLaVA-NeXT from source:
|
| 213 |
-
```shell
|
| 214 |
cd LLaVA-NeXT
|
| 215 |
pip install --upgrade pip # Enable PEP 660 support.
|
| 216 |
pip install -e ".[train]"
|
|
@@ -219,8 +225,10 @@ pip install -e ".[train]"
|
|
| 219 |
We recommend the first way because it is simple.
|
| 220 |
</details>
|
| 221 |
|
|
|
|
|
|
|
| 222 |
**Load Model**
|
| 223 |
-
```python
|
| 224 |
from llava.model.builder import load_pretrained_model
|
| 225 |
from llava.constants import DEFAULT_IMAGE_TOKEN
|
| 226 |
|
|
@@ -258,7 +266,7 @@ tokenizer, model, image_processor, max_length = load_pretrained_model(
|
|
| 258 |
|
| 259 |
Our model can perform inference on images without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
|
| 260 |
|
| 261 |
-
```python
|
| 262 |
import torch
|
| 263 |
import requests
|
| 264 |
from PIL import Image
|
|
@@ -311,7 +319,7 @@ You can refer to the instances that you are interested in using their IDs.
|
|
| 311 |
Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
|
| 312 |
Refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
|
| 313 |
|
| 314 |
-
```python
|
| 315 |
import torch
|
| 316 |
import requests
|
| 317 |
from PIL import Image
|
|
@@ -366,7 +374,7 @@ For the video, we organize each frame into a list. You can use the format \<t\>
|
|
| 366 |
|
| 367 |
Our model can perform inference on videos without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
|
| 368 |
|
| 369 |
-
```python
|
| 370 |
import torch
|
| 371 |
import requests
|
| 372 |
from PIL import Image
|
|
@@ -429,7 +437,7 @@ You can refer to the instances that you are interested in using their IDs.
|
|
| 429 |
Compared to the previous inference code, the following code has no modifications except for the input video, which is visual prompted with Set-of-Marks.
|
| 430 |
Refer to [SAM2](https://github.com/facebookresearch/sam2) and [SoM](https://github.com/microsoft/SoM) to learn how to generate SoMs for a video.
|
| 431 |
|
| 432 |
-
```python
|
| 433 |
import torch
|
| 434 |
import requests
|
| 435 |
from PIL import Image
|
|
@@ -490,7 +498,7 @@ Feel free to contact us if you have any questions or suggestions
|
|
| 490 |
- Email (Lingchen Meng): [email protected]
|
| 491 |
|
| 492 |
## Citation
|
| 493 |
-
```bibtex
|
| 494 |
@article{peng2024inst,
|
| 495 |
title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning},
|
| 496 |
author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Xu, Hang and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang},
|
|
|
|
| 176 |
pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
|
| 177 |
```
|
| 178 |
**Error Handling**
|
| 179 |
+
<details>
|
| 180 |
+
<summary>Click to unfold</summary>
|
| 181 |
+
|
| 182 |
+
* **Common error case 1:**
|
| 183 |
+
``` shell
|
| 184 |
+
Exception: data did not match any variant of untagged enum ModelWrapper at line 757272 column 3
|
| 185 |
+
```
|
| 186 |
+
This is caused by the version of `transformers`, try to update it:
|
| 187 |
+
``` python
|
| 188 |
+
pip install -U transformers
|
| 189 |
+
```
|
| 190 |
|
| 191 |
+
* **Common error case 2:**
|
| 192 |
+
``` shell
|
| 193 |
RuntimeError: Error(s) in loading state_dict for CLIPVisionModel:
|
| 194 |
size mismatch for vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([729, 1152]) from checkpoint, the shape in current model is torch.Size([730, 1152]).
|
| 195 |
You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
|
| 196 |
```
|
| 197 |
+
This is a logical error encountered when loading the vision tower from the local path. To fix this issue, you can prepare the environment in any of the following ways.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 198 |
|
| 199 |
**Option 1: Install from our fork of LLaVA-NeXT:**
|
| 200 |
|
| 201 |
+
``` shell
|
| 202 |
pip install git+https://github.com/inst-it/LLaVA-NeXT.git
|
| 203 |
```
|
| 204 |
|
| 205 |
**Option 2: Install from LLaVA-NeXT and manually modify its code:**
|
| 206 |
* step 1: clone source code
|
| 207 |
+
``` shell
|
| 208 |
git clone https://github.com/LLaVA-VL/LLaVA-NeXT.git
|
| 209 |
```
|
| 210 |
* step 2: before installing LLaVA-NeXT, you need to modify `line 17` of [llava/model/multimodal_encoder/builder.py](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/llava/model/multimodal_encoder/builder.py#L17).
|
| 211 |
+
``` python
|
| 212 |
# Before modification:
|
| 213 |
if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
|
| 214 |
|
|
|
|
| 216 |
if "clip" in vision_tower or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
|
| 217 |
```
|
| 218 |
* step 3: install LLaVA-NeXT from source:
|
| 219 |
+
``` shell
|
| 220 |
cd LLaVA-NeXT
|
| 221 |
pip install --upgrade pip # Enable PEP 660 support.
|
| 222 |
pip install -e ".[train]"
|
|
|
|
| 225 |
We recommend the first way because it is simple.
|
| 226 |
</details>
|
| 227 |
|
| 228 |
+
</details>
|
| 229 |
+
|
| 230 |
**Load Model**
|
| 231 |
+
``` python
|
| 232 |
from llava.model.builder import load_pretrained_model
|
| 233 |
from llava.constants import DEFAULT_IMAGE_TOKEN
|
| 234 |
|
|
|
|
| 266 |
|
| 267 |
Our model can perform inference on images without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
|
| 268 |
|
| 269 |
+
``` python
|
| 270 |
import torch
|
| 271 |
import requests
|
| 272 |
from PIL import Image
|
|
|
|
| 319 |
Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
|
| 320 |
Refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
|
| 321 |
|
| 322 |
+
``` python
|
| 323 |
import torch
|
| 324 |
import requests
|
| 325 |
from PIL import Image
|
|
|
|
| 374 |
|
| 375 |
Our model can perform inference on videos without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
|
| 376 |
|
| 377 |
+
``` python
|
| 378 |
import torch
|
| 379 |
import requests
|
| 380 |
from PIL import Image
|
|
|
|
| 437 |
Compared to the previous inference code, the following code has no modifications except for the input video, which is visual prompted with Set-of-Marks.
|
| 438 |
Refer to [SAM2](https://github.com/facebookresearch/sam2) and [SoM](https://github.com/microsoft/SoM) to learn how to generate SoMs for a video.
|
| 439 |
|
| 440 |
+
``` python
|
| 441 |
import torch
|
| 442 |
import requests
|
| 443 |
from PIL import Image
|
|
|
|
| 498 |
- Email (Lingchen Meng): [email protected]
|
| 499 |
|
| 500 |
## Citation
|
| 501 |
+
``` bibtex
|
| 502 |
@article{peng2024inst,
|
| 503 |
title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning},
|
| 504 |
author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Xu, Hang and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang},
|