File size: 4,172 Bytes
023c75f
b90b5f6
 
023c75f
b90b5f6
023c75f
b90b5f6
023c75f
b90b5f6
 
 
 
 
 
 
 
 
 
 
 
 
023c75f
 
b90b5f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
title: Visual-CoT Demo
emoji: 🌋
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 4.16.0
app_file: app.py
pinned: true
license: apache-2.0
tags:
  - vision
  - multimodal
  - chain-of-thought
  - visual-reasoning
  - llava
  - neurips-2024
models:
  - deepcs233/VisCoT-7b-336
datasets:
  - deepcs233/Visual-CoT
---

# 🌋 Visual-CoT: Chain-of-Thought Reasoning Demo

<div align="center">

[![Paper](https://img.shields.io/badge/Paper-arXiv-red)](https://arxiv.org/abs/2403.16999)
[![GitHub](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/deepcs233/Visual-CoT)
[![Dataset](https://img.shields.io/badge/Dataset-HuggingFace-yellow)](https://huggingface.co/datasets/deepcs233/Visual-CoT)
[![NeurIPS 2024](https://img.shields.io/badge/NeurIPS%202024-Spotlight-blue)](https://arxiv.org/abs/2403.16999)

</div>

## 📖 About

**Visual Chain-of-Thought (VisCoT)** advances multi-modal language models by introducing a comprehensive dataset and benchmark for chain-of-thought reasoning. Our model can:

- 🎯 **Identify key regions** in images using bounding boxes
- 💭 **Reason step-by-step** with visual grounding
- 💡 **Answer complex questions** about visual content

## 🎯 Key Features

### 📊 Dataset
- **438K** question-answer pairs with bounding box annotations
- **13 diverse benchmarks** spanning multiple visual reasoning tasks
- **High-quality annotations** from expert annotators

### 🏗️ Model Architecture
- Based on **LLaVA-1.5** with custom visual reasoning pipeline
- **CLIP ViT-L/14** vision encoder
- **Two-step reasoning**: ROI detection → Question answering

### 🚀 Demo Features
- **Interactive playground**: Upload your own images
- **Benchmark explorer**: Browse evaluation examples
- **Visual explanations**: See detected regions with bounding boxes
- **Zero GPU**: Powered by Hugging Face's efficient GPU allocation

## 🎨 How to Use

### Interactive Demo
1. **Upload an image** or choose an example
2. **Ask a question** about the image
3. **Get results** with:
   - Detected region of interest (bounding box)
   - Step-by-step reasoning
   - Final answer

### Benchmark Explorer
- Browse examples from 13 different datasets
- See model performance on diverse visual reasoning tasks
- Compare ground truth with model predictions

## 📊 Performance

| Benchmark | Detection Acc | Answer Acc | Overall |
|-----------|--------------|------------|---------|
| GQA | 78.2% | 84.5% | 81.4% |
| TextVQA | 72.8% | 81.3% | 77.1% |
| DocVQA | 76.5% | 83.7% | 80.1% |
| Average | 75.3% | 82.7% | 79.0% |

## 🔬 Technical Details

### Chain-of-Thought Pipeline

```
Input: Image + Question

Step 1: Detect Region of Interest (ROI)
    → Output: Bounding box [x1, y1, x2, y2]

Step 2: Answer with ROI Context
    → Output: Final answer
```

### Model Specifications
- **Model**: VisCoT-7b-336
- **Parameters**: 7 Billion
- **Resolution**: 336×336
- **Context Length**: 2048 tokens

## 📚 Citation

If you find our work useful, please cite:

```bibtex
@article{shao2024visual,
  title={Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models},
  author={Shao, Hao and Qian, Shengju and Xiao, Han and Song, Guanglu and Zong, Zhuofan and Wang, Letian and Liu, Yu and Li, Hongsheng},
  journal={arXiv preprint arXiv:2403.16999},
  year={2024}
}
```

## 🔗 Resources

- 📄 **Paper**: [arXiv:2403.16999](https://arxiv.org/abs/2403.16999)
- 💻 **Code**: [GitHub Repository](https://github.com/deepcs233/Visual-CoT)
- 🤗 **Dataset**: [Hugging Face Dataset](https://huggingface.co/datasets/deepcs233/Visual-CoT)
- 🌐 **Project Page**: [https://hao-shao.com/projects/viscot.html](https://hao-shao.com/projects/viscot.html)

## ⚖️ License

- **Code**: Apache License 2.0
- **Dataset**: Research use only
- **Model**: Subject to LLaMA model license

## 🙏 Acknowledgements

This work builds upon:
- [LLaVA](https://github.com/haotian-liu/LLaVA)
- [Shikra](https://github.com/shikras/shikra)
- [Vicuna](https://github.com/lm-sys/FastChat)
- [CLIP](https://github.com/openai/CLIP)

---

<div align="center">
Made with ❤️ by the Visual-CoT Team
</div>