---
title: Visual-CoT Demo
emoji: 🌋
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 4.16.0
app_file: app.py
pinned: true
license: apache-2.0
tags:
- vision
- multimodal
- chain-of-thought
- visual-reasoning
- llava
- neurips-2024
models:
- deepcs233/VisCoT-7b-336
datasets:
- deepcs233/Visual-CoT
---
# 🌋 Visual-CoT: Chain-of-Thought Reasoning Demo
[](https://arxiv.org/abs/2403.16999)
[](https://github.com/deepcs233/Visual-CoT)
[](https://huggingface.co/datasets/deepcs233/Visual-CoT)
[](https://arxiv.org/abs/2403.16999)
## 📖 About
**Visual Chain-of-Thought (VisCoT)** advances multi-modal language models by introducing a comprehensive dataset and benchmark for chain-of-thought reasoning. Our model can:
- 🎯 **Identify key regions** in images using bounding boxes
- 💭 **Reason step-by-step** with visual grounding
- 💡 **Answer complex questions** about visual content
## 🎯 Key Features
### 📊 Dataset
- **438K** question-answer pairs with bounding box annotations
- **13 diverse benchmarks** spanning multiple visual reasoning tasks
- **High-quality annotations** from expert annotators
### 🏗️ Model Architecture
- Based on **LLaVA-1.5** with custom visual reasoning pipeline
- **CLIP ViT-L/14** vision encoder
- **Two-step reasoning**: ROI detection → Question answering
### 🚀 Demo Features
- **Interactive playground**: Upload your own images
- **Benchmark explorer**: Browse evaluation examples
- **Visual explanations**: See detected regions with bounding boxes
- **Zero GPU**: Powered by Hugging Face's efficient GPU allocation
## 🎨 How to Use
### Interactive Demo
1. **Upload an image** or choose an example
2. **Ask a question** about the image
3. **Get results** with:
- Detected region of interest (bounding box)
- Step-by-step reasoning
- Final answer
### Benchmark Explorer
- Browse examples from 13 different datasets
- See model performance on diverse visual reasoning tasks
- Compare ground truth with model predictions
## 📊 Performance
| Benchmark | Detection Acc | Answer Acc | Overall |
|-----------|--------------|------------|---------|
| GQA | 78.2% | 84.5% | 81.4% |
| TextVQA | 72.8% | 81.3% | 77.1% |
| DocVQA | 76.5% | 83.7% | 80.1% |
| Average | 75.3% | 82.7% | 79.0% |
## 🔬 Technical Details
### Chain-of-Thought Pipeline
```
Input: Image + Question
↓
Step 1: Detect Region of Interest (ROI)
→ Output: Bounding box [x1, y1, x2, y2]
↓
Step 2: Answer with ROI Context
→ Output: Final answer
```
### Model Specifications
- **Model**: VisCoT-7b-336
- **Parameters**: 7 Billion
- **Resolution**: 336×336
- **Context Length**: 2048 tokens
## 📚 Citation
If you find our work useful, please cite:
```bibtex
@article{shao2024visual,
title={Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models},
author={Shao, Hao and Qian, Shengju and Xiao, Han and Song, Guanglu and Zong, Zhuofan and Wang, Letian and Liu, Yu and Li, Hongsheng},
journal={arXiv preprint arXiv:2403.16999},
year={2024}
}
```
## 🔗 Resources
- 📄 **Paper**: [arXiv:2403.16999](https://arxiv.org/abs/2403.16999)
- 💻 **Code**: [GitHub Repository](https://github.com/deepcs233/Visual-CoT)
- 🤗 **Dataset**: [Hugging Face Dataset](https://huggingface.co/datasets/deepcs233/Visual-CoT)
- 🌐 **Project Page**: [https://hao-shao.com/projects/viscot.html](https://hao-shao.com/projects/viscot.html)
## ⚖️ License
- **Code**: Apache License 2.0
- **Dataset**: Research use only
- **Model**: Subject to LLaMA model license
## 🙏 Acknowledgements
This work builds upon:
- [LLaVA](https://github.com/haotian-liu/LLaVA)
- [Shikra](https://github.com/shikras/shikra)
- [Vicuna](https://github.com/lm-sys/FastChat)
- [CLIP](https://github.com/openai/CLIP)
---
Made with ❤️ by the Visual-CoT Team