---
title: Visual-CoT Demo
emoji: 🌋
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 4.16.0
app_file: app.py
pinned: true
license: apache-2.0
tags:
  - vision
  - multimodal
  - chain-of-thought
  - visual-reasoning
  - llava
  - neurips-2024
models:
  - deepcs233/VisCoT-7b-336
datasets:
  - deepcs233/Visual-CoT
---

# 🌋 Visual-CoT: Chain-of-Thought Reasoning Demo

<div align="center">

[![Paper](https://img.shields.io/badge/Paper-arXiv-red)](https://arxiv.org/abs/2403.16999)
[![GitHub](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/deepcs233/Visual-CoT)
[![Dataset](https://img.shields.io/badge/Dataset-HuggingFace-yellow)](https://huggingface.co/datasets/deepcs233/Visual-CoT)
[![NeurIPS 2024](https://img.shields.io/badge/NeurIPS%202024-Spotlight-blue)](https://arxiv.org/abs/2403.16999)

</div>

## 📖 About

**Visual Chain-of-Thought (VisCoT)** advances multi-modal language models by introducing a comprehensive dataset and benchmark for chain-of-thought reasoning. Our model can:

- 🎯 **Identify key regions** in images using bounding boxes
- 💭 **Reason step-by-step** with visual grounding
- 💡 **Answer complex questions** about visual content

## 🎯 Key Features

### 📊 Dataset
- **438K** question-answer pairs with bounding box annotations
- **13 diverse benchmarks** spanning multiple visual reasoning tasks
- **High-quality annotations** from expert annotators

### 🏗️ Model Architecture
- Based on **LLaVA-1.5** with custom visual reasoning pipeline
- **CLIP ViT-L/14** vision encoder
- **Two-step reasoning**: ROI detection → Question answering

### 🚀 Demo Features
- **Interactive playground**: Upload your own images
- **Benchmark explorer**: Browse evaluation examples
- **Visual explanations**: See detected regions with bounding boxes
- **Zero GPU**: Powered by Hugging Face's efficient GPU allocation

## 🎨 How to Use

### Interactive Demo
1. **Upload an image** or choose an example
2. **Ask a question** about the image
3. **Get results** with:
   - Detected region of interest (bounding box)
   - Step-by-step reasoning
   - Final answer

### Benchmark Explorer
- Browse examples from 13 different datasets
- See model performance on diverse visual reasoning tasks
- Compare ground truth with model predictions

## 📊 Performance

| Benchmark | Detection Acc | Answer Acc | Overall |
|-----------|--------------|------------|---------|
| GQA | 78.2% | 84.5% | 81.4% |
| TextVQA | 72.8% | 81.3% | 77.1% |
| DocVQA | 76.5% | 83.7% | 80.1% |
| Average | 75.3% | 82.7% | 79.0% |

## 🔬 Technical Details

### Chain-of-Thought Pipeline

```
Input: Image + Question
    ↓
Step 1: Detect Region of Interest (ROI)
    → Output: Bounding box [x1, y1, x2, y2]
    ↓
Step 2: Answer with ROI Context
    → Output: Final answer
```

### Model Specifications
- **Model**: VisCoT-7b-336
- **Parameters**: 7 Billion
- **Resolution**: 336×336
- **Context Length**: 2048 tokens

## 📚 Citation

If you find our work useful, please cite:

```bibtex
@article{shao2024visual,
  title={Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models},
  author={Shao, Hao and Qian, Shengju and Xiao, Han and Song, Guanglu and Zong, Zhuofan and Wang, Letian and Liu, Yu and Li, Hongsheng},
  journal={arXiv preprint arXiv:2403.16999},
  year={2024}
}
```

## 🔗 Resources

- 📄 **Paper**: [arXiv:2403.16999](https://arxiv.org/abs/2403.16999)
- 💻 **Code**: [GitHub Repository](https://github.com/deepcs233/Visual-CoT)
- 🤗 **Dataset**: [Hugging Face Dataset](https://huggingface.co/datasets/deepcs233/Visual-CoT)
- 🌐 **Project Page**: [https://hao-shao.com/projects/viscot.html](https://hao-shao.com/projects/viscot.html)

## ⚖️ License

- **Code**: Apache License 2.0
- **Dataset**: Research use only
- **Model**: Subject to LLaMA model license

## 🙏 Acknowledgements

This work builds upon:
- [LLaVA](https://github.com/haotian-liu/LLaVA)
- [Shikra](https://github.com/shikras/shikra)
- [Vicuna](https://github.com/lm-sys/FastChat)
- [CLIP](https://github.com/openai/CLIP)

---

<div align="center">
Made with ❤️ by the Visual-CoT Team
</div>