Spaces:
Runtime error
Runtime error
File size: 4,172 Bytes
023c75f b90b5f6 023c75f b90b5f6 023c75f b90b5f6 023c75f b90b5f6 023c75f b90b5f6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
---
title: Visual-CoT Demo
emoji: 🌋
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 4.16.0
app_file: app.py
pinned: true
license: apache-2.0
tags:
- vision
- multimodal
- chain-of-thought
- visual-reasoning
- llava
- neurips-2024
models:
- deepcs233/VisCoT-7b-336
datasets:
- deepcs233/Visual-CoT
---
# 🌋 Visual-CoT: Chain-of-Thought Reasoning Demo
<div align="center">
[](https://arxiv.org/abs/2403.16999)
[](https://github.com/deepcs233/Visual-CoT)
[](https://huggingface.co/datasets/deepcs233/Visual-CoT)
[](https://arxiv.org/abs/2403.16999)
</div>
## 📖 About
**Visual Chain-of-Thought (VisCoT)** advances multi-modal language models by introducing a comprehensive dataset and benchmark for chain-of-thought reasoning. Our model can:
- 🎯 **Identify key regions** in images using bounding boxes
- 💭 **Reason step-by-step** with visual grounding
- 💡 **Answer complex questions** about visual content
## 🎯 Key Features
### 📊 Dataset
- **438K** question-answer pairs with bounding box annotations
- **13 diverse benchmarks** spanning multiple visual reasoning tasks
- **High-quality annotations** from expert annotators
### 🏗️ Model Architecture
- Based on **LLaVA-1.5** with custom visual reasoning pipeline
- **CLIP ViT-L/14** vision encoder
- **Two-step reasoning**: ROI detection → Question answering
### 🚀 Demo Features
- **Interactive playground**: Upload your own images
- **Benchmark explorer**: Browse evaluation examples
- **Visual explanations**: See detected regions with bounding boxes
- **Zero GPU**: Powered by Hugging Face's efficient GPU allocation
## 🎨 How to Use
### Interactive Demo
1. **Upload an image** or choose an example
2. **Ask a question** about the image
3. **Get results** with:
- Detected region of interest (bounding box)
- Step-by-step reasoning
- Final answer
### Benchmark Explorer
- Browse examples from 13 different datasets
- See model performance on diverse visual reasoning tasks
- Compare ground truth with model predictions
## 📊 Performance
| Benchmark | Detection Acc | Answer Acc | Overall |
|-----------|--------------|------------|---------|
| GQA | 78.2% | 84.5% | 81.4% |
| TextVQA | 72.8% | 81.3% | 77.1% |
| DocVQA | 76.5% | 83.7% | 80.1% |
| Average | 75.3% | 82.7% | 79.0% |
## 🔬 Technical Details
### Chain-of-Thought Pipeline
```
Input: Image + Question
↓
Step 1: Detect Region of Interest (ROI)
→ Output: Bounding box [x1, y1, x2, y2]
↓
Step 2: Answer with ROI Context
→ Output: Final answer
```
### Model Specifications
- **Model**: VisCoT-7b-336
- **Parameters**: 7 Billion
- **Resolution**: 336×336
- **Context Length**: 2048 tokens
## 📚 Citation
If you find our work useful, please cite:
```bibtex
@article{shao2024visual,
title={Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models},
author={Shao, Hao and Qian, Shengju and Xiao, Han and Song, Guanglu and Zong, Zhuofan and Wang, Letian and Liu, Yu and Li, Hongsheng},
journal={arXiv preprint arXiv:2403.16999},
year={2024}
}
```
## 🔗 Resources
- 📄 **Paper**: [arXiv:2403.16999](https://arxiv.org/abs/2403.16999)
- 💻 **Code**: [GitHub Repository](https://github.com/deepcs233/Visual-CoT)
- 🤗 **Dataset**: [Hugging Face Dataset](https://huggingface.co/datasets/deepcs233/Visual-CoT)
- 🌐 **Project Page**: [https://hao-shao.com/projects/viscot.html](https://hao-shao.com/projects/viscot.html)
## ⚖️ License
- **Code**: Apache License 2.0
- **Dataset**: Research use only
- **Model**: Subject to LLaMA model license
## 🙏 Acknowledgements
This work builds upon:
- [LLaVA](https://github.com/haotian-liu/LLaVA)
- [Shikra](https://github.com/shikras/shikra)
- [Vicuna](https://github.com/lm-sys/FastChat)
- [CLIP](https://github.com/openai/CLIP)
---
<div align="center">
Made with ❤️ by the Visual-CoT Team
</div>
|