--- title: Visual-CoT Demo emoji: 🌋 colorFrom: indigo colorTo: purple sdk: gradio sdk_version: 4.16.0 app_file: app.py pinned: true license: apache-2.0 tags: - vision - multimodal - chain-of-thought - visual-reasoning - llava - neurips-2024 models: - deepcs233/VisCoT-7b-336 datasets: - deepcs233/Visual-CoT --- # 🌋 Visual-CoT: Chain-of-Thought Reasoning Demo
[![Paper](https://img.shields.io/badge/Paper-arXiv-red)](https://arxiv.org/abs/2403.16999) [![GitHub](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/deepcs233/Visual-CoT) [![Dataset](https://img.shields.io/badge/Dataset-HuggingFace-yellow)](https://huggingface.co/datasets/deepcs233/Visual-CoT) [![NeurIPS 2024](https://img.shields.io/badge/NeurIPS%202024-Spotlight-blue)](https://arxiv.org/abs/2403.16999)
## 📖 About **Visual Chain-of-Thought (VisCoT)** advances multi-modal language models by introducing a comprehensive dataset and benchmark for chain-of-thought reasoning. Our model can: - 🎯 **Identify key regions** in images using bounding boxes - 💭 **Reason step-by-step** with visual grounding - 💡 **Answer complex questions** about visual content ## 🎯 Key Features ### 📊 Dataset - **438K** question-answer pairs with bounding box annotations - **13 diverse benchmarks** spanning multiple visual reasoning tasks - **High-quality annotations** from expert annotators ### 🏗️ Model Architecture - Based on **LLaVA-1.5** with custom visual reasoning pipeline - **CLIP ViT-L/14** vision encoder - **Two-step reasoning**: ROI detection → Question answering ### 🚀 Demo Features - **Interactive playground**: Upload your own images - **Benchmark explorer**: Browse evaluation examples - **Visual explanations**: See detected regions with bounding boxes - **Zero GPU**: Powered by Hugging Face's efficient GPU allocation ## 🎨 How to Use ### Interactive Demo 1. **Upload an image** or choose an example 2. **Ask a question** about the image 3. **Get results** with: - Detected region of interest (bounding box) - Step-by-step reasoning - Final answer ### Benchmark Explorer - Browse examples from 13 different datasets - See model performance on diverse visual reasoning tasks - Compare ground truth with model predictions ## 📊 Performance | Benchmark | Detection Acc | Answer Acc | Overall | |-----------|--------------|------------|---------| | GQA | 78.2% | 84.5% | 81.4% | | TextVQA | 72.8% | 81.3% | 77.1% | | DocVQA | 76.5% | 83.7% | 80.1% | | Average | 75.3% | 82.7% | 79.0% | ## 🔬 Technical Details ### Chain-of-Thought Pipeline ``` Input: Image + Question ↓ Step 1: Detect Region of Interest (ROI) → Output: Bounding box [x1, y1, x2, y2] ↓ Step 2: Answer with ROI Context → Output: Final answer ``` ### Model Specifications - **Model**: VisCoT-7b-336 - **Parameters**: 7 Billion - **Resolution**: 336×336 - **Context Length**: 2048 tokens ## 📚 Citation If you find our work useful, please cite: ```bibtex @article{shao2024visual, title={Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models}, author={Shao, Hao and Qian, Shengju and Xiao, Han and Song, Guanglu and Zong, Zhuofan and Wang, Letian and Liu, Yu and Li, Hongsheng}, journal={arXiv preprint arXiv:2403.16999}, year={2024} } ``` ## 🔗 Resources - 📄 **Paper**: [arXiv:2403.16999](https://arxiv.org/abs/2403.16999) - 💻 **Code**: [GitHub Repository](https://github.com/deepcs233/Visual-CoT) - 🤗 **Dataset**: [Hugging Face Dataset](https://huggingface.co/datasets/deepcs233/Visual-CoT) - 🌐 **Project Page**: [https://hao-shao.com/projects/viscot.html](https://hao-shao.com/projects/viscot.html) ## ⚖️ License - **Code**: Apache License 2.0 - **Dataset**: Research use only - **Model**: Subject to LLaMA model license ## 🙏 Acknowledgements This work builds upon: - [LLaVA](https://github.com/haotian-liu/LLaVA) - [Shikra](https://github.com/shikras/shikra) - [Vicuna](https://github.com/lm-sys/FastChat) - [CLIP](https://github.com/openai/CLIP) ---
Made with ❤️ by the Visual-CoT Team