Spaces:
Runtime error
Runtime error
| title: Visual-CoT Demo | |
| emoji: 🌋 | |
| colorFrom: indigo | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 4.16.0 | |
| app_file: app.py | |
| pinned: true | |
| license: apache-2.0 | |
| tags: | |
| - vision | |
| - multimodal | |
| - chain-of-thought | |
| - visual-reasoning | |
| - llava | |
| - neurips-2024 | |
| models: | |
| - deepcs233/VisCoT-7b-336 | |
| datasets: | |
| - deepcs233/Visual-CoT | |
| # 🌋 Visual-CoT: Chain-of-Thought Reasoning Demo | |
| <div align="center"> | |
| [](https://arxiv.org/abs/2403.16999) | |
| [](https://github.com/deepcs233/Visual-CoT) | |
| [](https://huggingface.co/datasets/deepcs233/Visual-CoT) | |
| [](https://arxiv.org/abs/2403.16999) | |
| </div> | |
| ## 📖 About | |
| **Visual Chain-of-Thought (VisCoT)** advances multi-modal language models by introducing a comprehensive dataset and benchmark for chain-of-thought reasoning. Our model can: | |
| - 🎯 **Identify key regions** in images using bounding boxes | |
| - 💭 **Reason step-by-step** with visual grounding | |
| - 💡 **Answer complex questions** about visual content | |
| ## 🎯 Key Features | |
| ### 📊 Dataset | |
| - **438K** question-answer pairs with bounding box annotations | |
| - **13 diverse benchmarks** spanning multiple visual reasoning tasks | |
| - **High-quality annotations** from expert annotators | |
| ### 🏗️ Model Architecture | |
| - Based on **LLaVA-1.5** with custom visual reasoning pipeline | |
| - **CLIP ViT-L/14** vision encoder | |
| - **Two-step reasoning**: ROI detection → Question answering | |
| ### 🚀 Demo Features | |
| - **Interactive playground**: Upload your own images | |
| - **Benchmark explorer**: Browse evaluation examples | |
| - **Visual explanations**: See detected regions with bounding boxes | |
| - **Zero GPU**: Powered by Hugging Face's efficient GPU allocation | |
| ## 🎨 How to Use | |
| ### Interactive Demo | |
| 1. **Upload an image** or choose an example | |
| 2. **Ask a question** about the image | |
| 3. **Get results** with: | |
| - Detected region of interest (bounding box) | |
| - Step-by-step reasoning | |
| - Final answer | |
| ### Benchmark Explorer | |
| - Browse examples from 13 different datasets | |
| - See model performance on diverse visual reasoning tasks | |
| - Compare ground truth with model predictions | |
| ## 📊 Performance | |
| | Benchmark | Detection Acc | Answer Acc | Overall | | |
| |-----------|--------------|------------|---------| | |
| | GQA | 78.2% | 84.5% | 81.4% | | |
| | TextVQA | 72.8% | 81.3% | 77.1% | | |
| | DocVQA | 76.5% | 83.7% | 80.1% | | |
| | Average | 75.3% | 82.7% | 79.0% | | |
| ## 🔬 Technical Details | |
| ### Chain-of-Thought Pipeline | |
| ``` | |
| Input: Image + Question | |
| ↓ | |
| Step 1: Detect Region of Interest (ROI) | |
| → Output: Bounding box [x1, y1, x2, y2] | |
| ↓ | |
| Step 2: Answer with ROI Context | |
| → Output: Final answer | |
| ``` | |
| ### Model Specifications | |
| - **Model**: VisCoT-7b-336 | |
| - **Parameters**: 7 Billion | |
| - **Resolution**: 336×336 | |
| - **Context Length**: 2048 tokens | |
| ## 📚 Citation | |
| If you find our work useful, please cite: | |
| ```bibtex | |
| @article{shao2024visual, | |
| title={Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models}, | |
| author={Shao, Hao and Qian, Shengju and Xiao, Han and Song, Guanglu and Zong, Zhuofan and Wang, Letian and Liu, Yu and Li, Hongsheng}, | |
| journal={arXiv preprint arXiv:2403.16999}, | |
| year={2024} | |
| } | |
| ``` | |
| ## 🔗 Resources | |
| - 📄 **Paper**: [arXiv:2403.16999](https://arxiv.org/abs/2403.16999) | |
| - 💻 **Code**: [GitHub Repository](https://github.com/deepcs233/Visual-CoT) | |
| - 🤗 **Dataset**: [Hugging Face Dataset](https://huggingface.co/datasets/deepcs233/Visual-CoT) | |
| - 🌐 **Project Page**: [https://hao-shao.com/projects/viscot.html](https://hao-shao.com/projects/viscot.html) | |
| ## ⚖️ License | |
| - **Code**: Apache License 2.0 | |
| - **Dataset**: Research use only | |
| - **Model**: Subject to LLaMA model license | |
| ## 🙏 Acknowledgements | |
| This work builds upon: | |
| - [LLaVA](https://github.com/haotian-liu/LLaVA) | |
| - [Shikra](https://github.com/shikras/shikra) | |
| - [Vicuna](https://github.com/lm-sys/FastChat) | |
| - [CLIP](https://github.com/openai/CLIP) | |
| --- | |
| <div align="center"> | |
| Made with ❤️ by the Visual-CoT Team | |
| </div> | |