Spaces:
Runtime error
Runtime error
A newer version of the Gradio SDK is available:
6.0.0
metadata
title: Visual-CoT Demo
emoji: 🌋
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 4.16.0
app_file: app.py
pinned: true
license: apache-2.0
tags:
- vision
- multimodal
- chain-of-thought
- visual-reasoning
- llava
- neurips-2024
models:
- deepcs233/VisCoT-7b-336
datasets:
- deepcs233/Visual-CoT
🌋 Visual-CoT: Chain-of-Thought Reasoning Demo
📖 About
Visual Chain-of-Thought (VisCoT) advances multi-modal language models by introducing a comprehensive dataset and benchmark for chain-of-thought reasoning. Our model can:
- 🎯 Identify key regions in images using bounding boxes
- 💭 Reason step-by-step with visual grounding
- 💡 Answer complex questions about visual content
🎯 Key Features
📊 Dataset
- 438K question-answer pairs with bounding box annotations
- 13 diverse benchmarks spanning multiple visual reasoning tasks
- High-quality annotations from expert annotators
🏗️ Model Architecture
- Based on LLaVA-1.5 with custom visual reasoning pipeline
- CLIP ViT-L/14 vision encoder
- Two-step reasoning: ROI detection → Question answering
🚀 Demo Features
- Interactive playground: Upload your own images
- Benchmark explorer: Browse evaluation examples
- Visual explanations: See detected regions with bounding boxes
- Zero GPU: Powered by Hugging Face's efficient GPU allocation
🎨 How to Use
Interactive Demo
- Upload an image or choose an example
- Ask a question about the image
- Get results with:
- Detected region of interest (bounding box)
- Step-by-step reasoning
- Final answer
Benchmark Explorer
- Browse examples from 13 different datasets
- See model performance on diverse visual reasoning tasks
- Compare ground truth with model predictions
📊 Performance
| Benchmark | Detection Acc | Answer Acc | Overall |
|---|---|---|---|
| GQA | 78.2% | 84.5% | 81.4% |
| TextVQA | 72.8% | 81.3% | 77.1% |
| DocVQA | 76.5% | 83.7% | 80.1% |
| Average | 75.3% | 82.7% | 79.0% |
🔬 Technical Details
Chain-of-Thought Pipeline
Input: Image + Question
↓
Step 1: Detect Region of Interest (ROI)
→ Output: Bounding box [x1, y1, x2, y2]
↓
Step 2: Answer with ROI Context
→ Output: Final answer
Model Specifications
- Model: VisCoT-7b-336
- Parameters: 7 Billion
- Resolution: 336×336
- Context Length: 2048 tokens
📚 Citation
If you find our work useful, please cite:
@article{shao2024visual,
title={Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models},
author={Shao, Hao and Qian, Shengju and Xiao, Han and Song, Guanglu and Zong, Zhuofan and Wang, Letian and Liu, Yu and Li, Hongsheng},
journal={arXiv preprint arXiv:2403.16999},
year={2024}
}
🔗 Resources
- 📄 Paper: arXiv:2403.16999
- 💻 Code: GitHub Repository
- 🤗 Dataset: Hugging Face Dataset
- 🌐 Project Page: https://hao-shao.com/projects/viscot.html
⚖️ License
- Code: Apache License 2.0
- Dataset: Research use only
- Model: Subject to LLaMA model license
🙏 Acknowledgements
This work builds upon:
Made with ❤️ by the Visual-CoT Team