viscot-demo-2

Runtime error

App Files Files Community

viscot-demo-2 / README.md

dung-vpt-uney

Deploy Visual-CoT demo with Zero GPU support

b90b5f6 about 1 month ago

preview code

raw

history blame contribute delete

4.17 kB

	---
	title: Visual-CoT Demo
	emoji: 🌋
	colorFrom: indigo
	colorTo: purple
	sdk: gradio
	sdk_version: 4.16.0
	app_file: app.py
	pinned: true
	license: apache-2.0
	tags:
	- vision
	- multimodal
	- chain-of-thought
	- visual-reasoning
	- llava
	- neurips-2024
	models:
	- deepcs233/VisCoT-7b-336
	datasets:
	- deepcs233/Visual-CoT
	---

	# 🌋 Visual-CoT: Chain-of-Thought Reasoning Demo

	<div align="center">

	[![Paper](https://img.shields.io/badge/Paper-arXiv-red)](https://arxiv.org/abs/2403.16999)
	[![GitHub](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/deepcs233/Visual-CoT)
	[![Dataset](https://img.shields.io/badge/Dataset-HuggingFace-yellow)](https://huggingface.co/datasets/deepcs233/Visual-CoT)
	[![NeurIPS 2024](https://img.shields.io/badge/NeurIPS%202024-Spotlight-blue)](https://arxiv.org/abs/2403.16999)

	</div>

	## 📖 About

	Visual Chain-of-Thought (VisCoT) advances multi-modal language models by introducing a comprehensive dataset and benchmark for chain-of-thought reasoning. Our model can:

	- 🎯 Identify key regions in images using bounding boxes
	- 💭 Reason step-by-step with visual grounding
	- 💡 Answer complex questions about visual content

	## 🎯 Key Features

	### 📊 Dataset
	- 438K question-answer pairs with bounding box annotations
	- 13 diverse benchmarks spanning multiple visual reasoning tasks
	- High-quality annotations from expert annotators

	### 🏗️ Model Architecture
	- Based on LLaVA-1.5 with custom visual reasoning pipeline
	- CLIP ViT-L/14 vision encoder
	- Two-step reasoning: ROI detection → Question answering

	### 🚀 Demo Features
	- Interactive playground: Upload your own images
	- Benchmark explorer: Browse evaluation examples
	- Visual explanations: See detected regions with bounding boxes
	- Zero GPU: Powered by Hugging Face's efficient GPU allocation

	## 🎨 How to Use

	### Interactive Demo
	1. Upload an image or choose an example
	2. Ask a question about the image
	3. Get results with:
	- Detected region of interest (bounding box)
	- Step-by-step reasoning
	- Final answer

	### Benchmark Explorer
	- Browse examples from 13 different datasets
	- See model performance on diverse visual reasoning tasks
	- Compare ground truth with model predictions

	## 📊 Performance

	\| Benchmark \| Detection Acc \| Answer Acc \| Overall \|
	\|-----------\|--------------\|------------\|---------\|
	\| GQA \| 78.2% \| 84.5% \| 81.4% \|
	\| TextVQA \| 72.8% \| 81.3% \| 77.1% \|
	\| DocVQA \| 76.5% \| 83.7% \| 80.1% \|
	\| Average \| 75.3% \| 82.7% \| 79.0% \|

	## 🔬 Technical Details

	### Chain-of-Thought Pipeline

	```
	Input: Image + Question
	↓
	Step 1: Detect Region of Interest (ROI)
	→ Output: Bounding box [x1, y1, x2, y2]
	↓
	Step 2: Answer with ROI Context
	→ Output: Final answer
	```

	### Model Specifications
	- Model: VisCoT-7b-336
	- Parameters: 7 Billion
	- Resolution: 336×336
	- Context Length: 2048 tokens

	## 📚 Citation

	If you find our work useful, please cite:

	```bibtex
	@article{shao2024visual,
	title={Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models},
	author={Shao, Hao and Qian, Shengju and Xiao, Han and Song, Guanglu and Zong, Zhuofan and Wang, Letian and Liu, Yu and Li, Hongsheng},
	journal={arXiv preprint arXiv:2403.16999},
	year={2024}
	}
	```

	## 🔗 Resources

	- 📄 Paper: [arXiv:2403.16999](https://arxiv.org/abs/2403.16999)
	- 💻 Code: [GitHub Repository](https://github.com/deepcs233/Visual-CoT)
	- 🤗 Dataset: [Hugging Face Dataset](https://huggingface.co/datasets/deepcs233/Visual-CoT)
	- 🌐 Project Page: [https://hao-shao.com/projects/viscot.html](https://hao-shao.com/projects/viscot.html)

	## ⚖️ License

	- Code: Apache License 2.0
	- Dataset: Research use only
	- Model: Subject to LLaMA model license

	## 🙏 Acknowledgements

	This work builds upon:
	- [LLaVA](https://github.com/haotian-liu/LLaVA)
	- [Shikra](https://github.com/shikras/shikra)
	- [Vicuna](https://github.com/lm-sys/FastChat)
	- [CLIP](https://github.com/openai/CLIP)

	---

	<div align="center">
	Made with ❤️ by the Visual-CoT Team
	</div>