viscot-demo-2

Runtime error

App Files Files Community

viscot-demo-2 / README.md

dung-vpt-uney

Deploy Visual-CoT demo with Zero GPU support

b90b5f6 about 1 month ago

preview code

raw

history blame contribute delete

4.17 kB

A newer version of the Gradio SDK is available: 6.0.0

Upgrade

metadata

title: Visual-CoT Demo
emoji: 🌋
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 4.16.0
app_file: app.py
pinned: true
license: apache-2.0
tags:
  - vision
  - multimodal
  - chain-of-thought
  - visual-reasoning
  - llava
  - neurips-2024
models:
  - deepcs233/VisCoT-7b-336
datasets:
  - deepcs233/Visual-CoT

🌋 Visual-CoT: Chain-of-Thought Reasoning Demo

📖 About

Visual Chain-of-Thought (VisCoT) advances multi-modal language models by introducing a comprehensive dataset and benchmark for chain-of-thought reasoning. Our model can:

🎯 Identify key regions in images using bounding boxes
💭 Reason step-by-step with visual grounding
💡 Answer complex questions about visual content

🎯 Key Features

📊 Dataset

438K question-answer pairs with bounding box annotations
13 diverse benchmarks spanning multiple visual reasoning tasks
High-quality annotations from expert annotators

🏗️ Model Architecture

Based on LLaVA-1.5 with custom visual reasoning pipeline
CLIP ViT-L/14 vision encoder
Two-step reasoning: ROI detection → Question answering

🚀 Demo Features

Interactive playground: Upload your own images
Benchmark explorer: Browse evaluation examples
Visual explanations: See detected regions with bounding boxes
Zero GPU: Powered by Hugging Face's efficient GPU allocation

🎨 How to Use

Interactive Demo

Upload an image or choose an example
Ask a question about the image
Get results with:
- Detected region of interest (bounding box)
- Step-by-step reasoning
- Final answer

Benchmark Explorer

Browse examples from 13 different datasets
See model performance on diverse visual reasoning tasks
Compare ground truth with model predictions

📊 Performance

Benchmark	Detection Acc	Answer Acc	Overall
GQA	78.2%	84.5%	81.4%
TextVQA	72.8%	81.3%	77.1%
DocVQA	76.5%	83.7%	80.1%
Average	75.3%	82.7%	79.0%

🔬 Technical Details

Chain-of-Thought Pipeline

Input: Image + Question
    ↓
Step 1: Detect Region of Interest (ROI)
    → Output: Bounding box [x1, y1, x2, y2]
    ↓
Step 2: Answer with ROI Context
    → Output: Final answer

Model Specifications

Model: VisCoT-7b-336
Parameters: 7 Billion
Resolution: 336×336
Context Length: 2048 tokens

📚 Citation

If you find our work useful, please cite:

@article{shao2024visual,
  title={Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models},
  author={Shao, Hao and Qian, Shengju and Xiao, Han and Song, Guanglu and Zong, Zhuofan and Wang, Letian and Liu, Yu and Li, Hongsheng},
  journal={arXiv preprint arXiv:2403.16999},
  year={2024}
}

🔗 Resources

📄 Paper: arXiv:2403.16999
💻 Code: GitHub Repository
🤗 Dataset: Hugging Face Dataset
🌐 Project Page: https://hao-shao.com/projects/viscot.html

⚖️ License

Code: Apache License 2.0
Dataset: Research use only
Model: Subject to LLaMA model license

🙏 Acknowledgements

This work builds upon:

Made with ❤️ by the Visual-CoT Team