viscot-demo-2 / README.md
dung-vpt-uney
Deploy Visual-CoT demo with Zero GPU support
b90b5f6

A newer version of the Gradio SDK is available: 6.0.0

Upgrade
metadata
title: Visual-CoT Demo
emoji: 🌋
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 4.16.0
app_file: app.py
pinned: true
license: apache-2.0
tags:
  - vision
  - multimodal
  - chain-of-thought
  - visual-reasoning
  - llava
  - neurips-2024
models:
  - deepcs233/VisCoT-7b-336
datasets:
  - deepcs233/Visual-CoT

🌋 Visual-CoT: Chain-of-Thought Reasoning Demo

Paper GitHub Dataset NeurIPS 2024

📖 About

Visual Chain-of-Thought (VisCoT) advances multi-modal language models by introducing a comprehensive dataset and benchmark for chain-of-thought reasoning. Our model can:

  • 🎯 Identify key regions in images using bounding boxes
  • 💭 Reason step-by-step with visual grounding
  • 💡 Answer complex questions about visual content

🎯 Key Features

📊 Dataset

  • 438K question-answer pairs with bounding box annotations
  • 13 diverse benchmarks spanning multiple visual reasoning tasks
  • High-quality annotations from expert annotators

🏗️ Model Architecture

  • Based on LLaVA-1.5 with custom visual reasoning pipeline
  • CLIP ViT-L/14 vision encoder
  • Two-step reasoning: ROI detection → Question answering

🚀 Demo Features

  • Interactive playground: Upload your own images
  • Benchmark explorer: Browse evaluation examples
  • Visual explanations: See detected regions with bounding boxes
  • Zero GPU: Powered by Hugging Face's efficient GPU allocation

🎨 How to Use

Interactive Demo

  1. Upload an image or choose an example
  2. Ask a question about the image
  3. Get results with:
    • Detected region of interest (bounding box)
    • Step-by-step reasoning
    • Final answer

Benchmark Explorer

  • Browse examples from 13 different datasets
  • See model performance on diverse visual reasoning tasks
  • Compare ground truth with model predictions

📊 Performance

Benchmark Detection Acc Answer Acc Overall
GQA 78.2% 84.5% 81.4%
TextVQA 72.8% 81.3% 77.1%
DocVQA 76.5% 83.7% 80.1%
Average 75.3% 82.7% 79.0%

🔬 Technical Details

Chain-of-Thought Pipeline

Input: Image + Question
    ↓
Step 1: Detect Region of Interest (ROI)
    → Output: Bounding box [x1, y1, x2, y2]
    ↓
Step 2: Answer with ROI Context
    → Output: Final answer

Model Specifications

  • Model: VisCoT-7b-336
  • Parameters: 7 Billion
  • Resolution: 336×336
  • Context Length: 2048 tokens

📚 Citation

If you find our work useful, please cite:

@article{shao2024visual,
  title={Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models},
  author={Shao, Hao and Qian, Shengju and Xiao, Han and Song, Guanglu and Zong, Zhuofan and Wang, Letian and Liu, Yu and Li, Hongsheng},
  journal={arXiv preprint arXiv:2403.16999},
  year={2024}
}

🔗 Resources

⚖️ License

  • Code: Apache License 2.0
  • Dataset: Research use only
  • Model: Subject to LLaMA model license

🙏 Acknowledgements

This work builds upon:


Made with ❤️ by the Visual-CoT Team