---
base_model: Qwen/Qwen2.5-VL-7B-Instruct
library_name: peft
license: apache-2.0
language:
- en
- zh
pipeline_tag: image-text-to-text
tags:
- visualization
- quality-assessment
- lora
- qwen2.5-vl
- visjudge
- aesthetics
- grpo
---

# VisJudge: Qwen2.5-VL-7B LoRA for Visualization Quality Assessment

[![arXiv](https://img.shields.io/badge/arXiv-2510.22373-b31b1b.svg)](https://arxiv.org/abs/2510.22373)
[![Dataset](https://img.shields.io/badge/GitHub-VisJudgeBench-blue)](https://github.com/HKUSTDial/VisJudgeBench)

**VisJudge** is a specialized model fine-tuned on [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) for visualization quality and aesthetics assessment. It significantly outperforms state-of-the-art multimodal large language models (MLLMs) including GPT-5, GPT-4o, and Claude-4-Sonnet on visualization evaluation tasks.

📄 **Paper**: [VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations](https://arxiv.org/abs/2510.22373)

## 🎯 Model Overview

VisJudge addresses the significant gaps between general MLLMs and human expert judgment in visualization quality assessment. Trained using **GRPO (Group Relative Policy Optimization)** on the **VisJudgeBench** dataset containing 3,090 expert-annotated samples, VisJudge evaluates visualizations across the **Fidelity-Expressiveness-Aesthetics** framework.

### Key Features

- **🏆 State-of-the-Art Performance**: 19.8% MAE improvement over GPT-5
- **📊 Six-Dimensional Evaluation**: Data Fidelity, Semantic Readability, Insight Discovery, Design Style, Visual Composition, Color Harmony
- **🎨 Comprehensive Coverage**: Supports 32 visualization types including single charts, multi-panel views, and dashboards
- **🔬 Expert-Level Assessment**: Achieves 0.681 correlation with human experts (vs. 0.429 for GPT-5)

## 🏆 Performance Benchmarks

### Overall Performance Comparison

| Model              | MAE ↓          | MSE ↓          | Correlation ↑  |
| ------------------ | --------------- | --------------- | --------------- |
| **VisJudge** | **0.442** | **0.306** | **0.681** |
| GPT-5              | 0.551           | 0.484           | 0.429           |
| GPT-4o             | 0.609           | 0.575           | 0.482           |
| Claude-4-Sonnet    | 0.618           | 0.596           | 0.470           |
| Gemini-2.0-Flash   | 0.680           | 0.716           | 0.395           |
| Gemini-2.5-Pro     | 0.661           | 0.674           | 0.266           |
| Claude-3.5-Sonnet  | 0.823           | 1.006           | 0.395           |
| Qwen2.5-VL-7B      | 1.048           | 1.502           | 0.322           |

**Key Achievements:**
- 🎯 **19.8% MAE improvement** over GPT-5 (0.551 → 0.442)
- 📈 **58.7% higher correlation** with human experts vs GPT-5 (0.429 → 0.681)
- 🏅 **Outperforms all commercial MLLMs** across all metrics

### Performance by Evaluation Dimensions (MAE ↓)

| Model              | Overall         | Data Fidelity   | Semantic Readability | Insight Discovery | Design Style    | Visual Composition | Color Harmony   |
| ------------------ | --------------- | --------------- | -------------------- | ----------------- | --------------- | ------------------ | --------------- |
| **VisJudge** | **0.442** | **0.662** | **0.649**      | **0.679**   | **0.581** | **0.546**    | **0.604** |
| GPT-5              | 0.551           | 0.861           | 0.780                | 0.776             | 0.648           | 0.698              | 0.682           |
| GPT-4o             | 0.609           | 0.986           | 0.804                | 0.742             | 0.608           | 0.694              | 0.657           |
| Claude-4-Sonnet    | 0.618           | 0.839           | 0.757                | 0.830             | 0.678           | 0.733              | 0.785           |
| Gemini-2.0-Flash   | 0.680           | 0.828           | 0.910                | 0.818             | 0.637           | 0.728              | 0.798           |
| Gemini-2.5-Pro     | 0.661           | 1.241           | 0.944                | 0.898             | 0.839           | 0.918              | 0.980           |
| Claude-3.5-Sonnet  | 0.823           | 0.977           | 0.902                | 1.152             | 0.782           | 0.939              | 0.862           |
| Qwen2.5-VL-7B      | 1.048           | 1.169           | 1.294                | 0.857             | 0.755           | 0.812              | 0.772           |

### Performance by Evaluation Dimensions (Correlation ↑)

| Model              | Overall         | Data Fidelity   | Semantic Readability | Insight Discovery | Design Style    | Visual Composition | Color Harmony   |
| ------------------ | --------------- | --------------- | -------------------- | ----------------- | --------------- | ------------------ | --------------- |
| **VisJudge** | **0.681** | **0.571** | **0.625**      | **0.572**   | **0.567** | **0.512**    | **0.385** |
| GPT-5              | 0.429           | 0.256           | 0.438                | 0.383             | 0.463           | 0.277              | 0.295           |
| GPT-4o             | 0.482           | 0.382           | 0.539                | 0.442             | 0.472           | 0.277              | 0.363           |
| Claude-4-Sonnet    | 0.470           | 0.392           | 0.548                | 0.453             | 0.422           | 0.164              | 0.228           |
| Gemini-2.0-Flash   | 0.395           | 0.371           | 0.458                | 0.418             | 0.460           | 0.157              | 0.209           |
| Gemini-2.5-Pro     | 0.266           | 0.180           | 0.379                | 0.357             | 0.447           | 0.194              | 0.208           |
| Claude-3.5-Sonnet  | 0.395           | 0.325           | 0.491                | 0.366             | 0.456           | 0.137              | 0.259           |
| Qwen2.5-VL-7B      | 0.322           | 0.340           | 0.349                | 0.278             | 0.356           | 0.148              | 0.155           |

**Key Observations:**

- All models struggle most with **Aesthetics dimensions** (Design Style, Visual Composition, Color Harmony)
- **Data Fidelity** is relatively easier but still challenging for most models
- **VisJudge consistently outperforms** baseline models across all six dimensions

## 🔍 Evaluation Framework

VisJudge evaluates visualizations across three fundamental dimensions with six measurable metrics:

### 1. Fidelity - Data Accuracy and Truthfulness
- **Data Fidelity**: Ensures visual encodings accurately reflect original data without misleading interpretations

### 2. Expressiveness - Information Clarity and Understandability
- **Semantic Readability**: Assesses clarity of information encoding and unambiguous decoding
- **Insight Discovery**: Evaluates effectiveness in revealing data patterns, trends, and outliers

### 3. Aesthetics - Visual Aesthetics and Refinement
- **Design Style**: Measures innovation and uniqueness of design elements
- **Visual Composition**: Focuses on spatial layout, balance, and element positioning
- **Color Harmony**: Assesses color coordination and functional effectiveness

## 🚀 Usage

### Installation

```bash
pip install transformers peft torch pillow
```

### Quick Start

```python
from peft import PeftModel
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Load base model
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load VisJudge LoRA adapter
model = PeftModel.from_pretrained(
    base_model, 
    "xypkent/visjudge-7b"
)

# Load processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Prepare your visualization
image = Image.open("path/to/your/visualization.png")

# Evaluation prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": """You are a rigorous data visualization evaluation expert. Please evaluate this visualization based on the "Fidelity-Expressiveness-Aesthetics" framework.

The evaluation follows the "Fidelity-Expressiveness-Aesthetics" principle:
- Fidelity: Data accuracy and truthfulness
- Expressiveness: Information clarity and understandability
- Aesthetics: Visual aesthetics and refinement

For each evaluation dimension below, provide a score from 1 to 5 and reasoning based on the scoring criteria:

1. Data Fidelity: Does the visual encoding accurately reflect the data without distortion?
2. Semantic Readability: Is the information clearly encoded and easy to decode?
3. Insight Discovery: Does it effectively reveal patterns, trends, and insights?
4. Design Style: Is the design innovative and distinctive?
5. Visual Composition: Is the layout balanced and well-organized?
6. Color Harmony: Are colors coordinated and effective?

Return Format: JSON object with the following structure:
{
  "data_fidelity": {"score": 1-5, "reasoning": "Your explanation here."},
  "semantic_readability": {"score": 1-5, "reasoning": "Your explanation here."},
  "insight_discovery": {"score": 1-5, "reasoning": "Your explanation here."},
  "design_style": {"score": 1-5, "reasoning": "Your explanation here."},
  "visual_composition": {"score": 1-5, "reasoning": "Your explanation here."},
  "color_harmony": {"score": 1-5, "reasoning": "Your explanation here."},
  "average_score": "the average of the above six scores, rounded to 2 decimals"
}

Do not include any additional text, only the JSON object."""}
        ]
    }
]

# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512)
    
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
```

### Example Output

```json
{
  "data_fidelity": {
    "score": 4,
    "reasoning": "The visual encoding accurately represents the data with appropriate scales and minimal distortion."
  },
  "semantic_readability": {
    "score": 5,
    "reasoning": "Clear labels, legend, and annotations make the information very easy to understand and decode."
  },
  "insight_discovery": {
    "score": 4,
    "reasoning": "The chart effectively reveals key trends and patterns, though some subtle insights could be more emphasized."
  },
  "design_style": {
    "score": 3,
    "reasoning": "Uses standard design elements without much innovation, but maintains professional appearance."
  },
  "visual_composition": {
    "score": 4,
    "reasoning": "Well-balanced layout with good spacing between elements and clear visual hierarchy."
  },
  "color_harmony": {
    "score": 4,
    "reasoning": "Color palette is well-coordinated and supports readability, with good contrast and consistency."
  },
  "average_score": 4.00
}
```

## 📊 Training Details

### Dataset

- **Name**: [VisJudgeBench](https://github.com/HKUSTDial/VisJudgeBench)
- **Size**: 3,090 expert-annotated visualization samples
- **Types**: Single visualizations, multi-panel views, dashboards
- **Coverage**: 32 chart types including bar charts, line charts, heatmaps, sankey diagrams, treemaps, dashboards, and more

### Training Method

- **Base Model**: Qwen2.5-VL-7B-Instruct
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation) + GRPO (Group Relative Policy Optimization)
- **LoRA Configuration**:
  - Rank: 128
  - Alpha: 256
  - Target Modules: All attention and MLP layers
- **Training Framework**: PEFT 0.14.0

### Key Improvements

✅ **Human-like Scoring**: Mean score μ=3.11 (vs. human μ=3.13), eliminating the score inflation bias seen in other models  
✅ **Balanced Assessment**: Avoids both overly conservative (Gemini-2.5-Pro μ=3.02) and overly generous (Qwen2.5-VL-7B μ=3.89) biases  
✅ **Complexity Handling**: Maintains performance across single visualizations (0.577), multi-panel views (0.565), and complex dashboards (0.375)

## 📈 Supported Visualization Types

### Single Visualizations (22 types)
Bar Chart, Pie Chart, Line Chart, Area Chart, Treemap, Sankey Diagram, Heatmap, Scatter Plot, Histogram, Donut Chart, Funnel Chart, Bubble Chart, Choropleth Map, Radar Chart, Network Graph, Candlestick Chart, Gauge Chart, Box Plot, Point Map, Word Cloud, Violin Plot, and more

### Multiple Visualizations (5 types)
Comparison Views, Small Multiples, Coordinated Views, Overview+Detail

### Dashboards (5 types)
Analytical Dashboard, Operational Dashboard, Interactive Dashboard, Strategic Dashboard

## ⚠️ Limitations

- Performance degrades with increasing visualization complexity (dashboards are most challenging)
- Best suited for visualization types seen during training
- Aesthetic dimensions (especially Visual Composition in complex dashboards) remain challenging
- Inherits any biases present in the base Qwen2.5-VL model

## 📝 Citation

If you use VisJudge in your research, please cite:

```bibtex
@misc{xie2025visjudge,
  title={VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations}, 
  author={Yupeng Xie and Zhiyang Zhang and Yifan Wu and Sirong Lu and Jiayi Zhang and Zhaoyang Yu and Jinlin Wang and Sirui Hong and Bang Liu and Chenglin Wu and Yuyu Luo},
  year={2025},
  eprint={2510.22373},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2510.22373}
}
```

## 🔗 Resources

- 📄 **Paper**: [arXiv:2510.22373](https://arxiv.org/abs/2510.22373)
- 💻 **Dataset**: [VisJudgeBench on GitHub](https://github.com/HKUSTDial/VisJudgeBench)
- 📧 **Contact**: yxie740@connect.hkust-gz.edu.cn

## 📜 License

This model is released under the Apache 2.0 License, consistent with the base Qwen2.5-VL model.

## 🙏 Acknowledgments

This model is built upon [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) by Alibaba Cloud. We thank the Qwen team for their excellent foundation model.

---

**Developed by**: Yupeng Xie and team at HKUST-GZ  
**Framework Versions**: PEFT 0.14.0 | Transformers 4.x | PyTorch 2.x