VisJudge: Qwen2.5-VL-7B LoRA for Visualization Quality Assessment

VisJudge is a specialized model fine-tuned on Qwen2.5-VL-7B-Instruct for visualization quality and aesthetics assessment. It significantly outperforms state-of-the-art multimodal large language models (MLLMs) including GPT-5, GPT-4o, and Claude-4-Sonnet on visualization evaluation tasks.

📄 Paper: VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations

🎯 Model Overview

VisJudge addresses the significant gaps between general MLLMs and human expert judgment in visualization quality assessment. Trained using GRPO (Group Relative Policy Optimization) on the VisJudgeBench dataset containing 3,090 expert-annotated samples, VisJudge evaluates visualizations across the Fidelity-Expressiveness-Aesthetics framework.

Key Features

🏆 State-of-the-Art Performance: 19.8% MAE improvement over GPT-5
📊 Six-Dimensional Evaluation: Data Fidelity, Semantic Readability, Insight Discovery, Design Style, Visual Composition, Color Harmony
🎨 Comprehensive Coverage: Supports 32 visualization types including single charts, multi-panel views, and dashboards
🔬 Expert-Level Assessment: Achieves 0.681 correlation with human experts (vs. 0.429 for GPT-5)

🏆 Performance Benchmarks

Overall Performance Comparison

Model	MAE ↓	MSE ↓	Correlation ↑
VisJudge	0.442	0.306	0.681
GPT-5	0.551	0.484	0.429
GPT-4o	0.609	0.575	0.482
Claude-4-Sonnet	0.618	0.596	0.470
Gemini-2.0-Flash	0.680	0.716	0.395
Gemini-2.5-Pro	0.661	0.674	0.266
Claude-3.5-Sonnet	0.823	1.006	0.395
Qwen2.5-VL-7B	1.048	1.502	0.322

Key Achievements:

🎯 19.8% MAE improvement over GPT-5 (0.551 → 0.442)
📈 58.7% higher correlation with human experts vs GPT-5 (0.429 → 0.681)
🏅 Outperforms all commercial MLLMs across all metrics

Performance by Evaluation Dimensions (MAE ↓)

Model	Overall	Data Fidelity	Semantic Readability	Insight Discovery	Design Style	Visual Composition	Color Harmony
VisJudge	0.442	0.662	0.649	0.679	0.581	0.546	0.604
GPT-5	0.551	0.861	0.780	0.776	0.648	0.698	0.682
GPT-4o	0.609	0.986	0.804	0.742	0.608	0.694	0.657
Claude-4-Sonnet	0.618	0.839	0.757	0.830	0.678	0.733	0.785
Gemini-2.0-Flash	0.680	0.828	0.910	0.818	0.637	0.728	0.798
Gemini-2.5-Pro	0.661	1.241	0.944	0.898	0.839	0.918	0.980
Claude-3.5-Sonnet	0.823	0.977	0.902	1.152	0.782	0.939	0.862
Qwen2.5-VL-7B	1.048	1.169	1.294	0.857	0.755	0.812	0.772

Performance by Evaluation Dimensions (Correlation ↑)

Model	Overall	Data Fidelity	Semantic Readability	Insight Discovery	Design Style	Visual Composition	Color Harmony
VisJudge	0.681	0.571	0.625	0.572	0.567	0.512	0.385
GPT-5	0.429	0.256	0.438	0.383	0.463	0.277	0.295
GPT-4o	0.482	0.382	0.539	0.442	0.472	0.277	0.363
Claude-4-Sonnet	0.470	0.392	0.548	0.453	0.422	0.164	0.228
Gemini-2.0-Flash	0.395	0.371	0.458	0.418	0.460	0.157	0.209
Gemini-2.5-Pro	0.266	0.180	0.379	0.357	0.447	0.194	0.208
Claude-3.5-Sonnet	0.395	0.325	0.491	0.366	0.456	0.137	0.259
Qwen2.5-VL-7B	0.322	0.340	0.349	0.278	0.356	0.148	0.155

Key Observations:

All models struggle most with Aesthetics dimensions (Design Style, Visual Composition, Color Harmony)
Data Fidelity is relatively easier but still challenging for most models
VisJudge consistently outperforms baseline models across all six dimensions

🔍 Evaluation Framework

VisJudge evaluates visualizations across three fundamental dimensions with six measurable metrics:

1. Fidelity - Data Accuracy and Truthfulness

Data Fidelity: Ensures visual encodings accurately reflect original data without misleading interpretations

2. Expressiveness - Information Clarity and Understandability

Semantic Readability: Assesses clarity of information encoding and unambiguous decoding
Insight Discovery: Evaluates effectiveness in revealing data patterns, trends, and outliers

3. Aesthetics - Visual Aesthetics and Refinement

Design Style: Measures innovation and uniqueness of design elements
Visual Composition: Focuses on spatial layout, balance, and element positioning
Color Harmony: Assesses color coordination and functional effectiveness

🚀 Usage

Installation

pip install transformers peft torch pillow

Quick Start

from peft import PeftModel
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Load base model
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load VisJudge LoRA adapter
model = PeftModel.from_pretrained(
    base_model, 
    "xypkent/visjudge-7b"
)

# Load processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Prepare your visualization
image = Image.open("path/to/your/visualization.png")

# Evaluation prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": """You are a rigorous data visualization evaluation expert. Please evaluate this visualization based on the "Fidelity-Expressiveness-Aesthetics" framework.

The evaluation follows the "Fidelity-Expressiveness-Aesthetics" principle:
- Fidelity: Data accuracy and truthfulness
- Expressiveness: Information clarity and understandability
- Aesthetics: Visual aesthetics and refinement

For each evaluation dimension below, provide a score from 1 to 5 and reasoning based on the scoring criteria:

1. Data Fidelity: Does the visual encoding accurately reflect the data without distortion?
2. Semantic Readability: Is the information clearly encoded and easy to decode?
3. Insight Discovery: Does it effectively reveal patterns, trends, and insights?
4. Design Style: Is the design innovative and distinctive?
5. Visual Composition: Is the layout balanced and well-organized?
6. Color Harmony: Are colors coordinated and effective?

Return Format: JSON object with the following structure:
{
  "data_fidelity": {"score": 1-5, "reasoning": "Your explanation here."},
  "semantic_readability": {"score": 1-5, "reasoning": "Your explanation here."},
  "insight_discovery": {"score": 1-5, "reasoning": "Your explanation here."},
  "design_style": {"score": 1-5, "reasoning": "Your explanation here."},
  "visual_composition": {"score": 1-5, "reasoning": "Your explanation here."},
  "color_harmony": {"score": 1-5, "reasoning": "Your explanation here."},
  "average_score": "the average of the above six scores, rounded to 2 decimals"
}

Do not include any additional text, only the JSON object."""}
        ]
    }
]

# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512)
    
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)

Example Output

{
  "data_fidelity": {
    "score": 4,
    "reasoning": "The visual encoding accurately represents the data with appropriate scales and minimal distortion."
  },
  "semantic_readability": {
    "score": 5,
    "reasoning": "Clear labels, legend, and annotations make the information very easy to understand and decode."
  },
  "insight_discovery": {
    "score": 4,
    "reasoning": "The chart effectively reveals key trends and patterns, though some subtle insights could be more emphasized."
  },
  "design_style": {
    "score": 3,
    "reasoning": "Uses standard design elements without much innovation, but maintains professional appearance."
  },
  "visual_composition": {
    "score": 4,
    "reasoning": "Well-balanced layout with good spacing between elements and clear visual hierarchy."
  },
  "color_harmony": {
    "score": 4,
    "reasoning": "Color palette is well-coordinated and supports readability, with good contrast and consistency."
  },
  "average_score": 4.00
}

📊 Training Details

Dataset

Name: VisJudgeBench
Size: 3,090 expert-annotated visualization samples
Types: Single visualizations, multi-panel views, dashboards
Coverage: 32 chart types including bar charts, line charts, heatmaps, sankey diagrams, treemaps, dashboards, and more

Training Method

Base Model: Qwen2.5-VL-7B-Instruct
Fine-tuning Method: LoRA (Low-Rank Adaptation) + GRPO (Group Relative Policy Optimization)
LoRA Configuration:
- Rank: 128
- Alpha: 256
- Target Modules: All attention and MLP layers
Training Framework: PEFT 0.14.0

Key Improvements

✅ Human-like Scoring: Mean score μ=3.11 (vs. human μ=3.13), eliminating the score inflation bias seen in other models
✅ Balanced Assessment: Avoids both overly conservative (Gemini-2.5-Pro μ=3.02) and overly generous (Qwen2.5-VL-7B μ=3.89) biases
✅ Complexity Handling: Maintains performance across single visualizations (0.577), multi-panel views (0.565), and complex dashboards (0.375)

📈 Supported Visualization Types

Single Visualizations (22 types)

Bar Chart, Pie Chart, Line Chart, Area Chart, Treemap, Sankey Diagram, Heatmap, Scatter Plot, Histogram, Donut Chart, Funnel Chart, Bubble Chart, Choropleth Map, Radar Chart, Network Graph, Candlestick Chart, Gauge Chart, Box Plot, Point Map, Word Cloud, Violin Plot, and more

Multiple Visualizations (5 types)

Comparison Views, Small Multiples, Coordinated Views, Overview+Detail

Dashboards (5 types)

Analytical Dashboard, Operational Dashboard, Interactive Dashboard, Strategic Dashboard

⚠️ Limitations

Performance degrades with increasing visualization complexity (dashboards are most challenging)
Best suited for visualization types seen during training
Aesthetic dimensions (especially Visual Composition in complex dashboards) remain challenging
Inherits any biases present in the base Qwen2.5-VL model

📝 Citation

If you use VisJudge in your research, please cite:

@misc{xie2025visjudge,
  title={VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations}, 
  author={Yupeng Xie and Zhiyang Zhang and Yifan Wu and Sirong Lu and Jiayi Zhang and Zhaoyang Yu and Jinlin Wang and Sirui Hong and Bang Liu and Chenglin Wu and Yuyu Luo},
  year={2025},
  eprint={2510.22373},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2510.22373}
}

🔗 Resources

📄 Paper: arXiv:2510.22373
💻 Dataset: VisJudgeBench on GitHub
📧 Contact: [email protected]

📜 License

This model is released under the Apache 2.0 License, consistent with the base Qwen2.5-VL model.

🙏 Acknowledgments

This model is built upon Qwen2.5-VL-7B-Instruct by Alibaba Cloud. We thank the Qwen team for their excellent foundation model.

Developed by: Yupeng Xie and team at HKUST-GZ
Framework Versions: PEFT 0.14.0 | Transformers 4.x | PyTorch 2.x

Downloads last month: 34

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xypkent/visjudge-7b

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Adapter

(132)

this model