VisJudge: Qwen2.5-VL-7B LoRA for Visualization Quality Assessment
VisJudge is a specialized model fine-tuned on Qwen2.5-VL-7B-Instruct for visualization quality and aesthetics assessment. It significantly outperforms state-of-the-art multimodal large language models (MLLMs) including GPT-5, GPT-4o, and Claude-4-Sonnet on visualization evaluation tasks.
π Paper: VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations
π― Model Overview
VisJudge addresses the significant gaps between general MLLMs and human expert judgment in visualization quality assessment. Trained using GRPO (Group Relative Policy Optimization) on the VisJudgeBench dataset containing 3,090 expert-annotated samples, VisJudge evaluates visualizations across the Fidelity-Expressiveness-Aesthetics framework.
Key Features
- π State-of-the-Art Performance: 19.8% MAE improvement over GPT-5
- π Six-Dimensional Evaluation: Data Fidelity, Semantic Readability, Insight Discovery, Design Style, Visual Composition, Color Harmony
- π¨ Comprehensive Coverage: Supports 32 visualization types including single charts, multi-panel views, and dashboards
- π¬ Expert-Level Assessment: Achieves 0.681 correlation with human experts (vs. 0.429 for GPT-5)
π Performance Benchmarks
Overall Performance Comparison
| Model | MAE β | MSE β | Correlation β |
|---|---|---|---|
| VisJudge | 0.442 | 0.306 | 0.681 |
| GPT-5 | 0.551 | 0.484 | 0.429 |
| GPT-4o | 0.609 | 0.575 | 0.482 |
| Claude-4-Sonnet | 0.618 | 0.596 | 0.470 |
| Gemini-2.0-Flash | 0.680 | 0.716 | 0.395 |
| Gemini-2.5-Pro | 0.661 | 0.674 | 0.266 |
| Claude-3.5-Sonnet | 0.823 | 1.006 | 0.395 |
| Qwen2.5-VL-7B | 1.048 | 1.502 | 0.322 |
Key Achievements:
- π― 19.8% MAE improvement over GPT-5 (0.551 β 0.442)
- π 58.7% higher correlation with human experts vs GPT-5 (0.429 β 0.681)
- π Outperforms all commercial MLLMs across all metrics
Performance by Evaluation Dimensions (MAE β)
| Model | Overall | Data Fidelity | Semantic Readability | Insight Discovery | Design Style | Visual Composition | Color Harmony |
|---|---|---|---|---|---|---|---|
| VisJudge | 0.442 | 0.662 | 0.649 | 0.679 | 0.581 | 0.546 | 0.604 |
| GPT-5 | 0.551 | 0.861 | 0.780 | 0.776 | 0.648 | 0.698 | 0.682 |
| GPT-4o | 0.609 | 0.986 | 0.804 | 0.742 | 0.608 | 0.694 | 0.657 |
| Claude-4-Sonnet | 0.618 | 0.839 | 0.757 | 0.830 | 0.678 | 0.733 | 0.785 |
| Gemini-2.0-Flash | 0.680 | 0.828 | 0.910 | 0.818 | 0.637 | 0.728 | 0.798 |
| Gemini-2.5-Pro | 0.661 | 1.241 | 0.944 | 0.898 | 0.839 | 0.918 | 0.980 |
| Claude-3.5-Sonnet | 0.823 | 0.977 | 0.902 | 1.152 | 0.782 | 0.939 | 0.862 |
| Qwen2.5-VL-7B | 1.048 | 1.169 | 1.294 | 0.857 | 0.755 | 0.812 | 0.772 |
Performance by Evaluation Dimensions (Correlation β)
| Model | Overall | Data Fidelity | Semantic Readability | Insight Discovery | Design Style | Visual Composition | Color Harmony |
|---|---|---|---|---|---|---|---|
| VisJudge | 0.681 | 0.571 | 0.625 | 0.572 | 0.567 | 0.512 | 0.385 |
| GPT-5 | 0.429 | 0.256 | 0.438 | 0.383 | 0.463 | 0.277 | 0.295 |
| GPT-4o | 0.482 | 0.382 | 0.539 | 0.442 | 0.472 | 0.277 | 0.363 |
| Claude-4-Sonnet | 0.470 | 0.392 | 0.548 | 0.453 | 0.422 | 0.164 | 0.228 |
| Gemini-2.0-Flash | 0.395 | 0.371 | 0.458 | 0.418 | 0.460 | 0.157 | 0.209 |
| Gemini-2.5-Pro | 0.266 | 0.180 | 0.379 | 0.357 | 0.447 | 0.194 | 0.208 |
| Claude-3.5-Sonnet | 0.395 | 0.325 | 0.491 | 0.366 | 0.456 | 0.137 | 0.259 |
| Qwen2.5-VL-7B | 0.322 | 0.340 | 0.349 | 0.278 | 0.356 | 0.148 | 0.155 |
Key Observations:
- All models struggle most with Aesthetics dimensions (Design Style, Visual Composition, Color Harmony)
- Data Fidelity is relatively easier but still challenging for most models
- VisJudge consistently outperforms baseline models across all six dimensions
π Evaluation Framework
VisJudge evaluates visualizations across three fundamental dimensions with six measurable metrics:
1. Fidelity - Data Accuracy and Truthfulness
- Data Fidelity: Ensures visual encodings accurately reflect original data without misleading interpretations
2. Expressiveness - Information Clarity and Understandability
- Semantic Readability: Assesses clarity of information encoding and unambiguous decoding
- Insight Discovery: Evaluates effectiveness in revealing data patterns, trends, and outliers
3. Aesthetics - Visual Aesthetics and Refinement
- Design Style: Measures innovation and uniqueness of design elements
- Visual Composition: Focuses on spatial layout, balance, and element positioning
- Color Harmony: Assesses color coordination and functional effectiveness
π Usage
Installation
pip install transformers peft torch pillow
Quick Start
from peft import PeftModel
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch
# Load base model
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Load VisJudge LoRA adapter
model = PeftModel.from_pretrained(
base_model,
"xypkent/visjudge-7b"
)
# Load processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
# Prepare your visualization
image = Image.open("path/to/your/visualization.png")
# Evaluation prompt
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": """You are a rigorous data visualization evaluation expert. Please evaluate this visualization based on the "Fidelity-Expressiveness-Aesthetics" framework.
The evaluation follows the "Fidelity-Expressiveness-Aesthetics" principle:
- Fidelity: Data accuracy and truthfulness
- Expressiveness: Information clarity and understandability
- Aesthetics: Visual aesthetics and refinement
For each evaluation dimension below, provide a score from 1 to 5 and reasoning based on the scoring criteria:
1. Data Fidelity: Does the visual encoding accurately reflect the data without distortion?
2. Semantic Readability: Is the information clearly encoded and easy to decode?
3. Insight Discovery: Does it effectively reveal patterns, trends, and insights?
4. Design Style: Is the design innovative and distinctive?
5. Visual Composition: Is the layout balanced and well-organized?
6. Color Harmony: Are colors coordinated and effective?
Return Format: JSON object with the following structure:
{
"data_fidelity": {"score": 1-5, "reasoning": "Your explanation here."},
"semantic_readability": {"score": 1-5, "reasoning": "Your explanation here."},
"insight_discovery": {"score": 1-5, "reasoning": "Your explanation here."},
"design_style": {"score": 1-5, "reasoning": "Your explanation here."},
"visual_composition": {"score": 1-5, "reasoning": "Your explanation here."},
"color_harmony": {"score": 1-5, "reasoning": "Your explanation here."},
"average_score": "the average of the above six scores, rounded to 2 decimals"
}
Do not include any additional text, only the JSON object."""}
]
}
]
# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
Example Output
{
"data_fidelity": {
"score": 4,
"reasoning": "The visual encoding accurately represents the data with appropriate scales and minimal distortion."
},
"semantic_readability": {
"score": 5,
"reasoning": "Clear labels, legend, and annotations make the information very easy to understand and decode."
},
"insight_discovery": {
"score": 4,
"reasoning": "The chart effectively reveals key trends and patterns, though some subtle insights could be more emphasized."
},
"design_style": {
"score": 3,
"reasoning": "Uses standard design elements without much innovation, but maintains professional appearance."
},
"visual_composition": {
"score": 4,
"reasoning": "Well-balanced layout with good spacing between elements and clear visual hierarchy."
},
"color_harmony": {
"score": 4,
"reasoning": "Color palette is well-coordinated and supports readability, with good contrast and consistency."
},
"average_score": 4.00
}
π Training Details
Dataset
- Name: VisJudgeBench
- Size: 3,090 expert-annotated visualization samples
- Types: Single visualizations, multi-panel views, dashboards
- Coverage: 32 chart types including bar charts, line charts, heatmaps, sankey diagrams, treemaps, dashboards, and more
Training Method
- Base Model: Qwen2.5-VL-7B-Instruct
- Fine-tuning Method: LoRA (Low-Rank Adaptation) + GRPO (Group Relative Policy Optimization)
- LoRA Configuration:
- Rank: 128
- Alpha: 256
- Target Modules: All attention and MLP layers
- Training Framework: PEFT 0.14.0
Key Improvements
β
Human-like Scoring: Mean score ΞΌ=3.11 (vs. human ΞΌ=3.13), eliminating the score inflation bias seen in other models
β
Balanced Assessment: Avoids both overly conservative (Gemini-2.5-Pro ΞΌ=3.02) and overly generous (Qwen2.5-VL-7B ΞΌ=3.89) biases
β
Complexity Handling: Maintains performance across single visualizations (0.577), multi-panel views (0.565), and complex dashboards (0.375)
π Supported Visualization Types
Single Visualizations (22 types)
Bar Chart, Pie Chart, Line Chart, Area Chart, Treemap, Sankey Diagram, Heatmap, Scatter Plot, Histogram, Donut Chart, Funnel Chart, Bubble Chart, Choropleth Map, Radar Chart, Network Graph, Candlestick Chart, Gauge Chart, Box Plot, Point Map, Word Cloud, Violin Plot, and more
Multiple Visualizations (5 types)
Comparison Views, Small Multiples, Coordinated Views, Overview+Detail
Dashboards (5 types)
Analytical Dashboard, Operational Dashboard, Interactive Dashboard, Strategic Dashboard
β οΈ Limitations
- Performance degrades with increasing visualization complexity (dashboards are most challenging)
- Best suited for visualization types seen during training
- Aesthetic dimensions (especially Visual Composition in complex dashboards) remain challenging
- Inherits any biases present in the base Qwen2.5-VL model
π Citation
If you use VisJudge in your research, please cite:
@misc{xie2025visjudge,
title={VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations},
author={Yupeng Xie and Zhiyang Zhang and Yifan Wu and Sirong Lu and Jiayi Zhang and Zhaoyang Yu and Jinlin Wang and Sirui Hong and Bang Liu and Chenglin Wu and Yuyu Luo},
year={2025},
eprint={2510.22373},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.22373}
}
π Resources
- π Paper: arXiv:2510.22373
- π» Dataset: VisJudgeBench on GitHub
- π§ Contact: [email protected]
π License
This model is released under the Apache 2.0 License, consistent with the base Qwen2.5-VL model.
π Acknowledgments
This model is built upon Qwen2.5-VL-7B-Instruct by Alibaba Cloud. We thank the Qwen team for their excellent foundation model.
Developed by: Yupeng Xie and team at HKUST-GZ
Framework Versions: PEFT 0.14.0 | Transformers 4.x | PyTorch 2.x
- Downloads last month
- 34
Model tree for xypkent/visjudge-7b
Base model
Qwen/Qwen2.5-VL-7B-Instruct