Papers
arxiv:2510.22373

VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations

Published on Oct 25
· Submitted by xypkent on Oct 29
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

VisJudge-Bench is a benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality, revealing gaps compared to human experts and demonstrating improvements with the VisJudge model.

AI-generated summary

Visualization, a domain-specific yet widely used form of imagery, is an effective way to turn complex datasets into intuitive insights, and its value depends on whether data are faithfully represented, clearly communicated, and aesthetically designed. However, evaluating visualization quality is challenging: unlike natural images, it requires simultaneous judgment across data encoding accuracy, information expressiveness, and visual aesthetics. Although multimodal large language models (MLLMs) have shown promising performance in aesthetic assessment of natural images, no systematic benchmark exists for measuring their capabilities in evaluating visualizations. To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality. It contains 3,090 expert-annotated samples from real-world scenarios, covering single visualizations, multiple visualizations, and dashboards across 32 chart types. Systematic testing on this benchmark reveals that even the most advanced MLLMs (such as GPT-5) still exhibit significant gaps compared to human experts in judgment, with a Mean Absolute Error (MAE) of 0.551 and a correlation with human ratings of only 0.429. To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment. Experimental results demonstrate that VisJudge significantly narrows the gap with human judgment, reducing the MAE to 0.442 (a 19.8% reduction) and increasing the consistency with human experts to 0.681 (a 58.7% improvement) compared to GPT-5. The benchmark is available at https://github.com/HKUSTDial/VisJudgeBench.

Community

Paper submitter

We introduce VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' capabilities in assessing visualization aesthetics and quality. Built on the "Fidelity-Expressiveness-Aesthetics" framework, our benchmark contains 3,090 expert-annotated samples covering 32 chart types across single visualizations, multiple visualizations, and dashboards. Extensive testing reveals that even GPT-5 shows significant gaps compared to human experts (MAE: 0.551, correlation: 0.429). To address this, we developed VisJudge, a specialized model that achieves 19.8% MAE reduction (0.442) and 58.7% correlation improvement (0.681) over GPT-5, demonstrating the necessity and effectiveness of domain-specific training for visualization quality assessment. Dataset and code available at https://github.com/HKUSTDial/VisJudgeBench

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.22373 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.22373 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.22373 in a Space README.md to link it from this page.

Collections including this paper 1