A Color Evaluation Benchmark for Text-to-Image Generation
Recent years have seen impressive advances in text-to-image generation, with image generative and unified models producing high-quality images from text. Yet these models still struggle with fine-grained color controllability, often failing to accurately match colors specified in text prompts. While existing benchmarks evaluate compositional reasoning and prompt adherence, none systematically assess color precision.
Color is fundamental to human visual perception and communication, critical for applications from art to design workflows requiring brand consistency. However, current benchmarks either neglect color or rely on coarse assessments, missing key capabilities such as interpreting RGB values or aligning with human expectations.
To this end, we propose GenColorBench, the first comprehensive benchmark for text-to-image color generation, grounded in established color systems—Munsell's ISCC-NBS and CSS3/X11—including numerical color specifications which are absent from existing benchmarks. With 44K+ color-focused prompts covering 400+ colors, it reveals models' true capabilities via perceptual and automated assessments.
GenColorBench defines five evaluation tasks across two color systems, with an automated evaluation pipeline combining object detection, segmentation, and perceptual color metrics.
Our benchmark covers five color generation tasks: Color Name Accuracy (CNA), Color-Object Association (COA), Multi-Object Color Composition (MOC), Implicit Color Association (ICA), and Numeric Color Understanding (NCU). The evaluation pipeline uses VQA for object verification, GroundingDINO + SAM for mask generation, and CIEDE2000 (ΔE) in LAB space for color matching.
GenColorBench is grounded in established color naming conventions with extensive coverage analysis across training data and evaluation tasks.
Analysis of color term frequencies in large-scale image-text datasets across four categories: ISCC-NBS L2, CSS3/X11 named colors, Modifiers (light/dark), and Numerics (RGB, HEX). Frequency varies from "black" (68M) to rare colors like "linen" (2M).
We extract dominant colors using OneHue in CIE Luv space. After SAM segmentation, we compute the dominant hue via PCA on chromatic components (u*, v*), providing robust estimation for textured objects.
Comprehensive evaluation across state-of-the-art T2I models. Results reported across three color specification granularities: ISCC-NBS-L1 (13 basic colors), ISCC-NBS-L3 (267 specific colors), and CSS3/X11 (147 web-standard colors).
| Model | Res. | Type | Color Name Accuracy | Color-Object Assoc. | Multi-Object Comp. | Implicit Color Assoc. | Numeric Color Und. | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| L1 | L3 | CSS | L1 | L3 | CSS | L1 | L3 | CSS | L1 | L3 | CSS | L1 | L3 | CSS | |||
| Pixart Alpha | 1024 | DM | 68.78 | 44.35 | 40.56 | 21.54 | 10.21 | 8.69 | 9.54 | 6.35 | 5.29 | 14.90 | 16.55 | 14.09 | 9.45 | 5.42 | 4.21 |
| SD 3.5 | 1024 | DM | 64.37 | 35.80 | 37.74 | 32.95 | 18.62 | 10.02 | 21.54 | 11.30 | 9.61 | 24.42 | 8.30 | 8.39 | 14.92 | 7.53 | 5.78 |
| Sana | 1024 | DM | 62.89 | 43.02 | 42.92 | 28.57 | 14.92 | 10.81 | 20.87 | 14.96 | 16.74 | 20.50 | 4.90 | 3.01 | 25.10 | 12.85 | 9.45 |
| Pixart Sigma | 1024 | DM | 62.14 | 46.21 | 39.02 | 25.40 | 14.16 | 10.42 | 11.21 | 12.91 | 10.17 | 19.67 | 14.51 | 15.48 | 10.21 | 5.11 | 4.09 |
| Bagel | 1024 | CoT | 60.08 | 26.85 | 35.68 | 31.57 | 11.12 | 9.09 | 16.81 | 15.70 | 13.12 | 25.37 | 21.66 | 17.02 | 22.51 | 11.62 | 9.73 |
| Flux | 1024 | DM | 58.25 | 31.25 | 23.87 | 31.74 | 14.13 | 11.54 | 19.86 | 6.59 | 5.02 | 27.19 | 23.36 | 16.92 | 13.82 | 7.74 | 5.86 |
| SD 3 | 1024 | DM | 58.07 | 39.88 | 44.13 | 29.41 | 17.45 | 20.49 | 18.24 | 4.23 | 3.75 | 18.34 | 7.74 | 6.79 | 11.80 | 5.92 | 4.63 |
| OmniGen2 | 512 | AR | 57.31 | 14.22 | 16.82 | 34.23 | 19.93 | 16.47 | 23.78 | 7.51 | 5.13 | 25.09 | 16.27 | 12.28 | 26.38 | 14.21 | 11.88 |
| Blip3o | 1024 | MM | 56.54 | 32.10 | 38.67 | 18.48 | 11.88 | 16.41 | 17.13 | 7.21 | 5.47 | 28.22 | 16.39 | 10.92 | 43.20 | 23.65 | 18.08 |
| Janus Pro | 384 | AR | 41.55 | 23.12 | 28.60 | 22.96 | 11.97 | 14.06 | 15.45 | 12.24 | 9.54 | 24.98 | 6.46 | 6.45 | 5.41 | 2.98 | 2.59 |
| CogView4 | 1024 | DM | 40.11 | 21.67 | 30.87 | 21.87 | 11.27 | 12.70 | 12.10 | 10.44 | 9.35 | 16.78 | 18.01 | 15.97 | 10.95 | 5.32 | 4.25 |
| Model | Res. | Type | Color Name Accuracy | Color-Object Assoc. | Multi-Object Comp. | Implicit Color Assoc. | Numeric Color Und. | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| L1 | L3 | CSS | L1 | L3 | CSS | L1 | L3 | CSS | L1 | L3 | CSS | L1 | L3 | CSS | |||
| GPT-Image-1.5 | 1024 | API | 76.63 | 52.47 | 50.82 | 45.72 | 36.65 | 38.38 | 42.84 | 34.56 | 39.72 | 35.28 | 28.47 | 23.65 | 51.82 | 48.12 | 49.91 |
| Pixart Alpha | 1024 | DM | 69.82 | 25.88 | 16.91 | 13.23 | 5.76 | 4.74 | 10.78 | 3.56 | 2.88 | 15.55 | 6.54 | 5.62 | 4.80 | 2.40 | 2.03 |
| SD 3.5 | 1024 | DM | 68.61 | 24.12 | 15.76 | 17.40 | 7.56 | 6.21 | 13.64 | 4.50 | 3.69 | 25.44 | 10.70 | 9.20 | 7.80 | 3.60 | 2.86 |
| Pixart Sigma | 1024 | DM | 68.05 | 27.48 | 16.15 | 15.00 | 6.48 | 5.31 | 12.66 | 4.81 | 3.94 | 14.98 | 6.30 | 5.42 | 5.40 | 2.40 | 1.94 |
| Sana | 1024 | DM | 67.31 | 23.68 | 15.48 | 17.04 | 7.41 | 6.09 | 13.22 | 4.38 | 3.50 | 20.64 | 8.68 | 7.46 | 9.00 | 4.20 | 3.44 |
| Bagel | 1024 | CoT | 67.05 | 22.59 | 14.76 | 19.44 | 4.62 | 3.81 | 12.04 | 4.69 | 3.81 | 26.29 | 11.50 | 9.80 | 10.80 | 5.40 | 2.92 |
| FLUX.2 | 1024 | DM | 66.42 | 38.67 | 32.45 | 35.82 | 17.34 | 13.21 | 24.67 | 10.82 | 8.43 | 28.94 | 24.18 | 17.65 | 48.72 | 39.29 | 42.67 |
| Flux | 1024 | DM | 65.10 | 21.93 | 14.33 | 15.78 | 6.87 | 5.64 | 12.24 | 4.06 | 3.25 | 28.27 | 11.89 | 10.22 | 4.80 | 2.40 | 1.88 |
| SD 3 | 1024 | DM | 65.10 | 23.72 | 16.30 | 16.20 | 7.41 | 6.45 | 12.58 | 2.69 | 2.19 | 18.94 | 7.97 | 6.85 | 6.00 | 3.00 | 2.24 |
| OmniGen2 | 512 | AR | 63.80 | 8.45 | 14.04 | 21.06 | 9.18 | 7.53 | 14.70 | 5.06 | 3.44 | 26.29 | 11.06 | 9.50 | 15.00 | 7.20 | 3.59 |
| Z-Image | 1024 | DM | 63.27 | 35.92 | 38.84 | 30.65 | 15.47 | 12.83 | 19.42 | 9.35 | 7.64 | 24.56 | 15.28 | 12.37 | 22.18 | 16.92 | 12.45 |
| Blip3o | 1024 | MM | 63.15 | 21.27 | 13.90 | 11.37 | 4.95 | 6.45 | 12.28 | 3.25 | 2.63 | 29.40 | 12.37 | 10.63 | 21.00 | 10.80 | 5.88 |
| Qwen-Image | 1024 | MM | 59.83 | 28.45 | 34.12 | 33.47 | 19.28 | 17.63 | 22.35 | 14.67 | 11.28 | 26.82 | 18.93 | 14.52 | 25.64 | 18.73 | 14.26 |
| Janus Pro | 384 | AR | 46.22 | 15.57 | 10.17 | 14.13 | 6.15 | 5.04 | 11.66 | 5.19 | 4.25 | 26.01 | 10.94 | 9.40 | 3.00 | 1.20 | 1.23 |
| CogView4 | 1024 | DM | 44.92 | 15.13 | 9.89 | 13.44 | 5.85 | 4.80 | 10.98 | 4.44 | 3.63 | 17.53 | 7.37 | 6.34 | 5.40 | 2.40 | 2.04 |
Detailed analysis of model behavior across color modifier types and object categories.
Radar chart comparing model accuracy across five color categories: Basic Colors, Intermediate Colors, Colors with Light Modifiers (e.g., "light blue"), Dark Modifiers (e.g., "dark green"), and "-ish" Modifiers (e.g., "reddish"). Models show consistent performance on basic colors but struggle with modified color terms.
Per-category color generation accuracy across 11 T2I models. Performance varies significantly by object type—"Clothes and Accessories" prove challenging while "Fruits and Vegetables" with strong color priors are easier.
Distribution of generated colors per object category, revealing model-specific biases. "Animals" shows heavy bias toward brown/black tones across all models, while "Fruits and Vegetables" exhibits expected yellow/green dominance. Models vary in how strongly they follow category color priors vs. prompt specifications.
@article{butt2025gencolorbench,
author = {Butt, Muhammad Atif and Gomez-Villa, Alexandra and Wu, Tao
and Vazquez-Corral, Javier and Van De Weijer, Joost and Wang, Kai},
title = {GenColorBench: A Color Evaluation Benchmark for
Text-to-Image Generation Models},
journal = {arXiv preprint arXiv:2510.20586},
year = {2025},
}