CVPR 2026

GenColorBench

A Color Evaluation Benchmark for Text-to-Image Generation

1Computer Vision Center, Spain    2Computer Science Dept., Universitat Autònoma de Barcelona, Spain 3Program of Computer Science, City University of Hong Kong (Dongguan)    4City University of Hong Kong

Can your T2I model generate the exact color you asked for?

Crimson
CSS Named Color
rgb(255,99,71)
RGB Numeric
#4B0082
Hex Code
Bluish Gray
ISCC-NBS L2

Abstract

Recent years have seen impressive advances in text-to-image generation, with image generative and unified models producing high-quality images from text. Yet these models still struggle with fine-grained color controllability, often failing to accurately match colors specified in text prompts. While existing benchmarks evaluate compositional reasoning and prompt adherence, none systematically assess color precision.

Color is fundamental to human visual perception and communication, critical for applications from art to design workflows requiring brand consistency. However, current benchmarks either neglect color or rely on coarse assessments, missing key capabilities such as interpreting RGB values or aligning with human expectations.

To this end, we propose GenColorBench, the first comprehensive benchmark for text-to-image color generation, grounded in established color systems—Munsell's ISCC-NBS and CSS3/X11—including numerical color specifications which are absent from existing benchmarks. With 44K+ color-focused prompts covering 400+ colors, it reveals models' true capabilities via perceptual and automated assessments.

44K+
Prompts
Full Benchmark
<10K
Prompts
Mini Version
400+
Colors
5
Tasks
2
Color Systems

Method Overview

GenColorBench defines five evaluation tasks across two color systems, with an automated evaluation pipeline combining object detection, segmentation, and perceptual color metrics.

GenColorBench method overview showing five evaluation tasks and the evaluation pipeline
Figure 1: GenColorBench Tasks and Evaluation Pipeline

Our benchmark covers five color generation tasks: Color Name Accuracy (CNA), Color-Object Association (COA), Multi-Object Color Composition (MOC), Implicit Color Association (ICA), and Numeric Color Understanding (NCU). The evaluation pipeline uses VQA for object verification, GroundingDINO + SAM for mask generation, and CIEDE2000 (ΔE) in LAB space for color matching.

Color Distribution Analysis

GenColorBench is grounded in established color naming conventions with extensive coverage analysis across training data and evaluation tasks.

Color distribution analysis showing coverage across ISCC-NBS, CSS3/X11, Modifiers, and Numeric color systems
Figure 2: Color Distribution in Training Data

Analysis of color term frequencies in large-scale image-text datasets across four categories: ISCC-NBS L2, CSS3/X11 named colors, Modifiers (light/dark), and Numerics (RGB, HEX). Frequency varies from "black" (68M) to rare colors like "linen" (2M).

Color extraction visualization showing masked teddy bear and Luv color space distribution
Figure 3: Dominant Color Extraction

We extract dominant colors using OneHue in CIE Luv space. After SAM segmentation, we compute the dominant hue via PCA on chromatic components (u*, v*), providing robust estimation for textured objects.

Leaderboard

Comprehensive evaluation across state-of-the-art T2I models. Results reported across three color specification granularities: ISCC-NBS-L1 (13 basic colors), ISCC-NBS-L3 (267 specific colors), and CSS3/X11 (147 web-standard colors).

Model Res. Type Color Name Accuracy Color-Object Assoc. Multi-Object Comp. Implicit Color Assoc. Numeric Color Und.
L1L3CSS L1L3CSS L1L3CSS L1L3CSS L1L3CSS
Pixart Alpha 1024 DM 68.7844.3540.56 21.5410.218.69 9.546.355.29 14.9016.5514.09 9.455.424.21
SD 3.5 1024 DM 64.3735.8037.74 32.9518.6210.02 21.5411.309.61 24.428.308.39 14.927.535.78
Sana 1024 DM 62.8943.0242.92 28.5714.9210.81 20.8714.9616.74 20.504.903.01 25.1012.859.45
Pixart Sigma 1024 DM 62.1446.2139.02 25.4014.1610.42 11.2112.9110.17 19.6714.5115.48 10.215.114.09
Bagel 1024 CoT 60.0826.8535.68 31.5711.129.09 16.8115.7013.12 25.3721.6617.02 22.5111.629.73
Flux 1024 DM 58.2531.2523.87 31.7414.1311.54 19.866.595.02 27.1923.3616.92 13.827.745.86
SD 3 1024 DM 58.0739.8844.13 29.4117.4520.49 18.244.233.75 18.347.746.79 11.805.924.63
OmniGen2 512 AR 57.3114.2216.82 34.2319.9316.47 23.787.515.13 25.0916.2712.28 26.3814.2111.88
Blip3o 1024 MM 56.5432.1038.67 18.4811.8816.41 17.137.215.47 28.2216.3910.92 43.2023.6518.08
Janus Pro 384 AR 41.5523.1228.60 22.9611.9714.06 15.4512.249.54 24.986.466.45 5.412.982.59
CogView4 1024 DM 40.1121.6730.87 21.8711.2712.70 12.1010.449.35 16.7818.0115.97 10.955.324.25
Top
Second Best
Third Best
Model Res. Type Color Name Accuracy Color-Object Assoc. Multi-Object Comp. Implicit Color Assoc. Numeric Color Und.
L1L3CSS L1L3CSS L1L3CSS L1L3CSS L1L3CSS
GPT-Image-1.5 1024 API 76.6352.4750.82 45.7236.6538.38 42.8434.5639.72 35.2828.4723.65 51.8248.1249.91
Pixart Alpha 1024 DM 69.8225.8816.91 13.235.764.74 10.783.562.88 15.556.545.62 4.802.402.03
SD 3.5 1024 DM 68.6124.1215.76 17.407.566.21 13.644.503.69 25.4410.709.20 7.803.602.86
Pixart Sigma 1024 DM 68.0527.4816.15 15.006.485.31 12.664.813.94 14.986.305.42 5.402.401.94
Sana 1024 DM 67.3123.6815.48 17.047.416.09 13.224.383.50 20.648.687.46 9.004.203.44
Bagel 1024 CoT 67.0522.5914.76 19.444.623.81 12.044.693.81 26.2911.509.80 10.805.402.92
FLUX.2 1024 DM 66.4238.6732.45 35.8217.3413.21 24.6710.828.43 28.9424.1817.65 48.7239.2942.67
Flux 1024 DM 65.1021.9314.33 15.786.875.64 12.244.063.25 28.2711.8910.22 4.802.401.88
SD 3 1024 DM 65.1023.7216.30 16.207.416.45 12.582.692.19 18.947.976.85 6.003.002.24
OmniGen2 512 AR 63.808.4514.04 21.069.187.53 14.705.063.44 26.2911.069.50 15.007.203.59
Z-Image 1024 DM 63.2735.9238.84 30.6515.4712.83 19.429.357.64 24.5615.2812.37 22.1816.9212.45
Blip3o 1024 MM 63.1521.2713.90 11.374.956.45 12.283.252.63 29.4012.3710.63 21.0010.805.88
Qwen-Image 1024 MM 59.8328.4534.12 33.4719.2817.63 22.3514.6711.28 26.8218.9314.52 25.6418.7314.26
Janus Pro 384 AR 46.2215.5710.17 14.136.155.04 11.665.194.25 26.0110.949.40 3.001.201.23
CogView4 1024 DM 44.9215.139.89 13.445.854.80 10.984.443.63 17.537.376.34 5.402.402.04
Top
Second Best
Third Best

Analysis

Detailed analysis of model behavior across color modifier types and object categories.

Radar chart showing model performance across different color modifier categories
Figure 4: Performance by Color Modifier Type

Radar chart comparing model accuracy across five color categories: Basic Colors, Intermediate Colors, Colors with Light Modifiers (e.g., "light blue"), Dark Modifiers (e.g., "dark green"), and "-ish" Modifiers (e.g., "reddish"). Models show consistent performance on basic colors but struggle with modified color terms.

Bar chart showing model accuracy across object categories
Figure 5: Accuracy by Object Category

Per-category color generation accuracy across 11 T2I models. Performance varies significantly by object type—"Clothes and Accessories" prove challenging while "Fruits and Vegetables" with strong color priors are easier.

Stacked bar chart showing color distribution bias per object category across models
Figure 6: Model Color Bias Analysis

Distribution of generated colors per object category, revealing model-specific biases. "Animals" shows heavy bias toward brown/black tones across all models, while "Fruits and Vegetables" exhibits expected yellow/green dominance. Models vary in how strongly they follow category color priors vs. prompt specifications.

Citation

@article{butt2025gencolorbench, author = {Butt, Muhammad Atif and Gomez-Villa, Alexandra and Wu, Tao and Vazquez-Corral, Javier and Van De Weijer, Joost and Wang, Kai}, title = {GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation Models}, journal = {arXiv preprint arXiv:2510.20586}, year = {2025}, }