GenColorBench: A Color Evaluation Benchmark for T2I Models

Abstract

Recent years have seen impressive advances in text-to-image generation, with image generative and unified models producing high-quality images from text. Yet these models still struggle with fine-grained color controllability, often failing to accurately match colors specified in text prompts. While existing benchmarks evaluate compositional reasoning and prompt adherence, none systematically assess color precision.

Color is fundamental to human visual perception and communication, critical for applications from art to design workflows requiring brand consistency. However, current benchmarks either neglect color or rely on coarse assessments, missing key capabilities such as interpreting RGB values or aligning with human expectations.

To this end, we propose GenColorBench, the first comprehensive benchmark for text-to-image color generation, grounded in established color systems—Munsell's ISCC-NBS and CSS3/X11—including numerical color specifications which are absent from existing benchmarks. With 44K+ color-focused prompts covering 400+ colors, it reveals models' true capabilities via perceptual and automated assessments.

44K+

Prompts

Full Benchmark

<10K

Prompts

Mini Version

400+

Colors

Tasks

Color Systems

Method Overview

GenColorBench defines five evaluation tasks across two color systems, with an automated evaluation pipeline combining object detection, segmentation, and perceptual color metrics.

Figure 1: GenColorBench Tasks and Evaluation Pipeline

Our benchmark covers five color generation tasks: Color Name Accuracy (CNA), Color-Object Association (COA), Multi-Object Color Composition (MOC), Implicit Color Association (ICA), and Numeric Color Understanding (NCU). The evaluation pipeline uses VQA for object verification, GroundingDINO + SAM for mask generation, and CIEDE2000 (ΔE) in LAB space for color matching.

Color Distribution Analysis

GenColorBench is grounded in established color naming conventions with extensive coverage analysis across training data and evaluation tasks.

Figure 2: Color Distribution in Training Data

Analysis of color term frequencies in large-scale image-text datasets across four categories: ISCC-NBS L2, CSS3/X11 named colors, Modifiers (light/dark), and Numerics (RGB, HEX). Frequency varies from "black" (68M) to rare colors like "linen" (2M).

Color extraction visualization showing masked teddy bear and Luv color space distribution

Figure 3: Dominant Color Extraction

We extract dominant colors using OneHue in CIE Luv space. After SAM segmentation, we compute the dominant hue via PCA on chromatic components (u*, v*), providing robust estimation for textured objects.

Leaderboard

Comprehensive evaluation across state-of-the-art T2I models. Results reported across three color specification granularities: ISCC-NBS-L1 (13 basic colors), ISCC-NBS-L3 (267 specific colors), and CSS3/X11 (147 web-standard colors).

Model	Res.	Type	Color Name Accuracy			Color-Object Assoc.			Multi-Object Comp.			Implicit Color Assoc.			Numeric Color Und.
Model	Res.	Type	L1	L3	CSS	L1	L3	CSS	L1	L3	CSS	L1	L3	CSS	L1	L3	CSS
Pixart Alpha	1024	DM	68.78	44.35	40.56	21.54	10.21	8.69	9.54	6.35	5.29	14.90	16.55	14.09	9.45	5.42	4.21
SD 3.5	1024	DM	64.37	35.80	37.74	32.95	18.62	10.02	21.54	11.30	9.61	24.42	8.30	8.39	14.92	7.53	5.78
Sana	1024	DM	62.89	43.02	42.92	28.57	14.92	10.81	20.87	14.96	16.74	20.50	4.90	3.01	25.10	12.85	9.45
Pixart Sigma	1024	DM	62.14	46.21	39.02	25.40	14.16	10.42	11.21	12.91	10.17	19.67	14.51	15.48	10.21	5.11	4.09
Bagel	1024	CoT	60.08	26.85	35.68	31.57	11.12	9.09	16.81	15.70	13.12	25.37	21.66	17.02	22.51	11.62	9.73
Flux	1024	DM	58.25	31.25	23.87	31.74	14.13	11.54	19.86	6.59	5.02	27.19	23.36	16.92	13.82	7.74	5.86
SD 3	1024	DM	58.07	39.88	44.13	29.41	17.45	20.49	18.24	4.23	3.75	18.34	7.74	6.79	11.80	5.92	4.63
OmniGen2	512	AR	57.31	14.22	16.82	34.23	19.93	16.47	23.78	7.51	5.13	25.09	16.27	12.28	26.38	14.21	11.88
Blip3o	1024	MM	56.54	32.10	38.67	18.48	11.88	16.41	17.13	7.21	5.47	28.22	16.39	10.92	43.20	23.65	18.08
Janus Pro	384	AR	41.55	23.12	28.60	22.96	11.97	14.06	15.45	12.24	9.54	24.98	6.46	6.45	5.41	2.98	2.59
CogView4	1024	DM	40.11	21.67	30.87	21.87	11.27	12.70	12.10	10.44	9.35	16.78	18.01	15.97	10.95	5.32	4.25

Top

Second Best

Third Best

Model	Res.	Type	Color Name Accuracy			Color-Object Assoc.			Multi-Object Comp.			Implicit Color Assoc.			Numeric Color Und.
Model	Res.	Type	L1	L3	CSS	L1	L3	CSS	L1	L3	CSS	L1	L3	CSS	L1	L3	CSS
GPT-Image-1.5	1024	API	76.63	52.47	50.82	45.72	36.65	38.38	42.84	34.56	39.72	35.28	28.47	23.65	51.82	48.12	49.91
Pixart Alpha	1024	DM	69.82	25.88	16.91	13.23	5.76	4.74	10.78	3.56	2.88	15.55	6.54	5.62	4.80	2.40	2.03
SD 3.5	1024	DM	68.61	24.12	15.76	17.40	7.56	6.21	13.64	4.50	3.69	25.44	10.70	9.20	7.80	3.60	2.86
Pixart Sigma	1024	DM	68.05	27.48	16.15	15.00	6.48	5.31	12.66	4.81	3.94	14.98	6.30	5.42	5.40	2.40	1.94
Sana	1024	DM	67.31	23.68	15.48	17.04	7.41	6.09	13.22	4.38	3.50	20.64	8.68	7.46	9.00	4.20	3.44
Bagel	1024	CoT	67.05	22.59	14.76	19.44	4.62	3.81	12.04	4.69	3.81	26.29	11.50	9.80	10.80	5.40	2.92
FLUX.2	1024	DM	66.42	38.67	32.45	35.82	17.34	13.21	24.67	10.82	8.43	28.94	24.18	17.65	48.72	39.29	42.67
Flux	1024	DM	65.10	21.93	14.33	15.78	6.87	5.64	12.24	4.06	3.25	28.27	11.89	10.22	4.80	2.40	1.88
SD 3	1024	DM	65.10	23.72	16.30	16.20	7.41	6.45	12.58	2.69	2.19	18.94	7.97	6.85	6.00	3.00	2.24
OmniGen2	512	AR	63.80	8.45	14.04	21.06	9.18	7.53	14.70	5.06	3.44	26.29	11.06	9.50	15.00	7.20	3.59
Z-Image	1024	DM	63.27	35.92	38.84	30.65	15.47	12.83	19.42	9.35	7.64	24.56	15.28	12.37	22.18	16.92	12.45
Blip3o	1024	MM	63.15	21.27	13.90	11.37	4.95	6.45	12.28	3.25	2.63	29.40	12.37	10.63	21.00	10.80	5.88
Qwen-Image	1024	MM	59.83	28.45	34.12	33.47	19.28	17.63	22.35	14.67	11.28	26.82	18.93	14.52	25.64	18.73	14.26
Janus Pro	384	AR	46.22	15.57	10.17	14.13	6.15	5.04	11.66	5.19	4.25	26.01	10.94	9.40	3.00	1.20	1.23
CogView4	1024	DM	44.92	15.13	9.89	13.44	5.85	4.80	10.98	4.44	3.63	17.53	7.37	6.34	5.40	2.40	2.04

Top

Second Best

Third Best

Analysis

Detailed analysis of model behavior across color modifier types and object categories.

Radar chart showing model performance across different color modifier categories

Figure 4: Performance by Color Modifier Type

Radar chart comparing model accuracy across five color categories: Basic Colors, Intermediate Colors, Colors with Light Modifiers (e.g., "light blue"), Dark Modifiers (e.g., "dark green"), and "-ish" Modifiers (e.g., "reddish"). Models show consistent performance on basic colors but struggle with modified color terms.

Bar chart showing model accuracy across object categories

Figure 5: Accuracy by Object Category

Per-category color generation accuracy across 11 T2I models. Performance varies significantly by object type—"Clothes and Accessories" prove challenging while "Fruits and Vegetables" with strong color priors are easier.

Stacked bar chart showing color distribution bias per object category across models

Figure 6: Model Color Bias Analysis

Distribution of generated colors per object category, revealing model-specific biases. "Animals" shows heavy bias toward brown/black tones across all models, while "Fruits and Vegetables" exhibits expected yellow/green dominance. Models vary in how strongly they follow category color priors vs. prompt specifications.

Abstract

Method Overview

Color Distribution Analysis

Leaderboard

Analysis

Related Benchmarks

Citation