It's a normal day as an analyst at a social media company like Instagram or Rednote. You open your laptop to a million new images—your task: figuring out what content are your users uploading and what's trending. Usually, you tag images into known categories under some predefined topics (Sports, Tech, Beauty...) to sort these images. But is that enough? What about the emerging topics you haven't thought of yet?
Meet X-Cluster. It automatically explores massive, unstructured image collections to discover meaningful and interpretable grouping criteria and organize the images for you—no predefined rules, no manual effort. It doesn't just sort the images; it identifies new, hidden distribution directly from the visual data. Just sit back and explore today's fresh insights, discover hidden opportunities, and stay ahead effortlessly.
Sounds great, right? Grazie!
In this work, we introduce and study the novel task of Open-ended Semantic Multiple Clustering (OpenSMC). Given a large, unstructured image collection, the goal is to automatically discover several, diverse semantic clustering criteria (e.g., Activity or Location) from the images, and subsequently organize them according to the discovered criteria, without requiring any human input. Our framework, X-Cluster: eXploratory Clustering, treats text as a reasoning proxy: it concurrently scans the entire image collection, proposes candidate criteria in natural language, and groups images into meaningful clusters per criterion. This radically differs from previous works, which either assume predefined clustering criteria or fixed cluster counts. To evaluate X-Cluster, we create two new benchmarks, COCO-4C and Food-4C, each annotated with four distinct grouping criteria and corresponding cluster labels. Experiments show that X-Cluster can effectively reveal meaningful partitions on several datasets. Finally, we use X-Cluster to achieve various real-world applications, including uncovering hidden biases in text-to-image (T2I) generative models and analyzing image virality on social media. Code and datasets will be open-sourced for future research.
We annotate and propose two new benchmarks for OpenSMC: Food-4c and COCO-4c. Food-4c is sourced from Food-101, which includes 101 Food type (original annotations), along with new annotations for 15 Cuisine types, 5 Courses types, and 4 Diet preferences, totaling four clustering criteria. Additionally, we introduced COCO-4c using images from COCO-val, where we annotated four criteria with varying number of clusters: 64 Activity, 19 Location, 20 Mood, and 6 Time of day.
X-Cluster consists of a Criteria Proposer and a Semantic Grouper. Given a set of images, the Proposer discovers and outputs a pool of grouping criteria in natural language. The Grouper subsequently extracts criterion-specific descriptions from images relevant to each criterion, discovers the underlying semantic clusters, and groups each image at three semantic granularity levels. Results shows an example, as how an unstructured image collection can be grouped into clusters of different semantic granularity corresponding to criterion "Location".
We first conduct a Comprehensiveness Comparison of Criteria Proposers in the figure below. TPR performance of each proposer is evaluated against Basic and Hard ground-truth criteria, and visualized using a Progress Bar Chart. Each block represents one ground-truth criterion, with Colored blocks indicating successfully discovered criteria and Gray blocks representing undiscovered criteria. We observe that our Caption-based Proposer discovers the most comprehensive criteria, making it the closest to the human-annotated set among all methods.
Next, we study the impact of image quantity on criteria discovery. We evaluate the TPR performance of the Caption-based Proposer at different image scales against the Hard ground-truth criteria set. Interestingly, in object-centric benchmarks like Card-2c and Clevr-4c, satisfactory performance is achieved with just a few images. In fact, even a single image often suffices for reasonable criteria discovery, as object-centric datasets tend to have uniform structures, i.e., seeing one playing card is enough to suggest criteria like Suit. However, this does not hold for more complex datasets like COCO-4c, Food-4c, and Action-3c, which feature diverse and realistic scenarios. Here, reducing the number of images leads to a clear drop in TPR performance, as capturing intricate and varied thematic criteria requires a larger image set.
We conduct a Comparison of different Semantic Groupers. We report CAcc, SAcc, and their Harmonic Mean (HM) for different Semantic Groupers on the Basic criteria across four benchmarks. CLIP zero-shot classification serves as an oracle, while KMeans with strong visual features is used as a CAcc baseline. The best performer for each criterion, determined by HM, is highlighted in green. Overall, our Caption-based Grouper performs best, ranking first in 10 of 15 evaluated criteria.
Visualizations of clustering results on COCO-4c and Food-4c datasets, showing how images are grouped according to different discovered criteria.
Beyond controlled benchmarks, X-Cluster enables practical analyses on challenging real-world datasets. Below we highlight three application studies where automatically discovered criteria uncover actionable insights.
We probe DALLE-3 and SDXL generations with our inferred criteria to discover and quantify not only demographic skew, but also other types of previously unexplored biases. X-Cluster exposes dominant clusters (e.g., gender, race, grooming) together with bias intensity indicators, revealing systematic preferences in occupations such as nurse, firefighter, or CEO.
Applying X-Cluster to large-scale social media feeds, we group images by activity, color theme, clothing style, age group, and emotion. Popularity scores per cluster surface which visual traits drive virality versus mainstream appeal, while also flagging clusters that contain NSFW or harmful content.
On CelebA, the discovered clusters expose imbalanced correlations between attributes (e.g., blond hair & female). Incorporating these clusters into distributionally robust optimization improves fairness: debiasing methods guided by X-Cluster achieve better worst-group and average accuracy compared with baselines relying only on ground-truth annotations.
@article{liu2024organizing,
title={Organizing unstructured image collections using natural language},
author={Liu, Mingxuan and Zhong, Zhun and Li, Jun and Franchi, Gianni and Roy, Subhankar and Ricci, Elisa},
journal={arXiv preprint arXiv:2410.05217},
year={2024}
}