Beyond Single Images: A Comprehensive Benchmark for Album-Level Vision-Language Understanding

Huang, Shawn; Price, Brian; Fan, Yifei; Morse, Bryan

Beyond Single Images: A Comprehensive Benchmark for Album-Level Vision-Language Understanding

Shawn Huang¹, Brian Price², Yifei Fan², Bryan Morse¹

¹Brigham Young University ²Adobe
CVPR 2026 (Highlight)

Paper Code

The four AlbumBench album organization tasks: Intent Selection, Intent Rating, Group Labeling, and Group Clustering

Given an album, AlbumBench evaluates four album organization tasks: Intent Selection (select images matching a user intent), Intent Rating (rate images 0–3 on how well they match), Group Labeling (label each image by a given grouping), and Group Clustering (group images by a query, with no predefined labels).

Abstract

Automatic album organization has been studied extensively over the past decades due to significant progress in digital photography. Recent Vision-Language Models (VLMs) have shown strong performance on multi-image understanding, making them natural candidates for automating album organization workflows. While VLMs' abilities in multi-image understanding have been widely studied, their performance on album organization remains underexplored. To bridge this gap, we introduce AlbumBench, the first comprehensive benchmark for automatic album organization. Specifically, we (1) define album organization tasks as photo selection for album-specific user objectives, photo rating according to how well user intents are fulfilled, and album-specific photo grouping given a user query that requires contextual understanding of the album; (2) establish AlbumBench, a benchmark dataset containing 27,051 images across 641 albums with 5 annotations per image; and (3) evaluate mainstream open-source and proprietary VLMs on AlbumBench. We show that AlbumBench presents unique challenges compared to traditional multi-image understanding benchmarks due to its requirement for understanding album context and user intent. Our findings reveal a significant performance gap between open-source and proprietary VLMs on album organization tasks. Despite this gap, even the best-performing proprietary models sometimes struggle with tasks that humans find relatively easy. We hope that AlbumBench can serve as a foundation for unifying album organization research and motivate improvements in VLMs' performance on these tasks.

Task results when visual context is provided.

Task results when language context is provided.

Per-task performance of representative VLMs

Performance of representative models per task type. Closed-source models generally outperform open-source ones.

Spearman correlation between AlbumBench tasks and MMMU-val

Spearman correlations with MMMU-val show AlbumBench measures capabilities complementary to existing benchmarks.

Video Presentation

Poster

BibTeX

@InProceedings{Huang_2026_CVPR,
    author    = {Huang, Shawn and Price, Brian and Fan, Yifei and Morse, Bryan},
    title     = {Beyond Single Images: A Comprehensive Benchmark for Album-Level Vision-Language Understanding},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {38564-38573}
}