Evaluating MLLMs with

Multimodal Multi-image Reasoning Benchmark

1Beijing University of Posts and Telecommunications, China 2Yanshan University, China

3National University of Singapore, Singapore 4Li Auto Inc., China 5SenseTime Research, China

*Equal contribution §Core contribution Project lead

MMRB Overview

Overview of the MMRB benchmark, which evaluates MLLMs on 92 multi-image-only sub-tasks annotated with reasoning steps.

🔔News

🚀 [2025-05-31]: We released MMRB, a benchmark for multimodal multi-image reasoning. 🥳

Abstract

With enhanced capabilities and widespread applications, Multi-modal Large Language Models (MLLMs) are increasingly required to process and reason over multiple images simultaneously. However, existing MLLM benchmarks focus either on single-image visual reasoning or on multi-image understanding tasks with only final-answer evaluation, leaving the reasoning capabilities of MLLMs over multi-image inputs largely underexplored. To address this gap, we introduce the Multimodal Multi-image Reasoning Benchmark (MMRB), the first benchmark designed to evaluate structured visual reasoning across multiple images. MMRB comprises 92 sub-tasks covering spatial, temporal, and semantic reasoning, with multi-solution, CoT-style annotations generated by GPT-4o and refined by human experts. A derivative subset is designed to evaluate multimodal reward models in multi-image scenarios. To support fast and scalable evaluation, we propose a sentence-level matching framework using open-source LLMs. Extensive baseline experiments on 40 MLLMs, including 9 reasoning-specific models and 8 reward models, demonstrate that open-source MLLMs still lag significantly behind commercial MLLMs in multi-image reasoning tasks. Furthermore, current multimodal reward models are nearly incapable of handling multi-image reward ranking tasks.

Multimodal Multi-image Reasoning Benchmark

Overview

We first present an overview of our Multimodal Multi-image Reasoning Benchmark. Our benchmark consists of 4,750 samples encompassing 68,882 reasoning steps across 92 sub-tasks, covering semantic, spatial, and temporal reasoning. Notably, each sample contains an average of 6.17 images and 1.93 distinct solutions. During the annotation process, we also corrected a horrifying 355 incorrect ground truths from the source datasets, which could have resulted in up to a 14% deviation in our benchmark if left uncorrected.

dataset-comparison

MMRB stands out as the largest benchmark by sub-task count and image density, the only one to offer multiple-solution annotations.

dataset-comparison

Data Construction Pipeline

  1. Task Selection & Creation: We first surveyed 22 multi-image datasets and collected 242 tasks, then filtered and categorized them into semantic, temporal, and spatial reasoning types following the MMIU taxonomy. Using ChatGPT-4o with chain-of-thought (CoT) prompting, we further selected 101 reasoning-focused tasks for annotation, excluding hard math problems to better target general multi-image understanding.
  2. Reasoning Steps Annotation: For each multi-image reasoning task, we prompt GPT-4o to generate three diverse reasoning trajectories, each composed of step-by-step explanations. These steps are categorized into six types—Task Understanding, Information Grounding, Commonsense Seeking, Logical Reasoning, Arithmetic Calculating, and Drawing Conclusion—based on refined cognitive operations. This results in rich, multi-path annotated tasks that reflect varied reasoning strategies leading to the same answer.
  3. Manual Inspection and Correction: To ensure annotation quality, we conducted a rigorous human verification process. A team of 17 trained annotators manually reviewed and corrected both reasoning steps and final answers generated by GPT-4o. In total, 25% of samples had at least one reasoning step revised, and 7.5% had their final answers corrected—highlighting the importance of human oversight in building a high-quality benchmark.

Pipeline

Leaderboard

We select 40 MLLMs for our baseline experiment, including 10 commercial models and 24 open-source models ranging from 0.5B to 38B parameters, among which 9 are specifically designed for reasoning. We use outcome score (Rule-based), process score and efficacy score (LLM-based) for metric.

Non-reasoning-specialized API Model Reasoning-specialized API Model

Non-reasoning-specialized Open-source Model Reasoning-specialized Open-source Model


Reset
Model Outcome Score Outcome Score \w CoT Process Score Efficacy Score

Overall results of different models on the MMRB leaderboard. The best-performing model in each category is in-bold, and the second best is underlined.

Examples

BibTeX

@article{cheng2025evaluating,
  title={Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark},
  author={Cheng, Ziming and Xu, Binrui and Gong, Lisheng and Song, Zuhe and Zhou, Tianshuo and Zhong, Shiqi and Ren, Siyu and Chen, Mingxiang and Meng, Xiangchao and Zhang, Yuxin and others},
  journal={arXiv preprint arXiv:2506.04280},
  year={2025}
}