ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos

1National Taiwan University, 2Microsoft
geometric reasoning

Our benchmark specializes in evaluating reasoning across time, i.e. video QA when question and answer each belongs to different time spans. ReXTime poses difficulties even for frontier MLLMs, as indicated by the large gap to human-level accuracy.

Abstract

We introduce ReXTime, a benchmark designed to rigorously test AI models' ability to perform temporal reasoning within video events. Specifically, ReXTime focuses on reasoning across time, i.e. human-like understanding when the question and its corresponding answer occur in different video segments. This form of reasoning, requiring advanced understanding of cause-and-effect relationships across video segments, poses significant challenges to even the frontier multimodal large language models. To facilitate this evaluation, we develop an automated pipeline for generating temporal reasoning question-answer pairs, significantly reducing the need for labor-intensive manual annotations. Our benchmark includes 921 carefully vetted validation samples and 2,143 test samples, each manually curated for accuracy and relevance. Evaluation results show that while frontier large language models outperform academic models, they still lag behind human performance by a significant 14.3% accuracy gap. Additionally, our pipeline creates a training dataset of 9,695 machine generated samples without manual effort, which empirical studies suggest can enhance the across-time reasoning via fine-tuning.

Logo ReXTime Benchmark

Overview

Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have nearly matched human performance in various language and vision-language tasks. Notably, frontier MLLMs trained on web-scale proprietary datasets show impressive video understanding such as GPT-4o, Gemini-Pro-1.5 and Claude-Core. However, unlike LLMs which excel in text reasoning over long sequences, the cause-effect reasoning in MLLMs, especially in understanding long video events, remains under-explored. In an initial study, we identified a common shortcoming in the most advanced MLLMs -- they struggle with video question answering when the question and answer correspond to different time segments. As shown in Figure 1, the question”How can we cut up the tomato efficiently?'' and the answer “Hold up a plate and sharpen the knife on the plate.'' each refers to separate segments. Surprisingly, a simple question like this can challenge leading MLLMs. Therefore, there is a pressing need for a benchmark to quantitatively assess video temporal reasoning. To address this, we introduce ReXTime, a benchmark to evaluate Reasoning-Across-Time capabilities for video events.

To develop ReXTime, we propose an LLM-assisted data generation pipeline that minimizes human effort and cuts costs from $300 to $135 per 1,000 QA pairs. The benchmark includes 921 validation and 2143 test samples, each rigorously curated by human annotators. Empirical evidence indicates that even proprietary frontier MLLMs are inadequate for temporal reasoning. For instance, humans can achieve 88.0% accuracy on VQA tasks, whereas the top-performing MLLM, OpenAI's GPT-4o, only reaches 73.7% as shown in Figure 1. A new benchmark such as ReXTime has the potential to significantly propel advancements in this field -- it effectively differentiates between model capabilities, and the state-of-the-art model has not yet saturated human-level accuracy. The additional 9695 unverified samples provide a training dataset that has significantly boosted an academic MLLM's temporal reasoning skills, lowering the entry bar for future research. Furthermore, we confirmed that \ours primarily contains reasoning across time questions, with the lowest question-answer overlap in time (QA-mIoU) compared to other video QA benchmarks.

Question Types in ReXTime

We present the relationship and examples between the three categories of question we generated. “Having dinner / Watching TV'' does not have strong causality and is classified in sequential, which often results in before / after questions. “Girls falls down'' shows strong causality with “The girl is crying.'' but lacks human intention, is classified in cause-effect. “Chopping tomato / Making a dish'' not only exists strong causal relations but also shows subjective deliberation, which is classified into means-to-an-end.

algebraic reasoning

Reasoning across time question-answer types and examples. The three categories are sequential, cause-effect, and means-to-an-end.

Data Generation Pipeline

To develop an efficient and effective pipeline, we have addressed two primary challenges: (1) the quality-diversity trade-off in LLM generation, and (2) the high cost of human labor for verification. Initially, prompting an (M)LLM to generate question-answer pairs often results in logically incorrect responses. While few-shot in-context learning enhances logical correctness, it reduces response diversity. We address this by moderating the MLLM with specific event attributes and temporal relations from a structured taxonomy. Additionally, although human verification is necessary to eliminate residual errors, we minimize costs by establishing criteria that allow the MLLM to self-assess the accuracy of its generated QAs. As a bonus feature, we evaluate video moment localization to assess whether an AI model accurately grounds its answers to the correct video segments. Please refer to our paper for more details.

algebraic reasoning

In stage I, we collect event pairs from two video sources. In stage II, we score and categorize the event pairs into four relation types. In stage III, the (M)LLM generates question-answer pairs by our carefully written few-shot demonstrations. In stage IV, the LLM self-evaluate the generated samples to reduce the human verification cost.

Comparisons with Existing Benchmarks

We compare ReXTime to the two datasets, NExT-GQA and Ego4D-NLQ on the number of reasoning across time samples, certificate length, and QA-mIoU. The average certificate length in our dataset is considerably longer than in existing tasks. This suggests that effectively addressing our task requires models to have more advanced temporal reasoning abilities. The lower QA-mIoU in ReXTime indicates that an AI model needs to first locate the question event and then scan the rest of the visual events in the video to reason about the correct answer. This is more challenging because the reasoning and moment localization cannot be easily decomposed. For existing tasks, a model mostly needs to localize the question event and then reason within roughly the same span due to the higher QA-IoU.

Experiment Results

Leaderboard

We evaluate various models including open-source and proprietary models on the ReXTime benchmark. We consider two types of tasks: moment localization and video question answering (VQA). Our evaluation is conducted under a zero-shot setting and a fine-tuning setting if available.

This table shows the results of human and the proprietary models on 300 mini test set.

Models Moment Localization VQA
mIoU R @1 (IoU=0.3) R @1 (IoU=0.5) Accuracy(%) Accuracy(%) @IoU ≥ 0.5
Human 61.11 74.30 62.85 87.98 58.51
GPT-4o 36.28 45.33 34.00 73.67 28.67
Claude3-Opus 23.61 30.67 17.67 68.67 13.67
Gemini-1.5-Pro 28.43 35.67 25.00 68.00 18.33
GPT-4V 26.74 33.33 22.00 63.33 16.67
Reka-Core 27.95 36.33 24.00 59.67 17.00

BibTeX

BibTex Code Here