SciVideoBench

Benchmarking Scientific Video Reasoning in Large Multimodal Models

1University of Central Florida 2Amazon 3University of North Carolina at Chapel Hill 4Stanford University
1,000
Research-Level QAs
240+
Experimental Videos
25+
Subjects
30+
Evaluated Models

🎬 Dataset Examples

Physics Example 1
Material Science
Physics Example 2
Microfluidics
Physics Example 3
Soft Robotics
Physics Example 4
Applied Physics
Medicine Example 1
Oncology
Medicine Example 2
Immunology
Medicine Example 3
Therapeutics
Medicine Example 4
Radiopharmaceuticals
Chemistry Example 1
Electrochemistry
Chemistry Example 2
Photovoltaics
Chemistry Example 3
Analytical Chemistry
Chemistry Example 4
Physical Chemistry
Biology Example 1
Cell Biology
Biology Example 2
Neuroscience
Biology Example 3
Bioinformatics
Biology Example 4
Structural Biology

📝 Abstract

Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills.

To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models' higher-order cognitive abilities.

Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists.

We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.

SciVideoBench Teaser Figure
Figure 1: Overview of SciVideoBench - A comprehensive benchmark for evaluating scientific video reasoning in Large Multimodal Models.

🔍 Overview

To construct a high-quality benchmark for advanced scientific reasoning, we collect 241 research-grade experimental videos from the Journal of Visualized Experiments (JoVE)↗, a peer-reviewed platform publishing methodological videos across diverse scientific disciplines. These professionally produced and narratively structured videos clearly demonstrate laboratory protocols, scientific phenomena, and technical instrumentation, making them an ideal foundation for a benchmark grounded in authentic scientific practice.

Each video is accompanied by a peer-reviewed manuscript and synchronized audio narration: the manuscript details experimental protocols and results, while the narration provides temporally aligned explanations of each step as it unfolds. This tri-modal alignment—video, audio, and text—supports principled question generation and rigorous answer verification, ensuring that questions are both visually grounded and scientifically meaningful.

We focus on four foundational domains—physics, chemistry, biology, and medicine—covering a wide spectrum of procedural complexity and reasoning challenges. Videos are selected to include measurable variables (e.g., reaction time, temperature, applied force), observable causal relationships, and logical experimental sequences, thereby enabling conceptual, hypothetical, and quantitative reasoning.

This targeted curation ensures that each video in SCIVIDEOBENCH provides rich multimodal cues essential for rigorous scientific reasoning and serves as an ideal testbed for evaluating LMMs.

SciVideoBench Overview Diagram
Figure 2: Examples of SciVideoBench, including videos from 4 disciplines (Physics, Biology, Chemistry, and Medicine), which involve 19 different subjects. The research-level QAs challenge LMMs in three different aspects (\textit{Conceptual, Hypothetical, and Quantitative}) that are of vital importance in scientific experiment video understanding..

📊 Statistics

We collected a total of 241 experimental videos spanning four major domains and covering more than 25 distinct scientific subjects, as illustrated in Figure 3. The average video duration is 484 seconds, which ensures that the benchmark reflects the complexity and extended reasoning often required in real-world scientific experiments.

Building upon these videos, we annotated a total of 1,000 challenging questions that demand research-level knowledge for both perception and reasoning. To further capture the nature of academic research and experimental analysis, we carefully designed three distinct question types (conceptual, hypothetical, and quantitative) that reflect common reasoning scenarios observed across the videos.

SciVideoBench Dataset Statistics
Figure 3: Distribution of videos across domains and scientific subjects in SciVideoBench, showing the diversity and scale of our benchmark.

🏆 Leaderboard

Rank Model LLM Size Overall ↓ Disciplines Reasoning Types
Physics Chemistry Biology Medicine Conceptual Hypothetical Quantitative
- Random Guess - - 10.0% 10.0% 10.0% 10.0% 10.0% 10.0% 10.0% 10.0%
- Human (Graduate Students) - - 17.4% 18.1% 18.7% 15.9% 21.2% 16.1% 21.2% 18.9%

📊 Analysis

Performance Analysis Chart 1

The Impact of Chain-of-Thought

Chain-of-Thought performance gains across proprietary models and open-source models. An obvious difference between proprietary models across all the reasoning aspects can be observed; while for open-source models, quantitative reasoning has a notable performance boost, while the other two reasoning aspects have negative impacts. This phenomenon again demonstrates that the quantitative settings in SciVideoBench require sophisticated multi-step reasoning that can benefit a lot from chain-of-thought prompts.

Performance Analysis Chart 2

The Impact of Model Scaling

The impact of LLM backbones on the performance. Larger models reliably boost conceptual/hypothetical reasoning, but quantitative gains remain weak and non-monotonic across series.

đź“– Citation

        @article{deng2025scivideobench,
          title={SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models},
          author={Andong Deng and Taojiannan Yang and Shoubin Yu and Lincoln Spencer and Mohit Bansal and Chen Chen and Serena Yeung-Levy and Xiaohan Wang},
          journal={arXiv preprint arXiv:2501.XXXX},
          year={2025}
        }