We introduce the "Long Video Haystack" task: finding a minimal set of relevant frames (one to five) from tens of thousands of frames for given queries. We provide LV-HAYSTACK, a new benchmark of 3,874 human-annotated instances with fine-grained metrics for keyframe search quality and efficiency. Current methods only achieve a 2.1% temporal F1 score on its LVBENCH subset, highlighting a large gap.
To address this, we propose T*, a lightweight keyframe search framework that reframes expensive temporal search as a spatial search. T* leverages image-based visual localization capabilities and introduces adaptive zooming-in across temporal and spatial dimensions. Under a 32-frame inference budget, T* boosts GPT-4o's performance from 50.5% to 53.1%, and LLaVA-OneVision-OV-72B's from 55.5% to 62.4% on the LongVideoBench XL subset.
T* is an advanced temporal search framework designed to efficiently identify key frames relevant to specific queries. By transforming temporal search into spatial search, T* leverages lightweight object detectors and Visual Language Model (VLM) visual grounding techniques to streamline the process. T* demonstrates exceptional performance, both with and without additional training, making it a versatile and powerful tool for various applications.
We explore disentangled evaluation of temporal search & video understanding with 6 fine-grained search metrics.
video: 988
length: 423 h (~25.7 min/video)
frame: 45,700,000 (~46,300/video)
QA pair: 15,092 (~15.3/video)
keyframe: 28,300 (~1.9/question)
video: 114
length: 26.7 h (~14.1 min/video)
frame: 2,200,000 (~19,100/video)
QA pair: 342 (~3.0/video)
keyframe: 496 (~1.5/question)
Explore examples from our LV-Haystack dataset. (Note: The video shown in the current interface is not the original dataset video length; it has been clipped to a segment close to the target frame.)
This work is in part supported by ONR N00014-23-1-2355 and ONR MURI N00014-22-1-2740. We thank Anabella Isaro for their valuable contributions in helping organizing open-source code.
@misc{tstar, title={Re-thinking Temporal Search for Long-Form Video Understanding}, author={Jinhui Ye and Zihan Wang and Haosen Sun and Keshigeyan Chandrasegaran and Zane Durante and Cristobal Eyzaguirre and Yonatan Bisk and Juan Carlos Niebles and Ehsan Adeli and Li Fei-Fei and Jiajun Wu and Manling Li}, year={2025}, eprint={2504.02259}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2504.02259}, }