Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video

Yulin Zhang, Cheng Shi, Yang Wang, Sibei Yang
NeurIPS 2025 ·

Paper arXiv Code

Eyes Wide Open teaser

Abstract

Envision an AI capable of functioning in human-like settings, moving beyond mere observation to actively understand, anticipate, and proactively respond to unfolding events. Towards this vision, we focus on the innovative task where, given ego‑streaming video input, an assistant proactively answers diverse, evolving questions at the opportune moment, while maintaining synchronized perception and reasoning. This task embodies three key properties: (1) Proactive Coherence, (2) Just‑in‑Time Responsiveness, and (3) Synchronized Efficiency. To evaluate and address these properties, we first introduce ESTP‑Bench (Ego Streaming Proactive Benchmark) alongside the ESTP‑F1 metric—a novel framework designed for their rigorous assessment. Secondly, we propose a comprehensive technical pipeline to enable models to tackle this challenging task. This pipeline comprises: (1) a data engine, (2) a multi‑stage training strategy, and (3) a proactive dynamic compression technique. Our proposed model effectively addresses these critical properties while outperforming multiple baselines across diverse online and offline benchmarks.

The Challenge: From Passive Observer to Proactive Assistant

Imagine an AI assistant that not just sees your world, but proactively thinks and helps when needed. Current video models are mostly passive observers, struggling with three core challenges in streaming video:

  1. Proactive Coherence: How to understand context and answer questions that depend on future events?

  2. Just-in-Time Responsiveness: How to respond precisely when evidence is sufficient—not too early, not too late?

  3. Synchronized Efficiency: How to generate answers without missing new visual information, keeping perception and reasoning perfectly aligned?

Eyes Wide Open teaser
Figure 1. Table.

Framework

To overcome these challenges, we developed

VideoLLM-EyeWO, a complete technical pipeline featuring three key innovations:

  1. Data Engine: Automatically generates large-scale, diverse, and multi-turn proactive QA data to fuel model training.

  2. Multi-Stage Training Strategy: Progressively teaches the model to master response timing (“when to answer”) and accuracy (“how to answer”), ultimately achieving coherent, multi-turn dialogue.

  3. Proactive Dynamic Compression: Intelligently adjusts the information compression rate based on response likelihood, enabling efficient and synchronized perception and reasoning.

Eyes Wide Open teaser
Figure Framework overview.

Results

Eyes Wide Open teaser
ESTP result.
Eyes Wide Open teaser
Evalutation of VideoLLM-EyeWO.

TODO

  • public in arxiv
  • release full paper
  • release code in VideoLLM-EyeWO
  • release ESTP-Bench and ESTP-IT
  • release ESTP-Bench v2
  • public LiveCC-EyeWO result
  • release code in LiveCC-EyeWO

BibTeX