Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video

Yulin Zhang, Cheng Shi, Yang Wang, Sibei Yang
NeurIPS 2025 ·

Abstract

Envision an AI capable of functioning in human-like settings, moving beyond mere observation to actively understand, anticipate, and proactively respond to unfolding events. Towards this vision, we focus on the innovative task where, given ego‑streaming video input, an assistant proactively answers diverse, evolving questions at the opportune moment, while maintaining synchronized perception and reasoning. This task embodies three key properties: (1) Proactive Coherence, (2) Just‑in‑Time Responsiveness, and (3) Synchronized Efficiency. To evaluate and address these properties, we first introduce ESTP‑Bench (Ego Streaming Proactive Benchmark) alongside the ESTP‑F1 metric—a novel framework designed for their rigorous assessment. Secondly, we propose a comprehensive technical pipeline to enable models to tackle this challenging task. This pipeline comprises: (1) a data engine, (2) a multi‑stage training strategy, and (3) a proactive dynamic compression technique. Our proposed model effectively addresses these critical properties while outperforming multiple baselines across diverse online and offline benchmarks.

The Challenge: From Passive Observer to Proactive Assistant

Imagine an AI assistant that not just sees your world, but proactively thinks and helps when needed. Current video models are mostly passive observers, struggling with three core challenges in streaming video:

Proactive Coherence: How to understand context and answer questions that depend on future events?
Just-in-Time Responsiveness: How to respond precisely when evidence is sufficient—not too early, not too late?
Synchronized Efficiency: How to generate answers without missing new visual information, keeping perception and reasoning perfectly aligned?

Framework

To overcome these challenges, we developed

VideoLLM-EyeWO, a complete technical pipeline featuring three key innovations:

Data Engine: Automatically generates large-scale, diverse, and multi-turn proactive QA data to fuel model training.
Multi-Stage Training Strategy: Progressively teaches the model to master response timing (“when to answer”) and accuracy (“how to answer”), ultimately achieving coherent, multi-turn dialogue.
Proactive Dynamic Compression: Intelligently adjusts the information compression rate based on response likelihood, enabling efficient and synchronized perception and reasoning.

Results

TODO

public in arxiv
release full paper
release code in VideoLLM-EyeWO
release ESTP-Bench and ESTP-IT
release ESTP-Bench v2
public LiveCC-EyeWO result
release code in LiveCC-EyeWO