Real-world videos are filled with routine activities punctuated by memorable, surprising events. Yet most Video-LLMs process videos by uniformly sampling frames, likely missing the critical moments that define a video's narrative.
We introduce SPIKE, an inference-time framework that quantifies Bayesian Surprise as the belief update triggered by new visual evidence in the video stream. By identifying moments where new observations conflict with prior expectations, SPIKE naturally discovers the most salient and informative frames. Since zero-shot Video-LLM beliefs are often suboptimal, we develop SPIKE-RL, which leverages group relative policy optimization (GRPO) to refine belief hypotheses based on caption quality as a reward signal.
SPIKE and SPIKE-RL enable query-agnostic, surprise-weighted frame sampling that allocates more frames to interesting moments. This strategy yields consistent performance gains across five downstream benchmarks compared to uniform sampling baselines.
Bayesian Belief Tracking in Action
At the core of SPIKE is a principled approach to modeling uncertainty and surprise. As the model processes each frame, it maintains explicit probability distributions over possible future events, represented as human-interpretable textual hypotheses. When new visual evidence arrives, these beliefs are updated according to Bayes' rule, and the magnitude of this update—measured via KL divergence—quantifies surprise.
Method
SPIKE: Inference-Time Surprise Quantification
SPIKE operates through a four-stage pipeline at each timestep:
- Hypothesis generation: Given recent frames and historical context, generate a diverse set of hypotheses about potential future events
- Prior computation: Score each hypothesis before observing the next frame to obtain a prior belief distribution
- Posterior update: Re-score hypotheses after observing the new frame to compute posterior beliefs
- Surprise measurement: Calculate KL divergence between posterior and prior distributions as the surprise signal
This formulation has several advantages. First, surprise is computed in a model-agnostic manner that doesn't require gradient access or architectural modifications. Second, the textual hypothesis space enables rich reasoning about complex events. Third, the probabilistic framework provides theoretical grounding and interpretability.
SPIKE-RL: Reinforcement Learning for Belief Optimization
A critical challenge is that zero-shot Video-LLMs often generate suboptimal belief hypotheses. To address this, SPIKE-RL employs GRPO to optimize the hypothesis generator. The key observation is that accurate intermediate beliefs are necessary for producing high-quality final captions. By using caption quality as a reward signal, we can implicitly supervise the belief tracking process without requiring ground-truth surprise annotations.
The training procedure samples multiple belief trajectories per video, computes rewards based on caption quality (using GPT-4 as an evaluator), and updates the generator to increase the probability of high-reward trajectories. This approach yields hypotheses that are both more diverse and better aligned with human judgments of surprise.
Results
Surprise Localization
We evaluate SPIKE's ability to identify surprising moments on three benchmarks containing videos with unexpected events:
On the Oops! dataset, SPIKE approaches human-level performance at localizing failure events. On FunQA, our method significantly outperforms prior approaches. We also introduce a new Mr. Bean benchmark containing videos with multiple surprising moments, where SPIKE achieves strong performance despite the increased complexity.
Downstream Task Performance
To validate that surprise-weighted sampling improves general video understanding, we evaluate on five diverse benchmarks spanning different task types and domains. In each case, we compare uniform frame sampling against our surprise-weighted approach while keeping all other factors constant.
The consistent improvements across all benchmarks demonstrate that surprise-weighted sampling captures something fundamental about video structure. Notably, gains are substantial even on tasks that don't explicitly focus on surprising events—suggesting that informative moments tend to be surprising moments. The particularly strong results on ExFunTube (+7.0%) highlight how crucial it is to observe the unexpected events that require explanation.
These results validate a key hypothesis: by concentrating computational resources on high-information frames, we can achieve better video understanding without increasing the total number of frames processed. This has important implications for scaling video models to longer content.
Discussion
SPIKE represents a step toward more human-like video understanding, where models actively predict upcoming events and register surprise when predictions fail. This predictive processing framework has deep connections to theories of human cognition and provides a principled approach to attention allocation in sequential data.
Future work could extend SPIKE to longer videos, incorporate multimodal signals beyond vision, and explore whether surprise-based learning can improve model capabilities more broadly. The interpretable nature of textual belief hypotheses also opens opportunities for human-in-the-loop refinement and model debugging.