SPIKE-RL: Video-LLMs Meet Bayesian Surprise

Real-world videos are filled with routine activities punctuated by memorable, surprising events. Yet most Video-LLMs process videos by uniformly sampling frames, likely missing the critical moments that define a video's narrative.

We introduce SPIKE, an inference-time framework that quantifies Bayesian Surprise as the belief update triggered by new visual evidence in the video stream. By identifying moments where new observations conflict with prior expectations, SPIKE naturally discovers the most salient and informative frames. Since zero-shot Video-LLM beliefs are often suboptimal, we develop SPIKE-RL, which leverages group relative policy optimization (GRPO) to refine belief hypotheses based on caption quality as a reward signal.

SPIKE identifies surprising moments by tracking belief shifts when new frames are observed. Unlike uniform sampling (a), our surprise-based sampling (b) focuses computational resources on high-surprise regions that naturally align with human attention.

SPIKE and SPIKE-RL enable query-agnostic, surprise-weighted frame sampling that allocates more frames to interesting moments. This strategy yields consistent performance gains across five downstream benchmarks compared to uniform sampling baselines.

Bayesian Belief Tracking in Action

At the core of SPIKE is a principled approach to modeling uncertainty and surprise. As the model processes each frame, it maintains explicit probability distributions over possible future events, represented as human-interpretable textual hypotheses. When new visual evidence arrives, these beliefs are updated according to Bayes' rule, and the magnitude of this update—measured via KL divergence—quantifies surprise.

Visualization of belief tracking dynamics: the model generates hypotheses about upcoming events, computes prior and posterior probability distributions, and registers surprise when expectations are violated. High-surprise moments receive increased attention during frame sampling.

Key insight: By representing beliefs as distributions over textual hypotheses rather than implicit feature representations, SPIKE provides interpretable explanations for why certain moments are deemed surprising. This transparency is crucial for understanding model behavior and debugging failure cases.

Method

SPIKE: Inference-Time Surprise Quantification

SPIKE operates through a four-stage pipeline at each timestep:

Hypothesis generation: Given recent frames and historical context, generate a diverse set of hypotheses about potential future events
Prior computation: Score each hypothesis before observing the next frame to obtain a prior belief distribution
Posterior update: Re-score hypotheses after observing the new frame to compute posterior beliefs
Surprise measurement: Calculate KL divergence between posterior and prior distributions as the surprise signal

SPIKE architecture showing the hypothesis generator and scorer modules. The generator produces belief hypotheses while the scorer computes likelihood ratios that yield prior and posterior distributions, ultimately producing per-frame surprise scores.

This formulation has several advantages. First, surprise is computed in a model-agnostic manner that doesn't require gradient access or architectural modifications. Second, the textual hypothesis space enables rich reasoning about complex events. Third, the probabilistic framework provides theoretical grounding and interpretability.

SPIKE-RL: Reinforcement Learning for Belief Optimization

A critical challenge is that zero-shot Video-LLMs often generate suboptimal belief hypotheses. To address this, SPIKE-RL employs GRPO to optimize the hypothesis generator. The key observation is that accurate intermediate beliefs are necessary for producing high-quality final captions. By using caption quality as a reward signal, we can implicitly supervise the belief tracking process without requiring ground-truth surprise annotations.

The training procedure samples multiple belief trajectories per video, computes rewards based on caption quality (using GPT-4 as an evaluator), and updates the generator to increase the probability of high-reward trajectories. This approach yields hypotheses that are both more diverse and better aligned with human judgments of surprise.

Results

Surprise Localization

We evaluate SPIKE's ability to identify surprising moments on three benchmarks containing videos with unexpected events:

62.9%

Oops! Dataset

Acc@0.25s (vs. 62.1% human)

68.2%

FunQA Dataset

IoU score

61.1%

Mr. Bean Dataset

IoU on multi-surprise

On the Oops! dataset, SPIKE approaches human-level performance at localizing failure events. On FunQA, our method significantly outperforms prior approaches. We also introduce a new Mr. Bean benchmark containing videos with multiple surprising moments, where SPIKE achieves strong performance despite the increased complexity.

Downstream Task Performance

To validate that surprise-weighted sampling improves general video understanding, we evaluate on five diverse benchmarks spanning different task types and domains. In each case, we compare uniform frame sampling against our surprise-weighted approach while keeping all other factors constant.

BlackSwan

Anomalous event detection in long-form videos

+2.3% accuracy

FunQA

Understanding creative and humorous video moments

+4.6% LLM-Match

ExFunTube

Generating explanations for unexpected outcomes

+7.0% LLM-Match

VideoMME-S

Short video comprehension and question answering

+2.7% accuracy

NextQA

Causal and temporal reasoning about events

+1.7% accuracy

The consistent improvements across all benchmarks demonstrate that surprise-weighted sampling captures something fundamental about video structure. Notably, gains are substantial even on tasks that don't explicitly focus on surprising events—suggesting that informative moments tend to be surprising moments. The particularly strong results on ExFunTube (+7.0%) highlight how crucial it is to observe the unexpected events that require explanation.

These results validate a key hypothesis: by concentrating computational resources on high-information frames, we can achieve better video understanding without increasing the total number of frames processed. This has important implications for scaling video models to longer content.

Impact of reinforcement learning: SPIKE-RL improves hypothesis diversity by 6.8% over SPIKE and achieves 0.87 Spearman correlation with human surprise judgments. This demonstrates that RL training enhances not just task performance but the quality of internal model representations.

Discussion

SPIKE represents a step toward more human-like video understanding, where models actively predict upcoming events and register surprise when predictions fail. This predictive processing framework has deep connections to theories of human cognition and provides a principled approach to attention allocation in sequential data.

Future work could extend SPIKE to longer videos, incorporate multimodal signals beyond vision, and explore whether surprise-based learning can improve model capabilities more broadly. The interpretable nature of textual belief hypotheses also opens opportunities for human-in-the-loop refinement and model debugging.

Bayesian Belief Tracking in Action

Method

SPIKE: Inference-Time Surprise Quantification

SPIKE-RL: Reinforcement Learning for Belief Optimization

Results

Surprise Localization

Downstream Task Performance

Discussion

Citation