FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding

1NOVA LINCS, Lisboa 2DeepNeuronic, Covilhã 3University of Beira Interior, Covilhã 4FCT NOVA, Lisboa
FV-Score Evaluation Overview

FineVAU enables fine-grained evaluation of LVLMs on video anomaly understanding through What, Who, and Where dimensions, achieving superior alignment with human perception.

Abstract

Video Anomaly Understanding (VAU) is a novel task focused on describing unusual occurrences in videos. Despite its growing interest, the evaluation of VAU remains an open challenge. Existing benchmarks rely on n-gram-based metrics (e.g., BLEU, ROUGE-L) or LLM-based evaluation. The first fails to capture the rich, free-form, and visually grounded nature of LVLM responses, while the latter focuses on assessing language quality over factual relevance, often leading to subjective judgments misaligned with human perception.

In this work, we address this issue by proposing FineVAU, a new benchmark for VAU that shifts the focus towards rich, fine-grained and domain-specific understanding of anomalous videos. We formulate VAU as a three-fold problem, with the goal of comprehensively understanding key descriptive elements of anomalies in video: events (What), participating entities (Who) and location (Where).

Our benchmark introduces a) FV-Score, a novel, human-aligned evaluation metric that assesses the presence of critical visual elements in LVLM answers, providing interpretable, fine-grained feedback; and b) FineW3, a novel, comprehensive dataset curated through a structured and fully automatic procedure that augments existing human annotations with high quality, fine-grained visual information.

Human evaluation reveals that our proposed metric has a superior alignment with human perception of anomalies in comparison to current approaches. Detailed experiments on FineVAU unveil critical limitations in LVLM's ability to perceive anomalous events that require spatial and fine-grained temporal understanding, despite strong performance on coarse grain, static information, and events that typically comprise strong visual cues.

FineW3 Dataset Examples

Our FineW3 dataset contains 1,544 videos with fine-grained annotations covering events (What), entities (Who), and location (Where). Below are examples demonstrating the granularity of our annotations across different anomaly categories.

Shoplifting Easy - Strong Visual Cues

📍 Events (What)
  • Woman and man walking in middle of supermarket shelves
  • Woman stops, takes bags of items from shelf
  • Woman dumps some items on opposite shelf
  • Woman puts bags of items into her clothes (hiding them)
  • Woman walks forward and takes two more bags from shelf
  • Woman hides additional items in her clothes
  • Man picks up child and puts child down
  • Woman continues taking things from shelf
👥 Entities (Who)
  • Primary Actor: Woman who took bags from shelf and hid them in her clothes
  • Accomplice/Companion: Man who followed woman and interacted with child
  • Distraction: Child who was picked up and put down by man
  • Stolen Items: Multiple bags of merchandise hidden in clothing
🏢 Scene (Where)
  • Location: Supermarket interior between shelves
  • Key Objects: Shelves with merchandise, bags of items
  • Context: Retail theft with possible distraction tactics (child present)

Arrest Easy - Strong Visual Cues

📍 Events (What)
  • Man in black clothes and white shirt talks with man in white at door
  • Man in black pushes man in white from behind
  • Man in white is surrounded and subdued by two people
  • Man in white is dragged from left side of screen to table on right
  • During struggle, man is pulled from table to right channel
  • Man with black top and white pants picks up pager and speaks into it
  • Three women at door are instructed to leave room one by one
👥 Entities (Who)
  • Aggressor: Man in black who pushed man in white from behind
  • Target: Man in white who was surrounded, subdued, and dragged
  • Accomplices: Two men in black who dragged the man to the table
  • Authority Figure: Man in all black who gave instructions
  • Bystanders: Women at door, man with pager
🏢 Scene (Where)
  • Location: Indoor room with door and table
  • Key Objects: Door, table on the right, pager
  • Context: Forced removal/arrest situation with multiple participants

Robbery Difficult - Complex Sequence

📍 Events (What)
  • Man climbs over wall and walks to small door
  • Man opens door and enters
  • Another man comes out and fights with intruder
  • Defender pushes robber away from door and struggles with him
  • Robber stabs the defender several times with knife
  • Robber repeatedly hits defender's head with knife handle
  • Defender fights back with iron rod, hitting robber's shoulder and head
  • Robber jumps out window to escape
  • Defender returns to house and closes door
👥 Entities (Who)
  • Robber: Intruder who climbed wall, fought, stabbed, and escaped through window
  • Defender: Man who fought back, used iron rod, pulled robber out multiple times
🏢 Scene (Where)
  • Location: Residential property with wall and small door
  • Key Objects: Wall, small door, knife, iron rod, window
  • Context: Home invasion with violent confrontation and self-defense

FV-Score vs. Other Metrics

Our FV-Score metric achieves superior correlation with human judgment (Pearson ρ: 0.61) compared to traditional n-gram metrics and LLM-based approaches. Below, we demonstrate how different metrics evaluate the same LVLM response, highlighting FV-Score's fine-grained and actionable feedback.

Ground truth annotations shown in dataset examples above

LVLM Response:

"A person approaches a residential property and enters through a door. Two individuals are present in the scene and appear to be engaged in some form of physical interaction near the entrance. One person is seen moving around the property, and there are objects visible including what appears to be a door and window. The scene takes place at a residential location during what appears to be daytime."

Comparison Across Evaluation Metrics

✓ FV-Score (Ours)

Overall Score: 3.5/15 | Correlation: ρ = 0.61


What (Events): 0.5/5

  • Missing: Climbing wall to break in
  • Missing: Stabbing with knife (multiple times)
  • Missing: Hitting head with knife handle
  • Missing: Fighting with iron rod
  • Missing: Escape through window
  • Vague: "physical interaction" vastly understates violent assault

Who (Entities): 1/6

  • Partial: Two individuals present (but roles completely missed)
  • Missing: Robber vs. Defender distinction
  • Missing: Knife and iron rod as weapons

Where (Location): 2/4

  • Correct: Residential property, door, window visible
  • Missing: Wall, small door specifications

Actionable: Model severely hallucinated benign "physical interaction"; completely missed violent home invasion, weapons, and life-threatening assault.

Human Judgment

Quality Rating: 2/10


"This is a catastrophic failure. The description talks about 'physical interaction' when the video clearly shows a violent home invasion with stabbing, beating with weapons, and a struggle for survival. The model completely sanitized the extreme violence and missed all weapons. Calling this 'physical interaction' is like calling a fire 'warmth.' Completely useless for security monitoring."


Critical failure - Dangerous mischaracterization of violent crime

VAU-EVAL

Score: 6.8/10 | Correlation: ρ = 0.53


  • Classification Accuracy: 7/10
  • Key Concept Alignment: 6/10
  • Linguistic Fluency: 9/10
  • Informativeness: 5/10
  • Factual Consistency: 7/10

Issue: High score despite missing life-threatening violence. Rewards grammatical fluency while ignoring critical safety failure.

AnomEVAL

Score: 6.1/10 | Correlation: ρ = 0.42


  • Basic Reasoning: 6/10
  • Consistency: 8/10
  • Hallucination: High (severe sanitization)

Issue: Evaluates reasoning structure, not visual grounding. Completely misses that "physical interaction" is a euphemism for violent assault with weapons.

ROUGE-L

Score: 0.22 | Correlation: ρ = 0.47


Lexical overlap with reference: 22%

Shared words: "door", "window", "property", "two"


Issue: Only measures word overlap. "Physical interaction" vs. "stabbing with knife" treated as simple mismatch. Ignores semantic severity.

BLEU

Score: 0.15 | Correlation: ρ = 0.19


N-gram precision: 15%

1-gram matches: 4/18 words

2-gram matches: 0/11 bigrams


Issue: Worst correlation with humans. Completely ignores semantic meaning and danger level of events.

FV-Score provides the most interpretable and actionable feedback, clearly identifying what visual elements were missed.

FineVAU Leaderboard

Performance of state-of-the-art LVLMs on our FineVAU benchmark. Results are broken down by dimension (Event, Entity, Location) and attribute types. Click column headers to sort. Higher scores indicate better performance.

# Model Location Event Entity Attribute All Location Breakdown Entity Breakdown
Lighting Env Crowd Time Salient Person Vehicle Others
1 InternVL3 71.8 18.0 51.2 25.5 40.5 80.4 86.6 59.3 79.7 53.1 54.0 44.8 51.5
2 Qwen2.5-VL 70.8 9.1 38.3 20.3 32.9 80.2 83.6 68.0 80.7 41.8 29.6 37.9 44.5
3 LLaVA-VID 65.7 14.4 44.0 21.0 35.0 65.1 87.0 56.8 69.0 50.8 42.2 38.0 47.4
4 LLaVA-OV 58.3 13.0 41.1 19.9 32.2 65.1 80.1 42.2 60.0 44.1 38.5 37.6 44.0
5 VideoLLaMA3 40.3 6.5 24.3 10.2 19.3 44.1 64.7 30.0 35.4 27.4 20.8 22.2 27.5

Note: Click on column headers to sort by different metrics. Results are based on zero-shot evaluation.

Want to submit your model? Follow the evaluation instructions in our GitHub repository.

FineVAU builds upon and advances several key areas in Video Anomaly Understanding and evaluation.

VAU Datasets and Benchmarks: UCA introduces dense captions for UCF-Crime videos. HAWK proposes synthetic descriptions and QA pairs, while Holmes-VAU provides multi-granularity annotations. ECVA introduces causal reasoning annotations. Our work differs by providing structured What/Who/Where annotations with human-aligned evaluation.

Evaluation Metrics: Traditional metrics like BLEU and ROUGE-L focus on lexical overlap, while LLM-based metrics like AnomEVAL and VAU-EVAL assess reasoning and fluency. FV-Score uniquely focuses on fine-grained visual grounding aligned with human perception.

Large Vision-Language Models: Recent LVLMs like InternVL3, Qwen2.5-VL, and LLaVA-Video show strong generalization on vision tasks. Our benchmark reveals their limitations in fine-grained temporal and spatial understanding of anomalies.

BibTeX

@inproceedings{pereira2026finevau,
  author    = {Pereira, Jo\~{a}o and Lopes, Vasco and Neves, Jo\~{a}o and Semedo, David},
  title     = {FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  year      = {2026},
}