FineVAU: Fine-Grained Video Anomaly Understanding

Abstract

Video Anomaly Understanding (VAU) is a novel task focused on describing unusual occurrences in videos. Despite its growing interest, the evaluation of VAU remains an open challenge. Existing benchmarks rely on n-gram-based metrics (e.g., BLEU, ROUGE-L) or LLM-based evaluation. The first fails to capture the rich, free-form, and visually grounded nature of LVLM responses, while the latter focuses on assessing language quality over factual relevance, often leading to subjective judgments misaligned with human perception.

In this work, we address this issue by proposing FineVAU, a new benchmark for VAU that shifts the focus towards rich, fine-grained and domain-specific understanding of anomalous videos. We formulate VAU as a three-fold problem, with the goal of comprehensively understanding key descriptive elements of anomalies in video: events (What), participating entities (Who) and location (Where).

Our benchmark introduces a) FV-Score, a novel, human-aligned evaluation metric that assesses the presence of critical visual elements in LVLM answers, providing interpretable, fine-grained feedback; and b) FineW3, a novel, comprehensive dataset curated through a structured and fully automatic procedure that augments existing human annotations with high quality, fine-grained visual information.

Human evaluation reveals that our proposed metric has a superior alignment with human perception of anomalies in comparison to current approaches. Detailed experiments on FineVAU unveil critical limitations in LVLM's ability to perceive anomalous events that require spatial and fine-grained temporal understanding, despite strong performance on coarse grain, static information, and events that typically comprise strong visual cues.

FineW3 Dataset Examples

Our FineW3 dataset contains 1,544 videos with fine-grained annotations covering events (What), entities (Who), and location (Where). Below are examples demonstrating the granularity of our annotations across different anomaly categories.

Shoplifting Easy - Strong Visual Cues

📍 Events (What)

Woman and man walking in middle of supermarket shelves
Woman stops, takes bags of items from shelf
Woman dumps some items on opposite shelf
Woman puts bags of items into her clothes (hiding them)
Woman walks forward and takes two more bags from shelf
Woman hides additional items in her clothes
Man picks up child and puts child down
Woman continues taking things from shelf

👥 Entities (Who)

Primary Actor: Woman who took bags from shelf and hid them in her clothes
Accomplice/Companion: Man who followed woman and interacted with child
Distraction: Child who was picked up and put down by man
Stolen Items: Multiple bags of merchandise hidden in clothing

🏢 Scene (Where)

Location: Supermarket interior between shelves
Key Objects: Shelves with merchandise, bags of items
Context: Retail theft with possible distraction tactics (child present)

Arrest Easy - Strong Visual Cues

📍 Events (What)

Man in black clothes and white shirt talks with man in white at door
Man in black pushes man in white from behind
Man in white is surrounded and subdued by two people
Man in white is dragged from left side of screen to table on right
During struggle, man is pulled from table to right channel
Man with black top and white pants picks up pager and speaks into it
Three women at door are instructed to leave room one by one

👥 Entities (Who)

Aggressor: Man in black who pushed man in white from behind
Target: Man in white who was surrounded, subdued, and dragged
Accomplices: Two men in black who dragged the man to the table
Authority Figure: Man in all black who gave instructions
Bystanders: Women at door, man with pager

🏢 Scene (Where)

Location: Indoor room with door and table
Key Objects: Door, table on the right, pager
Context: Forced removal/arrest situation with multiple participants

Robbery Difficult - Complex Sequence

📍 Events (What)

Man climbs over wall and walks to small door
Man opens door and enters
Another man comes out and fights with intruder
Defender pushes robber away from door and struggles with him
Robber stabs the defender several times with knife
Robber repeatedly hits defender's head with knife handle
Defender fights back with iron rod, hitting robber's shoulder and head
Robber jumps out window to escape
Defender returns to house and closes door

👥 Entities (Who)

Robber: Intruder who climbed wall, fought, stabbed, and escaped through window
Defender: Man who fought back, used iron rod, pulled robber out multiple times

🏢 Scene (Where)

Location: Residential property with wall and small door
Key Objects: Wall, small door, knife, iron rod, window
Context: Home invasion with violent confrontation and self-defense

FV-Score vs. Other Metrics

Our FV-Score metric achieves superior correlation with human judgment (Pearson ρ: 0.61) compared to traditional n-gram metrics and LLM-based approaches. Below, we demonstrate how different metrics evaluate the same LVLM response, highlighting FV-Score's fine-grained and actionable feedback.

Ground truth annotations shown in dataset examples above

LVLM Response:

"A person approaches a residential property and enters through a door. Two individuals are present in the scene and appear to be engaged in some form of physical interaction near the entrance. One person is seen moving around the property, and there are objects visible including what appears to be a door and window. The scene takes place at a residential location during what appears to be daytime."

Comparison Across Evaluation Metrics

✓ FV-Score (Ours)

Overall Score: 3.5/15 | Correlation: ρ = 0.61

✗ What (Events): 0.5/5

Missing: Climbing wall to break in
Missing: Stabbing with knife (multiple times)
Missing: Hitting head with knife handle
Missing: Fighting with iron rod
Missing: Escape through window
Vague: "physical interaction" vastly understates violent assault

✗ Who (Entities): 1/6

Partial: Two individuals present (but roles completely missed)
Missing: Robber vs. Defender distinction
Missing: Knife and iron rod as weapons

✓ Where (Location): 2/4

Correct: Residential property, door, window visible
Missing: Wall, small door specifications

Actionable: Model severely hallucinated benign "physical interaction"; completely missed violent home invasion, weapons, and life-threatening assault.

Human Judgment

Quality Rating: 2/10

"This is a catastrophic failure. The description talks about 'physical interaction' when the video clearly shows a violent home invasion with stabbing, beating with weapons, and a struggle for survival. The model completely sanitized the extreme violence and missed all weapons. Calling this 'physical interaction' is like calling a fire 'warmth.' Completely useless for security monitoring."

Critical failure - Dangerous mischaracterization of violent crime

VAU-EVAL

Score: 6.8/10 | Correlation: ρ = 0.53

Classification Accuracy: 7/10
Key Concept Alignment: 6/10
Linguistic Fluency: 9/10
Informativeness: 5/10
Factual Consistency: 7/10

Issue: High score despite missing life-threatening violence. Rewards grammatical fluency while ignoring critical safety failure.

AnomEVAL

Score: 6.1/10 | Correlation: ρ = 0.42

Basic Reasoning: 6/10
Consistency: 8/10
Hallucination: High (severe sanitization)

Issue: Evaluates reasoning structure, not visual grounding. Completely misses that "physical interaction" is a euphemism for violent assault with weapons.

ROUGE-L

Score: 0.22 | Correlation: ρ = 0.47

Lexical overlap with reference: 22%

Shared words: "door", "window", "property", "two"

Issue: Only measures word overlap. "Physical interaction" vs. "stabbing with knife" treated as simple mismatch. Ignores semantic severity.

BLEU

Score: 0.15 | Correlation: ρ = 0.19

N-gram precision: 15%

1-gram matches: 4/18 words

2-gram matches: 0/11 bigrams

Issue: Worst correlation with humans. Completely ignores semantic meaning and danger level of events.

FV-Score provides the most interpretable and actionable feedback, clearly identifying what visual elements were missed.

FineVAU Leaderboard

Performance of state-of-the-art LVLMs on our FineVAU benchmark. Results are broken down by dimension (Event, Entity, Location) and attribute types. Click column headers to sort. Higher scores indicate better performance.

#	Model	Location	Event	Entity	Attribute	All	Location Breakdown					Entity Breakdown
#	Model	Location	Event	Entity	Attribute	All	Lighting	Env	Crowd	Time	Salient	Person	Vehicle	Others
1	InternVL3	71.8	18.0	51.2	25.5	40.5	80.4	86.6	59.3	79.7	53.1	54.0	44.8	51.5
2	Qwen2.5-VL	70.8	9.1	38.3	20.3	32.9	80.2	83.6	68.0	80.7	41.8	29.6	37.9	44.5
3	LLaVA-VID	65.7	14.4	44.0	21.0	35.0	65.1	87.0	56.8	69.0	50.8	42.2	38.0	47.4
4	LLaVA-OV	58.3	13.0	41.1	19.9	32.2	65.1	80.1	42.2	60.0	44.1	38.5	37.6	44.0
5	VideoLLaMA3	40.3	6.5	24.3	10.2	19.3	44.1	64.7	30.0	35.4	27.4	20.8	22.2	27.5

Note: Click on column headers to sort by different metrics. Results are based on zero-shot evaluation.

Want to submit your model? Follow the evaluation instructions in our GitHub repository.

FineVAU builds upon and advances several key areas in Video Anomaly Understanding and evaluation.

VAU Datasets and Benchmarks: UCA introduces dense captions for UCF-Crime videos. HAWK proposes synthetic descriptions and QA pairs, while Holmes-VAU provides multi-granularity annotations. ECVA introduces causal reasoning annotations. Our work differs by providing structured What/Who/Where annotations with human-aligned evaluation.

Evaluation Metrics: Traditional metrics like BLEU and ROUGE-L focus on lexical overlap, while LLM-based metrics like AnomEVAL and VAU-EVAL assess reasoning and fluency. FV-Score uniquely focuses on fine-grained visual grounding aligned with human perception.

Large Vision-Language Models: Recent LVLMs like InternVL3, Qwen2.5-VL, and LLaVA-Video show strong generalization on vision tasks. Our benchmark reveals their limitations in fine-grained temporal and spatial understanding of anomalies.

BibTeX

@inproceedings{pereira2026finevau,
  author    = {Pereira, Jo\~{a}o and Lopes, Vasco and Neves, Jo\~{a}o and Semedo, David},
  title     = {FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  year      = {2026},
}

FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding

FineVAU enables fine-grained evaluation of LVLMs on video anomaly understanding through What, Who, and Where dimensions, achieving superior alignment with human perception.

Abstract

FineW3 Dataset Examples

Shoplifting Easy - Strong Visual Cues

📍 Events (What)

👥 Entities (Who)

🏢 Scene (Where)

Arrest Easy - Strong Visual Cues

📍 Events (What)

👥 Entities (Who)

🏢 Scene (Where)

Robbery Difficult - Complex Sequence

📍 Events (What)

👥 Entities (Who)

🏢 Scene (Where)

FV-Score vs. Other Metrics

LVLM Response:

Comparison Across Evaluation Metrics

✓ FV-Score (Ours)

Human Judgment

VAU-EVAL

AnomEVAL

ROUGE-L

BLEU

FineVAU Leaderboard

Related Work

BibTeX