Automated cricket scene classification using vision-language model
2026
Vision-Language Models (VLMs) have demonstrated impressive capabilities in general-
purpose multi-modal tasks, but their adaptation to specialized sports analysis remains
relatively unexplored. This paper bridges this gap by investigating VLM's effectiveness for
automated cricket scene classification, addressing critical bottlenecks in current workflows
that require 45-50 minutes of human intervention. We explore three distinct approaches—
zero-shot prompting, few-shot prompting, and Parameter Efficient Fine-Tuning (PEFT) with
LoRA—across three fundamental cricket tasks: event marker detection, start of delivery
identification, and scoreboard parsing. Our comprehensive experimentation utilizes datasets
comprising 30 thousand labeled high-resolution frames spanning 25 matches with balanced
distributions across diverse conditions and production styles. Fine-tuned models using PEFT
with LoRA achieve 90% accuracy in event marker detection, 98% accuracy in scoreboard
parsing, and 95% precision in delivery detection, while requiring significantly less labeled
data than traditional approaches. Notably, few-shot prompting approaches achieve
competitive performance (84-93% accuracy across tasks) without any training data. Our
findings establish a new benchmark for efficiency and accuracy in cricket scene analysis
while providing a scalable solution for real-time analysis
Research areas