Automated cricket scene classification using vision-language model

Karan Sindwani; Debasish Mishra; Yash Shah

Publication

Automated cricket scene classification using vision-language model

By Karan Sindwani, Debasish Mishra, Yash Shah

2026

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Vision-Language Models (VLMs) have demonstrated impressive capabilities in general- purpose multi-modal tasks, but their adaptation to specialized sports analysis remains relatively unexplored. This paper bridges this gap by investigating VLM's effectiveness for automated cricket scene classification, addressing critical bottlenecks in current workflows that require 45-50 minutes of human intervention. We explore three distinct approaches— zero-shot prompting, few-shot prompting, and Parameter Efficient Fine-Tuning (PEFT) with LoRA—across three fundamental cricket tasks: event marker detection, start of delivery identification, and scoreboard parsing. Our comprehensive experimentation utilizes datasets comprising 30 thousand labeled high-resolution frames spanning 25 matches with balanced distributions across diverse conditions and production styles. Fine-tuned models using PEFT with LoRA achieve 90% accuracy in event marker detection, 98% accuracy in scoreboard parsing, and 95% precision in delivery detection, while requiring significantly less labeled data than traditional approaches. Notably, few-shot prompting approaches achieve competitive performance (84-93% accuracy across tasks) without any training data. Our findings establish a new benchmark for efficiency and accuracy in cricket scene analysis while providing a scalable solution for real-time analysis

Automated cricket scene classification using vision-language model

Latest news

Work with us