Carnegie Mellon University
Browse

Deep Learning for Automated Video Assistant Referee System (VARS) in Football

Download (558.07 kB)
poster
posted on 2024-07-03, 19:33 authored by Ahmad Alhallaq, Ahmed Issaoui

Over the past years, football refereeing has seen drastic changes. In recent years, extra referees have been introduced to the game, known as VAR, video assistant referee. VAR consists of a team of referees that work together to review decisions made by the main referee. They do this by watching video footage of relevant occurrences from a video operation room containing monitors displaying different camera angles. VAR is used in various situations, including offences leading to a goal, offside decisions, penalty decisions, and red card incidents. However, the problem with VAR is that it is done manually by match officials. This means that not only are decisions being made subjectively depending on how referees view and apply the rule, but the game also becomes much slower and longer, as VAR decisions happen long after the action has taken place. This prompted the need for an automated system to support match officials, by relying on advancements in sports video understanding.
As a research topic, sports video understanding has increased in popularity in recent years. The use of deep learning techniques has shown impressive results on tasks such as player detection and tracking (Cioppa et al., 2020), tactics analysis (Suzuki et al., 2019), and prediction in football (Honda et al., 2022). Among those advancements, is a new Video Assistant Referee System, presented in the 9th International Workshop on Computer Vision in Sports (CVsports) at CVPR 2023. In the paper, “VARS: Video Assistant Referee System for Automated Soccer Decision Making from Multiple Views”, the proposed system leverages the latest findings in multi-view video analysis, to provide real-time feedback to the referee, and help them make informed decisions that can impact the outcome of a game. It is intended to automatically predict all fouls and suggest appropriate sanctions to the players. It uses video dataset of soccer fouls from multiple camera views, annotated with extensive foul descriptions by a professional soccer referee. They focus their analysis on the classification of foul types and evaluate their severity to identify the sanction for the player (Held et al., 2023).
Specifically, the paper uses a dataset comprising 3,901 actions extracted from 500 soccer games from six main European leagues, covering three seasons from 2014 to 2017, each with multi- view clips of 5 seconds around the action, annotated by a professional referee. The VARS takes multiple video clips as input, showing the same action from different views. Videos are fed into a video encoder to extract a vector containing the spatiotemporal features. The multi-task VARS learns to leverage these features to improve its predictions for both tasks (Held et al., 2023).
However, in both tasks, foul classification and offence severity classification, the VARS recognition system achieves low accuracy. This is mainly due to the lack of sufficient amount of data, as well as the fact that the data used is highly imbalanced. The aim of this independent study is to improve on the Video assistant referee system proposed in the paper. Improving on such a system can be achieved in two ways. The first method is to extend the existing data by using video recording of football games and trying to automatically extract sequences related to
fouls. This will be done by utilizing different methods using video and audio tracks, including identifying referee whistling to determine foul instances, as well as identifying points of contact through a measure of crowdedness in the images, to determine ‘not foul’ occurrences. The automatic extraction of fouls, will most likely produce a single view dataset, as opposed to the multi-view samples used in the paper, although both datasets might be integrated with each other. Although the manual annotation method used in the paper is effective, it is very time consuming and does not scale well, and hence does not allow to create a large dataset for training. The other way to improve on the existing system is to experiment with different techniques, including trying different deep learning architectures, with different loss functions and feature engineering methods.
The goal of the research is to experiment with different approaches to increasing the accuracy of the existing system. As such, the work will proceed in parallel, comprising two sets of contributions, one from the side of automatic extraction of foul sequences from videos, and the other from side of a better performing learning architecture for automatic VAR. The two methods will be evaluated both separately and together, in terms of the accuracy they produce. 

History

Date

2024-04-30

Academic Program

  • Computer Science

Advisor(s)

Gianni Di Caro

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC