0% found this document useful (0 votes)
36 views

Predicting The Outcome of A Football Game: A Comparative Analysis of Single and Ensemble Analytics Methods

Uploaded by

Sarah Hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Predicting The Outcome of A Football Game: A Comparative Analysis of Single and Ensemble Analytics Methods

Uploaded by

Sarah Hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Proceedings of the 52nd Hawaii International Conference on System Sciences | 2019

Predicting the Outcome of a Football Game: A Comparative Analysis of


Single and Ensemble Analytics Methods

Enes Eryarsoy Dursun Delen


Istanbul Sehir University Oklahoma State University
[email protected] [email protected]

Abstract analytics, during recent years there has been a surge of


interest in predicting outcomes of a soccer game. Fans,
As analytical tools and techniques advance, sports gamblers, managers, researchers, software
increasingly large numbers of researchers apply these vendors are all interested in accurately predicting soccer
techniques on a variety of different sports. With nearly games.
4 billion followers, it is estimated that association Even though game prediction in soccer, in its most
football, or soccer, is the most popular sports for fans basic format, has three outcomes (win-loss-draw), there
across the world by a large margin. The objective of this are many other outcomes that may be of interest for
study is to develop a model to predict the outcomes of prediction. Most popular soccer wagers involve not only
soccer (or association football) games (win-loss-draw), the outcome, but also predicting the number of goals,
and determine factors that influence game outcomes. number of corners, free kicks, or even cards. The club
We used 10 years of comprehensive game-level data managers are also interested in analytics in order to
spanning the years 2007-2017 in the Turkish Super grow their fan bases or their game audience.
League, and tested a variety of classifiers to identify the However, we believe that the number of studies for
most promising methods for outcome predictions. soccer game outcome prediction is low due primarily to
the following reasons:
i) Lack of data available: the majority of papers
make use of limited data.
1. Introduction ii) Sports’ dynamic nature: Soccer outcome
prediction still remains in analytical backwaters because
Value generating management of both structured of the sport’s highly dynamic nature. A “win-loss-draw”
and unstructured data in sports falls into the vast field of game is much more difficult to predict a “win-loss”
sport analytics. Application of predictive analytics has game. Also, each tournament - such as playoffs, regular
been used successfully on many different sports such as seasons, or European championship league- presents
football [1], basketball [2, 3], cricket [4], rugby [5], and different modeling challenges.
hockey [6]. Predicting outcomes of any sports game iii) Authors’ believe that most of the
naturally is one of the most obvious objectives in sports “Moneyballing” efforts are put by club managers, and
analytics. However, outcome prediction of almost any sports gamblers who are not willing to publish their
sports game is a challenging task due to their results.
dynamically changing and stochastic nature. Many The rest of this paper is organized as follows. In
stakeholders, such as odds traders, fans, or team Section 2, we review the existing academic literature. In
managers are interested in deploying such methods not Section 3, we outline our methodology and our dataset.
merely to predict the outcomes but even more to Section 4 is dedicated to our analysis and results.
understand underlying driving factors behind success or
failure. 2. Literature Review
Association football (for simplicity hereafter will be
referred as “soccer”) has not only been the world’s most Academic literature on soccer game outcome
popular sport [7, 8, 9] but also, not surprisingly, the predictions can be roughly categorized based on i) the
largest in the sports betting market. Due to advances in kind of data used, ii) prediction stage (i.e. during or

URI: https://hdl.handle.net/10125/59550
ISBN: 978-0-9981331-2-6 Page 1107
(CC BY-NC-ND 4.0)
before the game, or even season), iii) the type of this study and improved the likelihood of accurate and
outcome to be predicted, and iv) technique employed. reliable results.
i) Kind of data used: Most of the data used in sports In order to assess the predictive power of the
analytics are structured data. The majority of studies use different models we employed, we used a cross-
structured game/player data [10], or structured odds data validation method, which is a widely used statistical
based on past betting quotas [11]. However, there are technique often used to compare multiple models for
also studies that make use of unstructured data, such as their accuracies. Even though given a sufficiently large
sentiment analysis on tweets [12], or Tumblr posts [13]. dataset, a single random split into two (or three, for
ii) Prediction stage: Studies also differ in terms of neural network, for instance) may give enough
the prediction window used. While some studies focus accuracy, we chose cross-validation as each record
on studying live data, such as player trajectories, to represents a game, and may cover a valuable yet another
assess player performances [14, 15], others make use of different aspect. More details of this cross-validation are
the data from the first half of game to predict the given in Section 3.3. A graphical depiction of the overall
outcome for the second half. This is due to the fact that methodology employed in this study is shown in Figure
the betting window is still open at halftime. Our study 1.
uses all data available up to the game week.
iii) The type of outcome to predict: Different studies
attempt to predict different kinds of outcomes such as
Data Data
the number of goals scored [16, 17, 18], the outcome of Sports Data
the game directly in terms of “win-loss-draw” [19], the
measurement of the efficiencies or inefficiencies of the
betting market [20], soccer tipsters’ behavior and
performance [21]. Data Preprocessing
iv) Techniques used in the study: Different studies  Merging
used different techniques for outcome prediction. While  Aggregating
 Cleaning
the majority of earlier studies employed methods from  Selecting
statistics and probability, such as Poisson distributions  Transforming
[22], Markov Chain Monte Carlo iterative simulations
[23] discrete choice regression models for “win-loss-
draw” scenarios [18, 19, 21] Newer studies tend to use
data mining methods such as Naïve Bayes [11, 24],
Pre-processed
Bayesian Belief Networks [25], Support Vector Data
Machines [26], Neural and Genetic optimization [26] or
combinations of various machine learning algorithms
[26].
Following the classification above, this paper uses Training Validation
structured data to predict the outcome of a soccer game Dataset Dataset
in terms of “win-loss-draw” before the game, employing
a variety of machine learning algorithms.
Model Building Model Evaluation
3. Methodology and Dataset  Naïve Bayes  Prediction
 Decision Trees  Accuracy
 Ensembles  Sensitivity
Trained
 Random Forest  Specificity
In this research, we follow the most popular data Models
 Boosted Trees  Explanation
mining framework called, CRISP-DM [27]. CRISP-DM  Sensitivity
has six sequential steps: (i) Understanding the domain
and developing the goals for the study; (ii) identifying
accessing and understanding the relevant data sources; True/Observed Class
Importance

Positive Negative
(iii) pre-processing, cleaning, and transforming the True False
Positive

Positive Positive
Predicted Class

relevant data; (iv) developing models using comparable Count


(TP)
Count
(FP) 100
90

analytical techniques; (5) evaluating and assessing the False True


80
Negative

70
60
Negative Negative 50
40 40
50

validity and utility of the models against each other and Count
(FN)
Count
(TN)
30

Variable Names
against the goals of the study; and (vi) deploying the
models for use in decision making process. Following Figure 1: A Graphical Depiction of the Analytics
this framework, we were able to systematically conduct Methodology

Page 1108
3.1. Data Number of spectators for
Attend GM
the Home team.
The sample data for our study are collected from the Hour GM Game starting time
Turkish Super League, using a variety of sources and Game formation (i.e. 4-2-3-
Formation GM
means, including hand collection. The initial dataset 1, 4-3-3, 4-4-2, 4-5-1…).
included 3,060 game-level items of data, from 33 teams Did the team play as a
Home GM
home team?
spanning a complete 10 seasons (2007-2017).
Percentage of maximum
The regular soccer season in Turkish Super League PrcPossbPntsEarned GM possible points earned
lasts 34 weeks. Being the top flight of the country’s during the season
football system, the three lowest performers at the end The consistency in the
of the season are relegated to the 1st League. The top 4- number of players of
FrgnPlayerConsistency GM
5 performers get to represent the country in the UEFA foreign nationals
Europe League (formerly called UEFA Cup), depending participating in the game
on the national team’s performance in the UEFA Number of players with
Plyr30PlusConsistency GM
League. Therefore, towards the end of the season, the 30+ of age
Same coach up to this
teams at the bottom and top get more competitive while CoachConsistInSeas GM
week?
the team in the mid-section may get potentially Same club management up
reluctant, and prediction becomes more difficult. Also, MgtConsistInSeas GM
to this week?
during the first 5 weeks of the season, the transfer Number of 30+ subs from
window is still open, and teams keep signing players. 30PlsSubsUsedPrev GM bench included in the
We, therefore, decided to filter out the first 5 and last 4 previous game
games of each league, and therefore included the games How many different
CoachCountInSeas GM
played during weeks 6 to 30. We also discarded data coaches up to that week?
from 27 games where points were given according to Weighted moving averages
PassesAttempted GM of passes attempted during
“decision by referee”.
recent games
The variables we used are provided in Table 1. Each
record in Table 1 contains data about a game of a team. Weighted moving averages
PassesComplete GM
of passes complete during
Table 1. Description of the team and game-based recent games
variables used in this study Weighted moving averages
Opportunities GM of number of opportunities
Variable Cat Explanation
during recent games
Season ID Which season? Weighted moving averages
ID PercOpportunitiesScored GM of percent opportunities
Week Which week of the season?
used during recent games
Team ID Which team? Weighted moving averages
Game ID for that week (1- Assists GM of assists during recent
ID
WeekID 9) for each week games
If the coach was a former Weighted moving averages
CoachPosition TM
player, this is his position ShotsOnTarget GM of shots on target during
The current ranking in the recent games
LeagueStanding TM
league. Weighted moving averages
The team has been in the Shots GM of number of shots during
Exist10Year TM
Super League consistently? recent games
CoachAge TM Age of the coach Weighted moving averages
Cross GM of number of crosses made
CoachNative TM Is the coach local? during recent games
How many different Weighted moving averages
MgtCntH TM management teams during of number of crosses
the season? CrossComplete GM
complete during recent
How many different games
FormationConsistency TM formations have been used Weighted moving averages
so far during the season. Reception GM of number of receptions
Number of foreign players during recent games
NumForeign TM
played in team Weighted moving averages
Team value in local Intercepts GM of number of intercepts
TeamValue TM
currency during recent games

Page 1109
Weighted moving averages 3.2. Data Preprocessing
PosessLost GM of possessions lost during
recent games
Weighted moving averages
In the formulation of the dataset in Table 1, each row
Corners GM of corner kicks awarded (or tuple) represents a team’s performance during a
during recent games game. Therefore, there are two rows, one for each home
Weighted moving averages or away team in each game. We converted and
Offside GM of number of offsides reformulated Table 1, the “team”-based dataset, into a
during recent games “game”-based dataset by representing each game with
Weighted moving averages one record. When doing so, we calculated and used the
FoulsCommitted GM of fouls committed during differences between the measures of the home and away
recent games teams for each game. We then adjusted our target
Weighted moving averages
FoulsAwarded GM of fouls awarded during
variables to be “Result”, and “Score Difference”, for the
recent games regression and classification tasks respectively.
Weighted moving averages The original dataset (Table 1) contained a
of number of free kicks significantly high amount of missing values (about
FreeKick GM
awarded during recent 11%). These were mainly data that we were not able to
games collect, due to unavailability, and typically belonged to
Weighted moving averages teams that were promoted to the top-flight league, or
YellowCard GM of yellow cards during were relegated to the 1st league (lower league).
recent games Converting this “team” based dataset into “game” based
Weighted moving averages
RedCard GM of red cards during recent
dataset for analysis resulted in an increase in the number
games of missing values (17%), as some of the missing values
Weighted moving averages were entered into difference calculations.
PassIncomplete GM of passes incomplete during We concluded that the missing data is not MAR, and
recent games the number of complete instances was not sufficiently
Weighted moving averages large enough (240 rows only) to perform a reliable
PercShotsOnTarget GM of percentage shots on imputation. We, therefore, for the sake of simplicity did
target during recent games not perform imputation.
Weighted moving averages
PercClearCross GM of percentage of clear
crosses during recent games 3.3. Methods
Weighted moving averages
of percentage of passes In this study, we mainly used three popular
PercPassComplete GM
complete during recent prediction techniques, and we compared their
games performances against each other: Naïve Bayes, Decision
Weighted moving averages Trees, and Ensemble models. Due to the high amount of
AveragePlayerAges GM of average player ages missing values in our dataset, we could not employ
played during recent games some of the techniques such as Artificial Neural
Weighted moving averages
NumPlayers30plus GM of number of 30+ players
Networks, k-Nearest Neighbor, Logistics Regression or
during recent games Support Vector Machines. Because of the non-random
Total values for players in nature of the missing values in several of the critical
SubValuePrevWeek GM the bench during the independent variables, we could neither impute the
previous game missing values nor we could exclude them from the
Scored O1 Goals scored dataset. Therefore, we chose to use predictive analytics
O1 methods that allow and properly handles missing values
Conceded Goals conceded
in both training and testing sections of the final dataset.
Result O2 Win, Loss, or Draw Naïve Bayes
Naïve Bayes is one of the simplest and also one of
ID: Identifier variables; GM: game-related variables the fastest techniques used in data mining and often
that change for each game. TM: team level variables that performs surprisingly well [28]. However, it is known
usually remain fixed throughout the season. O1: Output that the algorithm may underperform when its
variable for regression models; O2: output variable for underlying independence assumption is not satisfied.
classification models. The algorithm works well with almost any kind of
dataset.

Page 1110
Decision Trees practice is to stratify the folds so that the folds don’t
The technique recursively separates observations suffer from class imbalance problem. In our study, we
into branches in order to construct a tree to achieve the set k=10.
highest possible prediction accuracy. In recursive We then used three performance criteria to compare
separation, the technique may use different the prediction performances of the selected models.
mathematical splitting criteria such as Information Where TP, TN, FP, and FN denote true positive
Gain, GINI index, or Chi-square statistics. We choose (accurate prediction of wins), true negative (accurate
to use classification and regression trees (CART) which prediction of losses), false positive (false prediction of
were initially developed by Breiman, Friedman, Olshen, losses as wins or draws), and false negatives (false
and Stone [29]. This specific decision tree algorithm is prediction of wins as losses or draws). We then
capable of modeling both classification and regression- compute accuracy, sensitivity, and specificity
type problems and also performs well even with missing (Equations 2, 3, and 4 respectively).
values.
Ensemble Methods 𝑇𝑃 + 𝑇𝑁
We used two ensemble methods, Gradient Boosting 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (2)
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Trees [30] and Random Forest [29]. Both of these
methods rely on a decision trees algorithm, CART in 𝑇𝑃
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = (3)
their standard setups. They handle missing data 𝑇𝑃 + 𝐹𝑁
internally and automatically learn the best imputation
value for missing values based on training loss. They 𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = (4)
also almost consistently outperform decision trees. 𝑇𝑁 + 𝐹𝑃
While Gradient Boosted Trees uses very shallow
regression trees and boosting to create ensemble,
Random Forest creates a number of decision trees where 4. Results
each tree model is learned on different records and
different columns recursively.
Outcome prediction of soccer games presents more
difficulties than most of the other sports where there is
3.4. Evaluation criteria only one winner, or two classes (“win” and “loss”).
After the initial analysis we decided to analyze data in
A common way of evaluation is to split the data into two different ways:
two subsets for training, and testing (sometimes a third i) Win/Loss/Draw prediction
subset may be used as in the case of Neural Networks or ii) Points/No Point prediction.
Genetic Algorithms). Typically two-thirds of the dataset The outcome frequencies for 10 seasons are given in
is used for model building and one third is left for Table 2.
testing. However such single splits often yield results
that are prone to sampling bias regardless of the Table 2: Win/Loss/Draw, and
sampling technique. In order to minimize the bias many Points/No Point frequencies
data scientists opt for cross-validation. In k-fold cross-
validation, the dataset is split into k randomly split and Win or Loss Draw
mutually exclusive subsets of roughly the same size. 1636 541
Each model is trained and tested k times. Each time all Points No Point
(k-1) folds are used for learning and the last fold is used 1558 619
for testing. The cross-validation estimate of the
accuracy is calculated by taking the average of the k 4.1. Win/Loss/Draw Prediction
individual accuracy measures. This usually increases
total running times however the result is often less prone Due to the dynamic nature of soccer, it is difficult to
to overfitting and randomness. The overall accuracy separate classes especially when a game is not
(CV) of the cross-validation is calculated by taking the dominated by a team. The game could easily end with
average accuracies of each fold (𝐴𝑖 ) as in (1): any of the three outcomes. However, as it the middle
ground between “win” and “loss, “draw” intuitively is
1 the most difficult to predict. Moreover, as Table 2
𝐶𝑉 = 𝑘 ∑𝑘𝑖=1 𝐴𝑖 (1) shows, the least frequent outcome is the draw, making
the task even more difficult to learn (significantly fewer
As cross-validation accuracy is influenced by the data points for learning). In order to remedy this, we
randomness in assigning records to folds, a common used an oversampling method to enrich our training data

Page 1111
[31]. We also performed missing value imputation Draw 267 237 513
(single imputation, using mean and mode) to be able to
run Neural Networks (2 hidden-layers, 10 nodes per
layer), SVMs (with RBF), and kNN (k=5). Table 3 Acc.(%) Sens.(%) Spec.(%)
summarizes our results. 58.80% 65.19% 81.56%
60.77% 84.96%
50.44% 71.68%
Table 3. Prediction results for the direct classification
methodology with three classes Loss Win Draw
*⸷
Prediction kNN(k=5) Loss 695 122 200
method Win 306 330 381
Confusion matrix
(classificati Draw 199 116 702
on)
Loss Win Draw
Acc.(%) Sens.(%) Spec.(%)
Naive
Loss 611 365 41 58.80% 68.34% 75.17%
Bayes*
Win 207 783 27 32.45% 88.30%
Draw 340 545 132 69.03% 71.44%
Loss Win Draw
Acc.(%) Sens.(%) Spec.(%) Gradient
Boosted Loss 809 50 158
51.90% 60.10% 72.90%
Tree*
77.00% 55.00% Win 70 744 203
13.00% 96.60% Draw 126 172 719
Loss Win Draw
Decision Acc.(%) Sens.(%) Spec.(%)
Trees* Loss 675 124 218
(CART) 74.50% 79.50% 90.40%
Win 117 622 278 73.20% 89.10%
Draw 214 278 525 70.70% 82.30%
Loss Win Draw
Acc.(%) Sens.(%) Spec.(%) Random
Loss 827 98 92
Forest*
59.70% 66.40% 83.70%
Win 100 774 143
61.20% 80.20%
Draw 131 211 675
51.60% 75.60%
Loss Win Draw
Acc.(%) Sens.(%) Spec.(%)
Neural
Loss 658 74 285 74.60% 81.30% 88.60%
Nets*⸷
Win 101 616 300 76.10% 84.80%
Draw 258 327 432 66.40% 88.40%
* Performed after synthetic minority class oversampling

Performed after imputation
Acc.(%) Sens.(%) Spec.(%)
55.91% 64.70% 82.35%
60.57% 80.29% The prediction results of the modeling techniques
suggest ensemble tree models performed significantly
42.48% 71.24%
better. However, the prediction accuracy for the “draw”
Loss Win Draw class is proven to be more difficult. Figure 2 visualizes
SVMs *⸷
Loss 663 69 285 this phenomenon using Gradient Boosted Trees
Regression results.
Win 108 618 291

Page 1112
4.3. Sensitivity Analysis

Algorithms are good at capturing non-trivial


relationships and establish a relationship between input
and output. While some algorithms are known as black-
box algorithms, such as Neural Networks, others are
transparent, such as decision trees. However, even when
a transparent algorithm is used (for example tree
structure of the decision tree) cannot always easily be
interpreted. In the context of machine learning,
sensitivity analysis refers to exclusive experimentation
process to establish a possible cause and effect
relationship between the input and the output variables
Figure 2. Highlighted area in “blue” represents games
[32].
with “draw” prediction.
Our sensitivity analysis is based on using our best
performing algorithm (Random Forest). Random Forest
4.2. Points/NoPoint Prediction is an algorithm based on decision trees. In a decision
tree, variables that are used in earlier splits are
In soccer, it is also important to predict whether a considered more important. We use this characteristic of
team will earn a point during a game. This makes the decision trees/random forests and compute the number
prediction task easier by reducing the possible number of level-0 splits for each variable. This way we can to
of outcomes to two: “Points” and “No Points”. This observe the impact of the variable on performance. The
prediction task is also relevant for soccer betting. relative importance values are then tabulated
Following the same procedure (i.e. minority normalized and graphically presented for the top 20
oversampling, and cross-validation) we report results variables that are chosen for the first split in Figure 3.
for our top two performing algorithms in Table 4. As expected exclusion of these variables results in
significant drops in classification performance.
Table 4: Results for the two-class prediction task
Prediction method
(classification)
Acc.(%) Sens.(%) Spec.(%)
Random Forest* 86.3% 88.1% 84.4%
84.4% 88.1%

Acc.(%) Sens.(%) Spec.(%)


Gradient Boosted
89.0% 83.7%
Tree* 86.4%
83.7% 89.0%
*
With minority class oversampling

As the results indicate, ensemble-type prediction


methods performed better. Among the four data mining
methods, Naïve Bayes was the lowest performer. This
may be due to dependencies in variables. Using a t-test,
we also found that Gradient Boosting and Random
Forest methods both outperformed Decision Tree
(CART) and Naïve Bayes methods. However the
performance difference between Gradient Boosting and
Random Forest ensemble methods was not found to be Figure 3. Variable importance values
statistically different at 0.05 alpha level for either of the
“Win/Loss/Draw” and “Points/NoPoint” problems
(α=0.11.2, and α=0.12 respectively).

Page 1113
5. Discussion, conclusion, and future [5] J. Carbone, T. Corke, F. Moisiadis, “The rugby league
prediction model: using an ELO-based approach to
research predict the outcome of the national rugby league (NRL)
matches”, International Education Scientific Research
The results of this study once again show that the Journal, Vol 2, Issue 5 (2016), pp 26-31.
prediction of soccer outcomes is not straightforward. [6] G. Wei, L.T. Saaty, R. Whitaker, “Expert System for Ice
We were able to achieve over 74% accuracy in Hockey Game Prediction: Data Mining with Human
“Win/Loss/Draw”, and over 86% accuracy in Judgment”, International Journal of Information
“Points/NoPoint” type of classification problems. Technology & Decision Making, Vol 15, Issue 4, pp.
Perhaps, one of the shortcomings of this study was due 763-789.
to the significant amount of missing values (not at [7] E.G. Dunning, A. Joseph, R.E. Maguire, “The Sports
random). However, for the sake of inclusion of other Process: A Comparative and developmental approach”
methods (Neural Networks, kNN, and SVMs, we also (1996), pp.129, Champaign: Human Kinetics.
performed missing value imputation in its most basic
[8] E. Dunning, “Sports Matters: Sociological Studies of
form (mean imputation). However, as expected these
Sport, Violence and Civilisation” (1999). London:
algorithms were not able to outperform ensemble Routledge.
methods. This limited our options for algorithms. We
believe more comprehensive datasets may yield even [9] A.C. Constantinou, N.E. Fenton, M. Neil, “Profiting from
better results. The ensemble methods outperformed the an inefficient association football gambling market:
Prediction, risk and uncertainty using Bayesian
other two methods we used. Within ensemble trees,
networks”, Knowledge-Based Systems, Vol 50 (2013),
gradient boosted trees slightly outperformed random pp. 60-86.
trees, however, the difference was not significant, and
also should not be generalized beyond the scope of the [10] R. Baboota, H. Kaur, “Predictive analysis and modeling
study. football results using machine learning approach for
In soccer games, there are many different aspects to English Premier League”, International Journal of
Forecasting, (2018), Forthcoming.
be studied from the different stakeholders. A coach,
given the week and the opponent, may try to evaluate [11] S. Dobrovec, “Predicting sports results using latent
impacts of different formations, or different players features: A case study”, 38th International Convention on
included in the game. A team manager may try to Information and Communication Technology,
understand what kind of coach would be suitable for his Electronics and Microelectronics (MIPRO), 25-29 May
2015, pp. 1267 – 1272.
team, or what drives fans to stadiums. A sports gambler
can attack the same problem with different target [12] R.P. Schumaker, A.T. Jarmoszko, C.S. Labedz,
variables, such as the number of cards, corners etc. We “Predicting wins and spread in the Premier League using
also believe that a more carefully collected dataset (i.e. a sentiment analysis of twitter”, Decision Support
with fewer missing values), as well as a richer dataset in Systems, Vol 88 (2016), pp. 76-84.
terms of variables, will help in predicting outcomes [13] V. Radosavljevic et al., “Large-scale World Cup 2014
more accurately. outcome prediction based on Tumblr posts”, KDD
Workshop on Large-Scale Sports Analytics: Sydney,
Australia, 2014.
6. References
[14] T. D’Orazio, C. Guaragnella, M. Leo, A. Distante, “A
[1] D. Delen, D. Cogdell, N. Kasap, “A comparative analysis new algorithm for ball recognition using circle Hough
of data mining methods in predicting NCAA bowl transform and neural classifier”, Pattern Recognition
outcomes”, International Journal of Forecasting, Vol. 28, (2004), Vol 37, pp. 393–408.
Issue 2, (2012), pp. 543-552.
[15] Y.L. Kang, J.H. Lim, M.S. Kankanhalli, C.S. Xu, Q.Tian,
[2] M.J. Lopez, G.J. Matthews, “Building an NCAA men’s “Goal detection in soccer video using audio/visual
basketball predictive model and quantifying its success”, keywords”, IEEE International Conference on Image
Journal of Quantitative Analysis in Sports, Vol 11, Issue Processing (ICIP), Singapore, 24–27 October 2004, pp.
1 (2015), pp. 5-12. 1629–1632.
[3] P. Vračar, E. Štrumbelj, I. Kononenko, “Modeling [16] M.J. Maher, “Modeling association football scores”,
basketball play-by-play data”, Expert Systems with Statistica Neerlandica (1982), Vol 36, pp.109-118.
Applications, Vol 44 (2016), pp 58-66.
[17] B. Gianluca, M. Blangiardo, “Bayesian hierarchical
[4] M. Asif, I. G. McHale, “In-play forecasting of win model for the prediction of football results”, Journal of
probability in One-Day International cricket: A dynamic Applied Statistics (2010), Vol 37, Issue 2, pp. 253-264.
logistic regression model”, “International Journal of
Forecasting”, Vol 32, Issue 1 (2016), pp 34-43. [18] D. Karlis, I. Ntzoufras, “Bayesian modelling of football
outcomes: Using the Skellams distribution for the goal

Page 1114
difference”. IMA Journal of Management Mathematics [26] A.P. Rotshtein, M. Posner, A.B. Rakityanskaya,
(2008), pp. 229–244. “Football Predictions Based on a Fuzzy Model with
Genetic and Neural Tuning”, Cybernetics and Systems
[19] J. Goddard, “Regression models for forecasting goals and Analysis (2005), Vol 41, Issue 4, pp. 619-630.
match results in association football”, International
Journal of Forecasting (2005), Vol 21, Issue 2, pp.331- [27] C. Shearer, “The CRISP-DM model: the new blueprint
340. for data mining”, Journal of Data Warehousing (2000),
Vol 5, pp.13–22.
[20] M.J. Dixon, S.G.Coles, “Modelling association football
scores and inefficiencies in the football betting market”, [28] D.J. Hand, K. Yu, “Idiots Bayes—not so stupid after
Journal of the Royal Statistical Society: Series C all?”, International Statatistical Review (2001), Vol 69,
(Applied Statistics) (1997), Vol 46, pp. 265–280. pp.385–398.
[21] D. Forrest, R. Simmons, “Forecasting sport: the [29] L. Breiman, J. Friedman, R. Olshen, C. Stone,
behaviour and performance of football tipsters”, “Classification and Regression Trees” (1984),
International Journal of Forecasting (2000), Vol 16, Issue Wadsworth.
3, pp.317-331.
[30] J.H. Friedman, “Greedy function approximation: a
[22] S. J. Koopman, R. Lit, “A dynamic bivariate Poisson gradient boosting machine”. Technical Report,
model for analysing and forecasting match results in the Department of Statistics, Stanford University (1999).
English Premier League”, Journal of the Royal
Statisitical Society: Series A, Vol 178, Issue 1, pp.167- [31] N. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer,
186. “SMOTE: Synthetic Minority Over-sampling
Technique”, Journal of Artificial Intelligence Research
[23] M. Crowder, M. Dixon, A. Ledford, M. Robinson, (2002), Vol 16, 321-357.
“Dynamic modelling and prediction of English football
league matches for betting”, The Statistician (2002), Vol [32] G. Davis, “Sensitivity analysis in neural net solutions”.
51, pp.157–168. IEEE Trans. Syst., Man, Cybern. (1989), Vol 19,
pp.1078–1082.
[24] F. Godin, et al., “Beating the bookmakers: leveraging
statistics and twitter microposts for predicting soccer
results”, KDD Workshop on Large-Scale Sports
Analytics. Sydney, Australia, 2014.
[25] M. Byungho, J. Kim, C. Choe, H. Eom, R.I. McKay, “A
compound framework for sports results prediction: A
football case study”, Knowldege Based Systems, Vol 21,
pp.551-562.

Page 1115

You might also like