0% found this document useful (0 votes)
201 views6 pages

Analysis and Prediction of Soccer Games - An Application To The Kaggle European Soccer Database

Uploaded by

julia.sousa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
201 views6 pages

Analysis and Prediction of Soccer Games - An Application To The Kaggle European Soccer Database

Uploaded by

julia.sousa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Original Research Article

Analysis and Prediction of Soccer Games: An Application to the


Kaggle European Soccer Database
Wuhuan Deng*, Eric Zhong
Chengdu Foreign Language School, Chengdu 611731, Sichuan, China. E-mail: [email protected]

Abstract: The study of soccer game data has many applications for both fans and teams. The effective analytical work
can not only help the teams to improve their offensive and defensive skills and strategies, but also could assist the fans
to make a bet. In this work, the authors study the European League Dataset with statistical methods to analyze the game
data. Moreover, machine learning techniques are designed to predict the game results based on in-game performance
and pre-game odds provided by bookmakers. With rational feature engineering and model selection, our model results
in an overall 95% accuracy.
Keywords: Soccer; Python; Data Science; Artificial Neural Network; Statistics; Poisson Distribution

1. Introduction neural network model has 6 input layers 5 hidden layers,


and 2 output layers. And their model gave an accuracy
Soccer is one of most popular sports in the world,
of 73.27% for goals scored by Manchester United. And
especially in Europe and South America. Numerous peo-
Yoel F. Alfredo and Sani M. Isa[2] also published a simi-
ple are enthusiastic about this sport. More than a billion
lar research paper in 2019. Their data comes from foot-
people watched the final match of World Cup in 2018.
ball data.co.uk which is a common data set to be used in
Another fact of soccer is that it is quite excellent for bet-
conducting research in football match predictions. There
ting at different levels. The outcome of a soccer game
are 71 attributes in the data, and they can be divided into
depends on many factors, like the home-away ground
two categories: football match statistics and bookmaker
effect, the physical and psychological conditions of key
odds prediction. Only 14 attributes are selected by them,
players. On the other hand, for each team, a good under-
which they think can have good impacts on prediction
standing and analysis of their past games is effective to
and result accuracy. They use 3 models to predict the
help improving their skills, strategies and training meth-
result: C5.0 Model, Random Forest Model, and Extreme
odologies.
Gradient Boosting Model. And each model has a high
In literature, Balogun O. and Ogunseye AA[1] used
accuracy of prediction.
Artificial Neural Network (ANN) to predict the scoreline
of Manchester United matches against opposing teams 2. Dataset description
for matches played in the English Premier League in
The open dataset for study in this work is acquired
2019. Their data spans a period if nine years from 2009
from www.kaggle.com, built by Hugo Mathien, includes
to 2018. 331 of the data set were used as the training data
soccer match data of 11 European countries with their
set, while 12 was used for validation. Their artificial

Copyright © 2020 Wuhuan Deng and Eric Zhong


doi: 10.18282/i-s.v3i1.332
This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial License
(http://creativecommons.org/licenses/by-nc/4.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium,
provided the original work is properly cited.

Insight - Statistics Volume 3 Issue 1 | 2020 | 1


lead championship from seasons 2008 to 2016. The da- Match data provide supporting evidence to reveal
tabase covers 24,637 matches and over 10,000 players. the offensive and defensive performance for each team.
Betting odds from up to 10 bookmakers are included. Hence, the first investigation conducted on this dataset is
Detailed match events like goal types, possession, corner, to study the game performance for each team from 2008
cross, fouls, cards are also recorded in the database. to 2016 based on goal events. The attributes applied in-
Moreover, this dataset integrated players and teams at- clude “home_team_goal”, “away_team_goal” and an-
tribute ratings sourced from EA Sports’ FIFA video game other two designed features, “shot_efficiency” and
series. “goal_efficiency”. The feature “shot_efficiency” is the
The dataset is a SQL database file contains 7 tables radio of the number of shot-ons and the number of total
of “Country”, “League”, “Match”, “Player”, “Play- shots, and “goal_efficiency” is the radio of the number of
er_Attribute”, “Team”, “Team_Attributes”. For the goals and the number of shot-ons. With simple feature
match table, 134 attributes are recorded, some of which engineering, the two introduced new attributes demon-
are significant for predicting game results, while appro- strate the efficiency of team performance of shooting and
priate feature engineering based on relative soccer scoring, which are significant to evaluate team offensive
knowledge may also be conducted in the later prediction capabilities.
process. By calculating the averaged attributes from 2008 to
2016 for all teams, the top away and home teams are
3. Statistical analysis demonstrated in Figure 1 and Figure 2.
3.1 Best offensive and defensive teams

Away team goal performance top list Away team defensive performance top list

Rank away_team away_team_goal Rank away_team home_team_goal


1 FC Barcelona 2.33 1 Grasshopper Club Zurich 0.63
2 Real Madrid CF 2.22 2 Rangers 0.75
3 FC Zurich 2.13 3 FC Porto 0.77
4 Ajax 2.11 4 SL Benfica 0.84
5 PSV 2.07 5 Juventus 0.84
6 Celtic 2.01 6 FC Bayern Munich 0.85
7 FC Bayern Munich 1.99 7 FC Barcelona 0.86
8 SL Benfica 1.99 8 Celtic 0.88
9 FC Porto 1.98 9 Sporting CP 0.9
10 Rangers 1.93 10 Legia Warszawa 0.97

Away team shot efficiency top list Away team goal efficiency top list

Rank away_team away_team_shot_efficiency Rank away_team away_team_goal_efficiency


1 Valenciennes FC 80% 1 DSC Arminia Bielefeld 32%
2 Karlstruher SC 75% 2 Valenciennes FC 25%
3 FC Utrecht 69% 3 CD Numancia 22%
4 Novara 68% 4 SV Darmstadt 98 22%
5 SC Freiburg 68% 5 ADO Den Haag 22%
6 Vitesse 67% 6 Rangers 21%
7 FC Barcelona 66% 7 Xerez Club Deportivo 20%
8 Legia Warszawa 65% 8 Zaglebie Lubin 20%
9 Paris Saint-Germain 65% 9 Paris Saint-Germain 20%
10 SC Heerenveen 64% 10 AS Monaco 19%

Figure 1. Away team attributes ranking.

2 | Wuhuan Deng and Eric Zhong Insight - Statistics


Home team goal performance top list Home team defensive performance top list
Rank home_team home_team_goal Rank home_team away_team_goal
1 BSC Young Boys 3.38 1 FC Zurich 0.13
2 Real Madrid CF 3.32 2 FC Porto 0.52
3 FC Barcelona 3.26 3 Ajax 0.57
4 FC Bayern Munich 2.81 4 Celtic 0.58
5 PSV 2.72 5 SL Benfica 0.66
6 Ajax 2.65 6 FC Barcelona 0.66
7 SL Benfica 2.59 7 FC Bayern Munich 0.71
8 Celtic 2.56 8 FC Vaduz 0.71
9 Manchester City 2.4 9 Legia Warszawa 0.73
10 FC Porto 2.38 10 Rangers 0.74

Home team shot efficiency top list Home team goal efficiency top list

Rank home_team home_team_shot_efficiency Rank home_team home_team_goal_efficiency


1 Paris Saint-Germain 67% 1 Korona Kielce 25%
2 FC Barcelona 66% 2 Paris Saint-Germain 23%
3 Monchengladbach 66% 3 FC Barcelona 23%
4 FC Twente 65% 4 Livomo 22%
5 N.E.C. 65% 5 Ajax 22%
6 Bournemouth 65% 6 Celtic 22%
7 Real Madrid CF 65% 7 Real Madrid CF 21%
8 Ajax 65% 8 Hecules Club 21%
9 Hercules Club 64% 9 Wolfsburg 20%
10 Wolfsburg 64% 10 Reading 20%

Figure 2. Home team attributes ranking.

The above two figures demonstrate that FC Barce-


lona appears 7 times, and real Madrid CF appears 5 times,
which match with the practical game performance of
these two teams during that time period.

3.2 Goal distribution


Numbers of goals for both away team and home
team in each of the matches are collected in order to
study the distribution of goals. Two Poisson distribution
model are built to fit the goal distribution. Demonstrated
in Figure X to Figure X, the Poisson distribution is ap-
propriate to accurately model the true distributions of
home team goals and away team goals. Therefore, as the
two variables home goal and away goal follow the Pois-
son distribution, the difference between them, which is
the net goal, should fit the Skellam distribution. Figure X
shows the Skellam distribution of net goals accurately fit
the true distribution of the net goals calculated by taking
the difference of home and away goals for each match.
The total goal and net goal distribution are shown in
the Figure 3 and Figure 4.

Figure 3. Goal distribution of away team and home team.

Insight - Statistics Volume 3 Issue 1 | 2020 | 3


Figure 4. Net goal distribution.

4. Game result predictions gistic Regression, Decision Tree, Random Forest and
Deep Neural Network. The Logistic Regression model is
4.1 Description
appropriate for this problem since the dependent variable
This prediction of this dataset is slightly different win-draw-lose is categorical. It is easy to implement, and
from normal predictions of match results. Normally, the computational efficient. For the Decision Tree model,
prediction is focused on the result of matches which have each branch of it represents a possible decision, outcome,
not happened yet, based on some pre-game features, such or reaction, and the last branch of the tree represents the
as squad, and historical match results. The prediction in final result. In our Decision Tree model, the criterion is
this work is to foretell the match results based on information entropy: a mathematical measure of the de-
pre-game bet radios and in-game performance. The ob- gree of randomness in a set of data, with greater ran-
jective of this prediction is to analyze the key features domness implying higher entropy and greater predicta-
that mostly affect the final match results, which could be bility implying lower entropy. The advantage of Decision
useful to help improving team training and strategy Tree is that data processing is simple or unnecessary.
making. However, when the types of data increase, the accuracy
4.2 Feature engineering will decrease, since the over-fitting problem occurs. For
the Random Forest model, literally, it gives a higher ac-
There are total 144 attributes in the match table of
curacy than the Decision Tree model since the problem
the database file, 28 of which are selected, including bet
of over-fitting is ameliorated, with the cost of increasing
radios provided by different bookmakers and in-game
computational complexities. The last model we generate
features like possession (the time of controlling the ball
is the Deep Neural Network model, which is suitable to
in percentage), shot-efficiency (on target shot/total shot),
describe the non-linear relationship between the match
goal-efficiency (goal/on target shot), etc.
features and results. In this work, a neural network with
Featuring engineering is a significant and necessary
5 fully-collectly dense layers, 5 activation layers, 2
process to appropriately help improving the prediction
dropout layers and 1 batch normalization layer were de-
accuracy. Sufficient background knowledge of the prac-
signed. Softmax transformed function is applied in the
tical application is essential to help understand the pre-
model, and the loss of the model is based on sparse cat-
diction problem. For soccer games, the offensive effi-
egorical crossentropy. The batch size of the model is 32,
ciency plays an important role in evaluating the game
and the epochs of the model is set to as 500.
performance, and strongly related to the final results.
Therefore, additional features including shot efficiency 4.4 Results
and score efficiency are designed to characterize the of- The four models are all implemented to predict the
fensive efficiency. match results based same training and testing data.
4.3 Models Comparison of the prediction results for the four models
are provided in Figure 5.
Four models are built in the work, including Lo-

4 | Wuhuan Deng and Eric Zhong Insight - Statistics


Figure 5. Four models’ prediction result.

Among which, the Deep Neural Network model model on the 5 top leagues of Europe exclusively, which
gives the highest accuracy of 0.99. The Logistic Regres- are English Premier League (England), Laliga (Spain),
sion, Decision Tree and Random Forest models results in Bundesliga (Germany), Serie A (Italy) and Ligue1
0.95, 0.91 and 0.84. Moreover, we implement the best (France), and the results are shown in Figure 6.

Figure 6. Top 5 leagues’ prediction result.

For analyze the most import features for soccer the prediction for the 5 leagues. The results in Figure 7
match prediction. 4 most important features including to Figure 10 indicate that the possession, total shot and
possession, total shot, shot efficiency and goal efficiency shot efficiency are most important for Laliga, and goal
are compared. Each of the 4 features is solely applied for efficiency is most significant for Ligue.

Figure 7

Insight - Statistics Volume 3 Issue 1 | 2020 | 5


Figure 8

Figure 9

Figure 10

5. Conclusion prediction.

In this work, a novel analysis and prediction on an References


open soccer game database are proved. By using simple
statistical methods, the best offensive and defensive 1. Ogunseye AA. Artificial neural network approach to
football score prediction. Journal of Artificial Intel-
teams are successfully evaluated, and the Poisson and ligence 2019; 1.
Skellam Distribution is verified for fitting the goals. 2. Alfredo YF, Sani MI. Football match prediction with
Moreover, feature engineering and four prediction mod- tree based model classification. International Journal
of Intelligent Systems and Applications 2019; 11(7):
els are conducted to foresee the match outcomes. The
20–28.
results indicate that our feature engineering and the de-
signed Neural Network is effective for the match result

6 | Wuhuan Deng and Eric Zhong Insight - Statistics

You might also like