Analysis and Prediction of Soccer Games - An Application To The Kaggle European Soccer Database
Analysis and Prediction of Soccer Games - An Application To The Kaggle European Soccer Database
Abstract: The study of soccer game data has many applications for both fans and teams. The effective analytical work
can not only help the teams to improve their offensive and defensive skills and strategies, but also could assist the fans
to make a bet. In this work, the authors study the European League Dataset with statistical methods to analyze the game
data. Moreover, machine learning techniques are designed to predict the game results based on in-game performance
and pre-game odds provided by bookmakers. With rational feature engineering and model selection, our model results
in an overall 95% accuracy.
Keywords: Soccer; Python; Data Science; Artificial Neural Network; Statistics; Poisson Distribution
Away team goal performance top list Away team defensive performance top list
Away team shot efficiency top list Away team goal efficiency top list
Home team shot efficiency top list Home team goal efficiency top list
4. Game result predictions gistic Regression, Decision Tree, Random Forest and
Deep Neural Network. The Logistic Regression model is
4.1 Description
appropriate for this problem since the dependent variable
This prediction of this dataset is slightly different win-draw-lose is categorical. It is easy to implement, and
from normal predictions of match results. Normally, the computational efficient. For the Decision Tree model,
prediction is focused on the result of matches which have each branch of it represents a possible decision, outcome,
not happened yet, based on some pre-game features, such or reaction, and the last branch of the tree represents the
as squad, and historical match results. The prediction in final result. In our Decision Tree model, the criterion is
this work is to foretell the match results based on information entropy: a mathematical measure of the de-
pre-game bet radios and in-game performance. The ob- gree of randomness in a set of data, with greater ran-
jective of this prediction is to analyze the key features domness implying higher entropy and greater predicta-
that mostly affect the final match results, which could be bility implying lower entropy. The advantage of Decision
useful to help improving team training and strategy Tree is that data processing is simple or unnecessary.
making. However, when the types of data increase, the accuracy
4.2 Feature engineering will decrease, since the over-fitting problem occurs. For
the Random Forest model, literally, it gives a higher ac-
There are total 144 attributes in the match table of
curacy than the Decision Tree model since the problem
the database file, 28 of which are selected, including bet
of over-fitting is ameliorated, with the cost of increasing
radios provided by different bookmakers and in-game
computational complexities. The last model we generate
features like possession (the time of controlling the ball
is the Deep Neural Network model, which is suitable to
in percentage), shot-efficiency (on target shot/total shot),
describe the non-linear relationship between the match
goal-efficiency (goal/on target shot), etc.
features and results. In this work, a neural network with
Featuring engineering is a significant and necessary
5 fully-collectly dense layers, 5 activation layers, 2
process to appropriately help improving the prediction
dropout layers and 1 batch normalization layer were de-
accuracy. Sufficient background knowledge of the prac-
signed. Softmax transformed function is applied in the
tical application is essential to help understand the pre-
model, and the loss of the model is based on sparse cat-
diction problem. For soccer games, the offensive effi-
egorical crossentropy. The batch size of the model is 32,
ciency plays an important role in evaluating the game
and the epochs of the model is set to as 500.
performance, and strongly related to the final results.
Therefore, additional features including shot efficiency 4.4 Results
and score efficiency are designed to characterize the of- The four models are all implemented to predict the
fensive efficiency. match results based same training and testing data.
4.3 Models Comparison of the prediction results for the four models
are provided in Figure 5.
Four models are built in the work, including Lo-
Among which, the Deep Neural Network model model on the 5 top leagues of Europe exclusively, which
gives the highest accuracy of 0.99. The Logistic Regres- are English Premier League (England), Laliga (Spain),
sion, Decision Tree and Random Forest models results in Bundesliga (Germany), Serie A (Italy) and Ligue1
0.95, 0.91 and 0.84. Moreover, we implement the best (France), and the results are shown in Figure 6.
For analyze the most import features for soccer the prediction for the 5 leagues. The results in Figure 7
match prediction. 4 most important features including to Figure 10 indicate that the possession, total shot and
possession, total shot, shot efficiency and goal efficiency shot efficiency are most important for Laliga, and goal
are compared. Each of the 4 features is solely applied for efficiency is most significant for Ligue.
Figure 7
Figure 9
Figure 10
5. Conclusion prediction.