0% found this document useful (0 votes)
29 views5 pages

Omid Aryan, Ali Reza Sharafat, A Novel Approach To Predicting The Results of NBA Matches

Un document de bac maths

Uploaded by

mensahjohnkwesi4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views5 pages

Omid Aryan, Ali Reza Sharafat, A Novel Approach To Predicting The Results of NBA Matches

Un document de bac maths

Uploaded by

mensahjohnkwesi4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

A Novel Approach to Predicting

the Results of NBA Matches

Omid Aryan Ali Reza Sharafat


Stanford University Stanford University
[email protected] [email protected]

Abstract—The current paper presents a novel approach II. DATA S ET


to predicting the results of basketball matches played in
the NBA league. As opposed to previous work that merely We collect our data from basketball-reference.com.
take into account a teams own performance and statistics For each game we get a set of 30 features for
in predicting a teams outcome in a match, our approach each team (e.g., http://www.basketball-reference.com/
also uses the data known about the opposing team in our boxscores/201411160LAL.html). The features are listed
endeavor to make that prediction. Our findings show that in Table 1.
utilizing the information about the opponent improves the
error rate of observed in these predictions.
TABLE I: Features given for each team and each game
in the season
I. I NTRODUCTION
Statistics and data have become an ever more attrac- Minutes Played
tive component in professional sports in recent years, Field Goals Blocks
particularly with respect to the National Basketball Asso- Field Goal Attempts 3-Point Field Goals
ciation (NBA). Massive amounts of data are collected on Field Goal Percentage 3-Point Field Goal Attempts
each of the teams in the NBA, from the number of wins Turnovers 3-Point Field Goal Percentage
and losses of each team to the number of field goals and Personal Fouls True Shooting Percentage
three-point shots of each player to the average number of Points Effective Field Goal Percentage
minutes each player plays. All the data that is collected Free Throws Offensive Rebound Percentage
from this great sport is truly intriguing, and given that it Free Throw Attempts Defensive Rebound Percentage
is usually interpreted by people in the sports profession Free Throw Percentage Total Rebound Percentage
alone, it is a fantastic and pristine field for people in the Offensive Rebounds Assist Percentage
machine learning field (and other data-mining sciences) Defensive Rebounds Steal Percentage
to apply their techniques and draw out interesting results. Total Rebounds Block Percentage
Thus, we plan to do just that and utilize the algorithms Assists Turnover Percentage
we learn in CS229 to predict the NBA matches. Steals Usage Percentage
The paper provides a novel approach to predicting Offensive Rating Defensive Rating
these matches. As opposed to previous work done in
this field, where the available data is directly fed to
We created a crawler that scrapes the website and
the algorithms, we intend to find a relationship between
collects the entire data set for all games for the past 5
the data sets provided for the two teams in a match
seasons (2008-2013) and stores them in a csv file per
and modify the data accordingly before feeding it to
season. We call this data set the game data. Each team
our algorithms. Section II will describe the source and
plays on average 82 games per season, so we have about
format of the dataset we will use to train our models.
2600 data points in each of those data sets. We primarily
Section III will present an overview of the model we
use this data set to predict features corresponding to
intend to implement along with a description of each of
teams playing an upcoming game.
its components. Section IV will present the results of our
implementation, and section V will conclude the paper. We also collect seasonal data for each team (e.g.,
http://www.basketball-reference.com/leagues/NBA
2013.html). Here, there are 5 tables of features for each
of the teams in the league. We combine 4 of these tables
(Team Stats, Opponent Stats, Team Shooting, Opponent
Shooting) into one table, resulting in 98 features per
team. We call this data set the team data. We primarily
use this data set to cluster teams in order to have more
accurate feature predictions for upcoming games. Fig. 1: Model Overview

III. M ODEL OVERVIEW


The previous Section described the data set with 1) Statistic Prediction: In this component the two
which we intend to train our models. Each training teams to play in a match are given as input (e.g.,
example is of the form (x, y), which corresponds to the Los Angeles Lakers and Boston Celtics) and the
statistics and output of a team in a particular match. x is expected statistics of each of the teams (xA and
an N dimensional vector containing the input variables xB ) in that match is predicted.
and y indicates whether the team won (y = 1) or lost 2) Feature Formation: This component forms a single
(y = 0) in that match. set of input features x from xA and xB based on
how the individual features (xi ’s) are related to one
Given such a data set, a naive approach would be another.
to feed the training examples directly to the existing 3) Result Prediction: This component contains the
algorithms in machine learning (e.g., logistic regression, different machine learning algorithms we intend to
support vector machines, neural networks, etc.) to predict utilize to predict the final outcome based on the
the outcome of a match. This approach is undertaken in input feature set x from the previous component.
the referenced work. However, this method merely gives
a prediction of whether a team wins or loses regardless A. Statistic Prediction
of which team it plays against.
We use several different predictive models to predict
In this paper we plan to tackle the intrinsic problem of the features for an upcoming game, using all the data
the aforementioned method and take a different approach. points from the previous games. We describe all such
Instead of directly utilizing the training examples in Sec- models below:
tion II to train our parameters and feed to our algorithms, 1) Running average: In this model, the feature values
we intend to modify them based on the matches they of a team in an upcoming game are predicted using
were obtained from as well as the relationship that the the running average of the features of that team in the
different features have with one another. For example, the previous games it has played in the season. This is
statistics that a team is expected to have in each match the simplest prediction method one can use to predict
depends on the team it is playing against. We plan to take upcoming games. Since this method relies on the team
this into account by clustering the teams into different having played at least one game already, we do not make
groups and to predict their expected statistics based on: a prediction for the first game of teams in the season.
a)the relationship that the two clusters corresponding to
the two teams have with one another and b)their average 2) Exponentially decaying average: This is similar
statistics from the previous matches. Moreover, once we to the previous model, but with the difference that the
have a prediction of how each team will perform with previous games are weighed by a factor of 0 < α < 1.
respect to each of the features, we will congregate the That is, if a given team has played n games already with
two feature sets of the two teams into a single feature gi representing its features in the ith game, our prediction
set based on how those features relate to one another. for the n + 1st game will be
Ultimately, this single feature set is fed to our algorithms n
1 − α X n−i

in order to make a prediction. gn+1 = α gi
1 − αn
i=1
An overview of our model is depicted in Figure 1. It
is composed of the following three components: The point of using a decaying average is to test
the hypothesis that results have temporal dependence. (basketball matches have no draws and one team must
That is, more recent results are more relevant than win).
older results. The lower the value of α, the higher the
importance of the recent results. C. Result Prediction
3) Home and away average: The hypothesis we aim Once we have the predicted feature set of a match,
to test using this prediction is whether or not teams we can then predict the outcome of that match by means
perform differently at home and away. We use two of any machine learning algorithm. The algorithms we
running averages for each team, for home and away utilized are linear regression, logistic regression, and
games respectively. Then, if the upcoming game is a support vector machines (SVM). For linear regression,
home game for a given team, the running home game the score difference of the two teams was used to train
average is our prediction. Similarly, we predict features and test the data.
if the game is away from home.
IV. I MPLEMENTATION AND R ESULTS
4) Cluster-based prediction: The aim of this method
is to see if we can predict the behavior of a team, based The data was trained and tested for each season
on the type of team they play. In order to cluster the individually. A collection of five seasons were tested
teams, we use our team data set, where for each team from 2008 to 2013. The foregoing results indicate the
we have 98 features. We first perform PCA to reduce average of the results achieved from these seasons. The
the number of features that go into clustering to n. hold-out cross validation technique was used for training
Then, we perform k-means clustering to get k clusters, and testing, where 70% of the matches in a season
then we keep a running average of each team’s features were used to train the algorithms while the rest of the
agains teams of each cluster. That is, we keep n running matches were used for testing purposes. The reported
averages for each team. Then, if the upcoming game of a error rate was computed by taking the average of 100
given team X is against a team which belongs to cluster iterations of this technique with different training and
i, our prediction of X ’s features is the running average testing sets. Furthermore, the forward search feature
of X ’s features against teams in cluster i. selection algorithm was used to choose the most effective
features for each of the algorithms.
B. Feature Formation Here we break down our results based on the Statistics
Prediction method used.
A distinctive aspect of our approach from the previous
work is that we take into account the statistics of the op- 1) Running average: The running average is the sim-
posing team before making a prediction. In other words, plest predictor for the the features of an upcoming game.
we create a training example (to be fed to our algorithms) We simply train our classifiers on a 70/30 training/testing
for each match rather than for each team. This training set split. The results are shown in Table II. We see that
example would be the output of this component, where linear regression turns out to be the best classifier in this
the input would be the two (expected) training examples case. We note that training and test errors were generally
of the two teams (which is also derived with knowledge very similar. One possible reason for that is that we used
of the other team by the previous component). Hence, the predicted statistics for both runs. The fact that we were
task of the Feature Formation Component is to form an using a prediction to train a classifier means that the noise
input variable vector for the match based on the expected from our predicted statistics cannot be subsumed during
input variables of each team. training.

For our implementation, we have carried out the most 2) Exponentially decaying average: We use the de-
simplistic method of comparison, which is to simply take caying averages and feed them into our 3 classifiers to
the difference of the predicted feature values of each predict the outcomes of the matches in the training and
team to form the final feature set, i.e.: test sets. We experimented with various values of the
decay parameter, α. The results are shown in Figure 2.
x = xA − xB Similar to the previous part, linear regression performs
the best amongst the classifiers. Both training and test
An output of y = 1 would then indicate team A’s errors decrease as the value of α increases, which sig-
victory, while y = 0 would indicate team B’s victory nifies that there is no temporal relationship between the
different numbers of cluster and different PCA compo-
TABLE II: Training and test errors when using running
nents during the prediction phase. The results are shown
averages as predictors
in Figure 3. We find that in general the training and
test errors increase with the number of clusters and the
Training error Test error number of components in PCA. We find that logistic
Linear regression 0.3231 0.3198 regression performs best amongst our classifiers, just
Logistic regression 0.3281 0.3251 slightly better than how linear regression performed on
SVM 0.3317 0.3304 the running averages.

results (that is, the results from early in the season are
as important as the latest results). The training and test
errors for all values of α are still higher than those from
running averages.

0.44  
Linear  Regression  Tes5ng  
0.42   Error  

Linear  Regression  Training  


0.4   Error  

0.38   Logis5c  Regression  


Tes5ng  Error  
0.36   Logis5c  Regression  
Training  Error  
0.34  
SVM  Tes5ng  Error  
0.32  
SVM  Training  Error  
0.3  
0   0.2   0.4   0.6   0.8   1  

Fig. 2: The test training errors as a function of α (x-axis)

3) Home and away average: The classification part


is very similar to the two previous sections. The training
and test errors are shown in Table 3. We see that linear
regression performs better in training and is marginally
better in testing. The other two classifiers perform worse,
however. This refutes our hypothesis that teams perform
differently at home and away from home. We see that the
running average (surprisingly) is still our best predictor.

TABLE III: Training and test errors when using running


home vs away averages as predictors

Training error Test error


Linear regression 0.314 0.3174
Logistic regression 0.3371 0.3327 Fig. 3: Training and test error rates for various number
SVM 0.3412 0.3397 of PCA components (4, 10, 20, 30 from top to bottom)
and number of clusters (2, 3, 5, 10 x-axis).
4) Cluster-based prediction: The classification part
is similar to the previous parts. We experimented with
V. C ONCLUSION
We used a variety of predictors to predict features
of teams in an upcoming game of basketball. We then
used those predictions to form a single feature set cor-
responding to a single match. We then used that feature
set to predict the outcome of the corresponding match.
We found that running averages of the feature set was
consistently the best predictor. We also found that linear
regression and logistic regression performed the best
when it came to classification, with SVM at a distant
third. The reason for the poor performance of SVM is
that our data was not nicely separable. Our training and
test errors are in line with the literature referenced in the
bibliography.

R EFERENCES
[1] Beckler, Matthew, Hongfei Wang, and Michael Papamichael.
”Nba oracle.” Zuletzt besucht am 17 (2013): 2008-2009.
[2] Bhandari, Inderpal, et al. ”Advanced scout: Data mining and
knowledge discovery in NBA data.” Data Mining and Knowl-
edge Discovery 1.1 (1997): 121-125.
[3] Cao, Chenjie. ”Sports data mining technology used in basketball
outcome prediction.” (2012).
[4] Miljkovic, Dejan, et al. ”The use of data mining for basketball
matches outcomes prediction.” Intelligent Systems and Informat-
ics (SISY), 2010 8th International Symposium on. IEEE, 2010.

You might also like