0% found this document useful (0 votes)
192 views

Olympic History: Athletes and Results Data Analysis

This document discusses analyzing data from the Olympics to predict whether an athlete will win a medal based on attributes like sex, age, height, and weight. The data was preprocessed to select relevant attributes and resolve null values. Exploratory analysis found most athletes were between 21-24 years old, 168-175 cm tall, and 60-70 kg. Distribution of sports and countries of participation were also analyzed. Prediction algorithms like random forest, logistic regression, SVM, and KNN were tested on the data, with KNN achieving the best result of 59% accuracy.

Uploaded by

vardhan reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
192 views

Olympic History: Athletes and Results Data Analysis

This document discusses analyzing data from the Olympics to predict whether an athlete will win a medal based on attributes like sex, age, height, and weight. The data was preprocessed to select relevant attributes and resolve null values. Exploratory analysis found most athletes were between 21-24 years old, 168-175 cm tall, and 60-70 kg. Distribution of sports and countries of participation were also analyzed. Prediction algorithms like random forest, logistic regression, SVM, and KNN were tested on the data, with KNN achieving the best result of 59% accuracy.

Uploaded by

vardhan reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Olympic History: Athletes and Results Data Analysis

Jake Haas Manuel Herrera


Computer Science Computer Science
California State University, Sacramento California State University, Sacramento
Sacramento, CA, USA Sacramento, CA, USA
[email protected][email protected]

ABSTRACT
2​ Design and Methodology
The Olympics is an international sporting event.
Participation in the event has expanded from 241
athletes to 11,500 since the last Olympics [1]. Given The dataset provided consists of 271,116 unique
the historical data throughout the Olympics, the odds athletes with 15 attributes:
of winning a medal (gold, silver, or bronze) could
perhaps be given based on a few biological attributes
of the athletes.
Therefore, we decided to do exploratory data analysis
so we may visualize patterns within the dataset.
Furthermore, we wanted to predict if an athlete would
win a medal based on those few attributes given. The
dataset was provided by the Kaggle user ‘rgriffin’
under “120 years of Olympic History: Athletes and
Results.”

1​ Introduction

The Olympic Games have been expanding every year


which can be seen by the records of the nations
participating. The number has grown from 14 nations
in 1896 in Athens to 207 nations in 2016 at the Rio
Olympics [1]. This international sporting event where
thousands of athletes from various countries compete
The dataset was obtained by the user ‘rgriffin’ by
in various sports every four years, has experienced
scraping and wrangling the data from a website
enough growth in which we can begin to ask
dedicated to the collection of sport statistics. The
questions on the evolution of the Olympics based on
collection includes all games from Athens 1896 to Rio
gender participation or their performance and results
2016. Another file called “noc_regions.csv” was
based on basic biological information [2].
provided as well, however, we made the decision to
drop the file. The National Olympic Committee
(NOC) regions file simply defines which region each
Olympic History: Athletes and Results Data Analysis 2 

NOC is associated with, however the original Silver, and Bronze for medalists and a null tag for
“athelete_events.csv” file already contains a column non-medalists. The decision was made to give
for the NOC that the athletes are associated with. non-medalists the “NoMedal” string value to make
further data analysis easier.
We did further preprocessing of the data by selecting
attributes we deemed relevant such as: Sex, Age, At the end of our data preprocessing step, we came
Weight, Height, Sport, and Medal. We made the out with a total of 206,165 unique athletes with the
decision to attributes: Sex, Age, Weight, Height, NOC, Year,
remove ID, Season, Sport, and Medal.
Name, Games,
City, and Event.
These were
removed based on
the idea that
personal
identifying
information
would not be
useful in any For predictions, we wanted to use different algorithms
predictions or to plug our data into after a train/test split. We chose
data analysis. The Event column was removed to use RandomForest, Logistic Regression, Support
because it splits the Sports column based on specific Vector Machine (SVM), and K-Nearest Neighbor
games based on the sport. For example, the swimming (KNN). Due to the range of medalists and
tag would be represented as Swimming Men’s 200 non-medalist, we under-sampled to help with our
Meter Breaststroke, Swimming Men’s 400 Meter predictions.
Breaststroke, and so forth in the Event column.
Therefore, we made the decision to drop the column.

The dataset came with null values that had to be


resolved. We identified
them by checking
existing null values for
each column within the
dataset. Our results
were 9,474 for Age,
60,171 for both Weight
and Height, 3,578 for
Sport and 231,333 for
Medal. The reason the
Medal column returned
so many null values
was because the dataset
had the tags Gold,
Olympic History: Athletes and Results Data Analysis 3 

around 21-24 years of age, 168-175 centimeters, and


3​ Results about 60-70 kilograms.

Before we began the predictions, we did exploratory We also wanted to view the distribution through their
data analysis of the dataset. We wanted to view the quartiles with the ages between male and females over
distribution of different kinds of sport records across the years. As seen on the box and whisker plot, the
the whole dataset. average age is focused around the early 20’s

The distribution clearly shows many more records of


‘athletic’ events followed by ‘swimming’ and
‘gymnastics, with such a smaller amount for the rest
of the categories. Furthermore, we wanted to view the
number of athletes per NOC tag.
Furthermore, we wanted to view the attendance
between males and females over the years. It’s clear
there is an imabl;ance in male to female attendance
for the both summer and winter events. Therefore, we
decided to split the data for the two types to see if it
makes a difference in our predictions.

This distribution shows that the United States had


more attending athletes than any other countries over
time, but many of them
are somewhat similar to
one another. We also
wanted to view a quick
descriptive printout for
the numerical attributes.
The 25th and 50th
percentile show that most athletes, at least overall, are
Olympic History: Athletes and Results Data Analysis 4 

We believe Height and Weight played a vital role as


well so we wanted to see if there was a trend that
existed within our data. Our visualization showed
there was a trend but it was not too extreme.

Next we wanted to analyze the predictions of only


males. The results show the prediction results a
slightly lower
average score
for KNN, buyt
the results are
fairly close to
For our predictions, we wanted to compare the results being the same
of our predictions from the various algorithms we as before. The
mentioned. The one below includes both males and correlation of
females. We plugged the data into them after a the variables Height and Weight also remain very
train/test split. similar to the last attempt being at about 0.66
The scores were compared to 0.65.
low except for
KNN, which
managed to get
an average score
of 59%. We
included both
the correlation matrix where age doesn't correlate
much. Furthermore, the results are reasonably similar for the
confusion matrix with “non-medalists” still being
under 50%

The confusion matrix did turn out better and showed


the ability to predict being different per medal, or lack
thereof.
Olympic History: Athletes and Results Data Analysis 5 

Finally, the top NOC by athlete count had lower


results. It is apparent that there was not a lot of variety
with the
results so
far, so our
attempts to
spit the data
to get
different
results
didn’t help as much. In this section we wanted to see
the top 10 NOC by total number of athletes over the
years, which are the most active committees. The
Next, the female only results proved to be the best but print out of the results were as follow:
only by a margin.

Overall, the results were lower than this than it was


for the unmodified dataset and was even worse at
identifying those who wouldn’t get medal leading to a
lot of false positive for the categories.

There’s a slightly higher average score for KNN and


the confusion matrix results were all above 50% for
each category.

4​ Related Works

The paper ​Analyzing Sports Training Data with


Machine Learning Techniques b​ y Purdue University
students, dove deeper in the machine learning aspect
to improve the training and coaching of their
Women's Soccer Team [4]. Their data preprocessing
Olympic History: Athletes and Results Data Analysis 6 

included making the data anonymous and rearranging


the data according to players vs according to training 7​ Appendix
drills. There was player data corresponding to unique
individuals and drill; data that was average out across
Final Project Code…………………………………..A
all players. Furthermore, they normalized features into
a common range.

5​ Conclusion

Overall, it was a learning experience doing hands-on


exploratory data analysis. Various useful data
visualization and machine learning libraries were
used. For instance, we had to learn more about a
useful visualization named Seaborn to supplement our
exploratory analysis section. There is no doubt the
knowledge learned during this project will be
incredibly useful later on. It should be noted that there
are many different combinations of features to be used
in this project that may give better predictions. For
now, the experience attained during this project was a
pleasant one.

6​ References

[1] ​“Rio 2016.” ​International Olympic Committee​, 17 Apr. 2018,


www.olympic.org/rio-2016​.
[2] ​Rgriffin. “120 Years of Olympic History: Athletes and
Results.” ​Kaggle,​ 15 June 2018,
www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes
-and-results​.
[3] ​VanderPlas, Jake. “Visualization with Seaborn.” ​Visualization
with Seaborn | Python Data Science Handbook,​
jakevdp.github.io/PythonDataScienceHandbook/04.14-visualizati
on-with-seaborn.html​.
[4] ​Mahfuz, Rehana, et al. “[PDF] Analyzing Sports Training
Data with Machine Learning Techniques: Semantic Scholar.”
Undefined,​ 1 Jan. 1970,
www.semanticscholar.org/paper/Analyzing-Sports-Training-Data-
with-Machine-Mahfuz-Mourad/8a7a774e2aa3410575e0137071ed
591fd65d1f78​.

You might also like