Olympic History: Athletes and Results Data Analysis
Olympic History: Athletes and Results Data Analysis
ABSTRACT
2 Design and Methodology
The Olympics is an international sporting event.
Participation in the event has expanded from 241
athletes to 11,500 since the last Olympics [1]. Given The dataset provided consists of 271,116 unique
the historical data throughout the Olympics, the odds athletes with 15 attributes:
of winning a medal (gold, silver, or bronze) could
perhaps be given based on a few biological attributes
of the athletes.
Therefore, we decided to do exploratory data analysis
so we may visualize patterns within the dataset.
Furthermore, we wanted to predict if an athlete would
win a medal based on those few attributes given. The
dataset was provided by the Kaggle user ‘rgriffin’
under “120 years of Olympic History: Athletes and
Results.”
1 Introduction
NOC is associated with, however the original Silver, and Bronze for medalists and a null tag for
“athelete_events.csv” file already contains a column non-medalists. The decision was made to give
for the NOC that the athletes are associated with. non-medalists the “NoMedal” string value to make
further data analysis easier.
We did further preprocessing of the data by selecting
attributes we deemed relevant such as: Sex, Age, At the end of our data preprocessing step, we came
Weight, Height, Sport, and Medal. We made the out with a total of 206,165 unique athletes with the
decision to attributes: Sex, Age, Weight, Height, NOC, Year,
remove ID, Season, Sport, and Medal.
Name, Games,
City, and Event.
These were
removed based on
the idea that
personal
identifying
information
would not be
useful in any For predictions, we wanted to use different algorithms
predictions or to plug our data into after a train/test split. We chose
data analysis. The Event column was removed to use RandomForest, Logistic Regression, Support
because it splits the Sports column based on specific Vector Machine (SVM), and K-Nearest Neighbor
games based on the sport. For example, the swimming (KNN). Due to the range of medalists and
tag would be represented as Swimming Men’s 200 non-medalist, we under-sampled to help with our
Meter Breaststroke, Swimming Men’s 400 Meter predictions.
Breaststroke, and so forth in the Event column.
Therefore, we made the decision to drop the column.
Before we began the predictions, we did exploratory We also wanted to view the distribution through their
data analysis of the dataset. We wanted to view the quartiles with the ages between male and females over
distribution of different kinds of sport records across the years. As seen on the box and whisker plot, the
the whole dataset. average age is focused around the early 20’s
4 Related Works
5 Conclusion
6 References