0% found this document useful (0 votes)
19 views

Mini Project Documentation

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Mini Project Documentation

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Table of contents:

CONTENTS PAGE NO.

Ch.0 ABSTRACT 09
Ch.1 INTRODUCTION 10-12
1.1 Problem statement 10
1.2 Objectives 10
1.3 Motivations 11
1.4 Existing systems 11
1.5 Proposed system 11
1.6 Scope 12
Ch.2 LITERATURE SURVEY 13
Ch.3 SYSTEM REQUIREMENT SPECIFICATION 14
3.1 Hardware Requirements
3.2 Software Requirements
Ch.4 ARCHITECTURE OF PROPOSED SYSTEM 15
Ch.5 IMPLEMENTATION 16-29
5.1 Algorithm 16
5.2 Required Modules/Libraries/Framework 17
5.3 Installation 20
5.4 Datasets 28
Ch.6 APPLICATIONS OVERVIEW AND RESULTS 30-42
6.1 Home 30
6.2 Exploratory Data Analysis 31
6.3 Data Preprocessing 34
6.4 Trends 36
6.5 Prediction 41
Ch.7 CONCLUSIONS AND FUTURE SCOPE 43
Ch.8 REFERENCES 44
APPENDIX 45-51

8
Ch.0 Abstract
Olympics is one of the leading sporting events and this project revolves around
performing careful data analytics operations on the data collected from it. For
this objective, two datasets that contain information about the various events
and the participated athletes has been analyzed. This project finds its base in
Descriptive and Predictive forms of Analytics.

9
Ch.1 Introduction
The modern Olympic Games or Olympics are leading international sports events
featuring summer and winter sports competitions in which thousands
of athletes from around the world participate in a variety of competitions. The
‘modern Olympics’ comprises all the Games from Athens 1986 to Rio 2016. The
Olympic Games are considered the world’s foremost sports competition with more
than 200 nations participating.

The Olympics is more than just a quadrennial multi-sport world championship. It is a


lense through which to understand global history, including shifting geopolitical
power dynamics, women’s empowerment, and the evolving values of society.
Therefore, it's very much essential to analyze the Olympics data to determine various
relationships between the athletes involved and their participation in the events.

1.1 Problem Statement


The problem statement revolves around knowing the trends and relationship
between attributes of the participated athletes. For this purpose, the Athlete events
dataset containing a total of 15 attributes, describing about the Athlete object has
been considered. It also includes a basic predictive analysis on BMI values of athletes
and their concerned sport. For this purpose Athlete BMI dataset has been imported.
The results must be embedded into an application (or) interface.

1.2 Objectives
The objective of our analysis is to answer these questions, but, however not limited
to these only:
1. How weight of an athlete is dependent on his/her height?
2. The total number of medals won by Male and female athletes.
3. Determining the participation trend in the Summer and Winter Seasons
4. Which Countries have the most medals?
5. Name the athletes with most medals.
6. Determine the countries winning the most gold medals in a specific year.
7. Predict the weight of an athlete, given the height?
8. Predict the sport a person is apt for, depending on his/her BMI values.
9. Analyze women participation over the years.

10
1.3 Motivation
The motivation for this project lies in our curiosity in the field of Data
Analytics. Also, the previous documented analysis on the olympics
data have continuously motivated us to perform better.

1.4 Existing System


• Usually,the analytics part is done using Jupyter Notebook, Google colab and
other tools.
• In those environments, cells are present which contains the code that
produces output on run command.
• A serious drawback is the lack of a proper interface, that would make results

even more appealing to look at.


• Existing System Drawback:
1. No easy interface.
2. Code visibilty
3. Not appealing to the users.

1.5 Proposed system


• A simple application (or) Graphical User Interface that would act as an
intermediate between the results and the users.
• The coding part is done using Python, for building the application
streamlit(python) has been used and for analytics part libraries
• such as Numpy, pandas, matplotlib etc(python) have been implemented
accordingly.

11
1.6 Scope
• This project is an interactive application that would help to know about the
Olympics more.
• It's especially useful to the Olympics managing authorities.
• It can lay foundations to build more Data Analytics applications that would
easy to present the results and user-friendly.

12
Ch.2 Literature Survey
1)
Ø Title : Performance Analysis in Olympic Games using Exploratory Data
Analysis Techniques

Ø Authors: Yamunathangam.D, Kirthicka.G, Shahanas Parveen

Ø Year : Jan 2019

Ø Observations : Application of Excellent visualization techniques


Ø Limitations : Lack of Interface

2)
Ø Title : 120 years of Olympic Games — How to analyze and visualize the
history with R

Ø Author : Saul Buentello

Ø Year: Aug 1 2021


Ø Observations : Usage of R programming, appealing visualization charts.
Ø Limitations: No interface, data preprocessing, or documentation.

3)
Ø Title : Analyzing Evolution of the Olympics by Exploratory Data Analysis
using R
Ø Author(s): Rahul Pradhan, Karthik Agrawal, Anubhav Bag
Ø Year: March 2021
Ø Observations: Neat presentation, proper vision, excellent findings and
documentation.
Ø Limitations: Lack of Interface, unappealing Visualization.

13
Ch.3 System Requirement specification
3.1 Hardware requirements are:

Ø Processor: Pentium V (or) higher.

Ø RAM: 1GB

Ø Space on Hard disk: minimum 512MB

3.2 Software requirements are:

Ø Web browser/engine: Google chrome (or) IE.

Ø Python libraries:(matplotlib, plotty, numpy, pandas, re, sklearn,


seaborn) Anaconda environment

Ø PC running with windows 7 (or) more

Ø Streamlit framework

14
Ch.4 Architecture of Proposed system

• The proposed system would have five main components.

• First consider the basic layout

• The five components are : Home, Exploratory Data Analysis, Data


preprocessing, Trends and Prediction.

• Consider the UML notation, with datasets:

15
Ch.5 Implementation

5.1 Algorithm

Linear regression
o Linear Regression is a machine learning algorithm based on supervised
learning. It performs a regression task.
o Regression models a target prediction value based on independent variables.
o It is mostly used for finding out the relationship between variables and
forecasting.
o Different regression models differ based on – the kind of relationship between
dependent and independent variables they are considering, and the number
of independent variables getting used.

o Linear regression performs the task to predict a dependent variable value (y)
based on a given independent variable (x).
o So, this regression technique finds out a linear relationship between x (input)
and y(output). Hence, the name is Linear Regression.
o In the figure above, X (input) is the work experience and Y (output) is the
salary of a person.
o The regression line is the best fit line for our model.
Hypothesis function for Linear Regression :

While training the model we are given :


x: input training data (univariate – one input variable(parameter))
y: labels to data (supervised learning)

16
When training the model – it fits the best line to predict the value of y for a given
value of x. The model gets the best regression fit line by finding the best θ1 and
θ2 values.
θ1: intercept
θ2: coefficient of x

Once we find the best θ1 and θ2 values, we get the best fit line. So when we are
finally using our model for prediction, it will predict the value of y for the input
value of x.

How to update θ1 and θ2 values to get the best fit line ?

Cost Function (J):


By achieving the best-fit regression line, the model aims to predict y value such
that the error difference between predicted value and true value is minimum. So,
it is very important to update the θ1 and θ2 values, to reach the best value that
minimize the error between predicted y value (pred) and true y value (y).

Cost function(J) of Linear Regression is the Root Mean Squared Error (RMSE)
between predicted y value (pred) and true y value (y).

Gradient Descent:
To update θ1 and θ2 values in order to reduce Cost function (minimizing RMSE
value) and achieving the best fit line the model uses Gradient Descent. The idea is
to start with random θ1 and θ2 values and then iteratively updating the values,
reaching minimum cost.

5.2 Required Modules/Libraries/Framework


The Required Modules/Libraries/Framework are:

17
Numpy:
• Arrays of Numpy offer modern mathematical implementations on huge
amount of data.
• Numpy makes the execution of these projects much easier and hassle-free.

• Numpy provides masked arrays along with general array objects.

• It also comes with functionalities such as manipulation of logical shapes,


discrete Fourier transform, general linear algebra, and many more.

• While you change the shape of any N-dimensional arrays, Numpy will create
new arrays for that and delete the old ones.

• This python package provides useful tools for integration. You can easily
integrate Numpy with programming languages such as C, C++, and Fortran
code.

• Numpy provides such functionalities that are comparable to MATLAB. They


both allow users to get faster with operations.

Pandas:

• Pandas provide us with many Series and Data Frames. It allows you to easily
organize, explore, represent, and manipulate data.

• Smart alignment and indexing featured in Pandas offer you a perfect


organization and data labeling.

• Pandas has some special features that allow you to handle missing data or
value with a proper measure.

• This package offers you such a clean code that even people with no or basic
knowledge of programming can easily work with it.

• It provides a collection of built-in tools that allows you to both read and write
data in different web services, data-structure, and databases as well.

• Pandas can support JSON, Excel, CSV, HDF5, and many other formats. In fact,
you can merge different databases at a time with Pandas.
Streamlit:

• Streamlit is a free and open-source framework to rapidly build and share


beautiful machine learning and data science web apps.

18
• It is a Python-based library specifically designed for machine learning
engineers.

• Data scientists or machine learning engineers are not web developers and
they're not interested in spending weeks learning to use these frameworks to
build web apps.

• Instead, they want a tool that is easier to learn and to use, as long as it can
display data and collect needed parameters for modeling.

• Streamlit allows you to create a stunning-looking application with only a few


lines of code.

Matplotlib:

• Matplotlib is a comprehensive library for creating static, animated, and


interactive visualizations in Python.

• Matplotlib makes easy things easy and hard things possible, like

• Create publication quality plots

• Make interactive figures that can zoom, pan, update.

• Customize visual style and layout.


• Export to many file formats.
• Embed in JupyterLab and Graphical User Interfaces.
• Use a rich array of third-party packages built on Matplotlib.

Plotly:

• Python Plotly Library is an open-source library that can be used for data
visualization and understanding data simply and easily.
• Plotly supports various types of plots like line charts, scatter plots, histograms,
cox plots, etc.
• Plotly has hover tool capabilities that allow us to detect any outliers or
anomalies in a large number of data points.
• It is visually attractive that can be accepted by a wide range of audiences.
• It allows us for the endless customization of our graphs that makes our plot
more meaningful and understandable for others.

Seaborn:

19
• Seaborn is an amazing visualization library for statistical graphics plotting in
Python.
• It provides beautiful default styles and color palettes to make statistical plots
more attractive.
• It is built on the top of matplotlib library and also closely integrated to the
data structures from pandas.
• Seaborn aims to make visualization the central part of exploring and
understanding data.
• It provides dataset-oriented APIs, so that we can switch between different
visual representations for same variables for better understanding of dataset.

Scikit-Learn:

• This module is a simple and efficient tool for predictive data analysis

• Accessible to everybody, and reusable in various contexts

• The library is focused on modeling data. It is not focused on loading,


manipulating and summarizing data. For these features, refer to NumPy and
Pandas.

• The functionality that scikit-learn provides include:

• Regression, including Linear and Logistic Regression

• Classification, including K-Nearest Neighbors

• Clustering, including K-Means and K-Means++

• Model selection

• Preprocessing, including Min-Max Normalization

5.3 Installation

1. Anaconda Navigator:
The installation of Anaconda Navigator is:

20
§ Anaconda is an open-source software that contains Jupyter, spyder, etc that
are used for large data processing, data analytics, heavy scientific computing.
§ Anaconda works for R and python programming language. Spyder(sub-
application of Anaconda) is used for python.
§ Opencv for python will work in spyder. Package versions are managed by the
package management system called conda.
§ To begin working with Anaconda, one must get it installed first.
§ Follow the below instructions to Download and install Anaconda on your
system:
Download and install Anaconda:
Head over to anaconda.com and install the latest version of Anaconda.
Make sure to download the “Python 3.7 Version” for the appropriate
architecture.

• Begin with the installation process:


Ø Getting Started:

Ø Getting through the License Agreement:

21
Ø Select Installation Type: Select Just Me if you want the software
to be used by a single User

Ø Choose Installation Location:

Ø Advanced Installation Option:

22
Ø Getting through the Installation Process:

Ø Recommendation to Install Pycharm:

Ø Finishing up the Installation:

23
• Working with Anaconda:
Once the installation process is done, Anaconda can be used to
perform multiple operations. To begin using Anaconda, search for
Anaconda Navigator from the Start Menu in Windows

2. Setting up new Environment in Conda:

• Open Home of Anaconda Navigator

• Go to Environments

24
• Click on create at the bottom, and name the environment as shown

• A new environment will be created with the name given

25
• Can click "Open Terminal" to install any modules or packages

3. Download matplotlib:

• To download Matplotlib click on 'Open Terminal' as shown above, and


enter the code: pip install matplotlib
• The library will be successfully installed.

4. Download seaborn:
• To download Seaborn click on 'Open Terminal' as shown above, and
enter the code: pip install seaborn
• The library will be successfully installed.

26
5. Download sklearn:
• To download sklearn kit click on 'Open Terminal' as shown above,and
enter the code: pip install sklearn

6. Download Streamlit:
• To download streamlit click on 'Open Terminal' as shown above, and
enter the code: pip install streamlit
• The framework will be successfully installed.

27
• Similarly plotly can be downloaded using pip install plotly

7. Running the Code:


• First, write the code ( provided in the Appendix) in any text editor( like
Visual Studio, Atom, or Notepad).
• Then save the code with .py extension Like file.py
• Then Open Terminal as shown above and type streamlit run file.py
• Then a new window in the browser opens that runs the application.

5.4 Datasets
• Datasets are imported in csv file format.
• Two datsets are imported.
• One being athlete_events csv file, the other athlete_BMI file
• Athlete Events dataset.
• It has been imported as a csv file from the source
https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-
history-athletes-and-results
• It contains a total of 277116 row tuples, mapped over 15 attributes or
columns.
• Each row corresponds to an individual athlete competing in an individual
Olympic event (athlete-events). The columns are:
ID - Unique number for each athlete
Name - Athlete's name
Sex - M or F
Age - Integer
Height - In centimeters
Weight - In kilograms
Team - Team name
NOC - National Olympic Committee 3-letter code
Games - Year and season

28
Year – Integer
Season - Summer or Winter
City - Host city
Sport - Sport
Event - Event
Medal - Gold, Silver, Bronze, or NA

• Athlete BMI dataset


• This dataset has been created by us by taking three attributes:
o Athlete name
o Athlete BMI
o Athlete's concerned sport( represented in integers)
• The sport and their integer values are:
o Marathon - 1
o Basketball – 2
o Rugby -3
o Shot put - 4
• The dataset contains about twenty tuples, it's given in the
Appendix.

29
Ch.6 Applications Overview and Results

The Application or the interface contains five pages, namely:


1. Home
2. Exploratory Data Analysis
3. Data Preprocessing
4. Trends
5. Prediction

6.1 Home:
• Home of the data application contains basic introduction on the Olympics
application, along with it's Logo of five connected circles.
• Also the Home contains a navigation bar, that would be helpful in linking
all the five pages.

• Moreover, it's helpful in identifying the current page, and also to navigate
to other pages.

• There are two Load buttons:


1. Load Athlete Events dataset –
• Which on click loads the Athlete Events dataset and prints the
message 'Loaded Successfully'.
• This button is also helpful to present the users with a flowchart that
guides them during the exploration of the application.

30
2. Load Athlete BMI dataset -
• Body mass index (BMI) is a value derived from the mass (weight)
and height of a person.
• The BMI is defined as the body mass divided by the square of
the body height, and is expressed in units of kg/m2, resulting from
mass in kilograms and height in metres.
• BMI = Weight in kgs/ (height in m)^2
• This BMI values of athletes, and corresponding sport are used to
predict the apt sport for a test case.
• The concerned datset has been given in the Appendix.
• This button on click gives a flowchart that guides users to the
prediction page.
• Note that this dataset doesn't require preprocessing.
• The result is shown below:

• The code to create the Home page is given in the Appendix.

6.2 Exploratory Data Analysis:


• Exploratory Data Analysis (EDA) is an approach to analyze the data using
visual techniques.
• It is used to discover trends, patterns, or to check assumptions with the
help of statistical summary and graphical representations.

31
• The next two pages, Data Preprocessing and Trends are also a part of
Exploratory Data Analysis.
• For simple reasoning, let's assme this page would just explore the
dataset(Athlete Events) and provide statistical information if asked by the
user.
• This page contains five Exploration checkboxes, which user can click to see
the results.
• Every checkbox has a query embedded into it, which makes it more
appealing to the user.
• Here is the screenshot of considered five checkboxes:

• The first checkbox, Show Dataset, when clicked presents the user the
entire dataset with the help of scroll bar.

• The second checkbox, First 5 values of the dataset, returns the head of the
dataset. For this purpose df.head() has been used.

32
• The third checkbox, Get the total number of Rows and Columns, uses the
shape function to return total number of data tuples and attributes

• The fourth checkbox, Show the Statistical Information of the


Columns/Attributes, when clicked uses the df.describe() function to
present the user with statistical information such as Count, Mean,
Standard deviation, 25%, 50%, 75%, Max and Min of the numeric
attributes in the dataset.
• In this dataset, numeric attributes are ID, Weight, Height, Age, and Year.

• The fifth and final checkbox, Null values in Columns, gives the total NaN
values in the dataset in each column.

• For this purpose, df.isnull().sum() operation has been performed.

33
• Again, note that these operations are only considered on Athlete Events
dataset, and BMI dataset has only been used for prediction
.
• The code for this has been given in the Appendix.

6.3 Data Preprocessing:


• From the above Exploration, it was clear that the dataset has these issues:
a. Presence of NaN values in the Columns Age, Height, Weight and Medal
which preclude efficient data analysis
b. The Medal Column should be converted to Numeric type for analysis
part.
c. The deletion of attribute Games on an account of redundancy. Since it's
concatenation of Year and Season columns.

• From these observations, it's clear that data clearing should be performed.

• Data cleaning routines attempt to fill in missing values, smooth out noise
while identifying outliers, and correct inconsistencies in the data.
• Here, it's applications are limited to filling of the NaN values, and removal
of an attribute only.
• The various methods for handling the problem of missing values in data
tuples include:

A. Ignoring the tuple: This is usually done when the class label is missing
(assuming the mining task involves classification or description). This
method is not very effective unless the tuple contains several attributes
with missing values.

B. Manually filling in the missing value: In general, this approach is time-


consuming and may not be a reasonable task for large data sets with
many missing values, especially when the value to be filled in is not
easily determined.

C. Using a global constant to fill in the missing value: Replace all missing
attribute values by the same constant, such as a label like “Unknown,”
or −∞. If missing values are replaced by, say, “Unknown,” then the
mining program may mistakenly think that they form an interesting
concept, since they all have a value in common — that of “Unknown.”
Hence, although this method is simple, it is not recommended.
D. Using the attribute mean for quantitative (numeric) values or attribute
mode for categorical (nominal) values, can be a feasible approach.

34
• In this project, we considered replacing NaN with mean values for Age,
Height and Medal
• While, NaN values of Medal are replaced by 0 which indicates No medal.
• Also, the Medal Column has been converted to numeric form, with 1
representing Gold, 2 with Silver and 3 representing Bronze, using replace
function.
• In this page a total of 7 queries have been answered. The first three
checkboxes remove NaN values in Age, Height and Weight columns and
output the dataset with new values.
• The fourth checkbox performs the operation discussed above for Medal
Column.

35
• One can check the updated NaN/Null values by checking the fifth checkbox
Updated Null Values

• The Remove Redundant column checkbox outputs the dataset with


removed Games column, using drop() function. There's no Games here:

• The polished dataset can be viewed with the aid of last button "Final
Dataset".

6.4 Trends:
• Trends determine the relationships between attributes and are helpful in
answering questions presented in the Objectives session.
• The main purpose of this project is to find answers to those questions.
1) Consider the query "Analyze the relationship between the Height and
Weights of an athlete", for this purpose a simple scatter plot is used to
check the cirrelation between those attributes.

36
Ø Scatter plot's code is given in the Appendix, also the user is presented
with an option of selecting Histogram to find the relationship between
the attributes. Consider these outputs

Ø Result: The height and weight are positively correlated.

2) Consider the query "Approximate number of males and females


participated in the Olympics".
Ø The answer to which is found using bargraph from matplotlib, With x-
axis representing Gender y-axis representing Count.

Ø Result: it can be inferred that approximately 200k men and 75k women
have participated in the Olympics so far.

3) Consider Determining the participation trend in the Summer and Winter


Seasons
Ø This answer is found using histogram from plotly express.
Ø With X-axis representing Season and Y-axis representing Count.

37
Ø Result : Summer Olympics see more participation than Winter.

4) Consider the query "women participation over the years".


Ø The answer to which is illustrated in the form of histogram from plotly
express.
Ø With x-axis representing the Year and y-axis representing count of
Women.

Ø Result: from the graph it can be clearly inferred that women


participation is increasing over the years

38
5) Consider "The total number of medals won by Male and female
athletes", which is found using a histogram from plotly express.

Ø Consider the image, showing medals count at each level ( 0 being no


medal, 1 being Gold, 2 being bronze, 3 being silver) .

Ø Result : Men have won 9625 Gold, 9524 silver and 9381 Bronze, while
women have won 3747 gold, 3771 silver and 3735 bronze.

6) Consider the query "Athletes with most medals."


Ø The answer to which is found out using groupby() function and
get_dummies() functions of python.
Ø Here is the list of top 10 athletes with Most medals
Ø Each athlete's share is represented in form of pie chart as shown below
using matplotlib property pie()

Ø Result: Top 10 athletes with Most medals are found.

39
7) Consider the query " Countries with most medals", the answer can be
found out using get_dummies () function from the pandas. (The code to
which is given in appendix)
Ø Given below is the list of ten countries with most medals .
Ø This checkbook is also uses to find whether a country is in the zero
medal list.
Ø A zero medal list is basically the list of countries with zero Olympic
medals.
Ø It asks the user to enter a country name and to check the result. If the
country is in the zero medal list it outputs "sorry to break it to u, your
country is in the zero medal list".
Ø If the country is not in the zero medal list then it outputs "your country
has won atleast one medal so chill".
Ø If the country is not in the data set then it outputs "country not listed
in the data set so be optimistic about your country winning a medal".

Ø Result: Countries with Most medals is estimated.

8) Consider the query, "Countries winning the most gold medals in a specific
year."
Ø Here, an Olympic year is asked from the user, and then the year is fed
to the function, that outputs bar plot from the highest to lowest gold
winning countries.
Ø This bar plot is derived from barplot() of seaborn.

40
Ø Result: Countries winning the most gold medals in a specific year is
determined. For example in 1896, Germany won most Gold medals.
Ø Observations checkbox is provided to note down any results acquired.
Ø The code is given in the Appendix.

6.5 Prediction:
• In this part, simple predictions are done using the Linear Regression algorithm
.
• The first prediction deals in finding og Weight value of an Athlete given
his/her height.
• While the other prediction deals with finding an apt sport for a person based
on his/her BMI values.
• The first prediction is done by constructing model based on Height and
Weights of athletes from Athlete Events dataset.
• The second is done by constructing a model that would predict, based on BMI
& corresponding sport from Athlete BMI dataset.
• The accuracy of H vs W model in 62 percent, while BMI is 89.87 percent.

• Consider output for Height 178.2 cm and BMI 34.60

41
• Show athlete BMI dataset button gives the corresponding dataset.

42
Ch.7 Conclusions and Future Scope

7.1 Conclusions:

We’re able to build an interactive application that’d perform analytical operations


on the olympic datasets, and fetch results in an easy and appealing manner. Once
the user clicks on load datasets option, he/she presented with a flowchart that
guides them in a wonderful data analytic journey. Therefore, we conclude that
the interface/application has been built successfully.

7.2 Future Scope:

• Improving the accuracy of prediction.

• Including more operations under each component


.
• Option to contribute to the datasets

43
Ch.8 References
Ø http://www.researchgate.net/publication/330847008_Performance_analysis_i
n_olympic_games_using_exploratory_data_analysis_techniques

Ø https://www.researchgate.net/publication/265033380_Data_mining_of_sport
s_performance_data

Ø https://www.researchgate.net/publication/23756788_Economics_and_Olympi
cs_An_Efficiency_Analysis

Ø https://ieeexplore.ieee.org/abstract/document/9725496

Ø https://docs.streamlit.io/

44
Appendix

v CSV file of Athlete BMI dataset:


"Athlete","BMI","Sport"
"Joe Kovacs","40","4"
"Patty Mills","24","2"
"Ryan Crouser","35.9","4"
"Richie Mccaw","30.6","3"
"Goran Dragic","23.8","2"
"Brigid Kosgei","17.3","1"
"Seth Curry","23.8","2"
"Heather Moyce","22.6","3"
"Zerseney Tadese","21.1","1"
"Fernando Portugal","28.1","3"
"Kyrie Irving","24.9","2"
"Eliud Kipchoge","18.6","1"
"Valerie Adams","32.2","4"
"David Harvey","27.8","3"
"Tom Walsh","35.1","4"
"Lelisa Desisa","20.1","1"
"Ivanka Khristovia","30.4","4"
"Santiago Gomez","25.2","3"
"Abdi Nageeye","19.8","1"
"Bruce Brown","24.7","2"

v Code for the application:

import streamlit as st
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

45
import time

st.title("Olympics Dataset Analytics")#need to change the font

nav = st.sidebar.radio("Navigation",['Home','Exploratory Data Analysis','Data


Preprocessing','Trends','Prediction'])

if nav == 'Home':
st.image('https://cdn.pixabay.com/photo/2013/02/15/10/58/blue-
81847__340.jpg',width=800)
st.write("The main theme of this app is to perform data analytics on
Olympic datasets(namely Athlete Events and Athlete BMI datasets). The
Athlete Events dataset contains historic data ranging from the Athens
Olympics 1896 to Rio 2016, while the other dataset contains data of athletes
with their BMI values and their concerned sport.")
if st.button('Load Main/Athlete Events dataset'):
st.write('Loaded successfully! Can perform analysis on it by following this
flowchart')
st.graphviz_chart("""
digraph{
Home -> ExploratoryDataAnalysis
ExploratoryDataAnalysis -> DataPreprocessing
DataPreprocessing -> Trends
Trends -> Prediction
DataPreprocessing -> Prediction
}
""")
if st.button('Load Athlete BMI dataset'):
st.write('Loaded successfully! Can perform predictions on it')
st.graphviz_chart("""
digraph{
Home -> Prediction
}
""")

data = pd.read_csv(r"C:\Users\chand\Downloads\athlete_events.csv
(1)\athlete_events.csv")
p = pd.DataFrame(data)
if nav == 'Exploratory Data Analysis':
st.header('Exploratory Data Analysis')
if st.checkbox("Show Dataset"):

46
st.dataframe(p)
st.write("Note that the Winter and Summer Games were held in the same
year up until 1992. After that, they staggered them such that Winter Games
occur on a four year cycle starting with 1994, then Summer in 1996, then
Winter in 1998, and so on. A common mistake people make when analyzing
this data is to assume that the Summer and Winter Games have always been
staggered.")
if st.checkbox("Show first 5 values of the Dataset"):
st.dataframe(p.head())
if st.checkbox("Get the total number of Rows and Columns"):
st.write(p.shape)
if st.checkbox("Show the Statistical Information Of the
Columns/Attributes"):
st.write(p.describe())
if st.checkbox("Null Values in columns"):
d = pd.DataFrame(p.isnull().sum()).transpose()
st.write(d)
if nav == 'Data Preprocessing':
st.header('Data Preprocessing')
if st.checkbox("Remove Null Values in Age Column"):
p['Age'] = p['Age'].fillna(p.Age.mean())
st.dataframe(p)
if st.checkbox("Remove Null Values in Height Column"):
p['Height'] = p['Height'].fillna(p.Height.mean())
st.dataframe(p)
if st.checkbox("Remove Null Values in Weight Column"):
p['Weight'] = p['Weight'].fillna(p.Weight.mean())
st.dataframe(p)
if st.checkbox('Covert Medals to Numeric datatype and Remove Null
Values'):
p['Medal']= p.Medal.replace({'Gold':1,'Silver':2,'Bronze':3})
p['Medal']= p['Medal'].fillna(0)
st.write(p)
if st.checkbox(" Updated Null Values"):
d = pd.DataFrame(p.isnull().sum()).transpose()
st.write(d)
if st.checkbox("Remove redundant column"):
p = p.drop(['Games'], axis = 1)
st.write(p)
if st.button("Final Dataset"):
st.write(p)

if nav == 'Trends':

47
st.header('Trends')
if st.checkbox("Analyze the relationship between the Height and Weights of
an athlete"):
graph = st.selectbox("What kind of Plot do you want?",['Scatter
Plot','Histogram'])
if graph =='Histogram':
fig = px.histogram(p,x=p.Height,color=p.Weight)
st.write(fig)
if graph =='Scatter Plot':
plt.scatter(p['Height'],p['Weight'])
plt.xlabel('Height')
plt.ylabel('Weight')
plt.title('Height Vs Weight')
st.set_option('deprecation.showPyplotGlobalUse', False)
st.pyplot()
if st.checkbox('Approximate Number of Males And Females Participated in
the Olympics'):
p['Sex'].value_counts().plot.bar(p['Sex'])
st.set_option('deprecation.showPyplotGlobalUse', False)
plt.grid()
st.pyplot()
if st.checkbox("Determine the Participation trend in the Summer and
Winter Seasons"):
fig = px.histogram(p,x=p.Season,color = p.Sex, barmode="group")
st.write(fig)
if st.checkbox('Women Participation over the years'):
y = p[p['Sex']=='F']['Sex']
fig = px.histogram(y,x = p.Year)
st.write(fig)
if st.checkbox('Number of Medals Won by M and F'):
p['Medal']= p.Medal.replace({'Gold':1,'Silver':2,'Bronze':3})
p['Medal']= p['Medal'].fillna(0)
fig = px.histogram(p,x = p.Sex, color= p.Medal)
st.write(fig)
if st.checkbox("Atheletes with Most Medals"):
p['Medal']= p.Medal.replace({'Gold':1,'Silver':2,'Bronze':3})
p['Medal']= p['Medal'].fillna(0)
df = p[['Medal']]
df = pd.get_dummies(df.Medal)
df = df.drop([0],axis = 1)
df['Name'] = p['Name']
df['Total'] = df[1]+df[2]+df[3]

48
f = df.groupby(df['Name'])['Total'].sum().sort_values(ascending =
False).head(10)
x = pd.DataFrame(f)
st.write(x)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.axis('equal')
Team = list(f.index.values)
Count_of_Medal = f
ax.pie(Count_of_Medal, labels = Team,autopct='%1.2f%%')
plt.show()
st.pyplot()
if st.checkbox("Countries with Most Medals"):
p['Medal']= p.Medal.replace({'Gold':1,'Silver':2,'Bronze':3})
p['Medal']= p['Medal'].fillna(0)
df = p[['Medal']]
df = pd.get_dummies(df.Medal)
df = df.drop([0],axis = 1)
df['Team'] = p['Team']
df['Total'] = df[1]+df[2]+df[3]
f = df.groupby(df['Team'])['Total'].sum().sort_values(ascending = False)
k=f.head(10)
k = list(k.index.values)
k = pd.DataFrame(k)
st.write(k)

st.subheader("Find whether your country is in the Zero-Medal list?")


x = st.text_input('Enter')
if st.checkbox('Show'):
if x not in f.index.values:
st.write('Country not listed in the dataset, so be optimisitic about
your country winning a medal')
else:
if f[x]!=0:
st.write('Your country has won atleast 1 Medal, so chill')
else:
st.write('Sorry to break it to you, your country is in the Zero-Medal
list')
if st.checkbox("Countries winning the most Gold medals in a specific year"):
number = st.number_input('Insert the Leap
Year',1896.00,2016.00,step=4.00)
st.write('The current Year is ', int(number))
max_year = int(number)

49
if max_year not in p.Year:
st.write("Enter a valid year")
else:
if st.button('Show'):
if max_year not in p.Year:
st.write("Enter a valid year")
else:
team_list = p[(p.Year == max_year) & (p.Medal=='Gold')].Team
if(len(team_list)!=0):
sns.barplot(x=team_list.value_counts().head(),
y=team_list.value_counts().head().index)
st.set_option('deprecation.showPyplotGlobalUse', False)
st.pyplot()
else:
st.write('Enter valid year')
if st.checkbox('Observations'):
a = st.text_area("Observations")
st.write(a)
if nav == 'Prediction':
st.header('Prediction')
st.subheader('Predict the Weight of an athlete with his/her Height')
model = LinearRegression()
p['Height'] = p['Height'].fillna(p.Height.mean())
p['Weight'] = p['Weight'].fillna(p.Weight.mean())
x= p['Height']
x= np.array(x).reshape(-1,1)
y= p['Weight']
y=np.array(y).reshape(-1,1)
x_train,x_test,y_train,y_test = train_test_split(x,y, test_size= 0.2)
model.fit(x_train,y_train)
t = st.number_input('Enter the Height')
t = np.array(t).reshape(-1,1)
d = model.predict(t)
if st.button('Predict Weight'):
st.write(d)
st.subheader('Predict the suitable Sport with the BMI values')
data = pd.read_csv(r"C:\Users\chand\Prediction.csv")
k = pd.DataFrame(data)
if st.checkbox("Show Athlete BMI Dataset"):
st.dataframe(k)
st.write("Note:")
q = {"Value":['1','2','3','4'],
"Corresponding Sport": ['Marathon','Basketball','Rugby','Shot Put']}

50
m=pd.DataFrame(q)
st.write(m)
model1 = LinearRegression()
j= k['BMI']
j= np.array(j).reshape(-1,1)
l= k['Sport']
l=np.array(l).reshape(-1,1)
j_train,j_test,l_train,l_test = train_test_split(j,l, test_size= 0.3)
model1.fit(j_train,l_train)
t = st.number_input('Enter BMI')
t = np.array(t).reshape(-1,1)
d = model1.predict(t)
d = d*10
d = np.round(d)

if st.button('Results'):
my_bar = st.progress(0)
for percent_complete in range(100):
time.sleep(0.001)
my_bar.progress(percent_complete + 1)
if d in range(0,12):
st.write('Definitely Marathon')
elif d in range(12,19):
st.write('Marathon, But also suitable for Basketball')
elif d in range(19,23):
st.write('Definitely Basketball')
elif d in range(23,27):
st.write('Basketball, But also suitable for Rugby')
elif d in range(27,33):
st.write('Definitely Rugby')
elif d in range(33,38):
st.write('Rugby, can also try shot put')
else:
st.write('Opt for Shot Put')

51

You might also like