Descriptive statistics
Data Science enables practitioners to do various mathematical operations on data, to get the best insight of
the data and with desired output objective. Not just to mention but with python, it becomes more exciting
to do operations on data. Generally, in Mathematical terms central tendency means the center of the
distribution, it enables to get the idea of the average value with the indication of how widely the values are
spread. There are three main measures of central tendency, which can be calculated using Pandas in the
Python library, namely,
Mean
Median
Mode ##### Mean can be defined as the average of the data observation, calculated by adding up all
the number in the data and dividing it by the total number of data terms. Mean is preferred when the
data is normally distributed.
Mean= x̄ = ∑x/ N
Median can be defined as middle number data in a given set of observations, calculated by arranging the
data in the required order and the middle data is taken out. Median is best used when data is skewed.
Median = (n + 1/2)th observation if the total observation is odd.
Mode can be defined as the highest frequency occurring number in a given set of datasets, if there is a
unique dataset then there is no mode at all.
* We are going to use very famous pandas library
to explore descriptive statistics
for more details please refer following links [Link]
Import Pandas:
you need to import Pandas into your Python script so you can use its functionality. You can do this by
adding the following line at the beginning of your script:
import pandas as pd
Load Data:
Pandas can work with various types of data, such as CSV files , Excel files , SQL databases , or
even from web URLs . To load a dataset, you use functions like pd.read_csv() , pd.read_excel() ,
pd.read_sql() , or pd.read_html() depending on the data source.
Basic data structures in pandas
Pandas provides two types of classes for handling data:
Series: a one-dimensional labeled array holding data of any type such as integers, strings, Python
objects etc.
DataFrame: a two-dimensional data structure that holds data like a two-dimension array or a table
with rows and columns.
Creating the dataset
In [1]: import pandas as pd
# Creating the dataframe of student's marks
df = [Link]({"Upendra ":[98,87,76,88,96],
"Jayesh ":[88,52,69,79,80],
"Rahul":[90,92,71,60,64],
"Puja":[88,85,79,81,91]})
# Printing the dataframe
df
Out[1]: Upendra Jayesh Rahul Puja
0 98 88 90 88
1 87 52 92 85
2 76 69 71 79
3 88 79 60 81
4 96 80 64 91
The data frame has been created using [Link] and is stored in df variable. The values are
then displayed as output.
now lets calculate mean
In [2]: [Link](axis = 0)
Upendra 89.0
Out[2]:
Jayesh 73.6
Rahul 75.4
Puja 84.8
dtype: float64
Now, lets calculate MEDIAN
In [3]: [Link](axis = 0)
Upendra 88.0
Out[3]:
Jayesh 79.0
Rahul 71.0
Puja 85.0
dtype: float64
Now, we will find the MODE
In [4]: [Link]()
Out[4]: Upendra Jayesh Rahul Puja
0 76 52 60 79
1 87 69 64 81
2 88 79 71 85
3 96 80 90 88
4 98 88 92 91
In [ ]:
Measures Of Spread
Measures of spread tell how spread the data points are. Some examples of measures of spread are
quantiles, variance, standard deviation and mean absolute deviation.
Quantiles
Quantiles are values that split sorted data or a probability distribution into equal parts. There
several different types of quantlies, here are some of the examples:
Quartiles - Divides the data into 4 equal parts.
Quintiles - Divides the data into 5 equal parts.
Deciles - Divides the data into 10 equal parts
Percentiles - Divides the data into 100 equal parts
In [5]: import numpy as np
import pandas as pd
In [6]: # lets make datasets
data=({'hour':[2.5,5.1,3.2,8.5,3.5,1.5,9.2,5.5,8.3,2.7,7.7,
5.9,4.5,3.3,1.1,8.9,2.5,1.9,6.1,7.4,2.7,4.8,3.8,6.9,7.8],
'Scores':[21,47,27,75,30,20,88,60,81,25,85,62,41,42,17,95,30,24,67,69,30,54,35,76
In [7]: df = [Link](data)
print(df)
hour Scores
0 2.5 21
1 5.1 47
2 3.2 27
3 8.5 75
4 3.5 30
5 1.5 20
6 9.2 88
7 5.5 60
8 8.3 81
9 2.7 25
10 7.7 85
11 5.9 62
12 4.5 41
13 3.3 42
14 1.1 17
15 8.9 95
16 2.5 30
17 1.9 24
18 6.1 67
19 7.4 69
20 2.7 30
21 4.8 54
22 3.8 35
23 6.9 76
24 7.8 86
Let's calculate the quartiles for the scores. These are the 5 data points in the scores that divide the
scores into 4 equal parts.
In [8]: print([Link](df['Scores'], [0, 0.25, 0.5, 0.75, 1]))
[17. 30. 47. 75. 95.]
Quantiles using linspace( )
It can become quite tedious to list all the points when getting quantiles, more so in cases of higher
quantiles such as deciles and percentiles. For such cases we can make use of the linspace( )
Let's get the quartiles of the scores
In [9]: print([Link](df['Scores'], [Link](0, 1, 5)))
[17. 30. 47. 75. 95.]
Let's get the quintiles
In [10]: print([Link](df['Scores'], [Link](0, 1, 6)))
[17. 26.6 38.6 60.8 77. 95. ]
Let's get the deciles
In [11]: print([Link](df['Scores'], [Link](0, 1, 11)))
[17. 22.2 26.6 30. 38.6 47. 60.8 68.6 77. 85.6 95. ]
Interquartile Range (IQR)
This is the difference between the 3rd and the 1st quartile. The IQR tells the spread of the middle half
of the data.
Let's get the IQR for the scores
In [12]: IQR = [Link](df['Scores'], 0.75) - [Link](df['Scores'], 0.25)
print(IQR)
45.0
Another way we can get IQR is by using iqr( ) from the scipy library
In [13]: from [Link] import iqr
IQR = iqr(df['Scores'])
print(IQR)
45.0
Outliers
These are data points that are usually different or detached from the rest of the data points.
A data point is an outlier if:
`data < 1st quartile − 1.5 * IQR
or
data > 3rd quartile + 1.5 * IQR`
Let's get the outliers in the scores
In [14]: # first get iqr
iqr= iqr(df['Scores'])
# then get lower & upper threshold
lower_threshold = [Link](df['Scores'], 0.25)
upper_threshold = [Link](df['Scores'], 0.75)
# then find outliers
outliers = df[(df['Scores'] < lower_threshold) | (df['Scores'] > upper_threshold)]
print(outliers)
hour Scores
0 2.5 21
2 3.2 27
5 1.5 20
6 9.2 88
8 8.3 81
9 2.7 25
10 7.7 85
14 1.1 17
15 8.9 95
17 1.9 24
23 6.9 76
24 7.8 86
Variance
Varience is the average of the squared distance between each data point and the mean of the data.
Let's calculate the variance of the scores. We will use [Link]( )
In [15]: print([Link](df['Scores'], ddof=1))
639.4266666666666
with the 'ddof=1' included, it means that the variance we get is the sample variance, if it is excluded
then we get the population variance.
Let's see that here below.
In [16]: print([Link](df['Scores']))
613.8496
Standard deviation
This is the squareroot of the variance.
Let's get the standard deviation of the scores
In [17]: print([Link]([Link](df['Scores'], ddof=1)))
25.28688724747802
Another way we can get standard deviation is by [Link]( )
Let's use that
In [18]: print([Link](df['Scores'], ddof=1))
25.28688724747802
Mean Absolute Deviation
This is the average of the distance between each data point and the mean of the data.
Let's find the mean absolute distance of the scores
In [19]: # first find the distance between the data points and the mean
dists = df['Scores'] - [Link](df['Scores'])
# find the mean absolute
print([Link]([Link](dists)))
22.4192
decsribe( ) method
The pandas describe( ) method can be used to calculate some statistical data of a dataframe. The
dataframe must contain numerical data for the describe( ) method to be used.
We can make use of it to get some of the measurements that have been mentioned above.
In [20]: df['Scores'].describe()
count 25.000000
Out[20]:
mean 51.480000
std 25.286887
min 17.000000
25% 30.000000
50% 47.000000
75% 75.000000
max 95.000000
Name: Scores, dtype: float64
In [ ]:
In [ ]:
In [ ]:
Assignment Questions
1. Load/define a dataset using pandas and display the first 5 rows to inspect its structure. also
Calculate the mean, median, and mode of a specific column in the dataset.
2. Determine the range of values in a particular column of the dataset. also Compute the variance
and standard deviation of a numeric column in the dataset.
3. Identify any missing values in the dataset and count the total number of missing entries.
4. Determine the top 5 highest values in a specific column.
5. Find the interquartile range (IQR) for a numeric column in the dataset.
6. Using Pandas and Matplotlib for any real life application and demostrate with example.
In [ ]: