0% found this document useful (0 votes)

43 views7 pages

Unit 5 Descriptive Statistics

python pandas

Uploaded by

upendra maurya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views7 pages

Unit 5 Descriptive Statistics

python pandas

Uploaded by

upendra maurya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Descriptive statistics

Data Science enables practitioners to do various mathematical operations on data, to get the best insight of
the data and with desired output objective. Not just to mention but with python, it becomes more exciting
to do operations on data. Generally, in Mathematical terms central tendency means the center of the
distribution, it enables to get the idea of the average value with the indication of how widely the values are
spread. There are three main measures of central tendency, which can be calculated using Pandas in the
Python library, namely,

Mean
Median
Mode ##### Mean can be defined as the average of the data observation, calculated by adding up all
the number in the data and dividing it by the total number of data terms. Mean is preferred when the
data is normally distributed.

Mean= x̄ = ∑x/ N

Median can be defined as middle number data in a given set of observations, calculated by arranging the
data in the required order and the middle data is taken out. Median is best used when data is skewed.

Median = (n + 1/2)th observation if the total observation is odd.

Mode can be defined as the highest frequency occurring number in a given set of datasets, if there is a
unique dataset then there is no mode at all.

* We are going to use very famous pandas library

to explore descriptive statistics
for more details please refer following links [Link]
Import Pandas:
you need to import Pandas into your Python script so you can use its functionality. You can do this by
adding the following line at the beginning of your script:

import pandas as pd

Load Data:
Pandas can work with various types of data, such as CSV files , Excel files , SQL databases , or
even from web URLs . To load a dataset, you use functions like pd.read_csv() , pd.read_excel() ,
pd.read_sql() , or pd.read_html() depending on the data source.

Basic data structures in pandas

Pandas provides two types of classes for handling data:

Series: a one-dimensional labeled array holding data of any type such as integers, strings, Python
objects etc.

DataFrame: a two-dimensional data structure that holds data like a two-dimension array or a table
with rows and columns.

Creating the dataset

In [1]: import pandas as pd

# Creating the dataframe of student's marks

df = [Link]({"Upendra ":[98,87,76,88,96],
"Jayesh ":[88,52,69,79,80],
"Rahul":[90,92,71,60,64],
"Puja":[88,85,79,81,91]})

# Printing the dataframe

Out[1]: Upendra Jayesh Rahul Puja

0 98 88 90 88

1 87 52 92 85

2 76 69 71 79

3 88 79 60 81

4 96 80 64 91

The data frame has been created using [Link] and is stored in df variable. The values are
then displayed as output.

now lets calculate mean

In [2]: [Link](axis = 0)

Upendra 89.0
Out[2]:
Jayesh 73.6
Rahul 75.4
Puja 84.8
dtype: float64

Now, lets calculate MEDIAN

In [3]: [Link](axis = 0)
Upendra 88.0
Out[3]:
Jayesh 79.0
Rahul 71.0
Puja 85.0
dtype: float64

Now, we will find the MODE

In [4]: [Link]()

Out[4]: Upendra Jayesh Rahul Puja

0 76 52 60 79

1 87 69 64 81

2 88 79 71 85

3 96 80 90 88

4 98 88 92 91

In [ ]:

Measures Of Spread
Measures of spread tell how spread the data points are. Some examples of measures of spread are
quantiles, variance, standard deviation and mean absolute deviation.

Quantiles

Quantiles are values that split sorted data or a probability distribution into equal parts. There
several different types of quantlies, here are some of the examples:

Quartiles - Divides the data into 4 equal parts.

Quintiles - Divides the data into 5 equal parts.
Deciles - Divides the data into 10 equal parts
Percentiles - Divides the data into 100 equal parts

In [5]: import numpy as np

import pandas as pd

In [6]: # lets make datasets

data=({'hour':[2.5,5.1,3.2,8.5,3.5,1.5,9.2,5.5,8.3,2.7,7.7,
5.9,4.5,3.3,1.1,8.9,2.5,1.9,6.1,7.4,2.7,4.8,3.8,6.9,7.8],
'Scores':[21,47,27,75,30,20,88,60,81,25,85,62,41,42,17,95,30,24,67,69,30,54,35,76

In [7]: df = [Link](data)
print(df)

hour Scores
0 2.5 21
1 5.1 47
2 3.2 27
3 8.5 75
4 3.5 30
5 1.5 20
6 9.2 88
7 5.5 60
8 8.3 81
9 2.7 25
10 7.7 85
11 5.9 62
12 4.5 41
13 3.3 42
14 1.1 17
15 8.9 95
16 2.5 30
17 1.9 24
18 6.1 67
19 7.4 69
20 2.7 30
21 4.8 54
22 3.8 35
23 6.9 76
24 7.8 86

Let's calculate the quartiles for the scores. These are the 5 data points in the scores that divide the
scores into 4 equal parts.

In [8]: print([Link](df['Scores'], [0, 0.25, 0.5, 0.75, 1]))

[17. 30. 47. 75. 95.]

Quantiles using linspace( )

It can become quite tedious to list all the points when getting quantiles, more so in cases of higher
quantiles such as deciles and percentiles. For such cases we can make use of the linspace( )

Let's get the quartiles of the scores

In [9]: print([Link](df['Scores'], [Link](0, 1, 5)))

[17. 30. 47. 75. 95.]

Let's get the quintiles

In [10]: print([Link](df['Scores'], [Link](0, 1, 6)))

[17. 26.6 38.6 60.8 77. 95. ]

Let's get the deciles

In [11]: print([Link](df['Scores'], [Link](0, 1, 11)))

[17. 22.2 26.6 30. 38.6 47. 60.8 68.6 77. 85.6 95. ]

Interquartile Range (IQR)

This is the difference between the 3rd and the 1st quartile. The IQR tells the spread of the middle half
of the data.

Let's get the IQR for the scores

In [12]: IQR = [Link](df['Scores'], 0.75) - [Link](df['Scores'], 0.25)

print(IQR)

45.0

Another way we can get IQR is by using iqr( ) from the scipy library

In [13]: from [Link] import iqr

IQR = iqr(df['Scores'])
print(IQR)

45.0

Outliers

These are data points that are usually different or detached from the rest of the data points.

A data point is an outlier if:

`data < 1st quartile − 1.5 * IQR

data > 3rd quartile + 1.5 * IQR`

Let's get the outliers in the scores

In [14]: # first get iqr

iqr= iqr(df['Scores'])
# then get lower & upper threshold
lower_threshold = [Link](df['Scores'], 0.25)
upper_threshold = [Link](df['Scores'], 0.75)
# then find outliers
outliers = df[(df['Scores'] < lower_threshold) | (df['Scores'] > upper_threshold)]
print(outliers)

hour Scores
0 2.5 21
2 3.2 27
5 1.5 20
6 9.2 88
8 8.3 81
9 2.7 25
10 7.7 85
14 1.1 17
15 8.9 95
17 1.9 24
23 6.9 76
24 7.8 86

Variance

Varience is the average of the squared distance between each data point and the mean of the data.

Let's calculate the variance of the scores. We will use [Link]( )

In [15]: print([Link](df['Scores'], ddof=1))

639.4266666666666

with the 'ddof=1' included, it means that the variance we get is the sample variance, if it is excluded
then we get the population variance.

Let's see that here below.

In [16]: print([Link](df['Scores']))
613.8496

Standard deviation

This is the squareroot of the variance.

Let's get the standard deviation of the scores

In [17]: print([Link]([Link](df['Scores'], ddof=1)))

25.28688724747802

Another way we can get standard deviation is by [Link]( )

Let's use that

In [18]: print([Link](df['Scores'], ddof=1))

25.28688724747802

Mean Absolute Deviation

This is the average of the distance between each data point and the mean of the data.

Let's find the mean absolute distance of the scores

In [19]: # first find the distance between the data points and the mean
dists = df['Scores'] - [Link](df['Scores'])
# find the mean absolute
print([Link]([Link](dists)))

22.4192

decsribe( ) method

The pandas describe( ) method can be used to calculate some statistical data of a dataframe. The
dataframe must contain numerical data for the describe( ) method to be used.

We can make use of it to get some of the measurements that have been mentioned above.

In [20]: df['Scores'].describe()

count 25.000000
Out[20]:
mean 51.480000
std 25.286887
min 17.000000
25% 30.000000
50% 47.000000
75% 75.000000
max 95.000000
Name: Scores, dtype: float64

In [ ]:

Assignment Questions
1. Load/define a dataset using pandas and display the first 5 rows to inspect its structure. also
Calculate the mean, median, and mode of a specific column in the dataset.
2. Determine the range of values in a particular column of the dataset. also Compute the variance
and standard deviation of a numeric column in the dataset.
3. Identify any missing values in the dataset and count the total number of missing entries.
4. Determine the top 5 highest values in a specific column.
5. Find the interquartile range (IQR) for a numeric column in the dataset.
6. Using Pandas and Matplotlib for any real life application and demostrate with example.

In [ ]:

Python Pandas II Notes XII
No ratings yet
Python Pandas II Notes XII
20 pages
Chapter1.2 PythonPandas2
No ratings yet
Chapter1.2 PythonPandas2
38 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
87 pages
Experiment - 1 csd201
No ratings yet
Experiment - 1 csd201
19 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
97 pages
4 PythonPandas
No ratings yet
4 PythonPandas
8 pages
Descriptive Statistics with Pandas 2
No ratings yet
Descriptive Statistics with Pandas 2
38 pages
Informatics Practices Class 12 Cbse Notes Data Handling
0% (1)
Informatics Practices Class 12 Cbse Notes Data Handling
17 pages
Python Libraries for Statistical Analysis
No ratings yet
Python Libraries for Statistical Analysis
40 pages
Pandas Data Handling & Visualization Guide
100% (1)
Pandas Data Handling & Visualization Guide
37 pages
DS Manual 1
No ratings yet
DS Manual 1
96 pages
EDA Lab Manual
No ratings yet
EDA Lab Manual
93 pages
EDA Lab Manual
100% (2)
EDA Lab Manual
93 pages
Intro to Statistics with Python
No ratings yet
Intro to Statistics with Python
54 pages
Series and Pandas Methods
No ratings yet
Series and Pandas Methods
5 pages
EDA: Key Stats & Visualizations in Python
No ratings yet
EDA: Key Stats & Visualizations in Python
15 pages
Machine Learning Lab Word 12-1-2025. Document
No ratings yet
Machine Learning Lab Word 12-1-2025. Document
68 pages
Descriptive Stats in Pandas DataFrame
No ratings yet
Descriptive Stats in Pandas DataFrame
17 pages
01 Statistics With Python
No ratings yet
01 Statistics With Python
8 pages
Week2 Lab
No ratings yet
Week2 Lab
8 pages
Python For Data Analysis Jan 28
No ratings yet
Python For Data Analysis Jan 28
105 pages
Data Handling in Data Science
No ratings yet
Data Handling in Data Science
76 pages
Python Libraries
No ratings yet
Python Libraries
27 pages
Python Data Analysis Basics
No ratings yet
Python Data Analysis Basics
32 pages
DataFrame Statistics
No ratings yet
DataFrame Statistics
41 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
ADS LAB Merged
No ratings yet
ADS LAB Merged
86 pages
Pandas Descriptive Stats Guide
No ratings yet
Pandas Descriptive Stats Guide
55 pages
Data Analysis
No ratings yet
Data Analysis
20 pages
Eda Code Snippets
No ratings yet
Eda Code Snippets
17 pages
Python Data Analysis Tutorial
No ratings yet
Python Data Analysis Tutorial
47 pages
Document 1
No ratings yet
Document 1
16 pages
Data Handling Using Pandas-By Abhishek Shakya
No ratings yet
Data Handling Using Pandas-By Abhishek Shakya
55 pages
Python For Data Science
No ratings yet
Python For Data Science
45 pages
Week - 6-7
No ratings yet
Week - 6-7
9 pages
Python Data Analysis Cheat Sheet
100% (3)
Python Data Analysis Cheat Sheet
9 pages
Ids 1
No ratings yet
Ids 1
30 pages
Practical 1
No ratings yet
Practical 1
5 pages
Principles of AI Laboratory Varshadr
No ratings yet
Principles of AI Laboratory Varshadr
54 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
2.DescriptiveAnalytics v2
No ratings yet
2.DescriptiveAnalytics v2
10 pages
Data Science Programs
No ratings yet
Data Science Programs
6 pages
Murali Internship
No ratings yet
Murali Internship
34 pages
BDA File
No ratings yet
BDA File
26 pages
Unit 3
No ratings yet
Unit 3
20 pages
Dsa Lab Record (Ai&Ds)
No ratings yet
Dsa Lab Record (Ai&Ds)
34 pages
Ankit Python
No ratings yet
Ankit Python
26 pages
Dsbda Ass3
No ratings yet
Dsbda Ass3
22 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
2 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Python For ML
No ratings yet
Python For ML
41 pages
ML Programs
No ratings yet
ML Programs
41 pages
ML Lab Final R22
No ratings yet
ML Lab Final R22
67 pages
Edaunit IV
No ratings yet
Edaunit IV
15 pages
DataFrame Functions in Pandas
No ratings yet
DataFrame Functions in Pandas
12 pages
Practical File Question 28.09.2022
No ratings yet
Practical File Question 28.09.2022
15 pages
Python Basics - Hamza Zahoor
No ratings yet
Python Basics - Hamza Zahoor
6 pages
Innovations in Mechanical & Civil Engineering Conference 2024
No ratings yet
Innovations in Mechanical & Civil Engineering Conference 2024
1 page
LTCC
No ratings yet
LTCC
10 pages
The Tribological Properties of The Polyurea Greases Based On Oil Miscible Phosphonium Based Ionic Liquids
No ratings yet
The Tribological Properties of The Polyurea Greases Based On Oil Miscible Phosphonium Based Ionic Liquids
7 pages
Ionic Liquid for Lubricant Additives
No ratings yet
Ionic Liquid for Lubricant Additives
10 pages
Introduction To LoRA & QLoRA
No ratings yet
Introduction To LoRA & QLoRA
20 pages
Health Seeking Behaviour and Healthcare
No ratings yet
Health Seeking Behaviour and Healthcare
12 pages
SQL Execution Time Prediction
No ratings yet
SQL Execution Time Prediction
10 pages
Cultural Influences on Health in India
No ratings yet
Cultural Influences on Health in India
29 pages
??module 6 ?
No ratings yet
??module 6 ?
33 pages
Qlund 1983
No ratings yet
Qlund 1983
9 pages
B1-R001 - Estimating Market Risk Measures
No ratings yet
B1-R001 - Estimating Market Risk Measures
12 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
DWDM All Units
No ratings yet
DWDM All Units
102 pages
Fourth Periodical Exam 1718
No ratings yet
Fourth Periodical Exam 1718
8 pages
SQL Window Functions Cheat Sheet
No ratings yet
SQL Window Functions Cheat Sheet
10 pages
EN 12372-1999-Flexural-Three
No ratings yet
EN 12372-1999-Flexural-Three
15 pages
10 1053@j Ajkd 2017 07 011
No ratings yet
10 1053@j Ajkd 2017 07 011
9 pages
Data Visualization & Preprocessing Guide
No ratings yet
Data Visualization & Preprocessing Guide
18 pages
Quantile Regression
No ratings yet
Quantile Regression
6 pages
LM03 Statistical Measures of Asset Returns
No ratings yet
LM03 Statistical Measures of Asset Returns
83 pages
International Journal of Development Issues: Article Information
No ratings yet
International Journal of Development Issues: Article Information
20 pages
Bootstrap Methods Overview
No ratings yet
Bootstrap Methods Overview
90 pages
Chapter 2 Chemical Probability and Statistics
No ratings yet
Chapter 2 Chemical Probability and Statistics
14 pages
Predictive Models and Analytics Lab-Arun.m-student Manual
No ratings yet
Predictive Models and Analytics Lab-Arun.m-student Manual
52 pages
Math 10 DLL - 4th Q
100% (1)
Math 10 DLL - 4th Q
14 pages
Measures of Position
No ratings yet
Measures of Position
5 pages
Earnings Volatility, Cash Flow Volatility, and Firm Value: Rountree@rice - Edu
No ratings yet
Earnings Volatility, Cash Flow Volatility, and Firm Value: Rountree@rice - Edu
45 pages
Using Probability Distributions in R: Dnorm, Pnorm, Qnorm, and Rnorm
No ratings yet
Using Probability Distributions in R: Dnorm, Pnorm, Qnorm, and Rnorm
7 pages
Measures of Position For Ungrouped Data
No ratings yet
Measures of Position For Ungrouped Data
18 pages
Data Management
No ratings yet
Data Management
7 pages
Choropleth Exercise
No ratings yet
Choropleth Exercise
13 pages
Math 10 Periodical Test Guide
No ratings yet
Math 10 Periodical Test Guide
5 pages
Estimation of Parameters of Johnson-S System of Distributions
No ratings yet
Estimation of Parameters of Johnson-S System of Distributions
12 pages
Quantile Regression
No ratings yet
Quantile Regression
122 pages

Unit 5 Descriptive Statistics

Uploaded by

Unit 5 Descriptive Statistics

Uploaded by

Descriptive statistics

Median = (n + 1/2)th observation if the total observation is odd.

* We are going to use very famous pandas library

Basic data structures in pandas

Creating the dataset

# Creating the dataframe of student's marks

# Printing the dataframe

Out[1]: Upendra Jayesh Rahul Puja

now lets calculate mean

Now, lets calculate MEDIAN

Now, we will find the MODE

Out[4]: Upendra Jayesh Rahul Puja

Quartiles - Divides the data into 4 equal parts.

In [5]: import numpy as np

In [6]: # lets make datasets

In [8]: print([Link](df['Scores'], [0, 0.25, 0.5, 0.75, 1]))

[17. 30. 47. 75. 95.]

Quantiles using linspace( )

Let's get the quartiles of the scores

In [9]: print([Link](df['Scores'], [Link](0, 1, 5)))

Let's get the quintiles

In [10]: print([Link](df['Scores'], [Link](0, 1, 6)))

Let's get the deciles

In [11]: print([Link](df['Scores'], [Link](0, 1, 11)))

Interquartile Range (IQR)

Let's get the IQR for the scores

In [12]: IQR = [Link](df['Scores'], 0.75) - [Link](df['Scores'], 0.25)

In [13]: from [Link] import iqr

A data point is an outlier if:

`data < 1st quartile − 1.5 * IQR

data > 3rd quartile + 1.5 * IQR`

Let's get the outliers in the scores

In [14]: # first get iqr

Let's calculate the variance of the scores. We will use [Link]( )

In [15]: print([Link](df['Scores'], ddof=1))

Let's see that here below.

This is the squareroot of the variance.

Let's get the standard deviation of the scores

In [17]: print([Link]([Link](df['Scores'], ddof=1)))

Another way we can get standard deviation is by [Link]( )

Let's use that

In [18]: print([Link](df['Scores'], ddof=1))

Mean Absolute Deviation

Let's find the mean absolute distance of the scores

You might also like