0% found this document useful (0 votes)
17 views73 pages

Descriptive_Statistics

Statistics is a mathematical discipline focused on data collection, analysis, interpretation, and presentation, divided into descriptive and inferential statistics. Learning statistics is essential for informed decision-making, understanding patterns, and conducting scientific research in a data-driven world. Key concepts include measures of central tendency, dispersion, and data visualization techniques to effectively communicate insights from data.

Uploaded by

vanshlakhotya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views73 pages

Descriptive_Statistics

Statistics is a mathematical discipline focused on data collection, analysis, interpretation, and presentation, divided into descriptive and inferential statistics. Learning statistics is essential for informed decision-making, understanding patterns, and conducting scientific research in a data-driven world. Key concepts include measures of central tendency, dispersion, and data visualization techniques to effectively communicate insights from data.

Uploaded by

vanshlakhotya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

STATISTICS

Swapnil Desai
(Senior Data Scientist)
What is Statistics?

Statistics is a branch of mathematics dealing with the collection, analysis, interpretation,


presentation, and organization of data. It provides methods for designing experiments and
surveys, collecting data, analyzing it, drawing meaningful conclusions, and making decisions
based on data analysis.

In general, its investigations and analyses fall into two broad categories called descriptive and inferential
statistics.

Eg: We get information and knowledge from a Raw data


Why We Need to Learn Statistics:

1.Decision Making: Statistics helps in making informed decisions in various fields like
business, science, government, healthcare, etc., by providing a way to understand and
interpret data.

2.Understanding Patterns: It helps in understanding and interpreting patterns and


trends in data, which is crucial in many fields.

3.Scientific Research: Statistics is fundamental in scientific research for designing


experiments, testing hypotheses, and validating results.

4.Data-Driven World: In our increasingly data-driven world, statistics are essential for
making sense of the large amounts of data generated daily.
Developing Statistical Thinking

Statistics include numerical facts and figures. For instance:

• The largest earthquake measured 9.2 on the Richter scale.


• Men are at least 10 times more likely than women to commit murder.
• One in every 8 Americans is COVID positive.

The study of statistics involves math and relies upon calculations of numbers. But it also relies
heavily on how the numbers are chosen and how the statistics are interpreted. For example,
consider some scenarios and the interpretations based upon the presented statistics.

1. A new advertisement for Amul’s ice cream introduced in late May of last year resulted in a
30% increase in ice cream sales for the following three months. Thus, the advertisement was
effective.

2. The more liquor shop in a city, the more crime there is. Thus, liquor shops lead to crime.
1. Flaw: A major flaw is that ice cream consumption generally increases in the months of
June, July, and August regardless of advertisements. This effect is called a history effect
and leads people to interpret outcomes as the result of one variable when another
variable (in this case, one having to do with the passage of time) is actually
responsible.

2. Flaw: A major flaw is that both increased liquor shops and increased crime rates can be
explained by larger populations. In bigger cities, there are both more liquor shops and
more crime. This problem refers to the third-variable problem. Namely, a third variable
can cause both situations; however, people erroneously believe that there is a causal
relationship between the two primary variables rather than recognize that a third
variable can cause both.

Hence, the correct Interpretation of the numbers are necessary!!!!


Types of Statistics:

1. Descriptive statistics deals with the processing of data without attempting to draw any
inferences from it. The characteristics of the data are described in simple terms. Events that are
dealt with include everyday happenings such as accidents, prices of goods, business, incomes,
epidemics, sports data, population data.

2. Inferential statistics is a scientific discipline that uses mathematical tools to make forecasts
and projections by analysing the given data. This is of use to people employed in such fields as
engineering, economics, biology, the social sciences, business, agriculture and
communications.
Population? Sample?

Refers to the total amount of Small part of population that is


things. used for study.
Variable?
Sample Size?
What we are studying.
Total amount of things in a
(Measurable, Countable,
sample.
Categorized)
QUANTITATIVE VARIABLE

INTERVAL RATIO
An interval scale is one where there is A ratio variable, has all the properties of an
order
and interval variable, and also has a clear definition of
the difference between two values is 0.
meaningful.

Temperature (Fahrenheit), Weight, length, temperature in


Ex: Ex:
Temperature (Celsius), pH, Kelvin

• The difference between interval and ratio scales comes from their ability to dip below zero. Interval scales
hold no true zero and can represent values below zero. For example, you can measure temperature below
0 degrees Celsius, such as -10 degrees.
• Ratio variables, on the other hand, never fall below zero. Height and weight measure from 0 and above,
but never fall below it.
Data Visualization
Basics
What is data Visualization?

Data visualization is the representation of data or information in a graph, chart, or other


visual format. It communicates relationships of the data with images. This is important
because it allows trends and patterns to be more easily seen.

What is its importance?

We need data visualization because a visual summary of information makes it easier to identify
patterns and trends than looking through thousands of rows on a spreadsheet. It’s the way the
human brain works.

Since the purpose of data analysis is to gain insights, data is much more valuable when it is
visualized.

Even if a data analyst can pull insights from data without visualization, it will be more difficult
to communicate the meaning without visualization.
Line Chart.

A line chart is, as one can imagine, a line or multiple lines showing how single, or multiple
variables develop over time.

Pie Chart.

A pie chart is a circular graph divided into slices. The larger a slice is the bigger portion of
the total quantity it represents.

Bar Graph.

A bar chart or bar graph is a chart or graph that presents categorical data with rectangular
bars with heights or lengths proportional to the values that they represent. The bars can be
plotted vertically or horizontally. Can be of one variable or many variable.

Histogram

A series of bins showing us the frequency of observations of a given variable.

Scatter Plots
A scatter plot is a great indicator that allows us to see whether there is a pattern to be found
between two variables. E.g. : Positive or negative relationship.
Descriptive Statistics

• Descriptive statistics are a set of techniques and measures used to summarize, organize,
and describe the main features of a dataset.
• These statistics provide a way to understand the essential characteristics of the data
without necessarily making inferences or drawing conclusions about a larger population.
• Events that are dealt with include everyday happenings such as accidents, prices of goods, business,
incomes, epidemics, sports data, population data.

When we give description of data, there can be 3 kinds:

1. Measures of Central Tendency – Mean, Median and Mode


2. Measures of Dispersion – Standard Deviation, Variance, Range, IQR (Inter Quartile Range)
3. Measure of Symmetricity/Shape – Skewness and Kurtosis
1. Measure of Central Tendency
A measure of central tendency is a summary statistic that represents the centre point or typical value of
a dataset. These measures indicate where most values in a distribution fall and are also referred to as
the central location of a distribution.

1. Mean
The mean is the average of all the values in a dataset. It's calculated by summing up all the values and
then dividing by the total number of values. Average value of the set of Numbers. Mean is a a number around
which a whole data is spread out. Denoted by µ for population mean andfor sample mean.

Example: Find the mean of 5,5,2,6,3,8,9?


A: Mean is (5+5+2+6+3+8+9) / 7 = 38/7 = 5.43

2. Median
The median is the middle value of a dataset when it's arranged in ascending or descending order. If
there's an even number of values, the median is the average of the two middle values.
(Note: If you sort data in descending order, it won’t affect median but IQR will be negative. IQR will be discussed in
next slide.)

Example: Find the Median of 5,5,2,6,3,8,9?


A: Putting it in ascending order = 2,3,5,5,6,8,9. Hence, Median = Mid Number = 5.
(Note: Median of a even set of numbers can be found by taking the average of the 2 middle numbers.
E.g. Median of 2,3,4,7 = average of (3 and 4 ) = 3.5)

3. Mode
The mode is the value that appears most frequently in a dataset.
Mode is the term appearing maximum time in data set i.e. term that has highest frequency.

Example: Find the Median of 5,5,2,6,3,8,9?


Let’s build some concepts before going ahead.

1. What is Minimum and Maximum value?

It is the minimum and Maximum values of the dataset respectively.

2. What is 1st and 3rd Quartile? – Also called the lower and upper quartile
respectively.

When we divide the dataset into two groups while calculating median (sorted in
ascending order), then the median of first half is 1st Quartile and median of second half is
3rd Quartile.

3. Then where is the 2nd Quartile?

Your median is the 2nd Quartile ;-)


Q. Given is the ages of people registered for a webinar, calculate the 5 point summary (5 number summar
the ages of the participants?

19, 26, 25, 37, 32, 28, 22, 23, 29, 34, 39, 31

1)Sort the data

19, 22, 23, 25, 26, 28, 29, 31, 32, 34, 37, 39

5)Min and Max


2)Find the median
Min=19
value(Q2)
Max=39
Q2(50%)=28+29/2=28
.5
Q2=28.5 Range:Max – Min
=39-19
3)Find =20
Q1(25%) ,median
19, 22, 23, 25, 26,
Q1=23

4)Find
Q3(75%) ,median
2. Measure of Spread / Dispersion

1. Standard deviation
Standard deviation is the measurement of average distance between each quantity and
mean. That is, how data is spread out from mean. A low standard deviation indicates that the
data points tend to be close to the mean of the data set, while a high standard deviation
indicates that the data points are spread out over a wider range of values.

In Python :
Population STD = pstdev()
SD of Population
() i

SD of Sample STD = stdev()


Sample i
(s)
Measure of Spread / Dispersion

2. Variance

Variance is a square of average distance between each quantity and mean. That is, it is square of
standard deviation.

Population Variance Sample Variance


(2) i (S2) i

In Python : Population Var = pvariance() Sample Variance =


variance()
Measure of Spread / Dispersion

3. Range
The range is a measure of the spread or dispersion of a set of data points ,Range is one of the simplest
techniques of descriptive statistics. It is the difference between lowest and highest value.

Range = Maximum - Minimum

4. IQR (Interquartile Range): IQR= Q3- Q1 i


The IQR is a measure of statistical dispersion, which is the spread of the data points in a dataset. It is the
difference between the third quartile (Q3) and the first quartile (Q1) in a dataset.
Here's how to calculate it:
1.Sort the data from smallest to largest.
2.Find the median (the middle value) of the dataset. This divides the dataset into two halves.
3.Find the median of the first half of the data. This is Q1, or the first quartile.
4.Find the median of the second half of the data. This is Q3, or the third quartile.
5.Subtract Q1 from Q3: IQR = Q3 - Q1.
The IQR is useful because it gives us a measure of the middle 50% of the data. It is not affected by outliers
or extreme values as much as the range is.
Use of IQR:
Outlier Detection
Summarizing Data
Measure of Spread / Dispersion

Steps to find out the IQR

1. Order the data from least to greatest(Sort in ascending)


2. Find the median(Q2)
3. The left side of median(Q1) is lower half and right side of the data is upper half.
4. Calculate the median of both the lower and upper half of the data (Called Q1 and Q3
respectively)
5. The IQR is the difference between the upper and lower medians

(Note: When we write down Minimum, Maximum, Q1, Q2 (Median) and Q3, this is
called 5-point summary or 5 number summary)

Let’s solve some questions to find IQR.


Transforming Data
Look at below Question.

1. Below are the weights of 5 persons. Calculate Mean, Standard Deviation :

105, 156, 145, 172, 100

2. Suppose each one of them gained extra 5 Kg. weight during winters. Can you calculate
the new Mean and Standard deviation?
**Original Data:**
Weights of 5 persons: 105, 156, 145, 172, 100

1. **Calculating Mean:**
Mean (Average) = (Sum of all weights) / (Number of weights)

Mean = (105 + 156 + 145 + 172 + 100) / 5


= 678 / 5 = 135.6

2. **Calculating Standard Deviation:**


To calculate the standard deviation, first, calculate the mean of the data. Then, for each data point,
subtract the mean, square the result, sum up all the squared differences, divide by the number of data
points, and finally, take the square root.

Calculating squared differences:


(105 - 135.6)^2 = 930.24
(156 - 135.6)^2 = 420.36
(145 - 135.6)^2 = 88.36
(172 - 135.6)^2 = 1320.96
(100 - 135.6)^2 = 1265.76

Mean of squared differences = (930.24 + 420.36 + 88.36 + 1320.96 + 1265.76) / 5 = 805.536

Standard Deviation = √(Mean of squared differences) = √(805.536) = 28.38


**After Winter Weight Gain (Adding 5 Kg to each weight):**
New Weights: 110, 161, 150, 177, 105

1. **Calculating New Mean:**


New Mean = (Sum of new weights) / (Number of weights)

New Mean = (110 + 161 + 150 + 177 + 105) / 5 = 703 / 5 = 140.6

2. **Calculating New Standard Deviation:**


Calculating squared differences for the new weights:
(110 - 140.6)^2 = 937.96
(161 - 140.6)^2 = 420.36
(150 - 140.6)^2 = 88.36
(177 - 140.6)^2 = 1330.76
(105 - 140.6)^2 = 1259.56

Mean of squared differences = (937.96 + 420.36 + 88.36 + 1330.76 + 1259.56) / 5 = 807

New Standard Deviation = √(Mean of squared differences) = √(807) = 28.42

So, after each person gains 5 kg during the winter, the new mean weight is approximately 140.6 kg, and
the new standard deviation is approximately 28.42 kg.
3. Measure of Symmetricity & Shape – Skewness and Kurtosis
1. Skewness
Skewness is usually described as a measure of a dataset’s symmetry – or lack of
symmetry. A perfectly symmetrical data set will have a skewness of 0. The normal
distribution has a skewness of 0. Skewness is calculated as:

import numpy as np
from scipy.stats import skew
x = np.random.normal(0, 2, 10000) # create random values based on a normal distribution
print(skew(x))

….
Mathematically:

where n is the sample size, Xi is the ith X value, X-Bar is the average and s is the
sample standard deviation. Note the exponent in the summation. It is “3”. The
skewness is referred to as the “third standardized central moment for the
Skewness

So, when is the skewness too much? The rule of thumb seems to be:

1. If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
2. If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately
skewed.
3. If the skewness is less than -1 or greater than 1, the data are highly skewed.

Importance of Skewness:

Measures of asymmetry like skewness are the link between central tendency
measures and probability theory, which ultimately allows us to get a more complete
understanding of the data we are working with.

Knowing that the market has a 70% probability of going up and a 30% probability of going
down may appear helpful if you rely on normal distributions. However, if you were told
that if the market goes up, it will go up 2% and if it goes down, it will go down 10%, then
you could see the skewed returns and make a better informed decision.

E(r) = 0.7*0.02 + 0.3*-0.1 = -0.014


2. Kurtosis
Kurtosis is all about the tails of the distribution – not the peakness or flatness.
It measures the tail-heaviness of the distribution. Kurtosis is calculated as:

import numpy as np
from scipy.stats import kurtosis
x = np.random.normal(0, 2, 10000) # create random values based on a normal distribution
print(kurtosis(x))

Mathematically:

where n is the sample size, Xi is the ith X value, X-Bar is the average and s is the sample standard deviation.
Note the exponent in the summation. It is “4”. The kurtosis is referred to as the “fourth standardized central
moment for the probability model.”
What does the value of Kurtosis tells about the shape?
The reference standard is a normal distribution, which has a kurtosis of 3. In
token of this, often the excess kurtosis is presented: excess kurtosis is
simply kurtosis−3. For example, the “kurtosis” reported by Excel or any
statistical library is actually the excess kurtosis.

1. A normal distribution has kurtosis exactly 3 (excess kurtosis exactly 0).


Any distribution with kurtosis ≈3 (excess ≈0) is called mesokurtic.

2. A distribution with kurtosis <3 (excess kurtosis <0) is called platykurtic.


Compared to a normal distribution, its tails are shorter and thinner,
and often its central peak is lower and broader.

3. A distribution with kurtosis >3 (excess kurtosis >0) is called leptokurtic.


Compared to a normal distribution, its tails are longer and fatter,
and often its central peak is higher and sharper.
Uses of Kurtosis:

1. Depicts the shape of the distribution - specially tails.

2. Outlier Detection : Large Kurtosis suggests there could be outliers in the data.

3. With high kurtosis, there is a chance of high variance and hence test on Mean could lead to bad resu
Hence, in that case, we would need to choose a more robust option – like test on Median.

4. Financial Risk: E.g. The return of your asset can be farther from the mean. (Than predicted using no
distribution).
Outliers
What is outlier?
An outlier is an observation that lies an abnormal distance from other values in a
random sample from a population. In a sense, this definition leaves it up to the analyst
to decide what will be considered abnormal.

Common Causes of Outliers


1. Data entry errors (human errors)
2. Measurement errors (errors)
3. Experimental errors (data instrument extraction or
experiment planning/executing errors)
4. Intentional (dummy outliers made to test detection
methods)
5. Data processing errors (data manipulation or data set
unintended mutations)
6. Sampling errors (extracting or mixing data from
wrong or various sources)
7. Natural (not an error, novelties in data)
Common methods of determining an Outlier
1. Sort the data and see for the extreme values
2. Plotting – Boxplot, Scatterplot
3. IQR Method
4. Z-Score Method

Why do we need to treat outliers?


Outliers can impact the results of our analysis and statistical modelling in a drastic way.
IQR Method
Q. Can you identify the outliers from the below dataset, using the IQR method?

26.0 ℃ , 15.0 ℃ , 20.5 ℃ , 31 ℃ , -350.0 ℃ , 31.0 ℃ , 30.5 ℃


Outliers < Q1 – 1.5 25 – 1.5 (11) = 8.5
(IQR) 36 + 1.5 (11) = 52.5
> Q3 +
Hence,1.5
we(IQR)
can say that 59 is the only outlier we have in our dataset.
Z-Score Method
What is Z-Score?
A z-score measures exactly how many standard deviations above or below the mean a data point
is.
Z-scores are the number of standard deviations above and below the mean that each value falls,
assuming a Normal distribution.

For example, a Z-score of positive 2 indicates that an observation is two standard deviations
above the average while a Z-score of -2 signifies it is two standard deviations below the mean.

Z-Score Formula?

Here are some important facts about z-scores:


•A positive z-score says the data point is above average.
•A negative z-score says the data point is below average.
•A z-score close to 0 says the data point is close to average.
•A data point can be considered unusual if its z-score is above 3 or below.
Example: Consider the below dataset. Find out the outlier using Z-score method.

1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2

Solution:

Mean = 2.66 Ref:


Std = 3.36
Z score formula =
Z (1) = (1 – 2.66)/3.36 = -0.49405
Z(2) = (2 – 2.66)/3.36 = -0.19643
Z(3) = (3 – 2.66)/3.36 = 0.10119
Z(15) = (15 – 2.66)/3.36 = 3.67262

We will term the point outlier if it has a z-score of 3 or above (in any side - positive or negative).
Hence, here the outlier is 15.
Assignment 3: Write a Python code to detect outlier using Z Score Method
Covariance
Covariance: Covariance measures how the two variables/features move concerning each other
and is an extension of the concept of variance (which tells about how a single
variable varies).
It can take any value from -∞ to +∞

In Python: use cov() function

x > μx, y > μy + +

x < μx, y < μy - -

x > μx,y < μy + -

x < μx,y > μy - +


Correlation
What is Correlation?

Correlation is a statistical technique to depict the relationship between 2 variables –


strength and direction. We measure the correlation with the help of Correlation
Coefficient.

For example, height and weight are related; taller people tend to be heavier than
shorter people.

What is Correlation Coefficient?

The Pearson’s correlation coefficient (r) is a measure that determines the degree to
which the movement of two variables is associated. The value of Correlation Coefficient
lies between -1 and 1.

Formula: (Pearson’s Correlation Coefficient) - Standard Formula


In Python: DataFrame.corr(method=’pearson’)

(n = sample size, and Sx, Sy are the standard deviation of samples x and y. X-bar and y-
bar are the respective means of x and y samples whereas Xi and Yi are sample points of X
Positive and Negative Correlation:

1. Correlation Coefficient greater than zero indicates a positive relationship

2. while a value less than zero signifies a negative relationship

3. and a value of zero indicates no relationship between the two variables being compared.
Strong and Weak Correlation:

Kind of correlation = depicted by sign of correlation coefficient


How Strong =. Value of Correlation Coefficient
Rule of thumb: Any relationship with magnitude of r greater than 0.75 can be
considered to be a strong correlation.
E.g.: -0.84 is a strong Negative correlation and 0.90 is a strong positive correlation.
Question: The local ice cream shop keeps track of how much ice cream they sell versus the
temperature on that day, here are their figures for the last 12 days. Can you tell if Ice cream
sales are correlated to that of temperature? Find out the nature and strength of correlation.

Temperature Ice Cream Sales


14.2° $215
16.4° $325
11.9° $185
15.2° $332
18.5° $406
22.1° $522
19.4° $412
25.1° $614
23.4° $544
18.1° $421
22.6° $445
17.2° $408
Spearman Rank Correlation

• Used for Non-Linear Variables

• Spearman Corr Coeff = Pearson Corr coeff (rank


varaibles)
• In Python: DataFrame.corr(method=’spearman’)

• Denoted by rho.
Steps for Spearman Correlation Coefficient

1. Create a new column for rank(x) and assign the rank of each variable.
2. Assign the rank of 2nd variable in a new column rank(y).
3. Calculate the difference in rank of both the variables = d.
4. Calculate the d-squared.
5. Add up d-squared score.
6. Put in the formula provided:
Question: The scores for 10 students in English and Maths are as follows:

Compute the Spearman rank correlation.


Solution:

Step 1,2,3 and 4:


Solution Contd.

Step 5:

Step 6:

Hence, the Spearman Rank Coefficient is 0.67.


Recap – Descriptive Statistics

• Statistics? Its Importance. Population vs Sample.


• Types of variable – (Quantitative, Categorical), - (Ordinal, Nominal), (discrete, continuous).
• Types of charts – Pie, Donut, Line, Scatterplot, Histogram, Bar
chart, Box-plot
• Descriptive Stats:

Measure of central tendency -- Mean , Median & Mode


Measures of Dispersion/spread – Standard Deviation, Variance, range & IQR
Measures of symmetricity – Skewness and Kurtosis

• 5 Number Summary – Box Plot (Box and Whiskers)


• Effect of transformation on central tendency and spread.
• Outliers? How to detect? Modified Boxplot. (IQR Method, Z-Score
Method)
• Covariance, Correlation. Pearson’s Correlation Coefficient. Nature
& Strength of Correlation. How to calculate Pearson’s Correlation
Coefficient and Spearman’s Rank Correlation Coefficient.

You might also like