Descriptive_Statistics
Descriptive_Statistics
Swapnil Desai
(Senior Data Scientist)
What is Statistics?
In general, its investigations and analyses fall into two broad categories called descriptive and inferential
statistics.
1.Decision Making: Statistics helps in making informed decisions in various fields like
business, science, government, healthcare, etc., by providing a way to understand and
interpret data.
4.Data-Driven World: In our increasingly data-driven world, statistics are essential for
making sense of the large amounts of data generated daily.
Developing Statistical Thinking
The study of statistics involves math and relies upon calculations of numbers. But it also relies
heavily on how the numbers are chosen and how the statistics are interpreted. For example,
consider some scenarios and the interpretations based upon the presented statistics.
1. A new advertisement for Amul’s ice cream introduced in late May of last year resulted in a
30% increase in ice cream sales for the following three months. Thus, the advertisement was
effective.
2. The more liquor shop in a city, the more crime there is. Thus, liquor shops lead to crime.
1. Flaw: A major flaw is that ice cream consumption generally increases in the months of
June, July, and August regardless of advertisements. This effect is called a history effect
and leads people to interpret outcomes as the result of one variable when another
variable (in this case, one having to do with the passage of time) is actually
responsible.
2. Flaw: A major flaw is that both increased liquor shops and increased crime rates can be
explained by larger populations. In bigger cities, there are both more liquor shops and
more crime. This problem refers to the third-variable problem. Namely, a third variable
can cause both situations; however, people erroneously believe that there is a causal
relationship between the two primary variables rather than recognize that a third
variable can cause both.
1. Descriptive statistics deals with the processing of data without attempting to draw any
inferences from it. The characteristics of the data are described in simple terms. Events that are
dealt with include everyday happenings such as accidents, prices of goods, business, incomes,
epidemics, sports data, population data.
2. Inferential statistics is a scientific discipline that uses mathematical tools to make forecasts
and projections by analysing the given data. This is of use to people employed in such fields as
engineering, economics, biology, the social sciences, business, agriculture and
communications.
Population? Sample?
INTERVAL RATIO
An interval scale is one where there is A ratio variable, has all the properties of an
order
and interval variable, and also has a clear definition of
the difference between two values is 0.
meaningful.
• The difference between interval and ratio scales comes from their ability to dip below zero. Interval scales
hold no true zero and can represent values below zero. For example, you can measure temperature below
0 degrees Celsius, such as -10 degrees.
• Ratio variables, on the other hand, never fall below zero. Height and weight measure from 0 and above,
but never fall below it.
Data Visualization
Basics
What is data Visualization?
We need data visualization because a visual summary of information makes it easier to identify
patterns and trends than looking through thousands of rows on a spreadsheet. It’s the way the
human brain works.
Since the purpose of data analysis is to gain insights, data is much more valuable when it is
visualized.
Even if a data analyst can pull insights from data without visualization, it will be more difficult
to communicate the meaning without visualization.
Line Chart.
A line chart is, as one can imagine, a line or multiple lines showing how single, or multiple
variables develop over time.
Pie Chart.
A pie chart is a circular graph divided into slices. The larger a slice is the bigger portion of
the total quantity it represents.
Bar Graph.
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular
bars with heights or lengths proportional to the values that they represent. The bars can be
plotted vertically or horizontally. Can be of one variable or many variable.
Histogram
Scatter Plots
A scatter plot is a great indicator that allows us to see whether there is a pattern to be found
between two variables. E.g. : Positive or negative relationship.
Descriptive Statistics
• Descriptive statistics are a set of techniques and measures used to summarize, organize,
and describe the main features of a dataset.
• These statistics provide a way to understand the essential characteristics of the data
without necessarily making inferences or drawing conclusions about a larger population.
• Events that are dealt with include everyday happenings such as accidents, prices of goods, business,
incomes, epidemics, sports data, population data.
1. Mean
The mean is the average of all the values in a dataset. It's calculated by summing up all the values and
then dividing by the total number of values. Average value of the set of Numbers. Mean is a a number around
which a whole data is spread out. Denoted by µ for population mean andfor sample mean.
2. Median
The median is the middle value of a dataset when it's arranged in ascending or descending order. If
there's an even number of values, the median is the average of the two middle values.
(Note: If you sort data in descending order, it won’t affect median but IQR will be negative. IQR will be discussed in
next slide.)
3. Mode
The mode is the value that appears most frequently in a dataset.
Mode is the term appearing maximum time in data set i.e. term that has highest frequency.
2. What is 1st and 3rd Quartile? – Also called the lower and upper quartile
respectively.
When we divide the dataset into two groups while calculating median (sorted in
ascending order), then the median of first half is 1st Quartile and median of second half is
3rd Quartile.
19, 26, 25, 37, 32, 28, 22, 23, 29, 34, 39, 31
19, 22, 23, 25, 26, 28, 29, 31, 32, 34, 37, 39
4)Find
Q3(75%) ,median
2. Measure of Spread / Dispersion
1. Standard deviation
Standard deviation is the measurement of average distance between each quantity and
mean. That is, how data is spread out from mean. A low standard deviation indicates that the
data points tend to be close to the mean of the data set, while a high standard deviation
indicates that the data points are spread out over a wider range of values.
In Python :
Population STD = pstdev()
SD of Population
() i
2. Variance
Variance is a square of average distance between each quantity and mean. That is, it is square of
standard deviation.
3. Range
The range is a measure of the spread or dispersion of a set of data points ,Range is one of the simplest
techniques of descriptive statistics. It is the difference between lowest and highest value.
(Note: When we write down Minimum, Maximum, Q1, Q2 (Median) and Q3, this is
called 5-point summary or 5 number summary)
2. Suppose each one of them gained extra 5 Kg. weight during winters. Can you calculate
the new Mean and Standard deviation?
**Original Data:**
Weights of 5 persons: 105, 156, 145, 172, 100
1. **Calculating Mean:**
Mean (Average) = (Sum of all weights) / (Number of weights)
So, after each person gains 5 kg during the winter, the new mean weight is approximately 140.6 kg, and
the new standard deviation is approximately 28.42 kg.
3. Measure of Symmetricity & Shape – Skewness and Kurtosis
1. Skewness
Skewness is usually described as a measure of a dataset’s symmetry – or lack of
symmetry. A perfectly symmetrical data set will have a skewness of 0. The normal
distribution has a skewness of 0. Skewness is calculated as:
import numpy as np
from scipy.stats import skew
x = np.random.normal(0, 2, 10000) # create random values based on a normal distribution
print(skew(x))
….
Mathematically:
where n is the sample size, Xi is the ith X value, X-Bar is the average and s is the
sample standard deviation. Note the exponent in the summation. It is “3”. The
skewness is referred to as the “third standardized central moment for the
Skewness
So, when is the skewness too much? The rule of thumb seems to be:
1. If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
2. If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately
skewed.
3. If the skewness is less than -1 or greater than 1, the data are highly skewed.
Importance of Skewness:
Measures of asymmetry like skewness are the link between central tendency
measures and probability theory, which ultimately allows us to get a more complete
understanding of the data we are working with.
Knowing that the market has a 70% probability of going up and a 30% probability of going
down may appear helpful if you rely on normal distributions. However, if you were told
that if the market goes up, it will go up 2% and if it goes down, it will go down 10%, then
you could see the skewed returns and make a better informed decision.
import numpy as np
from scipy.stats import kurtosis
x = np.random.normal(0, 2, 10000) # create random values based on a normal distribution
print(kurtosis(x))
Mathematically:
where n is the sample size, Xi is the ith X value, X-Bar is the average and s is the sample standard deviation.
Note the exponent in the summation. It is “4”. The kurtosis is referred to as the “fourth standardized central
moment for the probability model.”
What does the value of Kurtosis tells about the shape?
The reference standard is a normal distribution, which has a kurtosis of 3. In
token of this, often the excess kurtosis is presented: excess kurtosis is
simply kurtosis−3. For example, the “kurtosis” reported by Excel or any
statistical library is actually the excess kurtosis.
2. Outlier Detection : Large Kurtosis suggests there could be outliers in the data.
3. With high kurtosis, there is a chance of high variance and hence test on Mean could lead to bad resu
Hence, in that case, we would need to choose a more robust option – like test on Median.
4. Financial Risk: E.g. The return of your asset can be farther from the mean. (Than predicted using no
distribution).
Outliers
What is outlier?
An outlier is an observation that lies an abnormal distance from other values in a
random sample from a population. In a sense, this definition leaves it up to the analyst
to decide what will be considered abnormal.
For example, a Z-score of positive 2 indicates that an observation is two standard deviations
above the average while a Z-score of -2 signifies it is two standard deviations below the mean.
Z-Score Formula?
1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2
Solution:
We will term the point outlier if it has a z-score of 3 or above (in any side - positive or negative).
Hence, here the outlier is 15.
Assignment 3: Write a Python code to detect outlier using Z Score Method
Covariance
Covariance: Covariance measures how the two variables/features move concerning each other
and is an extension of the concept of variance (which tells about how a single
variable varies).
It can take any value from -∞ to +∞
For example, height and weight are related; taller people tend to be heavier than
shorter people.
The Pearson’s correlation coefficient (r) is a measure that determines the degree to
which the movement of two variables is associated. The value of Correlation Coefficient
lies between -1 and 1.
(n = sample size, and Sx, Sy are the standard deviation of samples x and y. X-bar and y-
bar are the respective means of x and y samples whereas Xi and Yi are sample points of X
Positive and Negative Correlation:
3. and a value of zero indicates no relationship between the two variables being compared.
Strong and Weak Correlation:
• Denoted by rho.
Steps for Spearman Correlation Coefficient
1. Create a new column for rank(x) and assign the rank of each variable.
2. Assign the rank of 2nd variable in a new column rank(y).
3. Calculate the difference in rank of both the variables = d.
4. Calculate the d-squared.
5. Add up d-squared score.
6. Put in the formula provided:
Question: The scores for 10 students in English and Maths are as follows:
Step 5:
Step 6: