0% found this document useful (0 votes)
11 views

Statistics ClassNotes - 2

Uploaded by

shubham.shah009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Statistics ClassNotes - 2

Uploaded by

shubham.shah009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Numerical Summary Measures

Numerical summary measures are used to identify the location and dispersion of the
variables. It provides a concise representation of data, such as mean, median, mode, variance,
and standard deviation.
Measure of central tendency
Most commonly investigated characteristics of a set of data is its central location, or the point
about which the observation tends to cluster. The most commonly used statistical techniques
to examine measure of central tendency are:

Mean
● Most frequently used measure of central tendency.
● Mean in general represents the arithmetic mean
● Also known as average
● Not appropriate for nominal or ordinal data
● The mean is calculated by adding up all the values in a dataset and dividing by
the total number of values. Mathematically, it can be represented as:
Mean = (sum of all values) / (total number of values)
Represented as,

Disadvantage:
Extremely sensitive to outliers and skewed data, hence not a good approach when outliers
are present and data is skewed.

Example: Let's take an example dataset of 5 numbers: 10, 15, 20, 25, and 30. We can
calculate the mean as follows:
Mean = (10 + 15 + 20 + 25 + 30) / 5
Mean = 100 / 5
Mean = 20
So, the mean of this dataset is 20.

● Median
● Not sensitive to outliers, hence robust approach in comparison to mean
● Defined as the 50th percentile of a set of measurements, when observed
measurements are ranked from smallest to largest.
● Identification of median value depends on the number of observations (n)

Median for Odd number of observations


● First example [3, 4, 5, 6, 7] total number of observations (n) is 5.

▪ (n + 1)/2
▪ (5+1)/2
▪ 6/2
▪ 3rd position
● 5 is located at 3rd position hence 5 will be median value.

Median for Even number of observations


● Second example [3, 4, 5, 6, 7, 8] total number of observations (n) is 6
▪ Average of (n/2), (n/2) + 1 observations
▪ 6/2, (6/2) + 1
▪ 3, 3+1
▪ 3, 4

● n is 6. Hence position will be 3rd and 4th. So, the median will be average of 5
and 6, 5.5
● Median is not sensitive to outliers hence more robust and a better approach
when outliers are present or data is skewed.

● For example, if data points are:


● 1, 2, 3, 4, 90
● Mean value will be 20 which is highly influenced by outlier 90 in this
data set.
● But median will be 3, not influenced by outlier 90

● Mode
● Can be used as a summary measure for all types of data.
● Observation that occurs more frequently.
● Represents a number that does occur in the observations
● Not always well-defined since there may not be one value that occurs most
frequently.
● For example:

● 18, 19, 20, 21, 22 (No mode)


● 18, 19, 20, 19, 21 (Mode is 19 as it is present twice. It is also an
example of unimodal)
● 18, 19, 20, 20, 19, 21, 22 (Bimodal, as 19, and 20 are present in equal
numbers)

The Best approach for measures of Central Tendency


Although there are different approaches to calculate the measure of central tendency, which
approach is best depending on the distribution of variables.

● Data is not symmetric and outliers present, median is the best way
● Data is symmetric, mean is the best way.

The quantiles (quartiles, deciles, percentiles etc.) are the (k - 1) values of the variable which
divide the sample into k equal parts when the sample values are arranged in increasing order.
They identify locations other.

Measures of dispersion/variability

Knowing the central location is not enough. What else would it be useful to know? A key
issue is how alike or “unlike” each other the individual observations are. How can we
measure “unlikeness”. Measures of dispersion or variability are used to describe the spread or
distribution of data points in a dataset.
Above figures give the score of two batsmen in the last 10 matches.

Batsman A scored: 25, 20, 45, 93, 8, 14, 32, 87, 72, 4
Batsman B scored: 33, 50, 47, 38, 45, 40, 36, 48, 37, 26

Mean value for both batsmen is 40, but visualization or scores suggest significant differences.
Who is more consistent (or variability) as the average is the same? For Batsman B runs are
grouped around mean, but are more scattered in batsman A. Hence, measures of central
tendency are not sufficient to give a complete picture. We have to examine the spread in
data. There are many ways to examine the distribution. Few

● Range
● Range is defined as the difference between the largest observation and the
smallest.
● Usefulness is limited.

● Consider only extreme values in the dataset, rather than the majority of
observations.
● Like mean, highly sensitive to outliers.
Range = Maximum - Minimum

● Interquartile Range (IQR)


● One of the most important measures of variability which is not influenced by
extreme values, hence robust.
● Rather than minimum and maximum values, it uses quartiles to examine the
range.
● Quartile divides the distribution into four groups, hence contains k -1 value,
here 3 values.
● IQR is calculated by subtracting the 25th percentile from the 75th percentile.

● It encompasses the middle 50% of the observations.


● IQR is also represented by a box-whisker plot and denoted by quartiles.

IQR = Q3 - Q1

Once data is ordered


● Q1 is known as the first quartile or lower quartile. It contains 25% of the data lies
between minimum and Q1.
● Q2 is known as the median value. It contains 50% of the data lies between minimum
and Q2.
● Q3 is known as the third quartile or upper quartile. It contains 75% of the data lies
between minimum and Q3.

Outlier detection using IQR


There are many ways to identify outliers. One of the most common approaches is IQR.

Lower Bound: Q1 - 1.5 * IQR


Upper Bound: Q3 + 1.5 * IQR
Anything below lower bound (Q1 - 1.5 * IQR) and above upper bound (Q3 + 1.5 * IQR) are
considered outliers, using the IQR approach. If we use ±1 * IQR, as it is exclusive and may
result in many outliers and if we go for ±2 * IQR is too inclusive and may result in few
outliers. Hence, ±1.5 * IQR is a compromise but based on Gaussian distribution created by
John Tukey. You will learn Gaussian distribution (Normal distribution) when we go for
probability concepts.

The 3rd quartile (Q3) is positioned at .675 SD (std deviation, sigma) for a normal
distribution. The IQR (Q3 - Q1) represents 2 x .675 SD = 1.35 SD. The outlier fence is
determined by adding Q3 to 1.5 x IQR, i.e., .675 SD + 1.5 x 1.35 SD = 2.7 SD. This level
would declare .7% of the measurements to be outliers.

Extreme Outliers

Data points that extreme than Q1 - 3 * IQR and Q1 + 3 * IQR. Majority of the time small
deviations from normality are not an issue, but extreme outliers or extreme skewness may be
an issue. It may be for features or residuals. Residuals will be discussed later on during
inferential statistics.
● Variance

● One of the most commonly used measures of dispersion is variance.


● The variance quantifies the amount of variability, or spread, around the
mean of the measurements.
● First, we calculate the average distance of the individual from mean.
Summation is always mathematically zero.
● Hence, square the deviation from mean and find the average

● Used for symmetric data only


● Example given below only shows asymmetric data. You can see the presence
of high variability is just due to extreme value. Hence, not good for
asymmetric data.
● However, the mathematical calculation will be the same.

● Standard deviation

● Standard deviation (SD) is the positive square root of the variance

● SD is used more frequently than variance.


● SD has the same unit as mean
● In comparison of two groups of data, a group with smaller SD has the more
homogeneous observations.
● Group with larger SD exhibits a greater amount of variability
Population Parameter vs Sample Statistics

These are symbols used to represent either population parameters or sample statistics

Measures of Shape

Skewness
Skewness is a measure of the asymmetry of the probability distribution of a dataset. It
indicates the direction and degree to which the data is skewed or biased towards one tail of
the distribution. Skewness can help in understanding the shape and symmetry of a dataset.

● Data is not symmetric; we also say data is skewed.


● Normal distribution has skewness zero
● When there is a long tail towards right, data is right skewed
● When tail is long towards, left data is left skewed

● For symmetric or normal distribution data mean = median (nearly equal)


● For right skewed data or positively skewed data mean > median
● For left skewed data or negatively skewed data mean < median
- For symmetric data skewness ranges from -0.5 and 0.5
- For moderately skewed data it ranges from -1 and -0.5 (left skewed) and 0.5 and 1
(right skewed)
- For highly skewed data it ranges from < -1 (left skewed) and > +1 (right skewed)

You might also like