Statistics ClassNotes - 2
Statistics ClassNotes - 2
Numerical summary measures are used to identify the location and dispersion of the
variables. It provides a concise representation of data, such as mean, median, mode, variance,
and standard deviation.
Measure of central tendency
Most commonly investigated characteristics of a set of data is its central location, or the point
about which the observation tends to cluster. The most commonly used statistical techniques
to examine measure of central tendency are:
Mean
● Most frequently used measure of central tendency.
● Mean in general represents the arithmetic mean
● Also known as average
● Not appropriate for nominal or ordinal data
● The mean is calculated by adding up all the values in a dataset and dividing by
the total number of values. Mathematically, it can be represented as:
Mean = (sum of all values) / (total number of values)
Represented as,
Disadvantage:
Extremely sensitive to outliers and skewed data, hence not a good approach when outliers
are present and data is skewed.
Example: Let's take an example dataset of 5 numbers: 10, 15, 20, 25, and 30. We can
calculate the mean as follows:
Mean = (10 + 15 + 20 + 25 + 30) / 5
Mean = 100 / 5
Mean = 20
So, the mean of this dataset is 20.
● Median
● Not sensitive to outliers, hence robust approach in comparison to mean
● Defined as the 50th percentile of a set of measurements, when observed
measurements are ranked from smallest to largest.
● Identification of median value depends on the number of observations (n)
▪ (n + 1)/2
▪ (5+1)/2
▪ 6/2
▪ 3rd position
● 5 is located at 3rd position hence 5 will be median value.
● n is 6. Hence position will be 3rd and 4th. So, the median will be average of 5
and 6, 5.5
● Median is not sensitive to outliers hence more robust and a better approach
when outliers are present or data is skewed.
● Mode
● Can be used as a summary measure for all types of data.
● Observation that occurs more frequently.
● Represents a number that does occur in the observations
● Not always well-defined since there may not be one value that occurs most
frequently.
● For example:
● Data is not symmetric and outliers present, median is the best way
● Data is symmetric, mean is the best way.
The quantiles (quartiles, deciles, percentiles etc.) are the (k - 1) values of the variable which
divide the sample into k equal parts when the sample values are arranged in increasing order.
They identify locations other.
Measures of dispersion/variability
Knowing the central location is not enough. What else would it be useful to know? A key
issue is how alike or “unlike” each other the individual observations are. How can we
measure “unlikeness”. Measures of dispersion or variability are used to describe the spread or
distribution of data points in a dataset.
Above figures give the score of two batsmen in the last 10 matches.
Batsman A scored: 25, 20, 45, 93, 8, 14, 32, 87, 72, 4
Batsman B scored: 33, 50, 47, 38, 45, 40, 36, 48, 37, 26
Mean value for both batsmen is 40, but visualization or scores suggest significant differences.
Who is more consistent (or variability) as the average is the same? For Batsman B runs are
grouped around mean, but are more scattered in batsman A. Hence, measures of central
tendency are not sufficient to give a complete picture. We have to examine the spread in
data. There are many ways to examine the distribution. Few
● Range
● Range is defined as the difference between the largest observation and the
smallest.
● Usefulness is limited.
● Consider only extreme values in the dataset, rather than the majority of
observations.
● Like mean, highly sensitive to outliers.
Range = Maximum - Minimum
IQR = Q3 - Q1
The 3rd quartile (Q3) is positioned at .675 SD (std deviation, sigma) for a normal
distribution. The IQR (Q3 - Q1) represents 2 x .675 SD = 1.35 SD. The outlier fence is
determined by adding Q3 to 1.5 x IQR, i.e., .675 SD + 1.5 x 1.35 SD = 2.7 SD. This level
would declare .7% of the measurements to be outliers.
Extreme Outliers
Data points that extreme than Q1 - 3 * IQR and Q1 + 3 * IQR. Majority of the time small
deviations from normality are not an issue, but extreme outliers or extreme skewness may be
an issue. It may be for features or residuals. Residuals will be discussed later on during
inferential statistics.
● Variance
● Standard deviation
These are symbols used to represent either population parameters or sample statistics
Measures of Shape
Skewness
Skewness is a measure of the asymmetry of the probability distribution of a dataset. It
indicates the direction and degree to which the data is skewed or biased towards one tail of
the distribution. Skewness can help in understanding the shape and symmetry of a dataset.