0% found this document useful (0 votes)
25 views15 pages

Descriptive Statistics

Descriptive Statistics encompasses methods for summarizing and organizing data, including measures of central tendency (mean, median, mode) and measures of dispersion (variance, standard deviation, range). These statistics are essential in predictive modeling as they help understand data characteristics, facilitate data cleaning, and guide feature engineering and model selection. Visualization techniques like histograms and box plots further enhance data interpretation by revealing patterns and anomalies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views15 pages

Descriptive Statistics

Descriptive Statistics encompasses methods for summarizing and organizing data, including measures of central tendency (mean, median, mode) and measures of dispersion (variance, standard deviation, range). These statistics are essential in predictive modeling as they help understand data characteristics, facilitate data cleaning, and guide feature engineering and model selection. Visualization techniques like histograms and box plots further enhance data interpretation by revealing patterns and anomalies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Session: Descriptive Statistics

Definition

Descriptive Statistics refers to the methods used to summarize and organize data in a way that
provides a clear understanding of its main characteristics. These methods involve calculating
measures such as mean, median, mode, variance, standard deviation, and range, as well as creating
visualizations like histograms, box plots, and scatter plots.

Key Concepts:

• Measures of Central Tendency:

o Example: The mean (average) income of a sample population provides an idea of the
typical income level.

• Measures of Dispersion:

o Example: The standard deviation of test scores in a class shows how spread out the
scores are around the mean.

• Data Distribution:

o Example: A histogram can illustrate the distribution of ages in a survey sample,


revealing patterns such as skewness or symmetry.

• Shape of Distribution:

o Example: A skewed distribution might indicate that the data has a long tail on one
side, suggesting the presence of outliers.

Importance in Predictive Modelling

Role of Descriptive Statistics: Descriptive statistics play a crucial role in predictive modelling by
providing foundational insights into the data, which helps in making informed decisions throughout
the modelling process. These statistics allow analysts to understand the data's structure, detect
anomalies, and identify patterns that inform the choice of models and preprocessing techniques.

Key Importance:

1. Understanding Data Characteristics:

o Example: Before building a predictive model for housing prices, descriptive statistics
like the mean and median price, along with the standard deviation, help in
understanding the typical price range and the variability in the market.

2. Data Cleaning and Preprocessing:

o Example: By analysing the distribution of features (e.g., the number of rooms in


houses), one can detect and handle outliers, missing values, and incorrect entries,
ensuring that the data fed into the model is of high quality.

3. Feature Engineering:
o Example: Descriptive statistics can reveal correlations between variables, such as a
strong correlation between square footage and price, guiding the creation of new
features that enhance model performance.

4. Model Selection:

o Example: If descriptive statistics show that the data is normally distributed, a model
that assumes normality (e.g., Linear Regression) might be preferred. On the other
hand, if the data is highly skewed, a model that handles skewed distributions better
(e.g., a decision tree) might be chosen.

5. Evaluating Model Assumptions:

o Example: Descriptive statistics can help check the assumptions of models, such as
linearity in Linear Regression, by examining scatter plots and correlations between
variables.

6. Communication and Reporting:

o Example: Descriptive statistics are often used to present the findings of a predictive
model to stakeholders in a clear and concise manner, providing a summary of the
data that supports the model's decisions.

Example Application: In an e-commerce setting, descriptive statistics might be used to summarize


customer data, such as the average number of purchases per month, the distribution of purchase
amounts, and the variance in customer spending. These insights can then inform the development of
a predictive model to forecast future purchasing behaviour, identify high-value customers, and tailor
marketing strategies.
Measures of Central Tendency

Definition

Measures of Central Tendency are statistical metrics that describe the center or typical value of a
dataset. These measures provide a single value that represents the most common or average
characteristic of the data, helping to summarize and understand the distribution of the data. The
three primary measures of central tendency are the mean, median, and mode.

Mean

Definition:
The mean is the arithmetic average of a set of numbers, calculated by summing all the values in the
dataset and then dividing by the number of values. The mean is often used when the data is
symmetrically distributed without significant outliers.

Formula:

Example:
Suppose you have the test scores of five students: 85, 90, 78, 92, and 88. The mean score would be
calculated as:

Mean = (85+90+78+92+88)/5 = 433/5 = 86.6

This mean score represents the average performance of the students.

Considerations:

• The mean is sensitive to outliers.

o Example: If a sixth student scored 50, the new mean would drop to 80.5, even
though most students scored much higher.

Median

Definition:
The median is the middle value in a dataset when the values are arranged in ascending or
descending order. If the dataset has an odd number of observations, the median is the central value.
If it has an even number of observations, the median is the average of the two central values. The
median is particularly useful in skewed distributions or when outliers are present.

Example:
Consider the same test scores: 85, 90, 78, 92, and 88. First, arrange them in ascending order: 78, 85,
88, 90, 92. The median score, being the middle value, is 88.
Even Number of Observations Example:
If you add another score of 94, the scores become 78, 85, 88, 90, 92, 94. The median is calculated as:

Median = (88+90)/2 = 89

This median represents the central tendency of the dataset, unaffected by any extreme values.

Considerations:

• The median is robust against outliers.

o Example: Even if a score of 50 is added to the dataset, the median remains 88,
reflecting the central tendency better than the mean.

Mode

Definition:
The mode is the value that appears most frequently in a dataset. A dataset can have one mode
(unimodal), more than one mode (bimodal or multimodal), or no mode at all if all values are unique.
The mode is useful for categorical data or when identifying the most common value in a dataset.

Example:
In a survey of favourite colours among 10 people, the responses are: Blue, Blue, Red, Green, Green,
Blue, Yellow, Red, Blue, Green. The mode is "Blue," as it appears most frequently (four times).

Multiple Modes Example:


In the same dataset, "Green" and "Blue" both appear three times, making the dataset bimodal.

Considerations:

• The mode is most useful for categorical data.

o Example: In analysing the most common category of products sold in a store, the
mode could reveal that "Electronics" is the most sold category.
Measures of Dispersion

Definition

Measures of Dispersion are statistical metrics that describe the spread or variability within a dataset.
They provide insights into how much the data values deviate from the central tendency (e.g., mean,
or median). Understanding the dispersion helps in assessing the reliability and consistency of the
data, as well as in identifying outliers and patterns of distribution.

Range

Definition:
The range is the simplest measure of dispersion, calculated as the difference between the maximum
and minimum values in a dataset. It provides a quick sense of the spread but does not account for
how the data is distributed between these extremes.

Formula:

Range = Maximum Value−Minimum Value

Example:
Consider the test scores: 78, 85, 88, 90, 92. The range would be:

Range = 92−78 = 14

This indicates that the test scores vary by 14 points from the lowest to the highest score.

Considerations:

• The range is sensitive to outliers.

o Example: If a score of 50 is included in the dataset, the range becomes 42, which
may overstate the overall variability.

Variance

Definition:
Variance measures the average squared deviation of each data point from the mean. It gives an idea
of how much the data points vary from the mean, with higher variance indicating greater spread.

Formula:
For a sample, the variance is calculated as:

Example:
Using the test scores 78, 85, 88, 90, and 92, with a mean of 86.6, the variance would be:
This variance indicates the average squared deviation from the mean, reflecting the spread of the
scores.

Considerations:

• Variance is in squared units, which can make it harder to interpret compared to standard
deviation.

Standard Deviation

Definition:
The standard deviation is the square root of the variance, providing a measure of dispersion in the
same units as the data. It indicates how much the data values typically differ from the mean.

Formula:
For a sample, the standard deviation is calculated as:

Example:
Continuing with the test scores example, the standard deviation is:

This means that, on average, the test scores differ from the mean by approximately 5.49 points.

Considerations:

• Standard deviation is widely used because it is in the same units as the original data, making
it easier to interpret.

o Example: In financial data, a standard deviation of returns helps investors


understand the typical fluctuation in stock prices.

Interquartile Range (IQR)


Definition:
The Interquartile Range (IQR) measures the spread of the middle 50% of the data, calculated as the
difference between the third quartile (Q3) and the first quartile (Q1). It is a robust measure of
dispersion, particularly useful in identifying and managing outliers.

Formula:

IQR = Q3−Q1

Example:
For a dataset with quartiles Q1 = 85 and Q3 = 92, the IQR would be:

IQR = 92−85 = 7

This IQR indicates that the middle 50% of the data values are spread across 7 units.

Considerations:

• The IQR is not affected by extreme values or outliers.

o Example: In a salary dataset where most employees earn between $50,000 and
$70,000, the IQR would effectively summarize the central spread, ignoring any
extremely high or low salaries.
Measures of Shape

Definition

Measures of Shape are statistical metrics that describe the distribution pattern of data in terms of its
symmetry and peakedness. These measures help in understanding the underlying characteristics of
the data distribution, particularly how it deviates from a normal distribution. The two primary
measures of shape are skewness and kurtosis.

Skewness

Definition:
Skewness measures the asymmetry of the data distribution around its mean. It indicates whether
the data points tend to lean more towards the left or the right of the mean. Skewness can be
positive, negative, or zero:

• Positive Skewness: The tail on the right side of the distribution is longer or fatter than the
left side.

• Negative Skewness: The tail on the left side of the distribution is longer or fatter than the
right side.

• Zero Skewness: The distribution is perfectly symmetrical.


Formula:
Skewness can be calculated as:

Example:
Consider a dataset of exam scores where most students scored between 70 and 90, but a few scored
significantly higher, around 100. This dataset would exhibit positive skewness, as the distribution has
a longer tail on the right side.

Visual Example:
A histogram showing housing prices might reveal that most houses are priced between $200,000 and
$400,000, but a few luxury homes priced over $1,000,000 create a rightward skew in the distribution.

Considerations:

• Skewness helps identify potential issues in data that may affect statistical analyses, such as
when using models that assume normality.

o Example: In regression analysis, a skewed dependent variable may violate model


assumptions, requiring transformation or a different modelling approach.

Kurtosis

Definition:
Kurtosis measures the "tailedness" or peakedness of a data distribution compared to a normal
distribution. It indicates whether the data have heavier or lighter tails than the normal distribution:

• Leptokurtic (Positive Kurtosis): The distribution has heavier tails and a sharper peak than a
normal distribution.

• Platykurtic (Negative Kurtosis): The distribution has lighter tails and a flatter peak than a
normal distribution.

• Mesokurtic (Zero Kurtosis): The distribution is like a normal distribution in terms of tail
weight and peak.
Formula:
Kurtosis can be calculated as:

Example:
A dataset of stock market returns that shows frequent small fluctuations with occasional extreme
returns would have high kurtosis, indicating the presence of outliers or heavy tails.

Visual Example:
In a dataset of daily returns for a stock, high kurtosis might manifest as a few days with very large
positive or negative returns, creating heavy tails and a sharp peak in the distribution.

Considerations:

• High kurtosis can indicate potential risk or outliers in the data, which may require careful
treatment in predictive models.

o Example: In risk management, understanding kurtosis is crucial for assessing the


likelihood of extreme events, such as market crashes, which are not well-captured by
models assuming normality.
Measures of Association

Definition

Measures of Association are statistical tools used to quantify the relationship between two or more
variables. These measures help in understanding whether and how strongly variables are related,
which is crucial in data analysis and predictive modelling. The two primary measures of association
are correlation and covariance.

Correlation

Definition:
Correlation measures the strength and direction of a linear relationship between two variables. The
correlation coefficient, typically denoted as r, ranges from -1 to 1:

• r = 1: Perfect positive linear relationship (as one variable increases, the other also increases).

• r = −1: Perfect negative linear relationship (as one variable increases, the other decreases).

• r = 0: No linear relationship.

Formula:
The Pearson correlation coefficient is calculated as:

Example:
Consider the relationship between hours studied and exam scores. If students who study more tend
to score higher, the correlation might be r = 0.85, indicating a strong positive relationship.

Visual Example:
Considerations:

• Correlation does not imply causation.

o Example: A high correlation between the number of firefighters at a fire and the
amount of damage does not mean firefighters cause the damage; rather, larger fires
require more firefighters.

• Nonlinear Relationships:

o Example: If the relationship between variables is curved rather than linear, the
Pearson correlation may not capture the strength of the association accurately.

Covariance

Definition:
Covariance measures the directional relationship between two variables. Unlike correlation,
covariance does not normalize the relationship, meaning its value depends on the units of the
variables. A positive covariance indicates that as one variable increases, the other tends to increase
as well, and a negative covariance indicates that as one variable increases, the other tends to
decrease.

Formula:
Covariance between two variables XXX and YYY is calculated as:

Example:
If you are analysing the relationship between the number of hours worked and income, a positive
covariance suggests that individuals who work more hours tend to have higher incomes.

Interpretation Example:
Suppose the covariance between height and weight in a sample population is 25. This positive value
indicates that taller individuals tend to weigh more, but the magnitude of 25 is dependent on the
units of measurement used for height and weight.

Considerations:

• Units:

o Example: The covariance of height (in inches) and weight (in pounds) might differ
significantly in magnitude from the covariance of height (in meters) and weight (in
kilograms), even if the relationship is the same.

• Interpretation:

o Unlike correlation, covariance does not provide a standardized measure of


relationship strength, making it harder to compare across different datasets.
Visualization of Descriptive Statistics

Introduction

Visualizing descriptive statistics is crucial for understanding the distribution, spread, and relationships
within data. Graphical representations such as histograms, box plots, and scatterplots provide
intuitive insights that complement numerical summaries, making patterns, trends, and anomalies
easier to identify.

Histograms

Definition:
A histogram is a graphical representation of the distribution of a dataset. It consists of bars where
each bar represents the frequency (or count) of data points falling within a specific range (bin).
Histograms are used to visualize the shape, spread, and central tendency of continuous data.

Key Features:

• Bins: Intervals into which data is divided. The height of each bar shows the number of data
points within that bin.

• Skewness: The histogram shape can indicate if the data is skewed (left or right) or
symmetrical.

Example:
Consider a dataset of students' test scores out of 100. A histogram of these scores might reveal that
most students scored between 70 and 90, with fewer students scoring very low or very high.

Visual Example:

Interpretation:
Histograms are ideal for quickly assessing the shape of the data distribution, identifying modes
(peaks), and detecting potential outliers or gaps in the data.
Box Plots

Definition:
A box plot (or box-and-whisker plot) is a standardized way of displaying the distribution of data
based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and
maximum. It is particularly useful for identifying outliers and comparing distributions across multiple
groups.

Key Features:

• Median Line: The line inside the box represents the median (Q2).

• Interquartile Range (IQR): The box length represents the IQR, which is the range between
Q1 and Q3.

• Whiskers: Lines extending from the box to the minimum and maximum values within 1.5
times the IQR from Q1 and Q3.

• Outliers: Data points outside the whiskers are considered outliers and are often plotted as
individual points.

Example:
In comparing the test scores of two different classes, a box plot might reveal that one class has a
wider spread of scores (indicating more variability) and a higher median score.

Visual Example:

Interpretation:
Box plots provide a clear visual summary of data distribution, highlighting the central tendency,
spread, and potential outliers. They are particularly useful for comparing distributions between
groups.

Scatterplots

Definition:
A scatterplot is a graphical representation of the relationship between two continuous variables.
Each point on the scatterplot represents an observation, with its position determined by the values
of the two variables being plotted.

Key Features:
• Correlation: The pattern of points can reveal the strength and direction of the relationship
(positive, negative, or none).

• Clusters: Scatterplots can identify groups or clusters within the data.

• Outliers: Points that do not fit the general pattern may be outliers.

Example:
A scatterplot showing the relationship between hours studied and exam scores might reveal a
positive correlation, where more hours studied are associated with higher scores.

Visual Example:

Interpretation:
Scatterplots are essential for exploring potential relationships between two variables, identifying
trends, and detecting outliers. They provide a foundation for further statistical analysis, such as
correlation and regression.

You might also like