0% found this document useful (0 votes)

25 views15 pages

Descriptive Statistics

Descriptive Statistics encompasses methods for summarizing and organizing data, including measures of central tendency (mean, median, mode) and measures of dispersion (variance, standard deviation, range). These statistics are essential in predictive modeling as they help understand data characteristics, facilitate data cleaning, and guide feature engineering and model selection. Visualization techniques like histograms and box plots further enhance data interpretation by revealing patterns and anomalies.

Uploaded by

shrutibansal2k22bae150

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views15 pages

Descriptive Statistics

Uploaded by

shrutibansal2k22bae150

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Session: Descriptive Statistics

Definition

Descriptive Statistics refers to the methods used to summarize and organize data in a way that
provides a clear understanding of its main characteristics. These methods involve calculating
measures such as mean, median, mode, variance, standard deviation, and range, as well as creating
visualizations like histograms, box plots, and scatter plots.

Key Concepts:

• Measures of Central Tendency:

o Example: The mean (average) income of a sample population provides an idea of the
typical income level.

• Measures of Dispersion:

o Example: The standard deviation of test scores in a class shows how spread out the
scores are around the mean.

• Data Distribution:

o Example: A histogram can illustrate the distribution of ages in a survey sample,

revealing patterns such as skewness or symmetry.

• Shape of Distribution:

o Example: A skewed distribution might indicate that the data has a long tail on one
side, suggesting the presence of outliers.

Importance in Predictive Modelling

Role of Descriptive Statistics: Descriptive statistics play a crucial role in predictive modelling by
providing foundational insights into the data, which helps in making informed decisions throughout
the modelling process. These statistics allow analysts to understand the data's structure, detect
anomalies, and identify patterns that inform the choice of models and preprocessing techniques.

Key Importance:

1. Understanding Data Characteristics:

o Example: Before building a predictive model for housing prices, descriptive statistics
like the mean and median price, along with the standard deviation, help in
understanding the typical price range and the variability in the market.

2. Data Cleaning and Preprocessing:

o Example: By analysing the distribution of features (e.g., the number of rooms in

houses), one can detect and handle outliers, missing values, and incorrect entries,
ensuring that the data fed into the model is of high quality.

3. Feature Engineering:
o Example: Descriptive statistics can reveal correlations between variables, such as a
strong correlation between square footage and price, guiding the creation of new
features that enhance model performance.

4. Model Selection:

o Example: If descriptive statistics show that the data is normally distributed, a model
that assumes normality (e.g., Linear Regression) might be preferred. On the other
hand, if the data is highly skewed, a model that handles skewed distributions better
(e.g., a decision tree) might be chosen.

5. Evaluating Model Assumptions:

o Example: Descriptive statistics can help check the assumptions of models, such as
linearity in Linear Regression, by examining scatter plots and correlations between
variables.

6. Communication and Reporting:

o Example: Descriptive statistics are often used to present the findings of a predictive
model to stakeholders in a clear and concise manner, providing a summary of the
data that supports the model's decisions.

Example Application: In an e-commerce setting, descriptive statistics might be used to summarize

customer data, such as the average number of purchases per month, the distribution of purchase
amounts, and the variance in customer spending. These insights can then inform the development of
a predictive model to forecast future purchasing behaviour, identify high-value customers, and tailor
marketing strategies.
Measures of Central Tendency

Definition

Measures of Central Tendency are statistical metrics that describe the center or typical value of a
dataset. These measures provide a single value that represents the most common or average
characteristic of the data, helping to summarize and understand the distribution of the data. The
three primary measures of central tendency are the mean, median, and mode.

Mean

Definition:
The mean is the arithmetic average of a set of numbers, calculated by summing all the values in the
dataset and then dividing by the number of values. The mean is often used when the data is
symmetrically distributed without significant outliers.

Formula:

Example:
Suppose you have the test scores of five students: 85, 90, 78, 92, and 88. The mean score would be
calculated as:

Mean = (85+90+78+92+88)/5 = 433/5 = 86.6

This mean score represents the average performance of the students.

Considerations:

• The mean is sensitive to outliers.

o Example: If a sixth student scored 50, the new mean would drop to 80.5, even
though most students scored much higher.

Median

Definition:
The median is the middle value in a dataset when the values are arranged in ascending or
descending order. If the dataset has an odd number of observations, the median is the central value.
If it has an even number of observations, the median is the average of the two central values. The
median is particularly useful in skewed distributions or when outliers are present.

Example:
Consider the same test scores: 85, 90, 78, 92, and 88. First, arrange them in ascending order: 78, 85,
88, 90, 92. The median score, being the middle value, is 88.
Even Number of Observations Example:
If you add another score of 94, the scores become 78, 85, 88, 90, 92, 94. The median is calculated as:

Median = (88+90)/2 = 89

This median represents the central tendency of the dataset, unaffected by any extreme values.

Considerations:

• The median is robust against outliers.

o Example: Even if a score of 50 is added to the dataset, the median remains 88,
reflecting the central tendency better than the mean.

Mode

Definition:
The mode is the value that appears most frequently in a dataset. A dataset can have one mode
(unimodal), more than one mode (bimodal or multimodal), or no mode at all if all values are unique.
The mode is useful for categorical data or when identifying the most common value in a dataset.

Example:
In a survey of favourite colours among 10 people, the responses are: Blue, Blue, Red, Green, Green,
Blue, Yellow, Red, Blue, Green. The mode is "Blue," as it appears most frequently (four times).

Multiple Modes Example:

In the same dataset, "Green" and "Blue" both appear three times, making the dataset bimodal.

Considerations:

• The mode is most useful for categorical data.

o Example: In analysing the most common category of products sold in a store, the
mode could reveal that "Electronics" is the most sold category.
Measures of Dispersion

Definition

Measures of Dispersion are statistical metrics that describe the spread or variability within a dataset.
They provide insights into how much the data values deviate from the central tendency (e.g., mean,
or median). Understanding the dispersion helps in assessing the reliability and consistency of the
data, as well as in identifying outliers and patterns of distribution.

Range

Definition:
The range is the simplest measure of dispersion, calculated as the difference between the maximum
and minimum values in a dataset. It provides a quick sense of the spread but does not account for
how the data is distributed between these extremes.

Formula:

Range = Maximum Value−Minimum Value

Example:
Consider the test scores: 78, 85, 88, 90, 92. The range would be:

Range = 92−78 = 14

This indicates that the test scores vary by 14 points from the lowest to the highest score.

Considerations:

• The range is sensitive to outliers.

o Example: If a score of 50 is included in the dataset, the range becomes 42, which
may overstate the overall variability.

Variance

Definition:
Variance measures the average squared deviation of each data point from the mean. It gives an idea
of how much the data points vary from the mean, with higher variance indicating greater spread.

Formula:
For a sample, the variance is calculated as:

Example:
Using the test scores 78, 85, 88, 90, and 92, with a mean of 86.6, the variance would be:
This variance indicates the average squared deviation from the mean, reflecting the spread of the
scores.

Considerations:

• Variance is in squared units, which can make it harder to interpret compared to standard
deviation.

Standard Deviation

Definition:
The standard deviation is the square root of the variance, providing a measure of dispersion in the
same units as the data. It indicates how much the data values typically differ from the mean.

Formula:
For a sample, the standard deviation is calculated as:

Example:
Continuing with the test scores example, the standard deviation is:

This means that, on average, the test scores differ from the mean by approximately 5.49 points.

Considerations:

• Standard deviation is widely used because it is in the same units as the original data, making
it easier to interpret.

o Example: In financial data, a standard deviation of returns helps investors

understand the typical fluctuation in stock prices.

Interquartile Range (IQR)

Definition:
The Interquartile Range (IQR) measures the spread of the middle 50% of the data, calculated as the
difference between the third quartile (Q3) and the first quartile (Q1). It is a robust measure of
dispersion, particularly useful in identifying and managing outliers.

Formula:

IQR = Q3−Q1

Example:
For a dataset with quartiles Q1 = 85 and Q3 = 92, the IQR would be:

IQR = 92−85 = 7

This IQR indicates that the middle 50% of the data values are spread across 7 units.

Considerations:

• The IQR is not affected by extreme values or outliers.

o Example: In a salary dataset where most employees earn between $50,000 and
$70,000, the IQR would effectively summarize the central spread, ignoring any
extremely high or low salaries.
Measures of Shape

Definition

Measures of Shape are statistical metrics that describe the distribution pattern of data in terms of its
symmetry and peakedness. These measures help in understanding the underlying characteristics of
the data distribution, particularly how it deviates from a normal distribution. The two primary
measures of shape are skewness and kurtosis.

Skewness

Definition:
Skewness measures the asymmetry of the data distribution around its mean. It indicates whether
the data points tend to lean more towards the left or the right of the mean. Skewness can be
positive, negative, or zero:

• Positive Skewness: The tail on the right side of the distribution is longer or fatter than the
left side.

• Negative Skewness: The tail on the left side of the distribution is longer or fatter than the
right side.

• Zero Skewness: The distribution is perfectly symmetrical.

Formula:
Skewness can be calculated as:

Example:
Consider a dataset of exam scores where most students scored between 70 and 90, but a few scored
significantly higher, around 100. This dataset would exhibit positive skewness, as the distribution has
a longer tail on the right side.

Visual Example:
A histogram showing housing prices might reveal that most houses are priced between $200,000 and
$400,000, but a few luxury homes priced over $1,000,000 create a rightward skew in the distribution.

Considerations:

• Skewness helps identify potential issues in data that may affect statistical analyses, such as
when using models that assume normality.

o Example: In regression analysis, a skewed dependent variable may violate model

assumptions, requiring transformation or a different modelling approach.

Kurtosis

Definition:
Kurtosis measures the "tailedness" or peakedness of a data distribution compared to a normal
distribution. It indicates whether the data have heavier or lighter tails than the normal distribution:

• Leptokurtic (Positive Kurtosis): The distribution has heavier tails and a sharper peak than a
normal distribution.

• Platykurtic (Negative Kurtosis): The distribution has lighter tails and a flatter peak than a
normal distribution.

• Mesokurtic (Zero Kurtosis): The distribution is like a normal distribution in terms of tail
weight and peak.
Formula:
Kurtosis can be calculated as:

Example:
A dataset of stock market returns that shows frequent small fluctuations with occasional extreme
returns would have high kurtosis, indicating the presence of outliers or heavy tails.

Visual Example:
In a dataset of daily returns for a stock, high kurtosis might manifest as a few days with very large
positive or negative returns, creating heavy tails and a sharp peak in the distribution.

Considerations:

• High kurtosis can indicate potential risk or outliers in the data, which may require careful
treatment in predictive models.

o Example: In risk management, understanding kurtosis is crucial for assessing the

likelihood of extreme events, such as market crashes, which are not well-captured by
models assuming normality.
Measures of Association

Definition

Measures of Association are statistical tools used to quantify the relationship between two or more
variables. These measures help in understanding whether and how strongly variables are related,
which is crucial in data analysis and predictive modelling. The two primary measures of association
are correlation and covariance.

Correlation

Definition:
Correlation measures the strength and direction of a linear relationship between two variables. The
correlation coefficient, typically denoted as r, ranges from -1 to 1:

• r = 1: Perfect positive linear relationship (as one variable increases, the other also increases).

• r = −1: Perfect negative linear relationship (as one variable increases, the other decreases).

• r = 0: No linear relationship.

Formula:
The Pearson correlation coefficient is calculated as:

Example:
Consider the relationship between hours studied and exam scores. If students who study more tend
to score higher, the correlation might be r = 0.85, indicating a strong positive relationship.

Visual Example:
Considerations:

• Correlation does not imply causation.

o Example: A high correlation between the number of firefighters at a fire and the
amount of damage does not mean firefighters cause the damage; rather, larger fires
require more firefighters.

• Nonlinear Relationships:

o Example: If the relationship between variables is curved rather than linear, the
Pearson correlation may not capture the strength of the association accurately.

Covariance

Definition:
Covariance measures the directional relationship between two variables. Unlike correlation,
covariance does not normalize the relationship, meaning its value depends on the units of the
variables. A positive covariance indicates that as one variable increases, the other tends to increase
as well, and a negative covariance indicates that as one variable increases, the other tends to
decrease.

Formula:
Covariance between two variables XXX and YYY is calculated as:

Example:
If you are analysing the relationship between the number of hours worked and income, a positive
covariance suggests that individuals who work more hours tend to have higher incomes.

Interpretation Example:
Suppose the covariance between height and weight in a sample population is 25. This positive value
indicates that taller individuals tend to weigh more, but the magnitude of 25 is dependent on the
units of measurement used for height and weight.

Considerations:

• Units:

o Example: The covariance of height (in inches) and weight (in pounds) might differ
significantly in magnitude from the covariance of height (in meters) and weight (in
kilograms), even if the relationship is the same.

• Interpretation:

o Unlike correlation, covariance does not provide a standardized measure of

relationship strength, making it harder to compare across different datasets.
Visualization of Descriptive Statistics

Introduction

Visualizing descriptive statistics is crucial for understanding the distribution, spread, and relationships
within data. Graphical representations such as histograms, box plots, and scatterplots provide
intuitive insights that complement numerical summaries, making patterns, trends, and anomalies
easier to identify.

Histograms

Definition:
A histogram is a graphical representation of the distribution of a dataset. It consists of bars where
each bar represents the frequency (or count) of data points falling within a specific range (bin).
Histograms are used to visualize the shape, spread, and central tendency of continuous data.

Key Features:

• Bins: Intervals into which data is divided. The height of each bar shows the number of data
points within that bin.

• Skewness: The histogram shape can indicate if the data is skewed (left or right) or
symmetrical.

Example:
Consider a dataset of students' test scores out of 100. A histogram of these scores might reveal that
most students scored between 70 and 90, with fewer students scoring very low or very high.

Visual Example:

Interpretation:
Histograms are ideal for quickly assessing the shape of the data distribution, identifying modes
(peaks), and detecting potential outliers or gaps in the data.
Box Plots

Definition:
A box plot (or box-and-whisker plot) is a standardized way of displaying the distribution of data
based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and
maximum. It is particularly useful for identifying outliers and comparing distributions across multiple
groups.

Key Features:

• Median Line: The line inside the box represents the median (Q2).

• Interquartile Range (IQR): The box length represents the IQR, which is the range between
Q1 and Q3.

• Whiskers: Lines extending from the box to the minimum and maximum values within 1.5
times the IQR from Q1 and Q3.

• Outliers: Data points outside the whiskers are considered outliers and are often plotted as
individual points.

Example:
In comparing the test scores of two different classes, a box plot might reveal that one class has a
wider spread of scores (indicating more variability) and a higher median score.

Visual Example:

Interpretation:
Box plots provide a clear visual summary of data distribution, highlighting the central tendency,
spread, and potential outliers. They are particularly useful for comparing distributions between
groups.

Scatterplots

Definition:
A scatterplot is a graphical representation of the relationship between two continuous variables.
Each point on the scatterplot represents an observation, with its position determined by the values
of the two variables being plotted.

Key Features:
• Correlation: The pattern of points can reveal the strength and direction of the relationship
(positive, negative, or none).

• Clusters: Scatterplots can identify groups or clusters within the data.

• Outliers: Points that do not fit the general pattern may be outliers.

Example:
A scatterplot showing the relationship between hours studied and exam scores might reveal a
positive correlation, where more hours studied are associated with higher scores.

Visual Example:

Interpretation:
Scatterplots are essential for exploring potential relationships between two variables, identifying
trends, and detecting outliers. They provide a foundation for further statistical analysis, such as
correlation and regression.

Lecture Notes 2 - Descriptive Statistics-1720598791715
No ratings yet
Lecture Notes 2 - Descriptive Statistics-1720598791715
21 pages
Statistical Analysis - Descriptive Stat
No ratings yet
Statistical Analysis - Descriptive Stat
6 pages
Discriptive Statistics
No ratings yet
Discriptive Statistics
23 pages
Understanding Statistics in Data Analysis
No ratings yet
Understanding Statistics in Data Analysis
6 pages
Chapter 2 BSC TY Statistical Data Analysis
No ratings yet
Chapter 2 BSC TY Statistical Data Analysis
124 pages
Ai - Ssmda
No ratings yet
Ai - Ssmda
142 pages
Basic Statistics
No ratings yet
Basic Statistics
7 pages
Statistics, Statistical Modelling & Data Analytics
No ratings yet
Statistics, Statistical Modelling & Data Analytics
68 pages
Statistics Basics for Data Science
100% (2)
Statistics Basics for Data Science
27 pages
Statistics
No ratings yet
Statistics
10 pages
Ge8 Statistics
No ratings yet
Ge8 Statistics
2 pages
Statistics - Compendium - DMS IIT DELHI - 2025
No ratings yet
Statistics - Compendium - DMS IIT DELHI - 2025
18 pages
Math
No ratings yet
Math
50 pages
Understanding Descriptive Statistics
No ratings yet
Understanding Descriptive Statistics
37 pages
Descriptive Statistics Overview Guide
No ratings yet
Descriptive Statistics Overview Guide
18 pages
Unit 1 - Business Statistics & Analytics
No ratings yet
Unit 1 - Business Statistics & Analytics
25 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
14 pages
Understanding Basic Statistics Concepts
No ratings yet
Understanding Basic Statistics Concepts
29 pages
Psychology Project
No ratings yet
Psychology Project
14 pages
Statistics and Data Analytics Overview
No ratings yet
Statistics and Data Analytics Overview
152 pages
Understanding Statistics: Types & Uses
No ratings yet
Understanding Statistics: Types & Uses
152 pages
Statistical Analysis of Central Tendency
No ratings yet
Statistical Analysis of Central Tendency
3 pages
Unit 3 Measure of Central Location
No ratings yet
Unit 3 Measure of Central Location
29 pages
Statistical Machine Learning Overview
100% (1)
Statistical Machine Learning Overview
12 pages
Descriptive Statistics in Python Analysis
No ratings yet
Descriptive Statistics in Python Analysis
6 pages
Understanding Statistics: Key Concepts
No ratings yet
Understanding Statistics: Key Concepts
24 pages
Understanding Statistics: Types & Uses
No ratings yet
Understanding Statistics: Types & Uses
152 pages
Advance Statistics For Data Science and Data Analysis
No ratings yet
Advance Statistics For Data Science and Data Analysis
47 pages
Understanding Statistics Basics
No ratings yet
Understanding Statistics Basics
97 pages
Understanding Central Tendency Measures
No ratings yet
Understanding Central Tendency Measures
5 pages
Unit II TYCS DS
No ratings yet
Unit II TYCS DS
176 pages
Descriptive vs Inferential Statistics
No ratings yet
Descriptive vs Inferential Statistics
9 pages
Day 3 Educational Statistics
No ratings yet
Day 3 Educational Statistics
37 pages
Day 3 Educational Statistics
No ratings yet
Day 3 Educational Statistics
37 pages
1-2-3 Descriptive Stats & Central Tendency
No ratings yet
1-2-3 Descriptive Stats & Central Tendency
21 pages
Ssmda End Sem
No ratings yet
Ssmda End Sem
152 pages
Data Management
No ratings yet
Data Management
48 pages
Statistical Analysis of Variables and Data
No ratings yet
Statistical Analysis of Variables and Data
16 pages
Statistical Data Analysis in Data Science
No ratings yet
Statistical Data Analysis in Data Science
86 pages
Understanding Central Tendency Measures
No ratings yet
Understanding Central Tendency Measures
15 pages
Understanding Descriptive Analytics
No ratings yet
Understanding Descriptive Analytics
31 pages
Understanding Descriptive Statistics
No ratings yet
Understanding Descriptive Statistics
74 pages
Business Analytics
No ratings yet
Business Analytics
44 pages
Descriptive Stastistics
No ratings yet
Descriptive Stastistics
10 pages
Descriptive Statistics: Histograms & Measures
No ratings yet
Descriptive Statistics: Histograms & Measures
12 pages
Understanding Statistics: Types & Measures
No ratings yet
Understanding Statistics: Types & Measures
21 pages
Research Method Lecture Notes
No ratings yet
Research Method Lecture Notes
32 pages
Intro To Statistics - Descriptive Statistics and NPC - 20250225 - 171911 - 0000
No ratings yet
Intro To Statistics - Descriptive Statistics and NPC - 20250225 - 171911 - 0000
23 pages
Understanding Descriptive Statistics Basics
No ratings yet
Understanding Descriptive Statistics Basics
9 pages
Descriptive Statistics Overview
No ratings yet
Descriptive Statistics Overview
38 pages
Biostatics Course
No ratings yet
Biostatics Course
29 pages
Understanding Statistical Measures
No ratings yet
Understanding Statistical Measures
56 pages
Understanding Statistics and Parameters
100% (1)
Understanding Statistics and Parameters
41 pages
Introduction to Statistics Concepts
No ratings yet
Introduction to Statistics Concepts
50 pages
Unit-3 DS Students
No ratings yet
Unit-3 DS Students
35 pages
Lecture 06-Describing Data Visual Information
No ratings yet
Lecture 06-Describing Data Visual Information
49 pages
Q & A - Unit 1 - Introduction To Statistics
No ratings yet
Q & A - Unit 1 - Introduction To Statistics
20 pages
Self-Esteem Distribution in College Women
No ratings yet
Self-Esteem Distribution in College Women
10 pages
Understanding Descriptive Statistics
No ratings yet
Understanding Descriptive Statistics
49 pages
Unit 3 Services Economic Survey 2024-25
No ratings yet
Unit 3 Services Economic Survey 2024-25
19 pages
Untitled Document
No ratings yet
Untitled Document
7 pages
Unit 2 Agriculture Economic Survey 2024-25
No ratings yet
Unit 2 Agriculture Economic Survey 2024-25
13 pages
Inequlities in Health Sector Notes NEW 2024
No ratings yet
Inequlities in Health Sector Notes NEW 2024
13 pages
Key Ingredients of PM
No ratings yet
Key Ingredients of PM
16 pages
CAT Verbal Ability - 1749997605248
No ratings yet
CAT Verbal Ability - 1749997605248
10 pages
Statistical Analysis of Three Groups
No ratings yet
Statistical Analysis of Three Groups
2 pages
Quade Test
100% (2)
Quade Test
4 pages
Statistical Process Control Overview
No ratings yet
Statistical Process Control Overview
66 pages
Task 4 (Matmod)
No ratings yet
Task 4 (Matmod)
3 pages
INeuron Courses
No ratings yet
INeuron Courses
5,136 pages
Statistical Process Control Tools Guide
No ratings yet
Statistical Process Control Tools Guide
24 pages
The Effect of The Implementation of Reward Incenti
No ratings yet
The Effect of The Implementation of Reward Incenti
9 pages
Research Fundamentals: Study Design, Population, and Sample Size
No ratings yet
Research Fundamentals: Study Design, Population, and Sample Size
7 pages
Bowerman3Ce SSM Chapter05
No ratings yet
Bowerman3Ce SSM Chapter05
11 pages
STA404 July 2021 Test Answers
No ratings yet
STA404 July 2021 Test Answers
3 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Probability and Statistics Course Syllabus
No ratings yet
Probability and Statistics Course Syllabus
4 pages
Course Outline
No ratings yet
Course Outline
7 pages
Types of Sampling Methods Explained
100% (2)
Types of Sampling Methods Explained
13 pages
FDSA Lab Manual
No ratings yet
FDSA Lab Manual
31 pages
Arithmetic Mean Calculation Exercises
0% (1)
Arithmetic Mean Calculation Exercises
3 pages
Unit - IV - Sampling
No ratings yet
Unit - IV - Sampling
38 pages
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
No ratings yet
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
33 pages
Line Transect Sampling Explained
No ratings yet
Line Transect Sampling Explained
21 pages
40 ML Interview Questions That You Must Know (2024) - Reader View
No ratings yet
40 ML Interview Questions That You Must Know (2024) - Reader View
13 pages
Geo Prac Exercise 11
No ratings yet
Geo Prac Exercise 11
3 pages
SEM Tutorial with Stata Software
No ratings yet
SEM Tutorial with Stata Software
23 pages
Introduction To The Practice of Statistics 9th Edition by David S. Moore Question Bank
No ratings yet
Introduction To The Practice of Statistics 9th Edition by David S. Moore Question Bank
18 pages
Causal Forecasting Methods Explained
No ratings yet
Causal Forecasting Methods Explained
19 pages
Understanding Cross-Sectional Data Analysis
No ratings yet
Understanding Cross-Sectional Data Analysis
40 pages
Slide 8 - Statistical Methods
No ratings yet
Slide 8 - Statistical Methods
299 pages
Naive Bayes Classifier for Tennis Prediction
No ratings yet
Naive Bayes Classifier for Tennis Prediction
4 pages
Stat Unit 3 - T Test
No ratings yet
Stat Unit 3 - T Test
25 pages
CS3352 FDS Solved 2024
No ratings yet
CS3352 FDS Solved 2024
3 pages

Descriptive Statistics

Uploaded by

Descriptive Statistics

Uploaded by

Session: Descriptive Statistics

• Measures of Central Tendency:

o Example: A histogram can illustrate the distribution of ages in a survey sample,

Importance in Predictive Modelling

1. Understanding Data Characteristics:

2. Data Cleaning and Preprocessing:

o Example: By analysing the distribution of features (e.g., the number of rooms in

5. Evaluating Model Assumptions:

6. Communication and Reporting:

Example Application: In an e-commerce setting, descriptive statistics might be used to summarize

Mean = (85+90+78+92+88)/5 = 433/5 = 86.6

This mean score represents the average performance of the students.

• The mean is sensitive to outliers.

• The median is robust against outliers.

Multiple Modes Example:

• The mode is most useful for categorical data.

Range = Maximum Value−Minimum Value

• The range is sensitive to outliers.

o Example: In financial data, a standard deviation of returns helps investors

Interquartile Range (IQR)

• The IQR is not affected by extreme values or outliers.

• Zero Skewness: The distribution is perfectly symmetrical.

o Example: In regression analysis, a skewed dependent variable may violate model

o Example: In risk management, understanding kurtosis is crucial for assessing the

• Correlation does not imply causation.

o Unlike correlation, covariance does not provide a standardized measure of

• Clusters: Scatterplots can identify groups or clusters within the data.

You might also like