0% found this document useful (0 votes)
9 views60 pages

Statistics 091147

The document provides an overview of statistics, including its definition, types, and the importance of data collection and analysis for decision-making. It covers various sampling techniques, descriptive and inferential statistics, measures of central tendency and dispersion, and methods for calculating percentiles and detecting outliers. Additionally, it discusses the Pearson and Spearman correlation coefficients, Z-scores, and their significance in understanding relationships between variables.

Uploaded by

ARSH SINHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views60 pages

Statistics 091147

The document provides an overview of statistics, including its definition, types, and the importance of data collection and analysis for decision-making. It covers various sampling techniques, descriptive and inferential statistics, measures of central tendency and dispersion, and methods for calculating percentiles and detecting outliers. Additionally, it discusses the Pearson and Spearman correlation coefficients, Z-scores, and their significance in understanding relationships between variables.

Uploaded by

ARSH SINHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Statistics

Outlines
• Statistics
• Types of statistics
• Population and sample
• Types of sampling
• Simple random sampling
• Stratified sampling
• Systematic sampling
• Convenience sampling
Statistics

• Statistics is the science of collecting, organizing, and analyzing


data.

• Data:- Facts or pieces of information


• Example:
• Height of students in a class
• Gender of a person visiting a doctor
Why statistics?
• Decisions makers use statistics?
• Present and describe business data and information
properly
• Draw conclusions about large groups of individuals or
items, using information collected from subsets of
individuals or items.
• Make reliable forecast about activity
• Improve business process.
Types of Statistics
Sources of Data
• Primary sources: The data collector is the one using the data for
analysis
• Data from a political survey
• Data collected from experiment
• Observed data

• Secondary sources: The person performing data analysis is not the


data collector
• Analyzing census data
• Examining data from print journals or data published on the internet
Types of Variable
• Variable is a characteristics of an item or individual.
• E.g., height of students
Types of Statistics
• Statistics: The branch of mathematics that transform
data into useful information for decision makers.

• Descriptive statistics: Collecting, summarizing, and


describing
• Inferential Statistics: Drawing conclusions and/or making
decisions concerning a population based only on sample
data
Inferential statistics
• Estimation
• E.g., Estimate the population mean weight using the sample weight

• Hypothesis testing
• E.g., Test the claim that the population mean weight is 120 pounds

• Note: Drawing conclusions about a large group of individual based


on subset of the large group
Sampling
• Sampling is the process of selecting a subset of individuals, items, or
observations from a larger group or population in order to gather
information or draw conclusions about the entire population.

• Sampling allows researchers to obtain insights from a smaller,


manageable subset of the population, while still aiming to represent
the characteristics and variability present in the larger population.
Benefits of sampling
• Less time
• Less expensive
• Practicality
• Accuracy and Precision
Population Sample

Definition Complete enumeration Part of the population


of items is considered chosen for study

Symbols Population size = Sample size (n)


Population mean ( ) Sample mean ( )
Population standard Sample standard
deviation deviation =
Sampling techniques
• Simple random sampling
• Stratified sampling
• Systematic sampling
• Cluster Sampling
• Multistage Sampling
• Stratified Cluster Sampling
• Judgmental (Purposive) Sampling
• Snowball Sampling
Sampling techniques
• Simple random sampling: Each individual in the population has an equal chance of
being selected for the sample.
• Stratified sampling: The population is divided into distinct subgroups or strata based
on certain characteristics. A random sample is then taken from each stratum in
proportion to its size.
• Systematic sampling: A starting point is randomly selected, and then every nth
individual is selected from the population.
• Cluster sampling: The population is divided into clusters or groups, often based on
geographical or organizational divisions. A random sample of clusters is selected, and
then all individuals within the selected clusters are included in the sample.
• Multistage: Combination of various sampling techniques.
Sampling techniques
• Judgmental (Purposive) Sampling: In this method, the
researcher uses their judgment to select individuals
who they believe are most relevant or representative
of the population.
• Snowball Sampling: Commonly used in social sciences
and when studying hard-to-reach populations, this
method starts with one or a few participants who are
then asked to refer other potential participants.
Descriptive statistics
• Descriptive statistics are the tabular, graphical, and numerical
methods used to summarize data.
• Measure of central tendency
• Measure of dispersion
• Correlation
• Covariance
• Histogram
• Distribution
• Gaussian distribution
• Binomial distribution
• Log normal distribution
• Power law distribution
• Standard normal distribution
Central tendency
• It provides a way to summarize or describe the typical or
central value around which data points tend to cluster.
• Central tendency measures are used to understand the
general location of the data and to make comparisons
between different sets of data.
• There are three main measures of central tendency:
• Mean
• Median
• Mode
Example
1, 1, 2, 2, 3, 3,4, 5, 5, 6

Mean
Median
Mode
Central tendency
Mean: Influenced by outliers
Median: It is not affected by extreme values (outliers) and is a
useful measure for skewed distributions.
Mode: The mode is particularly useful for categorical or discrete
data

Note: The choice of which measure to use depends on the


characteristics of the data and the specific insights you want to
gain from it.
Measure of dispersion
A measure of dispersion, also known as a measure of
variability or spread, is a statistical metric that quantifies the
extent to which individual data points in a dataset vary or
deviate from the central tendency.

It provides information about how much the data points are


spread out from the average or central value.
Measure of dispersion
Range: The range is the simplest measure of dispersion and is
calculated by subtracting the minimum value from the
maximum value in a dataset. it's highly affected by outliers.
Variance: Variance measures the average squared difference
between each data point and the mean. It gives a
comprehensive understanding of the overall variability in the
data.
Measure of dispersion
Standard Deviation: The standard deviation is the square root
of the variance.
Measure of dispersion
Measure of dispersion
• Mean Absolute Deviation (MAD): MAD is the average absolute
difference between each data point and the mean.
• Coefficient of Variation (CV): The CV is the ratio of the standard
deviation to the mean, expressed as a percentage. It's used to
compare the variability of different datasets with varying means.
• Interquartile Range (IQR): The IQR is the range between the first
quartile (25th percentile) and the third quartile (75th percentile) of
the data. It measures the spread of the middle 50% of the data and is
less sensitive to outliers.
How to calculate percentile
How to calculate percentile
For example: Imagine you have the marks of 20 students.
Now, try to calculate the 90th percentile
How to calculate percentile
Step 1: Arrange the score in ascending order.
How to calculate percentile
Step 2: Plug the values in the formula to find n.

P90 = 94 means that 90% of


students got less than 94 and
10% of students got more than
94
How to calculate percentile
Suppose you want to find the percentile mark of 78 marks in
the data set.

Step 1: Sort the marks in


ascending order.
How to calculate percentile

P = 60 means that 78 marks point to the 60th percentile in the dataset.


Five number summary
• The five number summary consists of the
minimum, lower quartile, median, upper quartile
and the maximum.

• The minimum is the smallest number, the


maximum is the largest, the median is in the
middle, Q1 is the median of the first half of the data
and Q3 is the median of the second half.
Five number summary
Five number summary
• For the set of data: 1, 3, 5, 6, 6, 7, 9 the five number
summary is:
Five number summary
Five number summary
• The median (Q2) indicates the average
• The range = maximum – minimum indicates the spread of the whole data
set
• The interquartile range = Q3 – Q1 indicates the spread of the middle 50%
of the data set

Note: The range describes the spread of the whole set of data, whilst the
interquartile range describes the spread of the middle set of data.
Range is greatly affected by outliers (extreme results in the data), where the
interquartile range is not.
Boxplot
Boxplot
• The minimum is found at the position of the first line at 5
• The maximum is found at the position of the last line at 25
• The lower quartile (Q1) is found at the position of the start of the
box at 10
• The upper quartile (Q3) is found at the position of the end of the
box at 20
• The median (Q2) is found at the position of the line inside the box
at 18
Boxplot
Boxplot
Summary
• Step 1. Put the numbers in order from smallest to largest
• Step 2. The minimum is the smallest number in the list
• Step 3. The maximum is the largest number in the list
• Step 4. The median is found in the middle of the list
• Step 5. The lower quartile is the median of the first half of the data.
• Step 6. The upper quartile is the median of the second half of the
data
Outlier detection
Outlier detection
Outlier detection
Covariance
• Covariance is a statistical concept that measures the degree to which
two random variables change together.
• It's often used to understand the direction of linear relationship
between two variables.

• Positive Covariance: move in same direction


• Negative Covariance: move in opposite direction
• Zero Covariance: covariance is close to zero. Covariance of zero
doesn't necessarily mean there is no relationship at all, as non-linear
relationships might still exist.
Covariance
Covariance

Age (Year) Weight (Kg)


20 75
18 63
15 45
14 41
25 78
Covariance

Note: No restriction on value


Pearson Correlation Coefficient
• The Pearson correlation coefficient is a statistical measure that
quantifies the strength and direction of a linear relationship between
two continuous variables.

• It's a standardized version of the covariance that accounts for the


scales of the variables.

• The Pearson correlation coefficient ranges from -1 to 1.

• It is often denoted by r
Pearson Correlation Coefficient
• Positive Correlation (r>0): A positive value of r indicates a positive
linear relationship between the variables. As one variable increases,
the other tends to increase as well. The closer r is to 1, the stronger
the positive correlation.
• Negative Correlation (r<0): A negative value of r indicates a negative
linear relationship between the variables. As one variable increases,
the other tends to decrease. The closer r is to -1, the stronger the
negative correlation.
• No Correlation (r≈0): A correlation coefficient close to 0 suggests little
to no linear relationship between the variables.
Pearson Correlation Coefficient
Pearson Correlation Coefficient

Note: It is not able to capture the non-linear properties


Spearman's rank correlation coefficient
Spearman's rank correlation coefficient
Spearman's rank correlation coefficient
Zscore
• Z-score is a statistical measurement that describes a value's
relationship to the mean of a group of values.
• Z-score is measured in terms of standard deviations from the mean.
• If a Z-score is 0, it indicates that the data point's score is identical
Z-Scores
Z-Scores vs. Standard Deviation
• In most large data sets (assuming a normal distribution of data), 99.7% of
values lie between -3 and 3 standard deviations, 95% between -2 and 2
standard deviations, and 68% between -1 and 1 standard deviations.

• Standard deviation indicates the amount of variability (or dispersion)


within a given data set.

• A distribution curve has negative and positive sides, so there are positive
and negative standard deviations and z-scores.

• A negative value means it is on the left of the mean, and a positive value
indicates it is on the right.

You might also like