0% found this document useful (0 votes)

9 views60 pages

Statistics 091147

The document provides an overview of statistics, including its definition, types, and the importance of data collection and analysis for decision-making. It covers various sampling techniques, descriptive and inferential statistics, measures of central tendency and dispersion, and methods for calculating percentiles and detecting outliers. Additionally, it discusses the Pearson and Spearman correlation coefficients, Z-scores, and their significance in understanding relationships between variables.

Uploaded by

ARSH SINHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views60 pages

Statistics 091147

Uploaded by

ARSH SINHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Statistics

Outlines
• Statistics
• Types of statistics
• Population and sample
• Types of sampling
• Simple random sampling
• Stratified sampling
• Systematic sampling
• Convenience sampling
Statistics

• Statistics is the science of collecting, organizing, and analyzing

data.

• Data:- Facts or pieces of information

• Example:
• Height of students in a class
• Gender of a person visiting a doctor
Why statistics?
• Decisions makers use statistics?
• Present and describe business data and information
properly
• Draw conclusions about large groups of individuals or
items, using information collected from subsets of
individuals or items.
• Make reliable forecast about activity
• Improve business process.
Types of Statistics
Sources of Data
• Primary sources: The data collector is the one using the data for
analysis
• Data from a political survey
• Data collected from experiment
• Observed data

• Secondary sources: The person performing data analysis is not the

data collector
• Analyzing census data
• Examining data from print journals or data published on the internet
Types of Variable
• Variable is a characteristics of an item or individual.
• E.g., height of students
Types of Statistics
• Statistics: The branch of mathematics that transform
data into useful information for decision makers.

• Descriptive statistics: Collecting, summarizing, and

describing
• Inferential Statistics: Drawing conclusions and/or making
decisions concerning a population based only on sample
data
Inferential statistics
• Estimation
• E.g., Estimate the population mean weight using the sample weight

• Hypothesis testing
• E.g., Test the claim that the population mean weight is 120 pounds

• Note: Drawing conclusions about a large group of individual based

on subset of the large group
Sampling
• Sampling is the process of selecting a subset of individuals, items, or
observations from a larger group or population in order to gather
information or draw conclusions about the entire population.

• Sampling allows researchers to obtain insights from a smaller,

manageable subset of the population, while still aiming to represent
the characteristics and variability present in the larger population.
Benefits of sampling
• Less time
• Less expensive
• Practicality
• Accuracy and Precision
Population Sample

Definition Complete enumeration Part of the population

of items is considered chosen for study

Symbols Population size = Sample size (n)

Population mean ( ) Sample mean ( )
Population standard Sample standard
deviation deviation =
Sampling techniques
• Simple random sampling
• Stratified sampling
• Systematic sampling
• Cluster Sampling
• Multistage Sampling
• Stratified Cluster Sampling
• Judgmental (Purposive) Sampling
• Snowball Sampling
Sampling techniques
• Simple random sampling: Each individual in the population has an equal chance of
being selected for the sample.
• Stratified sampling: The population is divided into distinct subgroups or strata based
on certain characteristics. A random sample is then taken from each stratum in
proportion to its size.
• Systematic sampling: A starting point is randomly selected, and then every nth
individual is selected from the population.
• Cluster sampling: The population is divided into clusters or groups, often based on
geographical or organizational divisions. A random sample of clusters is selected, and
then all individuals within the selected clusters are included in the sample.
• Multistage: Combination of various sampling techniques.
Sampling techniques
• Judgmental (Purposive) Sampling: In this method, the
researcher uses their judgment to select individuals
who they believe are most relevant or representative
of the population.
• Snowball Sampling: Commonly used in social sciences
and when studying hard-to-reach populations, this
method starts with one or a few participants who are
then asked to refer other potential participants.
Descriptive statistics
• Descriptive statistics are the tabular, graphical, and numerical
methods used to summarize data.
• Measure of central tendency
• Measure of dispersion
• Correlation
• Covariance
• Histogram
• Distribution
• Gaussian distribution
• Binomial distribution
• Log normal distribution
• Power law distribution
• Standard normal distribution
Central tendency
• It provides a way to summarize or describe the typical or
central value around which data points tend to cluster.
• Central tendency measures are used to understand the
general location of the data and to make comparisons
between different sets of data.
• There are three main measures of central tendency:
• Mean
• Median
• Mode
Example
1, 1, 2, 2, 3, 3,4, 5, 5, 6

Mean
Median
Mode
Central tendency
Mean: Influenced by outliers
Median: It is not affected by extreme values (outliers) and is a
useful measure for skewed distributions.
Mode: The mode is particularly useful for categorical or discrete
data

Note: The choice of which measure to use depends on the

characteristics of the data and the specific insights you want to
gain from it.
Measure of dispersion
A measure of dispersion, also known as a measure of
variability or spread, is a statistical metric that quantifies the
extent to which individual data points in a dataset vary or
deviate from the central tendency.

It provides information about how much the data points are

spread out from the average or central value.
Measure of dispersion
Range: The range is the simplest measure of dispersion and is
calculated by subtracting the minimum value from the
maximum value in a dataset. it's highly affected by outliers.
Variance: Variance measures the average squared difference
between each data point and the mean. It gives a
comprehensive understanding of the overall variability in the
data.
Measure of dispersion
Standard Deviation: The standard deviation is the square root
of the variance.
Measure of dispersion
Measure of dispersion
• Mean Absolute Deviation (MAD): MAD is the average absolute
difference between each data point and the mean.
• Coefficient of Variation (CV): The CV is the ratio of the standard
deviation to the mean, expressed as a percentage. It's used to
compare the variability of different datasets with varying means.
• Interquartile Range (IQR): The IQR is the range between the first
quartile (25th percentile) and the third quartile (75th percentile) of
the data. It measures the spread of the middle 50% of the data and is
less sensitive to outliers.
How to calculate percentile
How to calculate percentile
For example: Imagine you have the marks of 20 students.
Now, try to calculate the 90th percentile
How to calculate percentile
Step 1: Arrange the score in ascending order.
How to calculate percentile
Step 2: Plug the values in the formula to find n.

P90 = 94 means that 90% of

students got less than 94 and
10% of students got more than
94
How to calculate percentile
Suppose you want to find the percentile mark of 78 marks in
the data set.

Step 1: Sort the marks in

ascending order.
How to calculate percentile

P = 60 means that 78 marks point to the 60th percentile in the dataset.

Five number summary
• The five number summary consists of the
minimum, lower quartile, median, upper quartile
and the maximum.

• The minimum is the smallest number, the

maximum is the largest, the median is in the
middle, Q1 is the median of the first half of the data
and Q3 is the median of the second half.
Five number summary
Five number summary
• For the set of data: 1, 3, 5, 6, 6, 7, 9 the five number
summary is:
Five number summary
Five number summary
• The median (Q2) indicates the average
• The range = maximum – minimum indicates the spread of the whole data
set
• The interquartile range = Q3 – Q1 indicates the spread of the middle 50%
of the data set

Note: The range describes the spread of the whole set of data, whilst the
interquartile range describes the spread of the middle set of data.
Range is greatly affected by outliers (extreme results in the data), where the
interquartile range is not.
Boxplot
Boxplot
• The minimum is found at the position of the first line at 5
• The maximum is found at the position of the last line at 25
• The lower quartile (Q1) is found at the position of the start of the
box at 10
• The upper quartile (Q3) is found at the position of the end of the
box at 20
• The median (Q2) is found at the position of the line inside the box
at 18
Boxplot
Boxplot
Summary
• Step 1. Put the numbers in order from smallest to largest
• Step 2. The minimum is the smallest number in the list
• Step 3. The maximum is the largest number in the list
• Step 4. The median is found in the middle of the list
• Step 5. The lower quartile is the median of the first half of the data.
• Step 6. The upper quartile is the median of the second half of the
data
Outlier detection
Outlier detection
Outlier detection
Covariance
• Covariance is a statistical concept that measures the degree to which
two random variables change together.
• It's often used to understand the direction of linear relationship
between two variables.

• Positive Covariance: move in same direction

• Negative Covariance: move in opposite direction
• Zero Covariance: covariance is close to zero. Covariance of zero
doesn't necessarily mean there is no relationship at all, as non-linear
relationships might still exist.
Covariance
Covariance

Age (Year) Weight (Kg)

20 75
18 63
15 45
14 41
25 78
Covariance

Note: No restriction on value

Pearson Correlation Coefficient
• The Pearson correlation coefficient is a statistical measure that
quantifies the strength and direction of a linear relationship between
two continuous variables.

• It's a standardized version of the covariance that accounts for the

scales of the variables.

• The Pearson correlation coefficient ranges from -1 to 1.

• It is often denoted by r
Pearson Correlation Coefficient
• Positive Correlation (r>0): A positive value of r indicates a positive
linear relationship between the variables. As one variable increases,
the other tends to increase as well. The closer r is to 1, the stronger
the positive correlation.
• Negative Correlation (r<0): A negative value of r indicates a negative
linear relationship between the variables. As one variable increases,
the other tends to decrease. The closer r is to -1, the stronger the
negative correlation.
• No Correlation (r≈0): A correlation coefficient close to 0 suggests little
to no linear relationship between the variables.
Pearson Correlation Coefficient
Pearson Correlation Coefficient

Note: It is not able to capture the non-linear properties

Spearman's rank correlation coefficient
Spearman's rank correlation coefficient
Spearman's rank correlation coefficient
Zscore
• Z-score is a statistical measurement that describes a value's
relationship to the mean of a group of values.
• Z-score is measured in terms of standard deviations from the mean.
• If a Z-score is 0, it indicates that the data point's score is identical
Z-Scores
Z-Scores vs. Standard Deviation
• In most large data sets (assuming a normal distribution of data), 99.7% of
values lie between -3 and 3 standard deviations, 95% between -2 and 2
standard deviations, and 68% between -1 and 1 standard deviations.

• Standard deviation indicates the amount of variability (or dispersion)

within a given data set.

• A distribution curve has negative and positive sides, so there are positive
and negative standard deviations and z-scores.

• A negative value means it is on the left of the mean, and a positive value
indicates it is on the right.

FRM一级强化段定量分析 Crystal 金程教育 (标准版
No ratings yet
FRM一级强化段定量分析 Crystal 金程教育 (标准版
156 pages
Psychology Project
No ratings yet
Psychology Project
14 pages
Statistic Reviewer (Masters Degree)
100% (1)
Statistic Reviewer (Masters Degree)
11 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
27 pages
Statistics
No ratings yet
Statistics
45 pages
Variational Autoencoders - Post Quiz - Attempt Review
No ratings yet
Variational Autoencoders - Post Quiz - Attempt Review
5 pages
Estimation and Hypothesis Testing: Two Populations: Prem Mann, Introductory Statistics, 7/E
No ratings yet
Estimation and Hypothesis Testing: Two Populations: Prem Mann, Introductory Statistics, 7/E
75 pages
Data Management
100% (1)
Data Management
51 pages
Math 553
No ratings yet
Math 553
271 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
Weekly Learning Activity Sheet Statistics and Probability Grade 11 Quarter 3 Week 2 Mean and Variance of A Discrete Random Variable
100% (1)
Weekly Learning Activity Sheet Statistics and Probability Grade 11 Quarter 3 Week 2 Mean and Variance of A Discrete Random Variable
6 pages
Small Area Estimation
No ratings yet
Small Area Estimation
40 pages
Ai - Ssmda
No ratings yet
Ai - Ssmda
142 pages
Basic Statistics
100% (9)
Basic Statistics
73 pages
SSM & Da All Unit Notes
No ratings yet
SSM & Da All Unit Notes
152 pages
Stat Chapter 5-9
No ratings yet
Stat Chapter 5-9
32 pages
Gardiner - Stochastic Meethods
No ratings yet
Gardiner - Stochastic Meethods
10 pages
Lesson 02 Probability and Statistics
No ratings yet
Lesson 02 Probability and Statistics
127 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
63 pages
Data Management (1) (1) - Compressed
No ratings yet
Data Management (1) (1) - Compressed
46 pages
Bio Statistics
No ratings yet
Bio Statistics
72 pages
Unit IV Design of Experiments
No ratings yet
Unit IV Design of Experiments
35 pages
Statistics
No ratings yet
Statistics
68 pages
Quant Descriptive Statistics
No ratings yet
Quant Descriptive Statistics
37 pages
02 Exploratory Data Analytics
No ratings yet
02 Exploratory Data Analytics
41 pages
Data Management (1)
No ratings yet
Data Management (1)
46 pages
Biostatistics (Descriptive Statistics)
No ratings yet
Biostatistics (Descriptive Statistics)
30 pages
43hyrs Principles of Statistics 3
No ratings yet
43hyrs Principles of Statistics 3
56 pages
Kinds & Classification of Research: Reported By: Marina G. Servan
No ratings yet
Kinds & Classification of Research: Reported By: Marina G. Servan
52 pages
Dynamic Bayesian Networks: Kevin P. Murphy WWW - Ai.mit - Edu/ Murphyk 12 November 2002
No ratings yet
Dynamic Bayesian Networks: Kevin P. Murphy WWW - Ai.mit - Edu/ Murphyk 12 November 2002
55 pages
Lecture 9
No ratings yet
Lecture 9
40 pages
Statistics
No ratings yet
Statistics
12 pages
Data Management
No ratings yet
Data Management
36 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
35 pages
Project 2 Factor Hair Revised Case Study
No ratings yet
Project 2 Factor Hair Revised Case Study
25 pages
UNGROUPED DATA Measures of Central Tendency, Dispersion, and Position
No ratings yet
UNGROUPED DATA Measures of Central Tendency, Dispersion, and Position
34 pages
Calculation Using Taguchi Method For Hardness
100% (1)
Calculation Using Taguchi Method For Hardness
8 pages
Data Management
No ratings yet
Data Management
43 pages
20 - Levels of Measurement, Central Tendency Dispersion
No ratings yet
20 - Levels of Measurement, Central Tendency Dispersion
35 pages
Statistics
No ratings yet
Statistics
61 pages
Statistics Introduction
No ratings yet
Statistics Introduction
37 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
24 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
Presentation 4
No ratings yet
Presentation 4
29 pages
Statistics
No ratings yet
Statistics
10 pages
Ch3 Numerically Summarizing Data
No ratings yet
Ch3 Numerically Summarizing Data
35 pages
f592b059 1643454320549
No ratings yet
f592b059 1643454320549
39 pages
Business Statistics NOtes
No ratings yet
Business Statistics NOtes
46 pages
Unit 3 - Descriptive Statistics
No ratings yet
Unit 3 - Descriptive Statistics
44 pages
BST 230 RP Final
No ratings yet
BST 230 RP Final
9 pages
2 Research - 2ND QT - Week 1 - 10 14 2024
No ratings yet
2 Research - 2ND QT - Week 1 - 10 14 2024
13 pages
Basics of Statistics: Definition: Science of Collection, Presentation, Analysis, and Reasonable
100% (1)
Basics of Statistics: Definition: Science of Collection, Presentation, Analysis, and Reasonable
33 pages
Assignment4 Group3.CC01.Forecasting-1
No ratings yet
Assignment4 Group3.CC01.Forecasting-1
11 pages
Peta 1 Statistics and Prob
No ratings yet
Peta 1 Statistics and Prob
4 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
MMW Reviewer
No ratings yet
MMW Reviewer
9 pages
Jam 2012 PDF
No ratings yet
Jam 2012 PDF
20 pages
NITKclass 1
No ratings yet
NITKclass 1
50 pages
Lab Manual 04
No ratings yet
Lab Manual 04
12 pages
Statistics and Probability
0% (1)
Statistics and Probability
5 pages
De Clercq Et Al., 2014
No ratings yet
De Clercq Et Al., 2014
7 pages
Biostats Lesson 3
No ratings yet
Biostats Lesson 3
6 pages
CH 2 Lecture Notes
No ratings yet
CH 2 Lecture Notes
12 pages
1 Basics of Stat (Statistics IEM 2-2)
No ratings yet
1 Basics of Stat (Statistics IEM 2-2)
29 pages
ST Formula Sheet Midterm
No ratings yet
ST Formula Sheet Midterm
4 pages
MACT - 2222 - Sample Exam - Final
No ratings yet
MACT - 2222 - Sample Exam - Final
8 pages
Uom Stability - FFOPL
No ratings yet
Uom Stability - FFOPL
35 pages
Introduction To Business Statistics - BCPC 112 PDF
No ratings yet
Introduction To Business Statistics - BCPC 112 PDF
11 pages
Nurul Aisyah Rahmalita Putri C1C020114 Midtest-Stat 1 2021-Dikonversi
No ratings yet
Nurul Aisyah Rahmalita Putri C1C020114 Midtest-Stat 1 2021-Dikonversi
5 pages
WK 1 3
No ratings yet
WK 1 3
5 pages
COSM-Question Bank-3,4,5 Units
No ratings yet
COSM-Question Bank-3,4,5 Units
2 pages
Chapter 4 MMW
No ratings yet
Chapter 4 MMW
13 pages
Statistics and Probability
No ratings yet
Statistics and Probability
4 pages
Module 3 4 MMW
No ratings yet
Module 3 4 MMW
6 pages
機率大抄
No ratings yet
機率大抄
2 pages
Mathematics in The Modern World
No ratings yet
Mathematics in The Modern World
13 pages
Data Method Nonorm CCC Pseudo RMSSTD Rsquare RSQ Id Var: Proc Cluster
No ratings yet
Data Method Nonorm CCC Pseudo RMSSTD Rsquare RSQ Id Var: Proc Cluster
4 pages
MP1 Parameter Estimation
No ratings yet
MP1 Parameter Estimation
11 pages
Statistical Analysis - Descriptive Stat
No ratings yet
Statistical Analysis - Descriptive Stat
6 pages
Lesson 5 - Simulating Random Service and Interarrival Times (Spreadsheet Simulation)
No ratings yet
Lesson 5 - Simulating Random Service and Interarrival Times (Spreadsheet Simulation)
5 pages
Loyola College (Autonomous), Chennai - 600 034: X P X P
No ratings yet
Loyola College (Autonomous), Chennai - 600 034: X P X P
2 pages
Measures of Central Tendency
100% (15)
Measures of Central Tendency
15 pages
Math in The Modern World Stat Lecture
No ratings yet
Math in The Modern World Stat Lecture
3 pages
MATM111
No ratings yet
MATM111
8 pages
Probability Assignment 3
No ratings yet
Probability Assignment 3
4 pages
Turbo and LDPC Codes: Implementation, Simulation, and Standardization
No ratings yet
Turbo and LDPC Codes: Implementation, Simulation, and Standardization
67 pages
Prelim Notes
No ratings yet
Prelim Notes
4 pages

Statistics 091147

Uploaded by

Statistics 091147

Uploaded by

Statistics

• Statistics is the science of collecting, organizing, and analyzing

• Data:- Facts or pieces of information

• Secondary sources: The person performing data analysis is not the

• Descriptive statistics: Collecting, summarizing, and

• Note: Drawing conclusions about a large group of individual based

• Sampling allows researchers to obtain insights from a smaller,

Definition Complete enumeration Part of the population

Symbols Population size = Sample size (n)

Note: The choice of which measure to use depends on the

It provides information about how much the data points are

P90 = 94 means that 90% of

Step 1: Sort the marks in

P = 60 means that 78 marks point to the 60th percentile in the dataset.

• The minimum is the smallest number, the

• Positive Covariance: move in same direction

Age (Year) Weight (Kg)

Note: No restriction on value

• It's a standardized version of the covariance that accounts for the

• The Pearson correlation coefficient ranges from -1 to 1.

Note: It is not able to capture the non-linear properties

• Standard deviation indicates the amount of variability (or dispersion)

You might also like