0% found this document useful (0 votes)

89 views38 pages

Unit II: Basic Data Analytic Methods

This document provides an overview of basic data analytic methods and statistical concepts. It covers topics such as hypothesis testing, different statistical tests, clustering methods, descriptive and inferential statistics, variables and data organization, measures of central tendency and variation, the normal distribution, and hypothesis testing. The key ideas are that statistics is used to analyze and draw conclusions from data, and that there are descriptive and inferential statistical methods for summarizing, visualizing, and making inferences about populations based on samples.

Uploaded by

vaibhavbdx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views38 pages

Unit II: Basic Data Analytic Methods

Uploaded by

vaibhavbdx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

Unit II

Basic Data Analytic Methods

Points to be covered
• Statistical Methods for Evaluation-
• Hypothesis testing, difference of means, wilcoxon rank–sum test,
type 1 type 2 errors, power and sample size, ANNOVA.

• Advanced Analytical Theory and Methods:

• Clustering- Overview, K means- Use cases, Overview of methods,
determining number of clusters, diagnostics, reasons to choose and
cautions.
What is statistics ?
 Statistics is a very broad subject, with applications in a vast number of different fields.

 Methodology for collecting, analyzing, interpreting and drawing conclusions from

information.

 Methodology which scientists and mathematicians have developed for interpreting and
drawing conclusions from collected data.

• Definition (Statistics).
Statistics consists of a body of methods for collecting and analyzing data.
What is statistics ?
Statistical methods can be used to find answers to the questions like:
What kind and how much data need to be collected?

 How should we organize and summarize the data?

How can we analyse the data and draw conclusions from it?

How can we assess the strength of the conclusions and evaluate their
uncertainty?
statistics provides methods
1. Design: Planning and carrying out research studies.

2. Description: Summarizing and exploring data.

3. Inference: Making predictions and generalizing about

phenomena represented by the data
Statistics in practice
Agricultural problem: Is new grain seed or fertilizer more productive?

Medical problem: What is the right amount of dosage of drug to

treatment?

Political science: How accurate are the gallups and opinion polls?

Economics: What will be the unemployment rate next year?

Technical problem: How to improve quality of product?

Population and Sample
• Population can be characterized as the set of individual persons or objects
in which an investigator is primarily interested during his or her research
problem.
• Definition : Population is the collection of all individuals or items under
consideration in a statistical study

• After applying some measurement to population only a set of individuals of

that population are observed

• Definition: Sample is that part of the population from which information is

collected
Descriptive and Inferential Statistics
• The branch of statistics devoted to the summarization and description
of data is called descriptive statistics
• Descriptive statistics includes the construction of graphs, charts, and tables,
and the calculation of various descriptive measures such as averages,
measures of variation, and percentiles

• The branch of statistics concerned with using sample data to make an

inference about a population of data is called inferential statistics
• Inferential statistics includes methods like point estimation, interval
estimation and hypothesis testing which are all based on probability theory
Example: Descriptive and Inferential Statistics
• Consider event of tossing dice. The dice is rolled 100 times and the
results are forming the sample data.
• Descriptive statistics is used to grouping the sample data

• Inferential statistics can now be used to verify whether the dice is a

fair or not.
Statistical data analysis
Variables and organization of the data
• A characteristic that varies from one person or thing to another is called a
variable
• Examples of variables for humans are height, weight, marital status, eye colour.
• Quantitative :- Allows Numerical Values
• Qualitative :- Allows Categorical Values.
• Quantitative variables are of two types Discrete or Continuous
• Discrete Variables :- only a countable number of distinct possible values (
finite numbers )
• Continuous Variables : Quantities such as length, weight, or temperature can
in principle be measured arbitrarily accurately. There is no indivisible unit.
Weight may be measured to the nearest gram.
Organization of the data
• Each individual piece of data is called an observation
• collection of all observations for particular variables is called a data
set

• The number of observations that fall into particular class (or category)
of the qualitative variable is called the frequency (or count) of that
class
• A table listing all classes and their frequencies is called a frequency
distribution
• The qualitative data are presented graphically either as a pie chart or
as a horizontal or vertical bar graph.
Example

• Let the blood types of 40 persons are as follows:

O O A B A O A A A O B O B O O A O O A A A A AB A B A A O O A O O A A
A O A O O AB

Frequency Distribution of Blood Types Graphical presentation of data

Measures of centre
• Mean , Median Mode, Range , Standard Deviation
• Mean :- Mean equals the sum of the numbers in the data set divided by
the number of values in the data set.
• Mean=(sum of all terms)÷(how many terms or values in the set).

• Median : The median identifies the midpoint or middle value of a set of

numbers.

• Mode :The mode identifies the most common value or values in the data
set. Depending on the data, there might be one or more modes, or no
mode at all.
Measures of centre
• Range: Range shows the mathematical distance between the lowest
and highest values in the data se
• Range= ( Max value – Min Value )
• Standard Deviation :Standard deviation measures the variability of
the data set. Like range, a smaller standard deviation indicates less
variability.
• Standard Deviation =
• where ∑ means sum
• X represents each data set value
• Ẍ represents the mean value
• N number of values in dataset
Example
• Dataset : 20, 24, 25, 36, 25, 22, 23
• Mean : (20+24+25+36+25+22+23) / 7 =175 /7 = 25
• Median : Sort data set :- 20, 22, 23, 24, 25, 25, 36
• Middle Value : 24
• Mode : Sort data set :- 20, 22, 23, 24, 25, 25, 36
• Repeated Value : 25
• Range : ( 36 -20 ) = 16
• Standard Deviation :
=√[ (20-25)2 +(24-25)2+(25-25)2 + (36-25)2+(25-25)2+(22-25)2+(23-25)2 ]/(7-1-)
= [ 25 +1 +0 +121 +0 + 9+ 4 ] / 6 = √16/6 = √26.66 =5.16
Example
• Dataset : 57, 64, 43, 67, 49, 59, 44, 47, 61, 59

• Mean
• Median
• Mode
• Range
• Standard Deviation
Measures of Centre
• measures of centre and variation, the sample mean and the sample
standard deviation s are the most commonly reported. Since their
values depend on the sample selected, they vary in value from
sample to sample. In this sense, they are called random variables

• A variable which can take at least two different numerical values in a

long run of repeated observations is called random variable.
Normal Distribution
• The normal distribution is the pattern of the distribution of a set of
data that follows a bell shaped curve.
• It is important because:
• Many variables are distributed approximately normally.
• It is easy to work with normal distribution as many kinds of statistical tests
can be derived for it. These tests work very well even if the distribution is only
approximately normally distributed.
Properties of the normal curve
• The curve has a peak at its center and
decreases on either side which shows that
extreme values are infrequent.
• Probabilities associated with the normal
curve correspond to areas under the
curve. The total area under the curve is 1
(i.e. 100%). The probability of a particular
point is 0 (since no associated area above
a point).
• Most of the values lie towards the center
of the curve with the arithmetic mean,
median, and mode lying at its very center.
• The curve is symmetric about a line that
extends up from the mean (plotted on the
axis) – so the area above the curve on
each side of the line is 0.5.
• The probability of deviations from the
mean is comparable in either direction as
the curve is symmetric
standard normal distribution
• The standard normal distribution is formed using these values and it
is a particular normal distribution with m =0 and s =1. Probabilities
are usually found by converting to z- values and then using a standard
normal table or appropriate XL function
𝑋−𝜇
z=
𝜎
Hypothesis Testing
• A hypothesis is a statement about some characteristic of a variable or
a collection of variables.
• Hypotheses arise from the theory that drives the research
• A significance test is a way of statistically testing a hypothesis by
comparing the data to values predicted by the hypothesis.
• Data that fall far from the predicted values provide evidence against
the hypothesis
• A significance test considers two hypotheses about the value of a
population parameter: the null hypothesis and the alternative
hypothesis
Hypothesis Testing
• The null hypothesis H0 is the hypothesis that is directly tested.
• This is usually a statement that the parameter has value
corresponding to, in some sense, no effect.

• The alternative hypothesis Ha is a hypothesis that contradicts the null

hypothesis.
• This hypothesis states that the parameter falls in some alternative set
of values to what null hypothesis specifies.
Hypothesis Testing
• Basic concept is to form an assertion and test it with data
• Common assumption is that there is no difference between samples
(default assumption)

• Statisticians refer to this as the null hypothesis (H0)

• The alternative hypothesis (HA) is that there is a difference between
samples
Hypothesis Testing
Identify the effect of drug A compared to drug B on patients
• H0: Drug A and drug B have the same effect on patients.
• HA: Drug A has a greater effect than drug Bon patients.
Identify whether advertising Campaign C is effective on reducing
customer churn,
• H0: Campaign C does not reduce customer churn better than the
current campaign method.
• HA: Campaign C does reduce customer churn better than the current
campaign.
Hypothesis Testing
• Once a model is built over the training data, it needs to be evaluated
over the testing data to see if the proposed model predicts better
than the existing model currently being used.
• The null hypothesis is that the proposed model does not predict
better than the existing model.
• The alternative hypothesis is that the proposed model indeed
predicts better than the existing model
Hypothesis Testing
Difference of Means
• Hypothesis testing is a common
approach to draw inferences on
whether or not the two populations,
are different from each other.
• If the values of X1 and X2 are
approximately equal to each other,
the distributions of X, and X2 overlap
substantially and the null hypothesis
is supported.
• A large observed difference between
the sample means indicates that the
null hypothesis should be rejected
Difference of Mean
• Difference of Mean is calculated using two test
• Student’s t-test
• Assumes two normally distributed populations, and that they have equal
variance
• Welch’s t-test
• Assumes two normally distributed populations, and they don’t necessarily
have equal variance
Student’s t-test
• The t test (also called Student’s T Test) compare two averages (means)
and tells you if they are different from each other.
• The t test also tells you how significant the differences are.
• The t score is a ratio between the difference between two groups
and the difference within the groups.
• A large t-score tells you that the groups are different.
• A small t-score tells you that the groups are similar.
• The number of degrees of freedom is the number of values in the
final calculation of a statistic that are free to vary.
There are two types of t-test:
• Matched pairs and independent pairs
• If there is some link between the data then use the matched pairs
test (eg a „before‟ and „after‟)

• If there is no link between the data then use the independent pairs
test
• Difference of mean =
• N number of samples , D is Difference between samples
• Standard deviation of difference

• Standard Error SE =

• Value of

• Degree of freedom
• DoF= n – 1 (where “n” is the number of items in your set)
• DOF (Two Samples) =(N1 + N2) – 2.
• Question
• In an investigation to determine the effectiveness of sequencing of fingerprints
10 prints are taken enhanced with DFO and then with ninhydrin. The points of
detail at each stage are recorded. Is there a difference at the 95% confidence
level?
t-test for matched pairs
1. Set up the null and alternative hypothesis
• H0 there is no difference in the number of minutae when using ninhydrin
• HA there are more minutae observed after the enhancement of ninhydrin.
• This is a one-tail test.
• We are testing at the 95% or 5% (0.05) level
t-test for matched pairs
1. Set up the null and alternative hypothesis
2. Calculate the difference between the pairs in the sample

3. Calculate the mean of the differences

4. Calculate the standard deviation of the difference

5. Calculate the standard error

6. Calculate the value of t

7. Calculate the number of degrees of freedom (DoF)

DoF = No of pairs of data – 1 = N- 1 =10-1 =9

8. find the critical value

Basics of Statistics: Definition: Science of Collection, Presentation, Analysis, and Reasonable
100% (1)
Basics of Statistics: Definition: Science of Collection, Presentation, Analysis, and Reasonable
33 pages
Lecture 1
No ratings yet
Lecture 1
32 pages
Biostatistics-1
No ratings yet
Biostatistics-1
45 pages
Bio Statistics
No ratings yet
Bio Statistics
72 pages
43hyrs Principles of Statistics 3
No ratings yet
43hyrs Principles of Statistics 3
56 pages
Stats 1 Module Updated
No ratings yet
Stats 1 Module Updated
53 pages
Basic Statistics (3685) PPT - Lecture On 20-01-2019
100% (1)
Basic Statistics (3685) PPT - Lecture On 20-01-2019
64 pages
WK 1b Biostat
No ratings yet
WK 1b Biostat
38 pages
Unit 8. Data Analysis
No ratings yet
Unit 8. Data Analysis
69 pages
Intro SRM
No ratings yet
Intro SRM
73 pages
BOT 315 Slide
No ratings yet
BOT 315 Slide
20 pages
Module 2 - Statistical Foundations
No ratings yet
Module 2 - Statistical Foundations
108 pages
Basics of Statistics
No ratings yet
Basics of Statistics
40 pages
Statistics 1 (Final) / Orthodontic Courses by Indian Dental Academy
No ratings yet
Statistics 1 (Final) / Orthodontic Courses by Indian Dental Academy
15 pages
2466939-EDA and STATISTICS NOTES
No ratings yet
2466939-EDA and STATISTICS NOTES
15 pages
Midterms Gec Math Adooooor
No ratings yet
Midterms Gec Math Adooooor
6 pages
Reviewer in IE-SAN1
No ratings yet
Reviewer in IE-SAN1
5 pages
Biostatistics 1
No ratings yet
Biostatistics 1
120 pages
Statistics
100% (1)
Statistics
11 pages
Course Introduction Inferential Statistics Prof. Sandy A. Lerio
No ratings yet
Course Introduction Inferential Statistics Prof. Sandy A. Lerio
46 pages
ca303577-1ae1-464e-ac48-59bdc453b954
No ratings yet
ca303577-1ae1-464e-ac48-59bdc453b954
14 pages
Unit 1 - Examining Distributions
No ratings yet
Unit 1 - Examining Distributions
80 pages
Basic Concepts in Statistics
No ratings yet
Basic Concepts in Statistics
42 pages
Statistics - The Big Picture
No ratings yet
Statistics - The Big Picture
4 pages
Basics For Understanding
No ratings yet
Basics For Understanding
8 pages
E Book - Unit 4
No ratings yet
E Book - Unit 4
12 pages
Statistics
No ratings yet
Statistics
68 pages
2. Central Tendencies
No ratings yet
2. Central Tendencies
44 pages
Statistical Foundations - Intro 64zlf
100% (2)
Statistical Foundations - Intro 64zlf
86 pages
Statistics For Data Analysis
No ratings yet
Statistics For Data Analysis
13 pages
Basic Concepts of Statistics
No ratings yet
Basic Concepts of Statistics
41 pages
Understandingstatisticsinresearch 151026064600 Lva1 App6892
No ratings yet
Understandingstatisticsinresearch 151026064600 Lva1 App6892
37 pages
Statistics 1
No ratings yet
Statistics 1
9 pages
Statistics
No ratings yet
Statistics
45 pages
Statistical Techniques - Bda
No ratings yet
Statistical Techniques - Bda
33 pages
Basic Statistics: Populations and Samples
No ratings yet
Basic Statistics: Populations and Samples
10 pages
Data Management
No ratings yet
Data Management
48 pages
Physics
No ratings yet
Physics
6 pages
Lesson 02 Probability and Statistics
No ratings yet
Lesson 02 Probability and Statistics
127 pages
Q & a- Unit 1 - Introduction to Statistics
No ratings yet
Q & a- Unit 1 - Introduction to Statistics
20 pages
Statistics: An Introduction and Overview
No ratings yet
Statistics: An Introduction and Overview
51 pages
Statistics 24 04 2021 20210618114031
No ratings yet
Statistics 24 04 2021 20210618114031
41 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
27 pages
Data Analysis and Statistical Treatment
No ratings yet
Data Analysis and Statistical Treatment
99 pages
COM 201 - Inferential Statistics - 18032022-1
No ratings yet
COM 201 - Inferential Statistics - 18032022-1
58 pages
Statistics Ppt.1
No ratings yet
Statistics Ppt.1
39 pages
Introduction To Statistics and SPSS
100% (1)
Introduction To Statistics and SPSS
110 pages
2statsnotes 1
No ratings yet
2statsnotes 1
24 pages
Statistics
No ratings yet
Statistics
33 pages
Basic Concepts in Biostatistics-1
No ratings yet
Basic Concepts in Biostatistics-1
40 pages
Introduction To Statistics: Prepared By: Joshua Erdy A. Tan
No ratings yet
Introduction To Statistics: Prepared By: Joshua Erdy A. Tan
29 pages
Mean, Median, Mode and Standard Deviation
No ratings yet
Mean, Median, Mode and Standard Deviation
42 pages
Introduction Book 1
No ratings yet
Introduction Book 1
41 pages
Statistics and Probabilities Quarter 1
No ratings yet
Statistics and Probabilities Quarter 1
6 pages
Chapter 01
No ratings yet
Chapter 01
56 pages
Introduction and Descriptive Statistics
No ratings yet
Introduction and Descriptive Statistics
50 pages
Descriptive Statistics: Six Sigma Thinking, #3
From Everand
Descriptive Statistics: Six Sigma Thinking, #3
Sumeet Savant
No ratings yet
Overview Of Bayesian Approach To Statistical Methods: Software
From Everand
Overview Of Bayesian Approach To Statistical Methods: Software
Vinaitheerthan Renganathan
No ratings yet
Statistics II Essentials
From Everand
Statistics II Essentials
Emil Milewski
2.5/5 (1)
Elementary Statistics
From Everand
Elementary Statistics
jay prakash Maheshwari
5/5 (1)
Analysis of The Main Consensus Protocols of Blockchain: Sciencedirect
No ratings yet
Analysis of The Main Consensus Protocols of Blockchain: Sciencedirect
5 pages
Pune Station Parking Data
No ratings yet
Pune Station Parking Data
14 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
ER/CORP/CRS/DB07/004 Version No. 1.1 10 10
No ratings yet
ER/CORP/CRS/DB07/004 Version No. 1.1 10 10
2 pages
DBMS Part-1 Lab Assign
No ratings yet
DBMS Part-1 Lab Assign
7 pages
Mysql 6
No ratings yet
Mysql 6
29 pages
DBMS
No ratings yet
DBMS
400 pages
Nursing Research A1
100% (1)
Nursing Research A1
13 pages
Identification of Carbohydrates, Proteins, and Lipids
100% (1)
Identification of Carbohydrates, Proteins, and Lipids
9 pages
Macro Evolution
No ratings yet
Macro Evolution
266 pages
Swedberg 5
No ratings yet
Swedberg 5
11 pages
Legal Research Short Questions
No ratings yet
Legal Research Short Questions
17 pages
Lecture Guide in Research IMRAD Format
No ratings yet
Lecture Guide in Research IMRAD Format
10 pages
The Owl of Minerva
No ratings yet
The Owl of Minerva
25 pages
Familiarity and Its Impact On Consumer Decision Biases and Heuristics
No ratings yet
Familiarity and Its Impact On Consumer Decision Biases and Heuristics
10 pages
Research Title: The Relationship of Time Management To Student's Academic Performance
No ratings yet
Research Title: The Relationship of Time Management To Student's Academic Performance
5 pages
Get Statistics For Business & Economics 13th Revised Edition Edition David Ray Anderson - Ebook PDF PDF Ebook With Full Chapters Now
100% (5)
Get Statistics For Business & Economics 13th Revised Edition Edition David Ray Anderson - Ebook PDF PDF Ebook With Full Chapters Now
51 pages
Soultion 4
71% (7)
Soultion 4
3 pages
Theoretical Framework: The Research Process - Literature Review - (Stage 3 in Research Process)
No ratings yet
Theoretical Framework: The Research Process - Literature Review - (Stage 3 in Research Process)
53 pages
SW 4810
No ratings yet
SW 4810
22 pages
IMT-120 Assignment
No ratings yet
IMT-120 Assignment
3 pages
Flocabulary Scientific Method Video Scientific Method Worksheet
No ratings yet
Flocabulary Scientific Method Video Scientific Method Worksheet
4 pages
4 Hypothesis Testing in The Multiple Regression Model
No ratings yet
4 Hypothesis Testing in The Multiple Regression Model
49 pages
GMATH Module 3
No ratings yet
GMATH Module 3
10 pages
Module 3 - Inferential Statistics
No ratings yet
Module 3 - Inferential Statistics
2 pages
BIO 111: Chapter 1 Lecture Presentation
100% (1)
BIO 111: Chapter 1 Lecture Presentation
108 pages
History of Scientific Method
No ratings yet
History of Scientific Method
10 pages
How To Write A Research Propasal
No ratings yet
How To Write A Research Propasal
59 pages
Master of Business Administration-MBA Semester 3 Assignment Set - 1 Research Methodology - Mb0034
No ratings yet
Master of Business Administration-MBA Semester 3 Assignment Set - 1 Research Methodology - Mb0034
16 pages
An Estimate of Variance Due To Traits in Leadership: David A. Kenny Stephen J. Zaccaro
No ratings yet
An Estimate of Variance Due To Traits in Leadership: David A. Kenny Stephen J. Zaccaro
8 pages
01 Zeigarnik Effect in Advertising
No ratings yet
01 Zeigarnik Effect in Advertising
9 pages
Has History Any Meeaning Crtique of Popper's Philsophy
No ratings yet
Has History Any Meeaning Crtique of Popper's Philsophy
12 pages
Insecurity in Ojo
No ratings yet
Insecurity in Ojo
45 pages
Miroshnik Et Al., 2020
No ratings yet
Miroshnik Et Al., 2020
53 pages
A Framework For Theory Development in Design Science Research-Multiple Perspectives
No ratings yet
A Framework For Theory Development in Design Science Research-Multiple Perspectives
29 pages
Modern Phylogenetic Comparative Methods and Their Application in Evolutionary Biology
100% (1)
Modern Phylogenetic Comparative Methods and Their Application in Evolutionary Biology
553 pages
Scientific Method Lesson For Middle School Writing A Hypothesis
No ratings yet
Scientific Method Lesson For Middle School Writing A Hypothesis
63 pages

Unit II: Basic Data Analytic Methods

Uploaded by

Unit II: Basic Data Analytic Methods

Uploaded by

Unit II

Basic Data Analytic Methods

• Advanced Analytical Theory and Methods:

 Methodology for collecting, analyzing, interpreting and drawing conclusions from

 How should we organize and summarize the data?

2. Description: Summarizing and exploring data.

3. Inference: Making predictions and generalizing about

Medical problem: What is the right amount of dosage of drug to

Economics: What will be the unemployment rate next year?

Technical problem: How to improve quality of product?

• After applying some measurement to population only a set of individuals of

• Definition: Sample is that part of the population from which information is

• The branch of statistics concerned with using sample data to make an

• Inferential statistics can now be used to verify whether the dice is a

• Let the blood types of 40 persons are as follows:

Frequency Distribution of Blood Types Graphical presentation of data

• Median : The median identifies the midpoint or middle value of a set of

• A variable which can take at least two different numerical values in a

• The alternative hypothesis Ha is a hypothesis that contradicts the null

• Statisticians refer to this as the null hypothesis (H0)

3. Calculate the mean of the differences

4. Calculate the standard deviation of the difference

6. Calculate the value of t

7. Calculate the number of degrees of freedom (DoF)

8. find the critical value

You might also like