0% found this document useful (0 votes)
55 views

Unit II: Basic Data Analytic Methods

This document provides an overview of basic data analytic methods and statistical concepts. It covers topics such as hypothesis testing, different statistical tests, clustering methods, descriptive and inferential statistics, variables and data organization, measures of central tendency and variation, the normal distribution, and hypothesis testing. The key ideas are that statistics is used to analyze and draw conclusions from data, and that there are descriptive and inferential statistical methods for summarizing, visualizing, and making inferences about populations based on samples.

Uploaded by

vaibhavbdx
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Unit II: Basic Data Analytic Methods

This document provides an overview of basic data analytic methods and statistical concepts. It covers topics such as hypothesis testing, different statistical tests, clustering methods, descriptive and inferential statistics, variables and data organization, measures of central tendency and variation, the normal distribution, and hypothesis testing. The key ideas are that statistics is used to analyze and draw conclusions from data, and that there are descriptive and inferential statistical methods for summarizing, visualizing, and making inferences about populations based on samples.

Uploaded by

vaibhavbdx
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Unit II

Basic Data Analytic Methods


Points to be covered
• Statistical Methods for Evaluation-
• Hypothesis testing, difference of means, wilcoxon rank–sum test,
type 1 type 2 errors, power and sample size, ANNOVA.

• Advanced Analytical Theory and Methods:


• Clustering- Overview, K means- Use cases, Overview of methods,
determining number of clusters, diagnostics, reasons to choose and
cautions.
What is statistics ?
 Statistics is a very broad subject, with applications in a vast number of different fields.

 Methodology for collecting, analyzing, interpreting and drawing conclusions from


information.

 Methodology which scientists and mathematicians have developed for interpreting and
drawing conclusions from collected data.

• Definition (Statistics).
Statistics consists of a body of methods for collecting and analyzing data.
What is statistics ?
Statistical methods can be used to find answers to the questions like:
What kind and how much data need to be collected?

 How should we organize and summarize the data?

How can we analyse the data and draw conclusions from it?

How can we assess the strength of the conclusions and evaluate their
uncertainty?
statistics provides methods
1. Design: Planning and carrying out research studies.

2. Description: Summarizing and exploring data.

3. Inference: Making predictions and generalizing about


phenomena represented by the data
Statistics in practice
Agricultural problem: Is new grain seed or fertilizer more productive?

Medical problem: What is the right amount of dosage of drug to


treatment?

Political science: How accurate are the gallups and opinion polls?

Economics: What will be the unemployment rate next year?

Technical problem: How to improve quality of product?


Population and Sample
• Population can be characterized as the set of individual persons or objects
in which an investigator is primarily interested during his or her research
problem.
• Definition : Population is the collection of all individuals or items under
consideration in a statistical study

• After applying some measurement to population only a set of individuals of


that population are observed

• Definition: Sample is that part of the population from which information is


collected
Descriptive and Inferential Statistics
• The branch of statistics devoted to the summarization and description
of data is called descriptive statistics
• Descriptive statistics includes the construction of graphs, charts, and tables,
and the calculation of various descriptive measures such as averages,
measures of variation, and percentiles

• The branch of statistics concerned with using sample data to make an


inference about a population of data is called inferential statistics
• Inferential statistics includes methods like point estimation, interval
estimation and hypothesis testing which are all based on probability theory
Example: Descriptive and Inferential Statistics
• Consider event of tossing dice. The dice is rolled 100 times and the
results are forming the sample data.
• Descriptive statistics is used to grouping the sample data

• Inferential statistics can now be used to verify whether the dice is a


fair or not.
Statistical data analysis
Variables and organization of the data
• A characteristic that varies from one person or thing to another is called a
variable
• Examples of variables for humans are height, weight, marital status, eye colour.
• Quantitative :- Allows Numerical Values
• Qualitative :- Allows Categorical Values.
• Quantitative variables are of two types Discrete or Continuous
• Discrete Variables :- only a countable number of distinct possible values (
finite numbers )
• Continuous Variables : Quantities such as length, weight, or temperature can
in principle be measured arbitrarily accurately. There is no indivisible unit.
Weight may be measured to the nearest gram.
Organization of the data
• Each individual piece of data is called an observation
• collection of all observations for particular variables is called a data
set

• The number of observations that fall into particular class (or category)
of the qualitative variable is called the frequency (or count) of that
class
• A table listing all classes and their frequencies is called a frequency
distribution
• The qualitative data are presented graphically either as a pie chart or
as a horizontal or vertical bar graph.
Example

• Let the blood types of 40 persons are as follows:


O O A B A O A A A O B O B O O A O O A A A A AB A B A A O O A O O A A
A O A O O AB

Frequency Distribution of Blood Types Graphical presentation of data


Measures of centre
• Mean , Median Mode, Range , Standard Deviation
• Mean :- Mean equals the sum of the numbers in the data set divided by
the number of values in the data set.
• Mean=(sum of all terms)÷(how many terms or values in the set).

• Median : The median identifies the midpoint or middle value of a set of


numbers.

• Mode :The mode identifies the most common value or values in the data
set. Depending on the data, there might be one or more modes, or no
mode at all.
Measures of centre
• Range: Range shows the mathematical distance between the lowest
and highest values in the data se
• Range= ( Max value – Min Value )
• Standard Deviation :Standard deviation measures the variability of
the data set. Like range, a smaller standard deviation indicates less
variability.
• Standard Deviation =
• where ∑ means sum
• X represents each data set value
• Ẍ represents the mean value
• N number of values in dataset
Example
• Dataset : 20, 24, 25, 36, 25, 22, 23
• Mean : (20+24+25+36+25+22+23) / 7 =175 /7 = 25
• Median : Sort data set :- 20, 22, 23, 24, 25, 25, 36
• Middle Value : 24
• Mode : Sort data set :- 20, 22, 23, 24, 25, 25, 36
• Repeated Value : 25
• Range : ( 36 -20 ) = 16
• Standard Deviation :
=√[ (20-25)2 +(24-25)2+(25-25)2 + (36-25)2+(25-25)2+(22-25)2+(23-25)2 ]/(7-1-)
= [ 25 +1 +0 +121 +0 + 9+ 4 ] / 6 = √16/6 = √26.66 =5.16
Example
• Dataset : 57, 64, 43, 67, 49, 59, 44, 47, 61, 59

• Mean
• Median
• Mode
• Range
• Standard Deviation
Measures of Centre
• measures of centre and variation, the sample mean and the sample
standard deviation s are the most commonly reported. Since their
values depend on the sample selected, they vary in value from
sample to sample. In this sense, they are called random variables

• A variable which can take at least two different numerical values in a


long run of repeated observations is called random variable.
Normal Distribution
• The normal distribution is the pattern of the distribution of a set of
data that follows a bell shaped curve.
• It is important because:
• Many variables are distributed approximately normally.
• It is easy to work with normal distribution as many kinds of statistical tests
can be derived for it. These tests work very well even if the distribution is only
approximately normally distributed.
Properties of the normal curve
• The curve has a peak at its center and
decreases on either side which shows that
extreme values are infrequent.
• Probabilities associated with the normal
curve correspond to areas under the
curve. The total area under the curve is 1
(i.e. 100%). The probability of a particular
point is 0 (since no associated area above
a point).
• Most of the values lie towards the center
of the curve with the arithmetic mean,
median, and mode lying at its very center.
• The curve is symmetric about a line that
extends up from the mean (plotted on the
axis) – so the area above the curve on
each side of the line is 0.5.
• The probability of deviations from the
mean is comparable in either direction as
the curve is symmetric
standard normal distribution
• The standard normal distribution is formed using these values and it
is a particular normal distribution with m =0 and s =1. Probabilities
are usually found by converting to z- values and then using a standard
normal table or appropriate XL function
𝑋−𝜇
z=
𝜎
Hypothesis Testing
• A hypothesis is a statement about some characteristic of a variable or
a collection of variables.
• Hypotheses arise from the theory that drives the research
• A significance test is a way of statistically testing a hypothesis by
comparing the data to values predicted by the hypothesis.
• Data that fall far from the predicted values provide evidence against
the hypothesis
• A significance test considers two hypotheses about the value of a
population parameter: the null hypothesis and the alternative
hypothesis
Hypothesis Testing
• The null hypothesis H0 is the hypothesis that is directly tested.
• This is usually a statement that the parameter has value
corresponding to, in some sense, no effect.

• The alternative hypothesis Ha is a hypothesis that contradicts the null


hypothesis.
• This hypothesis states that the parameter falls in some alternative set
of values to what null hypothesis specifies.
Hypothesis Testing
• Basic concept is to form an assertion and test it with data
• Common assumption is that there is no difference between samples
(default assumption)

• Statisticians refer to this as the null hypothesis (H0)


• The alternative hypothesis (HA) is that there is a difference between
samples
Hypothesis Testing
Identify the effect of drug A compared to drug B on patients
• H0: Drug A and drug B have the same effect on patients.
• HA: Drug A has a greater effect than drug Bon patients.
Identify whether advertising Campaign C is effective on reducing
customer churn,
• H0: Campaign C does not reduce customer churn better than the
current campaign method.
• HA: Campaign C does reduce customer churn better than the current
campaign.
Hypothesis Testing
• Once a model is built over the training data, it needs to be evaluated
over the testing data to see if the proposed model predicts better
than the existing model currently being used.
• The null hypothesis is that the proposed model does not predict
better than the existing model.
• The alternative hypothesis is that the proposed model indeed
predicts better than the existing model
Hypothesis Testing
Difference of Means
• Hypothesis testing is a common
approach to draw inferences on
whether or not the two populations,
are different from each other.
• If the values of X1 and X2 are
approximately equal to each other,
the distributions of X, and X2 overlap
substantially and the null hypothesis
is supported.
• A large observed difference between
the sample means indicates that the
null hypothesis should be rejected
Difference of Mean
• Difference of Mean is calculated using two test
• Student’s t-test
• Assumes two normally distributed populations, and that they have equal
variance
• Welch’s t-test
• Assumes two normally distributed populations, and they don’t necessarily
have equal variance
Student’s t-test
• The t test (also called Student’s T Test) compare two averages (means)
and tells you if they are different from each other.
• The t test also tells you how significant the differences are.
• The t score is a ratio between the difference between two groups
and the difference within the groups.
• A large t-score tells you that the groups are different.
• A small t-score tells you that the groups are similar.
• The number of degrees of freedom is the number of values in the
final calculation of a statistic that are free to vary.
There are two types of t-test:
• Matched pairs and independent pairs
• If there is some link between the data then use the matched pairs
test (eg a „before‟ and „after‟)

• If there is no link between the data then use the independent pairs
test
• Difference of mean =
• N number of samples , D is Difference between samples
• Standard deviation of difference

• Standard Error SE =

• Value of

• Degree of freedom
• DoF= n – 1 (where “n” is the number of items in your set)
• DOF (Two Samples) =(N1 + N2) – 2.
• Question
• In an investigation to determine the effectiveness of sequencing of fingerprints
10 prints are taken enhanced with DFO and then with ninhydrin. The points of
detail at each stage are recorded. Is there a difference at the 95% confidence
level?
t-test for matched pairs
1. Set up the null and alternative hypothesis
• H0 there is no difference in the number of minutae when using ninhydrin
• HA there are more minutae observed after the enhancement of ninhydrin.
• This is a one-tail test.
• We are testing at the 95% or 5% (0.05) level
t-test for matched pairs
1. Set up the null and alternative hypothesis
2. Calculate the difference between the pairs in the sample

3. Calculate the mean of the differences

4. Calculate the standard deviation of the difference


5. Calculate the standard error

6. Calculate the value of t

7. Calculate the number of degrees of freedom (DoF)


DoF = No of pairs of data – 1 = N- 1 =10-1 =9

8. find the critical value

You might also like