0% found this document useful (0 votes)
20 views

Quantitative Methods 3

Uploaded by

akashkm2710
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Quantitative Methods 3

Uploaded by

akashkm2710
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 174

Quantitative Methods

Data is a valuable resource in today’s ever-changing


marketplace. For business professionals, knowing
how to interpret and communicate data is an
indispensable skill that can inform sound decision-
making.

This Photo by Unknown author is licensed under CC BY-SA.


Data and Classification of Data
Data are any number of related observations.

A single data or observation is known as data point. A collection of data


is a data set.
Data can be qualitative or quantitative

Quantitative data are numerical data that can be measured

Qualitative data are data for which the measurement scale is categorical
Classification
Data

Qualitative Quantitative

Discrete Continuous
Processing Data

Data-Initial Generate hypothesis


Theory (Literature)
Observation (RQ) (Variables)

Collect data to test


Analyze data (Fit a
predictions (Measure
model)
variables)
Making Sense from Data
• The growing amount of data and improved computational
capacity are altering how businesses collect data, process
information, and make choices.

• Analytics is a scientific approach that involves utilizing large


amounts of data to find models that inform judgements and
actions. Making Data visualization an integral part of Business
Intelligence.

• Applied in Operations, Marketing, Finance, and Strategic


Planning, among other functions

This Photo by Unknown author is licensed under CC BY-NC-ND.


Data and its analysis help in searching for
possible solutions and evaluates them

Data insights can systematically test the


Management robustness of solutions w.r.t. change in business
based on environment (what-if and sensitivity analysis).

data and Data help generates insights which leads to better


decision making.
models
Data analytical tools and repetitive use of them
harness naïve intuition into an instinct.
Implementation
• Improving Productivity and Collaboration at Microsoft
• At technology giant Microsoft, collaboration is key to a productive, innovative work
environment.
• Microsoft’s Workplace Analytics team hypothesized that moving the 1,200-person
group from five buildings to four could improve collaboration by increasing the number
of employees per building and reducing the distance that staff needed to travel for
meetings.
• In an article for the Harvard Business Review, the company’s analytics team shared the
outcomes they observed as a result of the relocation.
• Through looking at metadata attached to employee calendars, the team found that the
move resulted in a 46 percent decrease in meeting travel time. This translated into a
combined 100 hours saved per week across all relocated staff members and an
estimated savings of $520,000 per year in employee time.
HBR article :Matt Gavin
Targeting Consumers at PepsiCo
• To ensure the right quantities and types of products are available to consumers
in certain locations, PepsiCo uses big data analytics.
• PepsiCo created a cloud-based data and analytics platform called Pep Worx to
make more informed decisions regarding product merchandising.
• With Pep Worx, the company identifies shoppers in the United States who are
likely to be highly interested in a specific PepsiCo brand or product.
• For example, Pep Worx enabled PepsiCo to distinguish 24 million households
from its dataset of 110 million US households that would be most likely to be
interested in Quaker Overnight Oats.
• The company then identified specific retailers that these households might
shop at and targeted their unique audiences.
• Ultimately, these customers drove 80 percent of the product’s sales growth in
its first 12 months after launch.

HBR article :Matt Gavin


1854 Cholera Outbreak - Snow's Map
(Diagnostic Analysis)

He created a map depicting where cases of cholera occurred in London’s West End and found them to be clustered
around a water pump on Broad Street.
Analytics

Source: Prof. David Simchi Levi


Descriptive Statistics
• Descriptive analytics examines what happened in the past. We
are utilizing descriptive analytics when we examine past data
sets for patterns and trends.
• Descriptive analytics functions by identifying what metrics you
want to measure, collecting that data, and analyzing it.
• It turns the stream of facts the business has collected into
information we can act on, plan around, and measure.
• Examples of descriptive analytics include:
• Annual revenue reports
• Year-over-year sales reports
Decision making approach

• What is the best estimate of population of Sri Lanka?

A. 50 million
B. 52 million
C. 22 million
D. 49 million
Scale of
Measurement
Likely to encounter these terms:
▪ Data are the facts and figures that are collected, summarized, analyzed,
and interpreted
▪ Elements are the entities on which data are collected
▪ A variable is a characteristic of interest for the elements
▪ A data set with n elements contains n observations
▪ Predictor variable: A variable thought to predict an outcome variable. This
term is basically another way of saying ‘independent variable or cause’.
▪ Outcome variable: A variable thought to change as a function of changes in
a predictor variable.(dependent variable or effect)
▪ Variables are measured constructs that vary across entities in the sample.
▪ In contrast, parameters are not measured and are (usually) constants
believed to represent some fundamental truth about the relations
between variables in the model. (mean, median and correlation,
regression)
For Instance

Name of Element

Variables
Variables

Name of Element

For Instance
Types of Measurement scale
• Variables can be split into categorical and continuous, and within these types
there are different levels of measurement:
• Categorical (entities are divided into distinct categories):
• Binary variable: There are only two categories (e.g., dead or alive).
• Nominal variable: There are more than two categories (e.g., whether someone is an
omnivore, vegetarian, vegan, or fruitarian).
• Ordinal variable: The same as a nominal variable but the categories have a logical
order (e.g., whether people got a fail, a pass, a merit or a distinction in their exam)
• Continuous or Quantitative (entities get a distinct score):
• Interval variable: Equal intervals on the variable represent equal differences in the
property being measured (e.g., the difference between 6 and 8 is equivalent to the
difference between 13 and 15).
• Ratio variable: The same as an interval variable, but the ratios of scores on the scale
must also make sense (e.g., a score of 16 on an anxiety scale means that the person is,
in reality, twice as anxious as someone scoring 8). For this to be true, the scale must
have a meaningful zero point.
What is the level of measurement of the following variables?

• The number of downloads of different bands’ songs on iTunes

• The phone numbers that the bands obtained during registration

• The gender of the people giving the bands their phone numbers

• The instruments played by the band members

• The time they had spent learning to play their instruments


Analysing data
• We collect data from a smaller subset of the population known as a sample and
use these data to infer things about the population as a whole. The bigger the
sample, the more likely it is to reflect the whole population
• The final stage of the research process is to analyse the data you have collected.
• The statistical analysis appropriate for a particular variable depends upon
whether the variable is categorical or quantitative.
• We can summarize categorical data by counting the number of observations in
each category or by computing the proportion of the observations in each
category.
• When the data are quantitative this involves both looking at your data graphically
to see what the general trends in the data are, and also fitting statistical models
to the data.
Caselet: How much students expect to make?
• Ashnaa (hypothetical name), an aspiring MBA applicant, was particularly
interested in starting salaries of graduates. She found a dataset from a
prominent MBA school in Germany.
• The data was of the class of 2023 who was surveyed about their satisfaction
with the MBA program and their starting salaries. The survey responses were
linked to existing data on the graduates, including age, sex, work experience,
GMAT scores, MBA averages, quartile ranking, and native language.
• Ashnaa was pleased to find this data and hoped it could answer her key
questions about starting salaries, the impact of gender and age, student
satisfaction, and the influence of GMAT scores. Since her native language was
not English, she had a relatively low GMAT score.
• Field Description
• age age - in years
• sex 1=Male; 2=Female
• gmat_tot total GMAT score
• gmat_qpc quantitative GMAT percentile
• gmat_vpc verbal GMAT percentile
• qmat_tpc overall GMAT percentile
• s_avg spring MBA average
• f_avg fall MBA average
• quarter quartile ranking (1st is top, 4th is bottom)
• work_yrs years of work experience
• frstlang first language (1=English; 2=other)
• salary starting salary
• satis degree of satisfaction with MBA program (1= low, 7 = high satisfaction)
• ..\..\Downloads\QM1 case desc.xlsx
Basic ways to look at and summarize the data
you have collected.
• Frequency distributions : a graph of how many
times each score occurs.
• Frequency distributions can be very useful for assessing
properties of the distribution of scores. Frequency
distributions come in many different shapes and sizes.
• In an ideal world our data would be distributed
symmetrically around the centre of all scores. This is
known as a normal distribution and is characterized by
the bellshaped curve with which you might already be
familiar.
• we often use a normal distribution with a mean of 0 and
a standard deviation of 1 as a standard.
• There are two main ways in which a distribution can deviate from
normal: (1) lack of symmetry (called skew) and (2) pointyness (called
kurtosis). (In a normal distribution the values of skew and kurtosis are
0)
Statistics of rolling dice

https://academo.org/demos/dice-roll-
statistics/#:~:text=If%20you%20roll%20a
%20fair,%22roll%20automatically%22%2
0button%20above.

https://www.youtube.com/wat
ch?v=zeJD6dqJ5lo
Descriptive Statistics
▪ Numerical Measures
▪ Measures of Location
▪ Measures of Variability
Measures of Location
▪Mean
▪Median
▪Mode
▪Percentiles
▪Quartiles
The measure of central tendency
• We can calculate where the centre of a frequency distribution lies
using three measures commonly used: the mean, the mode and the
median.
• The mean is the sum of all scores divided by the number of scores.
The value of the mean can be influenced quite heavily by extreme
scores. (The mean provides a measure of central location)
• The median is the middle score when the scores are placed in
ascending order. It is not as influenced by extreme scores as the
mean.
• The mode is the score that occurs most frequently.
Business Scenario: Mean
• Suppose you want to run a campaign to advertise the racing bikes /
latest fashion trend at a location.

• The average age of people living in that area is 39 Years.

• Will you run the event?


Age
Business Scenario: Mean 25
22
23
• The intuitive answer is No! Simply because of the ‘average’
22
age of the people living in that area is 39 Years and it will not
19
make sense to sell them these items as products.
24
26
• After canceling the event you looked at the age data closely and 22
found something like this… 21
20
20
• Most of the population is young! But there are 2 cases which are
121
abruptly different from all the other ages. Most probably this is a
24
data error.
180
16
Limitations of using Mean
• Mean is affected by outliers ( extreme values)
• And this is exactly why you cannot trust mean for approximating the average
trend for anything and one must always doubt statements like below.
• ‘Average placement salary of students from our institute is 15Lac per annum’
• ‘The mileage of our bike is 60 Kmpl’
• Outliers could be taken into account in these examples to inflate the figures
to catch attention and influence decision making.
• Tip! : This can introduce bias in the data since the mean value does not
always represent the central tendency or the general pattern of data. An
alternate way is to use the median value, which is a better indicator of
central tendency.
Practical use- Sales Data: Use the mean for overall average sales but consider the median to understand typical sales
figures if there are outliers.
How and where to use median
• Recall the previous example

• Whenever a data set has extreme values, the median is the preferred
measure of central location. The median of a data set is the value in
the middle when the data items are arranged in ascending order.

• As a general rule, use median when you want to get the average of a
vector that includes a more uneven data set.
• odd number of scores
Median
• even number of scores
The mean annual loan amount of the population is Customer Loan
13,50,000 INR. But this amount is higher than that Amount (in
earned by 80% of the population. Rs)
1 8,00,000

The median amount is 9,00,000 INR. 50% of the


population has lower loan than this amount, and 50% 2 8,50,000

has higher loan. So, the median represents the


“average” concept in a better way. 3 9,00,000

4 9,50,000
However, the median is not an impeccable statistic.
There are several things that we should consider when
5 32,50,000
using it for communicating statistical information.

Practical use- Salary Analysis: Use the median to report typical salaries when there are a few extremely high or low
salaries that could skew the mean.
Limitations of using Median

Median does not convey the information


of min and max values

Median may lead to a false impression

Median is not good for planning


• The mode is the score that occurs most frequently in the
data set.
The mode • This is easy to spot in a frequency distribution because it
will be the tallest bar.

Practical use- Customer Preferences: Use the mode to identify the most preferred product or service feature.
Dispersion of distribution

This Photo by Unknown author is licensed under CC BY-SA-NC.


The dispersion in a distribution
• The deviance or error is the distance of each score from the mean.
• The sum of squared errors is the total amount of error in the mean. The
errors/deviances are squared before adding them up.
• The variance is the average distance of scores from the mean. It is the sum
of squares divided by the number of scores. It tells us about how widely
dispersed scores are around the mean.
• The standard deviation is the square root of the variance. It is the variance
converted back to the original units of measurement of the scores used to
compute it. Large standard deviations relative to the mean suggest data are
widely spread around the mean, whereas small standard deviations
suggest data are closely packed around the mean.
• The range is the distance between the highest and lowest score.
• The interquartile range is the range of the middle 50% of the scores.
• Outcomei = b0+errori
• Given that our model is defined by parameters, this amounts to saying
that we’re not interested in the parameter values in our sample, but we
care about the parameter values in the population.
• We can use the sample data to estimate what the population parameter
values are likely to be. That’s why we use the word ‘estimate’, because
when we calculate parameters based on sample data they are only
estimates of what the true parameter value is in the population.
The mean as a statistical model
▪ For example, if we took five of you and measured the number of
friends that you have, we might find the following data: 1, 2, 3, 3 and
4.
▪ Mean number of values (1 + 2 + 3 + 3 + 4)/5 = 2.6.
▪ 2.6 friend ?
▪ So the mean value is a hypothetical value: it is a model created to
summarize the data and there will be error in prediction.
▪ Outcomei = b0+errori
▪ Where b0 is the mean of outcome
• The important thing is that we can use the value of the mean (or any
parameter) computed in our sample to estimate the value in the
population (which is the value in which we’re interested)
• So like for first person has 1 friend:
• 1=2.6+error
• error=1-2.6=-1.6
• You might notice that all we have done here is calculate the deviance.
The deviance is another word for error. A more general way to think
of the deviance or error is by rearranging equation into:
• Deviance = outcomei-modeli
• We know the accuracy or ‘fit’ of the model for a particular person having
1 friend, but we want to know the fit of the model overall.
• we can’t add deviances because some errors are positive and others
negative and so we’d get a total of zero.
• One way around this problem is to square the errors.

• SS=5.20
• This equation shows how something we have used before (the sum of
squares) can be used to assess the total error in any model (not just the
mean).
• Although, the sum of squared errors (SS) is a good measure of the
accuracy of our model, it depends upon the quantity of data that has
been collected – the more data points, the higher the SS.
• To estimate the mean error in the population we need to divide not by the
number of scores contributing to the total, but by the degrees of freedom
(df), which is the number of scores used to compute the total adjusted for
the fact that we’re trying to estimate the population value

• Our model is the mean, so let’s replace the ‘model’ with the mean ( ), and
the ‘outcome’ with the letter x (to represent a score on the outcome).

• The mean squared error is also known as the variance.


• Both measures give us an idea of how well a model fits the data: large values
relative to the model indicate a lack of fit.
• The standard deviation is the square root of the variance

• The sum of squares, variance and standard deviation are all


measures of the dispersion or spread of data around the
mean.
• A small standard deviation (relative to the value of the mean
itself) indicates that the data points are close to the mean.
• A large standard deviation (relative to the mean) indicates
that the data points are distant from the mean.
• A standard deviation of 0 would mean ?
68–95–99.7 rule
• In statistics, the 68–95–99.7
rule, also known as
the empirical rule, in
a normal distribution:
approximately 68%, 95%,
and 99.7% of the values lie
within one, two, and
three standard deviations of
the mean, respectively.
For example suppose the height of students in a
college has mean of 5.5 ft and std of 0.5 ft.
Ques: The normal distribution has stdv of 10 and
mean 70. Approx what area is contained between 70
and 90?
Ques: For normal distribution, with mean of 0 and
stdev 1, what area is contained between -2 and 1?
Percentile, Decile and Quartile
• Are used to identify the position of the observation in the data set.
• Percentile is used to identify the position of the student in the group.
• Denoted as Px , represent x percentage of data lie below that value.
• For Ex- P95 denotes the value below which 95 percentage of data lies.
• Position corresponds to Px ≈ x(n+1)/100
• Decile corresponds to special values of percentile that divides the data into
10 equal parts.
• Quartiles are specific percentiles.
• First Quartile = 25th Percentile
• Second Quartile = 50th Percentile = Median
• Third Quartile = 75th Percentile
For Instance:
Time between failures (in hours) of a wire cutter (to cut the dough into
cookies) used in cookie manufacturing oven is observed.
• Calculate the mean, median, and mode of time between failures of wire-
cuts .
• The company would like to know by what time 10% (ten percentile or P10)
and 90% (ninety percentile or P90) of the wire-cuts will fail.
2 22 32 39 46 56 76 79 88 93

3 24 33 44 46 66 77 79 89 99

5 24 34 45 47 67 77 86 89 99

9 26 37 45 55 67 78 86 89 99

21 31 39 46 56 75 78 87 90 102
Business Analytics course by U Dinesh kumar
Solution:
• Mean = 57.64, median = 56, and mode = 46, 89 and 99
• Note that the data in Table is arranged in increasing order in columns. The position
of P10 = 10 × (51)/100 = 5.1
• Value at position 5.1 is approximated as 21 + 0.1 × (value at 6th position — value at
5th position) = 21 + 0.1(1) = 21.1. That is, by 21 hours, 10% of the wire-cuts will fail.
In asset management (and reliability theory), this value is called P10 life.
• Position corresponding to P90 = 90 × 51/100 = 45.9 The value at position 45 is 90
and the value at position 45.9 is 90 + 0.9 (value at 46th position — value at 45th
position) = 90 + 0.9 × (3) = 92.7
• That is, 90% of the wire-cuts will fail by 92.7 hours.
• To calculate the range but excluding values at the
extremes of the distribution. One convention is
Spread of Data Scores to cut off the top and bottom 25% of scores and
calculate the range of the middle 50% of scores –
known as the interquartile range.
Box plot
• A box plot is a graphical summary of data that is based on a five-number
summary.
• A key to the development of a box plot is the computation of the median and
the quartiles Q1 and Q3
• Box plots provide another way to identify outliers

This Photo by Unknown author is licensed under CC BY-SA-NC.


• Compare their graphs: the ratings for lecturer 1 were consistently close to
the mean rating, indicating that the mean is a good representation of the
observed data – it is a good fit. The ratings for lecturer 2, however, were
more spread out from the mean: for some lectures, she received very high
ratings, and for others her ratings were terrible. Therefore, the mean is not
such a good representation of the observed scores – it is a poor fit.
Practice
Problems

This Photo by Unknown author is licensed under CC BY-NC.


Practice Problem: CGPA sheet
• The Cumulative grade point average (CGPA) of 40 students are given
in sheet.
• Calculate the mean, median and mode. Calculate the standard deviation.
• Calculate the 90th and 95th percentile of CGPA
• Calculate the interquartile range (IQR)
• The Dean of the student believes that the CGPA is right tailed distribution, is
there an evidence ?
• Create an histogram of data
Sol.
• a) Mean = 2.26, median = 1.98, mode = 1.48, and standard deviation = 0.74.
• b) Arrange the data of table in increasing order.
• The position of P90 = 90 × (41)/100 = 36.90.
• Value at 36th position is 3.36.
• Value at position 36.90 is approximated as 3.36 + 0.9 × (value at 37th position – value at 36th
position) = 3.36 + 0.9(0.07) = 3.423
• Similarly, P95 = 95 × 41/100 = 38.95. The value at position 38 is 3.67 and value at position
38.95 is 3.67 + 0.95 × (0.07) = 3.74
• c) P25 = 1.66 + 0.66 (0.02) = 1.67
• P75 = 2.84 + 0.75 (value at 31st position – value at 30th position) = 2.84 + 0.75 (0.02) = 2.85
• Thus the distance between Quartile (IQD) = 2.85 – 1.67 = 1.18
• The IQR = (Q1 – 1.5 IQD, Q3 + 1.5 IQD) = (-0.0995, 4.6277)
• d) Histogram (bins 6)
Practice Problem: Bank sheet
• The Bank captured KYC data, about the customer applying for home
loan and home improvement loan application which is given in sheet.
• Develop appropriate charts for the variables and discuss the insights from it.
• Calculate descriptive statistics summary of monthly salary and balance in
saving account.
• Use box plot to check the outliers among variables loan amount requested,
down payment and EMI.
• Which continuous variable is highly skewed.
Histogram
• Number of bins, N
• N = (Xmax- Xmin)/width of bin interval
• Width goes with the number of data points which are covered by
power of 2
• Sturges(1926) has propsed :
• Number of bins, N = 1+3.322*log10 (n)
Coefficient of variation (CV)
• If the absolute dispersion is defined as the standard deviation, and the
average is the mean, the relative dispersion is called the coefficient of
variation (CV) or coefficient of dispersion.

• The coefficient of variation expresses the standard deviation as a fraction of


the mean. We can use it to compare variation in different data sets of
different scales or units.

• Coefficient of deviation = (Standard Deviation / Mean) × 100.

• In symbols-
CV
• However, it has appropriate meaning only if the data achieve ratio scale.

• The coefficient of variation can be plotted as a graph to compare data.

• A CV exceeding say about 30 percent is often indicative of problems in the


data or that the experiment is out of control.

• Variates with a mean less than unity also provide spurious results and the
coefficient of variation will be very large and often meaningless.
Significance of the coefficient of variation:
• Relative Measure of Variability: The CV provides a relative measure of variability that allows for the
comparison of dispersion between data sets. It is particularly valuable when dealing with data sets that have
different units or scales. The CV enables meaningful comparisons between data sets with different ranges or
magnitudes by expressing the variability as a percentage of the mean.

• Comparison of Variability: The CV is useful when comparing the variability of different groups or populations.
For example, it can be employed to assess the dispersion of financial returns of different investment
portfolios, the volatility of stock prices across companies, or the variation in test scores among students in
various schools.

• Decision Making: The CV can be helpful in decision-making processes. A lower CV may indicate greater
consistency or stability in a data set in certain situations, making it more desirable. For instance, if you are
comparing the CV of two suppliers' delivery times, a lower CV would imply that the delivery times are more
consistent, which might be advantageous for supply chain planning.
Applications
Finance:
• Investment Comparison: Investors can use CV to compare the risk (volatility) of different
investments. A lower CV indicates a more stable investment relative to its expected return.
Example: Comparing two stocks to decide which one offers a more stable return relative to its
volatility.
Quality Control:
• Process Variation: Manufacturers can use CV to compare the consistency of different production
processes.
Example: Comparing the variability in the diameter of produced parts from two different machines.
Healthcare:
• Medical Measurements: CV can be used to compare the reliability of different diagnostic tests or
instruments.
Example: Comparing the variability in blood pressure readings from two different blood pressure
monitors.
The Coefficient of
Variation
• To get a feeling for the coefficient of
variation, let's compare a few data
sets.

• Which set has the highest relative


variation?

• Check the answer you select.


The Coefficient of
Variation
• Because the coefficient of variation
has no units, we can use it to
compare different kinds of data sets
and find out which data set is most
variable in this relative sense.

• The coefficient of variation describes


the standard deviation as a fraction
of the mean, giving you a standard
measure of variability.
Sampling and Estimation

INFERENTIAL STATISTICS
“To Clarify *add* data.” —Edward R Tufte

• Sampling is necessary when it is difficult or expensive to collect data on the entire


population. The inference about the population is made based on the sample that
was collected; incorrect sample may lead to incorrect inference about the
population
SAMPLING
• The process of identifying a subset from a population of
elements (aka observations or cases) is called sampling process
or simply sampling.
• The following steps are used in any sampling process:
• Identification of target population that is important for a given
problem under study
• Decide the sampling frame
• Determine the sample size
• Sampling method
• probabilistic sampling and non-probabilistic sampling.
Probabilistic Sampling
A good sample is representative of entire population

• In a probability sampling, the individual observations in the sample are selected


according to a probability distribution
• Random Sampling: Random sampling is ideal when the population is
homogeneous. In random sampling, every subject in the population has equal
probability of selection in the sample.
• Patient 1 2 3 4 5 6 7 8 9 10
• LoS 4 20 12 13 15 17 16 20 9 17
• Stratified Sampling: The population can be divided into mutually exclusive groups
using some factor (for example, age, gender, marital status, income, geographical
regions, etc.). The groups, thus, formed are called stratum. It is important that the
groups are mutually exclusive and exhaustive of the population.
• Efficacy of a drug among different age groups. Age group can be classified into categories such
as less than 40, between 41 and 60, and over 60 years of age.
Probabilistic Sampling
• Cluster Sampling: The population is divided into mutually exclusive clusters. For example, assume
that a researcher is interested in analyzing life of smart phone batteries from a specific
manufacturer. The manufacturer may have different models (each model in this case will be a
cluster).
• Note that stratified sampling and cluster sampling are similar. The major difference is that in a
stratified sample, all strata will be represented in the sample, whereas in a cluster sampling, not
all clusters will be represented.
• Cluster sample is used when the clusters are large in number. For example, assume that we are
interested in impact of demonetization on Indian industry. There are large number of industrial
sectors. Analyzing the impact on all clusters will be expensive and time consuming, so in such
cases few clusters (such as healthcare and manufacturing) may be used for the study.
• Bootstrap Aggregating (known as Bagging) is sampling with replacement used in machine
learning algorithms, especially the random forest algorithm (Breiman, 1996). In Bagging, several
samples (with replacement) are generated from the population and analytical models are
developed using each sample
Non-Probability Sampling
• In a non-probability sampling, the selection of sample units from the population does not follow
any probability distribution. Sample units are selected based on convenience and/or on
voluntary basis/ judgement/quota.
• Assume that a data scientist is interested in studying attrition and factors influencing attrition. For this
study, he/she may collect data from his friends and colleagues which may not be true representation of the
population.

• Convenience Sampling: is a non-probability sampling technique in which the sample units are
not selected according to a probability distribution.

• Voluntary Sampling: the data is collected from people who volunteer for such data collection.
For example, customer feedbacks in many contexts fall under this sampling procedure
SAMPLING DISTRIBUTION
• Sampling distribution refers to the probability distribution of a statistic such as
sample mean and sample standard deviation computed from several random
samples of same size.

• Understanding the sampling distribution is important for hypothesis testing.


• For example, consider a population of 10 observations. We can derive
several samples of various sizes from this population of 10 units.
• S. No. 1 2 3 4 5 6 7 8 9 10
• Value 5 10 15 20 25 30 35 40 45 50
Translation to the Standardized Normal Distribution

• Translate from X to the standardized normal (the “Z” distribution) by


subtracting the mean of X and dividing by its standard deviation:

X = value of the random variable with which we are


concerned

X −μ µ = mean of the distribution of this random variable

Z= σ = standard deviation
Z= number of standard deviations from x to the mean of
σ the distribution

The Z distribution always has mean = 0 and standard deviation = 1.


For a normal distribution, we usually refer
to the number of standard deviations we
must move away from the mean to cover a
particular probability as "z", or the "z-
value."

For any value of z, there is a specific


probability of being within z standard
deviations of the mean.

For example, for a z-value of 1, the


probability of being within z standard
deviations of the mean is about 68%, the
probability of being between -1 and +1 on a
standard normal curve.
Analogy

A good way to think about what the z-statistic


can do is this analogy: if a giant tells you his
house is four steps to the north, and you want
to know how many steps it will take you to get
there, what else do you need to know?

You would need to know how much bigger his


stride is than yours. Four steps could be a
really long way.
The same is true of a standard
deviation. To know how far you
must go from the mean to cover a
certain area under the curve, you
have to know the standard
deviation of the distribution.
Standard Normal Distribution
AREA UNDER THE NORMAL CURVE
Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.5 .69146 .69497 .69847 .70194 .70540 .70884 .71226 .71566 .71904 .72240
0.6 .72575 .72907 .73237 .73536 .73891 .74215 .74537 .74857 .75175 .75490
0.7 .75804 .76115 .76424 .76730 .77035 .77337 .77637 .77935 .78230 .78524
0.8 .78814 .79103 .79389 .79673 .79955 .80234 .80511 .80785 .81057 .81327
0.9 .81594 .81859 .82121 .82381 .82639 .82894 .83147 .83398 .83646 .83891
1.0 .84134 .84375 .84614 .84849 .85083 .85314 .85543 .85769 .85993 .86214
1.1 .86433 .86650 .86864 .87076 .87286 .87493 .87698 .87900 .88100 .88298
1.2 .88493 .88686 .88877 .89065 .89251 .89435 .89617 .89796 .89973 .90147
1.3 .90320 .90490 .90658 .90824 .90988 .91149 .91309 .91466 .91621 .91774
1.4 .91924 .92073 .92220 .92364 .92507 .92647 .92785 .92922 .93056 .93189
1.5 .93319 .93448 .93574 .93699 .93822 .93943 .94062 .94179 .94295 .94408
Using the Standard Normal Table
• For μ = 100, σ = 15, find the probability that X is less than 130

X −  130 − 100
Z= =
 15
30
= = 2 std dev
15

Normal Distribution Showing the Relationship Between Z Values and X Values


Using the Standard Normal Table

AREA UNDER THE NORMAL CURVE


Z 0.00 0.01 0.02 0.03
1.8 0.96407 0.96485 0.96562 0.96638
1.9 0.97128 0.97193 0.97257 0.97320
2.0 0.97725 0.97784 0.97831 0.97882
2.1 0.98214 0.98257 0.98300 0.98341
2.2 0.98610 0.98645 0.98679 0.98713

For Z = 2.00
P(X < 130) = P(Z < 2.00) = 0.97725
P(X > 130) = 1 − P(X ≤ 130) = 1 − P(Z ≤ 2)
= 1 − 0.97725 = 0.02275
Haynes Construction Company

• Builds three- and four-unit apartment


buildings
• Total construction time follows a
normal distribution
• For triplexes, μ = 100 days
and σ = 20 days
• Contract calls for completion in 125
days
• Late completion will incur a severe
penalty fee
• Probability of completing in 125 days?
Normal Distribution for Haynes Construction
Haynes Construction Company

• Compute Z Normal Distribution for Haynes Construction

X − 125 – 100
Z= =
 20
25
= = 1.25
20

• From Table, for Z = 1.25


area = 0.89435
• Compute Z
The probability is about 0.89
that Haynes will not violate the contract
X −  125 – 100 FIGURE 2.10 Normal Distribution
Z= = for Haynes Construction
 20
25
= = 1.25
20

• From Table, for Z = 1.25


area = 0.89435
Haynes Construction Company
• If finished in 75 days or less, bonus = $5,000
• Probability of bonus?
Probability That Haynes Will Receive the Bonus by Finishing in
75 Days or Less
X − 75 – 100
Z= =
 20
–25
= = –1.25
20
• Because the distribution is
symmetrical, equivalent to Z =
1.25
so area = 0.89435
Haynes Construction Company
• If finished in 75 days or less,
P(X >bonus
125) ==$5,000
1.0 − P(X ≤ 125)
• Probability of bonus? = 1.0 − 0.89435 = 0.10565
The probability of completing
FIGURE 2.11 the contract
Probability in 75 Will
That Haynes
X − days75
or–less
100is about Receive
11% the Bonus by Finishing in 75 Days
Z= = or Less
 20
–25
= = –1.25
20

• Because the distribution is


symmetrical, equivalent to Z = 1.25
so area = 0.89435
Haynes Construction Company
• Probability of completing between 110 and 125 days?
P(110 < X < 125) = P(X ≤ 125) − P(X < 110)
• P(X ≤ 125) = 0.89435 Probability That Haynes Will Complete in 110 to 125 Days

X − 110 – 100
Z= =
 20
10
= = 0.5
20

For Z = 0.5 area = 0.69146


Haynes Construction Company
• Probability of completing between 110 and 125 days?
P(110 ≤ X < 125) = 0.89435 − 0.69146
P(110 < X < 125) = P(X ≤ 125) − P(X < 110)
= 0.20289
• P(X ≤ 125) = 0.89435
The probability of completing between 110 and
125 days is about 20%
FIGURE 2.12 Probability That Haynes
X − 110 – 100 Will Complete in 110 to 125 Days
Z= =
 20
10
= = 0.5
20
For Z = 0.5 area = 0.69146
Standard Normal Distribution
AREA UNDER THE NORMAL CURVE
Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.5 .69146 .69497 .69847 .70194 .70540 .70884 .71226 .71566 .71904 .72240
0.6 .72575 .72907 .73237 .73536 .73891 .74215 .74537 .74857 .75175 .75490
0.7 .75804 .76115 .76424 .76730 .77035 .77337 .77637 .77935 .78230 .78524
0.8 .78814 .79103 .79389 .79673 .79955 .80234 .80511 .80785 .81057 .81327
0.9 .81594 .81859 .82121 .82381 .82639 .82894 .83147 .83398 .83646 .83891
1.0 .84134 .84375 .84614 .84849 .85083 .85314 .85543 .85769 .85993 .86214
1.1 .86433 .86650 .86864 .87076 .87286 .87493 .87698 .87900 .88100 .88298
1.2 .88493 .88686 .88877 .89065 .89251 .89435 .89617 .89796 .89973 .90147
1.3 .90320 .90490 .90658 .90824 .90988 .91149 .91309 .91466 .91621 .91774
1.4 .91924 .92073 .92220 .92364 .92507 .92647 .92785 .92922 .93056 .93189
1.5 .93319 .93448 .93574 .93699 .93822 .93943 .94062 .94179 .94295 .94408
ESTIMATION FOR MEAN OF THE POPULATION
• From the central limit theorem, we know that the sampling
distribution of mean follows a normal distribution with mean µ and
standard deviation . Then the standard normal variate of the
sampling distribution of mean is given by

• Point Estimate: Point estimate of a population parameter is the single value (or specific value)
calculated from sample (thus called statistic). Sample mean and variance are estimates of
population mean and variance. Similarly, sample proportion is an estimate of population proportion.

• Interval Estimate: Instead of a specific value of the parameter, in an interval estimate the parameter
is said to lie in an interval (say between points a and b) with certain probability (or confidence).
Problem
Confidence Interval

“Confidence comes not from always being right but from not fearing to be wrong ”. —Peter McIntyre
Point and Interval Estimates
• A point estimate is a single number.
• A confidence interval provides additional information
about the variability of the estimate.

Lower Upper
Confidence Confidence
Point Estimate Limit
Limit
Width of
confidence interval

➢ Confidence intervals are constructed using the point estimate, standard error, and a chosen confidence level.
➢ For example, a 95% confidence interval for the population mean provides a range of values within which we are
95% confident the true population mean lies.
Confidence Intervals
• When there is an uncertainty around measuring the value of an
important population parameter, it is advisable to find the range in which
the value of the parameter is likely to fall rather than predicting a single
estimate (point estimate).

• Confidence interval is the range in which the value of a population


parameter is likely to lie with certain probability

• The objective of confidence interval is to provide both location and


precision of population parameters.
Significance level- α, confidence level (1-)
• Confidence level, usually written as (1 − α)
100%, on the interval estimate of a
population parameter is the probability
that the interval estimate will contain the
true population parameter. When α = 0.05,
95% is the confidence level and 0.95 is the
probability that the interval estimate will
have the population parameter.
• The value of α is called significance.

Confidence interval for population mean.


Intervals and Level of Confidence
Sampling Distribution of the Mean

/2 1−  /2
x
Intervals μx = μ
extend from x1
σ x2 (1-)100%
X − Zα / 2 of intervals
n
to constructed
σ contain μ;
X + Zα / 2
n ()100% do
not.
Confidence Intervals
Intervals and Level of Confidence
For error let say as 5%, It's
Sampling Distribution of the Mean important to emphasize: We are
not saying that 95% of the time
our sample mean is the
/2 1−  /2 population mean, but we are
saying that 95% of the time a
x range that is two standard
Intervals μx = μ deviations wide centered around
extend from the sample mean contains the
x1
population mean
σ x2
X − Zα / 2
n (1-)100%
to of intervals
σ constructed
X + Zα / 2
n contain μ;
Confidence Intervals ()100% do
not.
What does this insight mean for us as managers? When we set a confidence level of 95%, we
are agreeing to an approach that 1 out of 20 times will give us an interval that does not
contain the true population mean. If we aren't comfortable with those odds, we should raise
the confidence level.

If we increase the confidence level to 98%, we have only a 1 out of


50 chance of obtaining an interval that does not contain the true
population mean. However, this higher confidence comes at a
cost. If we keep the same sample size, then the confidence
interval will widen, thereby decreasing the accuracy of our
estimate. Alternatively, to keep the same interval width, we can
increase our sample size.
Interval estimates and confidence intervals
➢ The confidence interval is the range of the estimate we are making.

• For Example: If we report that we are 90% confident that the mean of the population
of income of people in a certain community will lie between $8000 and $24000,
then the range $8000-$24000 is our confidence interval.

• Often, however, we express the confidence interval in standard error rather than in
numerical values. Thus, we will often express confidence intervals like:

• 𝑥ҧ +1.64 𝜎𝑥ҧ = upper limit of the confidence interval

• 𝑥ҧ - 1.64 𝜎𝑥ҧ = lower limit of the confidence interval


Confidence Intervals
Confidence
Intervals

Population Population
Mean Proportion

σ Known σ Unknown
Confidence Interval for μ
(σ Known)
• Assumptions:
• Population standard deviation σ is known.
• Population is normally distributed.
• If population is not normal, use large sample (n > 30).

• Confidence interval estimate: The formula for a confidence interval for a population
mean (mu) when the population standard deviation is known is:

σ ➢Here, X_bar is the sample mean,

X  Z/2 ➢Sigma is the population standard deviation, and n is the sample size.
n ➢The term Z_(alpha/2) * (sigma / sqrt(n)) is the margin of error of the

estimate.
Problem 1

• A sample of 100 patients was chosen to estimate the length of stay


(LoS) at a hospital.

• The sample mean was 4.5 days and the population standard deviation
was known to be 1.2 days.
(a) Calculate the 95% confidence interval for the population mean.

(b) What is the probability that the population mean is greater than 4.73 days?
Solution
Problem 2

Suppose we have a population of weights of bags (rice or flour) filled in

a factory. The population standard deviation (sigma) is known to be 2.5

kg. We take a sample of 100 bags and find their average weight (X_bar)

to be 50 kg. We want to estimate a 95% confidence interval for the true

average weight of bags filled in the factory.


• First, we need to find the value of Z_(alpha/2).
• For a 95% confidence interval, as previously mentioned, Z_(alpha/2) is
approximately 1.96.
Next, we plug the values into the confidence interval formula:
▪ This becomes:
▪ 50 ± 1.96 * (2.5 / sqrt(100))
▪ This simplifies to:
σ
X  Z/2
▪ 50 ± 1.96 * (2.5 / 10)
▪ Which further simplifies to:
▪ 50 ± 1.96 * 0.25
n
▪ Finally, we calculate the confidence interval as:
▪ 50 ± 0.49
• So, the 95% confidence interval for the true average weight of the bags is (49.51

kg, 50.49 kg). We are 95% confident that the true average weight of the bags filled

in the factory is between 49.51 kg and 50.49 kg.

• Keep in mind that this interpretation is based on the long-term behavior of the

method. That is, if we repeatedly took samples and calculated confidence intervals

in this way, about 95% of them would capture the true average weight.
Confidence Intervals
Confidence
Intervals

Population Population
Mean Proportion

σ Known σ Unknown
Do You Ever Truly Know σ?
• Probably not!

• In virtually all real-world business situations, σ is not known.

• If there is a situation where σ is known, then µ is also known (since to


calculate σ you need to know µ).

• If you truly know µ there would be no need to gather a sample to


estimate it.
The case when the population standard is
unknown
• If the population standard deviation is not known, it's more

appropriate to use the t-distribution instead of the z-distribution,

especially for small sample sizes.


We assumed in our confidence limit calculations that the sample size was
at least 30. What if it isn't? What if we have only a small sample? Let's
consider a different survey, one that concerns a delicate matter.
• The business manager of a large ocean liner, asks for our help. She wants
us to find out the value of her guests' belongings. She needs this value
to determine the correct insurance protection in case guest belongings
disappear from their cabins, are destroyed in a fire, or sink with the ship.

• She has no idea how valuable her guests' belongings are,


but she feels uneasy asking them for this information.

• She is willing to ask only 16 guests to estimate the


total value of the belongings in their cabins. From this
sample, we need to prepare an estimate.
• First, with a small sample, the consequences of the Central Limit Theorem
are not assured, so we cannot be sure that the sample means follow a
normal distribution.
• Second, with a small sample, we can't be sure that the sample standard
deviation is a good estimate of the population standard deviation.
• Due to these additional uncertainties, we cannot use z-values to construct
confidence intervals. Using a z-value would overstate our confidence in
our estimate.
• It depends: if we don't know anything about the underlying

population, we cannot create a confidence interval with fewer than

30 data points. However, if the underlying population is normally

distributed — or even roughly normally distributed — we can use a

confidence interval to estimate the population mean.

In practice, as long as we are sure the underlying population is not

highly skewed or extremely bimodal, we can construct a confidence

interval, even when we have a small sample. However, we do need

to modify our approach slightly.


• To estimate the population mean with a small sample, we use a t-
distribution, which was discovered in the early 20th century at the
Guinness Brewing Company in Ireland.

• If the population standard deviation σ is unknown, we can substitute


the sample standard deviation, S.
• This introduces extra uncertainty since S is variable from sample to
sample.
• So we use the t distribution instead of the normal distribution.
A t-distribution gives us t-values in much the
same way as a normal distribution gives us z-
values. What is the difference between the
normal distribution and the t-distribution?

A t-distribution looks similar to a normal


distribution, but is not as tall in the center and
has thicker tails, because it is more likely than
the normal distribution to have values fall
farther away from the mean.
Therefore, the normal distribution's "rules of thumb"
for 68% and 95% probabilities no longer hold. For
example, we must go more than 2 standard
deviations on either side of the mean to capture 95%
of the probability for a t-distribution.
Thus, to achieve the same level of confidence, a
confidence interval based on a t-distribution will be
wider than one based on a normal distribution. This
reinforces our intuition: we have less certainty about
our estimate with a smaller sample, so we need a
wider interval to achieve a given level of confidence.
The t-distribution is also different The smaller the
sample size n, the shorter the height and the thicker
the tails of the t-distribution curve, and the farther
we have to go from the mean to reach a given level of
confidence.
On the other hand, as the sample size increases, the shape of the t-distribution becomes more
and more like the shape of a normal distribution. Once we reach a sample size of 30, the t-
distribution becomes virtually identical to the z-distribution, so t-values and z-values can be
used interchangeably.

Incidentally, we can use the t-distribution even for sample sizes larger than 30. However,
most people use the z-distribution for larger samples, partially out of habit and partially
because it's easier, since the z-value doesn't vary based on the sample size.
Student’s t Distribution
Note: t Z as n increases

Standard
Normal
(t with df = ∞)

t (df = 13)
t-distributions are bell-
shaped and symmetric, but
have ‘fatter’ tails than the t (df = 5)
normal

0 t
Confidence Interval for μ
(σ Unknown)
• Assumptions:
• Population standard deviation is unknown.
• Population is normally distributed. S
• Use Student’s t Distribution. X  tα / 2
n
• Confidence Interval Estimate:
(where tα/2 is the critical value of the t distribution with n -1 degrees of freedom and
an area of α/2 in each tail.)
• The t is a family of distributions.
• The tα/2 value depends on degrees of freedom (d.f.).
• Number of observations that are free to vary after sample mean has been calculated.
d.f. = n - 1
Student’s t Table

Upper Tail Area Let: n = 3


df = n - 1 = 2
df .10 .05 .025  = 0.10
/2 = 0.05
1 3.078 6.314 12.706

2 1.886 2.920 4.303


3 1.638 2.353 3.182 /2 = 0.05

The body of the table


contains t values, not 0 2.920 t
probabilities
Selected t distribution values
With comparison to the Z value

Confidence t t t Z
Level (10 d.f.) (20 d.f.) (30 d.f.) (∞ d.f.)

0.80 1.372 1.325 1.310 1.28


0.90 1.812 1.725 1.697 1.645
0.95 2.228 2.086 2.042 1.96
0.99 3.169 2.845 2.750 2.58

Note: t Z as n increases
Suppose you have a sample of 20 observations and want to calculate a

95% confidence interval for the population mean. The sample mean is

65, and the sample standard deviation is 8.


1. Determine the degrees of freedom (df).
▪ In this case, df = n - 1 = 20 - 1 = 19.

2. Find the critical t-value corresponding to a 95% confidence level and 19 degrees of freedom.
You can refer to the t-table. The critical t-value is 2.093.

S
3. Calculate the margin of error using the formula:
Margin of Error = t-value * (standard deviation / sqrt(sample size))
X  tα / 2
Margin of Error = 2.093 * (8 / sqrt(20))
n
4. Calculate the confidence interval using the formula:
Confidence Interval = sample mean ± margin of error
Confidence Interval = 65 ± (2.093 * (8 / sqrt(20)))
Therefore, the 95% confidence interval for the population mean is (61.20, 68.80).
1. Determine the degrees of freedom (df).
▪ In this case, df = n - 1 = 20 - 1 = 19.

We are 95% confident that the true population


2. Find the critical t-value corresponding to a 95% confidence level and 19 degrees of freedom.
mean falls within the calculated confidence interval
You can refer to the t-table. The critical t-value is 2.093.
of (61.20, 68.80). This means that if we were to
repeatedly sample from the population and S
3. Calculate the margin of error using the formula:
calculate confidence intervals in the same way,
Margin of Error = t-value * (standard deviation / sqrt(sample size))
X  tα / 2
approximately 95% of those intervals would
Margin of Errorcontain
= 2.093 * (8
the/ sqrt(20))
true population mean.
n
4. Calculate the confidence interval using the formula:
Confidence Interval = sample mean ± margin of error
Confidence Interval = 65 ± (2.093 * (8 / sqrt(20)))
Therefore, the 95% confidence interval for the population mean is (61.20, 68.80).
Returning to the good ship example, let's determine an estimate of the average value of
passengers' belongings. The manager samples 16 guests, and reports that they have an
average of $10,200 worth of clothing, jewelry, and personal effects in their cabins. From her
survey numbers, we calculate a standard deviation of $4,800.

We need to double check that the distribution isn't too skewed,


which we might expect, since some of the passengers are quite
wealthy. The manager explains that the insurance policy has a
limited liability clause that limits a passenger's maximum claim to
$20,000.

We sketch a graph of the 16 values that confirms that the


distribution is not too asymmetric, so we feel comfortable
using the t-distribution.
• Since we have a sample of 16 passengers, there are 15 degrees of
freedom. The Excel function =TINV(0.05,15) tells us that the
appropriate t-value is 2.131.

• we can report that we are 95% confident that


the average value of passengers' belongings is
between $7,643 and $12,757.
Hypothesis Testing

“Beware of the problem of testing too many hypotheses, the more you torture the data, the more likely they are to confess,
but confessions obtained under duress may not be admissible in the court of scientific opinion.” —Stephen M Stigler
What is a hypothesis?
• Hypothesis testing begins with an assumption, called a hypothesis that
we make about a population parameter.
A hypothesis is a claim (assertion) about a population parameter:
population mean:
Example: The mean monthly cell phone bill in this city is μ = 800 INR

Example: A new sales force bonus plan is developed in an attempt to


increase sales.
• Alternative Hypothesis: The new bonus plan increase sales.
• Null Hypothesis: The new bonus plan does not increase sales.
What is Hypothesis Testing
❖Hypothesis testing is a statistical method used to make inferences or draw
conclusions about a population based on sample data. It involves formulating two
competing hypotheses, the null hypothesis (H0) and the alternative hypothesis (Ha),
and analyzing the sample data to determine which hypothesis is supported by the
evidence.

• The null hypothesis (H0) is a statement of no effect or no difference. It represents


the status quo or the assumption that there is no meaningful relationship or effect
in the population.

• The alternative hypothesis (Ha) is the statement that contradicts or opposes the
null hypothesis. It represents the claim or belief that there is a specific effect or
relationship in the population.
Null and Alternative Hypotheses
• The Null and Alternative Hypotheses are mutually exclusive. Only one
of them can be true.

• The Null and Alternative Hypotheses are collectively exhaustive. They


are stated to include all possibilities

• The Null Hypothesis is assumed to be true.

• The burden of proof falls on the Alternative Hypothesis. Anything, we


want to prove will go in alternate hypothesis.
Example:
➢Null Hypothesis (H0): The defendant is presumed innocent until proven guilty.
➢ This is the default assumption or the null hypothesis, similar to assuming no effect or no
difference in statistical hypothesis testing.

➢Alternative Hypothesis (Ha): The alternative hypothesis represents the claim or


belief that the defendant is guilty.
➢The prosecution presents evidence to support the alternative hypothesis, which is equivalent
to providing evidence against the null hypothesis in statistical hypothesis testing.
The Null Hypothesis, H0
• States the claim or assertion to be tested.
Example: The mean diameter of a manufactured bolt
is 30mm ( H 0 : μ = 30)
• Is always about a population parameter,
not about a sample statistic .

H 0 : μ = 30 H 0 : X = 30
•Let us look at examples –
•A realtor claims that the average price of an apartment in a locality in
a Metropolitan City is more than INR 50 Lakhs.​
•The hypotheses in this scenario will be (H0 : μ ≤ 50 Lakhs) null
hypotheses (H0), Mu (μ) will be less than or equal to INR 50 Lakhs.
Alternative hypotheses (H1 : μ > 50 Lakhs) (H1) will be Mu (μ) greater
than INR 50 Lakhs.​

•One more example.​


•The average age of active global Instagram users is 29 years.​
•The hypotheses to test this claim will be (H0 : μ = 29 Years) null
hypothesis Mu (μ) equal to 29 years and alternative hypotheses (H1 :
μ ≠ 29 Years) Mu (μ) not equal to 29 years.
In general, a hypothesis
test about the value of a
population mean m must
take one of the following
three forms
Null and Alternative Hypotheses

• Example: Metro EMS


• A major west coast city provides one of the most comprehensive
emergency medical services in the world. Operating in a multiple
hospital system with approximately 20 mobile medical units, the
service goal is to respond to medical emergencies with a mean
time of 12 minutes or less.
• The director of medical services wants to formulate a hypothesis
test that could use a sample of emergency response times to
determine whether or not the service goal of 12 minutes or less
is being achieved.
• The emergency service is meeting the response goal; no follow-up action is
necessary.

• The emergency service is not meeting the response goal; appropriate follow-
up action is necessary.

• where: m = mean response time for the population of medical emergency


requests
Rejection and Nonrejection Regions
• Statistical outcomes that result in the rejection of the null hypothesis lie in
what is termed the rejection region. Statistical outcomes that fail to result in
the rejection of the null hypothesis lie in what is termed the nonrejection
region.
• As an example, consider the flour-packaging manufacturing.
• The null hypothesis is that the average fill for the population of packages is
40 ounces.
• Suppose a sample of 100 such packages is randomly selected, and a sample
mean of 40.01 ounces is obtained.
• Because this mean is not 40 ounces, should the business researcher decide
to reject the null hypothesis?
• It makes sense that in taking random samples from a population with a
mean of 40 ounces not all sample means will equal 40 ounces. In fact, the
central limit theorem states that for large sample sizes, sample means are
normally distributed around the population mean.
• Thus, even when the population mean is 40 ounces, a sample mean might
still be 40.1, 38.6, or even 44.2.
• However, suppose a sample mean of 50 ounces is obtained for 100
packages. This sample mean may be so far from what is reasonable to
expect for a population with a mean of 40 ounces that the decision is
made to reject the null hypothesis.
• This prompts the question: when is the sample mean so far away from
the population mean that the null hypothesis is rejected?
• Note the critical values in each end (tail) of the distribution. In each
direction beyond the critical values lie in the rejection regions. Any sample
mean that falls in that region will lead the business researcher to reject
the null hypothesis.
Type 1 error in hypothesis test

• In hypothesis testing, a Type I error occurs when the null hypothesis H0 is true, but

we incorrectly reject it.

• In other words, we mistakenly conclude that there's a significant effect (or difference)

when there actually isn't one in the population.

• The probability of making a Type I error is denoted by the Greek letter α and is also

known as the significance level.

145
• Example:
• Let's consider a trial for a new drug.

Null Hypothesis H0: The new drug has no effect on a disease (i.e., it's no better than the

current standard treatment or a placebo).

Alternative Hypothesis Ha: The new drug has an effect on the disease (i.e., it's either better or

worse than the current standard treatment or a placebo).

146
• Imagine that, in reality, the new drug is just as effective as the current
treatment—it genuinely has no special effect.

• However, due to random variability in your sample or other unforeseen


circumstances (maybe the patients in the trial who received the new drug by
chance had milder forms of the disease or other beneficial conditions), the
results show that the new drug appears significantly better.

• If you then reject the null hypothesis based on these results, you are concluding
that the new drug is effective when, in reality, it's not. This is a Type I error.
• In this context:
• The consequence of a Type I error might be that the pharmaceutical company spends
a lot of money marketing a drug that isn't truly better.

• Patients might also be given a treatment that isn't any more effective than the
standard one.

In many testing scenarios, particularly in fields like medicine or justice, the consequences
of a Type I error can be quite severe, which is why researchers choose a significance level
(α) that reflects the risks they're willing to take with making this kind of error. Common
choices for α include 0.05, 0.01, and 0.10, but the appropriate level depends on the specific
field and context of the test.
What now?
• Reducing the risk of committing a Type I error involves decreasing the significance

level, α. However, there's a trade-off to consider. When you decrease α to reduce

the Type I error probability, you increase the risk of committing a Type II error

(failing to reject a false null hypothesis). This is because reducing α makes the test

more conservative, demanding stronger evidence to reject the null hypothesis.


What now?
• Reducing the risk of committing a Type I error involves decreasing the significance

level, α. However, there's a trade-off to consider. When you decrease α to reduce the

Type I error probability, you increase the risk of committing a Type II error (failing to

reject a false null hypothesis). This is because reducing α makes the test more

conservative, demanding stronger evidence to reject the null hypothesis.


Type II Error

• A Type II error occurs when you fail to reject the null hypothesis

even though the alternative hypothesis is true. In other words, it's a

false negative.

151
Context: A pharmaceutical company has developed a new drug that

they believe reduces blood pressure more effectively than the current

standard medication. They decide to conduct clinical trials to determine

its effectiveness.

152
Scenario:
• Imagine the new drug truly is more effective than the standard medication, but

when the clinical trials are conducted, the data fails to show a significant difference

between the two treatments.

• Perhaps this failure occurs because the sample size was too small, the trial duration

was too short, or there was significant variability in the responses. As a result, the

pharmaceutical company concludes that there's insufficient evidence to suggest the

new drug is more effective, even though it genuinely is.


153
Type II Error:
• This scenario represents a Type II error. The company failed to detect the true effect of the

new drug, which means they might not bring to market a treatment that's genuinely better

for patients.

• Consequence: Patients might miss out on a more effective treatment for high blood pressure.

• Significance: Depending on the magnitude of the improvement, this could mean a significant

health benefit that goes unrealized.

154
• The probability of committing a Type II error is often denoted by β.

• The power of a test, which is 1−β, is the probability of correctly rejecting


a false null hypothesis.

• In practice, researchers aim to design their experiments to minimize both


α (the probability of a Type I error) and β, but there's often a trade-off
between the two.

155
Type I & II Error Relationship

▪ Type I and Type II errors cannot happen at


the same time.
▪ A Type I error can only occur if H0 is true,
▪ A Type II error can only occur if H0 is false.

If Type I error probability () , then


Type II error probability (β)
Type I and Type II Errors

Type I and Type II Errors


• Type I Error • Generally, α and β are
o Rejecting a true null hypothesis inversely related
o The probability of committing a Type I error is called α, the • Increasing the size of the
level of significance sample reduces both α and β
o analogous to when an innocent person is sent to jail
• Type II Error
o Failing to reject a false null hypothesis
o The probability of committing a Type II error is called β
o analogous to when a guilty person is declared innocent
• Power (= 1 − β)
o Probability of a statistical test rejecting the null hypothesis
when the null hypothesis is false
Copyright ©2020 John Wiley & Sons, Inc. 157
Hypothesis Testing Procedures
• Step 1. Establish a null and alternative hypotheses.
• Step 2. Determine the appropriate statistical test
• Step 3. Specify the level of significance alpha. (acceptable value)
• Step 4. Establish the decision rule
• Step 5. Collect the sample data
• Step 6. Analyze or Compute the value of the test statistic.
• p-Value Approach
• Use the value of the test statistic to compute the p-value.
• Reject H0 if p-value < alpha.
• Critical Value Approach
• Use the level of significance to determine the critical value and the rejection rule.
• Use the value of the test statistic and the rejection rule to determine whether to reject
H0.
• Step 7. Reach a statistical conclusion
• Step 8. Make a business decision
Testing Hypotheses about a Population Mean
Using the z Statistic (σ known)
• For Instance:

• A survey of CPAs across the United States found that the average net
income for sole-proprietor CPAs is $98,500. Because this survey is over a
decade old, an accounting analyst wants to test whether the net income
figure has changed or not. A random sample of 112 CPAs produced a
mean salary of $102,220. Assume that the population standard deviation
of salaries is  = $14,530.
• Step 1: Establish the hypothesis
• Analyst wants to know whether the mean has changed
• Two tailed test • H0 : μ = $98,500
• Ha : μ ≠ $98,500
• Step 2: Determine the appropriate statistical test
• The z-statistic can be used when the following three conditions are met:
o The data are a random sample from the population
o The sample standard deviation (s) is known
o At least one of the following conditions are met: x− 
z=
• The sample size (n) is at least 30 OR

• the underlying distribution is normal
n
• Step 3: Set the value of α, the Type I error rate
• The value of α, 0.05, is specified in the problem
• Step 4: Establish the decision rule
• For a two-tailed test with α = 0.05, the rejection region will be in the two tails,
with an area of 0.025 in each tail, so z = 1.96
• Decision rule: Reject H0 if
z  1.96 or z  −1.96.
2 2
• Step 5: Gather the sample data
• Suppose that in the sample of 112 CPAs who respond to the survey, the
sample mean is $102,220
• Step 6: Analyze the data
102,220 − 98,500
z= = 2.71
14,530
112

• Step 7: Reach a statistical


conclusion
• Since the test statistic (or
“observed value”), z =2.71, is
greater than the critical value
of 1.96, reject H0
• Step 8: Make a business decision
• Statistically, the analyst has enough evidence to reject the figure of $98,500 as
the true average salary for CPAs
• Based on the evidence gathered, it suggests that the average has increased over
the 15-year period
• For a manager, this may indicate that hiring costs for CPAs has increased
• For an accountant, this may mean the potential for greater earning power
Example: Metro EMS

• A major west coast city provides one of the most comprehensive emergency medical services in
the world. Operating in a multiple hospital system with approximately 20 mobile medical
units, the service goal is to respond to medical emergencies with a mean time of 12 minutes or
less.
• The director of medical services wants to formulate a hypothesis test that could use a sample
of emergency response times to determine whether or not the service goal of 12 minutes or
less is being achieved.
• The response times for a random sample of 40 medical emergencies were tabulated. The
sample mean is 13.25 minutes. The population standard deviation is believed to be 3.2 minutes.
• The EMS director wants to perform a hypothesis test, with a .05 level of significance, to
determine whether the service goal of 12 minutes or less is being achieved.
One Tailed Tests About a Population Mean:
Known
• Step 1. Develop the null and alternative hypotheses.

• Step 2. Specify the level of significance alpha. (acceptable value)

• Step 3. Collect the sample data and compute the value of the test statistic.
• Critical Value Approaches:
• Step 4. Determine the critical value and rejection rule.

• Step 5. Determine whether to reject Ho

Thus, there is sufficient statistical evidence to infer that Metro EMS is


not meeting the response goal of 12 minutes.
p - Value Approaches
• Step 4- p-Value Approaches

• Step 5. Determine whether to reject Ho

• There is sufficient statistical evidence to infer that Metro EMS is not meeting the
response goal of 12 minutes.
p value approach
• p-value – another way to reach a statistical conclusion in hypothesis
testing it defines the smallest value of alpha for which the null
hypothesis can be rejected
• The p value is the probability, computed using the test statistic, that
measures the support (or lack of support) provided by the sample for
the null hypothesis
• If the p value is less than the level of significance alpha the value of
the test statistic is in the rejection region
• p-value <   reject H0
• p-value    do not reject H0
Problem
• Consider the case of a wholesaler that buys lightbulbs from the manufacturer.
The wholesaler buys the bulbs in large lots and does not want to accept a lot
of bulbs unless the mean life is at least 1000 hours As each shipment arrives,
the wholesaler tests a sample to determine whether it should accept the
shipment or not. The company will reject the shipment only if it feels that the
mean life is below 1000 hours

• They collected a random sample of 40 lightbulbs. The mean life is 992.6 hours.
The population standard deviation is believed to be 32 hours. The wholesaler
wants to perform a hypothesis test, with a 0.10 level of significance.
Example: Two-Tailed Tests About a Population Mean: sd Known
• Glow Toothpaste:
• The production line for Glow toothpaste is designed to fill tubes with a mean
weight of 6 oz. Periodically, a sample of 30 tubes will be selected in order to
check the filling process.
• Quality assurance procedures call for the continuation of the filling process if
the sample results are consistent with the assumption that the mean filling
weight for the population of toothpaste tubes is 6 oz.; otherwise the process
will be adjusted.
• Assume that a sample of 30 toothpaste tubes provides a sample mean of
6.1 oz. The population standard deviation is believed to be 0.2 oz. Perform a
hypothesis test, at the .03 level of significance, to help determine whether
the filling process should continue operating or be stopped and corrected

You might also like