Quantitative Methods 3
Quantitative Methods 3
Qualitative data are data for which the measurement scale is categorical
Classification
Data
Qualitative Quantitative
Discrete Continuous
Processing Data
He created a map depicting where cases of cholera occurred in London’s West End and found them to be clustered
around a water pump on Broad Street.
Analytics
A. 50 million
B. 52 million
C. 22 million
D. 49 million
Scale of
Measurement
Likely to encounter these terms:
▪ Data are the facts and figures that are collected, summarized, analyzed,
and interpreted
▪ Elements are the entities on which data are collected
▪ A variable is a characteristic of interest for the elements
▪ A data set with n elements contains n observations
▪ Predictor variable: A variable thought to predict an outcome variable. This
term is basically another way of saying ‘independent variable or cause’.
▪ Outcome variable: A variable thought to change as a function of changes in
a predictor variable.(dependent variable or effect)
▪ Variables are measured constructs that vary across entities in the sample.
▪ In contrast, parameters are not measured and are (usually) constants
believed to represent some fundamental truth about the relations
between variables in the model. (mean, median and correlation,
regression)
For Instance
Name of Element
Variables
Variables
Name of Element
For Instance
Types of Measurement scale
• Variables can be split into categorical and continuous, and within these types
there are different levels of measurement:
• Categorical (entities are divided into distinct categories):
• Binary variable: There are only two categories (e.g., dead or alive).
• Nominal variable: There are more than two categories (e.g., whether someone is an
omnivore, vegetarian, vegan, or fruitarian).
• Ordinal variable: The same as a nominal variable but the categories have a logical
order (e.g., whether people got a fail, a pass, a merit or a distinction in their exam)
• Continuous or Quantitative (entities get a distinct score):
• Interval variable: Equal intervals on the variable represent equal differences in the
property being measured (e.g., the difference between 6 and 8 is equivalent to the
difference between 13 and 15).
• Ratio variable: The same as an interval variable, but the ratios of scores on the scale
must also make sense (e.g., a score of 16 on an anxiety scale means that the person is,
in reality, twice as anxious as someone scoring 8). For this to be true, the scale must
have a meaningful zero point.
What is the level of measurement of the following variables?
• The gender of the people giving the bands their phone numbers
https://academo.org/demos/dice-roll-
statistics/#:~:text=If%20you%20roll%20a
%20fair,%22roll%20automatically%22%2
0button%20above.
https://www.youtube.com/wat
ch?v=zeJD6dqJ5lo
Descriptive Statistics
▪ Numerical Measures
▪ Measures of Location
▪ Measures of Variability
Measures of Location
▪Mean
▪Median
▪Mode
▪Percentiles
▪Quartiles
The measure of central tendency
• We can calculate where the centre of a frequency distribution lies
using three measures commonly used: the mean, the mode and the
median.
• The mean is the sum of all scores divided by the number of scores.
The value of the mean can be influenced quite heavily by extreme
scores. (The mean provides a measure of central location)
• The median is the middle score when the scores are placed in
ascending order. It is not as influenced by extreme scores as the
mean.
• The mode is the score that occurs most frequently.
Business Scenario: Mean
• Suppose you want to run a campaign to advertise the racing bikes /
latest fashion trend at a location.
• Whenever a data set has extreme values, the median is the preferred
measure of central location. The median of a data set is the value in
the middle when the data items are arranged in ascending order.
• As a general rule, use median when you want to get the average of a
vector that includes a more uneven data set.
• odd number of scores
Median
• even number of scores
The mean annual loan amount of the population is Customer Loan
13,50,000 INR. But this amount is higher than that Amount (in
earned by 80% of the population. Rs)
1 8,00,000
4 9,50,000
However, the median is not an impeccable statistic.
There are several things that we should consider when
5 32,50,000
using it for communicating statistical information.
Practical use- Salary Analysis: Use the median to report typical salaries when there are a few extremely high or low
salaries that could skew the mean.
Limitations of using Median
Practical use- Customer Preferences: Use the mode to identify the most preferred product or service feature.
Dispersion of distribution
• SS=5.20
• This equation shows how something we have used before (the sum of
squares) can be used to assess the total error in any model (not just the
mean).
• Although, the sum of squared errors (SS) is a good measure of the
accuracy of our model, it depends upon the quantity of data that has
been collected – the more data points, the higher the SS.
• To estimate the mean error in the population we need to divide not by the
number of scores contributing to the total, but by the degrees of freedom
(df), which is the number of scores used to compute the total adjusted for
the fact that we’re trying to estimate the population value
• Our model is the mean, so let’s replace the ‘model’ with the mean ( ), and
the ‘outcome’ with the letter x (to represent a score on the outcome).
3 24 33 44 46 66 77 79 89 99
5 24 34 45 47 67 77 86 89 99
9 26 37 45 55 67 78 86 89 99
21 31 39 46 56 75 78 87 90 102
Business Analytics course by U Dinesh kumar
Solution:
• Mean = 57.64, median = 56, and mode = 46, 89 and 99
• Note that the data in Table is arranged in increasing order in columns. The position
of P10 = 10 × (51)/100 = 5.1
• Value at position 5.1 is approximated as 21 + 0.1 × (value at 6th position — value at
5th position) = 21 + 0.1(1) = 21.1. That is, by 21 hours, 10% of the wire-cuts will fail.
In asset management (and reliability theory), this value is called P10 life.
• Position corresponding to P90 = 90 × 51/100 = 45.9 The value at position 45 is 90
and the value at position 45.9 is 90 + 0.9 (value at 46th position — value at 45th
position) = 90 + 0.9 × (3) = 92.7
• That is, 90% of the wire-cuts will fail by 92.7 hours.
• To calculate the range but excluding values at the
extremes of the distribution. One convention is
Spread of Data Scores to cut off the top and bottom 25% of scores and
calculate the range of the middle 50% of scores –
known as the interquartile range.
Box plot
• A box plot is a graphical summary of data that is based on a five-number
summary.
• A key to the development of a box plot is the computation of the median and
the quartiles Q1 and Q3
• Box plots provide another way to identify outliers
• In symbols-
CV
• However, it has appropriate meaning only if the data achieve ratio scale.
• Variates with a mean less than unity also provide spurious results and the
coefficient of variation will be very large and often meaningless.
Significance of the coefficient of variation:
• Relative Measure of Variability: The CV provides a relative measure of variability that allows for the
comparison of dispersion between data sets. It is particularly valuable when dealing with data sets that have
different units or scales. The CV enables meaningful comparisons between data sets with different ranges or
magnitudes by expressing the variability as a percentage of the mean.
• Comparison of Variability: The CV is useful when comparing the variability of different groups or populations.
For example, it can be employed to assess the dispersion of financial returns of different investment
portfolios, the volatility of stock prices across companies, or the variation in test scores among students in
various schools.
• Decision Making: The CV can be helpful in decision-making processes. A lower CV may indicate greater
consistency or stability in a data set in certain situations, making it more desirable. For instance, if you are
comparing the CV of two suppliers' delivery times, a lower CV would imply that the delivery times are more
consistent, which might be advantageous for supply chain planning.
Applications
Finance:
• Investment Comparison: Investors can use CV to compare the risk (volatility) of different
investments. A lower CV indicates a more stable investment relative to its expected return.
Example: Comparing two stocks to decide which one offers a more stable return relative to its
volatility.
Quality Control:
• Process Variation: Manufacturers can use CV to compare the consistency of different production
processes.
Example: Comparing the variability in the diameter of produced parts from two different machines.
Healthcare:
• Medical Measurements: CV can be used to compare the reliability of different diagnostic tests or
instruments.
Example: Comparing the variability in blood pressure readings from two different blood pressure
monitors.
The Coefficient of
Variation
• To get a feeling for the coefficient of
variation, let's compare a few data
sets.
INFERENTIAL STATISTICS
“To Clarify *add* data.” —Edward R Tufte
• Convenience Sampling: is a non-probability sampling technique in which the sample units are
not selected according to a probability distribution.
• Voluntary Sampling: the data is collected from people who volunteer for such data collection.
For example, customer feedbacks in many contexts fall under this sampling procedure
SAMPLING DISTRIBUTION
• Sampling distribution refers to the probability distribution of a statistic such as
sample mean and sample standard deviation computed from several random
samples of same size.
Z= σ = standard deviation
Z= number of standard deviations from x to the mean of
σ the distribution
X − 130 − 100
Z= =
15
30
= = 2 std dev
15
For Z = 2.00
P(X < 130) = P(Z < 2.00) = 0.97725
P(X > 130) = 1 − P(X ≤ 130) = 1 − P(Z ≤ 2)
= 1 − 0.97725 = 0.02275
Haynes Construction Company
X − 125 – 100
Z= =
20
25
= = 1.25
20
X − 110 – 100
Z= =
20
10
= = 0.5
20
• Point Estimate: Point estimate of a population parameter is the single value (or specific value)
calculated from sample (thus called statistic). Sample mean and variance are estimates of
population mean and variance. Similarly, sample proportion is an estimate of population proportion.
• Interval Estimate: Instead of a specific value of the parameter, in an interval estimate the parameter
is said to lie in an interval (say between points a and b) with certain probability (or confidence).
Problem
Confidence Interval
“Confidence comes not from always being right but from not fearing to be wrong ”. —Peter McIntyre
Point and Interval Estimates
• A point estimate is a single number.
• A confidence interval provides additional information
about the variability of the estimate.
Lower Upper
Confidence Confidence
Point Estimate Limit
Limit
Width of
confidence interval
➢ Confidence intervals are constructed using the point estimate, standard error, and a chosen confidence level.
➢ For example, a 95% confidence interval for the population mean provides a range of values within which we are
95% confident the true population mean lies.
Confidence Intervals
• When there is an uncertainty around measuring the value of an
important population parameter, it is advisable to find the range in which
the value of the parameter is likely to fall rather than predicting a single
estimate (point estimate).
/2 1− /2
x
Intervals μx = μ
extend from x1
σ x2 (1-)100%
X − Zα / 2 of intervals
n
to constructed
σ contain μ;
X + Zα / 2
n ()100% do
not.
Confidence Intervals
Intervals and Level of Confidence
For error let say as 5%, It's
Sampling Distribution of the Mean important to emphasize: We are
not saying that 95% of the time
our sample mean is the
/2 1− /2 population mean, but we are
saying that 95% of the time a
x range that is two standard
Intervals μx = μ deviations wide centered around
extend from the sample mean contains the
x1
population mean
σ x2
X − Zα / 2
n (1-)100%
to of intervals
σ constructed
X + Zα / 2
n contain μ;
Confidence Intervals ()100% do
not.
What does this insight mean for us as managers? When we set a confidence level of 95%, we
are agreeing to an approach that 1 out of 20 times will give us an interval that does not
contain the true population mean. If we aren't comfortable with those odds, we should raise
the confidence level.
• For Example: If we report that we are 90% confident that the mean of the population
of income of people in a certain community will lie between $8000 and $24000,
then the range $8000-$24000 is our confidence interval.
• Often, however, we express the confidence interval in standard error rather than in
numerical values. Thus, we will often express confidence intervals like:
Population Population
Mean Proportion
σ Known σ Unknown
Confidence Interval for μ
(σ Known)
• Assumptions:
• Population standard deviation σ is known.
• Population is normally distributed.
• If population is not normal, use large sample (n > 30).
• Confidence interval estimate: The formula for a confidence interval for a population
mean (mu) when the population standard deviation is known is:
X Z/2 ➢Sigma is the population standard deviation, and n is the sample size.
n ➢The term Z_(alpha/2) * (sigma / sqrt(n)) is the margin of error of the
estimate.
Problem 1
• The sample mean was 4.5 days and the population standard deviation
was known to be 1.2 days.
(a) Calculate the 95% confidence interval for the population mean.
(b) What is the probability that the population mean is greater than 4.73 days?
Solution
Problem 2
kg. We take a sample of 100 bags and find their average weight (X_bar)
kg, 50.49 kg). We are 95% confident that the true average weight of the bags filled
• Keep in mind that this interpretation is based on the long-term behavior of the
method. That is, if we repeatedly took samples and calculated confidence intervals
in this way, about 95% of them would capture the true average weight.
Confidence Intervals
Confidence
Intervals
Population Population
Mean Proportion
σ Known σ Unknown
Do You Ever Truly Know σ?
• Probably not!
Incidentally, we can use the t-distribution even for sample sizes larger than 30. However,
most people use the z-distribution for larger samples, partially out of habit and partially
because it's easier, since the z-value doesn't vary based on the sample size.
Student’s t Distribution
Note: t Z as n increases
Standard
Normal
(t with df = ∞)
t (df = 13)
t-distributions are bell-
shaped and symmetric, but
have ‘fatter’ tails than the t (df = 5)
normal
0 t
Confidence Interval for μ
(σ Unknown)
• Assumptions:
• Population standard deviation is unknown.
• Population is normally distributed. S
• Use Student’s t Distribution. X tα / 2
n
• Confidence Interval Estimate:
(where tα/2 is the critical value of the t distribution with n -1 degrees of freedom and
an area of α/2 in each tail.)
• The t is a family of distributions.
• The tα/2 value depends on degrees of freedom (d.f.).
• Number of observations that are free to vary after sample mean has been calculated.
d.f. = n - 1
Student’s t Table
Confidence t t t Z
Level (10 d.f.) (20 d.f.) (30 d.f.) (∞ d.f.)
Note: t Z as n increases
Suppose you have a sample of 20 observations and want to calculate a
95% confidence interval for the population mean. The sample mean is
2. Find the critical t-value corresponding to a 95% confidence level and 19 degrees of freedom.
You can refer to the t-table. The critical t-value is 2.093.
S
3. Calculate the margin of error using the formula:
Margin of Error = t-value * (standard deviation / sqrt(sample size))
X tα / 2
Margin of Error = 2.093 * (8 / sqrt(20))
n
4. Calculate the confidence interval using the formula:
Confidence Interval = sample mean ± margin of error
Confidence Interval = 65 ± (2.093 * (8 / sqrt(20)))
Therefore, the 95% confidence interval for the population mean is (61.20, 68.80).
1. Determine the degrees of freedom (df).
▪ In this case, df = n - 1 = 20 - 1 = 19.
“Beware of the problem of testing too many hypotheses, the more you torture the data, the more likely they are to confess,
but confessions obtained under duress may not be admissible in the court of scientific opinion.” —Stephen M Stigler
What is a hypothesis?
• Hypothesis testing begins with an assumption, called a hypothesis that
we make about a population parameter.
A hypothesis is a claim (assertion) about a population parameter:
population mean:
Example: The mean monthly cell phone bill in this city is μ = 800 INR
• The alternative hypothesis (Ha) is the statement that contradicts or opposes the
null hypothesis. It represents the claim or belief that there is a specific effect or
relationship in the population.
Null and Alternative Hypotheses
• The Null and Alternative Hypotheses are mutually exclusive. Only one
of them can be true.
H 0 : μ = 30 H 0 : X = 30
•Let us look at examples –
•A realtor claims that the average price of an apartment in a locality in
a Metropolitan City is more than INR 50 Lakhs.
•The hypotheses in this scenario will be (H0 : μ ≤ 50 Lakhs) null
hypotheses (H0), Mu (μ) will be less than or equal to INR 50 Lakhs.
Alternative hypotheses (H1 : μ > 50 Lakhs) (H1) will be Mu (μ) greater
than INR 50 Lakhs.
• The emergency service is not meeting the response goal; appropriate follow-
up action is necessary.
• In hypothesis testing, a Type I error occurs when the null hypothesis H0 is true, but
• In other words, we mistakenly conclude that there's a significant effect (or difference)
• The probability of making a Type I error is denoted by the Greek letter α and is also
145
• Example:
• Let's consider a trial for a new drug.
Null Hypothesis H0: The new drug has no effect on a disease (i.e., it's no better than the
Alternative Hypothesis Ha: The new drug has an effect on the disease (i.e., it's either better or
146
• Imagine that, in reality, the new drug is just as effective as the current
treatment—it genuinely has no special effect.
• If you then reject the null hypothesis based on these results, you are concluding
that the new drug is effective when, in reality, it's not. This is a Type I error.
• In this context:
• The consequence of a Type I error might be that the pharmaceutical company spends
a lot of money marketing a drug that isn't truly better.
• Patients might also be given a treatment that isn't any more effective than the
standard one.
In many testing scenarios, particularly in fields like medicine or justice, the consequences
of a Type I error can be quite severe, which is why researchers choose a significance level
(α) that reflects the risks they're willing to take with making this kind of error. Common
choices for α include 0.05, 0.01, and 0.10, but the appropriate level depends on the specific
field and context of the test.
What now?
• Reducing the risk of committing a Type I error involves decreasing the significance
the Type I error probability, you increase the risk of committing a Type II error
(failing to reject a false null hypothesis). This is because reducing α makes the test
level, α. However, there's a trade-off to consider. When you decrease α to reduce the
Type I error probability, you increase the risk of committing a Type II error (failing to
reject a false null hypothesis). This is because reducing α makes the test more
• A Type II error occurs when you fail to reject the null hypothesis
false negative.
151
Context: A pharmaceutical company has developed a new drug that
they believe reduces blood pressure more effectively than the current
its effectiveness.
152
Scenario:
• Imagine the new drug truly is more effective than the standard medication, but
when the clinical trials are conducted, the data fails to show a significant difference
• Perhaps this failure occurs because the sample size was too small, the trial duration
was too short, or there was significant variability in the responses. As a result, the
new drug, which means they might not bring to market a treatment that's genuinely better
for patients.
• Consequence: Patients might miss out on a more effective treatment for high blood pressure.
• Significance: Depending on the magnitude of the improvement, this could mean a significant
154
• The probability of committing a Type II error is often denoted by β.
155
Type I & II Error Relationship
• A survey of CPAs across the United States found that the average net
income for sole-proprietor CPAs is $98,500. Because this survey is over a
decade old, an accounting analyst wants to test whether the net income
figure has changed or not. A random sample of 112 CPAs produced a
mean salary of $102,220. Assume that the population standard deviation
of salaries is = $14,530.
• Step 1: Establish the hypothesis
• Analyst wants to know whether the mean has changed
• Two tailed test • H0 : μ = $98,500
• Ha : μ ≠ $98,500
• Step 2: Determine the appropriate statistical test
• The z-statistic can be used when the following three conditions are met:
o The data are a random sample from the population
o The sample standard deviation (s) is known
o At least one of the following conditions are met: x−
z=
• The sample size (n) is at least 30 OR
• the underlying distribution is normal
n
• Step 3: Set the value of α, the Type I error rate
• The value of α, 0.05, is specified in the problem
• Step 4: Establish the decision rule
• For a two-tailed test with α = 0.05, the rejection region will be in the two tails,
with an area of 0.025 in each tail, so z = 1.96
• Decision rule: Reject H0 if
z 1.96 or z −1.96.
2 2
• Step 5: Gather the sample data
• Suppose that in the sample of 112 CPAs who respond to the survey, the
sample mean is $102,220
• Step 6: Analyze the data
102,220 − 98,500
z= = 2.71
14,530
112
• A major west coast city provides one of the most comprehensive emergency medical services in
the world. Operating in a multiple hospital system with approximately 20 mobile medical
units, the service goal is to respond to medical emergencies with a mean time of 12 minutes or
less.
• The director of medical services wants to formulate a hypothesis test that could use a sample
of emergency response times to determine whether or not the service goal of 12 minutes or
less is being achieved.
• The response times for a random sample of 40 medical emergencies were tabulated. The
sample mean is 13.25 minutes. The population standard deviation is believed to be 3.2 minutes.
• The EMS director wants to perform a hypothesis test, with a .05 level of significance, to
determine whether the service goal of 12 minutes or less is being achieved.
One Tailed Tests About a Population Mean:
Known
• Step 1. Develop the null and alternative hypotheses.
• Step 3. Collect the sample data and compute the value of the test statistic.
• Critical Value Approaches:
• Step 4. Determine the critical value and rejection rule.
• There is sufficient statistical evidence to infer that Metro EMS is not meeting the
response goal of 12 minutes.
p value approach
• p-value – another way to reach a statistical conclusion in hypothesis
testing it defines the smallest value of alpha for which the null
hypothesis can be rejected
• The p value is the probability, computed using the test statistic, that
measures the support (or lack of support) provided by the sample for
the null hypothesis
• If the p value is less than the level of significance alpha the value of
the test statistic is in the rejection region
• p-value < reject H0
• p-value do not reject H0
Problem
• Consider the case of a wholesaler that buys lightbulbs from the manufacturer.
The wholesaler buys the bulbs in large lots and does not want to accept a lot
of bulbs unless the mean life is at least 1000 hours As each shipment arrives,
the wholesaler tests a sample to determine whether it should accept the
shipment or not. The company will reject the shipment only if it feels that the
mean life is below 1000 hours
• They collected a random sample of 40 lightbulbs. The mean life is 992.6 hours.
The population standard deviation is believed to be 32 hours. The wholesaler
wants to perform a hypothesis test, with a 0.10 level of significance.
Example: Two-Tailed Tests About a Population Mean: sd Known
• Glow Toothpaste:
• The production line for Glow toothpaste is designed to fill tubes with a mean
weight of 6 oz. Periodically, a sample of 30 tubes will be selected in order to
check the filling process.
• Quality assurance procedures call for the continuation of the filling process if
the sample results are consistent with the assumption that the mean filling
weight for the population of toothpaste tubes is 6 oz.; otherwise the process
will be adjusted.
• Assume that a sample of 30 toothpaste tubes provides a sample mean of
6.1 oz. The population standard deviation is believed to be 0.2 oz. Perform a
hypothesis test, at the .03 level of significance, to help determine whether
the filling process should continue operating or be stopped and corrected