COMM 215 Notes
COMM 215 Notes
1.1 Data
Element: a person, object, or other entity about which we wish to draw a conclusion.
Variable: a characteristic of a population or sample element.
● Quantitative: a variable having values that are numbers representing quantities.
Eg. Selling price, temperature, car mileage
● Qualitative: a variable having values that indicate into which of several categories a
population element belongs.
Eg. weather, gender, car color
Data set: facts and figures, taken together, that are collected for a statistical study.
● Cross-sectional data: data collected at the same point of time.
Eg. cell phone costs of different employees in June.
● Time series data: data collected over different time periods.
Eg. Temperature of each month
Traditional statistics
Traditional statistics consists of a set of concepts and techniques that are used to describe
populations and samples and to make statistical inferences about populations by using
samples.
Note: much of this book is devoted to traditional stats, but traditional stats is
sometimes not sufficient to analyze big data.
2 related extensions to help (chapter 1.5):
1. Business analytics: the use of traditional and newly developed stats methods,
advances in information systems, and techniques from management science to
continuously and iteratively explore and investigate past business
performance, with the purpose of gaining insight and improving business planning
and operations.
2. Data mining: the process of discovering useful knowledge in extremely large data
sets.
Processes
Sometimes we are interested in studying the population of all of the elements that will be or
could potentially be produced by a process.
Process: a sequence of operations that takes inputs (labor, materials, machines...) and
turns them into output (products, services, ...)
● Finite population: a population that contains a finite number of elements.
● Infinite population: a population that is defined so that there is no limit to the number
of elements that could potentially belong to the population.
Probability sampling
Probability sampling: sampling where we know the chance that each element in the
population will be included in the sample.
Note: if we employ probability sampling, the sample obtained can be used to make valid
stat inferences about the sampled population.
Non-probability sampling
● Convenience sampling: sampling where we select elements because they are easy
or convenient to sample.
● Voluntary response sample: sampling in which the sample participants self-select.
(eg. employed by television and radio). This sample overrepresent people with strong
opinions.
● Judgement sampling: sampling where an expert selects population elements that
he/she feels are representative of the population.
Qualitative variables
1. Ordinal: qualitative variable for which there is a meaningful ordering, or ranking of
the categories. Ordinal variables can be numerical or nonnumerical.
Eg. satisfaction ranking from 0 to 5, or from “no satisfactory” to “very satisfied”
2. Nominative: qualitative variable for which there is no meaningful ordering, or
ranking, of the categories.
Eg. colors of car, gender
Types of surveys
1. Phone survey (low response rate)
2. Mail survey (low response rate)
3. Web survey (low response rate)
4. Personal interview survey (high response rate)
Eg. Mall survey
Pie chart: a graphical display of data in categories made up of pie slices representing the
frequency, relative frequency, or percentage frequency of items in its corresponding
category.
Pareto charts: a bar chart of the frequencies or percentages for various types of defects.
These are used to identify opportunities for improvement.
Note: Pareto charts are sometimes plotted as a cumulative percentage point (up to 100%).
Frequency polygons: graphical display in which we plot points representing each class
frequency above their corresponding class midpoints and connect the points with lines.
Quantitative data
● Histogram
● Frequency polygons
● Stem-and-leaf
● Dot plot
● Ogive plot (cumulative)
● Bullet graph
Assignment 2
1. Pareto charts are frequently used to identify the most common types of defects.
2. A stem-and-leaf is best used to display the shape of the distribution.
3. 30 items are rejected daily by a manufacturer because of defects for the last 30 days.
How many classes should be used in constructing a histogram?
5
4. What would be the first class interval for the frequency histogram?
Exam: among mean, median and mode, which is the best to use?
Median
For a positively skewed distribution, the mean will always be the highest estimate of
central tendency and the mode will always be the lowest estimate of central
tendency (assuming that the distribution has only one mode).
Empirical Rule
Tolerance interval: an interval of numbers that contains a specified percentage of the
individual measurements in a population.
Under normal distribution:
● µ ± σ: 68.26%
● µ ± 2σ:95.44%
● µ ± 3σ: 99.73%
Chebyshev’s Theorem
It allows us to find an interval that contains a specified percentage of the individual
measurements in the population.
Chebyshev’s theorem:
Consider any population that has mean µ and standard deviation σ. Then for any value of k
1
greater than 1, at least 100(1 − 2 )% of the population measurements lie in the
𝑘
interval [µ ± 𝑘σ]
Z-score
Z-score (aka. Standardized value): the number of standard deviations that a measurement
is from the mean. The quantity indicates the relative location of a measurement within its
distribution.
● Positive z-score: x is above the mean
● Negative z-score: x is below the mean
Note: z-score is a standardized measurement of samples with each different mean and
standard deviation, to facilitate the comparison among them.
Eg. Class A has an average of 65 and standard deviation of 10; and Class B has an average
of 80 and standard deviation of 5. A student in Class A who scores an 85 is the same as a
student who scores a 90 in Class B, because their z-scores are equal. (85-65)/10=2 and
(90-80)/5=2.
Percentiles
● First quartile (Q1): 25th percentile
● Second quartile (median): 50th percentile
● Third quartile (Q3): 75th percentile
● Interquartile range (IQR) = Q3-Q1
1. Draw a box that extends from Q1 to Q3 and draw a vertical line at the median
2. Determine the values of the lower and upper limits.
a. Lower limit: Q1 - 1.5 IQR
b. Upper limit: Q3 + 1.5 IQR
3. Draw whiskers as dashed lines that extend below Q1 and above Q3.
a. Draw one whisker from Q1 to the smallest number that is between the lower
and upper limits.
b. Draw one whisker from Q3 to the largest number that is between the lower
and upper limits.
4. Number that is less than the lower limit or greater than the upper limit is an
outlier. Plot each outlier with “*”
Correlation coefficient: a measure of the strength of the linear relationship between -1 and
1, and independent of the units of x and y.
Least squares line: the line that minimizes the sum of the squared vertical differences
between points on a scatter plot and the line.
𝑠𝑥𝑦
● Slope 𝑏 = 2
1 𝑠𝑥
● Y-intercept 𝑏 = 𝑦 − 𝑏1𝑥
0
∑𝑤𝑖𝑥𝑖
Weighted mean = , where 𝑥𝑖= the value of the ith measurement
∑𝑤𝑖
Assignment 3
Population or sample:
The question will specify it. If it says “the numbers are collected from a larger group”, then it
is a sample. If not specified, it is a population.
Histogram, standard deviation, box plot are a must for the exam.
Chapter 4 Probability and probability models
Probability models
Definition: a mathematical representation of a random phenomenon.
Types of random phenomenon:
● Experiment (Chap 4)
The probability model describing an experiment consists of
○ The sample space of the experiment
○ Procedure for calculating probabilities concerning the sample space
outcomes
● Random variable (Chap 6,7): a variable whose value is numeric and is determined
by the outcome of an experiment
The probability model describing a random variable is called probability
distribution, and consists of
○ Specification of the possible values of the random variable
○ Table, graph, or formula that can be used to calculate probabilities concerning
the values that the random variable might equal
Independent events
2 events A and B are independent iff:
1. 𝑃(𝐴 | 𝐵) = 𝑃(𝐴) or, equivalently,
2. P(B | A) = P(B)
Assume that P(A) and P(B) are greater than 0.
()
𝑁
𝑛 =
𝑁!
𝑛!(𝑁−𝑛)!
Contingency table:
● Marginal probability
Probability of the occurrence of 1 event.
● Joint probability
Probability of the occurrence of 2 or more events together.
Practice 4
1. A manager has just received the expense checks for 6 of her employees. She
randomly distributes the checks to the 6 employees. What is the probability
that exactly 5 of them will receive the correct checks?
0. If all 5 receives their correct check, the 6th person must receive the correct check
as well. So the probability that exactly 5 receiving the correct checks and the 6th
receiving the wrong check is 0.
2. A group has 12 men and 4 women. If 3 people are selected at random from the
group, what is the probability that they are all men?
12𝐶3
Probability= =0.3929
16𝐶 3
Quiz 4
1. Container 1 has 8 items, 3 of which are defective. Container 2 has 5
items, 2 of which are defective. If one item is drawn from each container,
what is the probability that only one of the items is defective?
3 3 5 2
8
𝑥 5
+ 8
𝑥 5
= 0. 475
2. A family has two children. What is the probability that both are girls,
given that at least one is a girl?
Sample set = {BB,BG,GB,GG}
At least 1 girl: P(G1) = ¾
Both girls: P(GG)=¼
𝑃(𝐺𝐺∩𝐺1) 1/4 1
𝑃(𝐺𝐺 | 𝐺1) = 𝑃(𝐺1)
= 3/4
= 3
3. A lot contains 12 items, and 4 are defective. If three items are drawn at
random from the lot, what is the probability they are not defective?
() 8
3 8𝑥7𝑥6
= = 0. 2545, the number drawn is in both binomial,
( )
12
3 12 𝑥 11 𝑥10
Discrete random variable: when the possible values of a random variable can be counted
or listed by a finite number of possible values or by a countably infinite list.
Eg. The number of cars sold next month, x = 0,1,2,3…
Continuous random variable: when a random variable may assume any numerical value
in one or more intervals on the real number line. Not countable.
Eg. interest rate (%), time (s), temperature (F), weight (kg), car mileage (km/l)
2. ∑ 𝑝(𝑥) = 1
𝐴𝑙𝑙 𝑥
µ𝑥 = ∑ 𝑥𝑝(𝑥)
𝐴𝑙𝑙 𝑥
Variance of DRV
2 2
σ𝑥 = ∑ (𝑥 − µ𝑥 ) 𝑝(𝑥)
𝐴𝑙𝑙 𝑥
2 2 2
Estimated: σ𝑥 = ∑ 𝑥 𝑝(𝑥) − ∑ (𝑥𝑝(𝑥))
𝐴𝑙𝑙 𝑥 𝐴𝑙𝑙 𝑥
Standard deviation of DRV
2
σ𝑥 = σ𝑥
Binomial tables: show the probability of x successes in n trials, with success rate p.
Mean: µ𝑥 = 𝑛𝑝
2
Variance: σ𝑥 = 𝑛𝑝𝑞
Assume:
1. The probability of the event’s occurrence is the same for any 2 intervals of equals
length
2. Whether the event occurs in any interval is independent of whether the event occurs
in any other non overlapping interval
The probability that the event will occur x times in a specified interval is:
−µ 𝑥
𝑒 µ
𝑝(𝑥) = 𝑥! , where µis the mean (or expected) number of occurrences of the
event in the specified interval, and e=2.71828 is the base of Napierian logarithms.
Mean: µ𝑥 = µ
2
Variance: σ𝑥 = µ
Standard deviation σ𝑥 = µ
Where µis the mean number of occurrences of an event over the specified interval of
time or space of interest.
Where:
()
𝑟
𝑥 is the number of ways x successes can be selected from the total of r successes
in the population.
( )
𝑁−𝑟
𝑛−𝑥 is the number of way n-x failures can be selected from the total of N-r failures
in the population.
()
𝑁
𝑛 is the number of ways a sample of size n can be selected from a pop of size N.
𝑟
Mean: µ𝑥 = 𝑛( 𝑁 )
2 𝑟 𝑟 𝑁−𝑛
Variance: σ𝑥 = 𝑛( 𝑁 )(1 − 𝑁
)( 𝑁−1 )
Note: if the population size N is “much larger” than the sample size n (at least 20
times larger), then making selections will not substantially change the probability of a
success. We can assume that the probability of a success stays essentially constant from
selection to selection, and the different selections are essentially independent of each other.
In this case, we can approximate the hypergeometric distribution by using the
binomial distribution:
𝑛! 𝑥 𝑛−𝑥 𝑛! 𝑟 𝑥 𝑟 𝑛−𝑥
𝑝(𝑥) = 𝑥!(𝑛−𝑥)!
𝑝 (1 − 𝑝) = 𝑥!(𝑛−𝑥)!
( 𝑁 ) (1 − 𝑁
)
Assignment 6
1. A total of 50 raffle tickets are sold for a contest to win a car. If you
purchase one ticket, what are your odds against winning?
49 to 1
2. If p = .1 and n = 5, then the corresponding binomial distribution is:
Right skewed
3. If you were asked to play a game in which you tossed a fair coin three
times and were given $2 for every head you threw, how much would you
expect to win on average?
3$. The expected number of head E(x)=np=3*0.5=1.5.
Money earned on average = 1.5 x $2 = 3$
4. For a random variable X, the mean value of the squared deviations of its
values from their expected value is called its ________.
Variance
5. Which one of the following statements is not an assumption of the
binomial distribution?
Sampling with replacement
6. Which of the following is a valid probability value for a discrete random
variable?
0.2. (Between 0 and 1)
7. An insurance company will insure a $75,000 particular automobile make
and model for its full value against theft at a premium of $1500 per year.
Suppose that the probability that this particular make and model will be
stolen is .0075. Find the premium that the insurance company should
charge if it wants its expected net profit to be $2000.
-$75,000 x 0.0075 + Premium = $2000
$2562.5
Chapter 7 Continuous random variables
Property of a CPD:
1. f(x)≥ 0for any value of x
2. The total area under the curve f(x) = 1
If c and d are numbers on the real line, the equation describing the uniform distribution is
1
𝑓(𝑥) = 𝑑−𝑐 𝑓𝑜𝑟 𝑐 ≤ 𝑥 ≤ 𝑑
= 0 otherwise
𝑐+𝑑
Mean: µ𝑥 = 2
𝑑−𝑐
Standard deviation σ𝑥 =
12
Eg. imagine the waiting time for an elevator is uniformly distributed between 0 and 4
minutes. The uniform distribution is f(x)=¼ for 0 ≤ 𝑥 ≤ 4, having the shape of a rectangle
with base 4-0 and height ¼ .
µ 𝑎𝑛𝑑 σare the mean and standard deviation of the population. e=2.71828
Note: We use a normal curve table to find areas (thus probabilities) unde the normal
curve.
Normal curve table’s properties:
1. The shape of each normal distribution is determined by its mean and its standard
deviation.
2. The highest point on the normal curve is located at the mean µ , which is also
Note: Exponential and related Poisson distributions are useful in analyzing waiting lines or
queues.
Eg. Queuing theory attempts to determine the number of servers that strikes an optimal
balance between the time customers wait for service and the cost of providing service.
Quiz 4
1. Consider a normal population with a mean of 10 and a variance of 4.
Find P(X > 18).
0
z=(18-10)/2=4. The normal table’s highest value is at 3.9999, so above that
the probability is 0.
2. The relationship between the standard normal random variable, z, and
normal random variable, X, is that
the standard normal variable z counts the number of standard deviations that
the value of the normal random variable X is away from its mean.
3. The weight of a product is normally distributed with a mean of 5 ounces.
A randomly selected unit of this product weighs 7.1 ounces. The
probability of a unit weighing more than 7.1 ounces is .0014. The
production supervisor has lost files containing various pieces of
information regarding this process, including the standard deviation.
Determine the value of the standard deviation for this process.
P(x>7.1)=0.0014. p(x≤7.1) = 1-0.0014=0.9986.
Look at the normal table to find 𝑧.0014 = 2. 98. σ = 0. 70
Midterm questions
1. From a population of size 2,000, a random sample of 200 items is selected. The
mean of the sample:
Can be larger, smaller or equal to the population mean
2. When a class interval is expressed as: 100 to under 200, it implies that:
The class must contain an observation with a value of 100
3. Consider a statistics defined as the distance between the 33rd percentile and
67th percentile. This statistics would give us information concerning:
Variability
4. Long question:
Let s=the sum of the returns from 2 projects, find 𝑝(𝑠 ≥ 18, 000 | 𝑠 ≥ 12, 000).
𝑝(𝑠≥18,000 ∩ 𝑠≥12,000) 𝑝(𝑠≥18,000)
𝑝(𝑠 ≥ 18, 000 | 𝑠 ≥ 12, 000) = 𝑝(𝑠≥12,000)
= 𝑝(𝑠≥12,000)
,
if s≥18,000 then s≥12,000, so
𝑝(𝑠 ≥ 18, 000 ∩ 𝑠 ≥ 12, 000) = 𝑝(𝑠 ≥ 18, 000)
𝑝(𝑠 ≥ 12, 000) = 𝑝(6)𝑝(6) + 𝑝(18)𝑝(18) + 𝑝(6)𝑝(18) + 𝑝(18)𝑝(6) = 0. 7569
𝑝(𝑠 ≥ 18, 000) = 𝑝(18)𝑝(18) + 𝑝(18)𝑝(6) + 𝑝(6)𝑝(18) = 0. 3344
0.3344
𝑝(𝑠 ≥ 18, 000 | 𝑠 ≥ 12, 000) = 0.7569
= 0. 4418
5. For a positively skewed distribution, the mean will always be the highest estimate of
central tendency and the mode will always be the lowest estimate of central
tendency (assuming that the distribution has only one mode).
In a right skewed distribution: Mode -> median -> mean
6.
Chapter 8 Sampling distributions
Note: one purpose of 𝑥is to tell how accurate the sample mean is likely to be as a point
estimate of the population mean. But when the population is large, it is hard to tell.
2. Mean µ
𝑥
= µ, the sampling distribution 𝑥 of has mean µ𝑥 equals to the population
mean
σ
3. Standard deviation σ = , if the sample population is infinite or ≥ 20times the
𝑥 𝑛
sample size.
σ
Note: σ = means that if the sample size n > 1, the SD of the sampling distribution
𝑥 𝑛
is smaller than the SD of the population. See the spread of the graph below
If the sample size n is larger, the spread of sampling distribution is smaller, thus closer
to the population mean µ, so it’s more likely to obtain a sample mean that is near the
population mean.
Note:
● the larger the sample size n is, the more nearly normally distributed is the
population of all possible sample means.
● The more skewed the probability distribution of the sampled population, the
larger the sample size must be for the population of all possible sample means to be
approx. normally distributed.
● As the sample size increases, the spread of the distribution of all possible sample
means decreases (ie. the spread is measured by σ 𝑥 , so σ 𝑥 decreases as well )
𝑝(1−𝑝)
3. Has standard deviation σ =
𝑝 𝑛
Note: n should be considered large if both np and n(1-p) are at least 5.
Chapter 9 Confidence intervals
Confidence level: the percentage of time that a confidence interval would contain a
population parameter if all possible samples were used to calculate the interval.
Margin of error: the quantity that is added to and subtracted from a point estimate of a pop
parameter to obtain a confidence interval for the parameter.
Eg. [𝑥 ± 𝑚𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟]
When the sample size ≥ 30, you are safe to use t-table.
Note: z-table and t-table are the same when the sample size (df) is large. It is reasonable to
approximate the value of 𝑡α by 𝑧α when df is greater than 100.
Note: if both np and n(1-p) are larger than 5, you can use z-table.
Quiz 5
1. The width of a confidence interval will be
a. Narrower for 99% confidence than 95% confidence
b. Wider for a sample size of 100 than for size of 50
c. narrower for 90% confidence than 95% confidence
d. Wider when the sample s is small than when s is large
2. The internal auditing staff of a local manufacturing company performs a sample audit
each quarter to estimate the proportion of accounts that are current (between 0 and 60
days after billing). The historical records show that over the past 8 years 70 percent of
the accounts have been current. Determine the sample size needed in order to be
95% confident that the sample proportion of the current customer accounts is
within .03 of the true proportion of all current accounts for this company.
2
(𝑧α/2) 𝑝(1−𝑝)
𝑛= 2
𝐸
2
𝑧0.025 𝑥 0.7 𝑥 0.3
= 2
0.03
= 897
3. In the case where E is not given: If the interval is [100,200], E=50 since the
population mean will be at the middle of the distribution curve.
4. Sdsa
Chapter 10 Hypothesis testing
P-value
● If p-value < α, z is in the reject area
● If p-value > α, z is not in the reject area
Chapter 13 Chi-square tests
Goodness of fit:
Condition: E=np > 5.
If np < 5, we need to use a bigger sample size.
𝑖=1
4. Reject or don’t reject 𝐻0
Note: In hypothesis, we never put the numbers collected from samples! Put the
sample numbers in Observed data. So don’t put P1=253/1200 in the hypothesis. Use the
hypothesis in the question text.
Instead the 𝐻0: 𝑃1 = 𝑃2 = 𝑃3 = 𝑃4 = 𝑃5 = 0. 2
The rejection area is on the right side, anything to the left is accepted.
The Chi-square formula gives basically the margin of error of the hypothesis from
observation.
Note: if the 2 variables are independent, the proportion should be approximately even
distributed:
Age A B C
0-10 ≅33%
10-30 ≅33%
>30 ≅33%
Chapter 13
Nov 16 Class
Coefficient slope B1
Chapter 15
For multiple test, use f test