Statistics 2.
Ordinal - classifies data into categories that can be
-a science of conducting studies to collect, organize, ranked; but precise differences between the ranks do not
exist; judging (1st, 2nd) rating scale (poor, excellent)
summarize, analyze, present, interpret, and draw
conclusions from data 3.Interval - ranks data, and precise differences between units
-used to analyze the results of surveys and as a tool in of measure exist; there is no meaningful zero (IQ,
temperature)
scientific research to make decisions based on controlled
experiments. Other uses of statistics include operations 4.Ratio - possesses all the characteristics of interval
research, quality control, estimation, and prediction measurement, and a true zero exists. True ratios exist when
the same variable is measured on two different members of
the population; height, weight, time, salary, age
Variable - acharacteristic or attributeunderstudythatcan
assume different values Data Collection Methods
Random Variables - values are determined by chance 1. Surveys
Data - values (observations or measurements) that the a) Telephone - less costly, more candid, not face-face.
variables can assume Disadvantages: some don’t have phones or will not
answer, unlisted numbers
Data Set - a collection of observations (data values) on
one or more variables b)Mailed Questionnaires - less expensive to conduct,
respondents can remain anonymous. Disadvantages: low
Population - consists of all subjects (human, etc) that
number of responses, inappropriate answers to questions;
are being studied
some people may have difficulty reading or understanding
Sample - a group of subjects selected from a population the questions
c) Personal Interview - obtain in-depth responses.
Disadvantages: interviewers must be trained in asking
2 Main Areas of Statistics
questions and recording responses; interviewer may be
1.Descriptive statistics - the collection, organization, biased
summarization, and presentation of data. Tables, charts or
2. Surveying records
graphs are used to organize and present data.
Descriptive values such as the average score are used to 3. Direct observation of situations
summarize data.
2.Inferential statistics - generalizing from samples to
Reasons for Using Samples
populations, performing estimations and hypothesis tests,
determining relationships among variables, and making 1. Saves time and money
predictions. Make inferences from samples to
2.Enables the researcher to get information that he or she
populations.
might not be able to obtain otherwise
Hypothesis testing - a decision-making process for
3.Enables the researcher to get more detailed information
evaluating claims about a population, based on
about a particularsubject
information obtained from samples
Classifications of Variables
4 Basic Sampling Techniques
1.Quantitative - Numerical and can be ordered or
1. Random sampling - subjects are selected by
ranked (age, heights, weights, body temperatures)
random numbers from calculators, computers, or tables; for
a) Discrete - values that can be counted a sample of size n, all possible samples of this size have an
equal chance of being selected from the population.
b)Continuous - assume an infinite number of values
between any two specific values; obtained by Limitation: if the population is extremely large, it is time
measuringandoften include fractions and decimals consuming to number and select the sample elements
2.Qualitative - variables that can be placed into Methods for Random Sampling
distinct categories, according to some characteristic or
a) Fish bowl - number each element of the population, place
attribute (gender, religion, geographic locations)
the numbers on cards in a hat or fishbowl, mix them, and
select the sample by drawing the cards
Measuring Variables - to establish relationships between b) Random numbers - number the elements of the
variables; observe the variables and measure/record their population sequentially and then select each element by
observations. using random numbers
Scale of measurement - measuring a variable into a 2.Systematic random sampling- using every kth number after
set of categories and a process that classifies each the first subject us selected from 1 through k; done after the
individual into one category first number is selected at random. The advantage of
systematic sampling is the ease of selecting the sample
elements.
4 Types of Measurement Scales
3.Stratified random sampling - dividing the population into
1. Nominal level of measurement - classifies data subgroups, called strata, and subjects are randomly selected
into mutually exclusive (non overlapping), within groups; ensures representation of all population
exhausting categories in which no order or ranking can subgroups that are important to the study. Disadvantages:
be imposed on the data.gender, zipcode, eyecolor,
nationality, religion
many variables of interest, dividing a large population 4.Classes must be mutually exclusive - non-overlapping
into representative subgroups requires a great deal of class limits so that data cannot be placed into two classes
effort.
5.Classes must be continuous - no gaps in frequency
4. Cluster sampling- subjects are selected by using an distribution
intact group(cluster) that is representative of the
6.Classes must be exhaustive - enough to accommodate all
population.
the data
Advantages: A cluster sample can reduce costs, it can
simplify fieldwork it is convenient.
Reasons for constructing a frequency distribution
Disadvantage: homogeneous
1. To organize the data in a meaningful, intelligible way.
Frequency Distribution and Graphs
2.To enable the reader to determine the nature or shape of
Constructing a frequency distribution - most the distribution.
convenient method of organizingdata
3.To facilitate computational procedures for measures of
Frequency distribution -organization of raw data in table average and spread
form, using classes and frequencies; way of presenting a
4.To enable the researcher to draw charts and graphs to
summary of the data that shows
present data
a) possibility of seeing patterns or relationships in
5. To enable the reader to compare different data sets
data
b) how many times
each data point
(observation/outcome) occurs in a data set Types of Frequency Distribution
Components of frequency distribution table 1.Categorical Frequency Distribution - used for data that
can be placed in specific categories, such as
Class - quantitative/qualitative category, each raw data
nominal/ordinal level data.
value is placed into
2.Grouped Frequency Distributions - used when the range
Tally - data recorded in the sequence which they are
of the data is large, the data must be grouped into classes
collected, before they are processed/ranked
that are more than one unit in width.
Frequency - number of data values contained in a
3.Ungrouped Frequency Distribution - used when the range
specific class
of the data values is relatively small, a frequency distribution
1. Qualitative variable (ordinal/nominal data) can be constructed using single data values for each class
a) Class, tally, frequency, percent 4.Cumulative Frequency Distribution - gives total # of
values that fall below the upper boundary of each class.
2. Quantitative variable (numerical data)
a) Class limit, class boundaries - numbers used to Values are found by adding the frequencies of classes less
separate the classes sothere are no gaps in than or equal to upper class boundary of a specific class
the frequency distribution; tally, frequency (ascending cumulative frequency)
Basic Rules: Constructing “Class” in the Sample of Frequency Distribution Table
Frequency Distribution
1. There should be 5-20 classes
2.Class limits should have the same decimal place value
as the data
a) Class boundaries should have one additional
place value and end in a 5
Lower limit - 0.5 =lower
boundary Upperlimit+0.5 Constructing statistical charts and graphs - most useful
method of presenting the data
3. Classes must be equal in width - found
=upperboundary
by subtracting lower/upper class limit of one class from Uses of graphs in statistics
lower/upper class limit of the next class if boundaries 1. Convey data to viewers in pictorial form
are given. Find the class width bydividing
therangebythenumber of classes 2. Useful in getting the audience’s attention in a
presentation
* don’t subtract limits of a single class; incorrect answer
3. Describe/analyze data set
*researcher decides how many classes to use and the
width of each class 4.Discuss an issue, reinforce a critical point, summarize data
set
Sturge’s Rule - determining number of classes to use
in a histogram or frequency distribution table 5. Discover trends/patterns in a situation
Frequency Distribution Graphs
k = 1+3.322(log10n)
• X axis - score categories (X values)
k=numberofclasses
• Y axis - frequencies
n = size of the data
•Histogram or a polygon - When the score categories have
numerical scores from an interval or ratio scale
Commonly Used Graphs the distribution; reported along with the mean or the median
1.Histogram - contiguous vertical bars of various heights
(frequencies)
Modal class - the mode for grouped data; the class with the
2.Frequency polygon - using lines that connect points largest frequency
plotted for thefrequencies
1.Unimodal - a data set that has only one value that occurs
3.Ogive or Cumulative Frequency - represents the with the greatest frequency
cumulative frequencies. visually represent how many
2.Bimodal - a data set that has two values that occur with
values are below a certain upper class boundary
the same greatest frequency, both values are considered to
be the mode
Constructing Statistical Graphs 3.Multimodal - a data set that has more than two values
1. Draw and label x and y axes that occur with the same greatest frequency, each value is
2. Choose a suitable scale and label it on the y used as the mode
axis
3. Represent the class boundaries on the x axis
4. Plot the points and draw the bars or lines Central Tendency and the Shape of the Distribution
1. Symmetrical (Normal) Distribution - the data values
Relative Frequency Graphs - used when the
are evenly distributed on both sides of the mean.
proportion of data values is more important than the
When the distribution is unimodal, the mean, median, and
actual number of data values
mode are the same and are at the center of the distribution
To convert a frequency into a proportion or relative
frequency, divide the frequency for each class by the total
of the frequencies. The sum of the relative frequencies
will always be 1
Other Types of Graph
1.Bar graph - vertical or horizontal bars whose heights
or lengths represent the frequencies of the data
2.Pareto chart - frequency distribution for a
categorical variable, frequencies are displayed by
vertical bars, arranged in order from highest to lowest 2. Positively Skewed or Right-skewed Distribution - majority
of the data values fall to the left of the mean and cluster at
3.Time series graph - represents data that occur over the lower end of the distribution; the “tail” is to the right.
a specific period of time; look for trends/patterns The mean is to the right of the median, and the mode is to
4.Pie graph - circle divided into sections or wedges the left of the median
according to the percentage of frequencies; nominal/
categorical
Data Distribution
Measures of Central Tendency
Central tendency - descriptive statistical measure that
determines a single value that best describes the
center and represents the entire distribution; condense
a large set of data into a single value
- goal is to identify the single value that is the best 3. Negatively Skewed or Left-skewed Distribution -
representative for the entire set of data majority of the data values fall to the right of the mean and
cluster at the upper end of the distribution, with the tail to
Statistic - a characteristic or measure obtained by the left. The mean is to the left of the median, and the mode
using the data values from a sample is to the right of the median
Parameter - a characteristic or measure obtained by
using all the data values from a specific population
1.Mean - most commonly used measure of central
tendency; balance point of the distribution; sum of the
values divided by the total number of values
2.Median - midpoint of the list where scores in a
distribution are listed from smallest to largest; a more
appropriate measure of central tendency than the mean;
divides the scores so that 50% of the scores in the
distribution have values that are equal to or less than the
median *When a distribution is extremely skewed, the value of the
mean will be pulled toward the tail
3.Mode - most frequently occurring category or score
in the distribution or in the data set; peak or high point Central Tendency and Variability - two primary values that
of are used to describe a distribution of scores
Central tendency - the central point of the distribution Q2is thesameas the50th percentile, orthemedian
Q3 corresponds to the 75th percentile
Variability - descriptive statistic that describes how the
scores are scattered around that central point; determined 4.Interquartile Range (IQR) - difference between Q1 and Q3
by measuring distance and is the range of the middle 50% of the data; used to
identify outliers, and as a measure of variability in
- inferential statistic that describes how accurately any
exploratory data analysis (EDA)
individual score or sample represents the entire
population 5.Deciles - Deciles divide the distribution into 10 groups,
denoted by D1, D2, etc. Deciles can be found by using the
formulas given for percentiles
Measures of Variation
1. Range - totaldistance covered bythe distribution,
Relationships Among Percentiles, Deciles, and Quartiles
from the highest score to the lowest score
R = highest value - lowest value •Deciles are denoted by D1 , D2 , D3 , and they correspond to P10,
2
P20, P30
2. Variance ( or s2) - average of the squares of the
distance each value is from the mean •Quartiles are denoted by Q 1, Q2 , Q3 and they correspond to P25,
P50, P75
2 ( X )2 (nX1 X )
N s
2
• The median is the same as P 50 or Q2 or D5
X=individual value X = sample mean
μ= population mean n=samplesize
N = populationsize
3. Standard Deviation ( or s) - standard distance
between a score and the mean; square root of the
variance
Exploratory (Descriptive) Data Analysis, EDA - to examine
data to find out what information can be discovered about
Uses of Variance and Standard Deviation the data such as the center and the spread
1. To determine the spread of the data.
2. To determine the consistency of a variable
3.To determine the number of data values that fall within
a specified interval in a distribution
4. Used quite often in inferential statistics.
Stem-and-Leaf Plot - data plot that uses part of the data
Coefficient of Variation (CVar) - statistic that allows to value as the stem and part of the data value as the leaf to
compare standard deviations when the units are different; form groups or classes. Leading digit (stem), trailing digit
the standard deviation divided by the mean, result (leaf), frequency
expressed as apercentage
Boxplot (Box and Whisker Plot) - graph of a data set
For samples: For population: obtained by drawing: the lowest value of the data set
s
CVar 100% CVar 100% (minimum), Q1, the median, Q3, the highest value of the
X data set (maximum)
Measures of Positions - used to locate the relative Comparing Boxplots for Two or More Data Sets - use the
position of a data value in the data set location of the medians. To compare the variability, use the
interquartile range or the length of the boxes.
1. Standard score (z-score) - tells how many
standard deviations a data value is above or below the Probability and Counting Rules
mean for a specific distribution of values
Probability - the chance of an event occurring
a) If a zscore is 0, the data value is thesame as the
mean Basic Concepts of Probability
b) ifthezscore is (+),thescoreisabovethemean 1.Probability Experiments - achanceprocess thatgenerates a
set of data or well-defined results called outcomes
c) if the zscore is (-), the score is below the mean
2.Outcome - the result of a single trial of a probability
When all data for a variable are transformed into experiment
z scores, the resulting distribution will have a mean of
0 and a standard deviation of 1 3.Space sample (S) - set of all possible outcomes of a
statistical experiment
sdmean
value
z
2. Percentile - divide the data set into 100 equal
groups
percentile = (# ofvalues below X)+0.5 x 100%
total # of values
3.Quartiles - divide the distribution into four groups, Tree Diagram - used to determine all possible outcomes of
separated by Q1, Q2, Q3 a probability experiment
Q1 is the same as the 25th percentile
Classifications of Events a) Independent Events - the probability of both
occurring is P(A and B) = P(A) x P(B)
Event (E) - consists of a set of outcomes of a
probability experiment b) Dependent Events - conditional probability
P(B/A)
1.Independent - the first event does not affect the
- the probability of both occurring is
probability of the next event occurring
P(A and B) = P(A) x P(B/A)
2.Dependent - the probability of the second event
occurring depends on the first event Conditional Probability
The probability that event B occurs given that event A has
3. Complementary event ( E ) - set of outcomes in
already occurred:
the sample space that are not included in the
outcomes of event E; mutuallyexclusive P(B|A) = P(A and B)
P(A)
P(E) 1 P(E) P(E) P(E) 1
Determination of the Number of Outcomes of Events
Three Basic Interpretations of Probability
1.Fundamental Counting Rule - mulitply (k1 * k2 * k3 * kn)
1.Classical Probability - relies of the sample space;
2.Permutation - arrangement of nobjects in aspecific
assumes all outcomes are equally likely to occur; actual
order Permutation Rule - # of permutations of n objects
performance of experiment is not necessary; outcomes
taking r objects at a time; order is important
are obtained by observation and tree diagram
n!
P(E) = # of outcomes in E = where n! = n factorial
n Pr
n(E) total# ofoutcomes n(S) (n r)!
2.Empirical Probability - uses frequency distribution; 3. Combination - selection of distinct objects without order
outcomes are based on the frequency distribution and Combination Rule - # of combinations of r objects selected
observation from n objects; order is not important
P(E) = frequency for class = f n!
total frequencies
n Cr (n r)!r!
n
3.Subjected Probability - researcher makes an
educated guess about the chance of an event occurring; Probability Distribution - a relative frequency distribution of all
experiment performance not needed; based on possible outcomes if an experiment
educated personal judgment/estimate, opinions and
inexact information
Four Basic Probability Rules Different Types of Probability Distribution
Probability Rule 1 - probability of any event is a 1.Probability Distribution of Discrete Variables - binomial,
number (fraction/decimal) between and including 0 and poisson distribution
1 2.Probability Distribution of Continuous Variables - uniform,
normal distribution
0 P(E) 1
Probability Rule 2 - if event Ecan’t occur, probability is
0 Random Variables - characteristic that varies from one
component of a population to another; its values
Probability Rule 3 - if event E is certain, probability is 1 varyrandomly or by chance
Probability Rule 4 - sum of the probabilities of all 1.Discrete Random Variables - has a finite or countable
outcomes in the sample space is 1 number of values (0, 1, 2…)
*Probability values range from 0 to 1 2.Continuous Random Variables - has infinitely many
*When probability is near 0, occurrence is highly values associated with measurements on a continuous
unlikely scale where there are no gaps or interruptions (5, 5.1,
*When probability is near 0.5, there is a 50-50 6.2…)
chance
*When probability is near 1, event is likely to occur Discrete Probability Distribution - table, graph, or
Rules
*Wheninprobability
Solving Probability of CompoundisEvents
of an event/complement known,(2 mathematical expression that specifies all possible values
or
themore)
other can be found by subtracting the probability (outcomes) of a random variable with their probabilities. It
from 1 should satisfy the criteria:
1. Addition Rule
a) Mutually Exclusive Events - when two events where x is a discrete variable and
A and B are mutually exclusive P(A or B) 1. P(x) 1 P(x) is the probability of x
= P(A) + P(B)
2. 0 P(x) 1 for every value of x
b) Non-mutually Exclusive - if A and B are not
mutually exclusive P(A or B) =P(A) +P(B)- P(A
and B)
Mean of a Probability Distribution - expected value;
2. Multiplication Rule and Conditional typical value that represents the central location of a
Probability
probability distribution xP(x)
Variance and Standard Deviation of a Probability Hypergeometric Random Variable - the number X of
Distribution - measures the amount of spread in a successes of a hypergeometricexperiment
distribution 2 [(x )2 P(x)] Probability mass function (pmf)
K N K
k n k
Binomial Distribution - with parameters n and p, is the
P( X k )
discrete probability distribution of the # of successes in nN
sequence of n independent experiments
a
4 Properties of Binomial Distribution where N = population size
1. Fixed Number of Trials (n) K = # of success states in the population
2. Two outcomes in a trial, success or n = # of draws
failure k = # of observed successes
3. Trials are independent ba = is a binomial coefficient
4. Probability of success P remains
constant pmf is (+) when max(0, n K n) k min(K, n)
General Formula pmf satisfies the recurrence relation
P( X r)nc rpr qnr
X ~ B(n, p)
X = random N K
n
variable P( X 0)
n = # of trials nN
r =# ofsuccesses
p
q = probability
# of failuresof success
Meanand Variance
X ~ B(n, p)
mean E(x) np
2 Var( X ) npq
variance
where q 1 p
Mode - of a binomial B(n,p) distribution
|(n+1)p| if (n+1)p is 0 or a noninteger
(n+1)p and(n+1)p-1 if (n+1)p{1,...,n}
n if (n+1)p=n+1
Median - no formula to find the median for a binomial
distribution
Multinomial Distribution - used to compute probabilities
in situations that have more than 2 possible outcomes
1. Statistical experiment with k outcomes
2. Repeated independently n times
n! (n ) (n )
P (n !)( n !)... p1 p 2
1
... pk
2 (n k)
( n !)
1 2 k
where P = probability
n = total # of events
n1 = # of times outcome 1 occurs
n2 = # of times outcome 2 occurs
nk = # of times outcome k occurs
p1 = probability of outcome 1
p2 = probability of outcome 2
pk = probability of outcome k
Hypergeometric Distribution - discrete probability
distribution that describes the probability of k successes
in n draws, without replacement, from population N
that contains exactly K objects, wherein each draw is
either a success or afailure
Conditions Characterizing Hypergeometric
Distribution
1. The result of each draw can beclassified into one of
two mutually exclusive categories (Pass/Fail,
True/False )
2. The probability of a success changes on each draw,
as each draw decreases the population (sampling
without replacement from a finite population)