0% found this document useful (0 votes)
57 views22 pages

STK11O - Chapter 1-7 Notes

The document provides an overview of statistics, including definitions of key concepts such as data, variables, and observations, as well as different scales of measurement. It discusses methods for collecting data, summarizing it through various displays, and analyzing it using numerical measures and probability. Additionally, it covers graphical representations and measures of variability, location, and relationships between variables.

Uploaded by

u24594726
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views22 pages

STK11O - Chapter 1-7 Notes

The document provides an overview of statistics, including definitions of key concepts such as data, variables, and observations, as well as different scales of measurement. It discusses methods for collecting data, summarizing it through various displays, and analyzing it using numerical measures and probability. Additionally, it covers graphical representations and measures of variability, location, and relationships between variables.

Uploaded by

u24594726
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 1 | Data & Statistics

↪ What is statistics?
-​ The art and science of COLLECTING, ANALYZING, PRESENTING and INTERPRETING data

** stats makes SENSE of numbers + informations **

> data = facts & figures


> data set = facts & figures collected for a study

a. Elements

-​ The entities on which data are collected from

-​ From WHO are we collecting data


-​ From WHAT are we collecting data

b. Variables

-​ A characteristic of interest for the elements

-​ WHAT information are we collecting

c. observations

-​ The set of measurements obtained for a particular element

-​ EVERYTHING collected from a single element


> an entire row of data

** the number of observations will be EQUIVALENT to the number of elements **

↪ Scale of Measurement
-​ This determines the amount of information contained in the data, indicating the most appropriate
form of summarization + analysis

-​ Involves nominal, ordinal, ratio + interval scales ( N.O.I.R )


Quantitative Data Qualitative / Categorical Data
- Numerical data ( CAN calculate ) - non numerical data ( CANNOT calculate )

⬇ ⬇ ⬇ ⬇

Continuous data Discrete data Nominal scale Ordinal scale


- this deals with - this deals with - there is NO order, just - has a meaningful rank
FRACTIONS & INTEGERS & WHOLE shelving items
DECIMALS NUMBERS E.g. excellent, average,
( measurable ) ( can be counted ) poor

⬇ ↘ ↙ ⬇

Ratio scale Interval scale


- has a positive or - has NO true zero
negative outcome
- has a TRUE ZERO E.g. 0*C doesn’t mean
there in no temperature,
E.g. yes / no , a / b it is only VERY COLD
-​ summarized data can be displayed using the ‘pivot table’

↪ Collecting Data

a. Cross-sectional data

-​ Data collected at the SAME time or same POINT in time

-​ Data collected once

b. Time series data

-​ Data collected over several periods of time

> where is data collected?

1.​ Population
-​ A set of all elements of interest in a particular study

-​ A Large group of elements

2.​ Sample
-​ A subset of the population

-​ A smaller group of the population

> how data can be collected?

Descriptive Statistics / Analytics Inferential Statistics / Analytics


- take data to DESCRIBE / ILLUSTRATE it - take data to PREDICT / CONCLUDE / ESTIMATE
it

⬇ ⬇

1.​ Queries 1.​ Census


-​ A survey to collect data on an
2.​ Reports ENTIRE POPULATION
3.​ Data visualization
2.​ Sample survey
E.g. pie charts, bar graphs, frequency tables -​ A survey to collect data on a
SAMPLE

↪ Data Sets

a. Big data

-​ Data that CANNOT be managed, processed or analyzed with available software in an amount of
time

-​ Large amounts / variety of data

b. Data mining

-​ Process of using STATS + COMPUTER SCIENCE to obtain useful information from LARGE
databases
Chapter 2 | Tabular & Graphical Displays
↪ Summarizing Data

a. Frequency Distribution

-​ A tabular summary of data, showing the NUMBER of observations in each NON-OVERLAPPING


categories / classes

E.g. different types of SOFT DRINKS, different types of SOUR SWEETS

b. Relative Frequency

-​ A tabular summary of data, showing the fraction / PROPORTION of observations in each


NON-OVERLAPPING classes

> calculation = class frequency ➗ number of observations ( ‘n’ )


c. Percentage Frequency

-​ A tabular summary of data showing the PERCENTAGE of observations in each of several


NON-OVERLAPPING classes

> calculation = ( class frequency ➗ number of observations ) ✖️ 100


↪ Data Visualization

a. Bar Chart

-​ A graphical device for DEPICTING categorical data that that has been summarized

-​ Data can be made more obvious with the use of SORTED bar charts
-​ HOWEVER, does not provide exact information of data

b. Pie Chart

-​ A graphical device for PRESENTING summaries based on a subdivision of a circle into sectors,
corresponding to the relative frequency for each class
-​ Does provide a VISUAL REPRESENTATION
-​ HOWEVER, people do have difficulties piercing differences in areas ( bar chart recommended )

↪ Quantitative Variable Summarization

There are steps necessary to define the classes of a frequency distribution with quantitative data

1.​ Determine the number of NON-OVERLAPPING classes

-​ these classes will SUMMARIZE data into group


-​ should usually be between 5 - 20 classes

> a LARGER number of data ( more than 30 ) should use a LARGER number of classes
> a SMALLER number of data ( less than 30 ) will use 5 -6 classes to summarize data

2.​ Determine the WIDTH of each class

-​ the width of the data should ALWAYS be the same


-​ can be rounded to a WHOLE NUMBER

> calculation = ( largest data value ➖ smallest data value ) ➗ number of classes
3.​ Determine the class LIMIT

-​ Put in place so that each data items BELONGS to only ONE class

> the LOWER CLASS LIMIT is the smallest possible data value

E.g. 10 - 15

> the UPPER CLASS LIMIT is the largest possible data value

E.g. 10 - 15

4.​ Determine the class MIDPOINT


-​ The value halfway between the LOWER and UPPER class limits

> needs to be put in ascending order

E.g. the class midpoint of the class ‘10 - 15’ is 12.5

This data can be represented with different data visualizations

a. Dot Plot

-​ A graphical device that summarizes data by the number of dots above each data value on the
horizontal access

b. Histogram

-​ A graphical display of a frequency distribution, relative frequency distribution or quantitative data


constructed by;

> placing the class intervals on the horizontal axis


> placing the frequencies, relative frequency or percentage frequencies on the vertical axis

> has NO spaces, unlike a bar chart

> important for providing information about the data DISTRIBUTION

1.​ Can be skewed to the left


-​ The tail extends farther to the left

2.​ Can be skewed to the right


-​ The tail extends farther to the right

3.​ Can be symmetric


-​ The LEFT tail mirrors the shape of the RIGHT tail
↪ Cumulative Distributions

-​ A tabular summary of quantitative data showing the NUMBER of data values that are less than or
equal to the upper class limit of each class

> uses number of classes, class widths and class limits

a.​ Cumulative relative frequency distribution - a cumulative distribution showing the FRACTION
/ PROPORTION of data values

-​ SUM the relative frequencies


-​ DIVIDE the cumulative frequencies by TOTAL ITEMS

b.​ Cumulative percent frequency distribution - a cumulative distribution showing the


PERCENTAGE of data values

-​ MULTIPLY the relative frequencies by 100

↪ Stem and Leaf Display

-​ A graphical display used to show simultaneously the rank order and shape of a distribution of data

> it has 2 primary advantages

1.​ It is EASIER to construct by hand


2.​ Within a class interval, it provides more information than the HISTOGRAM ( shows actual
data )
↪ Summarizing data for TWO variables

a. Crosstabulation

-​ A tabular summary of data for two variables. One class variable is represented by the ROWS, the
other class variable is represented by the COLUMNS

> difficult to decide the number of classes to use when drawing up crosstabulation


1.​ Finds the relative frequency
-​ Frequency total data values

➗ ✖️ 100
2.​ Finds the percentage frequency
-​ ( Frequency total data values )

> a.i. Crosstabulations results in the SIMPSON'S PARADOX

-​ Conclusions drawn from two or more separate crosstabulations that can be reversed when the
data are aggregated into a SINGLE crosstabulation

> two or more SEPARATE crosstabulations can be reversed when aggregated into a single
tabulation

> by UN-AGGREGATING the data - it can show who has a greater record, who has a greater
percentage total etc

-​ When investigating the aggregate or unaggregate forms of the tabulations provide a


BETTER insight + conclusion

( would a hidden variable affect the results - possible providing a different / or better insight
and conclusion ? )

b. Scatter diagram
-​ A graphical display of the two relationships between two quantitative variables

> uses a trendline - providing an approximation of the relationship between the two variables

> the stronger the relationship , the closer the dots to the trend line

c. Side-by-side Bar Charts

-​ A graphical display for depicting multiple bar charts on the same display

> shows the qualitative data of each class ( good, very good, excellent )

d. Stacked Bar Chart

-​ A bar chart broken into rectangular segments of a different color, showing the relative frequency of
each class
> shows a clear relationship between the variables

> shows the qualitative data of each class ( good, very good, excellent )

↪ Choosing the Type of Graphical Display

> in order to obtain an accurate DISPLAY and SUMMARY

1.​ For Distribution of Data

Bar chart Histogram


Pie chart Stem and leaf
Dot plot

2.​ For Comparisons

Side by side bar chart Stacked bar chart

3.​ To Show Relationships

Scatter diagram Trendline

> for a COLLECTIVE display, data dashboards are used


( a set of visual displays that ORGANIZE and PRESENT information to monitor the
performance of a company )
Chapter 3 | Numerical Measures
-​ Involves the 5 number summary

Smallest value, Q1, Median, Q3, Largest value

↪ Measures of Location

a. Mean

-​ A measure of central location computed by summing the data values + dividing by the number of
observations

> denoted by x̄ for samples


> denoted by μ for population

> affected by OUTLIERS ( skews data to left or right )

1)​ Outlier less than mean, skewed to the left ( negative value )
2)​ Outlier more than mean, skewed to the right ( positive value )

3)​ A small amount is SLIGHTLY SKEWED


4)​ A large amount is HEAVILY SKEWED

a i. Weighted Mean

-​ The mean obtained by assigning each observation a WEIGHT that reflects its importance

> ( weight / frequency ✖️ x value ) ➗ total weight / total frequency


a ii. Geometric Mean

-​ A measure of location that is calculated by finding the nth root of the PRODUCT of n values

b. Median

-​ A measure of central location provided by the value in the MIDDLE when the data is arranged in
ascending order
> if two numbers occur as the median, average those two variables

c. Mode

-​ A measure of location, defines as the value that occurs with the GREATEST frequency

d. Percentiles

-​ A value that provides information about how the data spread over the interval from the SMALLEST
to the LARGEST value

Lp = P / 100 ✖️ ( n + 1 )
> this gives you the POSITION of the percentile
> can be a decimal

P = position 1 ➕ decimal value ✖️ ( position 2 - position 1 )


> the 50th percentile is the MEDIAN

-​ Finds quartiles ( dividing data in four parts )

> includes Q1 , Q2 and Q3

↪ Measures of Variability

-​ How is data SPREAD?

a. Range

-​ A measure of variability, defined to be the largest value MINUS the smallest value

b. Interquartile range

-​ A measure of variability, defined as the difference between the third and the first quartiles

> can be a percentile ( e.g. Q1 is the 25th percentile )

c. Variance

-​ A measure of variability based on the squared deviations of the data values about the mean

> standard deviation is the SQUARE ROOT of these calculations

ci ) Coefficient of variation

-​ A measure of variability computed by dividing the standard deviation by the mean, multiplied by 100
> tells you that the standard deviation is x% of the value of the mean
d. z-Scores

-​ A standardized value, denoting the number of standard deviations is FROM the mean

> shows that the z-score value, x, is x standard deviations BELOW or ABOVE the mean

> negative score is BELOW the mean


> positive score is ABOVE the mean

-​ Can find OUTLIERS

> the first is the lower limit ( LL = Q1 - 1.5 ( IQR ) )


> the second is the upper limit ( UL = Q3 + 1.5 ( IQR ) )

-​ Deals with CHEBYSHEV'S THEOREM

–> At least ( 1 - 1 / z^2 ) of the data values must be within z standard deviations of the mean, where
z is any value greater than 1

a)​ 75% or 0.75 of data MUST be within 2 standard deviations


b)​ 89% or 0.89 of data MUST be within 3 standard deviations
c)​ 94% or 0.94 of data MUST be within 4 standard deviations

↪ the Empirical rule

-​ Used to compute the percentage of data values that must be within 1, 2 and 3
standard deviation of the mean, exhibiting a BELL-SHAPED distribution

> standard deviation = mean + n(standard deviation)


↪ Boxplot

-​ A graphical summary of data based on a five number summary

↪ Measures of Two Variables

a. Covariance

-​ A measure of linear association between two variables. Shows a positive relationship or a negative
relationship

> shows the DIRECTION + LINEAR RELATIONSHIP


> unbounded ( can be any number )

> positive value = positive relationship


> negative value = negative relationship

ai ) Correlation Coefficient

-​ A measure of linear association between two variables of values between -1 and +1

> value NEAR +1 shows a strong positive relationship


> values NEAR -1 shows a strong negative relationship

> values NEAR 0 shows a weak relationship

> an exact value of -1 is a PERFECT negative relationship


> an exact value of +1 is a PERFECT positive relationship

> a BOUNDED value – showing direction + strength


Chapter 4 | Probability
-​ The likelihood + chance of an event or experiment occurring

↪ Experiments

-​ A process that generates well defined outcomes

E.g. heads, tails, heads, heads, tails

> shows an empirical probability ( the REAL LIFE situations - sample space )
> shows theoretical probability ( ½ chance to get heads on a coin )

-​ Analyzes the LONG TERM pattern

> ‘S’ is the sample point

-​ Important conditions

a)​ 0 =< P(S) =< 1


b)​ EP(S) = 1

↪ Counting techniques

a. Multiple-Step experiments

-​ An experiment that can be described as a sequence


-​ The total number of experimental outcome

> n = n1 x n2 x n3 … x nk

b. Factorial

-​ Arranging ALL experimental outcomes into number of x

>‘❗’
c. Combination

-​ The number of ways n objectives may be selected from among N objects without regard to a
specific order

> from n choose / select k

> nCk

d. Permutations

-​ The number of ways n objectives may be selected from among N objects within a specified order

> from n choose / select k AND arrange k


> the order matters

> nPk
↪ Assigning probability

a. Classical method

-​ A method of assigning probabilities that is appropriate when ALL the experimental outcomes are
equally likely

> equal probability

b. Relative frequency method

-​ Assigning probabilities that is appropriate when data is available, to estimate the proportion of time
the experimental outcome will occur if repeated

> shows probability outcomes in the form of a PERCENTAGE or FRACTION of

c. Subjective method

-​ A guesstimate of an outcome

> not reliable - not used often

↪ Venn diagrams

-​ A graphical representation for showing symbolically the sample space + operations involving
events

a. Union of A and B

-​ The event containing all sample points belonging to A or B or both

> A U B / P(A U B) = P(A) + P(B) - P(A U B)

b. intersection of A and B

-​ The event containing the sample points belonging to both A and B

>A∩B
c. Mutually exclusive events

-​ Events that have no sample points in common

> A ∩ B is empty
> P(A ∩ B) = 0

d. Conditional probability

-​ The probability of an event given that another event already occurred

> probability of A happening, given you know probability of B has occurred

> if independent P(A|B) = P(A) or P(B|A) = P(B)

e. Mutually disjoint

-​ Amounts do NOT meet

> P(A ∩ B) = 0

↪ Events

> a collection of sample points

a. Independent events
-​ Two events that have NO influence on each other

> P(A|B) = P(A)


> P(B|A) = P(B)

b. Dependent events

-​ Where two events are influenced / affected by one another

> P(A|B) ≠ P(A)


> P(B|A) ≠ P(B)
↪ Probability rules

a. Complements of an element

-​ The event consisting of all sample points that are not in the element

> P(A) + P(Ac) = 1

> P(Ac) = 1 - P(A)

b. Addition rule

-​ P(A U B) = P(A) + P(B) - P(A ∩ B)

> P(A ∩ B) is taken twice, so must be removed once

c. Multiplication law

-​ Used to compute the probability of the intersection of two events

> P(A ∩ B) = P(B)P(A | B)


> P(A ∩ B) = P(A)P(B | A)

> For independent events it reduces to P(A ∩ B) = P(A)P(B).

↪ Tree Diagram

-​ A graphical representation that helps in visualizing a multiple-step experiment


IMPORTANT

A and B are mutually exclusive if and ONLY if P(A n B) = 0

A and B are independent if and ONLY if P(A n B) = P(A) x P(B), or, P(A|B) = P(A), or, P(B|A)
= P(B)

Chapter 5 | Probability Distribution


↪ Random Variables

-​ A numerical description / qualifier of the outcome of an experiment

> QUANTITATIVE in nature

> it is random as it is unknown


> it is a variable as it varies

E.g. amount of people, height, weight

a. Can be a Discrete Random Variable

-​ A random variable that may assume either a finite number of values or an infinite number of values

> can COUNT

b. Can be a Continuous Random Variable

-​ A random variable that may assume any numerical value in an interval / collection of intervals

> when you can MEASURE

IMPORTANT TO NOTE

At least ( >= ) *different for discrete* At most ( =< )


Less than ( < ) Greater than ( > ) *different for discrete*

↪ Probability Distribution

-​ A description of HOW the probabilities are distributed over the values of the random variable

> denoted as f(x)

a. Discrete uniform probability distribution

-​ A probability distribution for which each possible value of the random variable has the SAME
probability

> f(x) = 1/n


> chance is the same
i) Expected value ii) Variance

-​ The measure of central location -​ A measure of variability of a random


variable
> the MEAN

> finds standard deviation ( the SQUARE


ROOT of this )

↪ Binomial Probability Distribution

a. Binomial experiment

-​ An experiment having four key properties:

1)​ Experiment is a FIXED NUMBER


2)​ Has a SUCCESS or FAILURE
> success = p
> failure = 1 - p
3)​ Trials are INDEPENDENT

-​ Can be used to calculate with a TREE DIAGRAM

> only has two stems as it is a ‘BI’nomial distribution

i) Expected value ii) variance

-​ The average / mean


Chapter 6 | Continuous Probability
-​ continuous deals with numbers directly after each other ( e.g. 1.000 … 001 )
a. Probability density function
-​ A function used to compute probabilities for a continuous random variable

> it is the area under the graph of the function

b. Uniform probability distribution

-​ A continuous probability distribution for which the probability that the random variable will ASSUME
a value is the same for each interval of equal length

> distributed with a MINIMUM and a MAXIMUM amount

i) Expected value ii) Variance

c. Normal probability distribution

-​ A continuous probability distribution, with its density function as BELL SHAPED and determined
by mean and standard deviation

-​ Has several characteristics


1)​ The highest point is the MEAN
2)​ The normal distribution is symmetric - reaches INFINITY
3)​ Standard deviation determines how FLAT + WIDE the curve is
4)​ Probabilities are under the curve - E(x) = 1

> has parameters MEAN and VARIANCE

c.i. Standard Normal distribution

-​ A normal distribution with a mean of 0 and standard deviation of 1

i) standard normal random variable

-​ Also known as the z-score ( its standard normal distribution is z ~ N(0,1) )

e. Exponential probability distribution

-​ A continuous probability distribution that is useful in computing probabilities for the time taken to
complete a task

i) cumulative probability

Chapter 7 | Random Variables


-​ Sampling a population brings multiple x-bar + p-bar values that are RANDOM VARIABLES

> in being a random variable it can be distributed

↪ Sampling techniques

a. SRS - simple random sampling


-​ each element of the population in the sample has an equal chance of being chosen

> uses random function in excel

b. systematic random sampling

-​ Randomly selected sample at a set interval

> e.g. 1 in every 5

c. stratified random sampling

-​ SPLIT population into sub-groups

You might also like