0% found this document useful (0 votes)
6 views

Statistics 101 Study Notes

Uploaded by

ninalee246
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Statistics 101 Study Notes

Uploaded by

ninalee246
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Lecture #1:

Statistics:

- Definition: the science of learning from data.


- Comes from the information needed to govern a state during an Intensified collection of
data in the 18th century

Individuals, Variables, and Data-sets:

- Data-set: information about some group of individuals


- Individuals: objects described in the data-set.
- Variables: a variable is any characteristic of an individual. It can take different values for
different individuals.

Types of Data:

- Quantitative Data:
● Numbers that have cardinal meaning
● Examples: income, age, etc.
● Makes sense to use in calculations such as adding and finding an average.

- Categorial Variables:
● Numbers don’t have cardinal meaning.
● They just identify groups.
Lecture #2:

Variable:

- Acts as a placeholder
- Is represented with a graph or figure

Key Statistics

1) Centrality:

- To measure an ​average:

- To measure a ​median​:

First, order all numbers from least to greatest

● If the values are in an odd number and short, split observation in half and choose
the number in between,

● If the values are in an odd number and long, use the formula: ,
where N is the number of observations to find the rank (where is it in the order of
observations) of the median (not the median, itself).
● If the values are in an even number, take the mean of the two middle numbers.

2) Measuring Spread:

Quartiles:

- Q1 and Q3 are the medians of the median.


- Q1: the median of the observations whose positions in the order list are to the left of the
median.
- Q3: the median of the observations whose positions in the order list are to the right of the
median.
- IQR (inter-Quartile Range): the distance between the first and third quartile. (Q3 - Q1)
- Percentile: the value that has p percent of observations fall at or below it.
To find ranks of the Quartiles:

Examples:

Steps (when number of observations is odd):

1) Find median: 5 (Q2)


2) Which values are before it?: 1,3
3) Find the median of 1 and 3: even number of observations.
4) Find mean: 1+3/2 = 2 (Q1)
5) Do the same for the third quartile: 7+9/2 = 8 (Q3)
Steps (when number of observations is even):

1. Find median: between two numbers (6+8/2 = 7)


2. Which values are before it?: 2,4
3. Find the rank of the median of 1 and 3: between two numbers.
4. Find median: 2+4/2 = 3 (Q1)
5. Do the same for the third quartile: 10+11/2 = 11 (Q3)

Steps (when number of observations is even):

6. Find median: 12 (Q2)


7. Which values are before it?: 10, 12, 12
8. Find the median of 10, 12, 12: 12 (Q1)
9. Which values are after the central median?: 14, 15, 20.
10. Find the median of 14, 15, 20: 15 (Q3)

3) Dispersion:

- How spread out observations are.

How to measure Variance:

- where n is the number of observations and x​i​ is the value of the observation.
- We use ‘n-1’ because ‘n’ in samples are biased.
- We square the denominator so it wouldn’t be zero.
How to measure Standard Deviation:

- Where ‘v’ is the variance.

Graphically representing data:

- Bars
- Pie
- Histograms
● Represent the distribution of numerical data
● Estimates probability distribution of a continuous variable
● Relates only one variable
- Boxplots
● An x is the mean

What causes representation to be misleading:

- A truncating Y-axis: not starting at zero.


- Outliers (1.5 times IQR) distorting mean, standard deviation, and variance.
- Confounding variables:
● E.g: measuring earth’s temperature at different times, but not taking into account
seasonal changes.
Additional Vocabulary:

Shape:

- Symmetric: the parts above and below the mean look the same.
- Unimodal: has only one peak
- Bimodal: has two peaks
- Skewed: has more of the observations on one end than the other (many extreme small
or many extreme big values).
Lecture #3:

Linear Transformations of one variable:

- Linear transformation: the change of an original variable, x, into a new variable, y,


following the operation: y = ax + b

- When x is multiplied by a constant bigger than 0:

● Mean, SD, V, IQR, and median are all multiplied by the same measure.

- When a constant is added to x:

● Constant is also added to the Mean and quartiles.

Percentage Inc./Dec. Formula:

Increase:

number(1 + perc./100)

Decrease:

number(1 - perc./100)
Density Curves:

- Density curve: is a curve that is always on or above the horizontal axis and has an area
of exactly 1 underneath it. A density curve describes the overall pattern of distribution.

- The area under the curve for any range of values is the proportion of all observations
that fall in that range.

- A density curve that is right-skewed (pos. El ta7t) has a mean > median > mode.

- Left-skewed curves (neg. El ta7t) have a median > mean > mode.

- Symmetric curves have median = mean = mode.

Five-number Summaries:

1. Minimum
2. Maximum
3. Median
4. Q1
5. Q3
- If two distributions have the same five-number summary, their curves may still
look different because they may have different distributions.
Normal Curves:

- A type of density curve that is symmetric, bell-shaped, and unimodal.


- They describe normal distributions.

Normal Distributions:

- has a mean of zero and a standard deviation of 1


- Height is given by:

- Points at which curvature changes are located at a distance σ on either side of the
mean µ.
- Good description of many real data sets & Good approximation to many kind of chance
outcomes
- Many statistical inference procedures based on the Normal distribution work well for
roughly symmetrical distributions
Z-scores (used to adjust location (mean) and curve shape (S.D) of a distribution):

- The numbers associated with the standard normal distribution


- Using them tells you the percentage of observations to the left & 1 - that = percentage of
observations to the right.
- Measured by x - mean/S.D.
- Find the z-score in the first column and the second decimal in the first row, then trace
your finger down to that cell, which will give you the percentage of observations to the
right of you (higher). E.g z-score 1, 0 decimals, = 84%.

Example where
mean = 3, SD = 3,
and X=1,6.
Two more examples:

c) What grade do I need to get to be among the best 10%?

d) What grade do I need to be above the lowest 30%?

c) 6.84

d) 1.44
Lecture 4
Associations:

- Two variables are said to be associated when knowing something about one tells you
something about the other.
- Can be positively associated, negatively associated, or not associated.

Negatively: having more of one gives you less of the other.

Positively: having more of one gives you more of the other.


Scatterplots:

- Graphs that represent a relationship between two quantitative variables measured on


the same individuals.
- Two examples above.

Two-way tables:

- Show relationships between two categorial variables.

Marginal two-way tables show the distribution of one variable:

Joint two-way tables show the distributions of two variables:

How to interpret a value:

0.39 = the percentage (39%) of adults are both male and employed in the European Union.

Correlation:

- If the correlation coefficient is zero, there is no correlation.


- Do not change under linear transforamtions of the variables.
To standardize a variable: find their z-score:

What happens when we standardize a perfect correlation:

What happens when we standardize an imperfect correlation:


Covariance:

- Covariance is the correlation without full standardization.


- The covariance does not vary between -1 and 1 as it depends on the scales of x and y.
- Note that Cov(X,X) is just the variance!
- Measure correlation using covariance:
Lecture 5
Least Square Regressions:

Description of a Straight Line:

- Equation: y​i​ = a + bx​i​ + є​i


- Find slope (b)

- Find y-intercept (a) (the y and x below are the means)

- Find error term (є), measures errors and confounding variables.

Least Square Regression:

- Chooses a & b to minimize the sum of the squared errors.


- Graphically, it minimizes the average distance or difference between the pata points and
the regression line.

- Why do we square it?


● The sum of non-squared distances adds up to zero.

Language for Least Square Regression:


How to Calculate Error Term:

- Difference between actual y and predicted y.

- Graphically (Error for first and Last):


Comments on Regression Analysis (Why it might not work):

● It assumes linearity, but not all relationships are linear.


● It assumes casualty, but the independent variable does not always cause the dependant
variable (like the above height difference between siblings).
● The outliers distort data such as S.D and mean, both of which are used to come up with
the equation and draw the line of best fit, so they may distort the regression analysis as
well.
Lecture #6

R​2​:

- In the equation above, y hat is the predicted value in the line of best fit.
- In the left graph, the y line is the average, the height of the red squares is the difference
between actual y and average of y, the black dots are the data points, and the area of
the squares is (yi - y average)​2​.
- In the right graph, the height of the squares is the difference between yi and y hat, and
the area of the squares is (y hat - yi)​2​.
- The equation of r​2 ​can be rewritten as:
Simpson’s Paradox:

- Happens because confounding variables (lurking) can strongly influence the relationship
between variables.
- Example:
Lecture #7

Causality:

- When one variable is causing the presence of another.


- Requirements for it:
1. Consistent statistical association (correlation has to be present)
2. Temporal order (if x causes y, then x has to have happened first)
3. Ruling out alternative explanations

- The dash line means correlation and the non-dash line means causation.
- In (a) x causes y.
- In (b) x does not cause y, but z causes both x and y, so they have a correlation.
- In (c) we don’t know whether x or z cause y. There can be multiple zs.

How to evaluation causal relationships:

- Think of confounding variables


- Experimentation
Lecture #8

Basic elements of an ideal experiment:

- A dependent variable yi (attention)


- An independent variable xi (coffee)
- Subjects belong to clear treatment and control groups related to xi
● Control - not given the independent variable
● Treatment - given the independent variable
- Subjects are selected into treatment and control groups in a random way
● Results in assignment not relying on any external or internal factor. So both
groups will be similar (e.g bias)
- A hypothesis: there is an impact of xi on yi

What we need to take into account:

- We need to take sample size into account.


● If it is larger, there will be a lower chance that the groups differ.
- Confounding variable
● Such as the placebo effect with the treatment group.
- To have a reference point
● It is a good idea to do two experiments (one before administration of independent
variables and one after.

Vocabulary:

- Experimental units:​ basic units that you can manipulate in experiments.


- Subjects:​ experimental units when they are people.
- Factors:​ all other characteristics that can affect outcomes. Experiments try to hold them
constant.
- Treatments:​ what is being done to subjects (x variable)
- Outcomes/Response:​ the y-variable (dependent variable)
- Randomization:​ randomly assigning experimental units to control and treatment
groups.
- Pre-test:​ same subjects, but you measure before the actual experiment.
- Post-test:​ same subjects, but you measure after the actual experiment.
- Sampling:​ technology to sample the data.

Other Data Sources:

- Anecdotal data
- sample surveys
- process-produced data
- social media

However, experiments are considered the ‘gold standard’ because by randomizing, they
hold all variables constant:

- Limitation of experiments:
● Lab experiments are criticized as being unrealistic environments.
● And, sometimes, you cannot do experiments.

Ethics:

- Subjects cannot be treated with something that might be harmful.


- Committees are there to evaluate whether an experiment is ethical.
- Alternate methods are used in that case.

Sampling:

- Done because it is infeasible to collect data on the whole population, and because it is
more money, effort, and time-efficient.
- Choosing subjects of the population.
- We sample by randomization:
● Everyone must have the same chance of showing up in the data.
- Concerns of sampling using surveys:
● Undercoverage: not enough respondents from certain groups.
● Nonresponse: try to survey people and they don’t respond.
Lecture #9

Probabilities and Proportions:

- Probability is used to evaluate the sample in reference to the population.

Confidence intervals:

- A confidence interval is a range of values that is likely to contain the average of the
actual population.
- Our confidence intervals decrease as we get more information about the average of the
population (as the same size gets bigger), since it helps us detect small effects.

Sample space:

- Collection of all outcomes that could happen.


- Notation: S = {outcome#1, outcome#2, etc.}
- Examples (third isn’t a range because we can’t get 1.5):
Rules of probability:

1) Any probability must be between 0 and 1.


2) All possible outcomes must have a probability of 1.
● Sum of the probability of each sample space = 1.
3) If two events have no outcomes in common, the probability that one or the other occurs
is the sum of their individual probabilities.
● E.g. in a coin flip, there is no common outcome since the outcome cannot be
both heads and tails.
● The probability in that case is 0.50 (probability of getting heads) + 0.50
(probability of getting tails).
4) The probability that an event does not occur is 1 minus the probability that the event
does occur.
● E.g. the probability that we don’t get tails is 1 - (probability that we do get tails).
Lecture #10

Venn Diagrams - Type#1:

- The whole area (S) represents the set of possible outcomes.


- The area A and B represent particular events.
- P(all the outcomes) = 1.
- Sample questions, Bearing in mind that s = 1 and A & B are not ALL the potential
outcomes in S).

Venn Diagrams - Type#2:

- A​c​ (A-complement) are all the outcomes (the sum of them) that aren’t A.
Venn Diagrams - Type#3:
Dependent and Independent events:

- Survey questions are not independent (almost always).


- e.g. educational level and age.
Lecture #11

The Multiplication Rule:

- Tells you the probability of two events happening together.


- P(B|A) is the probability of B given the condition that event A has occurred.
- Example: if we are talking about flipping a coin and we define event A as getting two
heads in a row, then the events are no longer independent since getting two heads in a
row is conditional on getting the first flip to be heads.

The conditional probability:

- Tells you the formula for the event B conditional of A.


- the following is the rule for the conditional probability when P(A) > 0:

- The venn diagram shows two dependent events, because they wouldn’t intersect if they
were independent.
- If we want to find the probability of both A and B happening, it is simply the sum of both
circles (including intersection).
- If we are asking for the probability of B conditional A P(B|A), we want A∩B/A or
probability of both B and A over the probability of A.
Example#1:
Final answer: P(A and B and C) = P(A)P(B|A)P(C|A and B) ​OR​ P(A and B and C) = 0.25 ×
0.02 × 0.60 = 0.003 (which is 0.3%)
Example#2:
Bayes Rule:

- All of them are independent of each other.

Example#1:

- A factory production line is manufacturing bolts using three machines, A, B and C. Of the
total output, machine A is responsible for 25%, machine B for 35% and machine C for
the rest. It is known from previous experience with the machines that 5% of the output
from machine A is defective, 4% from machine B and 2% from machine C. A bolt is
chosen at random from the production line and found to be defective. What is the
probability that it came from (a) machine A (b) machine B (c) machine C?

So:

The state space:

1. D = bolt is defective
2. A = bolt is from A
3. B = bolt is from B
4. C = bolt is from C

- We have P(A) = 0.25, P(B) = 0.35, P(C) = 0.4


- We have P(D|A) = 0.05, P(D|B) = 0.04, P(D|C) = 0.02

You might also like