Statistics 101 Study Notes
Statistics 101 Study Notes
Statistics:
Types of Data:
- Quantitative Data:
● Numbers that have cardinal meaning
● Examples: income, age, etc.
● Makes sense to use in calculations such as adding and finding an average.
- Categorial Variables:
● Numbers don’t have cardinal meaning.
● They just identify groups.
Lecture #2:
Variable:
- Acts as a placeholder
- Is represented with a graph or figure
Key Statistics
1) Centrality:
- To measure an average:
- To measure a median:
● If the values are in an odd number and short, split observation in half and choose
the number in between,
● If the values are in an odd number and long, use the formula: ,
where N is the number of observations to find the rank (where is it in the order of
observations) of the median (not the median, itself).
● If the values are in an even number, take the mean of the two middle numbers.
2) Measuring Spread:
Quartiles:
Examples:
3) Dispersion:
- where n is the number of observations and xi is the value of the observation.
- We use ‘n-1’ because ‘n’ in samples are biased.
- We square the denominator so it wouldn’t be zero.
How to measure Standard Deviation:
- Bars
- Pie
- Histograms
● Represent the distribution of numerical data
● Estimates probability distribution of a continuous variable
● Relates only one variable
- Boxplots
● An x is the mean
Shape:
- Symmetric: the parts above and below the mean look the same.
- Unimodal: has only one peak
- Bimodal: has two peaks
- Skewed: has more of the observations on one end than the other (many extreme small
or many extreme big values).
Lecture #3:
● Mean, SD, V, IQR, and median are all multiplied by the same measure.
Increase:
number(1 + perc./100)
Decrease:
number(1 - perc./100)
Density Curves:
- Density curve: is a curve that is always on or above the horizontal axis and has an area
of exactly 1 underneath it. A density curve describes the overall pattern of distribution.
- The area under the curve for any range of values is the proportion of all observations
that fall in that range.
- A density curve that is right-skewed (pos. El ta7t) has a mean > median > mode.
- Left-skewed curves (neg. El ta7t) have a median > mean > mode.
Five-number Summaries:
1. Minimum
2. Maximum
3. Median
4. Q1
5. Q3
- If two distributions have the same five-number summary, their curves may still
look different because they may have different distributions.
Normal Curves:
Normal Distributions:
- Points at which curvature changes are located at a distance σ on either side of the
mean µ.
- Good description of many real data sets & Good approximation to many kind of chance
outcomes
- Many statistical inference procedures based on the Normal distribution work well for
roughly symmetrical distributions
Z-scores (used to adjust location (mean) and curve shape (S.D) of a distribution):
Example where
mean = 3, SD = 3,
and X=1,6.
Two more examples:
c) 6.84
d) 1.44
Lecture 4
Associations:
- Two variables are said to be associated when knowing something about one tells you
something about the other.
- Can be positively associated, negatively associated, or not associated.
Two-way tables:
0.39 = the percentage (39%) of adults are both male and employed in the European Union.
Correlation:
R2:
- In the equation above, y hat is the predicted value in the line of best fit.
- In the left graph, the y line is the average, the height of the red squares is the difference
between actual y and average of y, the black dots are the data points, and the area of
the squares is (yi - y average)2.
- In the right graph, the height of the squares is the difference between yi and y hat, and
the area of the squares is (y hat - yi)2.
- The equation of r2 can be rewritten as:
Simpson’s Paradox:
- Happens because confounding variables (lurking) can strongly influence the relationship
between variables.
- Example:
Lecture #7
Causality:
- The dash line means correlation and the non-dash line means causation.
- In (a) x causes y.
- In (b) x does not cause y, but z causes both x and y, so they have a correlation.
- In (c) we don’t know whether x or z cause y. There can be multiple zs.
Vocabulary:
- Anecdotal data
- sample surveys
- process-produced data
- social media
However, experiments are considered the ‘gold standard’ because by randomizing, they
hold all variables constant:
- Limitation of experiments:
● Lab experiments are criticized as being unrealistic environments.
● And, sometimes, you cannot do experiments.
Ethics:
Sampling:
- Done because it is infeasible to collect data on the whole population, and because it is
more money, effort, and time-efficient.
- Choosing subjects of the population.
- We sample by randomization:
● Everyone must have the same chance of showing up in the data.
- Concerns of sampling using surveys:
● Undercoverage: not enough respondents from certain groups.
● Nonresponse: try to survey people and they don’t respond.
Lecture #9
Confidence intervals:
- A confidence interval is a range of values that is likely to contain the average of the
actual population.
- Our confidence intervals decrease as we get more information about the average of the
population (as the same size gets bigger), since it helps us detect small effects.
Sample space:
- Ac (A-complement) are all the outcomes (the sum of them) that aren’t A.
Venn Diagrams - Type#3:
Dependent and Independent events:
- The venn diagram shows two dependent events, because they wouldn’t intersect if they
were independent.
- If we want to find the probability of both A and B happening, it is simply the sum of both
circles (including intersection).
- If we are asking for the probability of B conditional A P(B|A), we want A∩B/A or
probability of both B and A over the probability of A.
Example#1:
Final answer: P(A and B and C) = P(A)P(B|A)P(C|A and B) OR P(A and B and C) = 0.25 ×
0.02 × 0.60 = 0.003 (which is 0.3%)
Example#2:
Bayes Rule:
Example#1:
- A factory production line is manufacturing bolts using three machines, A, B and C. Of the
total output, machine A is responsible for 25%, machine B for 35% and machine C for
the rest. It is known from previous experience with the machines that 5% of the output
from machine A is defective, 4% from machine B and 2% from machine C. A bolt is
chosen at random from the production line and found to be defective. What is the
probability that it came from (a) machine A (b) machine B (c) machine C?
So:
1. D = bolt is defective
2. A = bolt is from A
3. B = bolt is from B
4. C = bolt is from C