0% found this document useful (0 votes)

24 views9 pages

Midterm 2 Nem Veg Leges

Uploaded by

Anna Takács

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views9 pages

Midterm 2 Nem Veg Leges

Uploaded by

Anna Takács

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Midterm 2

Chapter 7: Simple regression

Regression basics
• Regression analysis is a method that uncovers the average value of a variable y for
different values of variable x
• The model for conditional mean shows the mean value of y for various values of x
• E y | x  = f ( x )
• yE = f (x )
o it shows the result of the result of plugging a specific value for x value into the
model
• y: dependent variable, left-hand-side variable
• x: explanatory variable, right-hand-side variable
• Regression analysis may reveal that average y tends to be higher at higher values of x
o pattern of positive mean-dependence/association
o Linear patterns: positive (negative) association - average y tends to be higher
(lower) at higher values of x.
o Non-linear patterns: association may be non-monotonic - y tends to be higher
for higher values of x in a certain range of the x variable and lower for higher
values of x in another range of the x variable
o No association or relationship
o goal: uncover this relationship
• Non-parametric regression
o describe the y E = f ( x ) pattern without imposing a specific functional form
on f
o Let the data dictate what that function looks like, at least approximately.
o Can spot (any) patterns well
o When x has few values and there are many observations in the data, the best
and most intuitive non-parametric regression for y E = f ( x ) shows average y
for each and every value of x.
o With many values of x: two ways
▪ bin scatters: show the average value of y that corresponds to each bin
created from the values of x, in an x-y coordinate system
▪ smoothing: shows the average y over the entire range of bins
▪ smoothed conditional means plots: lowess is one of them – we
interpret these graphs in a qualitative way
▪ lowess: locally weighted scatterplot smoothing – a smooth curve fit
around a bin scatter
• Binary explanatory variable
o β is the difference in average y between observations with x = 1 and
observations with x = 0
o Graphically, the regression line of linear regression goes through two points:
average y when x is zero (α) and average y when x is one (α + β).
• Coefficient formula
o Calculated estimates - αˆ and βˆ (use data and calculate the statistic)
Cov  x , y 
o =
Var  x 
o The formula of the intercept reveals that the regression line always goes
through the point of average x and average y
o  = y − x
o OLS gives the best-fitting linear regression line
o OLS method finds the values of the coefficients of the linear regression that
minimize the sum of squares of the difference between actual y values and
their values implied by the regression, αˆ + βˆx

Residuals and predicted values

• Predicted values
o The predicted value of the dependent variable = best guess for its average
value if we know the value of the explanatory variable, using our model
o The predicted values of the dependent variable are the points of the regression
line itself.
o y =  + x
• Residuals
o The residual is the difference between the actual value of the dependent
variable for an observation and its predicted value :
o e i = yi − yi
o The residual is meaningful only for actual observation
▪ While we can have predicted values for any x, actual y values are only
available for the observations in our data
o above regression line: positive
o under regression line: negative
o Residuals sum to zero if a linear regression is fitted by OLS
▪ Sum is zero –> average of the residuals is zero, too.
• Interpolation
o predicting y for x not the data but in-between x values in the data
• Extrapolation
o predicting y for x not in the data and outside the range of x in the data

R-squared
• Fit of a regression captures how predicted values compare to the actual values
• gives the goodness of fit
Var y 
• R2 =   = 1 − Var e 
Var y  Var y 
• can be defined for parametric and non-parametric regressions
• always between 1 and 0
o 1: if the regression fits perfectly the data
o 0: all of the predicted y values are equal tot he overall average value y in the
data regardless of the value of the explanatory variable x – regression line is
completely flat
• Fit depends (1): how well the particular version of the regression captures the actual
function f in y E = f ( x )
• Fit depends (2): how far actual values of y are spread around what would be predicted
using the actual function f
• R-squared may help in choosing between different versions of regression for the same
data

Causation
• R-squared of the simple linear regression is the square of the correlation coefficient
R 2 = (Corr y, x )
2
o
o So the R-squared is yet another measure of the association between the two
variables.
• reverse regression
o x E =  + y
o they’re not the same until the two variances aren’t equal
o but they always have the same sign
o both are larger in magnitude the larger the covariance
o R 2 for the simple linear regression and the reverse regression is the same
• Slope of the y E = f ( x ) regression is not zero in our data
o Several reasons, not mutually exclusive:
▪ x causes y
▪ y causes x
▪ A third variable causes both x and y (or many such variables do)

Chapter 8: Complicated patterns and messy data

The shape of association
• When is it importanr if the shape of a regression is linear or not
o we want to make a prediction or analyze residuals - better fit
o we want to go beyond the average pattern of association - good reason for
complicated patterns
o all we care about is the average pattern of association, but the linear regression
gives a bad approximation to that - linear approximation is bad
• Potential nonlinear shapes doesn’t matter
o all we care about is the average pattern of association
o linear regression is good approximation to the average pattern

Logs
• Frequent nonlinear patterns better approximated with y or x transformed by taking
relative differences
• Log differences works because differences in natural logs approximate percentage
differences!
• we usually use ln
• In cross-sectional data usually there is no natural base for comparison
• Log transformation allows for comparison in relative terms – percentages
o Log transformation allows for comparison in relative terms (percentage),
because:
 x 
o ln ( x + x ) − ln ( x )  ln 1 +  (for small differences)
 x 
• when to take logs?
o Percentage differences
o relative comparisons are free from measures of the variables that are often
different across time and space and are arbitrary to begin with
o economical decisions
o differences in relative are likely to be more stable across time
• The distribution of many important economic variables is skewed with a long-right
tail and are reasonably well approximidated by a log-normal distribution.
• When to take logs?in the ppt!and read in the book non-positive values
• Log-level
ln ( y ) =  +  x i
E
o
o α is average ln(y) when x is zero. (Often meaningless.)
o β: y is β ∗ 100 percent higher, on average for observations with one unit higher
x
• Level-log
o y E =  +  ln ( x i )
o α is : average y when ln(x) is zero (and thus x is one)
o β: y is β/100 units higher, on average, for observations with one percent higher
• Log-log
ln ( y ) =  +  ln ( x i )
E
o
o α: is average ln(y) when ln(x) is zero. (Often meaningless.)
o β: y is β percent higher on average for observations with one percent higher x.
• per capita measures
o Most often: per capita: GDP/capita, revenues/employee, sales/shop
o can take logs easily
Polynomials
• polynomials do not require the analyst to specify where the pattern may change
• Quadratic
o Technically: quadratic function is not a linear function (a parabola, not a line)
o Handles only nonlinearity, which can be captured by a parabola
o y E =  + 1x + 2 x 2
o if beta2 is positive: convex relationship
o if negative: concave relationship
o we can get the slope of the function with derivative
o We can compare two observations, denoted by j and k, that are different in x,
by one unit so that xk = xj + 1. y is higher by β1 + 2β2xj units for observation
k than for observation j

Other
• robustness checks: running several different regressions and comparing results
• Influentail observations
o the slope of the regression is different when we include them in the data from
the slope when we exclude them from the data
o extreme values
o why the values are extreme for the influental observations
o what the question of the analysis is
• Measurement error in variables
o such errors may arise due to technical or fundamental reasons
o latent variables: Latent variables are unobservable variables that are inferred
from observable variables in a statistical model.
o proxy variables: observed variables that we use instead of latent variables
o classical measurement error: an error that is zero on average and is
independent of all other relevant variables, including the error-free variable
o noise-to-signal ratio: the importance of measurement error
o attenuation bias: the effect of classical measurement error in the explanatory
variable
• Using weights
o one way to compensate for unequal sampling and biased coverage
o use it with aggregate data
▪ countries, firms, families

Chapter 9: Generalizing results of a regression

Generalizing linear regression coefficients
• Question: Is the pattern we see in our data
o True in general?
o or is it just a special case what we see?
• inference: the act of generalizing results
• statistical inference: generalizing from the data to the population, or general pattern, it
represents
• external validity
o if it is low we may consider wider ranges
o Beyond (other dates, countries, people, firms)

CI and SE of regression coefficients

• true value: the true value (of beta) in the population, or general pattern, represented by
the data
•  : the average difference in y corresponding to one unit difference in x in the data
o the question of the statistical inference is the true value of beta
• y i : best guess for the expected value (average) of the dependent variable for
observation i with value xi for the explanatory variable in the dataset
• CI
o confidence interval of the regression coefficient
o can tell us where the true value of beta is with 95% likelihood
o narrower SE, more precise coefficient
o CI  () 95%  ()
=  − 2SE  ,  + 2SE  () 
• SE
o measures the spread of the values of the statistic across hypothetical repeated
samples drwan from the same population, or general pattern, that our data
represents
o In the context of linear regression, the standard error is used to quantify the
precision of the regression coefficients. Each coefficient has its standard error,
and the ratio of the coefficient estimate to its standard error is used to assess
statistical significance.
• simple SE formula
o Simple SE formula is not correct in general
o assumes that variables are independent across observations
o Homoskedasticity assumption
Std e 
o SE  = () nStd  x 
(e:regresion residuals)

o smaller:
▪ smaller the standard deviation of the residual
▪ larger the standard deviation of the explanatory variable
▪ more observations are in the data
• homoskedasticity
o the assumption that the fit of the regression line is the same across the entire
range of the x variable
• heteroskedasticity
o the fit may differ at different values of x, in which case the spread of actual y
around the regression line is different for different values of x
• robust SE formula
o In statistics, a robust statistical method is one that remains valid and effective
even when the assumptions of the method are not perfectly met. Robust
statistical techniques are less sensitive to outliers or deviations from the
expected data distribution.
o Same properties as the simple formula: smaller when Std[e] is small, Std[x] is
large and n is large
o assumes heteroskedasticity
o sometimes they’re same, sometimes this one is larger
o Coefficient estimates, R squared etc. remain the same

Intervals for predicted values

• how we can quantify and present the uncertainty corresponding to this specific
predicted value, y i ?
• the CI of the predicted value/CI of the regression line
o The CI of the predicted value combines the CI for  and the CI for 
o ( )
95%CI y j = y j  2SE y j ( )
o It answers the question of where we can expect y E to lie if we know the value
of x and we have estimates of coefficients  and  from the data

( )
o SE y j can be estimated using bootstrap or an appropriate formula

( )
2
xj − x
( )
o SE y j = Std e 
1
+
n nVar  x 
o the SE of the predicted value for a particular observation is small if the SEs of
the coefficient estimates are small and the particular observation has an x
value close to its average
o The second part means that predictions for observations with more extreme x
values are bound to have larger standard errors and thus wider confidence
intervals
o Can be used for any model
o In general, the CI for the predicted value is an interval that tells where to
expect average y given the value of x in the population, or general pattern,
represented by the data
• Prediction interval
o The prediction interval for y j starts from the CI for y j and adds the extra
uncertainty due to the fact that the actual y j will be somewhere around yj .
o ( )
95% PI y j = y j  2SPE y j ( )
o Standard prediction error:

( )
2
xj − x
▪ ( )
SPE y j = Std e 
1
1+ +
n nVar  x 
o In the formula, all elements get very small if n gets large, except for the new
element

Testing hypotheses
• whether the true value of β is zero, which means its value in the population, or general
pattern represented by the data
• H 0 : true = 0
• H A : true  0
• t-statistic
o the t-statistic is best viewed as a measure of distance: how far the value of the
()
statistic in the data  is from its value hypothesized by the null (zero)

o A value 0 of the t-statistic means that the value of  is exactly what’s in the
null hypothesis (zero distance)
o A t-statistic of 1 means that the value of  is exactly one SE larger than the
value in the null hypothesis
 −c
o t=
SE  ()
o The t-statistic for the intercept coefficient is analogous
o critical value: . We reject the null if the t-statistic is larger than 2 or smaller
than −2, and we don’t reject the null if it’s in-between
• p-value
o the p-value is the smallest significance level at which we can reject the null
o We have to simply look at the p-value and decide if it is larger or smaller than
the level of significance that we set for ourselves in advance
o x is said to be statistically significant at 5%
• proof of concept
o example: cross-country data
• proof beyond reasonable doubt

Other
• Usually, one star is attached to coefficients that are significant at 5% (we can reject
that they are zero at the 5% level), and two stars are attached to coefficients that are
significant at 1%
• As external validity is about generalizing beyond what our data represents, we can’t
assess it using our data
o analyzing other data may help

Chapter 10: Multiple linear regression

Multiple linear regression
• Multiple regression analysis uncovers average y as a function of more than one x
variable: y E = f ( x1, x 2 ,...) .
• y E = 0 + 1x1 + 2 x 2
• The slope coefficient on x1 shows the difference in average y across observations
with different values of x1 but with the same value of x 2
• This way, multiple regression with two explanatory variables compares observations
that are similar in one explanatory variable to see the differences related to the other
explanatory variable
• x-x regression
o whether x1 and x 2 are related
o δ would tell us how much the two prices tend to move together
o x 2E =  + x1
o The slope of x1 in a simple regression is different from its slope in the
multiple regression, the difference being the product of its slope in the
regression of x 2 on x1 and the slope of x 2 in the multiple regression.
• omitted variable bias
o the slope in simple regression is different from the slope in multiple regression
by the slope in the x−x regression times the slope of the other x in the multiple
regression
o Corresponding differences in y may be due to differences in x1 but also due to
differences in x 2

Multiple Linear Regression Terminology

•

Multivariate Data Analysis in R PDF
No ratings yet
Multivariate Data Analysis in R PDF
400 pages
DP 14447
No ratings yet
DP 14447
105 pages
A Novel Anomaly Detection Scheme Based On Principal Component Classifier
No ratings yet
A Novel Anomaly Detection Scheme Based On Principal Component Classifier
10 pages
Qbus2810 Notes PDF
100% (1)
Qbus2810 Notes PDF
58 pages
13 Predictive Analysis - Tests of Association - Regression
No ratings yet
13 Predictive Analysis - Tests of Association - Regression
70 pages
PMC500 Tutorial 1 Norazliza BT Abd Aziz
No ratings yet
PMC500 Tutorial 1 Norazliza BT Abd Aziz
13 pages
Compliance Through Company Culture and Values: An International Study Based On The Example of Corruption Prevention
No ratings yet
Compliance Through Company Culture and Values: An International Study Based On The Example of Corruption Prevention
15 pages
Stats Notes
No ratings yet
Stats Notes
48 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
83 pages
Session 19&20
No ratings yet
Session 19&20
54 pages
Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
Lec2 ASE
No ratings yet
Lec2 ASE
86 pages
Module 3 - Data Analysis - S RM
No ratings yet
Module 3 - Data Analysis - S RM
63 pages
Topic 3 - Simple Regression Analysis
No ratings yet
Topic 3 - Simple Regression Analysis
37 pages
Mda-Session-7 Simple Linear Regression
No ratings yet
Mda-Session-7 Simple Linear Regression
75 pages
Automatic Transfer Function Synthesis From A Bode Plot
No ratings yet
Automatic Transfer Function Synthesis From A Bode Plot
6 pages
Lecture 3 Classical Linear Regression Model
No ratings yet
Lecture 3 Classical Linear Regression Model
55 pages
Econometrics For MGT ppt-2
No ratings yet
Econometrics For MGT ppt-2
58 pages
Validation
No ratings yet
Validation
49 pages
Chapter 2 - 1907876925
No ratings yet
Chapter 2 - 1907876925
33 pages
Outlier Detection - Priciples and Techniques
No ratings yet
Outlier Detection - Priciples and Techniques
67 pages
Effects of Missing Data in Credit Risk Scoring. A Comparative Analysis of Methods To Achieve Robustness in The Absence of Sufficient Data
No ratings yet
Effects of Missing Data in Credit Risk Scoring. A Comparative Analysis of Methods To Achieve Robustness in The Absence of Sufficient Data
16 pages
Chapter 3 - Linear Regression
No ratings yet
Chapter 3 - Linear Regression
43 pages
Regression Coeffient
No ratings yet
Regression Coeffient
52 pages
Regression and Correlation Analysis
No ratings yet
Regression and Correlation Analysis
28 pages
ECS3706-econometric Techniques Discussion Class 2 15-09-2010
No ratings yet
ECS3706-econometric Techniques Discussion Class 2 15-09-2010
33 pages
ST Notes Module 2 2023
No ratings yet
ST Notes Module 2 2023
38 pages
14 Statistics and Probability
No ratings yet
14 Statistics and Probability
37 pages
How To Calculate Outliers
No ratings yet
How To Calculate Outliers
7 pages
BRM - L4,5 - Linear Regression
No ratings yet
BRM - L4,5 - Linear Regression
113 pages
Past, Present and Future of SLAM
No ratings yet
Past, Present and Future of SLAM
25 pages
Unit 2 - Scatterplots Correlation and Regression Summer 2021
No ratings yet
Unit 2 - Scatterplots Correlation and Regression Summer 2021
43 pages
Da Unit 3 R22
No ratings yet
Da Unit 3 R22
15 pages
Regression Course For Second Year (Chap 1-3)
No ratings yet
Regression Course For Second Year (Chap 1-3)
59 pages
Chapter 6
No ratings yet
Chapter 6
58 pages
Econometrics Session
No ratings yet
Econometrics Session
43 pages
Linear Regression Models
No ratings yet
Linear Regression Models
42 pages
Chapter 2-Simple Regression Model
No ratings yet
Chapter 2-Simple Regression Model
25 pages
Redistribution, Inequality, and Growth: New Evidence
No ratings yet
Redistribution, Inequality, and Growth: New Evidence
47 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
BST 32202 Linear Regression 6 SLR Assumptions Lse
No ratings yet
BST 32202 Linear Regression 6 SLR Assumptions Lse
20 pages
A General Framework For Robust Monitoring of Multivariate Correlated Processes
No ratings yet
A General Framework For Robust Monitoring of Multivariate Correlated Processes
16 pages
Unit III
No ratings yet
Unit III
13 pages
Module 3
No ratings yet
Module 3
34 pages
Dataset On Determinants Of-15012021
No ratings yet
Dataset On Determinants Of-15012021
7 pages
Nihms 1037926
No ratings yet
Nihms 1037926
32 pages
Linear Regression Models
No ratings yet
Linear Regression Models
41 pages
1486016038da Mod12 Q1 e Text
No ratings yet
1486016038da Mod12 Q1 e Text
11 pages
Semiparametric Theory and Missing Data 1st Edition Anastasios Tsiatis Download
100% (3)
Semiparametric Theory and Missing Data 1st Edition Anastasios Tsiatis Download
81 pages
Business Analytics
No ratings yet
Business Analytics
19 pages
Chapter 3 - Classical Simple Linear Regression
No ratings yet
Chapter 3 - Classical Simple Linear Regression
52 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
DA-3rd Unit
No ratings yet
DA-3rd Unit
16 pages
Guidelines FOR Part Average Testing: Automotive Electronics Council
No ratings yet
Guidelines FOR Part Average Testing: Automotive Electronics Council
12 pages
Deck2 BusinessIntelligence M1 ACSA
No ratings yet
Deck2 BusinessIntelligence M1 ACSA
15 pages
Corr - Regression Analysis
No ratings yet
Corr - Regression Analysis
19 pages
Madhusudhan 2025 ApJL 983 L40
No ratings yet
Madhusudhan 2025 ApJL 983 L40
21 pages
CH 06
No ratings yet
CH 06
22 pages
Data Analytics Unit III
No ratings yet
Data Analytics Unit III
15 pages
Topic 8 - Regression Analysis
No ratings yet
Topic 8 - Regression Analysis
51 pages
DISCRETE MATH Chapter-8
No ratings yet
DISCRETE MATH Chapter-8
34 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
27 pages
DS Unit-Iv
No ratings yet
DS Unit-Iv
34 pages
Ra Web
No ratings yet
Ra Web
70 pages
Boukeloua 2024 CIS-TM
No ratings yet
Boukeloua 2024 CIS-TM
39 pages
Systems Biology 1st Edition Edda Klipp Instant Download
100% (2)
Systems Biology 1st Edition Edda Klipp Instant Download
47 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
4 pages
Lectures 14 15
No ratings yet
Lectures 14 15
66 pages
Robust Nonlinear Regression in Enzyme Kinetic Parameters Estimation
No ratings yet
Robust Nonlinear Regression in Enzyme Kinetic Parameters Estimation
13 pages
REGRESSION ANALYSIS 1 and 2 Notes
No ratings yet
REGRESSION ANALYSIS 1 and 2 Notes
9 pages
Lecture 8
No ratings yet
Lecture 8
23 pages
Multiple Regression
No ratings yet
Multiple Regression
49 pages
Bivariate Regression Analysis: The Beginning of Many Types of Regression
No ratings yet
Bivariate Regression Analysis: The Beginning of Many Types of Regression
40 pages
Data Science 03 - Regression PDF
No ratings yet
Data Science 03 - Regression PDF
32 pages
Lecture 6 Simple Linear Regression
No ratings yet
Lecture 6 Simple Linear Regression
36 pages
Lecture9 Regression1 PDF
No ratings yet
Lecture9 Regression1 PDF
22 pages
Final Credit Risk Prediction Report Corrected
No ratings yet
Final Credit Risk Prediction Report Corrected
19 pages
Module05 Notes
No ratings yet
Module05 Notes
19 pages
Student Notes Madule 2
No ratings yet
Student Notes Madule 2
12 pages
Statistical Testing and Prediction Using Linear Regression: Abstract
No ratings yet
Statistical Testing and Prediction Using Linear Regression: Abstract
10 pages
Psicothema2013UsingR MAPE
No ratings yet
Psicothema2013UsingR MAPE
8 pages
Beyond Detection: A Mathematical Framework For Persistent Latency Arbitrage in Modern Markets
No ratings yet
Beyond Detection: A Mathematical Framework For Persistent Latency Arbitrage in Modern Markets
14 pages
Regression Models Notes
No ratings yet
Regression Models Notes
13 pages
Box and Whisker Plot
No ratings yet
Box and Whisker Plot
2 pages
CT43B0513 Ieee
No ratings yet
CT43B0513 Ieee
6 pages
Chapter 6: How To Do Forecasting by Regression Analysis
No ratings yet
Chapter 6: How To Do Forecasting by Regression Analysis
7 pages
Kami Export - 11.2 Distributions of Data
No ratings yet
Kami Export - 11.2 Distributions of Data
2 pages
Testing For Bias Between The Kjeldahl and Dumas Methods For The Determination of Nitrogen in Meat Mixtures, by Using Data From A Designed Interlaboratory Experiment
No ratings yet
Testing For Bias Between The Kjeldahl and Dumas Methods For The Determination of Nitrogen in Meat Mixtures, by Using Data From A Designed Interlaboratory Experiment
4 pages

Midterm 2 Nem Veg Leges

Uploaded by

Midterm 2 Nem Veg Leges

Uploaded by

Midterm 2

Chapter 7: Simple regression

Residuals and predicted values

Chapter 8: Complicated patterns and messy data

Chapter 9: Generalizing results of a regression

CI and SE of regression coefficients

Intervals for predicted values

Chapter 10: Multiple linear regression

Multiple Linear Regression Terminology

You might also like