Unit1 - Data Science - SPPU
Unit1 - Data Science - SPPU
From this, it can readily be seen that the "linear" aspect of the model
means the following:
Residuals
A residual is a measure of how far away a point is vertically from the
regression line. Simply, it is the error between a predicted value and the
observed actual value.
regression inference
There are four assumptions associated with a linear regression model:
1. Linearity: The relationship between X and the mean of Y is linear.
2. Independence: Observations are independent of each other. In other
words, the different observations in our data must be independent of
one another.
3. Normality: For any fixed value of X, Y is normally distributed. The third
condition is that the residuals should follow a Normal distribution.
Furthermore, the center of this distribution should be 0. In other
words, sometimes the regression model will make positive
errors: y−ˆy>0y−y^>0. Other times, the regression model will make
equally negative errors: y−ˆy<0y−y^<0. However, on average the
errors should equal 0 and their shape should be similar to that of a
bell.
4. Equality or Homoscedasticity: The variance of residual is the same
for any value of X. The fourth and final condition is that the residuals
should exhibit Equal variance across all values of the explanatory
variable x. In other words, the value and spread of the residuals
should not depend on the value of the explanatory variable x.
Conditions L, N, and E can be verified through what is known as a residual
analysis. Condition I can only be verified through an understanding of
how the data was collected.
First, the Independence condition. The fact that there exist
dependencies must be addressed. In more advanced statistics courses,
you’ll learn how to incorporate such dependencies into your regression
models. One such technique is called hierarchical/multilevel modelling.
Second, when conditions L, N, E are not met, it often means there is a
shortcoming in our model. For example, it may be the case that using only
a single explanatory variable is insufficient. We may need to incorporate
more explanatory variables in a multiple regression model or perhaps use
a transformation of one or more of your variables, or use an entirely
different modelling technique.
Confidence Intervals for Regression Slope and Intercept
A level C confidence interval for the parameters 0 and 1 may be computed from the
estimates b0 and b1 using the computed standard deviations and the appropriate critical
value t* from the t(n-2) distribution.
Confidence Intervals for Mean Response
The mean of a response y for any specific value of x, say x*, is given by y = 0 + 1x
*
.
Substituting the fitted estimates b0 and b1 gives the equation y = b0 + b1x*. A confidence
interval for the mean response is calculated to be y + t*s , where the fitted value y is the
estimate of the mean response. The value t* is the upper (1 - C)/2 critical value for the t(n -
2) distribution.
log(λ)=β0+β1×1+β2×2,
GLM vs GLiM
Interpretation
According to the tablet above, individuals with endometrial cancer are
4.42 times more likely to be exposed to estrogen than those without
endometrial carcinoma.
Learning point: It is not appropriate to interpret this as ‘Individuals with
estrogen exposure are 4.42 times more likely to develop Endometrial
cancer than those without exposure.’ The reason is that a case-control
study begins from outcome i.e. selection of a sample with the outcome of
interest which in this case is endometrial cancer.