0% found this document useful (0 votes)
54 views

Unit1 - Data Science - SPPU

Uploaded by

burgulamey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Unit1 - Data Science - SPPU

Uploaded by

burgulamey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Unit 1: Regression Models

 Overview of statistical linear models, residuals,


regression inference, Generalized linear models,
 Logistic regression, Interpretation of odds and
odds ratios, Maximum likelihood estimation in
logistic regression, Poisson regression, Examples,
Interpreting logistic regression, Visualizing fitting
logistic regression curves.

Regression is a technique for investigating the relationship


between independent variables or features and a dependent
variable or outcome.
It's used as a method for predictive modelling in machine
learning, in which an algorithm is used to predict continuous
outcomes.
Regression analysis is a statistical method to model the
relationship between a dependent (target) and independent
(predictor) variables with one or more independent variables.
More specifically, Regression analysis helps us to understand
how the value of the dependent variable is changing
corresponding to an independent variable when other
independent variables are held fixed. It predicts continuous/real
values such as temperature, age, salary, price, etc.

Overview of statistical linear models


Linear regression models
For the regression case, the statistical model is as follows. Given a
(random) sample the relation between the
observations and the independent variables is formulated as
where may be nonlinear functions. In the above, the quantities
are random variables representing errors in the relationship. The "linear"
part of the designation relates to the appearance of the regression
coefficients, in a linear way in the above relationship. Alternatively, one
may say that the predicted values corresponding to the above model,
namely

are linear functions of the .


Given that estimation is undertaken on the basis of a least squares
analysis, estimates of the unknown parameters are determined by
minimising a sum of squares function

From this, it can readily be seen that the "linear" aspect of the model
means the following:

 the function to be minimised is a quadratic function of the for


which minimisation is a relatively simple problem;

 the derivatives of the function are linear functions of the making it


easy to find the minimising values;

 the minimising values are linear functions of the observations ;

 the minimising values are linear functions of the random errors


which makes it relatively easy to determine the statistical properties
of the estimated values of .
Time series models
An example of a linear time series model is an autoregressive moving
average model. Here the model for values { } in a time series can be
written in the form

where again the quantities are random variables representing


innovations which are new random effects that appear at a certain time
but also affect values of at later times. In this instance the use of the
term "linear model" refers to the structure of the above relationship in
representing as a linear function of past values of the same time series
and of current and past values of the innovations.[1] This particular aspect
of the structure means that it is relatively simple to derive relations for the
mean and covariance properties of the time series. Note that here the
"linear" part of the term "linear model" is not referring to the coefficients
and , as it would be in the case of a regression model, which looks
structurally similar.

Residuals
A residual is a measure of how far away a point is vertically from the
regression line. Simply, it is the error between a predicted value and the
observed actual value.

Below figure is an example of how to visualize residuals against the line of


best fit. The vertical lines are the residuals.

Residual analysis is a useful class of techniques for the evaluation of the


goodness of a fitted model. Checking the underlying assumptions is
important since most linear regression estimators require a correctly
specified regression function and independent and identically distributed
errors to be consistent.
A typical residual plot has the residual values on the Y-axis and the
independent variable on the x-axis.
A few characteristics of a good residual plot are as follows:
1. It has a high density of points close to the origin and a low density of
points away from the origin
2. It is symmetric about the origin

regression inference
There are four assumptions associated with a linear regression model:
1. Linearity: The relationship between X and the mean of Y is linear.
2. Independence: Observations are independent of each other. In other
words, the different observations in our data must be independent of
one another.
3. Normality: For any fixed value of X, Y is normally distributed. The third
condition is that the residuals should follow a Normal distribution.
Furthermore, the center of this distribution should be 0. In other
words, sometimes the regression model will make positive
errors: y−ˆy>0y−y^>0. Other times, the regression model will make
equally negative errors: y−ˆy<0y−y^<0. However, on average the
errors should equal 0 and their shape should be similar to that of a
bell.
4. Equality or Homoscedasticity: The variance of residual is the same
for any value of X. The fourth and final condition is that the residuals
should exhibit Equal variance across all values of the explanatory
variable x. In other words, the value and spread of the residuals
should not depend on the value of the explanatory variable x.
Conditions L, N, and E can be verified through what is known as a residual
analysis. Condition I can only be verified through an understanding of
how the data was collected.
First, the Independence condition. The fact that there exist
dependencies must be addressed. In more advanced statistics courses,
you’ll learn how to incorporate such dependencies into your regression
models. One such technique is called hierarchical/multilevel modelling.
Second, when conditions L, N, E are not met, it often means there is a
shortcoming in our model. For example, it may be the case that using only
a single explanatory variable is insufficient. We may need to incorporate
more explanatory variables in a multiple regression model or perhaps use
a transformation of one or more of your variables, or use an entirely
different modelling technique.
Confidence Intervals for Regression Slope and Intercept

A level C confidence interval for the parameters 0 and 1 may be computed from the
estimates b0 and b1 using the computed standard deviations and the appropriate critical
value t* from the t(n-2) distribution.
Confidence Intervals for Mean Response

The mean of a response y for any specific value of x, say x*, is given by y = 0 + 1x
*
.

Substituting the fitted estimates b0 and b1 gives the equation y = b0 + b1x*. A confidence

interval for the mean response is calculated to be y + t*s , where the fitted value y is the
estimate of the mean response. The value t* is the upper (1 - C)/2 critical value for the t(n -
2) distribution.

Generalized linear models


Generalized Linear Model (GLiM, or GLM) is an advanced statistical
modelling technique formulated by John Nelder and Robert Wedderburn
in 1972. It is an umbrella term that encompasses many other models,
which allows the response variable y to have an error distribution other
than a normal distribution. The models include Linear Regression, Logistic
Regression, and Poisson Regression.
Linear Regression model is not suitable if,
 The relationship between X and y is not linear. There exists some
non-linear relationship between them. For example, y increases
exponentially as X increases.
 Variance of errors in y (commonly called as Homoscedasticity in
Linear Regression), is not constant, and varies with X.
 Response variable is not continuous, but discrete/categorical. Linear
Regression assumes normal distribution of the response variable,
which can only be applied on a continuous data. If we try to build a
linear regression model on a discrete/binary y variable, then the
linear regression model predicts negative values for the
corresponding response variable, which is inappropriate.
Assumptions
here are some basic assumptions for Generalized Linear Models as well.
Most of the assumptions are similar to Linear Regression models, while
some of the assumptions of Linear Regression are modified.
 Data should be independent and random (Each Random variable
has the same probability distribution).
 The response variable y does not need to be normally distributed, but
the distribution is from an exponential family (e.g. binomial, Poisson,
multinomial, normal)
 The original response variable need not have a linear relationship
with the independent variables, but the transformed response
variable (through the link function) is linearly dependent on the
independent variables
 Feature engineering on the Independent variable can be applied i.e
instead of taking the original raw independent variables, variable
transformation can be done, and the transformed independent
variables, such as taking a log transformation, squaring the variables,
reciprocal of the variables, can also be used to build the GLM model.
 Homoscedasticity (i.e constant variance) need not be satisfied.
Response variable Error variance can increase, or decrease with the
independent variables.
 Errors are independent but need not be normally distributed
Components
There are 3 components in GLM.
 Systematic Component/Linear Predictor:
It is just the linear combination of the Predictors and the regression
coefficients.
β0+β1X1+β2X2
 Link Function:
Represented as η or g(μ), it specifies the link between a random and
systematic components. It indicates how the expected/predicted value of
the response relates to the linear combination of predictor variables.
 Random Component/Probability Distribution:
It refers to the probability distribution, from the family of distributions, of
the response variable.
The family of distributions, called an exponential family, includes normal
distribution, binomial distribution, or poisson distribution.
Models
 Linear Regression, for continuous outcomes with normal
distribution:
 Simple Linear Regression, y= β0+β1X1

 Multiple Linear Regression, y = β0+β1X1+β2X2

Response is continuous. Predictors can be continuous or categorical,


and can also be transformed. Errors are distributed normally and
variance is constant.

 Binary Logistic Regression, for dichotomous or binary


outcomes with binomial distribution:
Here Log odds is expressed as a linear combination of the explanatory
variables. Logit is the link function. The Logistic or Sigmoid function,
returns probability as the output, which varies between 0 and 1.

Log odds= β0+β1X1+β2X2

Response variable has only 2 outcomes. Predictors can be continuous or


categorical, and can also be transformed.

Poisson Regression, for count based outcomes with poisson


distribution:
Here count values are expressed as a linear combination of the
explanatory variables.Log link is the link function.

log(λ)=β0+β1×1+β2×2,

where λ is the average value of the count variable. Response variable is a


count value per unit of time and space. Predictors can be continuous or
categorical, and can also be transformed.

GLM vs GLiM

General Linear Models, also represented as GLM, is a special case of


Generalized Linear Models (GLiM). General Linear Models refers to normal
linear regression models with a continuous response variable. It includes
many statistical models such as Single Linear Regression, Multiple Linear
Regression, Anova, Ancova, Manova, Mancova, t-test and F-test. General
Linear Models assumes the residuals/errors follow a normal distribution.
Generalized Linear Model, on the other hand, allows residuals to have
other distributions from the exponential family of distributions.
logistic regression
In statistics, the logistic model (or logit model) is a statistical model that
models the probability of an event taking place by having the log-odds for
the event be a linear combination of one or more independent variables.
In regression analysis, logistic regression[1] (or logit regression)
is estimating the parameters of a logistic model (the coefficients in the
linear combination). Formally, in binary logistic regression there is a
single binary dependent variable, coded by an indicator variable, where
the two values are labelled "0" and "1", while the independent variables can
each be a binary variable (two classes, coded by an indicator variable) or
a continuous variable (any real value). The corresponding probability of the
value labelled "1" can vary between 0 (certainly the value "0") and 1
(certainly the value "1"), hence the labelling;[2] the function that converts
log-odds to probability is the logistic function, hence the name. The unit of
measurement for the log-odds scale is called a logit, from logistic unit,
hence the alternative names. See § Background and § Definition for formal
mathematics, and § Example for a worked example.
Analysis of the hypothesis
The output from the hypothesis is the estimated probability. This is used
to infer how confident can predicted value be actual value when given an
input X. Consider the below example,
X = [x0 x1] = [1 IP-Address]
Based on the x1 value, let’s say we obtained the estimated probability to
be 0.8. This tells that there is 80% chance that an email will be spam.
Mathematically this can be written as,

Figure 3: Mathematical Representation


This justifies the name ‘logistic regression’. Data is fit into linear regression
model, which then be acted upon by a logistic function predicting the
target categorical dependent variable.
Types of Logistic Regression
1. Binary Logistic Regression
The categorical response has only two 2 possible outcomes. Example:
Spam or Not
2. Multinomial Logistic Regression
Three or more categories without ordering. Example: Predicting which
food is preferred more (Veg, Non-Veg, Vegan)
3. Ordinal Logistic Regression
Three or more categories with ordering. Example: Movie rating from 1 to 5
Decision Boundary
To predict which class a data belongs, a threshold can be set. Based upon
this threshold, the obtained estimated probability is classified into classes.
Say, if predicted_value ≥ 0.5, then classify email as spam else as not spam.
Decision boundary can be linear or non-linear. Polynomial order can be
increased to get complex decision boundary.

Interpretation of odds and odds ratios


Odds and odds ratios are hard for many clinicians to understand. Odds are
the probability of an event occurring divided by the probability of the
event not occurring. An odds ratio is the odds of the event in one group,
for example, those exposed to a drug, divided by the odds in another
group not exposed. Odds ratios always exaggerate the true relative risk to
some degree. When the probability of the disease is low (for example, less
than 10%), the odds ratio approximates the true relative risk. As the event
becomes more common, the exaggeration grows, and the odds ratio no
longer is a useful proxy for the relative risk. Although the odds ratio is
always a valid measure of association, it is not always a good substitute for
the relative risk. Because of the difficulty in understanding odds ratios,
their use should probably be limited to case-control studies and logistic
regression, for which odds ratios are the proper measures of association.

Odds of an event happening is defined as the likelihood that an event will


occur, expressed as a proportion of the likelihood that the event will not
occur. Therefore, if A is the probability of subjects affected and B is the
probability of subjects not affected, then odds = A /B.
Therefore, the odds of rolling four on dice are 1/5 or an implied probability
of 20%. [Note this is not the same as probability which would be 1/6 =
16.66%]
Odds Ratio (OR) is a measure of association between exposure and an
outcome. The OR represents the odds that an outcome will occur given a
particular exposure, compared to the odds of the outcome occurring in
the absence of that exposure.
Important points about Odds ratio:

 Calculated in case-control studies as the incidence of outcome is not


known
 OR >1 indicates increased occurrence of an event
 OR <1 indicates decreased occurrence of an event (protective
exposure)
 Look at CI and P-value for statistical significance of value (Learn more
about p values and confidence intervals here)
 In rare outcomes OR = RR (RR = Relative Risk). This applies when the
incidence of the disease is < 10%

CLINICAL EXAMPLE AND CALCULATION


In a study examining the association between estrogen (exposure) and
endometrial carcinoma (outcome), the two by two table is shown below.
Learning point: In a two by two table, for ease of calculation ensure that
the outcome of interest is always at the top and the exposure on the left.

Interpretation
According to the tablet above, individuals with endometrial cancer are
4.42 times more likely to be exposed to estrogen than those without
endometrial carcinoma.
Learning point: It is not appropriate to interpret this as ‘Individuals with
estrogen exposure are 4.42 times more likely to develop Endometrial
cancer than those without exposure.’ The reason is that a case-control
study begins from outcome i.e. selection of a sample with the outcome of
interest which in this case is endometrial cancer.

Maximum likelihood estimation in logistic regression


A common modeling problem involves how to estimate a joint probability
distribution for a dataset.
For example, given a sample of observation (X) from a domain (x1, x2, x3, …,
xn), where each observation is drawn independently from the domain
with the same probability distribution (so-called independent and
identically distributed, i.i.d., or close to it).
Density estimation involves selecting a probability distribution function
and the parameters of that distribution that best explain the joint
probability distribution of the observed data (X).
There are many techniques for solving this problem, although two
common approaches are:
 Maximum a Posteriori (MAP), a Bayesian method.
 Maximum Likelihood Estimation (MLE), frequentist method.
The main difference is that MLE assumes that all solutions are equally
likely beforehand, whereas MAP allows prior information about the form
of the solution to be harnessed.
Maximum Likelihood Estimation
One solution to probability density estimation is referred to as Maximum
Likelihood Estimation, or MLE for short.
Maximum Likelihood Estimation involves treating the problem as an
optimization or search problem, where we seek a set of parameters that
results in the best fit for the joint probability of the data sample (X).
First, it involves defining a parameter called theta that defines both the
choice of the probability density function and the parameters of that
distribution. It may be a vector of numerical values whose values change
smoothly and map to different probability distributions and their
parameters.
In Maximum Likelihood Estimation, we wish to maximize the probability
of observing the data from the joint probability distribution given a
specific probability distribution and its parameters, stated formally as:
 P(X | theta)
This conditional probability is often stated using the semicolon (;) notation
instead of the bar notation (|) because theta is not a random variable, but
instead an unknown parameter. For example:
 P(X ; theta)
or
 P(x1, x2, x3, …, xn ; theta)
This resulting conditional probability is referred to as the likelihood of
observing the data given the model parameters and written using the
notation L() to denote the likelihood function. For example:
 L(X ; theta)
The objective of Maximum Likelihood Estimation is to find the set of
parameters (theta) that maximize the likelihood function, e.g. result in the
largest likelihood value.
 maximize L(X ; theta)
We can unpack the conditional probability calculated by the likelihood
function.
Given that the sample is comprised of n examples, we can frame this as
the joint probability of the observed data samples x1, x2, x3, …, xn in X given
the probability distribution parameters (theta).
 L(x1, x2, x3, …, xn ; theta)
The joint probability distribution can be restated as the multiplication of
the conditional probability for observing each example given the
distribution parameters.
 product i to n P(xi ; theta)
Multiplying many small probabilities together can be numerically
unstable in practice, therefore, it is common to restate this problem as the
sum of the log conditional probabilities of observing each example given
the model parameters.
 sum i to n log(P(xi ; theta))
Where log with base-e called the natural logarithm is commonly used.
Specifically, the choice of model and model parameters is referred to as a
modeling hypothesis h, and the problem involves finding h that best
explains the data X. We can, therefore, find the modeling hypothesis that
maximizes the likelihood function.
 maximize sum i to n log(P(xi ; h))
Supervised learning can be framed as a conditional probability problem of
predicting the probability of the output given the input:
 P(y | X)
As such, we can define conditional maximum likelihood estimation for
supervised machine learning as follows:
 maximize sum i to n log(P(yi|xi ; h))
Now we can replace h with our logistic regression model.
In order to use maximum likelihood, we need to assume a probability
distribution. In the case of logistic regression, a Binomial probability
distribution is assumed for the data sample, where each example is one
outcome of a Bernoulli trial. The Bernoulli distribution has a single
parameter: the probability of a successful outcome (p).
 P(y=1) = p
 P(y=0) = 1 – p
The expected value (mean) of the Bernoulli distribution can be calculated
as follows:
 mean = P(y=1) * 1 + P(y=0) * 0
Or, given p:
 mean = p * 1 + (1 – p) * 0
This calculation may seem redundant, but it provides the basis for the
likelihood function for a specific input, where the probability is given by
the model (yhat) and the actual label is given from the dataset.
 likelihood = yhat * y + (1 – yhat) * (1 – y)
This function will always return a large probability when the model is close
to the matching class value, and a small value when it is far away, for
both y=0 and y=1 cases.
Poisson regression, Examples
In statistics, Poisson regression is a generalized linear model form
of regression analysis used to model count data and contingency tables.
Poisson regression assumes the response variable Y has a Poisson
distribution, and assumes the logarithm of its expected value can be
modeled by a linear combination of unknown parameters. A Poisson
regression model is sometimes known as a log-linear model, especially
when used to model contingency tables.
Negative binomial regression is a popular generalization of Poisson
regression because it loosens the highly restrictive assumption that the
variance is equal to the mean made by the Poisson model. The traditional
negative binomial regression model is based on the Poisson-gamma
mixture distribution. This model is popular because it models the Poisson
heterogeneity with a gamma distribution.
Poisson regression models are generalized linear models with the
logarithm as the (canonical) link function, and the Poisson
distribution function as the assumed probability distribution of the
response.
If is a vector of independent variables, then the model takes the
form

where and . Sometimes this is written more compactly as

where x is now an (n + 1)-dimensional vector consisting of n independent


variables concatenated to the number one. Here θ is simply α
concatenated to β.
Thus, when given a Poisson regression model θ and an input vector x, the
predicted mean of the associated Poisson distribution is given by

If Yi are independent observations with corresponding values xi of the


predictor variables, then θ can be estimated by maximum likelihood. The
maximum-likelihood estimates lack a closed-form expression and must
be found by numerical methods. The probability surface for maximum-
likelihood Poisson regression is always concave, making Newton–Raphson
or other gradient-based methods appropriate estimation techniques.

Examples of Poisson regression


Example 1. The number of persons killed by mule or horse kicks in the
Prussian army per year. Ladislaus Bortkiewicz collected data from 20
volumes of Preussischen Statistik. These data were collected on 10 corps
of the Prussian army in the late 1800s over the course of 20 years.
Example 2. The number of people in line in front of you at the grocery
store. Predictors may include the number of items currently offered at a
special discounted price and whether a special event (e.g., a holiday, a big
sporting event) is three or fewer days away.
Example 3. The number of awards earned by students at one high school.
Predictors of the number of awards earned include the type of program in
which the student was enrolled (e.g., vocational, general or academic) and
the score on their final exam in math.

Interpreting logistic regression


structurally similar.

Visualizing fitting logistic regression curves.


structurally similar.

You might also like