0% found this document useful (0 votes)
19 views

Trapti Chap1

thesis chapter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Trapti Chap1

thesis chapter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

CHAPTER 1
INTRODUCTION
1.1 Data analysis
Data analysis is the exercise of gathering information and interpreting what it can mean. When
conducting data analysis, experts collect raw data and use a variety of methods for interpreting
the information it presents. There are five main types of data analysis that describe how people
can use different types of data to reach conclusions and make decisions. Here's more
information about the primary types of data analysis:
 Descriptive analysis: Descriptive analysis determines what happened in a certain
situation. This type of analysis typically involves ordering and adjusting data from
different sources to interpret its meaning.
 Exploratory analysis: This type of analysis explores relationships between specific
data points or sets. When engaging in exploratory analysis, you can find connections
between pieces of information and create hypotheses to determine why they might
relate to each other.
 Predictive analysis: Predictive analysis refers to developing a prediction for what
might happen. This can involve considering results from an earlier analysis and
exploring trends and patterns to make an estimation about what might occur in the
future.
 Diagnostic analysis: Diagnostic analysis considers why something happened. When
using diagnostic analysis, you can explore events that occur and the context that
surrounds them to reach a solution as to why they might arise.
 Prescriptive analysis: This type of analysis attempts to predict how something might
happen. Prescriptive analysis considers raw data that relates to trends or patterns and
determines how it might produce a certain expected result.
1.2 Methods of data analyzing
Data analysis can be especially important for companies that encounter high volumes of data
and use it to inform future business decisions. One situation where data analysis can be crucial
is in market research as experts can analyze market data to develop strategies for future

Department of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
1
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

marketing campaigns based on public responses. Data analysis can also be important because it
can tell a business about the specific demographics it serves by exploring data about customers'
habits, interests and behaviors Another instance where data analysis can be important is in
developing a protocol for a workplace. This is because those in management roles can interpret
data about their company's performance to inform decisions about where to invest capital, how
to grow their company and what might happen to their business in the future.
Here are 6 methods you can use for data analysis:
1. Factor analysis
Factor analysis considers all the potential variables that might arise in or affect a particular data
set. Experts also sometimes refer to factor analysis as dimension reduction because a factor
analysis views data points in terms of their dimensions. For example, a business might use
factor analysis to learn about how customers view a particular product by asking multiple
customers to describe the product and noting all characteristics they identify, such as color,
material and usability. The business can then use this information when developing new
products to ensure they include details that interest customers.
2. Regression analysis
Regression analysis uses historical data to observe how changing one or more independent
variables might affect an established dependent variable. Experts often use regression analysis
to find relationships between certain variables and make predictions about potential outcomes.
This can be beneficial for businesses that sell products, as they can use regression analysis to
identify which elements of their products and sales strategies might be most effective, such as
product quality, marketing initiatives and customer engagement.
For example, a business might find through regression analysis that the sale of their top
product, the dependent variable, depends primarily on the independent variables of customer
accessibility and publicity. The decision-makers can then use this information to redistribute
funds from other areas of production to improve their publicity and customer accessibility,
which might help to increase sales.
3. Cluster analysis
Cluster analysis involves grouping items in data sets with other items that have similar
properties. This process produces a number of groups that each contain items that are similar to

Department of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
2
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

each other, which can help identify relationships between values in a data set
that aren't initially presented as being related. This method can be particularly effective in the
marketing industry, as marketing specialists can use cluster analysis to find similarities among
a company's customer base that can inform which aspects of the company they might market to
appeal to any identified common interests.
4. Text analysis
Text analysis explores large sets of data in text form and rearranges them to make the
information more accessible and simpler to organize. This method of analysis can be valuable
in considering sources of data like articles, survey responses and product reviews, as it can
group similar texts together by their content, tone or intention. For example, a business might
learn about their customers' reception of a certain product by using text analysis to group like-
minded comments or reviews about the product together, showcasing which responses are most
common.
5. Neural networks
Neural networks act as the foundation for algorithms that can conduct specific tasks for data
analysis automatically. As neural networks reflect how people might process certain forms of
data, they typically learn from each data transaction they interact with by identifying new
patterns, predicting new values and processing different forms of data. One common use of
neural networks is in predictive data analysis, as neural networks can automatically generate
and display high volumes of predictions across data sets.
6. Data Mining
Data mining is the process of using metrics to find relationships, patterns and trends in large
sets of data that experts can use to inform business decisions. When a company uses data
mining, it can collect high volumes of information and automatically determine where different
pieces of information connect. This can be especially helpful for businesses that experience a
lot of online engagement, as data mining can inform them about their customers' interests and
purchasing habits.
1.3 Introduction to Multicollinearity
Multicollinearity occurs when two or more independent variables are highly correlated with
one another in a regression model. This means that an independent variable can be predicted

Department of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
3
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

from another independent variable in a regression model. For example, height and weight,
household income and water consumption, mileage and price of a car, study time and leisure
time, etc. For example, from our everyday life to explain this. Colin loves watching television
while munching on chips. The more television he watches, the more chips he eats and the
happier he gets! Now, if we could quantify happiness and measure Colin’s happiness while
he’s busy doing his favorite activity, we think would have a greater impact on his happiness?
Having chips or watching television? That’s difficult to determine because the moment we try
to measure Colin’s happiness from eating chips, he starts watching television. And the moment
we try to measure his happiness from watching television, he starts eating chips. Eating chips
and watching television are highly correlated in the case of Colin and we cannot individually
determine the impact of the individual activities on his happiness. This is the multicollinearity
problem. Multicollinearity can be a problem in a regression model because we would not be
able to distinguish between the individual effects of the independent variables on the dependent
variable. For example, let’s assume that in the following linear equation:
Y =W 0 +W 1 X 1 +W 2 X 2
Coefficient W1 is the increase in Y for a unit increase in X 1 while keeping X2 constant. But
since X1 and X2 are highly correlated, changes in X 1 would also cause changes in X2 and we
would not be able to see their individual effect on Y. This makes the effects of X1 on Y
difficult to distinguish from the effects of X2 on Y. Multicollinearity may not affect the
accuracy of the model as much. But we might lose reliability in determining the effects of
individual features in your model and that can be a problem when it comes to interpretability.
Multicollinearity is the presence of high correlations between two or more independent
variables (predictors). It is basically a phenomenon where independent variables are correlated.
Let us first understand what the term correlation means.
Correlation is the association between variables and it tells us the measure of the extent to
which two variables are related to each other. Two variables can have positive (change in one
variable causes change in another variable in the same direction), negative (change in one
variable causes change in another variable in the opposite direction), or no correlation. It is easy
to remember these terms if we keep some examples in our minds.

Department of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
4
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

A simple example of positive correlation can be weight and height. The taller you are, the
heavier you weigh (this is considered a general trend if we leave the exception case aside).
1. A simple example of a negative correlation can be the altitude and oxygen level. The
higher you go, the lower the oxygen level is.
2. A simple example of no correlation can be the depth of the sea and the number of apples
bought from the store. None of them is related to the other.
Simply put, we can say that multicollinearity occurs when two or more predictors in regression
analysis are highly related to one another. For example, the level of education and annual
income. It is generally considered that the more educated you are, the more you earn. Thus, one
variable can be easily predicted using another variable. If we keep both these variables in our
analysis, it can cause problems for our model.
1.4 Types of Multicollinearity
There are two basic kinds of multicollinearity:
1. Structural multicollinearity: This type of multicollinearity is caused by the researchers
(people like us) who create new predictors using the given predictors in order to solve the
problem. For example, the creation of variable x² from the predictor variable x. Thus, this
type of multicollinearity is a byproduct of the model we specify and not present in the data
itself.
2. Data multicollinearity: This type of multicollinearity is the result of poorly designed
experiments that are purely observational. Thus, it is present in the data itself and has not
been specified/designed by us.
In a few places, you might come across terms like perfect multicollinearity and high
multicollinearity as the two different types of multicollinearities.
Perfect multicollinearity occurs when two or more independent predictors in a regression
model exhibit a perfectly predictable (exact or no randomness) linear relationship. The
correlation, in this case, is equal to +1 or -1. For example, weight in pounds and weight in
kilograms. However, we rarely face issues of perfect multicollinearity in a dataset.
High/Imperfect/Near multicollinearity occurs when two or more independent predictors are
approximately linearly related. This is a common type and is problematic to us. All our analyses
are based on detecting and dealing with this type of multicollinearity.

Department of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
5
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

1.5 Perfect Multicollinearity


In statistics, multicollinearity occurs when two or more predictor variables are highly
correlated with each other, such that they do not provide unique or independent information in
the regression model. If the degree of correlation is high enough between variables, it can
cause problems when fitting and interpreting the regression model. The most extreme case of
multicollinearity is known as perfect multicollinearity. This occurs when at least two predictor
variables have an exact linear relationship between them. For example, suppose we have the
following dataset.
Table 1.1 Simple data set with two independent variable
y x1 x2
6 2 4
6 2 4
8 2 4
12 3 6
13 4 8
14 5 10
15 5 10
15 7 14
13 9 18
17 10 20

Notice that the values for predictor variable x2 are simply the values of x1 multiplied by 2.
This is an example of perfect multicollinearity. When perfect multicollinearity is present in a
dataset, the method of ordinary least squares is unable to produce estimates for regression
coefficients. This is because it’s not possible to estimate the marginal effect of one predictor
variable (x1) on the response variable (y) while holding another predictor variable (x2)
constant because x2 always moves exactly when x1 moves.

Table 1.2 Simple data set with perfect collinearity

Department of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
6
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

y x1 x2
6 2*2 4
6 2*2 4
8 2*2 4
12 3*2 6
13 4 8
14 5 10
15 5 10
15 7 14
13 9 18
17 10 20

The simplest way to handle perfect multicollinearity is to drop one of the variables that has an
exact linear relationship with another variable. For example, in our previous dataset we could
simply drop x2 as a predictor variable.
Figure 1.3 Simple data set with drop x2 as a predictor variable
y x1
6 2
6 2
8 2
12 3
13 4
14 5
15 5
15 7
13 9
17 10

1.6 The Dummy Variable Trap


Another scenario where perfect multicollinearity can occur is known as the dummy variable
trap. This is when we want to use a categorical variable in a regression model and convert it to
a “dummy variable” that takes on values of 0, 1, 2, etc. For example, suppose we would like to
use predictor variables “age” and “marital status” to predict income:

Department of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
7
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

Table 1.4 Simple data set with age income and marital Status
Income Age Martial Status
$45,000 23 Single
$48,000 25 Single
$54,000 24 Single
$57,000 29 Single
$65,000 38 Married
$69,000 36 Single
$78,000 40 Married
$83,000 59 Divorced
$98,000 56 Divorced
$104,000 64 Married
$107,000 53 Married

To use “marital status” as a predictor variable, we need to first convert it to a dummy variable.
To do so, we can let “Single” be our baseline value since it occurs most often and assign values
of 0 or 1 to “Married” and “Divorce” as follows:
Table 1.5 Simple data set with converting into a “dummy variable.”
Income Age Martial Status A Income Age Married Divorced
$45,000 23 Single $45,000 23 0 0
$48,000 25 Single $48,000 25 0 0
$54,000 24 Single $54,000 24 0 0
$57,000 29 Single $57,000 29 0 0
$65,000 38 Married $65,000 38 1 0
$69,000 36 Single $69,000 36 0 0
$78,000 40 Married $78,000 40 1 0
$83,000 59 Divorced $83,000 59 0 1
$98,000 56 Divorced $98,000 56 0 1
$104,00 64 Married $104,000 64 1 0
0 $107,000 53 1 0
$107,00 53 Married
0 mistake would be to create three new dummy
variables as follows:

Department of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
8
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

In this case, the variable “Single” is a perfect linear combination of the “Married” and
“Divorced” variables. This is an example of perfect multicollinearity. If we attempt to fit a
multiple linear regression model in R using this dataset, we won’t be able to produce a
coefficient estimate for every predictor variable:
Table 1.6 New dummy variables with singles, married and divorced
Income Age Martial Status Income Age Single Marrie Divorced
$45,000 23 Single d
$45,000 23 1 0 0
1.7 $48,000 25 Single
$48,000 25 1 0 0
$54,000 24 Single
$54,000 24 1 0 0
$57,000 29 Single
$57,000 29 1 0 0
$65,000 38 Married
$65,000 38 0 1 0
$69,000 36 Single
$69,000 36 1 0 0
$78,000 40 Married
$78,000 40 0 1 0
$83,000 59 Divorced
$83,000 59 0 0 1
$98,000 56 Divorced
$98,000 56 0 0 1
$104,00 64 Married
0 $104,000 64 0 1 0
$107,00 53 Married $107,000 53 0 1 0
0

Causes of Multicollinearity
Multicollinearity could occur due to the following problems:
1. Multicollinearity could exist because of the problems in the dataset at the time of creation.
These problems could be because of poorly designed experiments, highly observational
data, or the inability to manipulate the data. For example, determining the electricity
consumption of a household from the household income and the number of electrical
appliances. Here, we know that the number of electrical appliances in a household will
increase with household income. However, this cannot be removed from the dataset
2. Multicollinearity could also occur when new variables are created which are dependent on
other variables. For example, creating a variable for BMI from the height and weight
variables would include redundant information in the model.
3. Including identical variables in the dataset. For example, including variables for
temperature in Fahrenheit and temperature in Celsius.

Department of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
9
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

4. Inaccurate use of dummy variables can also cause a multicollinearity problem. This is
called the Dummy variable trap. For example, in a dataset containing the status of marriage
variable with two unique values: ‘married’, ’single’. Creating dummy variables for both of
them would include redundant information. We can make do with only one variable
containing 0/1 for ‘married’/’single’ status.
5. Insufficient data in some cases can also cause multicollinearity problems.
Some more reasons of multicollinearity can occur when developing a regression model?
1. Inaccurate use of different types of variables
2. Poor selection of questions or null hypothesis
3. The selection of a dependent variable
4. Variable repetition in a linear regression model
5.
A high correlation between variables – one variable could be developed through another variable used ∈the
6. Poor usage and choice of dummy variables
1.8 How Multicollinearity affects the Interpretation.
Consider the following Regression model
Y = β0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4
In this model we can clearly see that there are 4 independent variables as X and the
corresponding coefficients are given as β. Now consider a situation where all the variables are
independent except X3 and X4.
Or in other words, X3 and X4 have a significant correlation between them.

Y = β0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4

Now to estimate the β coefficient of each independent variable with respect to Y, we observe
the change in the magnitude of Y variable when we slightly change the magnitude of any one
independent variable at a time.
Case 1:
Considering the Variables X1 and X2, they are independent of every other variable. If we try
to change the magnitude of either X1 or X2 , they will not cause any other independent variable

Department of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
10
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

to change its value or by some negligible amount. As a result, we can clearly observe the
influence of independent Variable X over Y.

Negligible change
Y = β0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4

Case 2:
In the case of variables X3 and X4, they are significantly correlated. Can you guess what will
happen if we apply the same procedure as Case 1?
Image Below illustrates exactly the same.

Negligible change
Y = β0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4

According to this Image, If we try to change the magnitude of X3 (as shown in red) to observe
the change in Y(red), there will also be a significant difference in the value of X4(orange). As a
result, the change that we observe in Y is due to the change in both X3(red) and X4(orange). The
resultant change(blue) is greater than the Actual change(orange).
1.9 How to detect multicollinearity
To detect multicollinearity and identify the variables involved, linear regressions must be
carried out on each of the variables as a function of the others. We then calculate:
 The R² of each of the models If the R² is 1, then there is a linear relationship between
the dependent variable of the model (the Y) and the explanatory variables (the Xs).
 The tolerance for each of the models. The tolerance is (1-R²). It is used in several
methods (linear regression, logistic regression, discriminate factorial analysis) as a
criterion for filtering variables. If a variable has a tolerance less than a fixed threshold
(the tolerance is calculated by taking into account variables already used in the model),
it is not allowed to enter the model as its contribution is negligible and it risks causing
numerical problems.
 The VIF (Variance Inflation Factor) The VIF is equal to the inverse of the tolerance

Department of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
11
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

1.10 How to Deal with Multicollinearity


If we only want to predict the value of a dependent variable, you may not have to worry about
multicollinearity. Multiple regressions can produce a regression equation that will work for
you, even when independent variables are highly correlated.
The problem arises when we want to assess the relative importance of an independent variable
with a high R2k (or, equivalently, a high VIFk). In this situation, try the following:
 Redesign the study to avoid multicollinearity. If you are working on a true experiment,
the experimenter controls treatment levels. Choose treatment levels to minimize or
eliminate correlations between independent variables.
 Increase sample size. Other things being equal, a bigger sample means reduced sampling
error. The increased precision may overcome potential problems from multicollinearity.
 Remove one or more of the highly correlated independent variables. Then, define a new
regression equation, based on the remaining variables. Because the removed variables
were redundant, the new equation should be nearly as predictive as the old equation; and
coefficients should be easier to interpret because multicollinearity is reduced.
 Define a new variable equal to a linear combination of the highly correlated variables.
Then, define a new regression equation, using the new variable in place of the old highly
correlated variables.
1.11 Organization of the thesis
Chapter 1:- Data analysis, Methods of data analyzing , Introduction to
Multicollinearity ,Types of Multicollinearity, Perfect Multicollinearity, The Dummy Variable
Trap, Causes of Multicollinearity, How Multicollinearity affects the Interpretation, How to
detect multicollinearity, How to Deal with Multicollinearity, Organization of the thesis

Chapter 2:- Why is collinearity an issue, Reasons for multicollinearity ,Tips on how to fix
collinearity, Problem statement , Objectives

Chapter 3:- This chapter contains details study of various research paper includes Michael
Olusegun Akinwande et al “Variance Inflation Factor: As a Condition for the Inclusion of

Department of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
12
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

Suppressor Variable(s) in Regression Analysis”, Ahmad A. Suleiman et al “Analysis of


Multicollinearity In Multiple Regressions”, Kristina Vatcheva et al “Multicollinearity in
Regression Analyses Conducted in Epidemiologic Studies”, Sudhanshu K. Mishra et al
“Shapley value regression and the resolution of multicollinearity”, Christopher Winship et al
“Multicollinearity and Model Misspecification”, Hanan Duzan et al proposed “Solution to the
Multicollinearity Problem by adding some Constant to the Diagonal”, Jamal I. Daoud et al
“Multicollinearity and Regression Analysis”, Bager, Ali et al proposed “Addressing
multicollinearity in regression models: a ridge regression application, Gary H. McClelland et al
“Multicollinearity is a red herring in the search for moderator variables”, Dr Manoj Kumar
Mishra et al “A Study of Multicollinearity in Estimation of Coefficients in Ridge Regression”,
Neeraj Tiwari et al “Diagnostics of Multicollinearity in Multiple Regression Model for Small
Area Estimation”, Yunus Kologlu et al “A Multiple Linear Regression Approach For
Estimating the Market Value of Football Players in Forward Position”, N S M Shariff1 et al “A
Comparison of OLS and Ridge Regression Methods in the Presence of Multicollinearity
Problem in the Data”, Alhassan Umar et al “Detection of Collinearity Effects on Explanatory
Variables and Error Terms in Multiple Regressions”, Katerina M. Marcoulides et al
“Evaluation of Variance Inflation Factors in Regression Models Using Latent Variable
Modeling Methods”, N. A. M. R. Senaviratna et al “Diagnosing Multicollinearity of Logistic
Regression Model”, Jong Hae Kim et al “Multicollinearity and misleading statistical results”,
Noora Shrestha et al Detecting Multicollinearity in Regression Analysis”,
Shishodiya Ghanshyam Singh et al “Dealing with Multicollinearity Problem in Analysis
of Side Friction Characteristics under Urban Heterogeneous Traffic Conditions”

Chapter 4:- Basic Concept ,Causes Multicollinearity, How to Fix Multicollinearity in a


Regression Model ,Ways to detect multicollinearity, Proposed Approach with (VIF) and Low
Tolerance, Outline of the proposed approach ,Illustrate with an example.

Chapter 5:- Implementation Environment, Description about the Dataset, Loading library and
data set , Separating predictor variable from data set ,Separating response variable from data
set, Describe the data set , Calculating Correlation between predictors variables., Creating
Scatter Plot between Weight and BP, Creating Scatter Plot between Weight and BSA, Creating

Department of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
13
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

Scatter Plot between BSA and BP ,Calculating VIF value for all predictor
variables ,Calculating VIF value after deleting variable with high VIF Value, Calculating R
squared value for predictor variables

Chapter 6:- Identify Relationship between predictor weight and response BP ,Identify
Relationship between predictor BSA and predictor Weight ,Identify Relationship between
predictor BP and predictor BSA, Correlation between Predictor variables Variation Inflation
Factor between Predictor variables Variation Inflation Factor after removing multicollinearity,
R squared Value for Predictor with respect to response variable

Chapter 7:- Conclusion, Limitations and future work

Department of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
14

You might also like