INSY446 - 02 - Linear Model Part 1
INSY446 - 02 - Linear Model Part 1
2
Types of Model
§ Descriptive Modeling
– Quantify the average effect of inputs on an
outcome
– Causal structure is unknown
§ Explanatory Modeling
– Quantify the average effect of inputs on an
outcome
– Causal structure is known
§ Predictive Modeling
– Predicting the outcome value for new records,
given their input values
3
Liner Regression
§ Relationships between predictors (x) and the
outcome variable (y) is approximated by a line
(2-D), a plane (3-D), or a hyper-plane (Multiple
Dimensions)
§ The specification takes the following form:
yˆ = b0 + b1 x1 + b2 x2 ! + bm xm
4
Stats Issues in Liner Regression
§ Multicollinearity
§ Inference
– Correlation vs. Causation
§ Model selection
§ Interpretation
– Coefficient value
– Statistical significant value
§ T-test
§ F-test
5
Linear Regression Operations
y
6
Linear Regression in Python
§ Statistics Perspective
– statsmodels package
– obtain stats-related results (t-value, p-value, etc.)
§ Data Mining Perspective
– sklearn package
– results are compatible with standard sklearn
functions (cross-validation, MSE calculation, etc.)
7
Example 1
statsmodels
# Load libraries
import statsmodels.api
# Run Regression
# View results
8
Example 2
sklearn
# Load libraries
from sklearn.linear_model import LinearRegression
# View results
# Add constant
9
Example 3
Real-world dataset (ToyotaCorolla.csv)
# Load Libraries
from sklearn.linear_model import LinearRegression
import pandas
# Import Data
usedcar_df = pandas.read_csv(“C:\\...\\ToyotaCorolla.csv")
# View results
10
Model Objectives
11
Cross Validation
12
Cross Validation
§ Essentially, even though we do not know the
“future,” we can artificially create it
§ The idea is to separate some of the data out
before building the model
§ Since the model never sees the separated data
before, it is (somewhat) similar to the future
§ Hence, we will test the performance of the
model with the separated dataset
§ This is similar to the concept of out-of-sample
testing in statistics
13
Cross Validation
§ The simplest way to perform cross validation
is to split the data into two parts: training
dataset and test dataset
§ With this approach, we train the model using
the training dataset
§ Then, the model would be evaluated based on
the test dataset
14
Cross Validation
§ Alternatively, we can use an approach called
k-fold cross validation
§ With this approach, the original data is
partitioned into k independent and similar
subsets
15
Cross Validation
§ An alternative cross-validation approach is
called Leave-p-out cross-validation where p
observations are separated out as the test data
while the rest are used as the training data
§ The simplest form of the Leave-p-out cross-
validation is Leave-1-out cross-validation
(LOOCV)
16
Cross Validation
§ The concept of cross-validation can be
implemented in multiple ways
§ Any approaches would be deemed acceptable
as long as the data used to build the model
(“training data”) and the data used to measure
the performance of the model (“test data”) are
different
§ Some prefer to separate another set of data
called “validation dataset” to tune model
parameters
17
Cross Validation
§ From the coding perspective, there are three
common ways to perform cross-validation in
Python
1. Physically separate the data file, use one file to train the
model and another file to test the model
2. Manually separate the input data using the
train_test_split function
3. Perform the cross-validation using the cross_val_score
function
18
Performance Measure
§ The primary performance measure that is
calculated using the test dataset is mean
squared error (MSE). This measure is scale
dependent.
𝒏
𝟏
𝑴𝑺𝑬 = '(𝒚𝒊 − 𝒚 + 𝒊 )𝟐
𝒏
𝒊"𝟏
§ An alternative measure is called Mean
Absolute Percentage Error (MAPE). This
measure is scale independent.
𝒏
𝟏𝟎𝟎 𝒚𝒊 − 𝒚+𝒊
𝑴𝑨𝑷𝑬 = '
𝒏 𝒚𝒊
𝒏"𝟏
19
Example 4
Cross Validation (1)
# Load libraries
from sklearn.linear_model import LinearRegression
import pandas
# Using the model to predict the results based on the test dataset
20
Example 5
Cross Validation (2)
# Load libraries
from sklearn.linear_model import LinearRegression
import pandas
# Import data
usedcar_df = pandas.read_csv("C:\\...\\ToyotaCorolla.csv")
# Construct variables
X = usedcar_df.iloc[:,3:]
y = usedcar_df['Price’]
# Using the model to predict the results based on the test dataset
21
Example 6
Cross Validation (3)
# Load libraries
from sklearn.linear_model import LinearRegression
import pandas
# Import data
usedcar_df = pandas.read_csv("C:\\...\\ToyotaCorolla.csv")
# Construct variables
X = usedcar_df.iloc[:,3:]
y = usedcar_df['Price']
22
Overfitting
§ Overfitting occurs when the provisional model
tries to account for every possible trend or
structure in the training set
23
Bias vs. Variance
§ “Bias” represents the error between the
predicted value and the actual value when the
training dataset is used for evaluation
§ “Variance” represents the difference in model
performance when the training dataset is used
for evaluation versus when the test dataset is
used for evaluation
§ Generally, there is a tradeoff between bias and
variance, but it is possible that both values
increase or decrease together
§ Our job is to find a model that balances the
bias and variance
24
Example 7
Train/test accuracy
# Load libraries
from sklearn.linear_model import LinearRegression
import pandas
# Import data
usedcar_df = pandas.read_csv("C:\\...\\ToyotaCorolla.csv")
# Construct variables
X = usedcar_df.iloc[:,3:]
y = usedcar_df['Price']
# Using the model to predict the results based on the test dataset
# Using the model to predict the results based on the training dataset
25
Exercise #1
26
Exercise #2