0% found this document useful (0 votes)
5 views

INSY446 - 02 - Linear Model Part 1

Uploaded by

iryannh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

INSY446 - 02 - Linear Model Part 1

Uploaded by

iryannh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

INSY 446 – Winter 2023

Data Mining for Business


Analytics

Session 2 – Linear Model Part 1


January 16, 2023
Dongliang Sheng
Supervised Learning Model

§ A machine learning model is considered a


supervised model when the model consists
of both the “x” (e.g., input) variables and the
“y” (e.g., target) variable
§ An example of a supervised model is the
linear regression model

2
Types of Model

§ Descriptive Modeling
– Quantify the average effect of inputs on an
outcome
– Causal structure is unknown
§ Explanatory Modeling
– Quantify the average effect of inputs on an
outcome
– Causal structure is known
§ Predictive Modeling
– Predicting the outcome value for new records,
given their input values
3
Liner Regression
§ Relationships between predictors (x) and the
outcome variable (y) is approximated by a line
(2-D), a plane (3-D), or a hyper-plane (Multiple
Dimensions)
§ The specification takes the following form:
yˆ = b0 + b1 x1 + b2 x2 ! + bm xm

4
Stats Issues in Liner Regression
§ Multicollinearity
§ Inference
– Correlation vs. Causation
§ Model selection
§ Interpretation
– Coefficient value
– Statistical significant value
§ T-test
§ F-test

5
Linear Regression Operations
y

§ Find a linear line that minimizes the sum of


squared errors (SSE)
§ This operation can be done efficiently

6
Linear Regression in Python

§ Statistics Perspective
– statsmodels package
– obtain stats-related results (t-value, p-value, etc.)
§ Data Mining Perspective
– sklearn package
– results are compatible with standard sklearn
functions (cross-validation, MSE calculation, etc.)

7
Example 1
statsmodels

# Load libraries
import statsmodels.api

# Load built-in dataset


from sklearn.datasets import load_boston
boston = load_boston()

# Explore the data


print(boston.keys())
print(boston.data.shape)
print(boston.feature_names)
print(boston.DESCR)

# Setup dependent and independent variables


y = boston.target
X = boston.data[:,0:2]

# Run Regression

# View results

8
Example 2
sklearn

# Load libraries
from sklearn.linear_model import LinearRegression

# Load built-in dataset


from sklearn.datasets import load_boston
boston = load_boston()

# Setup dependent and independent variables


y = boston.target
X = boston.data[:,0:2]

# Run linear regression

# View results

# Add constant

9
Example 3
Real-world dataset (ToyotaCorolla.csv)

# Load Libraries
from sklearn.linear_model import LinearRegression
import pandas

# Import Data
usedcar_df = pandas.read_csv(“C:\\...\\ToyotaCorolla.csv")

# Setup dependent and independent variables

# Run linear regression

# View results

10
Model Objectives

§ In the descriptive & explanatory modeling, the


objective of the model is to identify the
relationships between X and y
§ Therefore, the primary interest is the
statistical power of the coefficients of the
model
§ What is the objective of the predictive model?

11
Cross Validation

§ The most important measure of a predictive


model is the predictive performance
§ In other words, how accurate is the model in
predicting the future?
§ But how can we calculate the prediction
accuracy when we do not know the future?
§ We use the technique called cross-validation

12
Cross Validation
§ Essentially, even though we do not know the
“future,” we can artificially create it
§ The idea is to separate some of the data out
before building the model
§ Since the model never sees the separated data
before, it is (somewhat) similar to the future
§ Hence, we will test the performance of the
model with the separated dataset
§ This is similar to the concept of out-of-sample
testing in statistics

13
Cross Validation
§ The simplest way to perform cross validation
is to split the data into two parts: training
dataset and test dataset
§ With this approach, we train the model using
the training dataset
§ Then, the model would be evaluated based on
the test dataset

14
Cross Validation
§ Alternatively, we can use an approach called
k-fold cross validation
§ With this approach, the original data is
partitioned into k independent and similar
subsets

15
Cross Validation
§ An alternative cross-validation approach is
called Leave-p-out cross-validation where p
observations are separated out as the test data
while the rest are used as the training data
§ The simplest form of the Leave-p-out cross-
validation is Leave-1-out cross-validation
(LOOCV)

16
Cross Validation
§ The concept of cross-validation can be
implemented in multiple ways
§ Any approaches would be deemed acceptable
as long as the data used to build the model
(“training data”) and the data used to measure
the performance of the model (“test data”) are
different
§ Some prefer to separate another set of data
called “validation dataset” to tune model
parameters

17
Cross Validation
§ From the coding perspective, there are three
common ways to perform cross-validation in
Python
1. Physically separate the data file, use one file to train the
model and another file to test the model
2. Manually separate the input data using the
train_test_split function
3. Perform the cross-validation using the cross_val_score
function

18
Performance Measure
§ The primary performance measure that is
calculated using the test dataset is mean
squared error (MSE). This measure is scale
dependent.
𝒏
𝟏
𝑴𝑺𝑬 = '(𝒚𝒊 − 𝒚 + 𝒊 )𝟐
𝒏
𝒊"𝟏
§ An alternative measure is called Mean
Absolute Percentage Error (MAPE). This
measure is scale independent.
𝒏
𝟏𝟎𝟎 𝒚𝒊 − 𝒚+𝒊
𝑴𝑨𝑷𝑬 = '
𝒏 𝒚𝒊
𝒏"𝟏

19
Example 4
Cross Validation (1)
# Load libraries
from sklearn.linear_model import LinearRegression
import pandas

# Import training data


usedcar_df_train = pandas.read_csv(" C:\\...\\ToyotaCorolla_train.csv")

# Construct variables for training data


X_train = usedcar_df_train.iloc[:,3:]
y_train = usedcar_df_train['Price']

# Run linear regression on training data

# Import test data


usedcar_df_test = pandas.read_csv(" C:\\...\\ToyotaCorolla_test.csv")

# Construct variables for test data


X_test = usedcar_df_test.iloc[:,3:]
y_test = usedcar_df_test['Price']

# Using the model to predict the results based on the test dataset

# Calculate the mean squared error of the prediction

20
Example 5
Cross Validation (2)

# Load libraries
from sklearn.linear_model import LinearRegression
import pandas

# Import data
usedcar_df = pandas.read_csv("C:\\...\\ToyotaCorolla.csv")

# Construct variables
X = usedcar_df.iloc[:,3:]
y = usedcar_df['Price’]

# Separate the data


from sklearn.model_selection import train_test_split

# Run linear regression

# Using the model to predict the results based on the test dataset

# Calculate the mean squared error of the prediction


from sklearn.metrics import mean_squared_error

21
Example 6
Cross Validation (3)

# Load libraries
from sklearn.linear_model import LinearRegression
import pandas

# Import data
usedcar_df = pandas.read_csv("C:\\...\\ToyotaCorolla.csv")

# Construct variables
X = usedcar_df.iloc[:,3:]
y = usedcar_df['Price']

# Develop Linear Regression Model

# Calculate the mean squared error of the prediction


from sklearn.model_selection import cross_val_score

22
Overfitting
§ Overfitting occurs when the provisional model
tries to account for every possible trend or
structure in the training set

23
Bias vs. Variance
§ “Bias” represents the error between the
predicted value and the actual value when the
training dataset is used for evaluation
§ “Variance” represents the difference in model
performance when the training dataset is used
for evaluation versus when the test dataset is
used for evaluation
§ Generally, there is a tradeoff between bias and
variance, but it is possible that both values
increase or decrease together
§ Our job is to find a model that balances the
bias and variance

24
Example 7
Train/test accuracy
# Load libraries
from sklearn.linear_model import LinearRegression
import pandas

# Import data
usedcar_df = pandas.read_csv("C:\\...\\ToyotaCorolla.csv")

# Construct variables
X = usedcar_df.iloc[:,3:]
y = usedcar_df['Price']

# Separate the data


from sklearn.model_selection import train_test_split

# Run linear regression

# Using the model to predict the results based on the test dataset

# Calculate the mean squared error of the prediction


from sklearn.metrics import mean_squared_error

# Using the model to predict the results based on the training dataset

# Calculate the mean squared error of the prediction

25
Exercise #1

§ Use nutrition.csv dataset


§ Use CALORIES as the target variable and
other variables as predictors
§ Construct a linear regression model
§ Print all coefficients (including intercept).
You do not have to format the results

26
Exercise #2

§ Using the same dataset in #1


§ Use CALORIES as the target variable and
PROTIEN and FAT as predictors
§ Split the data into a test (30%) and training
(70%) dataset
§ Run the linear regression model based on the
training dataset and perform cross-validation
on the test dataset. Print the mean-squared
error of your model
§ Predict the calories of a food item that has
PROTIEN = 20 and FAT = 10. Print the
prediction 27

You might also like