0% found this document useful (0 votes)

5 views

INSY446 - 02 - Linear Model Part 1

Uploaded by

iryannh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

INSY446 - 02 - Linear Model Part 1

Uploaded by

iryannh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

INSY 446 – Winter 2023

Data Mining for Business

Analytics

Session 2 – Linear Model Part 1

January 16, 2023
Dongliang Sheng
Supervised Learning Model

§ A machine learning model is considered a

supervised model when the model consists
of both the “x” (e.g., input) variables and the
“y” (e.g., target) variable
§ An example of a supervised model is the
linear regression model

2
Types of Model

§ Descriptive Modeling
– Quantify the average effect of inputs on an
outcome
– Causal structure is unknown
§ Explanatory Modeling
– Quantify the average effect of inputs on an
outcome
– Causal structure is known
§ Predictive Modeling
– Predicting the outcome value for new records,
given their input values
3
Liner Regression
§ Relationships between predictors (x) and the
outcome variable (y) is approximated by a line
(2-D), a plane (3-D), or a hyper-plane (Multiple
Dimensions)
§ The specification takes the following form:
yˆ = b0 + b1 x1 + b2 x2 ! + bm xm

4
Stats Issues in Liner Regression
§ Multicollinearity
§ Inference
– Correlation vs. Causation
§ Model selection
§ Interpretation
– Coefficient value
– Statistical significant value
§ T-test
§ F-test

5
Linear Regression Operations
y

§ Find a linear line that minimizes the sum of

squared errors (SSE)
§ This operation can be done efficiently

6
Linear Regression in Python

§ Statistics Perspective
– statsmodels package
– obtain stats-related results (t-value, p-value, etc.)
§ Data Mining Perspective
– sklearn package
– results are compatible with standard sklearn
functions (cross-validation, MSE calculation, etc.)

7
Example 1
statsmodels

# Load libraries
import statsmodels.api

# Load built-in dataset

from sklearn.datasets import load_boston
boston = load_boston()

# Explore the data

print(boston.keys())
print(boston.data.shape)
print(boston.feature_names)
print(boston.DESCR)

# Setup dependent and independent variables

y = boston.target
X = boston.data[:,0:2]

# Run Regression

# View results

8
Example 2
sklearn

# Load libraries
from sklearn.linear_model import LinearRegression

# Load built-in dataset

from sklearn.datasets import load_boston
boston = load_boston()

# Setup dependent and independent variables

y = boston.target
X = boston.data[:,0:2]

# Run linear regression

# View results

# Add constant

9
Example 3
Real-world dataset (ToyotaCorolla.csv)

# Load Libraries
from sklearn.linear_model import LinearRegression
import pandas

# Import Data
usedcar_df = pandas.read_csv(“C:\\...\\ToyotaCorolla.csv")

# Setup dependent and independent variables

# Run linear regression

# View results

10
Model Objectives

§ In the descriptive & explanatory modeling, the

objective of the model is to identify the
relationships between X and y
§ Therefore, the primary interest is the
statistical power of the coefficients of the
model
§ What is the objective of the predictive model?

11
Cross Validation

§ The most important measure of a predictive

model is the predictive performance
§ In other words, how accurate is the model in
predicting the future?
§ But how can we calculate the prediction
accuracy when we do not know the future?
§ We use the technique called cross-validation

12
Cross Validation
§ Essentially, even though we do not know the
“future,” we can artificially create it
§ The idea is to separate some of the data out
before building the model
§ Since the model never sees the separated data
before, it is (somewhat) similar to the future
§ Hence, we will test the performance of the
model with the separated dataset
§ This is similar to the concept of out-of-sample
testing in statistics

13
Cross Validation
§ The simplest way to perform cross validation
is to split the data into two parts: training
dataset and test dataset
§ With this approach, we train the model using
the training dataset
§ Then, the model would be evaluated based on
the test dataset

14
Cross Validation
§ Alternatively, we can use an approach called
k-fold cross validation
§ With this approach, the original data is
partitioned into k independent and similar
subsets

15
Cross Validation
§ An alternative cross-validation approach is
called Leave-p-out cross-validation where p
observations are separated out as the test data
while the rest are used as the training data
§ The simplest form of the Leave-p-out cross-
validation is Leave-1-out cross-validation
(LOOCV)

16
Cross Validation
§ The concept of cross-validation can be
implemented in multiple ways
§ Any approaches would be deemed acceptable
as long as the data used to build the model
(“training data”) and the data used to measure
the performance of the model (“test data”) are
different
§ Some prefer to separate another set of data
called “validation dataset” to tune model
parameters

17
Cross Validation
§ From the coding perspective, there are three
common ways to perform cross-validation in
Python
1. Physically separate the data file, use one file to train the
model and another file to test the model
2. Manually separate the input data using the
train_test_split function
3. Perform the cross-validation using the cross_val_score
function

18
Performance Measure
§ The primary performance measure that is
calculated using the test dataset is mean
squared error (MSE). This measure is scale
dependent.
𝒏
𝟏
𝑴𝑺𝑬 = '(𝒚𝒊 − 𝒚 + 𝒊 )𝟐
𝒏
𝒊"𝟏
§ An alternative measure is called Mean
Absolute Percentage Error (MAPE). This
measure is scale independent.
𝒏
𝟏𝟎𝟎 𝒚𝒊 − 𝒚+𝒊
𝑴𝑨𝑷𝑬 = '
𝒏 𝒚𝒊
𝒏"𝟏

19
Example 4
Cross Validation (1)
# Load libraries
from sklearn.linear_model import LinearRegression
import pandas

# Import training data

usedcar_df_train = pandas.read_csv(" C:\\...\\ToyotaCorolla_train.csv")

# Construct variables for training data

X_train = usedcar_df_train.iloc[:,3:]
y_train = usedcar_df_train['Price']

# Run linear regression on training data

# Import test data

usedcar_df_test = pandas.read_csv(" C:\\...\\ToyotaCorolla_test.csv")

# Construct variables for test data

X_test = usedcar_df_test.iloc[:,3:]
y_test = usedcar_df_test['Price']

# Using the model to predict the results based on the test dataset

# Calculate the mean squared error of the prediction

20
Example 5
Cross Validation (2)

# Load libraries
from sklearn.linear_model import LinearRegression
import pandas

# Import data
usedcar_df = pandas.read_csv("C:\\...\\ToyotaCorolla.csv")

# Construct variables
X = usedcar_df.iloc[:,3:]
y = usedcar_df['Price’]

# Separate the data

from sklearn.model_selection import train_test_split

# Run linear regression

# Using the model to predict the results based on the test dataset

# Calculate the mean squared error of the prediction

from sklearn.metrics import mean_squared_error

21
Example 6
Cross Validation (3)

# Load libraries
from sklearn.linear_model import LinearRegression
import pandas

# Import data
usedcar_df = pandas.read_csv("C:\\...\\ToyotaCorolla.csv")

# Construct variables
X = usedcar_df.iloc[:,3:]
y = usedcar_df['Price']

# Develop Linear Regression Model

# Calculate the mean squared error of the prediction

from sklearn.model_selection import cross_val_score

22
Overfitting
§ Overfitting occurs when the provisional model
tries to account for every possible trend or
structure in the training set

23
Bias vs. Variance
§ “Bias” represents the error between the
predicted value and the actual value when the
training dataset is used for evaluation
§ “Variance” represents the difference in model
performance when the training dataset is used
for evaluation versus when the test dataset is
used for evaluation
§ Generally, there is a tradeoff between bias and
variance, but it is possible that both values
increase or decrease together
§ Our job is to find a model that balances the
bias and variance

24
Example 7
Train/test accuracy
# Load libraries
from sklearn.linear_model import LinearRegression
import pandas

# Import data
usedcar_df = pandas.read_csv("C:\\...\\ToyotaCorolla.csv")

# Construct variables
X = usedcar_df.iloc[:,3:]
y = usedcar_df['Price']

# Separate the data

from sklearn.model_selection import train_test_split

# Run linear regression

# Using the model to predict the results based on the test dataset

# Calculate the mean squared error of the prediction

from sklearn.metrics import mean_squared_error

# Using the model to predict the results based on the training dataset

# Calculate the mean squared error of the prediction

25
Exercise #1

§ Use nutrition.csv dataset

§ Use CALORIES as the target variable and
other variables as predictors
§ Construct a linear regression model
§ Print all coefficients (including intercept).
You do not have to format the results

26
Exercise #2

§ Using the same dataset in #1

§ Use CALORIES as the target variable and
PROTIEN and FAT as predictors
§ Split the data into a test (30%) and training
(70%) dataset
§ Run the linear regression model based on the
training dataset and perform cross-validation
on the test dataset. Print the mean-squared
error of your model
§ Predict the calories of a food item that has
PROTIEN = 20 and FAT = 10. Print the
prediction 27

Assignment Report - Predictive Modelling - Rahul Dubey
No ratings yet
Assignment Report - Predictive Modelling - Rahul Dubey
18 pages
Session 3 Distribtion
No ratings yet
Session 3 Distribtion
46 pages
QAM-AmoreFrozenFood - Group8
No ratings yet
QAM-AmoreFrozenFood - Group8
14 pages
INSY662 - F23 - Week 3-1
No ratings yet
INSY662 - F23 - Week 3-1
22 pages
Unit 5
No ratings yet
Unit 5
171 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Exam2Review
No ratings yet
Exam2Review
23 pages
Lecture 3
No ratings yet
Lecture 3
90 pages
Linear Regression
100% (1)
Linear Regression
16 pages
Python Learning
No ratings yet
Python Learning
21 pages
Regression Dataset Example
No ratings yet
Regression Dataset Example
14 pages
Practical # 10
No ratings yet
Practical # 10
5 pages
m2 Data analytic and visualization
No ratings yet
m2 Data analytic and visualization
53 pages
Data Mining Practicals
No ratings yet
Data Mining Practicals
22 pages
Machine Learning 2
No ratings yet
Machine Learning 2
45 pages
Udacity Machine Learning Analysis Supervised Learning
100% (1)
Udacity Machine Learning Analysis Supervised Learning
504 pages
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
No ratings yet
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
10 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
14 pages
Untitled Document
No ratings yet
Untitled Document
6 pages
Data Science Chapitre 2
No ratings yet
Data Science Chapitre 2
98 pages
20dit073 Jay Prajapati ML
No ratings yet
20dit073 Jay Prajapati ML
68 pages
ML Lab Manual
No ratings yet
ML Lab Manual
38 pages
Project 03: Data Fitting Applied Mathematics and Statistics For Information Technology
No ratings yet
Project 03: Data Fitting Applied Mathematics and Statistics For Information Technology
17 pages
C2_W3_Assignment
No ratings yet
C2_W3_Assignment
437 pages
Linear Regression - Numpy and Sklearn
No ratings yet
Linear Regression - Numpy and Sklearn
7 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
ML0101EN Reg Simple Linear Regression Co2 Py v1
No ratings yet
ML0101EN Reg Simple Linear Regression Co2 Py v1
4 pages
Machine Learning: Engr. Ejaz Ahmad
No ratings yet
Machine Learning: Engr. Ejaz Ahmad
54 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
30 pages
14 Model Selection and Boosting
No ratings yet
14 Model Selection and Boosting
51 pages
6 - Train - Test - Split - Ipynb - Colaboratory
No ratings yet
6 - Train - Test - Split - Ipynb - Colaboratory
5 pages
lab mannual of ML
No ratings yet
lab mannual of ML
43 pages
Perform Prediction Using Regression Algorithm: Ex No: 1 Date
No ratings yet
Perform Prediction Using Regression Algorithm: Ex No: 1 Date
13 pages
Model_learning_steps
No ratings yet
Model_learning_steps
12 pages
C2W3_Lab_01_Model_Evaluation_and_Selection
No ratings yet
C2W3_Lab_01_Model_Evaluation_and_Selection
21 pages
Machine Learning General: Definiton
No ratings yet
Machine Learning General: Definiton
14 pages
C2W3 Lab 01 Model Evaluation and Selection
No ratings yet
C2W3 Lab 01 Model Evaluation and Selection
21 pages
Week 10 - PROG 8510 Week 10
No ratings yet
Week 10 - PROG 8510 Week 10
16 pages
Module 5
No ratings yet
Module 5
48 pages
Supervised Machine Learning - Linear Regression
No ratings yet
Supervised Machine Learning - Linear Regression
92 pages
Cross-Validation, Regularization, and Principal Components Analysis (PCA)
No ratings yet
Cross-Validation, Regularization, and Principal Components Analysis (PCA)
47 pages
module 2 modified
No ratings yet
module 2 modified
67 pages
B24 ML Exp-3
No ratings yet
B24 ML Exp-3
10 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Supervised and Unsupervised Learning
No ratings yet
Supervised and Unsupervised Learning
92 pages
Lab Experiments Vi Sem-1
No ratings yet
Lab Experiments Vi Sem-1
10 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
DA Practicle Answers Easyw
No ratings yet
DA Practicle Answers Easyw
30 pages
ml_6_7_8 (1)
No ratings yet
ml_6_7_8 (1)
10 pages
Ml Cyber Lab
No ratings yet
Ml Cyber Lab
16 pages
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
No ratings yet
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
10 pages
Week 10_Lecture 10
No ratings yet
Week 10_Lecture 10
59 pages
19BCS2059 DL1
No ratings yet
19BCS2059 DL1
4 pages
Final Lab Manual
No ratings yet
Final Lab Manual
34 pages
Islp 1
No ratings yet
Islp 1
15 pages
Lecture Notes - Linear Regression
No ratings yet
Lecture Notes - Linear Regression
26 pages
22UCS303 DS-Unit IV-LINEAR REGRESSION
No ratings yet
22UCS303 DS-Unit IV-LINEAR REGRESSION
19 pages
Week 9 - PROG 8510 Week 9
No ratings yet
Week 9 - PROG 8510 Week 9
27 pages
Supervised Regression Notes
No ratings yet
Supervised Regression Notes
11 pages
Seminar Presentation
No ratings yet
Seminar Presentation
25 pages
Hair PPT Ch05
No ratings yet
Hair PPT Ch05
18 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Rapid Miner Process - Getting Started With Assignment 2 and 3 (Fundraising Data)
No ratings yet
Rapid Miner Process - Getting Started With Assignment 2 and 3 (Fundraising Data)
7 pages
Exercise On Data Analysis Below Are 24 Cases
No ratings yet
Exercise On Data Analysis Below Are 24 Cases
6 pages
Class-33 Regression
No ratings yet
Class-33 Regression
15 pages
Final Assignment
No ratings yet
Final Assignment
3 pages
Regression: Finding The Equation of The Line of Best Fit: Background and General Principle
No ratings yet
Regression: Finding The Equation of The Line of Best Fit: Background and General Principle
6 pages
Week 3 - Describing Numerical Data
No ratings yet
Week 3 - Describing Numerical Data
7 pages
Normal Distribution
100% (1)
Normal Distribution
4 pages
Broadly, There Are 3 Types of Machine Learning Algorithms.
No ratings yet
Broadly, There Are 3 Types of Machine Learning Algorithms.
33 pages
Waiting Lines and Queuing Theory Models
No ratings yet
Waiting Lines and Queuing Theory Models
65 pages
Lec38 PDF
No ratings yet
Lec38 PDF
24 pages
Performing Hypothesis Testing For One-Sample T-Tests in Excel 2016
No ratings yet
Performing Hypothesis Testing For One-Sample T-Tests in Excel 2016
2 pages
Applied Probability: Course Lecturer Rajeev Surati Tina Kapur PH.D
No ratings yet
Applied Probability: Course Lecturer Rajeev Surati Tina Kapur PH.D
18 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
Interaction Nonlinearity and Multicollinearity Implica - 1993 - Journal of Ma
No ratings yet
Interaction Nonlinearity and Multicollinearity Implica - 1993 - Journal of Ma
8 pages
EC203 Tutorial 12 Time Series 16
No ratings yet
EC203 Tutorial 12 Time Series 16
4 pages
Statistics: An Introduction Using R by M.J. Crawley Exercises
No ratings yet
Statistics: An Introduction Using R by M.J. Crawley Exercises
34 pages
Big Ipl
No ratings yet
Big Ipl
13 pages
Plurigaussian Model
No ratings yet
Plurigaussian Model
38 pages
Using Statistical Methods in Social Science Research: With a Complete SPSS Guide 3rd Edition Soleman H. Abu-Bader All Chapters Instant Download
100% (3)
Using Statistical Methods in Social Science Research: With a Complete SPSS Guide 3rd Edition Soleman H. Abu-Bader All Chapters Instant Download
40 pages
CH 02
No ratings yet
CH 02
39 pages
Edu 2008 Spring C Solutions
No ratings yet
Edu 2008 Spring C Solutions
117 pages
Overview of Topic I: Statistics 512: Applied Regression Analysis Professor Min Zhang Purdue University Spring 2014
No ratings yet
Overview of Topic I: Statistics 512: Applied Regression Analysis Professor Min Zhang Purdue University Spring 2014
103 pages
Chapter III
0% (1)
Chapter III
23 pages
Application of Statistical Concepts in The Determination of Weight Variation in Samples
100% (1)
Application of Statistical Concepts in The Determination of Weight Variation in Samples
4 pages
Sample Assignment 1 Elementary Probability
No ratings yet
Sample Assignment 1 Elementary Probability
2 pages
2024-25 SMDS (AI&DS-A) (IV-Sem) 2
No ratings yet
2024-25 SMDS (AI&DS-A) (IV-Sem) 2
5 pages
Topic: Regression Model (Chapter 3 & 4) : Quantitative Analysis
No ratings yet
Topic: Regression Model (Chapter 3 & 4) : Quantitative Analysis
6 pages
Automatic Hyperparameter Tuning With Sklearn Using Grid and Random Search - by Bex T. - Towards Data Science
No ratings yet
Automatic Hyperparameter Tuning With Sklearn Using Grid and Random Search - by Bex T. - Towards Data Science
8 pages

INSY446 - 02 - Linear Model Part 1

Uploaded by

INSY446 - 02 - Linear Model Part 1

Uploaded by

INSY 446 – Winter 2023

Data Mining for Business

Session 2 – Linear Model Part 1

§ A machine learning model is considered a

§ Find a linear line that minimizes the sum of

# Load built-in dataset

# Explore the data

# Setup dependent and independent variables

# Load built-in dataset

# Setup dependent and independent variables

# Run linear regression

# Setup dependent and independent variables

# Run linear regression

§ In the descriptive & explanatory modeling, the

§ The most important measure of a predictive

# Import training data

# Construct variables for training data

# Run linear regression on training data

# Import test data

# Construct variables for test data

# Calculate the mean squared error of the prediction

# Separate the data

# Run linear regression

# Calculate the mean squared error of the prediction

# Develop Linear Regression Model

# Calculate the mean squared error of the prediction

# Separate the data

# Run linear regression

# Calculate the mean squared error of the prediction

# Calculate the mean squared error of the prediction

§ Use nutrition.csv dataset

§ Using the same dataset in #1

You might also like