0% found this document useful (0 votes)

4 views

Lec-01-Introduction to Statistical Learning

The document provides an introduction to statistical learning, covering key concepts such as supervised and unsupervised learning, regression and classification problems, and the differences between statistical learning and machine learning. It discusses various methods for modeling, including parametric and non-parametric approaches, and highlights the importance of model flexibility, interpretability, and the bias-variance trade-off. Additionally, it addresses the objectives of predicting outcomes and understanding relationships between variables in the context of statistical modeling.

Uploaded by

aman.sinha.iitkgp

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Lec-01-Introduction to Statistical Learning

Uploaded by

aman.sinha.iitkgp

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Introduction to Statistical

Learning
Dr. Sayak Roychowdhury
Department of Industrial and Systems Engineering,
IIT Kharagpur
References
Statistical Learning
• IBM Watson wins Jeopardy in 2011 (a Predicting Elections
game show)
https://www.youtube.com/watch?v=P1
8EdAKuC1U

Source: The Guardian

Source: IBM Research
What determines wage?

Regression problem
Stock Market Prediction

Classification Problem
Gene Expression Data

NCI60 data set, which consists of 6,830 gene expression

measurements for each of 64 cancer cell lines
(unsupervised learning using PCA)
Intrusion Detection Systems

Source: Transactions on Emerging Telecommunications Technologies, Volume: 32, Issue: 1,

First published: 16 October 2020, DOI: (10.1002/ett.4150)
Quality Engineering
Gene Classification

Mahendran, N., Durai Raj Vincent, P. M.,

Srinivasan, K., & Chang, C. Y. (2020). Machine
learning based computational gene selection
models: a survey, performance evaluation, open
issues, and future research directions. Frontiers in
genetics, 11, 603808.
Mining and Drilling Borehole

Source: Dindaroğlu, T. (2014). The use of the GIS Kriging technique to

determine the spatial changes of natural radionuclide concentrations in soil
and forest cover. Journal of Environmental Health Science and Source: Mining Weekly
Engineering, 12(1), 1-11.
Supervised Learning
• Outcome measurement 𝑌, (dependent variable, response, target)
• Vector of 𝑝 predictors 𝑋 (inputs, factors, regressors, covariates,
features, independent variables)
• In regression problem, 𝑌 is quantitative (e.g. price, pressure, length
etc)
• In classification problems, 𝑌 takes values in a finite unordered set
(survived/died, spam/not spam, cancerous/benign)
• Training data 𝑥1 , 𝑦1 , … (𝑥𝑁 , 𝑦𝑁 ) observations of these
measurements
Objective of supervised learning
• Accurately predict unseen test cases
• Understand which inputs affect outcome
• Assess quality of predictions and inferences
Unsupervised Learning
• No outcome variable, just a set of
predictors measured on a set of
samples
• Objective: find groups of samples
that behave similarly, group objects
with similar features
• Difficult to measure accuracy
• Can be used as a preprocessing
step for supervised learning
Source: https://www.thimbletoys.com/
Statistical Learning Vs Machine Learning
• Machine learning -> subfield of artificial intelligence (esp neural
network)
• Statistical learning -> subfield in statistics
• There is much overlap
• Machine learning is more utililized in large scale applications and cares more
about prediction accuracy
• Statistical learning puts more emphasis on models and interpretability,
precision and uncertainty
• The distinction is much blurred these days
Statistical Modeling

𝑆𝑎𝑙𝑒𝑠 ≈ 𝑓(𝑇𝑉, 𝑅𝑎𝑑𝑖𝑜, 𝑁𝑒𝑤𝑠𝑝𝑎𝑝𝑒𝑟)

Statistical Modeling
• 𝑌 = 𝑓 𝑥 + 𝜖 Model form
• 𝜖 ~ Error (measurement error, discrepancies etc)
• 𝑌 = 𝑆𝑎𝑙𝑒𝑠
• 𝑋1 = 𝑇𝑉, 𝑋2 = 𝑅𝑎𝑑𝑖𝑜, 𝑋3 = 𝑁𝑒𝑤𝑠𝑝𝑎𝑝𝑒𝑟
𝑋1
•𝑋= 𝑋2
𝑋3
Statistical Modeling
መ
What to do with 𝑓(𝑥)? Why estimate 𝑓 𝑥 with 𝑓(𝑥)?
Prediction
• With a good model of 𝑓(𝑥), we can make predictions for 𝑌 at an
unobserved point 𝑋 = 𝑥.
Inference
• Understand which features among 𝑋 = (𝑋1 , … 𝑋𝑝 ) are important for
the variation in 𝑌.
• How each component 𝑋𝑗 affects 𝑌, depending on the complexity of
𝑓(𝑥)
Inference
• Which predictors are associated with the response?
• What is the relationship between the response and each predictor?
• Is the relationship linear or more complicated model is required?
Statistical Modeling
What is a good value of 𝑓(𝑥)?
• 𝑓 𝑥 = 𝐸(𝑌|𝑋 = 𝑥)
• This means, the ideal value of 𝑓(𝑥) is the expected value (average)
value of 𝑌 at 𝑋 = 𝑥.

𝑓 5 = 𝐸(𝑌|𝑋 = 5)
Regression Function
• The ideal 𝑓 𝑥 = 𝐸(𝑌|𝑋 = 𝑥) is called the regression function
• It is also defined for a vector 𝑋
𝑓 𝑥 = 𝐸(𝑌|𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 )
• 𝑓 𝑥 = 𝐸(𝑌|𝑋 = 𝑥) is called the optimal 2
predictor because it minimizes
the mean square error 𝐸 𝑌 − 𝑞 𝑋 𝑋 = 𝑥] over all the functions 𝑞 at
𝑋=𝑥
• 𝜖 = 𝑌 − 𝑓(𝑥) is irreducible error
መ
• For any estimate 𝑓(𝑥) of 𝑓(𝑥), the expected prediction error
2 2
𝐸 𝑌−𝑓 𝑥 መ መ
𝑋 = 𝑥 = 𝑓 𝑥 − 𝑓(𝑥) + Var(𝜖)
Reducible Irreducible
How to estimate 𝑓መ
• No observation at 𝑋 = 12.5
• One way to approximate is to
• select a neighbourhood ℵ 𝑥
• 𝑓መ 𝑥 = 𝐴𝑣𝑔(𝑌|𝑋 ∈ ℵ 𝑥 )

• This is the essence of nearest

• neighbour methods
•
•
How to estimate 𝑓መ
• Nearest neighbour methods work well at lower dimensions (𝑝 ≤ 4)
and when the number of points N is large (ish)
• Smoothing methods like kernel, spline may also work
• Nearest neighbour methods break down when 𝑝 is large, because of
curse of dimensionality
• NN can be far away in high dimensions
Curse of Dimensionality

Expected edge length =

1
𝑒𝑝 𝑟 = 𝑟𝑝
𝑟 = fraction of volume
𝑝 = number of dimensions

Source: The Elements of Statistical Learning

Linear Regression

Parameteric Logistic Regression

Supervised
Learning
ANN

K nearest
neighbours

Thin-plate spline
Non-parametric
SVM

Tree-based
methods
Parametric Methods
• Parametric methods have model-based approach
• Step 1: Assume a functional form or shape of 𝑓. E.g. linear model
𝑓 𝑋 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝
To estimate 𝑓, we only need to estimate 𝑝 parameters, 𝛽0 , 𝛽1 … , 𝛽𝑝

• Step 2: Training the model by estimating the parameters, e.g. ordinary

least squares, maximum likelihood estimation etc.
Parametric Methods
Advantage:
• Only need to estimate a set of parameters, rather than to fit an
arbitrary function 𝑓
Disadvantage:
• The chosen model usually don’t match the real model (which is
unknown)
• Fitting more flexible model means estimating greater number of
parameters
• More complex models lead to overfitting of data
• E.g. Linear regression, logistic regression, generalized linear models
Non-parametric Methods
• They do not explicitly assume any functional form of 𝑓
• Seek an estimate of 𝑓 that gets as close to the datapoints as possible
Advantage:
• It avoids the danger of assuming a wrong model form
• N-p methods have the potential to accurately fit a wider range of possible
shapes of 𝑓
Disadvantage:
• Very large number of observations is required
• Overfitting
e.g. Thin-plate spline, decision trees, random forest etc.
Non-parametric Models

rough thin-plate spline fit to the Income data makes zero errors
on the training data.
Flexibility and Interpretability
Linear regression is a relatively inflexible
approach, because it can only generate
linear functions.

Methods, such as the thin plate splines

are more flexible and can generate more shapes.

Then why linear models?

Because restrictive methods like linear models

are more interpretable.
Trade-offs
• Prediction accuracy vs interpretability
• Linear models are easy to interpret, ANNs, thin-plate splines are not
• Good fit vs over-fit or under-fit
• Parsimony vs black-box
• Simpler model with fewer variables is preferred than a black box model with
all variables
Model Flexibility, Training and Test Errors
Model Flexibility, Training and Test Errors
Model Accuracy
መ 𝑁
• Suppose a model 𝑓(𝑥) is trained over a dataset 𝑇𝑟: 𝑥𝑖 , 𝑦𝑖
• The training error can be calculated using:
1 𝑁 2
• 𝑀𝑆𝐸𝑡𝑟 = σ መ 𝑖)
𝑦𝑖 − 𝑓(𝑥
𝑁 𝑖=1
• But using 𝑀𝑆𝐸𝑡𝑟 as the only accuracy metric may give advantage to
models that overfit.
𝑀
• One way to mitigate is to have a test dataset 𝑇𝑒: 𝑥𝑖 , 𝑦𝑖
• Calculate test error
1 𝑀 2
• 𝑀𝑆𝐸𝑡𝑒 = σ መ 𝑖)
𝑦𝑖 − 𝑓(𝑥
𝑀 𝑖=1
Bias Variance Trade-off
• Suppose True model is 𝑌 = 𝑓 𝑥 + ϵ
መ
• Suppose a model 𝑓(𝑥) is trained over a dataset 𝑇𝑟: 𝑥𝑖 , 𝑦𝑖 𝑁
.
• Let (𝑥0 , 𝑦0 ) be test observation from the population
2 2
መ 0)
• 𝐸 𝑦0 − 𝑓(𝑥 = 𝑉𝑎𝑟 𝑓መ 𝑥0 + 𝐵𝑖𝑎𝑠 𝑓መ 𝑥0 + 𝑉𝑎𝑟(𝜖)

• 𝐵𝑖𝑎𝑠 𝑓መ 𝑥0 = 𝐸 𝑓መ 𝑥0 − 𝑓(𝑥0 )
• The expectation averages over the variability of 𝑦0 as well as the variability in the
training dataset
• Typically, as 𝑓መ becomes more flexible, the variability increases and the bias
decreases
• Choosing flexibility based on average test-error amounts to bias-variance trade-
off
Bias Variance Trade-off

Source: The Elements of Statistical Learning

Classification
• Here the response variable 𝑌 is qualitative/categorical/discrete
• E.g. 𝒞 = 𝑠𝑝𝑎𝑚, ℎ𝑎𝑚 ; 𝒞 = 0,1, . . 9 ; etc
• Objective is to build a classifier 𝐶(𝑋) that assigns a class label from 𝒞
for a new observation 𝑋
• Assess the uncertainty in each classification
• Understand the roles of the different predictors 𝑋 = (𝑋1 , . . 𝑋𝑝 )
• Is there an ideal 𝐶(𝑋)?
Classification
• Is there an ideal 𝐶(𝑋)?
• For 𝐾 elements in 𝒞, numbered as 1,2, . . 𝐾
• Let 𝑝𝑘 𝑥 = 𝑃 𝑌 = 𝑘 𝑋 = 𝑥 , ∀𝑘 = 1, . . 𝐾
• 𝑝𝑘 𝑥 is the conditional probability that response 𝑌 is in class 𝑘 when feature
𝑋 takes a value 𝑥.
• The Bayes Optimal Classifier at 𝑋 = 𝑥 is the class 𝑘 ∈ 𝒞 which gives
maximum value of 𝑝𝑘 𝑥
• Nearest neighbour averaging can also be done for classification
problem
Accuracy in Classification
• To measure accuracy, misclassification error rate is used for
classification problems
• For test dataset 𝑇𝑒
መ 𝑖 ))
• 𝐸𝑟𝑟𝑇𝑒 = 𝐴𝑣𝑔𝑖∈𝑇𝑒 𝐼(𝑦𝑖 ≠ 𝐶(𝑥
• The Bayes Classifier using the true 𝑝𝑘 (𝑥) has the smallest error
• Logistic Regression, Support Vector Machines, Generalized Additive
Models are other methods for classification

9 Smart Habits of People With High Emotional Intelligence (1.1)
89% (9)
9 Smart Habits of People With High Emotional Intelligence (1.1)
116 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
BTMMeeting25Nov2020-StatisticalLearning
No ratings yet
BTMMeeting25Nov2020-StatisticalLearning
49 pages
2.SupervisedLearning Error
No ratings yet
2.SupervisedLearning Error
32 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
Ch2_Statistical_Learning
No ratings yet
Ch2_Statistical_Learning
51 pages
RigNotes15 PDF
No ratings yet
RigNotes15 PDF
130 pages
Chapter 1. Elements in Predictive Analytics
No ratings yet
Chapter 1. Elements in Predictive Analytics
66 pages
Week2 StatisticalLearning
No ratings yet
Week2 StatisticalLearning
46 pages
Machine Learning and Data Mining
No ratings yet
Machine Learning and Data Mining
88 pages
Supervised Learning
No ratings yet
Supervised Learning
6 pages
Capitulo 2 big data
No ratings yet
Capitulo 2 big data
25 pages
ML RUSA Module 6 Probablistic EM KNN SVM
No ratings yet
ML RUSA Module 6 Probablistic EM KNN SVM
51 pages
02 Statistical Learning
No ratings yet
02 Statistical Learning
37 pages
Machine Learning
No ratings yet
Machine Learning
92 pages
I2ml3e Chap5
No ratings yet
I2ml3e Chap5
26 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
Notes5_Regression
No ratings yet
Notes5_Regression
14 pages
Forecasting and Learning Theory
No ratings yet
Forecasting and Learning Theory
46 pages
AIML-Unit 5 Notes
No ratings yet
AIML-Unit 5 Notes
45 pages
w2 - Fundamentals of Learning
No ratings yet
w2 - Fundamentals of Learning
37 pages
ML_Introduction
No ratings yet
ML_Introduction
76 pages
Session 5 ppt
No ratings yet
Session 5 ppt
36 pages
1 Statistical Learning
No ratings yet
1 Statistical Learning
42 pages
Intro To Data Science Lecture 1
No ratings yet
Intro To Data Science Lecture 1
7 pages
AIML
No ratings yet
AIML
30 pages
2310.19244v1
No ratings yet
2310.19244v1
168 pages
Summer of Science-Final Report
100% (1)
Summer of Science-Final Report
7 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
ML (1)
No ratings yet
ML (1)
6 pages
Tutorial 7 Machine Learning Algorithms
No ratings yet
Tutorial 7 Machine Learning Algorithms
30 pages
Preface VII Mathematical Notation Xi Contents Xiii
No ratings yet
Preface VII Mathematical Notation Xi Contents Xiii
6 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
datamining-lect12
No ratings yet
datamining-lect12
75 pages
Datamining-lect4 - Other Classification Techniques. Nearest Neighbor Classifiers, Support Vector Machines, Logistic Regression, Naive Bayes Classification. Supervised Learning
No ratings yet
Datamining-lect4 - Other Classification Techniques. Nearest Neighbor Classifiers, Support Vector Machines, Logistic Regression, Naive Bayes Classification. Supervised Learning
79 pages
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
No ratings yet
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
10 pages
Data Mining - Other Classifiers
No ratings yet
Data Mining - Other Classifiers
7 pages
Neural Networks Study Notes
100% (2)
Neural Networks Study Notes
11 pages
Math Behind Machine Learning
No ratings yet
Math Behind Machine Learning
9 pages
lec1
No ratings yet
lec1
54 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
QSRI-lecture1
No ratings yet
QSRI-lecture1
45 pages
4.4 Parametric and Non-parametric Estimator
No ratings yet
4.4 Parametric and Non-parametric Estimator
47 pages
Rig Notes 17
No ratings yet
Rig Notes 17
168 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
03 Regressionanalysis
No ratings yet
03 Regressionanalysis
21 pages
PDF Estimation Corr
No ratings yet
PDF Estimation Corr
43 pages
ml_cheat (1)
No ratings yet
ml_cheat (1)
9 pages
Curs 1 SSL - Introduction
No ratings yet
Curs 1 SSL - Introduction
57 pages
Stats 2
No ratings yet
Stats 2
6 pages
Chapter 5 Learning Deterministic Models
No ratings yet
Chapter 5 Learning Deterministic Models
28 pages
Xiii Xiv Contents: 2 Probability Distributions 67
No ratings yet
Xiii Xiv Contents: 2 Probability Distributions 67
6 pages
Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
Unit 2 - Machine Learning - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 2 - Machine Learning - WWW - Rgpvnotes.in PDF
10 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
ML Algorithms Week 3
No ratings yet
ML Algorithms Week 3
30 pages
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
University of Cambridge International Examinations General Certificate of Education Advanced Subsidiary Level and Advanced Level
No ratings yet
University of Cambridge International Examinations General Certificate of Education Advanced Subsidiary Level and Advanced Level
4 pages
DexConnect Opportunity Update January 2024.01
No ratings yet
DexConnect Opportunity Update January 2024.01
16 pages
2 Rigid Compound Systems: 2.1 General
No ratings yet
2 Rigid Compound Systems: 2.1 General
62 pages
Oraciones en Presente, Pasado y Futuro Progresivo
No ratings yet
Oraciones en Presente, Pasado y Futuro Progresivo
7 pages
Reducing Interferences in Cyanide Analysis
No ratings yet
Reducing Interferences in Cyanide Analysis
52 pages
CFD Momentum Equation PDF
No ratings yet
CFD Momentum Equation PDF
18 pages
Sea Bream Market
No ratings yet
Sea Bream Market
3 pages
Strength Classiﬁcation and Diagnosis: Not All Strength Is Created Equal
No ratings yet
Strength Classiﬁcation and Diagnosis: Not All Strength Is Created Equal
9 pages
IS - Inglese Rev 06.01
No ratings yet
IS - Inglese Rev 06.01
34 pages
Compressor Roto Inject
No ratings yet
Compressor Roto Inject
2 pages
Syllabus History of Mathematics
No ratings yet
Syllabus History of Mathematics
7 pages
Jurnal Berkaiatan Token Ekonomi
No ratings yet
Jurnal Berkaiatan Token Ekonomi
14 pages
New Takehome Assignmentdistribution
No ratings yet
New Takehome Assignmentdistribution
21 pages
Memmert Incubator IN110.En
No ratings yet
Memmert Incubator IN110.En
4 pages
Criminology Internship Programof Mountain Province State Polytechnic College
No ratings yet
Criminology Internship Programof Mountain Province State Polytechnic College
12 pages
Chapter 4 Social System and Organizational Culture
No ratings yet
Chapter 4 Social System and Organizational Culture
8 pages
Millan - Steps in Policy-Making Process
No ratings yet
Millan - Steps in Policy-Making Process
2 pages
Bengtech Metallurgy Extended
No ratings yet
Bengtech Metallurgy Extended
2 pages
Cuadro de Inspeccion Visual
No ratings yet
Cuadro de Inspeccion Visual
1 page
Peer Conformity
No ratings yet
Peer Conformity
39 pages
Evaluasi Usability Google Classroom
No ratings yet
Evaluasi Usability Google Classroom
8 pages
Improvement in Outdoor Sound Source Detection Using A Quadrotor-Embedded Microphone Array
No ratings yet
Improvement in Outdoor Sound Source Detection Using A Quadrotor-Embedded Microphone Array
6 pages
Trimming The ML Bot
No ratings yet
Trimming The ML Bot
34 pages
Traditional MDD Festival by Ombatia Benard.
No ratings yet
Traditional MDD Festival by Ombatia Benard.
3 pages
RPP Language Testing at The Campus
No ratings yet
RPP Language Testing at The Campus
12 pages
Distribusi-Tegangan-Metode 2V 1H
No ratings yet
Distribusi-Tegangan-Metode 2V 1H
14 pages
Variables in Quantitative Research
100% (1)
Variables in Quantitative Research
21 pages
Research Paper Draft1
No ratings yet
Research Paper Draft1
12 pages

Lec-01-Introduction to Statistical Learning

Uploaded by

Lec-01-Introduction to Statistical Learning

Uploaded by

Introduction to Statistical

Source: The Guardian

NCI60 data set, which consists of 6,830 gene expression

Source: Transactions on Emerging Telecommunications Technologies, Volume: 32, Issue: 1,

Mahendran, N., Durai Raj Vincent, P. M.,

Source: Dindaroğlu, T. (2014). The use of the GIS Kriging technique to

𝑆𝑎𝑙𝑒𝑠 ≈ 𝑓(𝑇𝑉, 𝑅𝑎𝑑𝑖𝑜, 𝑁𝑒𝑤𝑠𝑝𝑎𝑝𝑒𝑟)

• This is the essence of nearest

Expected edge length =

Source: The Elements of Statistical Learning

Parameteric Logistic Regression

• Step 2: Training the model by estimating the parameters, e.g. ordinary

Methods, such as the thin plate splines

Then why linear models?

Because restrictive methods like linear models

Source: The Elements of Statistical Learning

You might also like