0% found this document useful (0 votes)
4 views

Lec-01-Introduction to Statistical Learning

The document provides an introduction to statistical learning, covering key concepts such as supervised and unsupervised learning, regression and classification problems, and the differences between statistical learning and machine learning. It discusses various methods for modeling, including parametric and non-parametric approaches, and highlights the importance of model flexibility, interpretability, and the bias-variance trade-off. Additionally, it addresses the objectives of predicting outcomes and understanding relationships between variables in the context of statistical modeling.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lec-01-Introduction to Statistical Learning

The document provides an introduction to statistical learning, covering key concepts such as supervised and unsupervised learning, regression and classification problems, and the differences between statistical learning and machine learning. It discusses various methods for modeling, including parametric and non-parametric approaches, and highlights the importance of model flexibility, interpretability, and the bias-variance trade-off. Additionally, it addresses the objectives of predicting outcomes and understanding relationships between variables in the context of statistical modeling.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Introduction to Statistical

Learning
Dr. Sayak Roychowdhury
Department of Industrial and Systems Engineering,
IIT Kharagpur
References
Statistical Learning
• IBM Watson wins Jeopardy in 2011 (a Predicting Elections
game show)
https://www.youtube.com/watch?v=P1
8EdAKuC1U

Source: The Guardian


Source: IBM Research
What determines wage?

Regression problem
Stock Market Prediction

Classification Problem
Gene Expression Data

NCI60 data set, which consists of 6,830 gene expression


measurements for each of 64 cancer cell lines
(unsupervised learning using PCA)
Intrusion Detection Systems

Source: Transactions on Emerging Telecommunications Technologies, Volume: 32, Issue: 1,


First published: 16 October 2020, DOI: (10.1002/ett.4150)
Quality Engineering
Gene Classification

Mahendran, N., Durai Raj Vincent, P. M.,


Srinivasan, K., & Chang, C. Y. (2020). Machine
learning based computational gene selection
models: a survey, performance evaluation, open
issues, and future research directions. Frontiers in
genetics, 11, 603808.
Mining and Drilling Borehole

Source: Dindaroğlu, T. (2014). The use of the GIS Kriging technique to


determine the spatial changes of natural radionuclide concentrations in soil
and forest cover. Journal of Environmental Health Science and Source: Mining Weekly
Engineering, 12(1), 1-11.
Supervised Learning
• Outcome measurement 𝑌, (dependent variable, response, target)
• Vector of 𝑝 predictors 𝑋 (inputs, factors, regressors, covariates,
features, independent variables)
• In regression problem, 𝑌 is quantitative (e.g. price, pressure, length
etc)
• In classification problems, 𝑌 takes values in a finite unordered set
(survived/died, spam/not spam, cancerous/benign)
• Training data 𝑥1 , 𝑦1 , … (𝑥𝑁 , 𝑦𝑁 ) observations of these
measurements
Objective of supervised learning
• Accurately predict unseen test cases
• Understand which inputs affect outcome
• Assess quality of predictions and inferences
Unsupervised Learning
• No outcome variable, just a set of
predictors measured on a set of
samples
• Objective: find groups of samples
that behave similarly, group objects
with similar features
• Difficult to measure accuracy
• Can be used as a preprocessing
step for supervised learning
Source: https://www.thimbletoys.com/
Statistical Learning Vs Machine Learning
• Machine learning -> subfield of artificial intelligence (esp neural
network)
• Statistical learning -> subfield in statistics
• There is much overlap
• Machine learning is more utililized in large scale applications and cares more
about prediction accuracy
• Statistical learning puts more emphasis on models and interpretability,
precision and uncertainty
• The distinction is much blurred these days
Statistical Modeling

𝑆𝑎𝑙𝑒𝑠 ≈ 𝑓(𝑇𝑉, 𝑅𝑎𝑑𝑖𝑜, 𝑁𝑒𝑤𝑠𝑝𝑎𝑝𝑒𝑟)


Statistical Modeling
• 𝑌 = 𝑓 𝑥 + 𝜖 Model form
• 𝜖 ~ Error (measurement error, discrepancies etc)
• 𝑌 = 𝑆𝑎𝑙𝑒𝑠
• 𝑋1 = 𝑇𝑉, 𝑋2 = 𝑅𝑎𝑑𝑖𝑜, 𝑋3 = 𝑁𝑒𝑤𝑠𝑝𝑎𝑝𝑒𝑟
𝑋1
•𝑋= 𝑋2
𝑋3
Statistical Modeling

What to do with 𝑓(𝑥)? Why estimate 𝑓 𝑥 with 𝑓(𝑥)?
Prediction
• With a good model of 𝑓(𝑥), we can make predictions for 𝑌 at an
unobserved point 𝑋 = 𝑥.
Inference
• Understand which features among 𝑋 = (𝑋1 , … 𝑋𝑝 ) are important for
the variation in 𝑌.
• How each component 𝑋𝑗 affects 𝑌, depending on the complexity of
𝑓(𝑥)
Inference
• Which predictors are associated with the response?
• What is the relationship between the response and each predictor?
• Is the relationship linear or more complicated model is required?
Statistical Modeling
What is a good value of 𝑓(𝑥)?
• 𝑓 𝑥 = 𝐸(𝑌|𝑋 = 𝑥)
• This means, the ideal value of 𝑓(𝑥) is the expected value (average)
value of 𝑌 at 𝑋 = 𝑥.

𝑓 5 = 𝐸(𝑌|𝑋 = 5)
Regression Function
• The ideal 𝑓 𝑥 = 𝐸(𝑌|𝑋 = 𝑥) is called the regression function
• It is also defined for a vector 𝑋
𝑓 𝑥 = 𝐸(𝑌|𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 )
• 𝑓 𝑥 = 𝐸(𝑌|𝑋 = 𝑥) is called the optimal 2
predictor because it minimizes
the mean square error 𝐸 𝑌 − 𝑞 𝑋 𝑋 = 𝑥] over all the functions 𝑞 at
𝑋=𝑥
• 𝜖 = 𝑌 − 𝑓(𝑥) is irreducible error

• For any estimate 𝑓(𝑥) of 𝑓(𝑥), the expected prediction error
2 2
𝐸 𝑌−𝑓 𝑥 መ መ
𝑋 = 𝑥 = 𝑓 𝑥 − 𝑓(𝑥) + Var(𝜖)
Reducible Irreducible
How to estimate 𝑓መ
• No observation at 𝑋 = 12.5
• One way to approximate is to
• select a neighbourhood ℵ 𝑥
• 𝑓መ 𝑥 = 𝐴𝑣𝑔(𝑌|𝑋 ∈ ℵ 𝑥 )

• This is the essence of nearest


• neighbour methods


How to estimate 𝑓መ
• Nearest neighbour methods work well at lower dimensions (𝑝 ≤ 4)
and when the number of points N is large (ish)
• Smoothing methods like kernel, spline may also work
• Nearest neighbour methods break down when 𝑝 is large, because of
curse of dimensionality
• NN can be far away in high dimensions
Curse of Dimensionality

Expected edge length =


1
𝑒𝑝 𝑟 = 𝑟𝑝
𝑟 = fraction of volume
𝑝 = number of dimensions

Source: The Elements of Statistical Learning


Linear Regression

Parameteric Logistic Regression

Supervised
Learning
ANN

K nearest
neighbours

Thin-plate spline
Non-parametric
SVM

Tree-based
methods
Parametric Methods
• Parametric methods have model-based approach
• Step 1: Assume a functional form or shape of 𝑓. E.g. linear model
𝑓 𝑋 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝
To estimate 𝑓, we only need to estimate 𝑝 parameters, 𝛽0 , 𝛽1 … , 𝛽𝑝

• Step 2: Training the model by estimating the parameters, e.g. ordinary


least squares, maximum likelihood estimation etc.
Parametric Methods
Advantage:
• Only need to estimate a set of parameters, rather than to fit an
arbitrary function 𝑓
Disadvantage:
• The chosen model usually don’t match the real model (which is
unknown)
• Fitting more flexible model means estimating greater number of
parameters
• More complex models lead to overfitting of data
• E.g. Linear regression, logistic regression, generalized linear models
Non-parametric Methods
• They do not explicitly assume any functional form of 𝑓
• Seek an estimate of 𝑓 that gets as close to the datapoints as possible
Advantage:
• It avoids the danger of assuming a wrong model form
• N-p methods have the potential to accurately fit a wider range of possible
shapes of 𝑓
Disadvantage:
• Very large number of observations is required
• Overfitting
e.g. Thin-plate spline, decision trees, random forest etc.
Non-parametric Models

rough thin-plate spline fit to the Income data makes zero errors
on the training data.
Flexibility and Interpretability
Linear regression is a relatively inflexible
approach, because it can only generate
linear functions.

Methods, such as the thin plate splines


are more flexible and can generate more shapes.

Then why linear models?

Because restrictive methods like linear models


are more interpretable.
Trade-offs
• Prediction accuracy vs interpretability
• Linear models are easy to interpret, ANNs, thin-plate splines are not
• Good fit vs over-fit or under-fit
• Parsimony vs black-box
• Simpler model with fewer variables is preferred than a black box model with
all variables
Model Flexibility, Training and Test Errors
Model Flexibility, Training and Test Errors
Model Accuracy
መ 𝑁
• Suppose a model 𝑓(𝑥) is trained over a dataset 𝑇𝑟: 𝑥𝑖 , 𝑦𝑖
• The training error can be calculated using:
1 𝑁 2
• 𝑀𝑆𝐸𝑡𝑟 = σ መ 𝑖)
𝑦𝑖 − 𝑓(𝑥
𝑁 𝑖=1
• But using 𝑀𝑆𝐸𝑡𝑟 as the only accuracy metric may give advantage to
models that overfit.
𝑀
• One way to mitigate is to have a test dataset 𝑇𝑒: 𝑥𝑖 , 𝑦𝑖
• Calculate test error
1 𝑀 2
• 𝑀𝑆𝐸𝑡𝑒 = σ መ 𝑖)
𝑦𝑖 − 𝑓(𝑥
𝑀 𝑖=1
Bias Variance Trade-off
• Suppose True model is 𝑌 = 𝑓 𝑥 + ϵ

• Suppose a model 𝑓(𝑥) is trained over a dataset 𝑇𝑟: 𝑥𝑖 , 𝑦𝑖 𝑁
.
• Let (𝑥0 , 𝑦0 ) be test observation from the population
2 2
መ 0)
• 𝐸 𝑦0 − 𝑓(𝑥 = 𝑉𝑎𝑟 𝑓መ 𝑥0 + 𝐵𝑖𝑎𝑠 𝑓መ 𝑥0 + 𝑉𝑎𝑟(𝜖)

• 𝐵𝑖𝑎𝑠 𝑓መ 𝑥0 = 𝐸 𝑓መ 𝑥0 − 𝑓(𝑥0 )
• The expectation averages over the variability of 𝑦0 as well as the variability in the
training dataset
• Typically, as 𝑓መ becomes more flexible, the variability increases and the bias
decreases
• Choosing flexibility based on average test-error amounts to bias-variance trade-
off
Bias Variance Trade-off

Source: The Elements of Statistical Learning


Classification
• Here the response variable 𝑌 is qualitative/categorical/discrete
• E.g. 𝒞 = 𝑠𝑝𝑎𝑚, ℎ𝑎𝑚 ; 𝒞 = 0,1, . . 9 ; etc
• Objective is to build a classifier 𝐶(𝑋) that assigns a class label from 𝒞
for a new observation 𝑋
• Assess the uncertainty in each classification
• Understand the roles of the different predictors 𝑋 = (𝑋1 , . . 𝑋𝑝 )
• Is there an ideal 𝐶(𝑋)?
Classification
• Is there an ideal 𝐶(𝑋)?
• For 𝐾 elements in 𝒞, numbered as 1,2, . . 𝐾
• Let 𝑝𝑘 𝑥 = 𝑃 𝑌 = 𝑘 𝑋 = 𝑥 , ∀𝑘 = 1, . . 𝐾
• 𝑝𝑘 𝑥 is the conditional probability that response 𝑌 is in class 𝑘 when feature
𝑋 takes a value 𝑥.
• The Bayes Optimal Classifier at 𝑋 = 𝑥 is the class 𝑘 ∈ 𝒞 which gives
maximum value of 𝑝𝑘 𝑥
• Nearest neighbour averaging can also be done for classification
problem
Accuracy in Classification
• To measure accuracy, misclassification error rate is used for
classification problems
• For test dataset 𝑇𝑒
መ 𝑖 ))
• 𝐸𝑟𝑟𝑇𝑒 = 𝐴𝑣𝑔𝑖∈𝑇𝑒 𝐼(𝑦𝑖 ≠ 𝐶(𝑥
• The Bayes Classifier using the true 𝑝𝑘 (𝑥) has the smallest error
• Logistic Regression, Support Vector Machines, Generalized Additive
Models are other methods for classification

You might also like