0% found this document useful (0 votes)
4 views

Lecture Compiled

The document outlines the course structure for EE2211 Introduction to Machine Learning, including topics such as data engineering, fundamental algorithms, and performance issues. It discusses various types of machine learning, including supervised, unsupervised, and reinforcement learning, along with their applications. The document also emphasizes the importance of machine learning in situations where human expertise is lacking or when handling large datasets.

Uploaded by

youngseoso13
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture Compiled

The document outlines the course structure for EE2211 Introduction to Machine Learning, including topics such as data engineering, fundamental algorithms, and performance issues. It discusses various types of machine learning, including supervised, unsupervised, and reinforcement learning, along with their applications. The document also emphasizes the importance of machine learning in situations where human expertise is lacking or when handling large datasets.

Uploaded by

youngseoso13
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 224

EE2211 Introduction to

Machine Learning
Lecture 1

Wang Xinchao
[email protected]

Office Hour: Monday 9:30 – 10:30 AM


(Week 2-4, Week 10-12)
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Course Contents
• Introduction and Preliminaries (Xinchao)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Yueming)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Yueming)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Xinchao)
– Performance Issues
– K-means Clustering
– Neural Networks
2
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
World’s Largest Selfie

3
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
World’s Largest Selfie

4
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Recent Advances

Sora

Prompt: A stylish woman walks down a Tokyo street filled with warm glowing
neon and animated city signage. She wears a black leather jacket, a long
red dress, and black boots, and carries a black purse. She wears sunglasses
and red lipstick. She walks confidently and casually. The street is damp and
reflective, creating a mirror effect of the colorful lights. Many pedestrians walk
about.

5
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• What is machine learning?
– Three Definition(s)
• When do we need machine learning?
– Sometimes we need, sometimes we don’t
• Applications of machine learning
• Types of machine learning
– Supervised, Unsupervised, Reinforcement Learning
• Walking through a toy example on classification
• Inductive vs. Deductive Reasoning

6
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
What is machine learning?

Learning is any process by which a system


improves performance from experience.
- Herbert Simon

A computer program is said to learn


- from experience E
- with respect to some class of tasks T
- and performance measure P,
if its performance at tasks in T, as measured
by P, improves with experience E.
- Tom Mitchell
7
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Machine Learning (Supervised Learning)
Learned
Data
Computer Function 𝑓 .
Output

Data Output

Cat
𝑓( ) = ‘cat’
𝑓(. ) such that

𝑓( )= ‘dog’
Dog

8
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Machine Learning (Supervised Learning)
Learned
Data
Computer Function 𝑓 .
Output

Data Output

When applied
Cat

𝑓(. ) 𝑓( ) Cat!

New image
Dog

Machine Learning: field of study that gives computers the


ability to learn without being explicitly programmed
- Arthur Samuel

9
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
AI, Machine Learning, and
Deep Learning

Example of AI but not ML: Deductive Reasoning

NUS is in Singapore, Singapore is in Asia -> NUS is in Asia

10
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
ML is used when:
• Human expertise does not exist (navigating on Mars)
When
• Humans do can’t
we explain
need their
machine
expertiselearning?
(speech recognition)
• Models must be customized (personalized medicine)
Lack of human expertise Involves huge amount of data
Modelsonare
•(Navigating based
Mars) on huge amounts of data (genomics)
(Genomics)

Learning isn’t
Learning is not always
always useful:
useful:
• NoThere is no need to “learn” to calculate payroll
need to “learn” to calculate payroll!
5
Based on slide by E. Alpaydin
My Salary = Days_of_work * Daily Salary + Bonus

11
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Application of Machine Learning
A classic example of a task that requires machine learnin
Task T, Performance P, Experience E
It is very hard to say what makes a 2

T: Digit Recognition
P: Classification Accuracy
E: Labelled Images

“four”
“three”

Labels -> Supervision!


Slide credit: Geoffrey Hinton

12
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Application of Machine Learning

Task T, Performance P, Experience E

T: Email Categorization
P: Classification Accuracy
E: Email Data, Some Labelled

13
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Application of Machine Learning

Task T, Performance P, Experience E

T: Playing Go Game
P: Chances of Winning
E: Records of Past Games

14
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Application of Machine Learning

Task T, Performance P, Experience E

T: Identifying Covid-19 Clusters


P: Small Internal Distances
Larger External Distances
E: Records of Patients

15
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Web Search Engine Product Recommendation Language Translation

Photo Tagging Virtual Personal Assistant Portfolio Management

Traffic Prediction Medical Diagnosis


Algorithmic Trading
16
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Types of Machine Learning

Supervised Learning Unsupervised Learning Reinforcement Learning


Input: Input:
1) Training Samples, Input: Sequence of States,
2) Desired Output Samples Actions, and
(Teacher/Supervision) Delayed Rewards
Output:
Output: Underlying patterns in Output:
A rule that maps input to data Action Strategy: a rule
output that maps the
environment to action

17
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Types of Machine Learning

Supervised Learning Unsupervised Learning Reinforcement Learning


Input: Input:
1) Training Samples, Input: Sequence of States,
2) Desired Output Samples Actions, and
(Teacher/Supervision) Delayed Rewards
Output:
Output: Underlying patterns in Output:
A rule that maps input to data Action Strategy: a rule
output that maps the
environment to action

18
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Supervised Data Output

Learning 𝑥 Regression 𝑦 (Continuous)

𝑥 Classification 𝑦 (Categorical)

• Given 𝐱! , 𝑦! , 𝐱" , 𝑦" , …, 𝐱# , 𝑦#


Regression • Learn a function 𝑓 𝐱 to predict real-valued 𝑦 given 𝐱

Arctic Sea Ice Extent in January (in million sq km) Acrtic Sea Ice Extent in January (in million sq km)
16 16

15.5 15.5

15 15

𝑦 14.5
𝑦
14.5
𝑓 𝑥 : line that best aligns
14 14 with samples
13.5 13.5

13 13

12.5 12.5
1970 1980 1990 2000 2010 2020 2030 1970 1980 1990 2000 2010 2020 2030

𝑥 𝑥

19
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Supervised Data Output

Learning 𝑥 Regression 𝑦 (Continuous)

𝑥 Classification 𝑦 (Categorical)

• Given 𝐱! , 𝑦! , 𝐱" , 𝑦" , …, 𝐱# , 𝑦#


Classification • Learn a function 𝑓 𝐱 to predict categorical 𝑦 given 𝐱

width width
𝑦 = Sea Bass 𝑦 = Sea Bass
Feature 𝐱! 𝐱!
Space
𝐱" 𝐱" ?

𝑦 = Salmon 𝑦 = Salmon
lightness lightness
𝑓 𝑥 : line that separates
two classes
20
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Types of Machine Learning

Supervised Learning Unsupervised Learning Reinforcement Learning


Input: Input:
1) Training Samples, Input: Sequence of States,
2) Desired Output Samples Actions, and
(Teacher/Supervision) Delayed Rewards
Output:
Output: Underlying patterns in Output:
A rule that maps input to data Action Strategy: a rule
output that maps the
environment to action

Key different w.r.t. supervised learning:


No Label/Supervision is given!

21
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Unsupervised Learning

• Given 𝐱! , 𝐱 " , …, 𝐱 # , without labels


Clustering • Output Hidden Structure Behind

width width
𝐱! 𝐱!

𝐱" 𝐱"

lightness lightness

No Label/Supervision is given!
22
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Types of Machine Learning

Supervised Learning Unsupervised Learning Reinforcement Learning


Input: Input:
1) Training Samples, Input: Sequence of States,
2) Desired Output Samples Actions, and
(Teacher/Supervision) Delayed Rewards
Output:
Output: Underlying patterns in Output:
A rule that maps input to data Action Strategy: a rule
output that maps the
environment to action

23
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Reinforcement Learning
Breakout Game

Initial Performance Training 15 minutes Training 30 minutes

24
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Reinforcement Learning
• Given sequence of states 𝑺 and actions 𝑨 with (delayed)
rewards 𝑹
• Output a policy 𝜋(𝑎, 𝑠), to guide us what action 𝑎 to take in
state 𝑠

𝑺: Ball Location,
Paddle Location, Bricks
𝑨: left, right

𝑹:
positive reward
Knocking a brick,
clearing all bricks

negative reward
Missing the ball

zero reward
Cases in between

25
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Supervised
Unsupervised Quiz Time!
Reinforcement
A classic example of a task that requires machine learning:
It is very hard to say what makes a 2

6
Slide credit: Geoffrey Hinton

Supervised Unsupervised

Supervised
Reinforcement

26
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Walking Through A Toy Example:
Token Classification

? Yes

?
Yes No

Step1: Feature Extraction Step2: Sample Classification


Extract Attributes of Samples Decide Label for a Sample

27
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Walking Through A Toy Example:
Token Classification
B,L,Ring R,L,Triangle
B,L,Rectangle
Y,S,Arrow

? Yes

G,S,Circle
G,S,Diamond ?
R,L,Circle
Y,L,Triangle
O,L,Diamond No
Yes

Step 1: Feature Extraction


Color, Size, Shape

28
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Walking Through A Toy Example:
Token Classification

Feature Extraction
Color Size Shape Label
Blue Large Ring Yes
Red Large Triangle Yes
Orange Large Diamond Yes
Green Small Circle Yes
Yellow Small Arrow No
Blue Large Rectangle No
Red Large Circle No
Green Small Diamond No
Yellow Large Triangle ?

29
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Walking Through A Toy Example:
Token Classification

Feature Extraction
Color Size Shape Label
Blue Large Ring Yes
Red Large Triangle Yes
Orange Large Diamond Yes
Green Small Circle Yes
Yellow Small Arrow No
Blue Large Rectangle No
Red Large Circle No
Green Small Diamond No

30
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Walking Through A Toy Example:
Token Classification

Feature Extraction Similarity


Color Size Shape Label Color Size Shape Total
Blue Large Ring Yes 0 1 0 1
Red Large Triangle Yes 0 1 1 2
Orange Large Diamond Yes 0 1 0 1
Green Small Circle Yes 0 0 0 0
Yellow Small Arrow No 1 0 0 1
Blue Large Rectangle No 0 1 0 1
Red Large Circle No 0 1 0 1
Green Small Diamond No 0 0 0 0

31
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Walking Through A Toy Example:
Token Classification

Similarity
Color Size Shape Total
Nearest Neighbor Classifier:
0 1 0 1
0 1 1 2 1) Find the “nearest
0 1 0 1 neighbor” of a sample in
the feature space
0 0 0 0
1 0 0 1 2) Assign the label of the
0 1 0 1 nearest neighbor to the
sample
0 1 0 1
0 0 0 0

32
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Inductive vs. Deductive Reasoning
• Main Task of Machine Learning: to make inference
Two Types of Inference

Inductive Deductive

• To reach logical conclusions


• To reach probable conclusions deterministically
• Not all needed information is • All information that can lead to
available, causing uncertainty the correct conclusion is
available

Probability and Statistics Rule-based reasoning

NUS is in Singapore, Singapore is in Asia ->


NUS is in Asia

33
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Inductive Reasoning
Note: humans use inductive reasoning all the time and
not in a formal way like using probability/statistics.

Ref: Gardener, Martin (March 1979). "MATHEMATICAL GAMES: On the fabric of


inductive logic, and some probability paradoxes" (PDF). Scientific American. 234
34
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Summary by Quick Quiz

Three Components in ML Definition


Task T, Performance P, Experience E
Two Types of Supervised Learning
Classification, Regression
Three Types of in ML
Supervised Learning
Unsupervised Learning
One Type of Unsupervised Learning
Reinforcement Learning
Clustering

Inductive and Deductive


Example of a Classifier Model
Inductive: Probable Nearest Neighbor Classifier
Deductive: Rule-based

35
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Practice Question
(Type of Question to Expect in Exams)
Which of the following statement is true?
A. Nearest Neighbor Classifier is an example of
unsupervised learning
B. Nearest Neighbor Classifier is an example of deductive
learning
C. Nearest Neighbor Classifier is an example of feature
selection
D. None of the above is correct.

36
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
NUS
National University
of Singapore

THf1NK You.

37
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
EE2211 Introduction to
Machine Learning
Lecture 2

Wang Xinchao
[email protected]

!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Course Contents
• Introduction and Preliminaries (Xinchao)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Yueming)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Yueming)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Xinchao)
– Performance Issues
– K-means Clustering
– Neural Networks
2
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Summary of Lec 1

Three Components in ML Definition


Task T, Performance P, Experience E
Two Types of Supervised Learning
Classification, Regression
Three Types of in ML
Supervised Learning
Unsupervised Learning
One Type of Unsupervised Learning
Reinforcement Learning
Clustering

Inductive and Deductive


Example of a Classifier Model
Inductive: Probable Nearest Neighbor Classifier
Deductive: Rule-based

3
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline

Data Data integrity


Types of data wrangling and and
cleaning visualization

4
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Types of Data

5
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ways of Viewing Data
• Based on Levels/Scales of Measurement
– Nominal Data
– Ordinal Data
– Interval Data
– Ratio Data

• Based on Numerical/Categorical
– Numerical, also known as Quantitative
– Categorical, also known as Qualitative

• Other aspects
– Available or Missing Data

6
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Levels/Scales of Measurement
Highest

Ratio
NOIR Named + Ordered
+ Equal Interval +
Interval Has “True” Zero

Named + Ordered
+ Equal Interval
Ordinal

Named + Ordered
Nominal

Named
Lowest

7
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
A Quick Recap: Mean, Median, Mode
• If we are given a sequence of numbers:
1, 3, 4, 6, 6, 7, 8
Mean: computing the average
(1+3+4+6+6+7+8)/7 = 5

Median: number in the middle (after sorting)


1, 3, 4, 6, 6, 7, 8
*In case of even number of elements Frequency Distribution
1, 3, 4, 6, 7, 8 2

(4+6)/2=5

Mode: number with highest frequency 1

1, 3, 4, 6, 6, 7, 8
6
0
1 2 3 4 5 6 7 8

8
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Nominal Data
• Lowest Level of Measurement
• Discrete Categories
• NO natural order
• Estimating a mean, median, or standard deviation, would be
meaningless.
• Possible Measure: mode, frequency distribution

• Example:

Gender Occupation

man woman Doctor Police Teacher

9
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ordinal Data
• Ordered Categories
• Relative Ranking
• Unknown “distance” between categories: orders matter but not the
difference between values
• Possible Measure: mode, frequency distribution + median

• Example:
– Evaluate the difficulty level of an exam
• 1: Very Easy, 2: Easy, 3: About Right, 4: Difficulty, 5: Very Difficulty

10
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Interval Data
• Ordered Categories
• Well-defined “unit” measurement:
– Distances between points on the scale are measurable and well-defined
– Can measure differences!
• Equal Interval (between two consecutive unit points)
• Zero is arbitrary (not absolute), in many cases human-defined
– If the variable equals zero, it does not mean there is none of that variable
• Ratio is meaningless
• Possible Measure: mode, frequency distribution + median + mean,
standard deviation, addition/subtraction
• Example:
– Temperature measured in Celsius
• For instance: 10 degrees C, 28 degrees C
– Year of someone’s birth
• For instance: 1990, 2005, 2010, 2022

11
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ratio Data
• Most precise and highest level of measurement
• Ordered
• Equal Intervals
• Natural Zeros:
– If the variable equals zero, it means there is none of that variable
– Not arbitrary
• Possible Measure: mode, frequency distribution + median + mean, standard
deviation, addition/subtraction + multiplication and division (ratio)

• Example:
– Weights
• 10 KG, 20 KG, 30 KG

– Time
• 10 Seconds, 1 Hour, 1 Day

12
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
NOIR

We can estimate Nominal Ordinal Interval Ratio

Frequency
Distribution
Yes Yes Yes Yes

Median No Yes Yes Yes

Add or subtract No No Yes Yes

Mean, standard
deviation
No No Yes Yes

Ratios No No No Yes

13
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
NOIR
Yes.
Ratio

Yes.
Zero means
No.
none?
Interval
Yes.
Equally
split? No.
Ordinal

No.
Nominal

Ordered?

14
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
• Which level of measurement?
Nominal, Ordinal, Interval, Ratio
Yes.
1. Favorite Restaurant Ratio

Yes.
• Mcdonald’s, Burger King, Subway, KFC, … Zero means
none?
No.
Interval

2. Weight of luggage measured in KG Yes.


Equally split? No.
Ordinal
3. SAT Scores: note that, SAT ranges is [400, 1600]
No.

4. Size of Packed Eggs in supermarkets Nominal

Ordered?

• Small, Medium, Large, Extra Large, …


5. Military rank
• General, Major, Captain, …
6. Number of people in a household
• 1, 2, 3, 4, 5, …
7. Credit Score in United States: the range is [300, 850]
8.

15
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ways of Viewing Data
• Based on Levels/Scales of Measurement:
– Nominal Data
– Ordinal Data
– Interval Data
– Ratio Data

• Based on Numerical/Categorical
– Numerical, also known as Quantitative
– Categorical, also known as Qualitative

• Other aspects
– Available or Missing Data

16
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Numerical or Categorical
Variable

Categorical Numerical
(qualitative) (quantitative)

Discrete Continuous
Nominal Ordinal (Whole numerical (Can take any value
(Unordered (Ordered values) within a rage)
categories) categories) Example: outcome of Example:
tossing a die temperature in day

Numerical
OR (quantitative)

Interval Ratio
(may compute (may compute
difference but no difference, real
absolute zero) zero exists)

17
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ways of Viewing Data
• Based on Levels/Scales of Measurement:
– Nominal Data
– Ordinal Data
– Interval Data
– Ratio Data

• Based on Numerical/Categorical
– Numerical, also known as Quantitative
– Categorical, also known as Qualitative

• Other aspects
– Available or Missing Data

18
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Missing Data
• Missing data: data that is missing and you do not know
the mechanism.
– You should use a single common code for all missing values (for
example, “NA”), rather than leaving any entries blank.

19
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline

Data Data integrity


Types of data wrangling and and
cleaning visualization

20
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Wrangling
• Data wrangling
– The process of transforming and mapping data from one "raw" data
form into another format with the intent of making it more
appropriate and valuable for a variety of downstream purposes
such as analytics.
– In short, transforms data to gain insight
– It is a general process!

Credit:https://en.wikipedia.org/wiki/Data_wrangling
21
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Wrangling
Example: Use data
Determine Make the most social network Remove invalid Ensure data (feature
your goal of dataset stored as graphs data correctness extraction/
training/test)

Unifying format Remove noisy


Checking that Use data for
to .png samples
Women all images feature
have labels. extraction and
training your
Men face detector!

Collect Human Face Images for Face Detector


Credit:https://understandingdata.com/what-is-data-wrangling/
22
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Formatting Data

• Binary Coding to convert categories into binary form


– One-hot encoding: unify several entities within one vector
• Example: the color of a pixel can be red, yellow, or green
• Very common in classification tasks!

• Normalization
– Linear Scaling:
scale each variable to [0 1]

– Z-score standardization:
each independent dimension
of data is normally distributed

23
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
xample

• Scaling to a range Data Cleaning


– When the bounds or range of each independent dimension of
data•isThe process
known, a commonofnormalization
detecting technique
and correcting
removing)(or
is min-max.
corrupt
• Feature or inaccurate records from a record set, table, or
clipping
database.
• Example:
– Clipping outliers

https://developers.google.com/machine-learning/data-prep/transform/normalization
– Handling missing features
EE2211 Introduction to Machine Learning 17
, NUS. All Rights Reserved.

NA
NA
NA

24
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Cleaning: Handling missing features

1. Removing the examples with missing features from the


dataset
– Can be done if the dataset is big enough so we can sacrifice some
training examples

2. Using a learning algorithm that can deal with missing


feature values
– Example: random forest

3. Using a data imputation technique

25
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Cleaning: Handling missing features: Imputation

• Method 1. Replace the missing value of a feature by an


average value of this feature in the dataset:

• Method 2. Highlight the missing value


– Replace the missing value with a value outside the normal range of
values.
– For example, if the normal range is [0, 1], then you can set the
missing value to −1.
– Enforce the learning algorithm to learn what is best to do when the
feature has a value significantly different from regular values.

26
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline

Data Data integrity


Types of data wrangling and and
cleaning visualization

27
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Integrity
• Data integrity is the maintenance and the assurance of data accuracy
and consistency;
– A critical aspect to the design, implementation, and usage of any system that stores,
processes, or retrieves data.
– Very broad concept!
• Example:
• In a dataset, numeric columns/cells should not accept alphabetic data.
• A binary entry should only allow binary inputs

We can only select


one of these

28
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Integrity

29
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Visualization

Graphical
Representation
of data!

30
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Example: Showing Distribution
Visualization: Distribution Visualization: Bars

ng Distribution
Probability Mass Function

(a) a probability mass function, (b) a probability density function

EE2211 Introduction to Machine Learning 29


© Copyright EE, NUS. All Rights Reserved.

Probability Density Function

function, (b) a probability density function


31
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Visualization: Boxplots

Maximum (100th percentile)

y-axis Third Quartile (75th percentile)

Median (50th percentile)

First Quartile (25th percentile)

Minimum (0th percentile)

• The first quartile (Q1) is defined as the middle number between the smallest number (i.e.,
Minimum) and the median of the data set.

• The third quartile (Q3) is the middle number between the median and the highest value (i.e.,
Maximum) of the data set.

32
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Why Visualization is Necessary

Four datasets with


identical means,
variances and
regression lines!

Hence, we need
visualization to show
their difference!

33
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Summary
• Types of data
– NOIR
• Data wrangling and cleaning

• Data integrity and visualization


– Integrity: Design
– Visualization: Graphical Representation

34
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Practice Question
(Type of Question to Expect in Exams)

What are the NOIR data types of


color, size, and shape in the table?

What about their label, yes/no?

35
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
NUS
National University
of Singapore

THf1NK You.

36
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
EE2211 Introduction to
Machine Learning
Lecture 3

Wang Xinchao
[email protected]

!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Course Contents
• Introduction and Preliminaries (Xinchao)
– Introduction
– Data Engineering
– Introduction to Linear Algebra, Probability and Statistics
• Fundamental Machine Learning Algorithms I (Yueming)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Yueming)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Xinchao)
– Performance Issues
– K-means Clustering
– Neural Networks
2
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Summary of Lec 2
• Types of data
– NOIR
• Data wrangling and cleaning

• Data integrity and visualization


– Integrity: Design
– Visualization: Graphical Representation

3
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• (Very Gentle) Introduction to Linear Algebra
– Prof. Yueming’s part will follow up
• Causality and Simpson’s paradox
– Understanding at intuitive level is sufficient
• Random Variable, Bayes’ Rule

4
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
(Very Gentle) Introduction to Linear Algebra

• A scalar is a simple numerical value, like 15 or −3.25


– Focus on real numbers
• Variables or constants that take scalar values are
denoted by an italic letter, like x or a

5
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations, Vectors, Matrices
• A vector is an ordered list of scalar values
– Denoted by a bold character, e.g. x or a

• In many books, vectors are written column-wise:

• The three vectors above are two-dimensional, or have


two elements

6
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations, Vectors, Matrices
• We denote an entry or attribute of a vector as an italic
value with an index, e.g. 𝑎(") or 𝑥 (") .
– The index j denotes a specific dimension of the vector, the position
of an attribute in the list

• Note:
– 𝑥 (") is not to be confused with the power operation, e.g., 𝑥 $ (squared)
– Square of an indexed attribute of a vector is denoted as (𝑥 " )$ .

7
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations, Vectors, Matrices
• Vectors can be visualized as, in a multi-dimensional space,
– arrows that point to some directions, or
– points

8
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations, Vectors, Matrices
• A matrix is a rectangular array of numbers arranged in rows and
columns
– Denoted with bold capital letters, such as X or W
– An example of a matrix with two rows
and three columns:

• A set is an unordered collection of unique elements


– When an element 𝑥 belongs to a set 𝑺, we write 𝑥 ∈ 𝑺.
– A special set denoted R includes all real numbers from minus infinity to
plus infinity

• Note:
– For elements in matrix X, we shall use the indexing 𝑥%,% where the first and
second indices indicate the row and the column position.
– Usually, for input data, rows represent samples and columns represent
features
9
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations, Vectors, Matrices

10
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Systems of Linear Equations
Linear dependence and independence

Note: If all rows or columns of a square matrix X are linearly


independent, then X is invertible.

11
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Systems
Systems of Linear
of Linear Equations
Equations

Geometry of dependency and independency

𝑥𝑥2 𝑥𝑥3 𝒄𝒄
2
𝒂𝒂 2 𝒃𝒃
𝒂𝒂
1
𝒃𝒃 𝑥𝑥2
1

𝑥𝑥1 𝑥𝑥1
𝛽𝛽1 𝒂𝒂 + 𝛽𝛽2 𝐛𝐛 = 0 𝛽𝛽1 𝒂𝒂 + 𝛽𝛽2 𝒃𝒃 ≠ 𝛽𝛽3 𝐜𝐜

13
12
© Copyright EE, NUS. All Rights Reserved.
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Systems of Linear Equations

13
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Exercises
• The principled way for computing rank is to do Echelon Form
– https://stattrek.com/matrix-algebra/echelon-transform.aspx#MatrixA
• For small-size matrices, however, the rank is in many cases easy to
estimate

• What is the rank of


1 2
2 1

1 1
100 100

1 −2 3
0 −3 3
1 1 0
14
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• (Very Gentle) Introduction to Linear Algebra
• Causality and Simpson’s paradox
• Random Variable, Bayes’ Rule

15
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Causality
• Causality, or causation is:
– The influence by which one event or process (i.e., cause)
contributes to another (i.e. effect),
– The cause is partly responsible for the effect, and the effect is
partly dependent on the cause

• Causality relates to an extremely very wide domain of subjects:


philosophy, science, management, humanity.

• Causality research is extremely complex


– Researcher can never be completely certain that there are no
other factors influencing the causal relationship,
– In most cases, we can only say “probably” causal.

16
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Causality
• (Probable) causal relations or non-causal?
– New web design implemented ? Web page traffic increased
– Your height and weight ? Gets A in EE2211
– Uploaded new app store images ? Downloads increased by 2X
– One works hard and attends lectures/tutorials ? Gets A in EE2211
– Your favorite color ? Your GPA in NUS

17
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Causality
• One popular way to causal data analysis is Randomized
Controlled Trial (RCT)
– A study design that randomly assigns participants into an experimental
group or a control group.
– As the study is conducted, the only expected difference between two
groups is the outcome variable being studied.

• Example:
– To decide whether smoking and lung cancer has a causal relation, we put
participants into experimental group (people who smoke) and control group
(people who don’t smoke), and check whether they develop lung cancer
eventually.

• RCT is sometimes infeasible to conduct, and also has moral


issues.

18
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Correlation
Causalityis
is aastatistical
statistical relationship
relationship

• Decades of data show a clear causal relationship


– Decades
between of smoking
data show a clear
and cancer.causal relationship between smokin
and cancer.
• If one smokes, it is a sure thing that his/her risk of cancer
– If you
will smoke,
increase.it is a sure thing that your risk of cancer will increase
– But it isitnot
• But a sure
is not a surething thatthat
thing youonewillwill
getget
cancer.
cancer.
– The
• Thecausal effect isisreal,
relationship not but it is an effect on your average risk.
deterministic.

19
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Correlation (vs Causality)
• In statistics, correlation is any statistical relationship,
whether causal or not, between two random variables.
• Correlations are useful because they can indicate a
predictive relationship that can be exploited in practice.

20
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
• In statistics, correlation is any statistical relationship,
whetherCorrelation
causal or not, between two random variables.
(vs Causality)
• Correlations are useful because they can indicate a
• Linear correlation
predictive coefficient,
relationship that canr, which is also known
be exploited as
in practice.
•the Pearson
Linear Coefficient.
correlation coefficient, r, which is also known as
the Pearson Coefficient.

The same holds for negative values.


https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Descriptive-Statistics/Measures-of-Relation-Between-Variables/Correlation/index.html

EE2211 Introduction to Machine Learning 17


© Copyright EE, NUS. All Rights Reserved.
1.https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Descriptive-Statistics/Measures-of-Relation-Between-Variables/Correlation/index.html

21
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Correlation does not imply causation
Correlation does not imply causation!
• Most data analyses involve inference or prediction.
• Unless a randomized study is performed, it is difficult to infer why
• Some
there is great examples
a relationship of correlations
between two variables.that can be
• calculated but are clearly
Some great examples not causally
of correlations that can related appear
be calculated at
but are
http://tylervigen.com/
clearly not causally related appear at http://tylervigen.com/
(See figure below).

EE2211 Introduction to Machine Learning 18


© Copyright EE, NUS. All Rights Reserved.

22
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Simpson’s paradox
• Simpson's paradox is a phenomenon in probability and
statistics, in which a trend appears in several different
groups of data but disappears or reverses when these
groups are combined.

The same set of samples!


23
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Example
• Batting Average in professional baseball game
• Two well-known players, Derek Jeter and David Justice

#of wins #of games

24
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• (Very Gentle) Introduction to Linear Algebra
• Causality and Simpson’s paradox
• Random Variable, Bayes’ Rule

25
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Probability
• We describe a random experiment by describing its
procedure and observations of its outcomes.
• Outcomes are mutual exclusive in the sense that only one
outcome occurs in a specific trial of the random outcome

experiment.
– This also means an outcome is not decomposable. sample space
– All unique outcomes form a sample space.
• A subset of sample space 𝑆, denoted as 𝐴, is an event in
a random experiment 𝐴 ⊂ 𝑆, that is meaningful to an
application.
– Example of an event: faces with numbers no greater than 3
event

26
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Axioms of Probability

https://en.wikipedia.org/wiki/Union_(set_theory)
https://en.wikipedia.org/wiki/Intersection_(set_theory)

27
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Random Variable
• A random variable, usually written as an italic capital
letter, like X, is a variable whose possible values are
numerical outcomes of a random event.

• There are two types of random variables: discrete and


continuous.

28
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations
• Some books used P(·) and p(·) to distinguish between the
probability of discrete random variable and the probability
of continuous random variables respectively.

• We shall use Pr(·) for both the above cases

29
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Discrete
Discrete randomrandom variable
variable
• A• discrete
A discrete random
random variable
variable(DRV) takestakes
(DRV) on only
ona only
countable
a number
of distinct values such as red, orange, blue or 1, 2, 3.
countable number of distinct values such as red, yellow,
blue or 1, 2, 3, . . ..
• The probability distribution of a discrete random variable is described
• Thebyprobability distribution
a list of probabilities of a discrete
associated random
with each variable
of its is values.
possible
described by a list of probabilities associated with each of its possible
values.
• This list of probabilities is called a
- This list of probabilities is called a
probability mass function (pmf).
probability mass function (pmf).
– Like a histogram, except that here
(Like a histogram, except that here
the probabilities sum to 1
the probabilities sum to 1)

Ref: Book 1, Chapter 2.2.


A probability mass function
EE2211 Introduction to Machine Learning 27 30
© Copyright EE, NUS. All Rights Reserved.
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Discrete random variable

31
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Discrete random variable

32
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Continuous random
Continuous variable
random variable
• A continuous random variable (CRV) takes an infinite
• A continuous
number random
of possible values variable
in some (CRV) takes an infinite
interval.
numberinclude
– Examples of possible
height, values in some
weight, and time. interval.
– Examples
– Because include height,
the number weight,
of values of aand time.
continuous random variable X
is–infinite,
The number of values ofPr(X
the probability a continuous random
= c) for any c is 0.variable X is infinite, the
probability Pr(X = c) for any c is 0
– Therefore, instead of the list of probabilities, the probability
– Therefore, instead of the list of probabilities, the probability distribution of a
distribution of a CRV (a continuous probability distribution) is
CRV (a continuous probability distribution) is described by a probability
described by a probability density function (pdf).
density function (pdf).
– The pdf pdf
– The is aisfunction
a functionwhose
whosecodomain
range is
is nonnegative
nonnegative and and the
thearea
area under the
curve is equal
under the curveto is
1.equal to 1.

A probability density function


A probability density function
EE2211 Introduction to Machine Learning 30
Copyright EE, NUS. All Rights Reserved.
33
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Continuous random variable

34
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Mean and Standard Deviation of a Gaussian Distribution
Mean and Standard Deviation of a Gaussian
Mean and Standard Deviation of a Gaussian
distribution
distribution 𝑥𝑥2
𝑥𝑥2

95%
90%95%
90%

𝜇𝜇
𝜇𝜇

𝑥𝑥1
𝑥𝑥1
EE2211 Introduction to Machine Learning 35
32
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
© Copyright EE, NUS. All Rights Reserved.
EE2211 Introduction to Machine Learning 32
© Copyright EE, NUS. All Rights Reserved.
Example 1
• Independent random variables
• Consider tossing a fair coin twice, what is the probability
of having (H,H)? Assuming a coin has two sides, H=head
and T=Tail
– Pr(x=H, y=H) = Pr(x=H)Pr(y=H) = (1/2)(1/2) = 1/4

36
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Example 2
• Dependent random variables
• Given 2 balls with different colors (Red and Black), what
is the probability of first drawing B and then R? Assuming
we are drawing the balls without replacement.

• The space of outcomes of taking two balls sequentially


without replacement:
B–R, R–B
– Thus having B-R is 1⁄2 .

• Mathematically:
– Pr(x=B, y=R) = Pr(y=R | x=B) Pr(x=B) = 1×(1/2) = 1/2
Conditional Probability

37
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Example 3
• Dependent random variables
• Given 3 balls with different colors (R,G,B), and we draw 2
balls. What is the probability of first having B and then G,
if we draw without replacement?
• The space of outcomes of taking two balls sequentially
without replacement:
R–G | G–B | B–R
R–B | G–R | B–G Thus, Pr(y=G, x=B) = 1/6
• Mathematically:
Pr(y=G, x=B) = Pr(y=G | x=B) Pr(x=B)
= (1/2) × (1/3)
= 1/6

38
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Two Basic Rules
• Sum Rule
Pr 𝑋 = 𝑥 = / Pr(𝑋 = 𝑥, 𝑌 = 𝑦% )
$

• Product Rule

Pr 𝑋 = 𝑥, 𝑌 = 𝑦 = Pr 𝑌 = 𝑦 𝑋 = 𝑥 𝑃(𝑋 = 𝑥)

39
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Bayes’ Rule
• The conditional probability Pr(𝑌 = 𝑦|𝑋 = 𝑥) is the
probability of the random variable 𝑌 to have a specific
value 𝑦, given that another random variable 𝑋 has a
specific value of 𝑥.

• The Bayes’ Rule (also known as the Bayes’ Theorem):


likelihood prior
Pr 𝑋 = 𝑥 𝑌 = 𝑦 Pr(𝑌 = 𝑦)
Pr 𝑌 = 𝑦 𝑋 = 𝑥 =
Pr(𝑋 = 𝑥)
posterior evidence

40
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Define:
Example
• B random variable for box picked
• B = {blue(b), red(r)}
• F identity of fruit
• Drawing a sample of fruit from a box
• F –= {apple(a),
First pick orange(o)}
a box, and then draw a sample of fruit from it
• P(B=r)=0.4 and P(B=b)=0.6
– B: variable for Box, can be r (red) or b (blue)
• Events– F:arevariable
mutually exclusive
for Fruit, and
can be include or
o (orange) allapossible
(apple) outcomes
• Their probabilities must sum to 1
• Pr(B=r)=0.4 prior
• Pr(F=o | B=r)= 0.75 likelihood
• Pr(F=o)= 0.45 evidence

• Pr(B=r | F=o) = Pr(F=o | B=r) * Pr(B=r) / Pr(F=o)


= 0.75*0.4/0.45 = 2/3 posterior

41
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Summary
• (Very Gentle) Introduction to Linear Algebra
• Causality and Simpson’s paradox
• Random Variable, Bayes’ Rule

42
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Practice Question
(Type of Question to Expect in Exams)

Suppose the random variable X has the following probability


mass function (pmf) listed in the table below. k is unknown.

X 1 2 3 4 5
Pr[X] 0.1 0.05 0.05 0.6 k

What is the probability that X takes a value of odd numbers?

43
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
NUS
National University
of Singapore

THf1NK You.

44
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
EE2211 Introduction to Machine
Learning
Lecture 4
Semester 2
2024/2025

Yueming Jin
[email protected]

Electrical and Computer Engineering Department


National University of Singapore

© Copyright EE, NUS. All Rights Reserved.


Welcome to EE2211
• Introduction and Preliminaries (Xinchao)
– Introduction
– Data Engineering
– Introduction to Probability, Statistics, and Matrix
• Fundamental Machine Learning Algorithms I (Yueming)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Yueming)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Xinchao)
– Performance Issues
– K-means Clustering
– Neural Networks

2
© Copyright EE, NUS. All Rights Reserved.
Welcome to EE2211
• Fundamental Machine Learning Algorithms I (Yueming)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Yueming)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest

• 3 Assignments
– Assignment 1: released on Week 4 Friday, due on Week 6 Friday
– Assignment 2: released on Week 6 Friday, due on Week 9 Wednesday
– Assignment 3: released on Week 9 Friday, due on Week 13 Friday
• Office hour via zoom: Monday 9:30-10:30am (Week 5-10)

3
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations

Module II Contents
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression

4
© Copyright EE, NUS. All Rights Reserved.
Fundamental ML Algorithms:
Linear Regression
References for Lectures 4-6:
Main
• [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019.
(read first, buy later: http://themlbook.com/wiki/doku.php)
• [Book2] Andreas C. Muller and Sarah Guido, “Introduction to Machine
Learning with Python: A Guide for Data Scientists”, O’Reilly Media, Inc., 2017

Supplementary
• [Book3] Jeff Leek, “The Elements of Data Analytic Style: A guide for people
who want to analyze data”, Lean Publishing, 2015.
• [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied
Linear Algebra”, Cambridge University Press, 2018 (available online)
http://vmls-book.stanford.edu/
• [Ref 5] Professor Vincent Tan’s notes (chapters 4-6): (useful)
https://vyftan.github.io/papers/ee2211book.pdf

5
© Copyright EE, NUS. All Rights Reserved.
Recap on Notations, Vectors, Matrices
Scalar Numerical value 15, -3.5
Variable Take scalar values x or a

Vector An ordered list of scalar values x or 𝐚


𝑎1 2
Attributes of a vector 𝐚= 𝑎 =
2 3

Matrix A rectangular array of numbers 2 4


𝐗=
arranged in rows and columns 21 −6

Capital Sigma 𝑖=1 𝑥𝑖 = 𝑥1 + 𝑥2 + … + 𝑥𝑚−1 + 𝑥𝑚


∑𝑚

Capital Pi ∏𝑚
𝑖=1 𝑥𝑖 = 𝑥1 · 𝑥2 ·…· 𝑥𝑚−1 · 𝑥𝑚
6
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

Operations on Vectors: summation and subtraction

𝑥1 𝑦1 𝑥1 + 𝑦1
𝐱+𝐲= 𝑥 + 𝑦 = 𝑥 +𝑦
2 2 2 2

𝑥1 𝑦1 𝑥1 − 𝑦1
𝐱−𝐲= 𝑥 − 𝑦 = 𝑥 −𝑦
2 2 2 2

7
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

Operations on Vectors: scalar

𝑥1 𝑎𝑥1
𝑎 𝐱 = 𝑎 𝑥 = 𝑎𝑥
2 2

1
𝑥1 𝑥
1 1 𝑎 1
𝑎
𝐱=
𝑎 𝑥2 = 1
𝑥
𝑎 2

8
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Matrix or Vector Transpose:
𝑥1
𝐱= 𝑥 , 𝐱𝑇 = 𝑥1 𝑥2
2
𝑥1,1 𝑥1,2 𝑥1,3 𝑥1,1 𝑥2,1 𝑥3,1
𝐗 = 𝑥2,1 𝑥2,2 𝑥2,3 , 𝐗 𝑇 = 𝑥1,2 𝑥2,2 𝑥3,2
𝑥3,1 𝑥3,2 𝑥3,3 𝑥1,3 𝑥2,3 𝑥3,3

Python demo 1

9
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Dot Product or Inner Product of Vectors:
𝐱 · 𝐲 = 𝐱𝑇 𝐲
𝑦1 𝐱
= 𝑥1 𝑥2 𝑦
2
= 𝑥1 𝑦1 + 𝑥2 𝑦2

Geometric definition: 𝜃
𝐱 · 𝐲 = 𝐱 𝐲 cos𝜃 𝐲
𝐱 cos𝜃

where 𝜃 is the angle between 𝐱 and 𝐲,


and 𝐱 = 𝐱 ⋅ 𝐱 is the Euclidean length of vector 𝐱
2 1
𝐄. 𝐠. 𝐚 = ,𝐜 = ➔ 𝐚 · 𝐜 = 2*1 + 3 *0 = 2
3 0
10
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

Matrix-Vector Product

𝑤1,1 𝑤1,2 𝑤1,3 𝑥1


𝐖𝐱 = 𝑤2,1 𝑤2,2 𝑤2,3 𝑥2
𝑥3

𝑤1,1 𝑥1 + 𝑤1,2 𝑥2 + 𝑤1,3 𝑥3


= 𝑤2,1 𝑥1 + 𝑤2,2 𝑥2 + 𝑤2,3 𝑥3

11
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

Vector-Matrix Product

𝑤1,1 𝑤1,2 𝑤1,3


𝐱 𝑇 𝐖 = 𝑥1 𝑥2 𝑤2,1 𝑤2,2 𝑤2,3
= (𝑥1 𝑤1,1 + 𝑥2 𝑤2,1) (𝑥1𝑤1,2 + 𝑥2 𝑤2,2 ) (𝑥1𝑤1,3 + 𝑥2𝑤2,3 )

12
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

Matrix-Matrix Product

𝑥1,1 … 𝑥1,𝑑 𝑤1,1 … 𝑤1,ℎ


𝐗𝐖 = ⁞ ⋱ ⁞ ⁞ ⋱ ⁞
𝑥𝑚,1 … 𝑥𝑚,𝑑 𝑤𝑑,1 … 𝑤𝑑,ℎ

(𝑥1,1 𝑤1,1 + ⋯ + 𝑥1,𝑑 𝑤𝑑,1 ) … (𝑥1,1 𝑤1,ℎ + ⋯ + 𝑥1,𝑑 𝑤𝑑,ℎ )


= ⁞ ⋱ ⁞
(𝑥𝑚,1 𝑤1,1 + ⋯ + 𝑥𝑚,𝑑 𝑤𝑑,1 ) … (𝑥𝑚,1 𝑤1,ℎ + ⋯ + 𝑥𝑚,𝑑 𝑤𝑑,ℎ )

∑𝑑𝑖=1 𝑥1,𝑖 𝑤𝑖,1 … ∑𝑑𝑖=1 𝑥1,𝑖 𝑤𝑖,ℎ


= ⁞ ⋱ ⁞
∑𝑑𝑖=1 𝑥𝑚,𝑖 𝑤𝑖,1 … ∑𝑑𝑖=1 𝑥𝑚,𝑖 𝑤𝑖,ℎ

If X is m x d and W is d x h, then the outcome is a m x h matrix


13
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

Matrix inverse

Definition:
A d-by-d square matrix A is invertible (also nonsingular)
if there exists a d-by-d square matrix B such that
𝐀𝐁 = 𝐁𝐀 = 𝐈 (identity matrix)

1 0…0 0
0 1 0 0
𝐈= ⁞ ⋱ ⁞ d-by-d dimension
0 0 1 0
0 0…0 1

Ref: https://en.wikipedia.org/wiki/Invertible_matrix

14
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Matrix inverse computation
−1
1
𝐀 = adj(𝐀)
det 𝐀
• det 𝐀 is the determinant of 𝐀
• adj(𝐀) is the adjugate or adjoint of 𝐀

Determinant computation
Example: 2x2 matrix
𝑎 𝑏
𝐀=
𝑐 𝑑

𝑎 𝑏
det 𝐀 = |𝐀| = = 𝑎𝑑 − 𝑏𝑐
𝑐 𝑑
Ref: https://en.wikipedia.org/wiki/Invertible_matrix

15
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
• adj(𝐀) is the adjugate or adjoint of 𝐀
• adj(𝐀) is the transpose of the cofactor matrix 𝐂 of 𝐀 → adj(A)= CT
• Minor of an element in a matrix 𝐀 is defined as the determinant
obtained by deleting the row and column in which that element lies
𝑎11 𝑎12 𝑎13 𝑎21 𝑎23
A= 𝑎 𝑎
21 22 𝑎23 Minor of a12 is 𝑀12 =
𝑎31 𝑎33
𝑎31 𝑎32 𝑎33

• The 𝑖, 𝑗 entry of the cofactor matrix 𝐂 is the minor of 𝑖, 𝑗


element times a sign factor +
Cofactor Cij = −1 𝑖 𝑗 𝑀𝑖𝑗

• The determinant of 𝐀 can also be defined by minors as


det(A)= ∑𝑘𝑗=1 = 𝑎𝑖𝑗Cij = −1 𝑖 + 𝑗𝑎 𝑀
𝑖𝑗 𝑖𝑗

Ref: https://en.wikipedia.org/wiki/Invertible_matrix

16
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
𝑎21 𝑎23
Minor of a12 is 𝑀12 = adj(A)= CT
𝑎31 𝑎33
𝑖 + 𝑗𝑀 det(A)= ∑𝑘𝑗=1 −1 𝑖 + 𝑗𝑎
Cofactor Cij = −1 𝑖𝑗 𝑖𝑗𝑀𝑖𝑗

𝑎 𝑏
• E.g. 𝐀 =
𝑐 𝑑
𝑑 −𝑐
𝐂=
−𝑏 𝑎
𝑇 𝑑 −𝑏
• adj 𝐀 = 𝐂 = det 𝐀 = |𝐀| = 𝑎𝑑 − 𝑏𝑐
−𝑐 𝑎
1 1 𝑑 −𝑏
𝐀 −1
= adj(𝐀) =
det 𝐀 𝑎𝑑−𝑏𝑐 −𝑐 𝑎
Ref: https://en.wikipedia.org/wiki/Invertible_matrix

17
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

det(A)= ∑𝑘𝑗=1 −1 𝑖 + 𝑗𝑎
Determinant computation 𝑖𝑗𝑀𝑖𝑗

Example: 3x3 matrix, use the first row (i = 1)

= a(ei - fh) – b(di - fg) + c(dh - eg)

Python demo 2 Ref: https://en.wikipedia.org/wiki/Determinant

18
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

𝑎22 𝑎23
The minor of 𝑎11 = 𝑎 𝑎33
32

Ref: https://en.wikipedia.org/wiki/Determinant

19
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

𝑎21 𝑎23
The minor of 𝑎12 = 𝑎 𝑎33
31

Ref: https://en.wikipedia.org/wiki/Determinant

20
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

𝑎11 𝑎13
The minor of 𝑎22 = 𝑎 𝑎33
31

adj(A)= CT

𝑖 +𝑗
det(A)= ∑𝑘𝑗=1 = 𝑎𝑖𝑗Cij = −1 𝑎𝑖𝑗𝑀𝑖𝑗
.
1
𝐀−1 = adj(𝐀)
det 𝐀

Ref: https://en.wikipedia.org/wiki/Determinant

21
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Example
1 2 3
Find the cofactor matrix of 𝐀 given that 𝐀 = 0 4 5 .
1 0 6
Solution:
5 4 0 5 04
𝑎11 ⇒ = 24, 𝑎12 ⇒ − = 5, 𝑎13 ⇒ = −4,
6 0 1 6 10
2 3 1 3 1 2
𝑎21 ⇒− = −12, 𝑎22 ⇒ = 3, 𝑎23 ⇒ − = 2,
0 6 1 6 1 0
2 3 1 3 1 2
𝑎31 ⇒ = −2, 𝑎32 ⇒ − = −5, 𝑎33 ⇒ = 4,
4 5 0 5 0 4
24 5 −4
The cofactor matrix C is thus −12 3 2 .
−2 − 5 4

Ref: https://www.mathwords.com/c/cofactor_matrix.htm

22
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations

Module II Contents
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression

23
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations

• Consider a system of m linear equations with d


variables or unknowns 𝑤1, … , 𝑤𝑑 :

𝑥1,1 𝑤1 + 𝑥1,2 𝑤2 + ⋯ + 𝑥1,𝑑 𝑤𝑑 = 𝑦1


𝑥2,1 𝑤1 + 𝑥2,2 𝑤2 + ⋯ + 𝑥2,𝑑 𝑤𝑑 = 𝑦2

𝑥𝑚,1 𝑤1 + 𝑥𝑚,2 𝑤2 + ⋯ + 𝑥𝑚,𝑑 𝑤𝑑 = 𝑦𝑚 .

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to


Applied Linear Algebra”, Cambridge University Press, 2018 (Chp8.3)

24
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
These equations can be written compactly in matrix-vector
notation:
𝐗𝐰 = 𝐲
Where
𝑥1,1 𝑥1,2 … 𝑥1,𝑑 𝑤1 𝑦1
𝐗= ⁞ ⁞ ⋱ ⁞ , 𝐰= ⁞ , 𝐲= ⁞ .
𝑥𝑚,1 𝑥𝑚,2 … 𝑥𝑚,𝑑 𝑤𝑑 𝑦𝑚

Note:
• The data matrix 𝐗 ∈ 𝓡𝑚×𝑑 and the target vector 𝐲 ∈ 𝓡𝑚 are given
• The unknown vector of parameters 𝐰 ∈ 𝓡𝑑 is to be learnt

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to


Applied Linear Algebra”, Cambridge University Press, 2018 (Chp8.3)

25
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
A set of linear equations can have no solution, one
solution, or multiple solutions:
𝐗𝐰 = 𝐲
Where
𝑥1,1 𝑥1,2 … 𝑥1,𝑑 𝑤1 𝑦1
𝐗= ⁞ ⁞ ⋱ ⁞ , 𝐰= ⁞ , 𝐲= ⁞ .
𝑥𝑚,1 𝑥𝑚,2 … 𝑥𝑚,𝑑 𝑤𝑑 𝑦𝑚

𝐗 is Square Even-determined m=d Equal number of equations


and unknowns
𝐗 is Tall Over-determined m>d More number of equations
than unknowns
𝐗 is Wide Under-determined m<d Fewer number of equations
than unknowns
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to
Applied Linear Algebra”, (Chp8.3 & 11) & [Ref 5] Tan’s notes, (Chp 4)

26
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1
1. Square or even-determined system: 𝒎 = 𝒅
- Equal number of equations and unknowns, i.e., 𝐗 ∈ 𝓡𝑑×𝑑
- One unique solution if 𝐗 is invertible or all rows/columns of 𝐗 are
linearly independent
- If all rows or columns of 𝐗 are linearly independent, then 𝐗 is
invertible.

Solution:
If 𝐗 is invertible (or 𝐗 −1 𝐗 = 𝐈 ), then pre-multiply both sides by 𝐗 −1
𝐗 −1 𝐗 𝐰 = 𝐗 −1 𝐲
⇒ ෝ = 𝐗 −1 𝐲
𝐰
(Note: we use a hat on top of 𝐰 to indicate that it is a specific point in the space of 𝐰)

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to


Applied Linear Algebra”, Cambridge University Press, 2018 (Chp11)

27
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Example 1 𝑤1 + 𝑤2 = 4 (1) Two unknowns
𝑤1 − 2𝑤2 = 1 (2) Two equations

𝐗 𝐰 𝐲
1 1 𝑤1 4
𝑤 =
1 −2 2 1

𝐰
ෝ = 𝐗 −1 𝐲
𝐰
−1
1 1 4
=
1 −2 1

−1 −2 −1 4 3
= =
3 −1 1 1 1 Python demo 3
1 𝑑 −𝑏
𝐀−1 = adj(𝐀) adj 𝐀 = 𝐂 𝑇 = det 𝐀 = 𝑎𝑑 − 𝑏𝑐
det 𝐀 −𝑐 𝑎
28
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1
2. Over-determined system: 𝒎 > 𝒅
– More equations than unknowns
– 𝐗 is non-square (tall) and hence not invertible
– Has no exact solution in general *
– An approximated solution is available using the left inverse
If the left-inverse of 𝐗 exists such that 𝐗 †𝐗 = 𝐈, then pre-multiply both
sides by 𝐗 † results in
𝐗†𝐗 𝐰 = 𝐗†𝐲
⇒𝐰ෝ = 𝐗†𝐲
Definition:
A matrix B that satisfies 𝑩𝒅 𝒙 𝒎𝑨𝒎 𝒙 𝒅 = 𝐈 is called a left-inverse of 𝐀.
The left-inverse of 𝐗: 𝐗 †= (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 given 𝐗 𝑇 𝐗 is invertible.
Note: * exception: when rank(𝐗) = rank([𝐗,𝐲]), there is a solution.
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (Chp11.1-11.2, 11.5)

29
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Example 2 𝑤1 + 𝑤2 = 1 (1) Two unknowns
𝑤1 − 𝑤2 = 0 (2) Three equations
𝐗 𝐰 𝐲 𝑤1 = 2 (3)
1 1 𝑤 1
1
1 −1 𝑤 = 0
2
1 0 2

𝐰
No exact solution
Approximated solution

ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰
3 0 −1 1 1 1 1 1
= 0 =
0 2 1 −1 0 2 0.5 Python demo 4
𝐗 𝑇 𝐗 is invertible
30
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1
3. Under-determined system: 𝒎 < 𝒅
– More unknowns than equations
– Infinite number of solutions in general *

If the right-inverse of 𝐗 exists such that 𝐗𝐗 † = 𝐈, then the 𝑑-vector


𝐰 = 𝐗 †𝐲 (one of the infinite cases) satisfies the equation 𝐗𝐰 = 𝐲, i.e.,
𝐗𝐰 = 𝐲 ⇒ 𝐗𝐗 †𝐲 = 𝐲
⇒ 𝐈𝐲 = 𝐲
Definition:
A matrix B that satisfies 𝐀𝒎 𝒙 𝒅𝐁𝒅 𝒙 𝒎 = 𝐈 is called a right-inverse of 𝐀.
The right-inverse of 𝐗: 𝐗 †= 𝐗 𝑇 (𝐗𝐗𝑇 )−1 given 𝐗𝐗𝑇 is invertible.
If 𝐗 is right−invertible, we can find a unique constrained solution.

Note: * exception: no solution if the system is inconsistent rank(𝐗) < rank([𝐗,𝐲]

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (Chp11.1-11.2, 11.5)

31
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
3. Under-determined system: 𝒎 < 𝒅
Derivation:
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1

A unique solution is yet possible by constraining the search using


𝐰 = 𝐗𝑇 𝐚

If 𝐗𝐗 𝑇 is invertible, let 𝐰 = 𝐗 𝑇 𝐚, then


𝐗𝐗 𝑇 𝐚 = 𝐲
⇒ 𝐚ො = (𝐗𝐗𝑇 )−1 𝐲
⇒𝐰ෝ = 𝐗 𝑇 𝐚ො = 𝐗 𝑇 (𝐗𝐗𝑇 )−1 𝐲

𝐗†
right-inverse

32
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Example 3 𝑤1 + 2𝑤2 + 3𝑤3 = 2 (1) Three unknowns
𝑤1 − 2𝑤2 + 3𝑤3 = 1 (2) Two equations
𝐗 𝐰 𝐲
𝑤
1 2 3 𝑤1 2
2 =
1 −2 3 𝑤3 1
Infinitely many solutions along
the intersection line
Here 𝐗𝐗𝑇 is invertible

ෝ = 𝐗 𝑇 (𝐗𝐗 𝑇 )−1 𝐲
𝐰
−1
1 1 14 6 0.15
2
= 2 −2 = 0.25 Constrained solution
1
3 3 6 14 0.45

33
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations

Example 4 𝑤1 + 2𝑤2 + 3𝑤3 = 2 (1) Three unknowns


3𝑤1 + 6𝑤2 + 9𝑤3 = 1 (2) Two equations

𝐗 𝐰 𝐲
𝑤
1 2 3 𝑤1 2
2 =
3 6 9 𝑤3 1

Both 𝐗𝐗𝑇 and 𝐗 𝑇 𝐗 are not invertible!


There is no solution for the system
34
© Copyright EE, NUS. All Rights Reserved.
Quick check 3*questions - Poll on PollEv.com/ymjin

Just “skip” if you are required to do registration

35
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations

Module II Contents
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression

36
© Copyright EE, NUS. All Rights Reserved.
Notations: Set
• A set is an unordered collection of unique elements
– Denoted as a calligraphic capital character e.g., 𝓢, 𝓡, 𝓝 etc
– When an element 𝑥 belongs to a set 𝑺, we write 𝑥 ∈ 𝓢
• A set of numbers can be finite - include a fixed amount of values
– Denoted using accolades, e.g. {1, 3, 18, 23, 235} or {𝑥1, 𝑥2, 𝑥3 , 𝑥4, . . . , 𝑥𝑑 }
• A set can be infinite and include all values in some interval
– If a set of real numbers includes all values between a and b, including a and
b, it is denoted using square brackets as [a, b]
– If the set does not include the values a and b, it is denoted using
parentheses as (a, b)
• Examples:
– The special set denoted by 𝓡 includes all real numbers from minus infinity
to plus infinity
– The set [0, 1] includes values like 0, 0.0001, 0.25, 0.9995, and 1.0

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p4 of chp2).

37
© Copyright EE, NUS. All Rights Reserved.
Notations: Set operations

• Intersection of two sets:


𝓢3 ← 𝓢1 ∩ 𝓢2
Example: {1,3,5,8} ∩ {1,8,4} = {1,8}

• Union of two sets:


𝓢3 ← 𝓢1 ∪ 𝓢2
Example: {1,3,5,8} ∪ {1,8,4} = {1,3,4,5,8}

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p4 of chp2).

38
© Copyright EE, NUS. All Rights Reserved.
Functions
• A function is a relation that associates each element 𝑥 of a set 𝓧,
the domain of the function, to a single element 𝑦 of another set 𝓨,
the codomain of the function
• If the function is called f, this relation is denoted 𝑦 = 𝑓(𝑥)
– The element 𝑥 is the argument or input of the function
– 𝑦 is the value of the function or the output
• The symbol used for representing the input is the variable of the
function
– 𝑓(𝑥) 𝑓 is a function of the variable 𝑥; 𝑓(𝑥, 𝑤) 𝑓 is a function of the variable 𝑥 and w

𝓧 𝓨
1
1 2 Range
2 3 (or Image)
3 4 {3,4,5,6}
4 5
6
{1,2,3,4} domain codomain {1,2,3,4,5,6}
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p6 of chp2). 39
© Copyright EE, NUS. All Rights Reserved.
Functions

• A scalar function can have vector argument


– E.g. 𝑦 = 𝑓(𝐱) = 𝑥1 + 𝑥2 +2𝑥3
• A vector function, denoted as 𝐲 = 𝐟(𝐱) is a function
that returns a vector 𝐲
– Input argument can be a vector 𝐲 = 𝐟(𝐱) or a scalar 𝐲 = 𝐟(𝑥)
𝑦1 −𝑥1
– E.g. 𝑦 = 𝑥
2 2
𝑦1 −2𝑥1
– E.g. 𝑦 =
2 3𝑥1

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p7 of chp2).

40
© Copyright EE, NUS. All Rights Reserved.
Functions
• The notation 𝑓: 𝓡𝑑 → 𝓡 means that 𝑓 is a function that
maps real d-vectors to real numbers
– i.e., 𝑓 is a scalar-valued function of d-vectors
• If 𝐱 is a d-vector argument, then 𝑓 𝐱 denotes the value
of the function 𝑓 at 𝐱
– i.e., 𝑓 𝐱 = 𝑓 𝑥1 , 𝑥2 , … , 𝑥𝑑 , 𝐱 ∈ 𝓡𝑑 , 𝑓 𝐱 ∈ 𝓡

• Example: we can define a function 𝑓: 𝓡4 → 𝓡 by


𝑓 𝐱 = 𝑥1 + 𝑥2 − 𝑥42

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, 2018 (Ch 2, p29)

41
© Copyright EE, NUS. All Rights Reserved.
Functions

The inner product function


• Suppose 𝒂 is a d-vector. We can define a scalar valued function 𝑓 of d-
vectors, given by
𝑓 𝐱 = 𝒂𝑇 𝐱 = 𝑎1 𝑥1 + 𝑎2 𝑥2 + ⋯ 𝑎𝑑 𝑥𝑑 (1)
for any d-vector 𝐱
• The inner product of its d-vector argument 𝐱 with some (fixed) d-vector 𝒂
• We can also think of 𝑓 as forming a weighted sum of the elements of 𝐱;
the elements of 𝒂 give the weights 𝒂

𝜃
𝐱
𝒂 cos𝜃
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p30)

42
© Copyright EE, NUS. All Rights Reserved.
Functions

Linear Functions

A function 𝑓: 𝓡𝑑 → 𝓡 is linear if it satisfies the following two properties:

• Homogeneity
• For any d-vector 𝐱 and any scalar 𝛼, 𝑓 𝛼𝐱 = 𝛼𝑓 𝐱
• Scaling the (vector) argument is the same as scaling the
function value

• Additivity
• For any d-vectors 𝐱 and 𝐲, 𝑓 𝐱 + 𝐲 = 𝑓 𝐱 + 𝑓 𝐲
• Adding (vector) arguments is the same as adding the function
values

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p31)

43
© Copyright EE, NUS. All Rights Reserved.
Functions

Linear Functions
Superposition and linearity
• The inner product function 𝑓 𝐱 = 𝒂𝑇 𝐱 defined in equation (1)
(slide 42) satisfies the property
𝑓 𝛼𝐱 + 𝛽𝐲 = 𝒂𝑇 𝛼𝐱 + 𝛽𝐲
= 𝒂𝑇 𝛼𝐱) + 𝒂𝑇 (𝛽𝐲
= 𝛼 𝒂𝑇 𝐱) + 𝛽(𝒂𝑇 𝐲
= 𝛼𝑓 𝐱) + 𝛽𝑓(𝐲
for all d-vectors 𝐱, 𝐲, and all scalars 𝛼, 𝛽.

• This property is called superposition, which consists of


homogeneity and additivity
• A function that satisfies the superposition property is called linear

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p30)

44
© Copyright EE, NUS. All Rights Reserved.
Functions

Linear Functions

• If a function 𝑓 is linear, superposition extends to linear


combinations of any number of vectors:
𝑓 𝛼1 𝐱1 + ⋯ + 𝛼𝑘 𝐱 𝑘 = 𝛼1 𝑓 𝐱1 ) + ⋯ + 𝛼𝑘 𝑓(𝐱 𝑘
for any d vectors 𝐱1 + ⋯ + 𝐱 𝑘 , and any scalars
𝛼1 + ⋯ + 𝛼𝑘 .

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p30)

45
© Copyright EE, NUS. All Rights Reserved.
Functions

Linear and Affine Functions

A linear function plus a constant is called an affine function

A linear function 𝑓: 𝓡𝑑 → 𝓡 is affine if and only if it can be


expressed as 𝑓 𝐱 = 𝒂𝑇 𝐱 + 𝑏 for some d-vector 𝒂 and scalar 𝑏,
which is called the offset (or bias)

Example:
𝑓 𝐱 = 2.3 − 2𝑥1 + 1.3𝑥2 − 𝑥3

This function is affine, with 𝑏 = 2.3, 𝒂𝑇 = [−2, 1.3, −1].

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p32)

46
© Copyright EE, NUS. All Rights Reserved.
Functions

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p33)

47
© Copyright EE, NUS. All Rights Reserved.
Summary
• Operations on Vectors and Matrices Assignment 1 (week 6 Fri)
• Dot-product, matrix inverse Tutorial 4
• Systems of Linear Equations 𝐗𝐰 = 𝐲
• Matrix-vector notation, linear dependency, invertible
• Even-, over-, under-determined linear systems
• Set and Functions
𝐗 is Even- m=d One unique solution in general ෝ = 𝐗 −1𝐲
𝐰
Square determined
𝐗 is Over- m>d No exact solution in general; ෝ = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰
Tall determined An approximated solution Left-inverse
𝐗 is Under- m<d Infinite number of solutions in general; ෝ = 𝐗 𝑇 (𝐗𝐗 𝑇 )−1 𝐲
𝐰
Wide determined Unique constrained solution Right-inverse

• Scalar and vector functions


python package numpy
• Inner product function
Inverse: numpy.linalg.inv(X)
• Linear and affine functions
Transpose: X.T

48
© Copyright EE, NUS. All Rights Reserved.
EE2211 Introduction to Machine
Learning
Lecture 5
Semester 2
2024/2025

Yueming Jin
[email protected]

Electrical and Computer Engineering Department


National University of Singapore

© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Course Contents
• Introduction and Preliminaries (Xinchao)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Yueming)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Yueming)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Xinchao)
– Performance Issues
– K-means Clustering
– Neural Networks

2
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Least Squares and Linear Regression
Module II Contents
• Notations, Vectors, Matrices (introduced in L3)
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression

3
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Recap: Linear and Affine Functions

Linear Functions
A function 𝑓: 𝓡𝑑 → 𝓡 is linear if it satisfies the following two properties:

• Homogeneity 𝑓 𝛼𝐱 = 𝛼𝑓 𝐱 Scaling
• Additivity 𝑓 𝐱 + 𝐲 = 𝑓 𝐱 + 𝑓 𝐲 Adding

Inner product function


𝑓 𝐱 = 𝒂𝑇 𝐱 = 𝑎1 𝑥1 + 𝑎2 𝑥2 + ⋯ 𝑎𝑑 𝑥𝑑

Affine function
𝑓 𝐱 = 𝒂𝑇 𝐱 + 𝑏 scalar 𝑏 is called the offset (or bias)

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p31)

4
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Functions: Maximum and Minimum
A local and a global minima of a function

• 𝑓 𝑥 has a local minimum


at 𝑥 = 𝑐 if 𝑓 𝑥 ≥ 𝑓 𝑐 for
every 𝑥 in some open
interval around 𝑥 = 𝑐
x
• 𝑓 𝑥 has a global
minimum at 𝑥 = 𝑐 if 𝑓 𝑥 ≥
𝑓 𝑐 for all 𝑥 in the domain
of 𝑓 a b
a<x≤b
Note: An interval is a set of real numbers with the property that any number that lies
between two numbers in the set is also included in the set.
An open interval does not include its endpoints and is denoted using parentheses. E.g.
(0, 1) means “all numbers greater than 0 and less than 1”.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p6-7 of chp2).

5
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Functions: Maximum and Minimum
Max and Arg Max
• Given a set of values 𝓐 = {𝑎1 , 𝑎2 , … , 𝑎𝑚 },
• The operator max 𝑓(𝑎) returns the highest value 𝑓(𝑎) for all elements in
𝑎ϵ𝓐
the set 𝓐
• The operator arg max 𝑓(𝑎) returns the element of the set 𝓐 that
𝑎ϵ𝓐
maximizes 𝑓(𝑎)
• When the set is implicit or infinite, we can write
max 𝑓(𝑎) or arg max 𝑓(𝑎)
𝑎 𝑎
E.g. 𝑓(𝑎) = 3𝑎, 𝑎 ϵ [0,1] → max 𝑓 𝑎 = 3 and arg max 𝑓 𝑎 = 1
𝑎 𝑎

Min and Arg Min operate in a similar manner

Note: arg max returns a value from the domain of the function and max returns
from the range (codomain) of the function.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p6-7 of chp2).

6
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Derivative and Gradient
• The derivative 𝒇′ of a function 𝒇 is a function that
describes how fast 𝒇 grows (or decreases)
– If the derivative is a constant value, e.g. 5 or −3
• The function 𝑓 grows (or decreases) constantly at any point x of its domain
– When the derivative 𝑓′ is a function
• If 𝑓′ is positive at some x, then the function 𝑓 grows at this point
• If 𝑓′ is negative at some x, then the function 𝑓 decreases at this point
• The derivative of zero at x means that the function’s slope at x is horizontal
(e.g. maximum or minimum points)

• The process of finding a derivative is called differentiation.

• Gradient is the generalization of derivative for functions that


take several inputs (or one input in the form of a vector or some
other complex structure).
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p8 of chp2).

7
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Derivative and Gradient
The gradient of a function is a vector of partial derivatives
Differentiation of a scalar function w.r.t. a vector
If 𝑓(𝐱) is a scalar function of d variables, 𝐱 is a d x1 vector.
Then differentiation of 𝑓(𝐱) w.r.t. 𝐱 results in a d x1 vector
𝜕𝑓
ⅆ𝑓(𝐱) 𝜕𝑥1
= ⋮
ⅆ𝐱 𝜕𝑓.
𝜕𝑥𝑑
This is referred to as the gradient of 𝑓(𝐱) and often written
as 𝛻𝐱 𝑓.
𝑎
E.g. 𝑓 𝐱 = 𝑎𝑥1 + 𝑏𝑥2 𝛻𝐱 𝑓 =
𝑏 Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Appendix)
8
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Derivative and Gradient
Partial Derivatives
Differentiation of a vector function w.r.t. a vector
If 𝐟(𝐱) is a vector function of size h x1 and 𝐱 is a d x1 vector.
Then differentiation of 𝐟(𝐱) results in a h x d matrix
𝜕𝑓1 𝜕𝑓1

ⅆ𝐟(𝐱) 𝜕𝑥1 𝜕𝑥𝑑
= ⋮ ⋱ ⋮
ⅆ𝐱 𝜕𝑓ℎ 𝜕𝑓.ℎ

𝜕𝑥1 𝜕𝑥𝑑
The matrix is referred to as the Jacobian of 𝐟(𝐱)

Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Appendix)

9
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Derivative and Gradient

Some Vector-Matrix Differentiation Formulae

ⅆ𝐀𝐱
=𝐀
ⅆ𝐱

ⅆ(𝒃𝑇 𝐱) ⅆ(𝐲 𝑇 𝐀𝐱)


=𝒃 = 𝐀𝑇 𝐲
ⅆ𝐱 ⅆ𝐱
ⅆ(𝐱 𝑇 𝐀𝐱)
= (𝐀 + 𝐀𝑇 )𝐱
ⅆ𝐱
𝑓 𝐱 = 𝒂𝑇 𝐱 = 𝑎1 𝑥1 + 𝑎2 𝑥2 + ⋯ 𝑎𝑑 𝑥𝑑
Derivations: https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Appendix)

10
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Poll on PollEv.com/ymjin
Just “skip” if you are required to do registration

11
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
!!!~na
~
~~
ofSingapore

When poll is active respond at PollEv.com/ymjin


Ii .
.

Suppose g(x) is a scalar function of d variables where x is ad x1 vector, the


outcome of differentiation of g(x) w.r.t. x is ad x1 vector.

True
0%

False
0%

12
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression

• Linear regression is a popular regression learning


algorithm that learns a model which is a linear
combination of features of the input example.

𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1

𝑥1,1 𝑥1,2 … 𝑥1,𝑑 𝑤1 𝑦1


𝐗= ⁞ ⁞ ⋱ ⁞ , 𝐰= ⁞ , 𝐲= ⁞
𝑥𝑚,1 𝑥𝑚,2 … 𝑥𝑚,𝑑 𝑤𝑑 𝑦𝑚

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p3 of chp3).

13
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
Problem Statement: To predict the unknown 𝑦 for a given 𝐱 (testing)
𝑚
• We have a collection of labeled examples (training) {(𝐱𝑖 , y𝑖 )}𝑖=1
– 𝑚 is the size of the collection
– 𝐱𝑖 is the d-dimensional feature vector of example 𝑖 = 1, … , 𝑚 (input)
– y𝑖 is a real-valued target (1-D)
– Note:
• when y𝑖 is continuous valued, it is a regression problem
• when y𝑖 is discrete valued, it is a classification problem

• We want to build a model 𝑓𝐰,𝑏 (𝐱) as a linear combination of features of


example 𝐱: 𝑓𝐰,𝑏 𝐱 = 𝐱 𝑇 𝐰 + 𝑏
where 𝐰 is a d-dimensional vector of parameters and 𝑏 is a real number.
• The notation 𝑓𝐰,𝑏 means that the model 𝑓 is parametrized by two values: w
and b
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (chp.14)

14
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression

Learning objective function


• To find the optimal values for w*
and b* which minimizes the
following expression:
𝑚
1
෍(𝑓𝐰,𝑏 𝐱𝑖 − y𝑖 )2
𝑚
𝑖=1
• In mathematics, the expression we
minimize or maximize is called an
objective function, or, simply, an
objective

(𝑓𝐰 𝐱𝑖 − y𝑖 )2 is called the loss function: a measure of the difference


between 𝑓𝐰 𝐱𝑖 and y𝑖 or a penalty for misclassification of example i.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (chp3.1.2)

15
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression

Learning objective function


• To find the optimal values for w*
and b* which minimizes the
following expression:
𝑚
1
෍(𝑓𝐰,𝑏 𝐱𝑖 − y𝑖 )2
𝑚
𝑖=1
• In mathematics, the expression we
minimize or maximize is called an
objective function, or, simply, an
objective

(𝑓𝐰 𝐱𝑖 − y𝑖 )2 is called the loss function: a measure of the difference


between 𝑓𝐰 𝐱𝑖 and y𝑖 or a penalty for misclassification of example i.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (chp3.1.2)

16
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
Learning objective function (using simplified notation hereon)
• To find the optimal values for w* which minimizes the
following expression:
𝑚

෍(𝑓𝐰 𝐱 𝑖 − y𝑖 )2
𝑖=1
with 𝑓𝐰 𝐱 𝑖 = 𝐱 𝑇 𝐰,
where we define 𝐰 = [𝑏, 𝑤1 , … 𝑤𝑑 ]𝑇 = [𝑤0 , 𝑤1 , … 𝑤𝑑 ]𝑇 ,
and 𝐱 𝑖 = [1, 𝑥𝑖,1 , … 𝑥𝑖,𝑑 ]𝑇 = [𝑥𝑖,0 , 𝑥𝑖,1 , … 𝑥𝑖,𝑑 ]𝑇 , 𝑖 = 1, … , 𝑚

• This particular choice of the loss function is called squared


error loss
1
Note: The normalization factor can be omitted as it does not affect the optimization.
𝑚

17
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
𝑚
Linear Regression ෍(𝑓𝐰,𝑏 𝐱 𝑖 − y𝑖 )2
𝑖=1
• All model-based learning algorithms have a loss function
• What we do to find the best model is to minimize the
objective known as the cost function
• Cost function is a sum of loss functions over training set
plus possibly some model complexity penalty (regularization)

• In linear regression, the cost function is given by the average


loss, also called the empirical risk because we do not have
all the data (e.g. testing data)
– The average of all penalties is obtained by applying the
model to the training data
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (chp3.1.2)

18
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
Learning (Training)
• Consider the set of feature vector 𝐱 𝑖 and target output 𝑦𝑖
indexed by 𝑖 = 1, … , 𝑚, a linear model 𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰 can
be stacked as 𝑦1
𝒇𝐰 𝐗 = 𝐗𝐰 𝐲= ⁞
𝑦𝑚
Learning 𝑇
Model 𝐱 1𝐰 Learning
target vector
= ⁞
𝑇
𝐱𝑚 𝐰
𝑏
where 𝐱 𝑖𝑇 𝐰 = [1, 𝑥𝑖,1 , … , 𝑥𝑖,𝑑 ] 𝑤1

𝑤𝑑
Note: The bias/offset term is responsible for translating the line/plane/hyperplane
away from the origin.

19
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression

Least Squares Regression


In vector-matrix notation, the minimization of the objective
function can be written compactly using 𝐞 = 𝐗𝐰 − 𝐲 :
J(𝐰) = 𝐞𝑇 𝐞
= (𝐗𝐰 − 𝐲)𝑇 (𝐗𝐰 − 𝐲)
= (𝐰 𝑇 𝐗 𝑇 − 𝐲 𝑇 )(𝐗𝐰 − 𝐲)
= 𝐰 𝑇 𝐗 𝑇 𝐗𝐰 − 𝐰 𝑇 𝐗 𝑇 𝐲 − 𝐲 𝑇 𝐗𝐰 + 𝐲 𝑇 𝐲
= 𝐰 𝑇 𝐗 𝑇 𝐗𝐰 − 2𝐲 𝑇 𝐗𝐰 + 𝐲 𝑇 𝐲.
Note: when 𝒇𝐰 𝐗 = 𝐗𝐰, then
𝑚
෌𝑖=1(𝑓𝐰 𝐱 𝑖 − y𝑖 )2 = (𝐗𝐰 − 𝐲)𝑇 (𝐗𝐰 − 𝐲).

20
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression

Differentiating J(𝐰) with respect to 𝐰 and setting the


𝜕
result to 0: J 𝐰 =𝟎
𝜕𝐰
𝜕
(𝐰 𝑇 𝐗 𝑇 𝐗𝐰 − 2𝐲 𝑇 𝐗𝐰 + 𝐲 𝑇 𝐲) = 𝟎
𝜕𝐰
⇒ 2𝐗 𝑇 𝐗𝐰 − 2𝐗 𝑇 𝐲 = 0
⇒ 𝐗 𝑇 𝐗𝐰 = 𝐗 𝑇 𝐲
⇒ Any minimizer 𝐰 ෝ of J 𝐰 must satisfy 𝐗 𝑇 𝐗𝐰 − 𝐲 = 𝟎.
If 𝐗 𝑇 𝐗 is invertible, then
Learning/training: 𝐰ෝ = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
Prediction/testing: 𝒇෠ 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰 ෝ

21
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
Example 1 Training set {(𝑥𝑖 , y𝑖 )}𝑚
𝑖=1 {𝑥 = −9} → {𝑦 = −6}
{𝑥 = −7} → {𝑦 = −6}
𝐗 𝐰 𝐲 𝑥 = −5 → {𝑦 = −4}
1 −9 −6 𝑥 = 1 → 𝑦 = −1
1 −7 −6 𝑥 = 5 → {𝑦 = 1 }
𝑤0 −4 𝑥 = 9 → {𝑦 = 4 }
1 −5
𝑤1 = −1
1 1
1 5 1
1 9 4
This set of linear equations has no exact solution
However, 𝐗𝑇 𝐗 is invertible Least square approximation

ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰 −6
−6
−1
6 −6 1 1 1 1 1 1 −4 −1.4375
= =
−6 262 −9 − 7 − 5 1 5 9 −1 0.5625
1
4
22
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
𝑦ො = 𝐗𝐰ෝ
−1.4375
=𝐗
0.5625
y = −1.4375+0.5625x

Prediction:
Test set
{𝑥 = −1} → {𝑦 =? }

−1.4375
𝑦ො = 1 − 1
0.5625
= −2
Linear Regression on one-dimensional samples

Python demo 1
23
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
Example 2 {(𝐱𝑖 , y𝑖 )}𝑚
𝑖=1
{𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 1} → {𝑦 = 1}
{𝑥1 = 1, 𝑥2 = −1, 𝑥3 = 1} → {𝑦 = 0}
Training set {𝑥1= 1, 𝑥2 = 1, 𝑥3 = 3} → {𝑦 = 2}
𝐗 𝐰 𝐲 {𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 0} → {𝑦 = −1}

1 1 1 1
𝑤1
1 −1 1 𝑤2 = 0
1 1 3 𝑤3 2
1 1 0 −1
This set of linear equations has no exact solution
However, 𝐗 𝑇 𝐗 is invertible
ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰 Least square approximation
−1
4 2 5 1 1 1 1 1 −0.7500
= 2 4 3 0 = 0.1786
1 −1 1 1 2
5 3 11 1 1 3 0 −1 0.9286

24
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
The four linear equations Prediction:
Test set

{𝑥1 = 1, 𝑥2 = 6, 𝑥3 = 8} → {𝑦 =? }
{𝑥1 = 1, 𝑥2 = 0, 𝑥3 = −1} → {𝑦 =? }

ෝ = 𝒇෠ 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰
𝒚 ෝ

1 6 8 −0.7500
ෝ=
𝒚 0.1786
1 0 −1
0.9286
7.7500
=
−1.6786

25
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
Learning of Vectored Function (Multiple Outputs)
For one sample: a linear model 𝐟𝐰 𝐱 = 𝐱 𝑇 𝐖 Vector function

For m samples: 𝐅𝐰 𝐗 = 𝐗𝐖 = 𝐘
𝑇 𝑤0,1 … 𝑤0,ℎ
Sample 1
𝐱1 1 𝑥 1,1 … 𝑥 1,𝑑
𝑤1,1 … 𝑤1,ℎ
⁞ = ⁞ 𝐖= ⁞ ⁞ ⋱ ⁞
𝑇 1 𝑥 … 𝑥 ⁞ ⋱ ⁞
Sample m 𝐱𝑚 𝑚,1 𝑚,𝑑 𝑤
𝑑,1 … 𝑤𝑑,ℎ


Sample 1’s output 𝑦1,1 … 𝑦1,ℎ
⁞ = ⁞ 𝑚
Sample m’s output 𝑦𝑚,1 … 𝑦𝑚,ℎ

𝐗 ∈ 𝓡𝑚×(𝑑+1) , 𝐖 ∈ 𝓡(𝑑+1)×ℎ , 𝐘 ∈ 𝓡𝑚×ℎ


26
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
𝑚
Objective:෌𝑖=1(𝐟𝐰 𝐱 𝑖 − 𝐲𝑖 )2 = 𝑬𝑇 𝑬

Least Squares Regression of Multiple Outputs


In matrix notation, the sum of squared errors cost
function can be written compactly using 𝐄 = 𝐗𝐖 − 𝐘:

J(𝐖) = trace(𝐄 𝑇 𝐄)
= trace[(𝐗𝐖 − 𝐘)𝑇 (𝐗𝐖 − 𝐘)]

If 𝐗 𝑇 𝐗 is invertible, then
Learning/training: 𝐖 ෡ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
Prediction/testing: 𝐅෠𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐖 ෡

Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2 nd ed., 12th printing) 2017 (chp.3.2.4)

27
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression

Least Squares Regression of Multiple Outputs

J(𝐖) = trace(𝐄 𝑇 𝐄)

𝐞1𝑇
= trace( ⁞ [𝐞1 𝐞2 … 𝐞ℎ ])
𝐞𝑇ℎ

𝐞1𝑇 𝐞1 𝐞1𝑇 𝐞2 … 𝐞1𝑇 𝐞ℎ


𝐞𝑇2 𝐞1 𝐞𝑇2 𝐞2 … 𝐞𝑇2 𝐞ℎ ℎ
= trace( ) = ෌𝑘=1 𝐞𝑇𝑘 𝐞𝑘
⁞ ⁞ ⋱ ⁞
𝐞𝑇ℎ 𝐞1 𝐞𝑇ℎ 𝐞2 … 𝐞𝑇ℎ 𝐞ℎ

28
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression of multiple outputs
Example 3
Training set {𝑥1= 1, 𝑥2 = 1, 𝑥3 = 1} → { 𝑦1= 1, 𝑦2 = 0}
{𝑥1 = 1, 𝑥2 = −1, 𝑥3 = 1} → { 𝑦1 = 0, 𝑦2 = 1}
{𝐱 𝑖 , 𝐲𝑖 }𝑚
𝑖=1 {𝑥1= 1, 𝑥2 = 1, 𝑥3 = 3} → { 𝑦1 = 2, 𝑦2 = −1}
{𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 0} → { 𝑦1 = −1, 𝑦2 = 3}
𝐗 𝐖 𝐘
Bias 1 1 1 𝑤 1 0
1,1 𝑤1,2
1 − 1 1 𝑤2,1 𝑤2,2 = 0 1
1 1 3 𝑤3,1 𝑤3,2 2 −1
1 1 0 −1 3
This set of linear equations has NO exact solution
Least square
෡ = 𝐗 † 𝐘 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
𝐖 𝐗 𝑇 𝐗 is invertible approximation
−1 1 0
4 2 5 1 1 1 1 −0.75 2.25
= 2 4 3 0 1 = 0.1786 0.0357
1 −1 1 1 2 −1
5 3 11 1 1 3 0 −1 0.9286 − 1.2143
3

29
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression of multiple outputs
Example 3
Prediction:
Test set: two new samples
{𝑥1= 1, 𝑥2 = 6, 𝑥3 = 8} → { 𝑦1 =? , 𝑦2 =? }
{𝑥1 = 1, 𝑥2 = 0, 𝑥3 = −1} → { 𝑦1=? , 𝑦2 =? }

෡ = 𝐗 𝑛𝑒𝑤 ෢
𝐘 𝐖
Bias 1 6 8 −0.75 2.25
= 0.1786 0.0357
1 0 −1
0.9286 − 1.2143
7.75 − 7.25
=
−1.6786 3.4643

Python demo 2

30
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression of multiple outputs
Example 4
The values of feature x and their corresponding values of multiple
outputs target y are shown in the table below.

Based on the least square regression, what are the values of w?


Based on the current mapping, when x = 2, what is the value of y?

x [3] [4] [10] [6] [7]

y [0, 5] [1.5, 4] [-3, 8] [-4, 10] [1, 6]

1.9 3.6
𝐖 = 𝐗 𝐘 = (𝐗 𝐗) 𝐗 𝐘 =
෡ † 𝑇 −1 𝑇
Python demo 3
−0.4667 0.5
𝒏𝒆𝒘 = 𝐗 𝑛𝑒𝑤 𝐖 = 1

𝐘 ෢ ෢ = 0.9667 4.6
2 𝐖 Prediction

31
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Summary
• Notations, Vectors, Matrices
• Operations on Vectors and Matrices
• Dot-product, matrix inverse
• Systems of Linear Equations 𝒇𝐰 𝐗 = 𝐗𝐰 = 𝐲
• Matrix-vector notation, linear dependency, invertible
• Even-, over-, under-determined linear systems
• Functions, Derivative and Gradient
• Inner product, linear/affine functions
• Maximum and minimum, partial derivatives, gradient
• Least Squares, Linear Regression
• Objective function, loss function
• Least square solution, training/learning and testing/prediction
• Linear regression with multiple outputs
Learning/training ෝ = (𝐗 𝑇𝒕𝒓𝒂𝒊𝒏 𝐗 𝒕𝒓𝒂𝒊𝒏 )−𝟏 𝐗 𝑇𝒕𝒓𝒂𝒊𝒏 𝐲𝒕𝒓𝒂𝒊𝒏
𝐰
Prediction/testing 𝐲𝒕𝒆𝒔𝒕 = 𝐗 𝒕𝒆𝒔𝒕 𝐰 ෝ
• Classification Python packages: numpy, pandas, matplotlib.pyplot,
• Ridge Regression numpy.linalg, and sklearn.metrics (for
• Polynomial Regression mean_squared_error), numpy.linalg.pinv
33
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
EE2211 Introduction to Machine
Learning
Lecture 6
Semester 2
2024/2025

Yueming Jin
[email protected]

Electrical and Computer Engineering Department


National University of Singapore

© Copyright EE, NUS. All Rights Reserved.


Course Contents
• Introduction and Preliminaries (Xinchao)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Yueming)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Yueming)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Xinchao) Mid-term: Lecture 1 to 6
– Performance Issues Trial quiz
– K-means Clustering Assignment 1 & 2
– Neural Networks

2
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression & Polynomial
Regression

Module II Contents
• Notations, Vectors, Matrices
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Functions, Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression

3
© Copyright EE, NUS. All Rights Reserved.
Review: Linear Regression
Learning of Scalar Function (Single Output)
For one sample: a linear model 𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰 scalar function
For m samples: 𝒇𝐰 𝐗 = 𝐗𝐰 = 𝐲
𝐱1𝑇 𝐰
𝐲= ⁞ where 𝐱 𝑖𝑇 = [1, 𝑥𝑖,1 , … , 𝑥𝑖,𝑑 ]
𝑇
𝐱𝑚 𝐰
1 𝑥1,1 … 𝑥1,𝑑 𝑏 𝑦1
𝐗= ⁞ ⁞ ⋱ ⁞ 𝐰 = 𝑤1 𝐲= ⁞
1 𝑥𝑚,1 … 𝑥𝑚,𝑑 ⁞ 𝑦𝑚
𝑤𝑑
𝑚
Objective:෌𝑖=1(𝑓𝐰 𝐱 𝑖 − y𝑖 )2 = 𝐞𝑇 𝐞 = (𝐗𝐰 − 𝐲)𝑇 (𝐗𝐰 − 𝐲)
Learning/training when 𝐗 𝑇 𝐗 is invertible
ෝ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲
Least square solution: 𝐰
Prediction/testing: 𝒚𝑛𝑒𝑤 = 𝒇෠ 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰

4
© Copyright EE, NUS. All Rights Reserved.
Review: Linear Regression
Learning of Vectored Function (Multiple Outputs)
𝐅𝐰 𝐗 = 𝐗𝐖 = 𝐘
𝑤0,1 … 𝑤0,ℎ
Sample 1 𝐱1𝑇 1 𝑥1,1 … 𝑥1,𝑑
𝑤1,1 … 𝑤1,ℎ
⁞ = ⁞ 𝐖= ⁞ ⁞ ⋱ ⁞
𝑇 1 𝑥𝑚,1 … 𝑥𝑚,𝑑 ⁞ ⋱ ⁞
Sample m 𝐱𝑚 𝑤𝑑,1 … 𝑤𝑑,ℎ
Sample 1’s output 𝑦1,1 … 𝑦1,ℎ
⁞ = ⁞
Sample m’s output 𝑦𝑚,1 … 𝑦𝑚,ℎ Least Squares Regression

If 𝐗 𝑇 𝐗 is invertible, then
Learning/training: 𝐖 ෡ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
Prediction/testing: 𝐅෠𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐖 ෡

𝐗 ∈ 𝓡𝑚×(𝑑+1) , 𝐖 ∈ 𝓡(𝑑+1)×ℎ , 𝐘 ∈ 𝓡𝑚×ℎ


5
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)

Linear Methods for Classification


• We have a collection of labeled examples
• 𝑚 is the size of the collection
• 𝐱𝑖 is the d-dimensional feature vector of example 𝑖 = 1, … , 𝑚
• 𝑦𝑖 is discrete target label (e.g., 𝑦𝑖 ∈ {−1, +1} or {0, 1} for
binary classification problems)
• Note:
• when 𝑦𝑖 is continuous valued → a regression problem
• when 𝑦𝑖 is discrete valued →a classification problem
• Linear model: 𝑓𝐰,𝑏 𝐱 = 𝐱 𝑇 𝐰 + 𝑏 or in compact form 𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰
(having the offset term absorbed into the inner product)

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (chp.14)

6
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)

Linear Methods for Classification


Binary Classification:
If 𝐗 𝑇 𝐗 is invertible, then
Learning: ෝ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲,
𝐰 𝑦𝑖 ∈ −1, +1 , 𝑖 = 1, … , 𝑚
Prediction: 𝑓መ𝐰𝑐 𝐱𝑛𝑒𝑤 = sign(𝐱𝑛𝑒𝑤
𝑇 𝐰) 𝑇
ෝ for each row 𝐱𝑛𝑒𝑤 of 𝐗 𝑛𝑒𝑤

sign(𝑎)
+1

0
𝑎
-1

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (chp.14)

7
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 1 Training set {𝑥𝑖 , y𝑖 }𝑚
𝑖=1 {𝑥 = −9} → { 𝑦 = −1 }
𝐗 𝐰 𝐲 {𝑥 = −7} → { 𝑦 = −1 }
𝑥 = −5 → { 𝑦 = −1 }
Bias 1 −9 −1
{ 𝑥 = 1} → 𝑦 = +1
1 −7 −1 { 𝑥 = 5} → { 𝑦 = +1 }
𝑤0 −1
1 −5 = { 𝑥 = 9} → { 𝑦 = +1 }
𝑤1 1
1 1
1 5 1
1 9 1
This set of linear equations has NO exact solution

ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲
𝐰 𝐗 𝑇 𝐗 is invertible
−1
−1
6 −6 −1
1 1 1 1 1 1 −1 0.1406 Least square
= =
−6 262 −9 − 7 − 5 1 5 9 1 0.1406 approximation
1
1
8
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 1

𝑦ො = sign(𝐗𝐰)
{ 𝑥 = 1} → 𝑦 = +1
0.1406
{ 𝑥 = 5} → { 𝑦 = +1 }
{ 𝑥 = 9} → { 𝑦 = +1 } = sign(𝐗 )
0.1406

ෝ = 0.1406+0.1406x
y' = 𝐗𝐰

Prediction:
𝑦𝑛𝑒𝑤′
Test set {𝑥 = −2} → {𝑦 = ? }
𝑦𝑛𝑒𝑤 = 𝑓መ𝐰𝑐 𝐱𝑛𝑒𝑤 = sign 𝐱𝑛𝑒𝑤 𝐰

Bias
0.1406
{𝑥 = −9} → { 𝑦 = −1 } −2
= sign( 1 − 2 )
{𝑥 = −7} → { 𝑦 = −1 } 0.1406
𝑥 = −5 → { 𝑦 = −1 }
𝑥𝑛𝑒𝑤 = sign(− 0.1406) = −1
Python
Linear Regression for one-dimensional classification
demo 1
9
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Linear Methods for Classification

Multi-Category Classification:

If 𝐗 𝑇 𝐗 is invertible, then

Learning: ෡ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘,
𝐖 𝐘 ∈ 𝐑𝑚×𝐶
Prediction: 𝑓መ𝐰𝑐 𝐱 𝑛𝑒𝑤 = arg max 𝑘=1,…,𝐶 𝐱 𝑛𝑒𝑤
𝑇 ෢ :,𝑘
𝐖 𝑇
for each 𝐱 𝑛𝑒𝑤 of 𝐗 𝑛𝑒𝑤

Each row (of 𝑖 = 1, … , 𝑚) in 𝐘 has an one-hot encoding/assignment:


e.g., target for class-1 is labelled as 𝐲𝑖𝑇 = [1, 0, 0, … , 0] for the ith sample,
target for class-2 is labelled as 𝐲𝑗𝑇 = 0, 1, 0, … , 0 for the jth sample,
𝑇 = 0, 0, … , 0, 1 for the mth sample.
target for class-C is labelled as 𝐲𝑚
𝐶
Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2 nd ed., 12th printing) 2017 (chp.4)

10
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 2 Three class classification
{𝑥1 = 1, 𝑥2 = 1} → { 𝑦1 = 1, 𝑦2 = 0, 𝑦3 = 0} Class 1
Training set {𝑥1 = −1, 𝑥2 = 1} → { 𝑦1 = 0, 𝑦2 = 1, 𝑦3 = 0} Class 2
{𝐱 𝑖 , 𝐲𝑖 }𝑚
𝑖=1 {𝑥1 = 1, 𝑥2 = 3} → { 𝑦1= 1, 𝑦2 = 0, 𝑦3 = 0} Class 1
{𝑥1= 1, 𝑥2 = 0} → { 𝑦1= 0, 𝑦2 = 0, 𝑦3 = 1} Class 3
𝐗 𝐖 𝐘
Bias 1 1 1 𝑤 1 0 0
1,1 𝑤1,2 𝑤1,3
1 −1 1 𝑤2,1 𝑤2,2 𝑤2,3 = 0 1 0
1 1 3 𝑤3,1 𝑤3,2 𝑤3,3 1 0 0
1 1 0 0 0 1
This set of linear equations has NO exact solution. Least square
෡ = 𝐗 † 𝐘 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
𝐖 𝐗 𝑇 𝐗 is invertible approximation

−1 1 0 0
4 2 5 1 1 1 1 0 0.5 0.5
= 2 4 3 0 1 0 = 0.2857 − 0.5 0.2143
1 −1 1 1 1 0 0
5 3 11 1 1 3 0 0.2857 0 − 0.2857
0 0 1
11
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 2 Prediction
Test set 𝐗 𝑛𝑒𝑤 {𝑥1= 6, 𝑥2 = 8} → {𝑐𝑙𝑎𝑠𝑠 1, 2, 𝑜𝑟 3? }
{𝑥1= 0, 𝑥2 = −1} → {𝑐𝑙𝑎𝑠𝑠 1, 2, 𝑜𝑟 3? }

1 6 8 0 0.5 0.5
෡ = 𝐗 𝑛𝑒𝑤 ෢
𝐘 𝐖= 0.2857 − 0.5 0.2143
1 0 −1
0.2857 0 − 0.2857
Category prediction:

𝒇෠ 𝑐𝐰 𝐗 𝑛𝑒𝑤 = arg max𝑘=1,…,𝐶 (𝐘෡(: , 𝑘))


4 − 2.50 − 0.50
= arg max𝑘=1,…,𝐶 ( )
−0.2587 0.50 0.7857
1 Class 1
=
3 Class 3 For each row of Y, the column position of the largest number
(across all columns for that row) determines the class label.
Python
E.g. in the first row, the maximum number is 4 which is in
demo 2 column 1. Therefore, the resulting predicted class is 1.
12
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression
Recall Linear regression
𝑚
𝐰 = argmin ෌𝑖=1(𝑓𝐰 𝐱 𝑖 − y𝑖 )2 = (𝐗𝐰 − 𝐲)𝑇 (𝐗𝐰 − 𝐲)
Objective: ෞ
The learning computation: 𝐰ෝ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲
We cannot guarantee that the matrix 𝐗 𝑇 𝐗 is invertible

Ridge regression: shrinks the regression coefficients w by imposing a


penalty on their size
𝑚
𝐰 = argmin ෌𝑖=1(𝑓𝐰 𝐱 𝑖 − y𝑖 )2 + λ σ𝑑𝑗=1 𝑤𝑗2
Objective: ෞ
= argmin (𝐗𝐰 − 𝐲)𝑇 (𝐗𝐰 − 𝐲) + λ𝐰 𝑇 𝐰

Here λ ≥ 0 is a complexity parameter that controls the amount of


shrinkage: the larger the value of λ, the greater the amount of shrinkage.

Note: m samples & d parameters


13
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression
Using a linear model:
min𝐰 (𝐗𝐰 − 𝐲)𝑇 𝐗𝐰 − 𝐲 + λ𝐰 𝑇 𝐰
Solution:
𝜕
((𝐗𝐰 − 𝐲)𝑇 𝐗𝐰 − 𝐲 + λ𝐰 𝑇 𝐰) = 𝟎
𝜕𝐰
⇒ 2𝐗 𝑇 𝐗𝐰 − 2𝐗 𝑇 𝐲 + 2λ𝐰 = 𝟎
⇒ 𝐗 𝑇 𝐗𝐰 + λ𝐰 = 𝐗 𝑇 𝐲
⇒ (𝐗 𝑇 𝐗 + λ𝐈)𝐰 = 𝐗 𝑇 𝐲
where I is the dxd identity matrix
Here on, we shall focus on single column of output 𝐲 in derivations in
the sequel
Learning: ෝ = (𝐗 𝑇 𝐗 +λ𝐈)−1 𝐗 𝑇 𝐲
𝐰
Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2 nd ed., 12th printing) 2017 (chp.3)

14
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression

Ridge Regression in Primal Form (when m > d)

(𝐗 𝑇 𝐗 + λ𝐈) is invertible for λ > 0,


Learning: 𝐰ෝ = (𝐗 𝑇 𝐗 +λ𝐈)−1 𝐗 𝑇 𝐲
Prediction: 𝒇෠ 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰ෝ

Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2 nd ed., 12th printing) 2017 (chp.3)

15
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression

Ridge Regression in Dual Form (when m < d)

(𝐗𝐗 𝑇 +λ𝐈) is invertible for λ > 0,


Learning: 𝐰ෝ = 𝐗 𝑇 (𝐗𝐗 𝑇 +λ𝐈)−1 𝐲
Prediction: 𝒇෠ 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰ෝ

Derivation as homework (see tutorial 6).


Hint: start off with (𝐗 𝑇 𝐗 + 𝜆𝐈)𝐰 = 𝐗 𝑇 𝐲 and make use of 𝐰 = 𝐗 𝑇 𝐚 and
𝒂 = 𝜆−1 𝐲 − 𝐗𝐰 , 𝜆 > 0

16
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Motivation: nonlinear decision surface
• Based on the sum of products of the variables
• E.g. when the input dimension is d=2,
a polynomial function of degree = 2 is:
𝑓𝐰 𝐱 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤12 𝑥1 𝑥2 + 𝑤11 𝑥12 + 𝑤22 𝑥22 .

XOR problem

𝑓𝐰 𝐱 = 𝑥1 𝑥2

17
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Polynomial Expansion
• The linear model 𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰 can be written as
𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰
= σ𝑑𝑖=0 𝑥𝑖 𝑤𝑖 , 𝑥0 = 1
= 𝑤0 + σ𝑑𝑖=1 𝑥𝑖 𝑤𝑖 .

• By including additional terms involving the products of


pairs of components of 𝐱, we obtain a quadratic model:
𝑓𝐰 𝐱 = 𝑤0 + σ𝑑𝑖=1 𝑤𝑖 𝑥𝑖 + σ𝑑𝑖=1 σ𝑑𝑗=1 𝑤𝑖𝑗 𝑥𝑖 𝑥𝑗 .

2nd order: 𝑓𝐰 𝐱 = 𝑤0 + 𝑤1𝑥1+ 𝑤2 𝑥2 + 𝑤12 𝑥1𝑥2 + 𝑤11 𝑥12+ 𝑤22 𝑥22


3rd order: 𝑓𝐰 𝐱 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2𝑥2 + 𝑤12 𝑥1 𝑥2 + 𝑤11 𝑥12 + 𝑤22 𝑥22 +
σ𝑑𝑖=1 σ𝑑𝑗=1 σ𝑑𝑘=1 𝑤𝑖𝑗𝑘 𝑥𝑖 𝑥𝑗 𝑥𝑘 , 𝑑 = 2 Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Chp.5)

18
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Generalized Linear Discriminant Function
• In general:
𝑓𝐰 𝐱 = 𝑤0 + σ𝑑𝑖=1 𝑤𝑖 𝑥𝑖 + σ𝑑𝑖=1 σ𝑑𝑗=1 𝑤𝑖𝑗 𝑥𝑖 𝑥𝑗 + σ𝑑𝑖=1 σ𝑑𝑗=1 σ𝑑𝑘=1 𝑤𝑖𝑗𝑘 𝑥𝑖 𝑥𝑗 𝑥𝑘 + ⋯

Weierstrass Approximation Theorem: Every continuous function defined on a


closed interval [a, b] can be uniformly approximated as closely as desired by
a polynomial function.
- Suppose f is a continuous real-valued function defined on the real interval [a, b].
- For every ε > 0, there exists a polynomial p such that for all x in [a, b], we have| f (x) − p(x)| < ε.
https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theorem
(Ref: https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theorem)

Notes:
• For high dimensional input features (large d value) and high polynomial order, the
number of polynomial terms becomes explosive! (i.e., grows exponentially)
• For high dimensional problems, polynomials of order larger than 3 is seldom used.
Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Chp.5) online

19
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Generalized Linear Discriminant Function

𝑓𝐰 𝐱 = 𝑤0 + σ𝑑𝑖=1 𝑤𝑖 𝑥𝑖 + σ𝑑𝑖=1 σ𝑑𝑗=1 𝑤𝑖𝑗 𝑥𝑖 𝑥𝑗 + σ𝑑𝑖=1 σ𝑑𝑗=1 σ𝑑𝑘=1𝑤𝑖𝑗𝑘 𝑥𝑖 𝑥𝑗 𝑥𝑘 + ⋯

𝒇𝐰 𝐱 = 𝐏𝐰 ( Note: 𝐏 ≜ 𝐏(𝐗) for symbol simplicity )

𝒑1𝑇 𝐰 𝑤0
= ⁞ 𝑤1
𝒑𝑇𝑚 𝐰 ⁞
𝑤𝑑
where 𝒑𝑇𝑙 𝐰 = [1, 𝑥𝑙,1 , … , 𝑥𝑙,𝑑 , … , 𝑥𝑙,𝑖 𝑥𝑙,𝑗 , … , 𝑥𝑙,𝑖 𝑥𝑙,𝑗 𝑥𝑙,𝑘 , … ] ⁞
𝑤𝑖𝑗

𝑤𝑖𝑗𝑘
𝑙 = 1, … , 𝑚; d denotes the dimension of input features; m
denotes the number of samples ⁞
Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Chp.5)

20
© Copyright EE, NUS. All Rights Reserved.
{𝑥1= 0, 𝑥2 = 0} → {𝑦 = −1}
Example 3 {𝑥1= 1, 𝑥2 = 1} → {𝑦 = −1}
Training set {𝐱 𝑖 , 𝐲𝑖 }𝑚
𝑖=1 {𝑥1= 1, 𝑥2 = 0} → {𝑦 = +1}
{𝑥1 = 0, 𝑥2 = 1} → {𝑦 = +1}
2nd order polynomial model

𝑓𝐰 𝐱 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤12 𝑥1 𝑥2 + 𝑤11 𝑥12 + 𝑤22 𝑥22


𝑤0
𝑤1
𝑤2
= [1 𝑥1 𝑥2 𝑥1 𝑥2 𝑥1 𝑥2 ] 𝑤 2 2 Python
12
demo 3
𝑤11
Stack the 4 training samples as a matrix 𝑤22
2 2
1 𝑥1,1 𝑥1,2 𝑥1,1 𝑥1,2 𝑥1,1 𝑥1,2
1 0 0 0 0 0
2 2
1 𝑥2,1 𝑥2,2 𝑥2,1 𝑥2,2 𝑥2,1 𝑥2,2 1 1 1 1 1 1
𝐏= 2 2 =
1 𝑥3,1 𝑥3,2 𝑥3,1 𝑥3,2 𝑥3,1 𝑥3,2 1 1 0 0 1 0
1 𝑥4,1 𝑥4,2 𝑥4,1 𝑥4,2 2
𝑥4,1 2
𝑥4,2 1 0 1 0 0 1
21
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Summary

Ridge Regression in Primal Form (m > d)


For λ > 0,
Learning: 𝐰ෝ = (𝐏𝑇 𝐏 +λ𝐈)−1 𝐏𝑇 𝐲
Prediction: 𝒇෠ 𝐰 𝐏(𝐗𝑛𝑒𝑤 ) = 𝐏𝑛𝑒𝑤 𝐰ෝ

Ridge Regression in Dual Form (m < d)


For λ > 0,
Learning: 𝐰ෝ = 𝐏𝑇 (𝐏𝐏𝑇 +λ𝐈)−1 𝐲
Prediction: 𝒇෠ 𝐰 𝐏(𝐗𝑛𝑒𝑤 ) = 𝐏𝑛𝑒𝑤 𝐰

Note: Change X to P with reference to slides 15/16; m & d refers to the size of P (not X)
22
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Summary

For Regression Applications


• Learn continuous valued 𝑦 using either primal form or dual form
• Prediction: 𝒇෠ 𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = 𝐏𝑛𝑒𝑤 𝐰

For Classification Applications


• Learn discrete valued 𝑦 (𝑦 ∈ {−1, +1}) or 𝐘 (one-hot) using either primal
form or dual form
• Binary Prediction: 𝒇෠ 𝑐𝐰 𝐏(𝐗𝑛𝑒𝑤 ) = sign(𝐏𝑛𝑒𝑤 𝐰)

• Multi-Category Prediction: 𝒇෠ 𝑐𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = arg max𝑘=1,…,𝐶 (𝐏𝑛𝑒𝑤 𝐖(:
෡ , 𝑘))

23
© Copyright EE, NUS. All Rights Reserved.
{𝑥1 = 0, 𝑥2 = 0} → {𝑦 = −1}
Example 3 (cont’d) {𝑥1 = 1, 𝑥2 = 1} → {𝑦 = −1}
Training set {𝑥1 = 1, 𝑥2 = 0} → {𝑦 = +1}
{𝑥1 = 0, 𝑥2 = 1} → {𝑦 = +1}
2nd order polynomial model
2 2
1 𝑥1,1 𝑥1,2 𝑥1,1 𝑥1,2 𝑥1,1 𝑥1,2
1 0 0 0 0 0
2 2
1 𝑥2,1 𝑥2,2 𝑥2,1 𝑥2,2 𝑥2,1 𝑥2,2 1 1 1 1 1 1
𝐏= 2 2 =
1 𝑥3,1 𝑥3,2 𝑥3,1 𝑥3,2 𝑥3,1 𝑥3,2 1 1 0 0 1 0
1 𝑥4,1 𝑥4,2 𝑥4,1 𝑥4,2 2
𝑥4,1 2
𝑥4,2 1 0 1 0 0 1

ෝ = 𝐏 𝑇 (𝐏𝐏 𝑇 )−1 𝐲
𝐰
−1
1 1 1 1 −1
0 1 1 0 1 1 1 1 −1 1
−1 1
= 0 1 0 1 1 6 3 3 =
0 1 0 0 1 3 3 1 +1 −4
0 1 1 0 1 3 1 3 +1 1 Python
0 1 0 1 1 demo 3
24
© Copyright EE, NUS. All Rights Reserved.
Example 3 (cont’d) Prediction
Test set Test point 1: {𝑥1 = 0.1, 𝑥2 = 0.1} → {𝑦 = class − 1 or + 1? }
Test point 2: {𝑥1 = 0.9, 𝑥2 = 0.9} → {𝑦 = class − 1 or + 1? }
Test point 3: {𝑥1 = 0.1, 𝑥2 = 0.9} → {𝑦 = class − 1 or + 1? }
Test point 4: {𝑥1 = 0.9, 𝑥2 = 0.1} → {𝑦 = class − 1 or + 1? }
𝐲ො = 𝐏𝑛𝑒𝑤 𝐰
ෝ −1
1 0.1 0.1 0.01 0.01 0.01 1
1 0.9 0.9 0.81 0.81 0.81 1
=
1 0.1 0.9 0.09 0.01 0.81 −4
1 0.9 0.1 0.09 0.81 0.01 1
1
[1 𝑥1 𝑥2 𝑥1 𝑥2 𝑥12 𝑥22 ]
−0.82
−0.82
𝒇෠ 𝑐𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = sign(𝐲)
ො = sign( 0.46 )
0.46
−1 Class −1
−1 Class −1
= +1
Class +1
+1 Class +1

25
© Copyright EE, NUS. All Rights Reserved.
Poll on PollEv.com/ymjin
Just “skip” if you are required to do registration

26
© Copyright EE, NUS. All Rights Reserved.
Mid-term: Lecture 1 to 6
Summary Trial quiz
• Notations, Vectors, Matrices Assignment 1 & 2
• Operations on Vectors and Matrices
• Systems of Linear Equations 𝒇𝐰 𝐗 = 𝐗𝐰 = 𝐲
• Functions, Derivative and Gradient
• Least Squares, Linear Regression with Single and Multiple Outputs
• Learning of vectored function, binary and multi-category classification
• Ridge Regression: penalty term, primal and dual forms
• Polynomial Regression: nonlinear decision boundary

Primal form Learning: 𝐰ෝ = (𝐏 𝑇 𝐏 +λ𝐈)−1 𝐏 𝑇 𝐲


Prediction: 𝒇෠ 𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = 𝐏𝑛𝑒𝑤 𝐰 ෝ

Dual form Learning: 𝐰ෝ = 𝐏 𝑇 (𝐏𝐏 𝑇 +λ𝐈)−1 𝐲


Prediction: 𝒇෠ 𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = 𝐏𝑛𝑒𝑤 𝐰ෝ
Hint: python packages: sklearn.preprocessing (PolynomialFeatures), np.sign,
sklearn.model_selection (train_test_split), sklearn.preprocessing (OneHotEncoder)
27
© Copyright EE, NUS. All Rights Reserved.

You might also like