0% found this document useful (0 votes)
4 views

Regression

Uploaded by

mysecondacc213
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Regression

Uploaded by

mysecondacc213
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Regression

Dr Vidhya Kamakshi
Assistant Professor
National Institute of Technology Calicut
[email protected]

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 1


Data from a film maker

Earnings

Expenses

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 2


Notations
• Training data = 𝑥𝑖 , 𝑦𝑖 𝑁
𝑖=1
• Goal learn a function 𝑓: ℝ → ℝ ∶ 𝑓 𝑥 ~𝑦
• Representation - Here the function is assumed to be a straight line
• So, 𝑓 𝑥 = 𝑤0 + 𝑤1 𝑥
• Aim is to find to parameters 𝑤0 and 𝑤1 that best fits the given data
(Predicted value 𝑓(𝑥) is as close as possible to the true value 𝑦)
𝑤0
• 𝑤 = 𝑤 is the learnable parameters
1

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 3


Evaluation
• 𝑓(𝑥) has to be as close as possible to the true value 𝑦
• Evaluation function to drive this objective
• Least Squares Error = 𝑓 𝑥 − 𝑦 2

• For a best fit the least squares error has to be low for most if not all
data points
• Objective/ Loss function is thus
𝑁
1
𝐽 𝑤 = ෍ 𝑓 𝑥𝑖 ; 𝑤 − 𝑦𝑖 2
2𝑁
𝑖=1
To be read as 𝑓 𝑥𝑖
parameterized by w
Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 4
𝑤0 = 0 , 𝑤1 = 0.5 𝑤0 = 0 , 𝑤1 = 2

….

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 5


Plot the error surface for the given data

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 6


Optimization
• To get the best fit, the overall loss must be minimized
• Done using Gradient Descent
• Basic Principle:
• Start with an initial estimate of 𝑤
• Change 𝑤 iteratively such that the loss function J(𝑤) is minimized
• Stop the iterative process if either of the following convergence conditions is met:
• Minimum value of J(𝑤) is hit
• No significant improvement observed in J(𝑤) values between two successive iterations
• Gradient – Provides the direction of increase of the function
• Perform Descent (opposing the gradient direction) in order to minimize the
loss function J(𝑤)

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 7


Parameter updates
• Update the parameters 𝑤 (weight 𝑤1 & bias 𝑤0 ) of the linear
regressor as :
• 𝑤 𝑛𝑒𝑤 = 𝑤 𝑜𝑙𝑑 − 𝛼 ∇𝑤 𝐽(𝑤)
• Repeat until convergence
• Here,
• 𝛼 is called learning rate
• ∇𝑤 𝐽(𝑤) is the gradient (partial derivative of the loss function 𝐽(𝑤)
• with respect to the learnable parameter 𝑤)

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 8


Importance of learning rate
• Critical hyper – parameter to be decided by the
machine learning engineer
• Too low 𝛼 may lead to slower convergence
• Too high 𝛼 may lead to oscillations
• Typically 𝛼 = 10−3 is recommended for most
cases (no theoretical proof of why this works is
available, but this seems to work in practice for
most practitioners)

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 9


Computing Gradients
∇𝑤0 𝐽(𝑤)
• ∇𝑤 𝐽(𝑤) =
∇𝑤1 𝐽(𝑤)
1
•𝐽 𝑤 = σ𝑁
𝑖=1 𝑤0 + 𝑤1 𝑥𝑖 − 𝑦𝑖
2
2𝑁
1 𝑁
σ𝑖=1(𝑤0 + 𝑤1 𝑥𝑖 − 𝑦𝑖 )
𝑁
• ∇𝑤 𝐽(𝑤) = 1
𝑥𝑖 σ𝑁
𝑖=1 (𝑤0 + 𝑤1 𝑥𝑖 − 𝑦𝑖 )
𝑁
1 𝑁
σ𝑖=1(𝑓(𝑥𝑖 ; 𝑤 𝑜𝑙𝑑 ) − 𝑦𝑖 )
𝑁
• ∇𝑤 𝐽(𝑤) = 1
𝑥𝑖 σ𝑁
𝑖=1 (𝑓(𝑥 𝑖 ; 𝑤 𝑜𝑙𝑑 ) − 𝑦 )
𝑖
𝑁
Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 10
Parameter updates - Expanded
• Repeat until convergence
𝑁
1
𝑤0𝑛𝑒𝑤 = 𝑤0𝑜𝑙𝑑 − 𝛼 ෍(𝑓(𝑥𝑖 ; 𝑤 𝑜𝑙𝑑 ) − 𝑦𝑖 )
𝑁
𝑖=1

𝑁
1
𝑤1𝑛𝑒𝑤 = 𝑤1𝑜𝑙𝑑 − 𝛼 𝑥𝑖 ෍(𝑓(𝑥𝑖 ; 𝑤 𝑜𝑙𝑑 ) − 𝑦𝑖 )
𝑁
𝑖=1
1
• is mostly ignored in analysis but helps practically
𝑁
• *Practical implementations use batch mode processing

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 11


Real world data – Multivariate
• In real world many factors influence an outcome
• Extending the film maker data example

• Number of input features = D

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 12


Multivariate Linear Regression – Extension
from Univariate Regression
• Representation – 𝑓 𝑥; 𝑤 = 𝑤0 + 𝑤1 𝑥1 + … + 𝑤𝐷 𝑥𝐷
1
• Evaluation – 𝐽 𝑤 = σ𝑁
𝑖=1 𝑓 𝑥𝑖 ; 𝑤 − 𝑦𝑖
2
2𝑁
• Analogous to simpler univariate case,
• Parameter update:
1
• 𝑤𝑗𝑛𝑒𝑤 = 𝑤𝑗𝑜𝑙𝑑 −𝛼 . . 𝑥𝑖𝑗 . σ𝑁
𝑖=1(𝑓(𝑥𝑖 ; 𝑤
𝑜𝑙𝑑
) − 𝑦𝑖 )
𝑁

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 13


Potential issues
• Different scales of features

• Transform to a common scale


• Standardization – Mean = 0, Variance = 1
• Normalization – All feature values should be in either [-1,1] or [0,1] range
Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 14
Multivariate Linear Regression – Matrix Vector
𝑤0 Interpretation
𝑤1
• Weights = 𝑤2 ∈ ℝ𝐷+1

𝑤𝐷
• 𝑤0 is the bias term
• 𝑤1 , 𝑤2 , … , 𝑤𝐷 are the weights corresponding to the D input features
1
𝑥1
• Data point / Instance = 𝑥2 ∈ ℝ𝐷+1

𝑥𝐷
• 𝑥0 = 1 is included to account for bias term
• Representation – 𝑓 𝑥; 𝑤 = 𝑤 𝑇 𝑥 = 𝑤0 + 𝑤1 𝑥1 + … + 𝑤𝐷 𝑥𝐷
Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 15
Multivariate Linear Regression – Matrix
Vector Interpretation(contd.)
1 𝑥11 𝑥12 … 𝑥1𝐷
1 𝑥21 𝑥22 … 𝑥2𝐷 𝑁×(𝐷+1)
• Data Matrix X = … ∈ ℝ
1 𝑥𝑁1 𝑥𝑁2 … 𝑥𝑁𝐷
• Representation : 𝑓 𝑋; 𝑤 = 𝑋𝑤
𝑦1
𝑦2
• Target Vector y = … ∈ ℝ𝑁
𝑦𝑁

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 16


Multivariate Linear Regression – Matrix
Vector Interpretation(contd.)
1
• Evaluation : 𝐽 𝑤 = σ𝑁
𝑖=1 𝑓 𝑥𝑖 ; 𝑤 − 𝑦𝑖
2
2
• Corresponding to the matrix vector notation
1 2
•𝐽 𝑤 = | 𝑓 𝑋; 𝑤 − 𝑦| 2
2
1 2
•𝐽 𝑤 = | 𝑋𝑤 − 𝑦| 2
2
1
•𝐽 𝑤 = 𝑋𝑤 − 𝑦 𝑇 (𝑋𝑤 − 𝑦)
2
1
•𝐽 𝑤 = (𝑤 𝑇 𝑋 𝑇 − 𝑦 𝑇 )(𝑋𝑤 − 𝑦)
2
1
•𝐽 𝑤 = (𝑤 𝑇 𝑋 𝑇 𝑋𝑤 − 𝑤 𝑇 𝑋 𝑇 𝑦 − 𝑦 𝑇 𝑋𝑤 + 𝑦 𝑇 𝑦)
2

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 17


Gradient Computations in Vector notations
𝑝1 𝑞1
𝑝2 𝑞2
• Consider 2 vectors 𝑝 = … , 𝑞 = … ∈ ℝ𝑛
𝑝𝑛 𝑞𝑛

• Dot Product f ∶ ℝ𝑛 → ℝ = 𝑝𝑇 𝑞 = 𝑝1 𝑞1 + 𝑝2 𝑞2 + … + 𝑝𝑛 𝑞𝑛
∇𝑝1 𝑓
∇ 𝑓
• ∇𝑝 𝑓 = 𝑝2 ∈ ℝ𝑛

∇𝑝𝑛 𝑓
𝑞1
𝑞2
• ∇𝑝 𝑓 = … ∈ ℝ𝑛
𝑞𝑛

• ∇𝑝 𝑝𝑇 𝑞 = q ; Similarly ∇𝑞 𝑝𝑇 𝑞 = p

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 18


Back to Multivariate Linear Regression –
Matrix Vector Interpretation
• Find 𝑤 that minimizes
1
• 𝐽 𝑤 = (𝑤 𝑇 𝑋 𝑇 𝑋𝑤 − 𝑤 𝑇 𝑋 𝑇 𝑦 − 𝑦 𝑇 𝑋𝑤 + 𝑦 𝑇 𝑦)
2
• Standard idea from Calculus:
• Equate Gradient to 0
• i.e. ∇𝑤 𝐽 𝑤 = 0
1
• 𝑋 𝑇 𝑋𝑤 + 𝑤 𝑇 𝑋 𝑇 𝑋 𝑇
− 𝑋𝑇 𝑦 − 𝑦𝑇 𝑋 𝑇
+0 =0
2
• Product rule applied at differentiating ∇𝑤 𝑤 𝑇 𝑋 𝑇 𝑋𝑤
1
• 𝑋 𝑇 𝑋𝑤 + 𝑋 𝑇 𝑋𝑤 − 𝑋 𝑇 y − 𝑋 𝑇 y = 0
2
1
• 2 𝑋 𝑇 𝑋𝑤 − 2 𝑋 𝑇 y = 0
2
• 𝑋 𝑇 𝑋𝑤 − 𝑋 𝑇 y = 0
• 𝑋 𝑇 𝑋𝑤 = 𝑋 𝑇 𝑦
• The necessary parameters (weights) are 𝑤 = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑦
• This is called the analytical solution that can be obtained in one shot
Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 19
Polynomial Regression
• Consider a function 𝑓 = sin 2𝜋𝑥
• By adding random noise samples (𝑥, 𝑦) are generated
• Given {(𝑥, 𝑦)}, find 𝑓: 𝑓 𝑥 = 𝑦
• Representation : Polynomial
• 𝑓 𝑥; 𝑤 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 + … + 𝑤𝑀 𝑥 𝑀
• Loss function (Evaluation) : Sum of squares error
1
•ℒ 𝑤 = σ𝑁
𝑖=1 (𝑓 𝑥𝑖 ; 𝑤 − 𝑦𝑖 ) 2
2

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 20


Choices of degree M

Good fit Over fit

Under fit

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 21


Caveat
• Aim is to generalize on unseen data and not memorize train data
Tuned to noise

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 22


Back to Linear Regression
• Overfitting is a problem in Linear Regression as well.
• Limitations prevail to visualize higher dimensions
• Hence a detour with univariate polynomial regression taken
• To mitigate overfitting and enhance generalization, regularization
strategies that constrain the weights from exploding in either
directions may be adopted

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 23


Regularization
• Constrain the weights from exploding in either
directions
• Achieved by finding the parameters (weights) 𝑤 that
minimizes
𝑞
• 𝐽 𝑤 + 𝜆 σ𝐷
𝑗=1 𝑤𝑗 Sweet
• This formulation constrains a generic 𝐿𝑞 norm spot to
choose 𝜆
• In practical cases, typically when q = 1 for a high 𝜆 ,
weights become sparse
• Most used q = 2 called Ridge Regression formulated
as:
𝜆 2
• 𝐽 𝑤 + σ𝐷
𝑗=1 𝑤
2 2

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 24


Basics of Probability
• S – Sample space (set of all possible outcomes of an experiment)
• Event – Subset of S
• Probability of any event A satisfies the following axioms:
• 0≤𝑃 𝐴 ≤1
• 𝑃 𝜙 = 0 [Null event]
• 𝑃 𝑆 = 1 [Certain event]
• 𝑃 𝐴′ ≤ 𝑃 𝑆 − 𝑃 𝐴 = 1 − 𝑃(𝐴) [Negated event]
Independence, Conditional Probability
• Two events A & B are independent if
• 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 𝑃(𝐵)
𝑃(𝐴∩𝐵)
•𝑃 𝐴𝐵 =
𝑃(𝐵)
• 𝑃 𝐴 𝐵 is read as Probability of A given B
• E.g. What is the probability of brain tumour?
• Usually less
• What is the probability of brain tumour given that the patient
reported severe headache since 1 month?
• Probability (sense of belief) increases due to evidence observed (headache
since 1 month)
2 views of Probability
• Frequentist view
• Repeat the experiment N times
• Let F be the number of times the event A occurred
𝐹
• Compute 𝑃 𝐴 =
𝑁
• Limitations – Can’t use it for non-repeating events
• E.g. Compute the Probability of rain on the coming Friday 10 AM?
• Bayesian view
• Uses a degree of belief
• Look at other factors, evidence and predict the probability
• Can introduce domain expertise often reported in books as subject bias
Bayes Theorem
𝑃(𝑋∩𝑌)
•𝑃 𝑋𝑌 =
𝑃(𝑌)
• 𝑃 𝑋 ∩ 𝑌 = 𝑃 𝑋 𝑌 𝑃(𝑌) ---- (1)
𝑃(𝑌∩𝑋)
•𝑃 𝑌𝑋 =
𝑃(𝑋)
• 𝑃 𝑋 ∩ 𝑌 = 𝑃 𝑌 𝑋 𝑃(𝑋) ------ (2)
• From (1) & (2)
• 𝑃 𝑌 𝑋 𝑃 𝑋 = 𝑃 𝑋 𝑌 𝑃(𝑌)
𝑃 𝑋𝑌 𝑃(𝑌) 𝑃 𝑋 𝑌 𝑃(𝑌)
•𝑃 𝑌𝑋 = = σ
𝑃(𝑋) 𝑌 𝑃 𝑋 𝑌 𝑃(𝑌)
• Here 𝑃 𝑌 𝑋 is called the posterior probability
• 𝑃 𝑋 𝑌 is called likelihood
• 𝑃(𝑌) is called prior probability
• 𝑃 𝑋 = σ𝑌 𝑃 𝑋 𝑌 𝑃(𝑌) is called total probability
Connecting to the Previous Example
• 𝑃 𝑌 = 𝐵𝑟𝑎𝑖𝑛𝑇𝑢𝑚𝑜𝑢𝑟 𝑋 = ℎ𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝑃 𝑋 = ℎ𝑒𝑎𝑑𝑎𝑐ℎ𝑒 𝑌 = 𝐵𝑟𝑎𝑖𝑛𝑇𝑢𝑚𝑜𝑢𝑟 𝑃(𝑌=𝐵𝑟𝑎𝑖𝑛𝑇𝑢𝑚𝑜𝑢𝑟)
=
𝑃 𝑋 = ℎ𝑒𝑎𝑑𝑎𝑐ℎ𝑒 𝑌 = 𝐵𝑟𝑎𝑖𝑛𝑇𝑢𝑚𝑜𝑢𝑟 𝑃 𝑌=𝐵𝑟𝑎𝑖𝑛𝑇𝑢𝑚𝑜𝑢𝑟 +𝑃 𝑋 = ℎ𝑒𝑎𝑑𝑎𝑐ℎ𝑒 𝑌 = 𝑀𝑖𝑔𝑟𝑎𝑖𝑛𝑒 𝑃 𝑌=𝑀𝑖𝑔𝑟𝑎𝑖𝑛𝑒 + …

• Here 𝑃 𝑌 = 𝐵𝑟𝑎𝑖𝑛𝑇𝑢𝑚𝑜𝑢𝑟 𝑋 = ℎ𝑒𝑎𝑑𝑎𝑐ℎ𝑒 is called the posterior


probability (predicting a brain tumour given the patient reported headache)
• 𝑃 𝑋 = ℎ𝑒𝑎𝑑𝑎𝑐ℎ𝑒 𝑌 = 𝐵𝑟𝑎𝑖𝑛𝑇𝑢𝑚𝑜𝑢𝑟 is called likelihood (probability that a
patient diagnosed with brain tumour would report headache as a symptom)
• 𝑃(𝑌) is called prior probability (Probability that an individual may get affected
by brain tumour)
• 𝑃 𝑋 = σ𝑌 𝑃 𝑋 𝑌 𝑃(𝑌) is called total probability (considering all possible
conditions where headache is a symptom)
Machine Learning View
• Y is the target attribute to predict, X is the input
• P(Y|X) = Posterior Probability that the learner predicts Y on observing
input X
• P(X|Y) = Likelihood that an instance of target type Y exhibits input
features X
• P(Y) = Prior Probability of target type Y in a given dataset
• Usually obtained through counting (frequentist view)
• To increase accuracy,
• Maximize posterior probability
• Proportional to Maximizing Likelihood
Revisiting Normal (Gaussian) Distribution
• Most Common distribution
• E.g. Height of individuals, Marks of students, etc.
• Parameters
• Mean 𝜇
• Variance 𝜎 2
• Bell shaped density function given by
− 𝑥−𝜇 2
1
• 𝑝 𝑥 = 1 𝑒 2 𝜎2
2𝜋 2 𝜎
• Central limit theorem
• Sum of n independent random variables sampled from a normal distribution can
again be sampled from the same normal distribution as n approaches infinity
• Think about its relevance to Machine Learning!!

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 31


Probabilistic View of Linear Regression
• Observed true value
• 𝑦 =𝑓 𝑥 +𝜖
• 𝜖 ~ 𝒩(0, 𝜎 2 ) is the
noise term sampled

y
from a Gaussian 𝑓(𝑥)
(normal) distribution
– Why?
• 3 𝜎 rule x
− 𝜖2
1
• 𝑃 𝜖 = 1 𝑒 2𝜎2
2𝜋 2 𝜎

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 32


Probabilistic View of Linear Regression
(contd.)
• 𝑃 𝑦𝑖 𝑥𝑖 that drives the prediction can be assumed to be follow any
distribution family
• Each assumption yields a variant of regression
• Consistent with the discussions till now,
• Assumption : 𝑃 𝑦𝑖 𝑥𝑖 follows a Gaussian distribution 𝒩(𝑓 𝑥 , 𝜎 2 )
− (𝑦−𝑓(𝑥𝑖 ;𝑤))2
1
• 𝑃 𝑦𝑖 |𝑥𝑖 = 1 𝑒 2𝜎2
2𝜋 2 𝜎

• Considering all data points,


• 𝑃 𝑦1 , 𝑦2 , … , 𝑦𝑁 𝑥1 , 𝑥2 , … , 𝑥𝑁 = ς𝑁
𝑖=1 𝑃(𝑦𝑖 |𝑥𝑖 )
• [iid assumption i.e. independent and identically distributed]

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 33


Maximum Likelihood Estimation
• max ς𝑁𝑖=1 𝑃(𝑦𝑖 |𝑥𝑖 )
• Standard idea from calculus
• Differentiate and equate the derivative to 0
• Product terms are involved making differentiation bit challenging
• Can we modify to sum?
• Yes using log
• Is using log fine?
• Yes because it is a monotonically increasing function
• Maximizing log likelihood is equivalent to maximizing likelihood itself
• So for ease of differentiation, we will maximize log likelihood having
ensured its validity
Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 34
Maximum Likelihood Estimation (contd.)
• max log ς𝑁𝑖=1 𝑃(𝑦𝑖 |𝑥𝑖 )
• max σ𝑁
𝑖=1 log 𝑃(𝑦𝑖 |𝑥𝑖 )
− (𝑦−𝑓(𝑥𝑖 ;𝑤))2
1
• max σ𝑁
𝑖=1 log 1 𝑒 2𝜎2
2𝜋 2 𝜎
1 2
𝑦−𝑓 𝑥𝑖 ;𝑤
• max σ𝑁
𝑖=1 − log[ 2𝜋 2 𝜎] −
2𝜎 2
• As – sign is present inside a max expression,
1 2
𝑦−𝑓 𝑥𝑖 ;𝑤
• min σ𝑁
𝑖=1 log 2𝜋 𝜎 + 2
2 𝜎2

Minimize noise Analogous to minimizing the objective function – Least Squares Error

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 35


What if the multivariate data is inherently
non-linear?
• Sin() in univariate polynomial regression, was inherently non-linear
that best fit from straight line was impossible
• What if multivariate regression is similarly non-linear?
• Unable to visualize but high loss 𝐽 𝑤 is observed with weights
obtained from the analytical solution
• In this case the data is transformed using a non-linear function called
kernel function and the regression is formulated using a linear
combination of these non-linear functions

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 36


Non-linear regression
• Representation - 𝑓 𝑥 = 𝑤0 + σ𝐷𝑗=1 𝑤𝑗 𝜙(𝑥)
1 𝜙1 𝑥1 𝜙2 (𝑥1 ) … 𝜙𝐷 (𝑥1 )
• Transformed Data Matrix 𝜙(X) = 1 𝜙1 𝑥2 𝜙2 (𝑥2 ) … 𝜙𝐷 (𝑥2 )

1𝜙1 𝑥𝑁 𝜙2 (𝑥𝑁 ) … 𝜙𝐷 (𝑥𝑁 )
1 2
• Evaluation - 𝐽 𝑤 = | 𝜙(𝑋)𝑤 − 𝑦| 2
2
• It can be observed that closed form analytical solution is:
• 𝑤 = 𝜙 𝑋 𝑇 𝜙(𝑋) −1
𝜙(𝑋)𝑇 𝑦

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 37


Multivariate Linear Regression with Multiple
Outputs
𝑦11 𝑦12 … 𝑦1𝐾
𝑦21 𝑦22 … 𝑦2𝐾
• Target matrix Y = …
𝑦𝑁1 𝑦𝑁2 … 𝑦𝑁𝐾

𝑤10 𝑤20 … 𝑤𝐾0


𝑤11 𝑤21 … 𝑤𝐾1
• Weight matrix W = …
𝑤1𝐷 𝑤2𝐷 … 𝑤𝐾𝐷
• Analytical solution would be W = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑌
• Equivalent to performing K single output multivariate regression

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 38


References
• The contents of the slides are adapted from the following resources
• PRML by Bishop (Ch 1)
• ESLII by Hastie , Tibshirani & Friedman (Ch 3)
• Introduction to Machine Learning by Alpaydin (Ch 4,5)
• Machine Learning: A Probabilistic Perspective by Murphy (Ch 7)
• Statistical Data Analysis by Cowan (Ch 2)
• Keng’s Blog
• Mathematics for Machine Learning (Ch 5)

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 39

You might also like