ML807_Distributed_and_Federated_Learning_Slides_1
ML807_Distributed_and_Federated_Learning_Slides_1
1. Introduction
2. Motivation
3. Challenges
4. Basics
Statistics
Calculus
1
Introduction
About Instructors: Samuel Horvath
Brief Bio
• Assistant Professor @ MBZUAI, Machine Learning: 2022 - Present
• MS and PhD @ KAUST, Statistics: 2017 - 2022
• BS @ Comenius University: 2014 - 2017
• Industry experience: Meta AI Research, Samsung AI, Amazon
AWS
Research
• Federated Learning
• Distributed Optimization
• Efficient Deep Learning
• Large-scale Machine Learning
• Optimization for Machine Learning
2
About Instructors: Eduard Gorbunov
Brief Bio
• PostDoc @ MBZUAI, Machine Learning, 2022 - Present
• PostDoc @ MILA: 2022
• Junior Researcher @ MIPT; 2020 - 2022
• BS, MS, and PhD @ MIPT, Applied Mathematics, 2014 - 2021
• Industry experience: Yandex, Huawei
Research
• Stochastic Optimization for Machine Learning
• Distributed Optimization
• Derivative-Free Optimization
• Variational Inequalities
3
About Instructors: Praneeth Vepakomma
Brief Bio
• Assistant Professor @ MBZUAI, Machine Learning: 2023 - Present
• PhD @ MIT, Statistics: 2018 - 2023
• Industry experience: Meta, Amazon, Apple, Amazon, Motorola
Solutions, Corning, and several startups
Research
• Distributed ML
• Distributed Statistical Inference
• Privacy
• Statistics
• Randomized Algorithms
4
ML807: Federated Learning
Prerequisites
• MTH 701
• Good knowledge of multivariate calculus, linear algebra,
optimization, probability, and algorithms.
• Proficiency in some ML framework, e.g., PyTorch and TensorFlow.
5
ML807: Federated Learning
Goals
• Understand the key concepts, challenges, and issues behind FL,
• Get familiar with key advances and results in FL,
• Prepare and deliver several 45 minutes of presentations about
the selected FL papers,
• Write a final project in the form of a paper/report with the
original applied or theoretical research in the area of FL.
Assessment
• Quizzes – 30%
• Paper presentation – 30%
• Project – 40%
• Active Participation - 5% (Bonus) Asking questions and
participating in class discussions. Please be interactive!
6
Course Organization
Weeks 7 (8)–11
In weeks 7-11, students will present, discuss, and summarize papers
on the latest research in federated learning.
Weeks 12–15
Last 4 weeks, students will present, discuss, and summarize their
progress on their course project, chosen by the student during the
course.
7
Weeks 1-6
Covered Topics
• Introduction to Distributed and Federated
Learning
• Communication Bottleneck
• Computational and Data Heterogeneity
• Client Selection and Importance Sampling
• Decentralized Learning
• Fairness
• Robustness
• Personalized and Multi-task Learning
• Differential Privacy in FL
8
Quizzes
9
Weeks 7-11
10
Weeks 12-15
Project
• Students select a topic for the project (covered materials).
• Approved by the instructor (relevant and doable).
• Students responsibilities: complete the project, present their
findings and current progress.
• Main criteria: a substantial literature review (an advanced
research course).
• Evaluation: the clarity, writing and presentation quality.
• The novelty is not required but preferred (a bonus score).
• A successful project does not need to be of a publishable
nature, but it is strongly encouraged.
11
Project Criteria
• Introduction (5%)
• Review of related work (30%)
• Data exploration and preprocessing (10%)
• Methods (25%)
• Discussion of experiments and results (25%)
• Conclusion (5%)
• Novelty – Bonus 10% (Capped at 100%)
• Individual Projects (Possible exceptions)
12
Lectures & Office Hours
• Instructor:
Samuel Horvath ([email protected])
Eduard Gorbunov ([email protected]) Praneeth
Vepakomma ([email protected])
• Lectures:
Monday/Tuesday (17:30 - 18:50), Friday (16:00 - 17:20)
Demanding!
• Office hours:
Right after the class or by email appointment
13
Useful Resources: Reading
Review articles
• Advances and open problems in federated learning
[Kairouz et al., 2021]
• A field guide to federated optimization [Wang et al., 2021]
• Federated learning: Challenges, methods, and future directions
[Li et al., 2020]
• Federated learning challenges and opportunities: An outlook
[Ding et al., 2022]
Books
• Federated Learning: A Comprehensive Overview of Methods and
Applications [Ludwig and Baracaldo, 2022]
14
Useful Resources: Software
FL Simulators
• FLSim: github.com/facebookresearch/FLSim
• FL_Pytorch: github.com/burlachenkok/flpytorch
• Simple FL:
github.com/SamuelHorvath/Simple_FL_Simulator
15
Useful Resources: FLOW Seminar
Stats
• Link: https://sites.google.com/view/one-world-
seminar-series-flow
• Talks: 60+
• Participants: 1200+
16
Questions?
16
Introduce yourself + Why this course?
16
Motivation
The Past: A Brief History of Machine Learning
17
The Past: A Brief History of Machine Learning
17
The Past: A Brief History of Machine Learning
17
The Past: A Brief History of Machine Learning
17
The Past: A Brief History of Machine Learning
17
The Past: A Brief History of Machine Learning
17
The Present: Mainstream Success
ML did not come into the mainstream until the 2000s due to lack of
computing power to train large neural network models and the
scarcity of training data.
18
The Present: Mainstream Success
ML did not come into the mainstream until the 2000s due to lack of
computing power to train large neural network models and the
scarcity of training data.
18
The Present: Mainstream Success
ML did not come into the mainstream until the 2000s due to lack of
computing power to train large neural network models and the
scarcity of training data.
18
The Present: Mainstream Success
ML did not come into the mainstream until the 2000s due to lack of
computing power to train large neural network models and the
scarcity of training data.
18
The Present: Mainstream Success
ML did not come into the mainstream until the 2000s due to lack of
computing power to train large neural network models and the
scarcity of training data.
18
The Present: Mainstream Success
ML did not come into the mainstream until the 2000s due to lack of
computing power to train large neural network models and the
scarcity of training data.
18
The Present: Distributed Machine Learning
ML Pillars:
1. Large-scale training infrastructures
2. The vast amounts of training data
19
The Present: Distributed Machine Learning
ML Pillars:
1. Large-scale training infrastructures
2. The vast amounts of training data
19
The Present: Distributed Machine Learning
ML Pillars:
1. Large-scale training infrastructures
2. The vast amounts of training data
We have to go distributed.
19
The Present: Types of Distributed ML
20
The Present: Types of Distributed ML
21
Issues with Distributed ML
21
The Potential Future: Federated Learning
22
Federated Learning
Definition
Federated Learning is defined as a Machine Learning setting where
we have multiple parties that want to collaborate in solving a
machine learning problem, and a central server coordinates this
collaboration. The core idea of FL is decentralized learning, where
the user data is never sent to the server, transferred or exchanged.
It’s kept locally; instead, the learning objective is meant to be
achieved using focused updates intended for immediate
aggregation to avoid privacy leaks.
23
Federated Learning
Definition
Federated Learning is defined as a Machine Learning setting where
we have multiple parties that want to collaborate in solving a
machine learning problem, and a central server coordinates this
collaboration. The core idea of FL is decentralized learning, where
the user data is never sent to the server, transferred or exchanged.
It’s kept locally; instead, the learning objective is meant to be
achieved using focused updates intended for immediate
aggregation to avoid privacy leaks.
23
Applications of Federated Learning
24
FL settings: Cross-silo Federated Learning
25
FL settings: Cross-device Federated Learning
26
Challenges
Training Loop for Distributed ML
27
Challenge 1: Communication Bottleneck (Data Center)
Issue
• In typical distributed computing
architectures, communication is
much slower than computation
• This is a major limiting factor for
many applications
Issue
• Wireless links and other end-user
internet connections are slower than
datacenter links.
• Even for data centres,
communication is already a
bottleneck.
29
Challenge 1: Communication Bottleneck (FL)
Issue
• Wireless links and other end-user
internet connections are slower than
datacenter links.
• Even for data centres,
communication is already a
bottleneck.
Remedies
• Employing communication
compression
• More local work, e.g., local updates,
increase mini-batch size
• Limit the number of clients to
communicate (partial participation)
29
Challenge 2: Systems Heterogeneity
Issue
• Clients (GPUs, mobile devices, etc.) can be unreliable and/or
heterogeneous.
• Vast heterogeneity of devices is one of the key challenges for
deploying and training models in “wild.”
30
Challenge 2: Systems Heterogeneity
Issue
• Clients (GPUs, mobile devices, etc.) can be unreliable and/or
heterogeneous.
• Vast heterogeneity of devices is one of the key challenges for
deploying and training models in “wild.”
Remedies
• Straggler-resilient algorithms
• Asynchronous updates
• Drop low-tier devices (FL)
• Limit the global model’s size
(FL)
30
Challenge 3: Client Availability
Issue
• The standard consideration is that
clients are only available when they
are on fast network + connected to
charger.
• This might cause undesired
variations that optimization
algorithms may need to address.
• It contradicts iid sampling (clients
available typically only during night
time)
Issue
• On-device inference has been happening for a while now,
on-device training is still in a very early stage.
• It requires considerably more compute:
• forward and backward propagation,
• memory for gradients, activations, optimizers.
32
Challenge 5: Limited labels
Issue
• Devices have a lot of data, but
only a few labels
33
Challenge 5: Limited labels
Issue
• Devices have a lot of data, but
only a few labels
• Next word prediction is
popular (labels for free).
33
Challenge 5: Limited labels
Issue
• Devices have a lot of data, but
only a few labels
• Next word prediction is
popular (labels for free).
Remedies
• Semi-supervised learning
• Incentives for clients to label
their data
33
Challenge 6: Personalization
Issue
• Objective: Is a single global
model optimal?
34
Challenge 6: Personalization
Issue
• Objective: Is a single global
model optimal?
• Next word prediction: clearly
not.
34
Challenge 6: Personalization
Issue
• Objective: Is a single global
model optimal?
• Next word prediction: clearly
not.
Remedies
• Meta learning
• New personalized objectives:
clustering, local optima, etc.
34
Challenge 7: Privacy Guarantees
Issue
• Users’ data do not leave
device, but updates do
35
Challenge 7: Privacy Guarantees
Issue
• Users’ data do not leave
device, but updates do
• Privacy leaks are possible.
35
Challenge 7: Privacy Guarantees
Issue
• Users’ data do not leave
device, but updates do
• Privacy leaks are possible.
Remedies
• Cryptography: homomorphic
encryption, trusted execution
environments-TEEs
• Differential privacy
35
Challenge 8: Adversarial Attacks
Issue
• Federated Learning =
collaboration among
mutually untrusted clients
• Easy target for poisoning
• Heterogenous datasets: hard
to identify malicious clients
36
Challenge 8: Adversarial Attacks
Issue
• Federated Learning =
collaboration among
mutually untrusted clients
• Easy target for poisoning
• Heterogenous datasets: hard
to identify malicious clients
Remedies
• Defensive mechanisms: proof
of computations
• Byzantine-robust learning
36
Challenge 9: Incentives to Participate
Issue
• Local Motivation (Reward vs. Cost):
• What is my motivation to
participate? Or to label data?
• What is my local cost (energy,
privacy, . . .)?
• Do I benefit from the federation?
• Local Contribution:
• How much do I contribute to the
federation?
• Ownership of model:
• Community or Enterprise (Google,
Amazon)
37
Challenge 9: Incentives to Participate
Issue
• Local Motivation (Reward vs. Cost):
• What is my motivation to
participate? Or to label data?
• What is my local cost (energy,
privacy, . . .)?
• Do I benefit from the federation?
• Local Contribution:
• How much do I contribute to the
federation?
• Ownership of model:
• Community or Enterprise (Google,
Amazon)
Remedies
• Model FL as a Market 37
Challenges: Summary
New Challenges
• Unique challenges that require devising new system-aware
efficient optimization techniques
• This involves optimization theory and networking/scheduling
techniques
Goals
• First 11 weeks: Understand techniques to tackle these challenges
• Last 4 weeks: Explore, analyse, and validate new/existing
solutions
• Ultimate goal: Enabling efficient collaborative (distributed) ML
38
Basics
Statistics
Expectation
Let X ∈ Rd be a random variable and g : Rd → R be any function.
Properties of Expectation
Let X, Y ∈ Rd be random variables.
40
Statistics
Example
• Let X be a Bernoulli random variable, i.e.,
(
1 w.p. p,
X=
0 otherwise.
41
Statistics
Example
• Let X be a Bernoulli random variable, i.e.,
(
1 w.p. p,
X=
0 otherwise.
41
Statistics
Example
• Let X be a Bernoulli random variable, i.e.,
(
1 w.p. p,
X=
0 otherwise.
E [X] = p ∗ 1 + (1 − p) ∗ 0 = p
41
Statistics
Variance
Let X ∈ Rd be a random variable. Its variance is defined as
V [X] = E (X − E [X])⊤ (X − E [X]) = E ∥X − E [X]∥2 .
42
Statistics
Variance
Let X ∈ Rd be a random variable. Its variance is defined as
V [X] = E (X − E [X])⊤ (X − E [X]) = E ∥X − E [X]∥2 .
42
Statistics
Exercise
Show that (1) holds.
43
Statistics
Exercise
Show that (1) holds.
Exercise
Show that
E ∥X − b∥2 = V [X] + ∥E [X] − b∥2 .
43
Statistics
Exercise
Show that (1) holds.
Exercise
Show that
E ∥X − b∥2 = V [X] + ∥E [X] − b∥2 .
Exercise
Show that the following equality holds. For any independent random
variables X1 , X2 , . . . , Xn ∈ Rd , we have
2
1 X h i
X n n
1
E (Xi − E [Xi ]) = 2
2
E ∥Xi − E [Xi ]∥ .
n n
i=1 i=1
43
Statistics
Tower Property
Let X, Y ∈ Rd be random variables. The law of total expectation
stays that
44
Statistics
Tower Property
Let X, Y ∈ Rd be random variables. The law of total expectation
stays that
Example
Suppose that only two factories supply light bulbs to the market.
Factory X’s bulbs work for an average of 5000 hours, whereas factory
Y’s bulbs work for an average of 4000 hours. It is known that factory
X supplies 60% of the total bulbs available. What is the expected
length of time that a purchased bulb will work for?
44
Statistics
Tower Property
Let X, Y ∈ Rd be random variables. The law of total expectation
stays that
Example
Suppose that only two factories supply light bulbs to the market.
Factory X’s bulbs work for an average of 5000 hours, whereas factory
Y’s bulbs work for an average of 4000 hours. It is known that factory
X supplies 60% of the total bulbs available. What is the expected
length of time that a purchased bulb will work for?
Solution
We condition on the factory and use the tower property.
E [X]
Pr [X > a] ≤ .
a
45
Statistics
Markov’s Inequality
Let X ∈ R be a non-negative random variable. Then, for any a ∈ R+ ,
E [X]
Pr [X > a] ≤ .
a
Example
Let X1 , X2 , . . . , Xt , . . . ∈ R be a sequence on non-negative random
variables that converges to 0 in expectation, i.e., limt→∞ E [Xt ] = 0.
∞
Show that {Xt }t=1 also converges in probability, i.e.,
limt→∞ Pr [Xt > a] = 0, ∀a > 0.
45
Statistics
Markov’s Inequality
Let X ∈ R be a non-negative random variable. Then, for any a ∈ R+ ,
E [X]
Pr [X > a] ≤ .
a
Example
Let X1 , X2 , . . . , Xt , . . . ∈ R be a sequence on non-negative random
variables that converges to 0 in expectation, i.e., limt→∞ E [Xt ] = 0.
∞
Show that {Xt }t=1 also converges in probability, i.e.,
limt→∞ Pr [Xt > a] = 0, ∀a > 0.
Solution
Let a > 0 be fixed,
E [Xt ] 1
lim Pr [Xt > a] ≤ lim = lim E [Xt ] = 0.
t→∞ t→∞ a a t→∞
45
Statistics
Useful Inequalities
Let X, Y ∈ Rd be random variables
46
Statistics
Useful Inequalities
Let X, Y ∈ Rd be random variables
46
Statistics
Useful Inequalities
Let X, Y ∈ Rd be random variables
46
Statistics
Exercise
Prove Cauchy-Schwartz inequality:
p
E |X⊤ Y| ≤ E [∥X∥2 ] E [∥Y∥2 ].
47
Statistics
Exercise
Prove Cauchy-Schwartz inequality:
p
E |X⊤ Y| ≤ E [∥X∥2 ] E [∥Y∥2 ].
Exercise
Show that, the covariance can be upper-bounded as follows
p def p
Cov(X, Y) ≤ V [X] V [Y], where Cov(X, Y) = V [X] V [Y].
47
Statistics
Exercise
Prove Cauchy-Schwartz inequality:
p
E |X⊤ Y| ≤ E [∥X∥2 ] E [∥Y∥2 ].
Exercise
Show that, the covariance can be upper-bounded as follows
p def p
Cov(X, Y) ≤ V [X] V [Y], where Cov(X, Y) = V [X] V [Y].
Exercise
Show that if g : Rd → R is a concave function, then
47
Convexity
48
Gradient
Gradient
Consider a multivariate function f : Rd → R, the gradient of f is
∂f
∂x. 1 ∂f(x)
∇f(x) = .
. , [∇f(x)]i = ∀i ∈ [d] = {1, 2, . . . , d}.
∂xi
∂f
∂xd x
49
Gradient
Gradient
Consider a multivariate function f : Rd → R, the gradient of f is
∂f
∂x. 1 ∂f(x)
∇f(x) = .
. , [∇f(x)]i = ∀i ∈ [d] = {1, 2, . . . , d}.
∂xi
∂f
∂xd x
49
Gradient
Gradient
Consider a multivariate function f : Rd → R, the gradient of f is
∂f
∂x. 1 ∂f(x)
∇f(x) = .
. , [∇f(x)]i = ∀i ∈ [d] = {1, 2, . . . , d}.
∂xi
∂f
∂xd x
49
Gradient
Gradient
Consider a multivariate function f : Rd → R, the gradient of f is
∂f
∂x. 1 ∂f(x)
∇f(x) = .
. , [∇f(x)]i = ∀i ∈ [d] = {1, 2, . . . , d}.
∂xi
∂f
∂xd x
49
Hessian
Hessian
Consider a multivariate function f : Rd → R, the Hessian of f is
∂2f 2
∂x1 ∂x1 . . . ∂x∂1 ∂xf d
. .. .. ∂ 2 f(x)
Hf (x) =
.. . .
,
Hf (x) ij
= ∀i, j ∈ [d].
∂xi ∂xj
∂2f ∂2f
∂xd ∂x1 . . . ∂xd ∂xd
x
50
Hessian
Hessian
Consider a multivariate function f : Rd → R, the Hessian of f is
∂2f 2
∂x1 ∂x1 . . . ∂x∂1 ∂xf d
. .. .. ∂ 2 f(x)
Hf (x) =
.. . .
,
Hf (x) ij
= ∀i, j ∈ [d].
∂xi ∂xj
∂2f ∂2f
∂xd ∂x1 . . . ∂xd ∂xd
x
50
Hessian
Hessian
Consider a multivariate function f : Rd → R, the Hessian of f is
∂2f 2
∂x1 ∂x1 . . . ∂x∂1 ∂xf d
. .. .. ∂ 2 f(x)
Hf (x) =
.. . .
,
Hf (x) ij
= ∀i, j ∈ [d].
∂xi ∂xj
∂2f ∂2f
∂xd ∂x1 . . . ∂xd ∂xd
x
50
Hessian
Hessian
Consider a multivariate function f : Rd → R, the Hessian of f is
∂2f 2
∂x1 ∂x1 . . . ∂x∂1 ∂xf d
. .. .. ∂ 2 f(x)
Hf (x) =
.. . .
,
Hf (x) ij
= ∀i, j ∈ [d].
∂xi ∂xj
∂2f ∂2f
∂xd ∂x1 . . . ∂xd ∂xd
x
50
Hessian
Clairaut’s Theorem
If the second order partial derivatives of f : Rd → R are continuous,
at a point x, then
∂2f ∂2f
= ∀i, j ∈ [d]2 .
∂xi ∂xj ∂xj ∂xi
y⊤ Hf (x)y ≥ 0, ∀x, y ∈ Rd .
51
Smoothness
Smoothness
Differentiable function f : Rd → R is L-smooth if
52
Smoothness
Smoothness
Differentiable function f : Rd → R is L-smooth if
52
Smoothness
Smoothness
Differentiable function f : Rd → R is L-smooth if
52
Smoothness
Smoothness
Differentiable function f : Rd → R is L-smooth if
52
Bregman Divergence
Bregman Divergence
Bregman divergence associated with f f : Rd → R is defined as
53
Bregman Divergence
Bregman Divergence
Bregman divergence associated with f f : Rd → R is defined as
53
Bregman Divergence
54
Bregman Divergence
54
Strong Convexity
Strong Convexity
We say that a differentiable function f : Rd → R is µ-strongly
convex (or µ-convex) if there exists such µ > 0
µ
∥x − y∥2 ≤ Df (x, y), ∀x, y ∈ Rd .
2
55
Strong Convexity
Strong Convexity
We say that a differentiable function f : Rd → R is µ-strongly
convex (or µ-convex) if there exists such µ > 0
µ
∥x − y∥2 ≤ Df (x, y), ∀x, y ∈ Rd .
2
55
Cheat-sheet
56
Disclaimer
Neural Networks
57
Disclaimer
Neural Networks
• Neural Networks (NNs) are neither L-smooth nor µ-convex
globally.
57
Disclaimer
Neural Networks
• Neural Networks (NNs) are neither L-smooth nor µ-convex
globally.
• However, such properties usually holds locally.
57
Disclaimer
Neural Networks
• Neural Networks (NNs) are neither L-smooth nor µ-convex
globally.
• However, such properties usually holds locally.
• Assuming global properties facilitates analysis.
57
Disclaimer
Neural Networks
• Neural Networks (NNs) are neither L-smooth nor µ-convex
globally.
• However, such properties usually holds locally.
• Assuming global properties facilitates analysis.
• NNs (NNs proxy) Optimization: [Du et al., 2018a, Du et al., 2018b,
Arora et al., 2019, Ye and Du, 2021]
57
Disclaimer
Neural Networks
• Neural Networks (NNs) are neither L-smooth nor µ-convex
globally.
• However, such properties usually holds locally.
• Assuming global properties facilitates analysis.
• NNs (NNs proxy) Optimization: [Du et al., 2018a, Du et al., 2018b,
Arora et al., 2019, Ye and Du, 2021]
Exercise
Show that NNs are non-smooth and non-convex.
Hint: Consider NN with one hidden layer with no bias and activation.
57
Questions?
57
Optimizing Convex Objectives
58
Optimizing Convex Objectives
58
Optimizing Convex Objectives
• How to find x⋆ ?
58
Optimizing Convex Objectives
• How to find x⋆ ?
• Find stationary points of f(x), that is, x⋆ that satisfies ∇f(x⋆ ) = 0
58
Optimizing Convex Objectives
• How to find x⋆ ?
• Find stationary points of f(x), that is, x⋆ that satisfies ∇f(x⋆ ) = 0
• Since f(x) is convex, the stationary point(s) minimize f(x)
58
Optimizing Convex Objectives
• How to find x⋆ ?
• Find stationary points of f(x), that is, x⋆ that satisfies ∇f(x⋆ ) = 0
• Since f(x) is convex, the stationary point(s) minimize f(x)
Example
58
Optimizing Convex Objectives
• How to find x⋆ ?
• Find stationary points of f(x), that is, x⋆ that satisfies ∇f(x⋆ ) = 0
• Since f(x) is convex, the stationary point(s) minimize f(x)
Example
• Question: Find x that minimizes f(x1 , x2 ) = (x1 − 3)2 + 2(x2 − 2)2
58
Optimizing Convex Objectives
• How to find x⋆ ?
• Find stationary points of f(x), that is, x⋆ that satisfies ∇f(x⋆ ) = 0
• Since f(x) is convex, the stationary point(s) minimize f(x)
Example
• Question: Find x that minimizes f(x1 , x2 ) = (x1 − 3)2 + 2(x2 − 2)2
" #
2(x1 − 3)
• Answer: ∇f(x) = =0
4(x2 − 2)
58
Optimizing Convex Objectives
• How to find x⋆ ?
• Find stationary points of f(x), that is, x⋆ that satisfies ∇f(x⋆ ) = 0
• Since f(x) is convex, the stationary point(s) minimize f(x)
Example
• Question: Find x that minimizes f(x1 , x2 ) = (x1 − 3)2 + 2(x2 − 2)2
" #
2(x1 − 3)
• Answer: ∇f(x) = =0
4(x2 − 2)
• =⇒ x⋆ = [3, 2]⊤
58
Gradient Descent (GD)
59
Gradient Descent (GD)
xt+1 = xt − η∇f(xt ).
59
Gradient Descent (GD)
xt+1 = xt − η∇f(xt ).
Intuition
59
Gradient Descent (GD)
xt+1 = xt − η∇f(xt ).
Intuition
• A person standing on a hill covered in fog who wants to find
their way down to the hill. They can feel the slope with their feet
and move along the direction of steepest descent.
59
Gradient Descent (GD)
60
Concept Check Exercise
61
Concept Check Exercise
61
SGD in Linear Regression
Example: Predicting house prices
62
Linear Regression
Setup
• Input: x ∈ Rd (covariates, predictors, features, etc.)
• Ouput: y ∈ R (responses, targets, outcomes, outputs, etc.)
• Model: f : x → y, with f(x) = w0 + w⊤ x.
• w: weights, parameters, or parameter vector.
• w0 : bias.
• Sometimes, we also call ŵ = [w0 , w⊤ ]⊤ parameters.
• Training data: D = {(xi , yi ), i ∈ [n]}
63
A simple case: x is just one-dimensional (d = 1)
64
A simple case: x is just one-dimensional (d = 1)
Figure 1: Loss (L) is the sum of squares of the dashed lines. Middle: Adjust
(w0 , w1 ) to reduce loss. Middle: Loss minimized at (w⋆0 , w⋆1 ).
65
A simple case: x is just one-dimensional (d = 1)
66
A simple case: x is just one-dimensional (d = 1)
Stationary points
Take derivative with respect to parameters and set it to zero.
66
A simple case: x is just one-dimensional (d = 1)
Stationary points
Take derivative with respect to parameters and set it to zero.
X n
∂L(w)
= 0 =⇒ 2 (w0 + w1 xi − yi ) = 0,
∂w0
i=1
Xn
∂L(w)
= 0 =⇒ 2 (w0 + w1 xi − yi )xi = 0.
∂w1
i=1
66
A simple case: x is just one-dimensional (d = 1)
67
A simple case: x is just one-dimensional (d = 1)
1
Pn 1
Pn
where x̄ = n i=1 xi and ȳ = n i=1 yi .
67
Least Mean Squares when x is d-dimensional
Pn 2
L(w) in matrix form: L(w) = i=1 w⊤ xi − yi , where we have
redefined some variables (by augmenting)
68
Least Mean Squares when x is d-dimensional
Pn 2
L(w) in matrix form: L(w) = i=1 w⊤ xi − yi , where we have
redefined some variables (by augmenting)
1
Pn 1
Pn
where x̄ = n i=1 xi and ȳ = n i=1 yi .
68
Least Mean Squares when x is d-dimensional
Pn 2
L(w) in matrix form: L(w) = i=1 w⊤ xi − yi , where we have
redefined some variables (by augmenting)
1
Pn 1
Pn
where x̄ = n i=1 xi and ȳ = n i=1 yi .
Compact expression:
n ⊤ o
L(w) = ∥Xw − y∥2 = w⊤ X⊤ Xw − 2 X⊤ y w + const
68
Least-Squares Solution
Compact expression:
n ⊤ o
L(w) = ∥Xw − y∥2 = w⊤ X⊤ Xw − 2 X⊤ y w + const
69
Least-Squares Solution
Compact expression:
n ⊤ o
L(w) = ∥Xw − y∥2 = w⊤ X⊤ Xw − 2 X⊤ y w + const
∇x (b⊤ x) = b
∇x (x⊤ Ax) = 2Ax (symmetric A)
69
Least-Squares Solution
Compact expression:
n ⊤ o
L(w) = ∥Xw − y∥2 = w⊤ X⊤ Xw − 2 X⊤ y w + const
∇x (b⊤ x) = b
∇x (x⊤ Ax) = 2Ax (symmetric A)
Normal equation
70
Computational complexity
70
Computational complexity
70
Computational complexity
70
Computational complexity
70
Computational complexity
70
Computational complexity
In total:
O(nd2 + d3 )
70
Computational complexity
In total:
O(nd2 + d3 )
70
Computational complexity
In total:
O(nd2 + d3 ) – impractical for very large d or n.
70
Alternative method: Batch Gradient Descent
71
Alternative method: Batch Gradient Descent
71
Alternative method: Batch Gradient Descent
71
Why would this work?
72
Why would this work?
72
Why would this work?
72
Why would this work?
v⊤ X⊤ Xv = ∥Xv∥2 ≥ 0.
72
Stochastic gradient descent (SGD)
73
Stochastic gradient descent (SGD)
73
Stochastic gradient descent (SGD)
73
Stochastic gradient descent (SGD)
73
Stochastic gradient descent (SGD)
Exercise
• Show that gt is an unbiased estimator of the gradient.
73
SGD versus Batch GD
74
SGD versus Batch GD
74
SGD versus Batch GD
74
SGD versus Batch GD
75
Mini-batch SGD
76
Mini-batch SGD
for back-propagation.
76
Mini-batch SGD
for back-propagation.
• Small b saves per-iteration computing cost but increases noise
in the gradients and yields worse error convergence.
76
Mini-batch SGD
for back-propagation.
• Small b saves per-iteration computing cost but increases noise
in the gradients and yields worse error convergence.
• Large b reduces gradient noise and gives better error
convergence but increases computing cost per iteration.
76
How to Choose Mini-batch Size
77
How to Choose Learning Rate
78
Questions?
78
References i
Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. (2019).
Fine-grained analysis of optimization and generalization for
overparameterized two-layer neural networks.
In International Conference on Machine Learning, pages 322–332.
PMLR.
Ding, J., Tramel, E., Sahu, A. K., Wu, S., Avestimehr, S., and Zhang, T.
(2022).
Federated learning challenges and opportunities: An outlook.
In ICASSP 2022-2022 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 8752–8756. IEEE.
Du, S. S., Hu, W., and Lee, J. D. (2018a).
Algorithmic regularization in learning deep homogeneous
models: Layers are automatically balanced.
Advances in Neural Information Processing Systems, 31.
References ii
Sapio, A., Canini, M., Ho, C.-Y., Nelson, J., Kalnis, P., Kim, C.,
Krishnamurthy, A., Moshref, M., Ports, D. R., and Richtárik, P. (2019).