0% found this document useful (0 votes)

21 views

ML807_Distributed_and_Federated_Learning_Slides_1

Uploaded by

Nicolas Cuadrado

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

ML807_Distributed_and_Federated_Learning_Slides_1

Uploaded by

Nicolas Cuadrado

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 190

ML807: Distributed and Federated Learning

Course Overview and Basics

Instructors: Samuel Horvath, Eduard Gorbunov, Praneeth Vepakomma

Semester: Spring 2024

Machine Learning, Mohamed bin Zayed University of Artificial Intelligence

Outline

1. Introduction

2. Motivation

3. Challenges

4. Basics
Statistics
Calculus

5. SGD in Linear Regression

1
Introduction
About Instructors: Samuel Horvath

Brief Bio
• Assistant Professor @ MBZUAI, Machine Learning: 2022 - Present
• MS and PhD @ KAUST, Statistics: 2017 - 2022
• BS @ Comenius University: 2014 - 2017
• Industry experience: Meta AI Research, Samsung AI, Amazon
AWS

Research
• Federated Learning
• Distributed Optimization
• Efficient Deep Learning
• Large-scale Machine Learning
• Optimization for Machine Learning

2
About Instructors: Eduard Gorbunov

Brief Bio
• PostDoc @ MBZUAI, Machine Learning, 2022 - Present
• PostDoc @ MILA: 2022
• Junior Researcher @ MIPT; 2020 - 2022
• BS, MS, and PhD @ MIPT, Applied Mathematics, 2014 - 2021
• Industry experience: Yandex, Huawei

Research
• Stochastic Optimization for Machine Learning
• Distributed Optimization
• Derivative-Free Optimization
• Variational Inequalities

3
About Instructors: Praneeth Vepakomma

Brief Bio
• Assistant Professor @ MBZUAI, Machine Learning: 2023 - Present
• PhD @ MIT, Statistics: 2018 - 2023
• Industry experience: Meta, Amazon, Apple, Amazon, Motorola
Solutions, Corning, and several startups

Research
• Distributed ML
• Distributed Statistical Inference
• Privacy
• Statistics
• Randomized Algorithms

4
ML807: Federated Learning

About this class

• This is a graduate course in a new branch of machine learning:
federated learning (FL).
• This course aims for students to become familiar with the field’s
key developments and practices.

Prerequisites
• MTH 701
• Good knowledge of multivariate calculus, linear algebra,
optimization, probability, and algorithms.
• Proficiency in some ML framework, e.g., PyTorch and TensorFlow.

5
ML807: Federated Learning

Goals
• Understand the key concepts, challenges, and issues behind FL,
• Get familiar with key advances and results in FL,
• Prepare and deliver several 45 minutes of presentations about
the selected FL papers,
• Write a final project in the form of a paper/report with the
original applied or theoretical research in the area of FL.

Assessment
• Quizzes – 30%
• Paper presentation – 30%
• Project – 40%
• Active Participation - 5% (Bonus) Asking questions and
participating in class discussions. Please be interactive!
6
Course Organization

Weeks 1–6 (7)

The first 6 weeks of the course will consist of lectures taught by the
instructor.

Weeks 7 (8)–11
In weeks 7-11, students will present, discuss, and summarize papers
on the latest research in federated learning.

Weeks 12–15
Last 4 weeks, students will present, discuss, and summarize their
progress on their course project, chosen by the student during the
course.

7
Weeks 1-6

Covered Topics
• Introduction to Distributed and Federated
Learning
• Communication Bottleneck
• Computational and Data Heterogeneity
• Client Selection and Importance Sampling
• Decentralized Learning
• Fairness
• Robustness
• Personalized and Multi-task Learning
• Differential Privacy in FL

8
Quizzes

• There will be three quizzes

• Each exam accounts for 10% of the overall
evaluation
• Quizzes are designed to test the knowledge
of the covered material
• Quizzes consist of 2 types of questions:
• covered theoretical results,
• general understanding: FL system design,
i.e., how to design efficient FL systems,
challenges one needs to consider,
evaluation, etc.

9
Weeks 7-11

Paper Presentation (30%)

• Students’ ability for independently applying
the topics learned in the class to advanced
FL problems will be tested through paper
presentations.
• Each student will present 2-3 papers from
the provided list or approved by the
instructor.
• Topics will include all areas of FL, including
but not limited to the material covered
during the first five weeks.
• Evaluation criteria: clarity and coherence of
the content, and clarity of the presentation.

10
Weeks 12-15

Project
• Students select a topic for the project (covered materials).
• Approved by the instructor (relevant and doable).
• Students responsibilities: complete the project, present their
findings and current progress.
• Main criteria: a substantial literature review (an advanced
research course).
• Evaluation: the clarity, writing and presentation quality.
• The novelty is not required but preferred (a bonus score).
• A successful project does not need to be of a publishable
nature, but it is strongly encouraged.

11
Project Criteria

• Introduction (5%)
• Review of related work (30%)
• Data exploration and preprocessing (10%)
• Methods (25%)
• Discussion of experiments and results (25%)
• Conclusion (5%)
• Novelty – Bonus 10% (Capped at 100%)
• Individual Projects (Possible exceptions)

12
Lectures & Office Hours

• Instructor:
Samuel Horvath ([email protected])
Eduard Gorbunov ([email protected]) Praneeth
Vepakomma ([email protected])
• Lectures:
Monday/Tuesday (17:30 - 18:50), Friday (16:00 - 17:20)
Demanding!
• Office hours:
Right after the class or by email appointment

13
Useful Resources: Reading

Review articles
• Advances and open problems in federated learning
[Kairouz et al., 2021]
• A field guide to federated optimization [Wang et al., 2021]
• Federated learning: Challenges, methods, and future directions
[Li et al., 2020]
• Federated learning challenges and opportunities: An outlook
[Ding et al., 2022]

Books
• Federated Learning: A Comprehensive Overview of Methods and
Applications [Ludwig and Baracaldo, 2022]

14
Useful Resources: Software

FL Simulators
• FLSim: github.com/facebookresearch/FLSim
• FL_Pytorch: github.com/burlachenkok/flpytorch
• Simple FL:
github.com/SamuelHorvath/Simple_FL_Simulator

Beyond Simulations–Real-world Deployments

• Flower: https://flower.dev/
• FedML: https://github.com/FedML-AI

15
Useful Resources: FLOW Seminar

Federated Learning One World (FLOW) seminar

• FLOW provides a global online forum for the dissemination of
the latest scientific research results in all aspects of FL:
distributed optimization, learning algorithms, privacy,
cryptography, personalization, . . .
• Focus: the theoretical foundations of the field, applications,
datasets, benchmarking, software, hardware, and systems.

Stats
• Link: https://sites.google.com/view/one-world-
seminar-series-flow
• Talks: 60+
• Participants: 1200+

16
Questions?

16
Introduce yourself + Why this course?

16
Motivation
The Past: A Brief History of Machine Learning

• Machine learning is not new. It can be traced as far as the 1950s.

17
The Past: A Brief History of Machine Learning

• Machine learning is not new. It can be traced as far as the 1950s.

• 1950: Alan Turing posed the idea of a thinking machine.

17
The Past: A Brief History of Machine Learning

• Machine learning is not new. It can be traced as far as the 1950s.

• 1950: Alan Turing posed the idea of a thinking machine.
• 1957: Rosenblatt proposed the Perceptron, precursor to neural
networks

17
The Past: A Brief History of Machine Learning

• Machine learning is not new. It can be traced as far as the 1950s.

17
The Past: A Brief History of Machine Learning

• Machine learning is not new. It can be traced as far as the 1950s.

• 1950: Alan Turing posed the idea of a thinking machine.
• 1957: Rosenblatt proposed the Perceptron, precursor to neural
networks
• 1986: Hinton popularised the use of the backpropagation
algorithm for efficiently training neural networks.
• 1998: LeCun et al. created the MNIST dataset and used an SVM to
achieve < 1% classification error.

17
The Past: A Brief History of Machine Learning

• Machine learning is not new. It can be traced as far as the 1950s.

17
The Present: Mainstream Success

ML did not come into the mainstream until the 2000s due to lack of
computing power to train large neural network models and the
scarcity of training data.

18
The Present: Mainstream Success

ML did not come into the mainstream until the 2000s due to lack of
computing power to train large neural network models and the
scarcity of training data.

• Early 2000s: Amazon launched AWS, which made cloud

computing flexible, scalable and inexpensive

18
The Present: Mainstream Success

ML did not come into the mainstream until the 2000s due to lack of
computing power to train large neural network models and the
scarcity of training data.

• Early 2000s: Amazon launched AWS, which made cloud

computing flexible, scalable and inexpensive
• 2012: Fei-Fei Li et al. created the massive ImageNet dataset and
hosted the ILSVRC challenge.

18
The Present: Mainstream Success

ML did not come into the mainstream until the 2000s due to lack of
computing power to train large neural network models and the
scarcity of training data.

• Early 2000s: Amazon launched AWS, which made cloud

computing flexible, scalable and inexpensive
• 2012: Fei-Fei Li et al. created the massive ImageNet dataset and
hosted the ILSVRC challenge.
• 2012: Krizhevsky et al. proposed AlexNet, a convolutional neural
network-based model to classify ImageNet.

18
The Present: Mainstream Success

ML did not come into the mainstream until the 2000s due to lack of
computing power to train large neural network models and the
scarcity of training data.

• Early 2000s: Amazon launched AWS, which made cloud

18
The Present: Mainstream Success

ML did not come into the mainstream until the 2000s due to lack of
computing power to train large neural network models and the
scarcity of training data.

• Early 2000s: Amazon launched AWS, which made cloud

After these advances, ML took off and achieved widespread success

in a plethora of applications.

18
The Present: Distributed Machine Learning

ML Pillars:
1. Large-scale training infrastructures
2. The vast amounts of training data

19
The Present: Distributed Machine Learning

ML Pillars:
1. Large-scale training infrastructures
2. The vast amounts of training data

Moore’s law (every 18 months performance of processors double)

does not anymore power the advancements in Machine Learning /
Artificial Intelligence. Last couple of years, the computational
resources needed to train state-of-the-art models grow 10 − 35×
every 18 months.

19
The Present: Distributed Machine Learning

ML Pillars:
1. Large-scale training infrastructures
2. The vast amounts of training data

Moore’s law (every 18 months performance of processors double)

We have to go distributed.

19
The Present: Types of Distributed ML

• Data-parallel Distributed ML: The training dataset is shuffled as

split across multiple nodes (servers in the cloud, often equipped
with GPUs)
• Model-parallel Distributed ML: Large NNs are split across nodes,
and the model parameters are synchronized across them.

20
The Present: Types of Distributed ML

• Data-parallel Distributed ML: The training dataset is shuffled as

Need a system-aware research approach, incorporating concepts

from both optimization theory as well as networking/scheduling.
20
Issues with Distributed ML

• Most data collection happens at

edge devices such as cellphones
• It is expensive to send all this data
to the cloud for training
• Negative privacy implications of data
collection
• Privacy Initiatives:
• GDPR (European Commission)
• Learning with Privacy at Scale
(Apple)
• Large carbon footprint of training

21
Issues with Distributed ML

• Most data collection happens at

What can we do?

21
The Potential Future: Federated Learning

• Federated Learning (FL) brings

training to the edge (decentralized)
• Obey data locality paradigm, i.e., to
process data where they were
collected
• Possibility to lower carbon
footprint [Qiu et al., 2020]
• Originally introduced by Google
researchers in 2016
• First works on FL:
• [Konečný et al., 2016]
• [Konečný et al., 2016b]
• [Konečný et al., 2016a]
• [McMahan et al., 2017]

22
Federated Learning

Definition
Federated Learning is defined as a Machine Learning setting where
we have multiple parties that want to collaborate in solving a
machine learning problem, and a central server coordinates this
collaboration. The core idea of FL is decentralized learning, where
the user data is never sent to the server, transferred or exchanged.
It’s kept locally; instead, the learning objective is meant to be
achieved using focused updates intended for immediate
aggregation to avoid privacy leaks.

23
Federated Learning

Need even more specialized system-aware research approach due

new challenges (more on this later).

23
Applications of Federated Learning

• Commercial applications already in

production:
• Apple: “Hey Siri”, QuickType
• Google: “Hey Google”, Gboard
• Next game changer for:
• historically low amount of data due to privacy
constraints
• Smart health applications: Medical Research
and Diagnosis
• Fintech applications: Fraud Detection
(WeBank)

24
FL settings: Cross-silo Federated Learning

• Clients: Institutions (hospitals, banks, companies, etc.)

• Number of clients: Up to 100
• Amount of data: moderate (can train a meaningful local model)

25
FL settings: Cross-device Federated Learning

• Clients: IoT edge devices (mobile phones, tablets, laptops, smart

watches, etc.)
• Number of clients: Up to 1010
• Amount of data: few data points

26
Challenges
Training Loop for Distributed ML

Repeat until convergence

(conceptual sketch)
1. Global model is sent to clients
(GPUs, devices, etc.).
2. Clients updates local models based
on local data (e.g., local steps).
3. Synchronization of update
(aggregation).
4. Global model is updated.

27
Challenge 1: Communication Bottleneck (Data Center)

Issue
• In typical distributed computing
architectures, communication is
much slower than computation
• This is a major limiting factor for
many applications

Source: [Sapio et al., 2019] 28

Challenge 1: Communication Bottleneck (FL)

Issue
• Wireless links and other end-user
internet connections are slower than
datacenter links.
• Even for data centres,
communication is already a
bottleneck.

29
Challenge 1: Communication Bottleneck (FL)

Issue
• Wireless links and other end-user
internet connections are slower than
datacenter links.
• Even for data centres,
communication is already a
bottleneck.

Remedies
• Employing communication
compression
• More local work, e.g., local updates,
increase mini-batch size
• Limit the number of clients to
communicate (partial participation)
29
Challenge 2: Systems Heterogeneity

Issue
• Clients (GPUs, mobile devices, etc.) can be unreliable and/or
heterogeneous.
• Vast heterogeneity of devices is one of the key challenges for
deploying and training models in “wild.”

30
Challenge 2: Systems Heterogeneity

Issue
• Clients (GPUs, mobile devices, etc.) can be unreliable and/or
heterogeneous.
• Vast heterogeneity of devices is one of the key challenges for
deploying and training models in “wild.”

Remedies
• Straggler-resilient algorithms
• Asynchronous updates
• Drop low-tier devices (FL)
• Limit the global model’s size
(FL)

30
Challenge 3: Client Availability

Issue
• The standard consideration is that
clients are only available when they
are on fast network + connected to
charger.
• This might cause undesired
variations that optimization
algorithms may need to address.
• It contradicts iid sampling (clients
available typically only during night
time)

Credits: Google AI blog 31

Challenge 4: On-device AI

Issue
• On-device inference has been happening for a while now,
on-device training is still in a very early stage.
• It requires considerably more compute:
• forward and backward propagation,
• memory for gradients, activations, optimizers.

32
Challenge 5: Limited labels

Issue
• Devices have a lot of data, but
only a few labels

33
Challenge 5: Limited labels

Issue
• Devices have a lot of data, but
only a few labels
• Next word prediction is
popular (labels for free).

33
Challenge 5: Limited labels

Issue
• Devices have a lot of data, but
only a few labels
• Next word prediction is
popular (labels for free).

Remedies
• Semi-supervised learning
• Incentives for clients to label
their data

33
Challenge 6: Personalization

Issue
• Objective: Is a single global
model optimal?

34
Challenge 6: Personalization

Issue
• Objective: Is a single global
model optimal?
• Next word prediction: clearly
not.

34
Challenge 6: Personalization

Issue
• Objective: Is a single global
model optimal?
• Next word prediction: clearly
not.

Remedies
• Meta learning
• New personalized objectives:
clustering, local optima, etc.

34
Challenge 7: Privacy Guarantees

Issue
• Users’ data do not leave
device, but updates do

35
Challenge 7: Privacy Guarantees

Issue
• Users’ data do not leave
device, but updates do
• Privacy leaks are possible.

35
Challenge 7: Privacy Guarantees

Issue
• Users’ data do not leave
device, but updates do
• Privacy leaks are possible.

Remedies
• Cryptography: homomorphic
encryption, trusted execution
environments-TEEs
• Differential privacy

35
Challenge 8: Adversarial Attacks

Issue
• Federated Learning =
collaboration among
mutually untrusted clients
• Easy target for poisoning
• Heterogenous datasets: hard
to identify malicious clients

36
Challenge 8: Adversarial Attacks

Issue
• Federated Learning =
collaboration among
mutually untrusted clients
• Easy target for poisoning
• Heterogenous datasets: hard
to identify malicious clients

Remedies
• Defensive mechanisms: proof
of computations
• Byzantine-robust learning

36
Challenge 9: Incentives to Participate

Issue
• Local Motivation (Reward vs. Cost):
• What is my motivation to
participate? Or to label data?
• What is my local cost (energy,
privacy, . . .)?
• Do I benefit from the federation?
• Local Contribution:
• How much do I contribute to the
federation?
• Ownership of model:
• Community or Enterprise (Google,
Amazon)

37
Challenge 9: Incentives to Participate

Remedies
• Model FL as a Market 37
Challenges: Summary

New Challenges
• Unique challenges that require devising new system-aware
efficient optimization techniques
• This involves optimization theory and networking/scheduling
techniques

Goals
• First 11 weeks: Understand techniques to tackle these challenges
• Last 4 weeks: Explore, analyse, and validate new/existing
solutions
• Ultimate goal: Enabling efficient collaborative (distributed) ML

38
Basics
Statistics
Expectation
Let X ∈ Rd be a random variable and g : Rd → R be any function.

1. If X is discrete, then the expectation of g(X) is defined as

X
E [g(X)] = g(x)p(x),
x∈X

where p is the probability mass function of X and X is the

support of X.
2. If X is continuous, then the expectation of g(X) is defined as
Z
E [g(X)] = g(x)p(x),
Rd

where p is the is the probability density function of X.

One sometimes writes EX [·] to emphasize that the expectation is

taken with respect to a particular random variable X.
39
Statistics

Properties of Expectation
Let X, Y ∈ Rd be random variables.

1. Linearity: For any constant A, B ∈ Rm×d and C ∈ Rm

E [AX + BY + C] = AE [X] + BE [Y] + C.

2. Decomposition: If X and Y are independent, then

⊤
E X⊤ Y = E [X] E [Y] .

In addition, U = g(X) and V = h(Y) are also independent for any

constant functions g and h.

40
Statistics

Example
• Let X be a Bernoulli random variable, i.e.,
(
1 w.p. p,
X=
0 otherwise.

41
Statistics

Example
• Let X be a Bernoulli random variable, i.e.,
(
1 w.p. p,
X=
0 otherwise.

• What is its expectation?

41
Statistics

Example
• Let X be a Bernoulli random variable, i.e.,
(
1 w.p. p,
X=
0 otherwise.

• What is its expectation?

• Solution:

E [X] = p ∗ 1 + (1 − p) ∗ 0 = p

41
Statistics

Variance
Let X ∈ Rd be a random variable. Its variance is defined as

V [X] = E (X − E [X])⊤ (X − E [X]) = E ∥X − E [X]∥2 .

42
Statistics

Variance
Let X ∈ Rd be a random variable. Its variance is defined as

V [X] = E (X − E [X])⊤ (X − E [X]) = E ∥X − E [X]∥2 .

Second Moment Decomposition

Let X ∈ Rd be a random variable. Its second moments can be
decomposed as

E ∥X∥2 = V [X] + ∥E [X]∥2 . (1)

42
Statistics

Exercise
Show that (1) holds.

43
Statistics

Exercise
Show that (1) holds.

Exercise
Show that

E ∥X − b∥2 = V [X] + ∥E [X] − b∥2 .

43
Statistics

Exercise
Show that (1) holds.

Exercise
Show that

E ∥X − b∥2 = V [X] + ∥E [X] − b∥2 .

Exercise
Show that the following equality holds. For any independent random
variables X1 , X2 , . . . , Xn ∈ Rd , we have
 
2
1 X h i
X n n
1
E (Xi − E [Xi ])  = 2
2
E ∥Xi − E [Xi ]∥ .
n n
i=1 i=1

43
Statistics
Tower Property
Let X, Y ∈ Rd be random variables. The law of total expectation
stays that

EX [X] = EY [EX [X|Y]] .

44
Statistics
Tower Property
Let X, Y ∈ Rd be random variables. The law of total expectation
stays that

EX [X] = EY [EX [X|Y]] .

Example
Suppose that only two factories supply light bulbs to the market.
Factory X’s bulbs work for an average of 5000 hours, whereas factory
Y’s bulbs work for an average of 4000 hours. It is known that factory
X supplies 60% of the total bulbs available. What is the expected
length of time that a purchased bulb will work for?

44
Statistics
Tower Property
Let X, Y ∈ Rd be random variables. The law of total expectation
stays that

EX [X] = EY [EX [X|Y]] .

E [L] = 4000 ∗ 0.4 + 5000 ∗ 0.6 = 4600

44
Statistics
Markov’s Inequality
Let X ∈ R be a non-negative random variable. Then, for any a ∈ R+ ,

E [X]
Pr [X > a] ≤ .
a

45
Statistics
Markov’s Inequality
Let X ∈ R be a non-negative random variable. Then, for any a ∈ R+ ,

E [X]
Pr [X > a] ≤ .
a

Example
Let X1 , X2 , . . . , Xt , . . . ∈ R be a sequence on non-negative random
variables that converges to 0 in expectation, i.e., limt→∞ E [Xt ] = 0.
∞
Show that {Xt }t=1 also converges in probability, i.e.,
limt→∞ Pr [Xt > a] = 0, ∀a > 0.

45
Statistics
Markov’s Inequality
Let X ∈ R be a non-negative random variable. Then, for any a ∈ R+ ,

E [X]
Pr [X > a] ≤ .
a

E [Xt ] 1
lim Pr [Xt > a] ≤ lim = lim E [Xt ] = 0.
t→∞ t→∞ a a t→∞

45
Statistics

Useful Inequalities
Let X, Y ∈ Rd be random variables

• Jensen’s Inequality: If g : Rd → R is a convex function, then

E [g(X)] ≥ g(E [X]).

46
Statistics

Useful Inequalities
Let X, Y ∈ Rd be random variables

• Jensen’s Inequality: If g : Rd → R is a convex function, then

E [g(X)] ≥ g(E [X]).

• Hölder’s Inequality: Let p, q ∈ (1, ∞) satisfying 1/p + 1/q = 1. Then,

1/p 1/q
E |X⊤ Y| ≤ E ∥X∥pp E ∥Y∥qq .

46
Statistics

Useful Inequalities
Let X, Y ∈ Rd be random variables

• Jensen’s Inequality: If g : Rd → R is a convex function, then

E [g(X)] ≥ g(E [X]).

• Hölder’s Inequality: Let p, q ∈ (1, ∞) satisfying 1/p + 1/q = 1. Then,

1/p 1/q
E |X⊤ Y| ≤ E ∥X∥pp E ∥Y∥qq .

• Minkowski’s Inequality: Let p ∈ (1, ∞) . Then,

1/p 1/p 1/p
E ∥X + Y∥pp ≤ E ∥X∥pp + E ∥Y∥pp .

46
Statistics

Exercise
Prove Cauchy-Schwartz inequality:
p
E |X⊤ Y| ≤ E [∥X∥2 ] E [∥Y∥2 ].

47
Statistics

Exercise
Prove Cauchy-Schwartz inequality:
p
E |X⊤ Y| ≤ E [∥X∥2 ] E [∥Y∥2 ].

Exercise
Show that, the covariance can be upper-bounded as follows
p def p
Cov(X, Y) ≤ V [X] V [Y], where Cov(X, Y) = V [X] V [Y].

47
Statistics

Exercise
Prove Cauchy-Schwartz inequality:
p
E |X⊤ Y| ≤ E [∥X∥2 ] E [∥Y∥2 ].

Exercise
Show that, the covariance can be upper-bounded as follows
p def p
Cov(X, Y) ≤ V [X] V [Y], where Cov(X, Y) = V [X] V [Y].

Exercise
Show that if g : Rd → R is a concave function, then

E [g(X)] ≤ g(E [X]).

47
Convexity

• Convex set: A set X ⊆ Rd is convex if and only if

αx + (1 − α)y ∈ X , ∀x, y ∈ X , ∀α ∈ [0, 1].

• Convex function: A function f : Rd → R is convex if and only if

f(αx + (1 − α)y) ≤ αf(x) + (1 − α)f(y), ∀x, y ∈ dom f, ∀α ∈ [0, 1].

48
Gradient

Gradient
Consider a multivariate function f : Rd → R, the gradient of f is
 ∂f 
 ∂x. 1  ∂f(x)
∇f(x) =  . 
 .  , [∇f(x)]i = ∀i ∈ [d] = {1, 2, . . . , d}.
∂xi
∂f
∂xd x

∇f(x) points in the direction of the steepest ascent from x (Why?).

49
Gradient

∇f(x) points in the direction of the steepest ascent from x (Why?).

Example

49
Gradient

∇f(x) points in the direction of the steepest ascent from x (Why?).

Example
• Question: find the gradient of f(x) = x21 + 3x1 x2 .

49
Gradient

∇f(x) points in the direction of the steepest ascent from x (Why?).

Example
• Question: find the gradient of f(x) = x21 + 3x1 x2 .
" #
2x1 + 3x2
• Answer: ∇f(x) = .
3x1

49
Hessian

Hessian
Consider a multivariate function f : Rd → R, the Hessian of f is
 ∂2f 2 
∂x1 ∂x1 . . . ∂x∂1 ∂xf d
 . .. ..  ∂ 2 f(x)
Hf (x) = 

.. . .
 ,
 Hf (x) ij
= ∀i, j ∈ [d].
∂xi ∂xj
∂2f ∂2f
∂xd ∂x1 . . . ∂xd ∂xd
x

Hf (x) captures the curvature of the function at x. (Why?).

50
Hessian

Hf (x) captures the curvature of the function at x. (Why?).

Example

50
Hessian

Hf (x) captures the curvature of the function at x. (Why?).

Example
• Question: find the Hessian of f(x) = x21 + 3x1 x2 .

50
Hessian

Hf (x) captures the curvature of the function at x. (Why?).

Example
• Question: find the Hessian of f(x) = x21 + 3x1 x2 .
" #
2 3
• Answer: ∇f(x) = .
3 0

50
Hessian

Clairaut’s Theorem
If the second order partial derivatives of f : Rd → R are continuous,
at a point x, then

∂2f ∂2f
= ∀i, j ∈ [d]2 .
∂xi ∂xj ∂xj ∂xi

The Hessian of a convex function is positive semi-definite, that is,

y⊤ Hf (x)y ≥ 0, ∀x, y ∈ Rd .

Hf (x) is positive semi-definite if and only if all its eigenvalues are

non-negative.

51
Smoothness

Smoothness
Differentiable function f : Rd → R is L-smooth if

∥∇f(x) − ∇f(y)∥ ≤ L∥x − y∥, ∀x, y ∈ Rd .

52
Smoothness

Smoothness
Differentiable function f : Rd → R is L-smooth if

∥∇f(x) − ∇f(y)∥ ≤ L∥x − y∥, ∀x, y ∈ Rd .

• Smoothness directly leads to the following quadratic upper

bound:
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd .
2

52
Smoothness

Smoothness
Differentiable function f : Rd → R is L-smooth if

∥∇f(x) − ∇f(y)∥ ≤ L∥x − y∥, ∀x, y ∈ Rd .

• Smoothness directly leads to the following quadratic upper

bound:
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd .
2
• Example: Show that objective f(x) = ∥Ax − b∥2 is L-smooth.

52
Smoothness

Smoothness
Differentiable function f : Rd → R is L-smooth if

∥∇f(x) − ∇f(y)∥ ≤ L∥x − y∥, ∀x, y ∈ Rd .

• Smoothness directly leads to the following quadratic upper

bound:
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd .
2
• Example: Show that objective f(x) = ∥Ax − b∥2 is L-smooth.
• Solution: Exercise.

52
Bregman Divergence
Bregman Divergence
Bregman divergence associated with f f : Rd → R is defined as

Df (y, x) = f(y) − f(x) − ⟨∇f(x), y − x⟩.

53
Bregman Divergence
Bregman Divergence
Bregman divergence associated with f f : Rd → R is defined as

Df (y, x) = f(y) − f(x) − ⟨∇f(x), y − x⟩.

• Bregman divergence is not symmetric.

53
Bregman Divergence

• For convex functions, we have

Df (x, y) ≥ 0, ∀x, y ∈ Rd .

54
Bregman Divergence

• For convex functions, we have

Df (x, y) ≥ 0, ∀x, y ∈ Rd .
• For smooth functions, we have
L
Df (x, y) ≤ ∥x − y∥2 , ∀x, y ∈ Rd .
2

54
Strong Convexity
Strong Convexity
We say that a differentiable function f : Rd → R is µ-strongly
convex (or µ-convex) if there exists such µ > 0
µ
∥x − y∥2 ≤ Df (x, y), ∀x, y ∈ Rd .
2

55
Strong Convexity
Strong Convexity
We say that a differentiable function f : Rd → R is µ-strongly
convex (or µ-convex) if there exists such µ > 0
µ
∥x − y∥2 ≤ Df (x, y), ∀x, y ∈ Rd .
2

• Putting µ-convexity and L-smoothness together leads to

µ L
∥x − y∥2 ≤ Df (x, y) ≤ ∥x − y∥2
2 2

55
Cheat-sheet

• very useful cheat-sheet https://fa.bianp.net/blog/2017/

optimization-inequalities-cheatsheet/ by Fabian
Pedregosa (Google)

56
Disclaimer

Neural Networks

57
Disclaimer

Neural Networks
• Neural Networks (NNs) are neither L-smooth nor µ-convex
globally.

57
Disclaimer

Neural Networks
• Neural Networks (NNs) are neither L-smooth nor µ-convex
globally.
• However, such properties usually holds locally.

57
Disclaimer

Neural Networks
• Neural Networks (NNs) are neither L-smooth nor µ-convex
globally.
• However, such properties usually holds locally.
• Assuming global properties facilitates analysis.

57
Disclaimer

Neural Networks
• Neural Networks (NNs) are neither L-smooth nor µ-convex
globally.
• However, such properties usually holds locally.
• Assuming global properties facilitates analysis.
• NNs (NNs proxy) Optimization: [Du et al., 2018a, Du et al., 2018b,
Arora et al., 2019, Ye and Du, 2021]

57
Disclaimer

Exercise
Show that NNs are non-smooth and non-convex.
Hint: Consider NN with one hidden layer with no bias and activation.

57
Questions?

57
Optimizing Convex Objectives

• Consider a convex objective function f(x) : Rd → R, which

depends on a parameter vector x ∈ Rd .

58
Optimizing Convex Objectives

• Consider a convex objective function f(x) : Rd → R, which

depends on a parameter vector x ∈ Rd .
• Our goal is to find x⋆ that minimizes this function, that is,

x⋆ ∈ arg min f(x).

x∈Rd

58
Optimizing Convex Objectives

• Consider a convex objective function f(x) : Rd → R, which

depends on a parameter vector x ∈ Rd .
• Our goal is to find x⋆ that minimizes this function, that is,

x⋆ ∈ arg min f(x).

x∈Rd

• How to find x⋆ ?

58
Optimizing Convex Objectives

• Consider a convex objective function f(x) : Rd → R, which

depends on a parameter vector x ∈ Rd .
• Our goal is to find x⋆ that minimizes this function, that is,

x⋆ ∈ arg min f(x).

x∈Rd

• How to find x⋆ ?
• Find stationary points of f(x), that is, x⋆ that satisfies ∇f(x⋆ ) = 0

58
Optimizing Convex Objectives

• Consider a convex objective function f(x) : Rd → R, which

depends on a parameter vector x ∈ Rd .
• Our goal is to find x⋆ that minimizes this function, that is,

x⋆ ∈ arg min f(x).

x∈Rd

• How to find x⋆ ?
• Find stationary points of f(x), that is, x⋆ that satisfies ∇f(x⋆ ) = 0
• Since f(x) is convex, the stationary point(s) minimize f(x)

58
Optimizing Convex Objectives

• Consider a convex objective function f(x) : Rd → R, which

depends on a parameter vector x ∈ Rd .
• Our goal is to find x⋆ that minimizes this function, that is,

x⋆ ∈ arg min f(x).

x∈Rd

• How to find x⋆ ?
• Find stationary points of f(x), that is, x⋆ that satisfies ∇f(x⋆ ) = 0
• Since f(x) is convex, the stationary point(s) minimize f(x)

Example

58
Optimizing Convex Objectives

• Consider a convex objective function f(x) : Rd → R, which

depends on a parameter vector x ∈ Rd .
• Our goal is to find x⋆ that minimizes this function, that is,

x⋆ ∈ arg min f(x).

x∈Rd

• How to find x⋆ ?
• Find stationary points of f(x), that is, x⋆ that satisfies ∇f(x⋆ ) = 0
• Since f(x) is convex, the stationary point(s) minimize f(x)

Example
• Question: Find x that minimizes f(x1 , x2 ) = (x1 − 3)2 + 2(x2 − 2)2

58
Optimizing Convex Objectives

• Consider a convex objective function f(x) : Rd → R, which

depends on a parameter vector x ∈ Rd .
• Our goal is to find x⋆ that minimizes this function, that is,

x⋆ ∈ arg min f(x).

x∈Rd

• How to find x⋆ ?
• Find stationary points of f(x), that is, x⋆ that satisfies ∇f(x⋆ ) = 0
• Since f(x) is convex, the stationary point(s) minimize f(x)

Example
• Question: Find x that minimizes f(x1 , x2 ) = (x1 − 3)2 + 2(x2 − 2)2
" #
2(x1 − 3)
• Answer: ∇f(x) = =0
4(x2 − 2)

58
Optimizing Convex Objectives

• Consider a convex objective function f(x) : Rd → R, which

depends on a parameter vector x ∈ Rd .
• Our goal is to find x⋆ that minimizes this function, that is,

x⋆ ∈ arg min f(x).

x∈Rd

• How to find x⋆ ?
• Find stationary points of f(x), that is, x⋆ that satisfies ∇f(x⋆ ) = 0
• Since f(x) is convex, the stationary point(s) minimize f(x)

Example
• Question: Find x that minimizes f(x1 , x2 ) = (x1 − 3)2 + 2(x2 − 2)2
" #
2(x1 − 3)
• Answer: ∇f(x) = =0
4(x2 − 2)
• =⇒ x⋆ = [3, 2]⊤

58
Gradient Descent (GD)

• In many problems, it is difficult to solve for x⋆ that satisfies

∇f(x) = 0

59
Gradient Descent (GD)

• In many problems, it is difficult to solve for x⋆ that satisfies

∇f(x) = 0
• Gradient descent (GD) is an iterative method that can help find
x⋆ . It uses the fact that −∇f(x) is the direction of the steepest
descent of f(x). Starts from a random initial point x0 :

xt+1 = xt − η∇f(xt ).

for a small learning rate η > 0.

59
Gradient Descent (GD)

• In many problems, it is difficult to solve for x⋆ that satisfies

xt+1 = xt − η∇f(xt ).

for a small learning rate η > 0.

Intuition

59
Gradient Descent (GD)

• In many problems, it is difficult to solve for x⋆ that satisfies

xt+1 = xt − η∇f(xt ).

for a small learning rate η > 0.

Intuition
• A person standing on a hill covered in fog who wants to find
their way down to the hill. They can feel the slope with their feet
and move along the direction of steepest descent.

59
Gradient Descent (GD)

• GD starts from a random initial point x0 and updates the current

iterate xt as follows:
xt+1 = xt − η∇f(xt ).
for a small learning rate η > 0.
• For convex f and small enough η, GD is guaranteed to converge
to an optimal x⋆ (more on this later).
• But for non-convex functions, it can get stuck at local minima or
saddle points.

60
Concept Check Exercise

• Question: Write the gradient descent update rule for

f(x) = 2x21 + x22 .

61
Concept Check Exercise

• Question: Write the gradient descent update rule for

f(x) = 2x21 + x22 .
• Answer: " #
t+1 t 4xt1
x =x −η .
2xt2

61
SGD in Linear Regression
Example: Predicting house prices

Sale price ≈ (price per square footage) × (square footage) + (fixed

expense)

62
Linear Regression

Setup
• Input: x ∈ Rd (covariates, predictors, features, etc.)
• Ouput: y ∈ R (responses, targets, outcomes, outputs, etc.)
• Model: f : x → y, with f(x) = w0 + w⊤ x.
• w: weights, parameters, or parameter vector.
• w0 : bias.
• Sometimes, we also call ŵ = [w0 , w⊤ ]⊤ parameters.
• Training data: D = {(xi , yi ), i ∈ [n]}

Minimize the Empirical Risk Function (ERM), which is the Residual

sum of squares for linear regression:
n
X n
X
2 2
L(w) = (f(xi ) − yi ) = w0 + w⊤ xi − yi
i=1 i=1

63
A simple case: x is just one-dimensional (d = 1)

Residual sum of squares:

n
X n
X
2 2
L(w) = (f(xi ) − yi ) = (w0 + w1 xi − yi )
i=1 i=1

64
A simple case: x is just one-dimensional (d = 1)

Residual sum of squares:

n
X n
X
2 2
L(w) = (f(xi ) − yi ) = (w0 + w1 xi − yi )
i=1 i=1

Figure 1: Loss (L) is the sum of squares of the dashed lines. Middle: Adjust
(w0 , w1 ) to reduce loss. Middle: Loss minimized at (w⋆0 , w⋆1 ).

65
A simple case: x is just one-dimensional (d = 1)

Residual sum of squares:

n
X n
X
2 2
L(w) = (f(xi ) − yi ) = (w0 + w1 xi − yi )
i=1 i=1

66
A simple case: x is just one-dimensional (d = 1)

Residual sum of squares:

n
X n
X
2 2
L(w) = (f(xi ) − yi ) = (w0 + w1 xi − yi )
i=1 i=1

Stationary points
Take derivative with respect to parameters and set it to zero.

66
A simple case: x is just one-dimensional (d = 1)

Residual sum of squares:

n
X n
X
2 2
L(w) = (f(xi ) − yi ) = (w0 + w1 xi − yi )
i=1 i=1

Stationary points
Take derivative with respect to parameters and set it to zero.

X n
∂L(w)
= 0 =⇒ 2 (w0 + w1 xi − yi ) = 0,
∂w0
i=1
Xn
∂L(w)
= 0 =⇒ 2 (w0 + w1 xi − yi )xi = 0.
∂w1
i=1

66
A simple case: x is just one-dimensional (d = 1)

Simplify these expressions to get the “Normal Equations”:

n
X n
X
yi = nw0 + w1 xi ,
i=1 i=1
n
X n
X Xn
yi xi = w0 x i + w1 x2i .
i=1 i=1 i=1

67
A simple case: x is just one-dimensional (d = 1)

Simplify these expressions to get the “Normal Equations”:

n
X n
X
yi = nw0 + w1 xi ,
i=1 i=1
n
X n
X Xn
yi xi = w0 x i + w1 x2i .
i=1 i=1 i=1

Solving the system, we obtain the least squares coefficient estimates:

Pn
(xi − x̄)(yi − ȳ)
w1 = i=1
⋆
Pn and w⋆0 = ȳ − w⋆1 x̄,
i=1 (xi − x̄)
2

1
Pn 1
Pn
where x̄ = n i=1 xi and ȳ = n i=1 yi .

67
Least Mean Squares when x is d-dimensional

Pn 2
L(w) in matrix form: L(w) = i=1 w⊤ xi − yi , where we have
redefined some variables (by augmenting)

x = [1, x1 , . . . , xd−1 ]⊤ , w = [w0 , w1 , . . . , wd−1 ]⊤ .

68
Least Mean Squares when x is d-dimensional

Pn 2
L(w) in matrix form: L(w) = i=1 w⊤ xi − yi , where we have
redefined some variables (by augmenting)

x = [1, x1 , . . . , xd−1 ]⊤ , w = [w0 , w1 , . . . , wd−1 ]⊤ .

Design matrix and target vector:

   
x⊤
1 y1
   
X = . . . ∈ Rn×d , y =  . . .  ∈ Rn .
x⊤
n yn

1
Pn 1
Pn
where x̄ = n i=1 xi and ȳ = n i=1 yi .

68
Least Mean Squares when x is d-dimensional

Pn 2
L(w) in matrix form: L(w) = i=1 w⊤ xi − yi , where we have
redefined some variables (by augmenting)

x = [1, x1 , . . . , xd−1 ]⊤ , w = [w0 , w1 , . . . , wd−1 ]⊤ .

Design matrix and target vector:

   
x⊤
1 y1
   
X = . . . ∈ Rn×d , y =  . . .  ∈ Rn .
x⊤
n yn

1
Pn 1
Pn
where x̄ = n i=1 xi and ȳ = n i=1 yi .

Compact expression:
n ⊤ o
L(w) = ∥Xw − y∥2 = w⊤ X⊤ Xw − 2 X⊤ y w + const
68
Least-Squares Solution

Compact expression:
n ⊤ o
L(w) = ∥Xw − y∥2 = w⊤ X⊤ Xw − 2 X⊤ y w + const

69
Least-Squares Solution

Compact expression:
n ⊤ o
L(w) = ∥Xw − y∥2 = w⊤ X⊤ Xw − 2 X⊤ y w + const

Gradients of Linear and Quadratic Functions:

∇x (b⊤ x) = b
∇x (x⊤ Ax) = 2Ax (symmetric A)

69
Least-Squares Solution

Compact expression:
n ⊤ o
L(w) = ∥Xw − y∥2 = w⊤ X⊤ Xw − 2 X⊤ y w + const

Gradients of Linear and Quadratic Functions:

∇x (b⊤ x) = b
∇x (x⊤ Ax) = 2Ax (symmetric A)

Normal equation

∇w L(w) = 2X⊤ Xw − 2X⊤ y = 0.

This leads to the least-mean-squares (LMS) solution

−1
wLMS = X⊤ X X⊤ y
69
Computational complexity

Bottleneck of computing the solution?

−1
w ⋆ = X⊤ X X⊤ y

70
Computational complexity

Bottleneck of computing the solution?

−1
w ⋆ = X⊤ X X⊤ y

How many operations do we need?

70
Computational complexity

Bottleneck of computing the solution?

−1
w ⋆ = X⊤ X X⊤ y

How many operations do we need?

• O(nd2 ) for matrix multiplication X⊤ X,

70
Computational complexity

Bottleneck of computing the solution?

−1
w ⋆ = X⊤ X X⊤ y

How many operations do we need?

• O(nd2 ) for matrix multiplication X⊤ X,
• O(d3 ) (e.g., using Gauss-Jordan elimination) or O(d2.373 ) (recent
theoretical advances) for matrix inversion of X⊤ X,

70
Computational complexity

Bottleneck of computing the solution?

−1
w ⋆ = X⊤ X X⊤ y

How many operations do we need?

• O(nd2 ) for matrix multiplication X⊤ X,
• O(d3 ) (e.g., using Gauss-Jordan elimination) or O(d2.373 ) (recent
theoretical advances) for matrix inversion of X⊤ X,
• O(nd) for matrix multiplication X⊤ y,

70
Computational complexity

Bottleneck of computing the solution?

−1
w ⋆ = X⊤ X X⊤ y

How many operations do we need?

70
Computational complexity

Bottleneck of computing the solution?

−1
w ⋆ = X⊤ X X⊤ y

How many operations do we need?

In total:
O(nd2 + d3 )

70
Computational complexity

Bottleneck of computing the solution?

−1
w ⋆ = X⊤ X X⊤ y

How many operations do we need?

In total:
O(nd2 + d3 )

70
Computational complexity

Bottleneck of computing the solution?

−1
w ⋆ = X⊤ X X⊤ y

How many operations do we need?

In total:
O(nd2 + d3 ) – impractical for very large d or n.

70
Alternative method: Batch Gradient Descent

(Batch) Gradient Descent

• Initialize w to w0 (e.g., randomly), choose η,
and set t = 0
• Loop until convergence:
1. Compute the gradient ∇L(w) = X⊤ (Xwt − y)
2. Update the parameters wt+1 = wt − η∇L(w)
3. t ← t + 1

What is the complexity of each iteration?

71
Alternative method: Batch Gradient Descent

(Batch) Gradient Descent

What is the complexity of each iteration?

• O(nd)

71
Alternative method: Batch Gradient Descent

(Batch) Gradient Descent

What is the complexity of each iteration?

• O(nd) – much better comparing to O(nd2 + d3 ).

71
Why would this work?

If gradient descent converges, it will converge to the same solution

as using matrix inversion.

72
Why would this work?

If gradient descent converges, it will converge to the same solution

as using matrix inversion.
This is because L(w) is a convex function in its parameters w.

72
Why would this work?

If gradient descent converges, it will converge to the same solution

as using matrix inversion.
This is because L(w) is a convex function in its parameters w.
Hessian of L(w):
n ⊤ o
L(w) = ∥Xw − y∥2 = w⊤ X⊤ Xw − 2 X⊤ y w + const
∂L(w)
=⇒ = 2X⊤ X
∂ww⊤

72
Why would this work?

If gradient descent converges, it will converge to the same solution

as using matrix inversion.
This is because L(w) is a convex function in its parameters w.
Hessian of L(w):
n ⊤ o
L(w) = ∥Xw − y∥2 = w⊤ X⊤ Xw − 2 X⊤ y w + const
∂L(w)
=⇒ = 2X⊤ X
∂ww⊤
X⊤ X is a positive semidefinite, because for any v

v⊤ X⊤ Xv = ∥Xv∥2 ≥ 0.

72
Stochastic gradient descent (SGD)

Widrow-Hoff rule: update parameters using one example at a time.

73
Stochastic gradient descent (SGD)

Widrow-Hoff rule: update parameters using one example at a time.

• Initialize w to w0 (e.g., randomly), choose η, and set t = 0

• Loop until convergence:
1. choose uniformly at random a training sample xit
2. Compute its contribution to the gradient gt = (w⊤
t xit − y)xit
3. Update the parameters wt+1 = wt − ηgt
4. t←t+1

73
Stochastic gradient descent (SGD)

Widrow-Hoff rule: update parameters using one example at a time.

• Initialize w to w0 (e.g., randomly), choose η, and set t = 0

How does the complexity per iteration compare with gradient

descent?

73
Stochastic gradient descent (SGD)

Widrow-Hoff rule: update parameters using one example at a time.

• Initialize w to w0 (e.g., randomly), choose η, and set t = 0

How does the complexity per iteration compare with gradient

descent?
• O(nd) for gradient descent versus O(d) for SGD.

73
Stochastic gradient descent (SGD)

Widrow-Hoff rule: update parameters using one example at a time.

• Initialize w to w0 (e.g., randomly), choose η, and set t = 0

How does the complexity per iteration compare with gradient

descent?
• O(nd) for gradient descent versus O(d) for SGD.

Exercise
• Show that gt is an unbiased estimator of the gradient.

73
SGD versus Batch GD

• Recall the empirical risk loss function that we considered for

linear regression
n
X 2
L(w) = w⊤ xi − yi
i=1

74
SGD versus Batch GD

• Recall the empirical risk loss function that we considered for

linear regression
n
X 2
L(w) = w⊤ xi − yi
i=1

• For large training datasets (large n), then computing gradients

with respect to each data point is expensive. For example, for
one iteration, the batch gradients are
n
X
∇L(wt ) = 2 w⊤
t xi − y i xi
i=1

74
SGD versus Batch GD

• Recall the empirical risk loss function that we considered for

linear regression
n
X 2
L(w) = w⊤ xi − yi
i=1

• For large training datasets (large n), then computing gradients

with respect to each data point is expensive. For example, for
one iteration, the batch gradients are
n
X
∇L(wt ) = 2 w⊤
t xi − y i xi
i=1

• Therefore, we use stochastic gradient descent (SGD), where we

2
choose a random data point xit and use L̃(w) = w⊤ xit − yi as
an approximation to the entire sum.

74
SGD versus Batch GD

• SGD reduces per-iteration complexity from O(nd) to O(d)

• But it is noisier and can take longer to converge

75
Mini-batch SGD

• Mini-batch SGD is between these two extremes.

76
Mini-batch SGD

• Mini-batch SGD is between these two extremes.

• In each iteration t, we choose a set St of b samples uniformly at
random from the n training samples and use
X 2
L̃(wt ) = w⊤t xi − y i
i∈St

for back-propagation.

76
Mini-batch SGD

• Mini-batch SGD is between these two extremes.

• In each iteration t, we choose a set St of b samples uniformly at
random from the n training samples and use
X 2
L̃(wt ) = w⊤t xi − y i
i∈St

for back-propagation.
• Small b saves per-iteration computing cost but increases noise
in the gradients and yields worse error convergence.

76
Mini-batch SGD

• Mini-batch SGD is between these two extremes.

• In each iteration t, we choose a set St of b samples uniformly at
random from the n training samples and use
X 2
L̃(wt ) = w⊤t xi − y i
i∈St

for back-propagation.
• Small b saves per-iteration computing cost but increases noise
in the gradients and yields worse error convergence.
• Large b reduces gradient noise and gives better error
convergence but increases computing cost per iteration.

76
How to Choose Mini-batch Size

• Small training datasets – use batch gradient descent b = n.

• Large training datasets – typical b are 64, 128, 256, . . ., whatever
fits in the CPU/GPU memory.
• Mini-batch size is another hyperparameter that you have to tune.

77
How to Choose Learning Rate

• SGD Update Rule

wt+1 = wt − ηgt .
• Large η: Faster convergence, but higher error floor (the flat
portion of each curve).
• Small η: Slow convergence, but lower error floor (the blue curve
will eventually go below the red curve).
• To get the best of both worlds, decay η over time.

78
Questions?

78
References i

Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. (2019).
Fine-grained analysis of optimization and generalization for
overparameterized two-layer neural networks.
In International Conference on Machine Learning, pages 322–332.
PMLR.
Ding, J., Tramel, E., Sahu, A. K., Wu, S., Avestimehr, S., and Zhang, T.
(2022).
Federated learning challenges and opportunities: An outlook.
In ICASSP 2022-2022 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 8752–8756. IEEE.
Du, S. S., Hu, W., and Lee, J. D. (2018a).
Algorithmic regularization in learning deep homogeneous
models: Layers are automatically balanced.
Advances in Neural Information Processing Systems, 31.
References ii

Du, S. S., Zhai, X., Poczos, B., and Singh, A. (2018b).

Gradient descent provably optimizes over-parameterized neural
networks.
arXiv preprint arXiv:1810.02054.
Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M.,
Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R.,
et al. (2021).
Advances and open problems in federated learning.
Foundations and Trends® in Machine Learning, 14(1–2):1–210.
Konečný, J., McMahan, H. B., Ramage, D., and Richtárik, P. (2016).
Federated optimization: Distributed machine learning for
on-device intelligence.
arXiv preprint arXiv:1610.02527.
References iii

Konečný, J., McMahan, H. B., Ramage, D., and Richtárik, P. (2016a).

Federated optimization: Distributed machine learning for
on-device intelligence.
arXiv preprint arXiv:1610.02527.
Konečný, J., McMahan, H. B., Yu, F. X., Richtárik, P., Suresh, A. T., and
Bacon, D. (2016b).
Federated learning: Strategies for improving communication
efficiency.
In NeurIPS Private Multi-Party Machine Learning Workshop.
Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. (2020).
Federated learning: Challenges, methods, and future directions.
IEEE Signal Processing Magazine, 37(3):50–60.
References iv

Ludwig, H. and Baracaldo, N. (2022).

Federated learning: A comprehensive overview of methods and
applications.
Springer Nature.
McMahan, H. B., Moore, E., Ramage, D., Hampson, S., and y Arcas,
B. A. (2017).
Communication-efficient learning of deep networks from
decentralized data.
In International Conference on Artificial Intelligence and
Statistics (AISTATS), pages 1273–1282.
Qiu, X., Parcollet, T., Beutel, D. J., Topal, T., Mathur, A., and Lane,
N. D. (2020).
Can federated learning save the planet?
arXiv preprint arXiv:2010.06537.
References v

Sapio, A., Canini, M., Ho, C.-Y., Nelson, J., Kalnis, P., Kim, C.,
Krishnamurthy, A., Moshref, M., Ports, D. R., and Richtárik, P. (2019).

Scaling distributed machine learning with in-network

aggregation.
arXiv preprint arXiv:1903.06701.
Wang, J., Charles, Z., Xu, Z., Joshi, G., McMahan, H. B., Al-Shedivat,
M., Andrew, G., Avestimehr, S., Daly, K., Data, D., et al. (2021).
A field guide to federated optimization.
arXiv preprint arXiv:2107.06917.
References vi

Ye, T. and Du, S. S. (2021).

Global convergence of gradient descent for asymmetric
low-rank matrix factorization.
Advances in Neural Information Processing Systems,
34:1429–1439.

Emerging Tech Impact Radar 2024
No ratings yet
Emerging Tech Impact Radar 2024
10 pages
Machine Learning Simplified
100% (1)
Machine Learning Simplified
109 pages
final syllabus
No ratings yet
final syllabus
5 pages
Lec1 Intro to p556
No ratings yet
Lec1 Intro to p556
29 pages
CS-871-Lecture 1
No ratings yet
CS-871-Lecture 1
41 pages
SEng5305-chap-1-Introduction to ML (1)
No ratings yet
SEng5305-chap-1-Introduction to ML (1)
85 pages
Brochure_CE_AML_30_Sept_2021_V33
No ratings yet
Brochure_CE_AML_30_Sept_2021_V33
16 pages
Lecture 1
No ratings yet
Lecture 1
39 pages
Machine learning Unit 1
No ratings yet
Machine learning Unit 1
14 pages
Lahore University of Management Sciences CS 535/EE 514 Machine Learning
No ratings yet
Lahore University of Management Sciences CS 535/EE 514 Machine Learning
3 pages
AppliedMachineLearning S12023 24
No ratings yet
AppliedMachineLearning S12023 24
5 pages
ps_ML coursepack_19th Feb 24
No ratings yet
ps_ML coursepack_19th Feb 24
8 pages
Lecture 1 Parallel and Scalable Machine Learning by HPC Morris Riedel
No ratings yet
Lecture 1 Parallel and Scalable Machine Learning by HPC Morris Riedel
50 pages
Course Code CSA400 8 Course Type LTP Credits 4: Applied Machine Learning
No ratings yet
Course Code CSA400 8 Course Type LTP Credits 4: Applied Machine Learning
3 pages
Lecture 1
100% (1)
Lecture 1
81 pages
Lec-01
No ratings yet
Lec-01
28 pages
ai&ml_20250109_074314
No ratings yet
ai&ml_20250109_074314
8 pages
PGP Machine Learning Brochure
No ratings yet
PGP Machine Learning Brochure
20 pages
1-Introduction
No ratings yet
1-Introduction
81 pages
LM #01-Introduction To ML
No ratings yet
LM #01-Introduction To ML
33 pages
cs412-24FA-syllabus
No ratings yet
cs412-24FA-syllabus
2 pages
Fall2024-W4995-Lecture1
No ratings yet
Fall2024-W4995-Lecture1
110 pages
BCSE 4th Year 1st Semester
No ratings yet
BCSE 4th Year 1st Semester
45 pages
1 Introduction
No ratings yet
1 Introduction
58 pages
Lecture 1
No ratings yet
Lecture 1
85 pages
Artificial Neural Networks Kluniversity Course Handout
No ratings yet
Artificial Neural Networks Kluniversity Course Handout
18 pages
Ml Microst
No ratings yet
Ml Microst
264 pages
Assignment #2
No ratings yet
Assignment #2
2 pages
ML Full Syllabus
No ratings yet
ML Full Syllabus
576 pages
Syllabus SML 2024
No ratings yet
Syllabus SML 2024
3 pages
Handout
No ratings yet
Handout
4 pages
Intro To Machine Learning
100% (1)
Intro To Machine Learning
250 pages
Machine Learning On Big Data: Opportunities and Challenges: Version of Record
No ratings yet
Machine Learning On Big Data: Opportunities and Challenges: Version of Record
27 pages
Lecture 1
No ratings yet
Lecture 1
51 pages
Machine Learning Unit 1
No ratings yet
Machine Learning Unit 1
30 pages
Class 1 C
No ratings yet
Class 1 C
14 pages
MasterMachineLearning-Modulhandbuch_16012021
No ratings yet
MasterMachineLearning-Modulhandbuch_16012021
65 pages
Proposed Course Syllabus: Algorithms For Big Data
No ratings yet
Proposed Course Syllabus: Algorithms For Big Data
7 pages
DSC_ MachineLearning Regular HO (1)
No ratings yet
DSC_ MachineLearning Regular HO (1)
7 pages
Lecture 01 - Introduction To AML-Jan24
No ratings yet
Lecture 01 - Introduction To AML-Jan24
66 pages
Unit-1
No ratings yet
Unit-1
88 pages
16 Decentralised Learning
No ratings yet
16 Decentralised Learning
140 pages
Machine Learning: Martin Jaggi & Nicolas Flammarion
No ratings yet
Machine Learning: Martin Jaggi & Nicolas Flammarion
52 pages
PCATG
No ratings yet
PCATG
266 pages
The Landscape of Machine,...
No ratings yet
The Landscape of Machine,...
31 pages
SEM 5 Syllabus
No ratings yet
SEM 5 Syllabus
28 pages
Introduction To Machine Learning: Pekka Parviainen
No ratings yet
Introduction To Machine Learning: Pekka Parviainen
39 pages
PGP Generative AI and ML Curriculum (New)
No ratings yet
PGP Generative AI and ML Curriculum (New)
42 pages
AI_ML syllabus (160 hours)
No ratings yet
AI_ML syllabus (160 hours)
3 pages
PGP - Unified Brochure
No ratings yet
PGP - Unified Brochure
18 pages
Principles of Deep Learning
No ratings yet
Principles of Deep Learning
2 pages
AI310 & CS361 Intro. To Artificial Intelligence - Fall 2023 - Module Main Contents - 1
No ratings yet
AI310 & CS361 Intro. To Artificial Intelligence - Fall 2023 - Module Main Contents - 1
5 pages
Major In: Machine Learning
No ratings yet
Major In: Machine Learning
11 pages
complete ml (1)
No ratings yet
complete ml (1)
325 pages
Handout - BITS-F464 - Machine - Learning - August 2019
No ratings yet
Handout - BITS-F464 - Machine - Learning - August 2019
4 pages
Intro To ML
No ratings yet
Intro To ML
18 pages
Artificial Intelligence - Flipped - ZC444
No ratings yet
Artificial Intelligence - Flipped - ZC444
10 pages
foYCFm3WFN
No ratings yet
foYCFm3WFN
6 pages
Research & the Analysis of Research Hypotheses
From Everand
Research & the Analysis of Research Hypotheses
Kathleen Thomas Allan
No ratings yet
Mixed Methods Research: Applying AI Tools for Effective Writing and Publishing
From Everand
Mixed Methods Research: Applying AI Tools for Effective Writing and Publishing
Krishna Bista
No ratings yet
Simple guide to start a thesis
From Everand
Simple guide to start a thesis
lady rodriguez
No ratings yet
Data Analysis & Probability - Task Sheets Gr. 3-5
From Everand
Data Analysis & Probability - Task Sheets Gr. 3-5
Tanya Cook
No ratings yet
Nicolas Cuadrado Thesis M Sc ML
No ratings yet
Nicolas Cuadrado Thesis M Sc ML
66 pages
16 OpticalBERT Cole Train Tokenizer Pretrain FinetrainforCNER Article RetrivelinREF8
No ratings yet
16 OpticalBERT Cole Train Tokenizer Pretrain FinetrainforCNER Article RetrivelinREF8
21 pages
RS_poster
No ratings yet
RS_poster
1 page
ADIALab_poster
No ratings yet
ADIALab_poster
1 page
fed_trpo_iclr
No ratings yet
fed_trpo_iclr
1 page
ML807_Distributed_and_Federated_Learning_Slides_2
No ratings yet
ML807_Distributed_and_Federated_Learning_Slides_2
211 pages
AGI Safety From First Principles
No ratings yet
AGI Safety From First Principles
35 pages
Data Science Hype or Ripe For Aviation?
100% (1)
Data Science Hype or Ripe For Aviation?
22 pages
Report
No ratings yet
Report
37 pages
Plan Stratégico Indra
No ratings yet
Plan Stratégico Indra
55 pages
1.+amrik IJISAE 1
No ratings yet
1.+amrik IJISAE 1
14 pages
MINI PROJECT
No ratings yet
MINI PROJECT
36 pages
Jignasa 2023 CS
No ratings yet
Jignasa 2023 CS
5 pages
17949
No ratings yet
17949
47 pages
My Resume
No ratings yet
My Resume
3 pages
2022 ML Assignments
No ratings yet
2022 ML Assignments
45 pages
What Is Machine Learning?: Lis Sulmont
No ratings yet
What Is Machine Learning?: Lis Sulmont
51 pages
Uil Lincoln-Douglas Debate Research Series
No ratings yet
Uil Lincoln-Douglas Debate Research Series
45 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
A Mobile Robot Path Planning Using Genetic Algorithm in Static Environment
No ratings yet
A Mobile Robot Path Planning Using Genetic Algorithm in Static Environment
4 pages
Dream 2047 June 2023 English v2
No ratings yet
Dream 2047 June 2023 English v2
22 pages
HS DE CUONG CUOI HK1
No ratings yet
HS DE CUONG CUOI HK1
15 pages
An Innovative Model of Higher Mathematics Curriculum Education Incorporating Artificial Intelligence Technology
No ratings yet
An Innovative Model of Higher Mathematics Curriculum Education Incorporating Artificial Intelligence Technology
16 pages
German Standardization Roadmap Industry 4.0
No ratings yet
German Standardization Roadmap Industry 4.0
138 pages
Asad Irfan CV For CSR TeraData
No ratings yet
Asad Irfan CV For CSR TeraData
3 pages
Email Spam Detection Using Machine Learning
No ratings yet
Email Spam Detection Using Machine Learning
10 pages
ChatGPT on academic research
No ratings yet
ChatGPT on academic research
15 pages
Image Generative Models
No ratings yet
Image Generative Models
2 pages
ML Assignments
No ratings yet
ML Assignments
2 pages
Machine Health As-A Service
No ratings yet
Machine Health As-A Service
27 pages
Buy Ebook Artificial Intelligence, Blockchain, Computing and Security, Volume 1 1st Edition Arvind Dagur Cheap Price
100% (3)
Buy Ebook Artificial Intelligence, Blockchain, Computing and Security, Volume 1 1st Edition Arvind Dagur Cheap Price
79 pages
Full Proj Report
No ratings yet
Full Proj Report
59 pages
Sentiment Analysis With Xlnet Final
No ratings yet
Sentiment Analysis With Xlnet Final
8 pages
Machine Learning - Classification: CS102 Winter 2019
No ratings yet
Machine Learning - Classification: CS102 Winter 2019
36 pages