0% found this document useful (0 votes)
21 views

ML807_Distributed_and_Federated_Learning_Slides_1

Uploaded by

Nicolas Cuadrado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

ML807_Distributed_and_Federated_Learning_Slides_1

Uploaded by

Nicolas Cuadrado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 190

ML807: Distributed and Federated Learning

Course Overview and Basics

Instructors: Samuel Horvath, Eduard Gorbunov, Praneeth Vepakomma


Semester: Spring 2024

Machine Learning, Mohamed bin Zayed University of Artificial Intelligence


Outline

1. Introduction

2. Motivation

3. Challenges

4. Basics
Statistics
Calculus

5. SGD in Linear Regression

1
Introduction
About Instructors: Samuel Horvath

Brief Bio
• Assistant Professor @ MBZUAI, Machine Learning: 2022 - Present
• MS and PhD @ KAUST, Statistics: 2017 - 2022
• BS @ Comenius University: 2014 - 2017
• Industry experience: Meta AI Research, Samsung AI, Amazon
AWS

Research
• Federated Learning
• Distributed Optimization
• Efficient Deep Learning
• Large-scale Machine Learning
• Optimization for Machine Learning

2
About Instructors: Eduard Gorbunov

Brief Bio
• PostDoc @ MBZUAI, Machine Learning, 2022 - Present
• PostDoc @ MILA: 2022
• Junior Researcher @ MIPT; 2020 - 2022
• BS, MS, and PhD @ MIPT, Applied Mathematics, 2014 - 2021
• Industry experience: Yandex, Huawei

Research
• Stochastic Optimization for Machine Learning
• Distributed Optimization
• Derivative-Free Optimization
• Variational Inequalities

3
About Instructors: Praneeth Vepakomma

Brief Bio
• Assistant Professor @ MBZUAI, Machine Learning: 2023 - Present
• PhD @ MIT, Statistics: 2018 - 2023
• Industry experience: Meta, Amazon, Apple, Amazon, Motorola
Solutions, Corning, and several startups

Research
• Distributed ML
• Distributed Statistical Inference
• Privacy
• Statistics
• Randomized Algorithms

4
ML807: Federated Learning

About this class


• This is a graduate course in a new branch of machine learning:
federated learning (FL).
• This course aims for students to become familiar with the field’s
key developments and practices.

Prerequisites
• MTH 701
• Good knowledge of multivariate calculus, linear algebra,
optimization, probability, and algorithms.
• Proficiency in some ML framework, e.g., PyTorch and TensorFlow.

5
ML807: Federated Learning

Goals
• Understand the key concepts, challenges, and issues behind FL,
• Get familiar with key advances and results in FL,
• Prepare and deliver several 45 minutes of presentations about
the selected FL papers,
• Write a final project in the form of a paper/report with the
original applied or theoretical research in the area of FL.

Assessment
• Quizzes – 30%
• Paper presentation – 30%
• Project – 40%
• Active Participation - 5% (Bonus) Asking questions and
participating in class discussions. Please be interactive!
6
Course Organization

Weeks 1–6 (7)


The first 6 weeks of the course will consist of lectures taught by the
instructor.

Weeks 7 (8)–11
In weeks 7-11, students will present, discuss, and summarize papers
on the latest research in federated learning.

Weeks 12–15
Last 4 weeks, students will present, discuss, and summarize their
progress on their course project, chosen by the student during the
course.

7
Weeks 1-6

Covered Topics
• Introduction to Distributed and Federated
Learning
• Communication Bottleneck
• Computational and Data Heterogeneity
• Client Selection and Importance Sampling
• Decentralized Learning
• Fairness
• Robustness
• Personalized and Multi-task Learning
• Differential Privacy in FL

8
Quizzes

• There will be three quizzes


• Each exam accounts for 10% of the overall
evaluation
• Quizzes are designed to test the knowledge
of the covered material
• Quizzes consist of 2 types of questions:
• covered theoretical results,
• general understanding: FL system design,
i.e., how to design efficient FL systems,
challenges one needs to consider,
evaluation, etc.

9
Weeks 7-11

Paper Presentation (30%)


• Students’ ability for independently applying
the topics learned in the class to advanced
FL problems will be tested through paper
presentations.
• Each student will present 2-3 papers from
the provided list or approved by the
instructor.
• Topics will include all areas of FL, including
but not limited to the material covered
during the first five weeks.
• Evaluation criteria: clarity and coherence of
the content, and clarity of the presentation.

10
Weeks 12-15

Project
• Students select a topic for the project (covered materials).
• Approved by the instructor (relevant and doable).
• Students responsibilities: complete the project, present their
findings and current progress.
• Main criteria: a substantial literature review (an advanced
research course).
• Evaluation: the clarity, writing and presentation quality.
• The novelty is not required but preferred (a bonus score).
• A successful project does not need to be of a publishable
nature, but it is strongly encouraged.

11
Project Criteria

• Introduction (5%)
• Review of related work (30%)
• Data exploration and preprocessing (10%)
• Methods (25%)
• Discussion of experiments and results (25%)
• Conclusion (5%)
• Novelty – Bonus 10% (Capped at 100%)
• Individual Projects (Possible exceptions)

12
Lectures & Office Hours

• Instructor:
Samuel Horvath ([email protected])
Eduard Gorbunov ([email protected]) Praneeth
Vepakomma ([email protected])
• Lectures:
Monday/Tuesday (17:30 - 18:50), Friday (16:00 - 17:20)
Demanding!
• Office hours:
Right after the class or by email appointment

13
Useful Resources: Reading

Review articles
• Advances and open problems in federated learning
[Kairouz et al., 2021]
• A field guide to federated optimization [Wang et al., 2021]
• Federated learning: Challenges, methods, and future directions
[Li et al., 2020]
• Federated learning challenges and opportunities: An outlook
[Ding et al., 2022]

Books
• Federated Learning: A Comprehensive Overview of Methods and
Applications [Ludwig and Baracaldo, 2022]

14
Useful Resources: Software

FL Simulators
• FLSim: github.com/facebookresearch/FLSim
• FL_Pytorch: github.com/burlachenkok/flpytorch
• Simple FL:
github.com/SamuelHorvath/Simple_FL_Simulator

Beyond Simulations–Real-world Deployments


• Flower: https://flower.dev/
• FedML: https://github.com/FedML-AI

15
Useful Resources: FLOW Seminar

Federated Learning One World (FLOW) seminar


• FLOW provides a global online forum for the dissemination of
the latest scientific research results in all aspects of FL:
distributed optimization, learning algorithms, privacy,
cryptography, personalization, . . .
• Focus: the theoretical foundations of the field, applications,
datasets, benchmarking, software, hardware, and systems.

Stats
• Link: https://sites.google.com/view/one-world-
seminar-series-flow
• Talks: 60+
• Participants: 1200+

16
Questions?

16
Introduce yourself + Why this course?

16
Motivation
The Past: A Brief History of Machine Learning

• Machine learning is not new. It can be traced as far as the 1950s.

17
The Past: A Brief History of Machine Learning

• Machine learning is not new. It can be traced as far as the 1950s.


• 1950: Alan Turing posed the idea of a thinking machine.

17
The Past: A Brief History of Machine Learning

• Machine learning is not new. It can be traced as far as the 1950s.


• 1950: Alan Turing posed the idea of a thinking machine.
• 1957: Rosenblatt proposed the Perceptron, precursor to neural
networks

17
The Past: A Brief History of Machine Learning

• Machine learning is not new. It can be traced as far as the 1950s.


• 1950: Alan Turing posed the idea of a thinking machine.
• 1957: Rosenblatt proposed the Perceptron, precursor to neural
networks
• 1986: Hinton popularised the use of the backpropagation
algorithm for efficiently training neural networks.

17
The Past: A Brief History of Machine Learning

• Machine learning is not new. It can be traced as far as the 1950s.


• 1950: Alan Turing posed the idea of a thinking machine.
• 1957: Rosenblatt proposed the Perceptron, precursor to neural
networks
• 1986: Hinton popularised the use of the backpropagation
algorithm for efficiently training neural networks.
• 1998: LeCun et al. created the MNIST dataset and used an SVM to
achieve < 1% classification error.

17
The Past: A Brief History of Machine Learning

• Machine learning is not new. It can be traced as far as the 1950s.


• 1950: Alan Turing posed the idea of a thinking machine.
• 1957: Rosenblatt proposed the Perceptron, precursor to neural
networks
• 1986: Hinton popularised the use of the backpropagation
algorithm for efficiently training neural networks.
• 1998: LeCun et al. created the MNIST dataset and used an SVM to
achieve < 1% classification error.
• Useful link: Jürgen Schmidhuber AI blog

17
The Present: Mainstream Success

ML did not come into the mainstream until the 2000s due to lack of
computing power to train large neural network models and the
scarcity of training data.

18
The Present: Mainstream Success

ML did not come into the mainstream until the 2000s due to lack of
computing power to train large neural network models and the
scarcity of training data.

• Early 2000s: Amazon launched AWS, which made cloud


computing flexible, scalable and inexpensive

18
The Present: Mainstream Success

ML did not come into the mainstream until the 2000s due to lack of
computing power to train large neural network models and the
scarcity of training data.

• Early 2000s: Amazon launched AWS, which made cloud


computing flexible, scalable and inexpensive
• 2012: Fei-Fei Li et al. created the massive ImageNet dataset and
hosted the ILSVRC challenge.

18
The Present: Mainstream Success

ML did not come into the mainstream until the 2000s due to lack of
computing power to train large neural network models and the
scarcity of training data.

• Early 2000s: Amazon launched AWS, which made cloud


computing flexible, scalable and inexpensive
• 2012: Fei-Fei Li et al. created the massive ImageNet dataset and
hosted the ILSVRC challenge.
• 2012: Krizhevsky et al. proposed AlexNet, a convolutional neural
network-based model to classify ImageNet.

18
The Present: Mainstream Success

ML did not come into the mainstream until the 2000s due to lack of
computing power to train large neural network models and the
scarcity of training data.

• Early 2000s: Amazon launched AWS, which made cloud


computing flexible, scalable and inexpensive
• 2012: Fei-Fei Li et al. created the massive ImageNet dataset and
hosted the ILSVRC challenge.
• 2012: Krizhevsky et al. proposed AlexNet, a convolutional neural
network-based model to classify ImageNet.

18
The Present: Mainstream Success

ML did not come into the mainstream until the 2000s due to lack of
computing power to train large neural network models and the
scarcity of training data.

• Early 2000s: Amazon launched AWS, which made cloud


computing flexible, scalable and inexpensive
• 2012: Fei-Fei Li et al. created the massive ImageNet dataset and
hosted the ILSVRC challenge.
• 2012: Krizhevsky et al. proposed AlexNet, a convolutional neural
network-based model to classify ImageNet.

After these advances, ML took off and achieved widespread success


in a plethora of applications.

18
The Present: Distributed Machine Learning

ML Pillars:
1. Large-scale training infrastructures
2. The vast amounts of training data

19
The Present: Distributed Machine Learning

ML Pillars:
1. Large-scale training infrastructures
2. The vast amounts of training data

Moore’s law (every 18 months performance of processors double)


does not anymore power the advancements in Machine Learning /
Artificial Intelligence. Last couple of years, the computational
resources needed to train state-of-the-art models grow 10 − 35×
every 18 months.

19
The Present: Distributed Machine Learning

ML Pillars:
1. Large-scale training infrastructures
2. The vast amounts of training data

Moore’s law (every 18 months performance of processors double)


does not anymore power the advancements in Machine Learning /
Artificial Intelligence. Last couple of years, the computational
resources needed to train state-of-the-art models grow 10 − 35×
every 18 months.

We have to go distributed.

19
The Present: Types of Distributed ML

• Data-parallel Distributed ML: The training dataset is shuffled as


split across multiple nodes (servers in the cloud, often equipped
with GPUs)
• Model-parallel Distributed ML: Large NNs are split across nodes,
and the model parameters are synchronized across them.

20
The Present: Types of Distributed ML

• Data-parallel Distributed ML: The training dataset is shuffled as


split across multiple nodes (servers in the cloud, often equipped
with GPUs)
• Model-parallel Distributed ML: Large NNs are split across nodes,
and the model parameters are synchronized across them.

Need a system-aware research approach, incorporating concepts


from both optimization theory as well as networking/scheduling.
20
Issues with Distributed ML

• Most data collection happens at


edge devices such as cellphones
• It is expensive to send all this data
to the cloud for training
• Negative privacy implications of data
collection
• Privacy Initiatives:
• GDPR (European Commission)
• Learning with Privacy at Scale
(Apple)
• Large carbon footprint of training

21
Issues with Distributed ML

• Most data collection happens at


edge devices such as cellphones
• It is expensive to send all this data
to the cloud for training
• Negative privacy implications of data
collection
• Privacy Initiatives:
• GDPR (European Commission)
• Learning with Privacy at Scale
(Apple)
• Large carbon footprint of training

What can we do?

21
The Potential Future: Federated Learning

• Federated Learning (FL) brings


training to the edge (decentralized)
• Obey data locality paradigm, i.e., to
process data where they were
collected
• Possibility to lower carbon
footprint [Qiu et al., 2020]
• Originally introduced by Google
researchers in 2016
• First works on FL:
• [Konečný et al., 2016]
• [Konečný et al., 2016b]
• [Konečný et al., 2016a]
• [McMahan et al., 2017]

22
Federated Learning

Definition
Federated Learning is defined as a Machine Learning setting where
we have multiple parties that want to collaborate in solving a
machine learning problem, and a central server coordinates this
collaboration. The core idea of FL is decentralized learning, where
the user data is never sent to the server, transferred or exchanged.
It’s kept locally; instead, the learning objective is meant to be
achieved using focused updates intended for immediate
aggregation to avoid privacy leaks.

23
Federated Learning

Definition
Federated Learning is defined as a Machine Learning setting where
we have multiple parties that want to collaborate in solving a
machine learning problem, and a central server coordinates this
collaboration. The core idea of FL is decentralized learning, where
the user data is never sent to the server, transferred or exchanged.
It’s kept locally; instead, the learning objective is meant to be
achieved using focused updates intended for immediate
aggregation to avoid privacy leaks.

Need even more specialized system-aware research approach due


new challenges (more on this later).

23
Applications of Federated Learning

• Commercial applications already in


production:
• Apple: “Hey Siri”, QuickType
• Google: “Hey Google”, Gboard
• Next game changer for:
• historically low amount of data due to privacy
constraints
• Smart health applications: Medical Research
and Diagnosis
• Fintech applications: Fraud Detection
(WeBank)

24
FL settings: Cross-silo Federated Learning

• Clients: Institutions (hospitals, banks, companies, etc.)


• Number of clients: Up to 100
• Amount of data: moderate (can train a meaningful local model)

25
FL settings: Cross-device Federated Learning

• Clients: IoT edge devices (mobile phones, tablets, laptops, smart


watches, etc.)
• Number of clients: Up to 1010
• Amount of data: few data points

26
Challenges
Training Loop for Distributed ML

Repeat until convergence


(conceptual sketch)
1. Global model is sent to clients
(GPUs, devices, etc.).
2. Clients updates local models based
on local data (e.g., local steps).
3. Synchronization of update
(aggregation).
4. Global model is updated.

27
Challenge 1: Communication Bottleneck (Data Center)

Issue
• In typical distributed computing
architectures, communication is
much slower than computation
• This is a major limiting factor for
many applications

Source: [Sapio et al., 2019] 28


Challenge 1: Communication Bottleneck (FL)

Issue
• Wireless links and other end-user
internet connections are slower than
datacenter links.
• Even for data centres,
communication is already a
bottleneck.

29
Challenge 1: Communication Bottleneck (FL)

Issue
• Wireless links and other end-user
internet connections are slower than
datacenter links.
• Even for data centres,
communication is already a
bottleneck.

Remedies
• Employing communication
compression
• More local work, e.g., local updates,
increase mini-batch size
• Limit the number of clients to
communicate (partial participation)
29
Challenge 2: Systems Heterogeneity

Issue
• Clients (GPUs, mobile devices, etc.) can be unreliable and/or
heterogeneous.
• Vast heterogeneity of devices is one of the key challenges for
deploying and training models in “wild.”

30
Challenge 2: Systems Heterogeneity

Issue
• Clients (GPUs, mobile devices, etc.) can be unreliable and/or
heterogeneous.
• Vast heterogeneity of devices is one of the key challenges for
deploying and training models in “wild.”

Remedies
• Straggler-resilient algorithms
• Asynchronous updates
• Drop low-tier devices (FL)
• Limit the global model’s size
(FL)

30
Challenge 3: Client Availability

Issue
• The standard consideration is that
clients are only available when they
are on fast network + connected to
charger.
• This might cause undesired
variations that optimization
algorithms may need to address.
• It contradicts iid sampling (clients
available typically only during night
time)

Credits: Google AI blog 31


Challenge 4: On-device AI

Issue
• On-device inference has been happening for a while now,
on-device training is still in a very early stage.
• It requires considerably more compute:
• forward and backward propagation,
• memory for gradients, activations, optimizers.

32
Challenge 5: Limited labels

Issue
• Devices have a lot of data, but
only a few labels

33
Challenge 5: Limited labels

Issue
• Devices have a lot of data, but
only a few labels
• Next word prediction is
popular (labels for free).

33
Challenge 5: Limited labels

Issue
• Devices have a lot of data, but
only a few labels
• Next word prediction is
popular (labels for free).

Remedies
• Semi-supervised learning
• Incentives for clients to label
their data

33
Challenge 6: Personalization

Issue
• Objective: Is a single global
model optimal?

34
Challenge 6: Personalization

Issue
• Objective: Is a single global
model optimal?
• Next word prediction: clearly
not.

34
Challenge 6: Personalization

Issue
• Objective: Is a single global
model optimal?
• Next word prediction: clearly
not.

Remedies
• Meta learning
• New personalized objectives:
clustering, local optima, etc.

34
Challenge 7: Privacy Guarantees

Issue
• Users’ data do not leave
device, but updates do

35
Challenge 7: Privacy Guarantees

Issue
• Users’ data do not leave
device, but updates do
• Privacy leaks are possible.

35
Challenge 7: Privacy Guarantees

Issue
• Users’ data do not leave
device, but updates do
• Privacy leaks are possible.

Remedies
• Cryptography: homomorphic
encryption, trusted execution
environments-TEEs
• Differential privacy

35
Challenge 8: Adversarial Attacks

Issue
• Federated Learning =
collaboration among
mutually untrusted clients
• Easy target for poisoning
• Heterogenous datasets: hard
to identify malicious clients

36
Challenge 8: Adversarial Attacks

Issue
• Federated Learning =
collaboration among
mutually untrusted clients
• Easy target for poisoning
• Heterogenous datasets: hard
to identify malicious clients

Remedies
• Defensive mechanisms: proof
of computations
• Byzantine-robust learning

36
Challenge 9: Incentives to Participate

Issue
• Local Motivation (Reward vs. Cost):
• What is my motivation to
participate? Or to label data?
• What is my local cost (energy,
privacy, . . .)?
• Do I benefit from the federation?
• Local Contribution:
• How much do I contribute to the
federation?
• Ownership of model:
• Community or Enterprise (Google,
Amazon)

37
Challenge 9: Incentives to Participate

Issue
• Local Motivation (Reward vs. Cost):
• What is my motivation to
participate? Or to label data?
• What is my local cost (energy,
privacy, . . .)?
• Do I benefit from the federation?
• Local Contribution:
• How much do I contribute to the
federation?
• Ownership of model:
• Community or Enterprise (Google,
Amazon)

Remedies
• Model FL as a Market 37
Challenges: Summary

New Challenges
• Unique challenges that require devising new system-aware
efficient optimization techniques
• This involves optimization theory and networking/scheduling
techniques

Goals
• First 11 weeks: Understand techniques to tackle these challenges
• Last 4 weeks: Explore, analyse, and validate new/existing
solutions
• Ultimate goal: Enabling efficient collaborative (distributed) ML

38
Basics
Statistics
Expectation
Let X ∈ Rd be a random variable and g : Rd → R be any function.

1. If X is discrete, then the expectation of g(X) is defined as


X
E [g(X)] = g(x)p(x),
x∈X

where p is the probability mass function of X and X is the


support of X.
2. If X is continuous, then the expectation of g(X) is defined as
Z
E [g(X)] = g(x)p(x),
Rd

where p is the is the probability density function of X.

One sometimes writes EX [·] to emphasize that the expectation is


taken with respect to a particular random variable X.
39
Statistics

Properties of Expectation
Let X, Y ∈ Rd be random variables.

1. Linearity: For any constant A, B ∈ Rm×d and C ∈ Rm

E [AX + BY + C] = AE [X] + BE [Y] + C.

2. Decomposition: If X and Y are independent, then


  ⊤
E X⊤ Y = E [X] E [Y] .

In addition, U = g(X) and V = h(Y) are also independent for any


constant functions g and h.

40
Statistics

Example
• Let X be a Bernoulli random variable, i.e.,
(
1 w.p. p,
X=
0 otherwise.

41
Statistics

Example
• Let X be a Bernoulli random variable, i.e.,
(
1 w.p. p,
X=
0 otherwise.

• What is its expectation?

41
Statistics

Example
• Let X be a Bernoulli random variable, i.e.,
(
1 w.p. p,
X=
0 otherwise.

• What is its expectation?


• Solution:

E [X] = p ∗ 1 + (1 − p) ∗ 0 = p

41
Statistics

Variance
Let X ∈ Rd be a random variable. Its variance is defined as
   
V [X] = E (X − E [X])⊤ (X − E [X]) = E ∥X − E [X]∥2 .

42
Statistics

Variance
Let X ∈ Rd be a random variable. Its variance is defined as
   
V [X] = E (X − E [X])⊤ (X − E [X]) = E ∥X − E [X]∥2 .

Second Moment Decomposition


Let X ∈ Rd be a random variable. Its second moments can be
decomposed as
 
E ∥X∥2 = V [X] + ∥E [X]∥2 . (1)

42
Statistics

Exercise
Show that (1) holds.

43
Statistics

Exercise
Show that (1) holds.

Exercise
Show that
 
E ∥X − b∥2 = V [X] + ∥E [X] − b∥2 .

43
Statistics

Exercise
Show that (1) holds.

Exercise
Show that
 
E ∥X − b∥2 = V [X] + ∥E [X] − b∥2 .

Exercise
Show that the following equality holds. For any independent random
variables X1 , X2 , . . . , Xn ∈ Rd , we have
 
2
1 X h i
X n n
1
E (Xi − E [Xi ])  = 2
2
E ∥Xi − E [Xi ]∥ .
n n
i=1 i=1

43
Statistics
Tower Property
Let X, Y ∈ Rd be random variables. The law of total expectation
stays that

EX [X] = EY [EX [X|Y]] .

44
Statistics
Tower Property
Let X, Y ∈ Rd be random variables. The law of total expectation
stays that

EX [X] = EY [EX [X|Y]] .

Example
Suppose that only two factories supply light bulbs to the market.
Factory X’s bulbs work for an average of 5000 hours, whereas factory
Y’s bulbs work for an average of 4000 hours. It is known that factory
X supplies 60% of the total bulbs available. What is the expected
length of time that a purchased bulb will work for?

44
Statistics
Tower Property
Let X, Y ∈ Rd be random variables. The law of total expectation
stays that

EX [X] = EY [EX [X|Y]] .

Example
Suppose that only two factories supply light bulbs to the market.
Factory X’s bulbs work for an average of 5000 hours, whereas factory
Y’s bulbs work for an average of 4000 hours. It is known that factory
X supplies 60% of the total bulbs available. What is the expected
length of time that a purchased bulb will work for?
Solution
We condition on the factory and use the tower property.

E [L] = 4000 ∗ 0.4 + 5000 ∗ 0.6 = 4600


44
Statistics
Markov’s Inequality
Let X ∈ R be a non-negative random variable. Then, for any a ∈ R+ ,

E [X]
Pr [X > a] ≤ .
a

45
Statistics
Markov’s Inequality
Let X ∈ R be a non-negative random variable. Then, for any a ∈ R+ ,

E [X]
Pr [X > a] ≤ .
a

Example
Let X1 , X2 , . . . , Xt , . . . ∈ R be a sequence on non-negative random
variables that converges to 0 in expectation, i.e., limt→∞ E [Xt ] = 0.

Show that {Xt }t=1 also converges in probability, i.e.,
limt→∞ Pr [Xt > a] = 0, ∀a > 0.

45
Statistics
Markov’s Inequality
Let X ∈ R be a non-negative random variable. Then, for any a ∈ R+ ,

E [X]
Pr [X > a] ≤ .
a

Example
Let X1 , X2 , . . . , Xt , . . . ∈ R be a sequence on non-negative random
variables that converges to 0 in expectation, i.e., limt→∞ E [Xt ] = 0.

Show that {Xt }t=1 also converges in probability, i.e.,
limt→∞ Pr [Xt > a] = 0, ∀a > 0.
Solution
Let a > 0 be fixed,

E [Xt ] 1
lim Pr [Xt > a] ≤ lim = lim E [Xt ] = 0.
t→∞ t→∞ a a t→∞

45
Statistics

Useful Inequalities
Let X, Y ∈ Rd be random variables

• Jensen’s Inequality: If g : Rd → R is a convex function, then

E [g(X)] ≥ g(E [X]).

46
Statistics

Useful Inequalities
Let X, Y ∈ Rd be random variables

• Jensen’s Inequality: If g : Rd → R is a convex function, then

E [g(X)] ≥ g(E [X]).

• Hölder’s Inequality: Let p, q ∈ (1, ∞) satisfying 1/p + 1/q = 1. Then,


   1/p  1/q
E |X⊤ Y| ≤ E ∥X∥pp E ∥Y∥qq .

46
Statistics

Useful Inequalities
Let X, Y ∈ Rd be random variables

• Jensen’s Inequality: If g : Rd → R is a convex function, then

E [g(X)] ≥ g(E [X]).

• Hölder’s Inequality: Let p, q ∈ (1, ∞) satisfying 1/p + 1/q = 1. Then,


   1/p  1/q
E |X⊤ Y| ≤ E ∥X∥pp E ∥Y∥qq .

• Minkowski’s Inequality: Let p ∈ (1, ∞) . Then,


 1/p  1/p  1/p
E ∥X + Y∥pp ≤ E ∥X∥pp + E ∥Y∥pp .

46
Statistics

Exercise
Prove Cauchy-Schwartz inequality:
  p
E |X⊤ Y| ≤ E [∥X∥2 ] E [∥Y∥2 ].

47
Statistics

Exercise
Prove Cauchy-Schwartz inequality:
  p
E |X⊤ Y| ≤ E [∥X∥2 ] E [∥Y∥2 ].

Exercise
Show that, the covariance can be upper-bounded as follows
p def p
Cov(X, Y) ≤ V [X] V [Y], where Cov(X, Y) = V [X] V [Y].

47
Statistics

Exercise
Prove Cauchy-Schwartz inequality:
  p
E |X⊤ Y| ≤ E [∥X∥2 ] E [∥Y∥2 ].

Exercise
Show that, the covariance can be upper-bounded as follows
p def p
Cov(X, Y) ≤ V [X] V [Y], where Cov(X, Y) = V [X] V [Y].

Exercise
Show that if g : Rd → R is a concave function, then

E [g(X)] ≤ g(E [X]).

47
Convexity

• Convex set: A set X ⊆ Rd is convex if and only if


αx + (1 − α)y ∈ X , ∀x, y ∈ X , ∀α ∈ [0, 1].

• Convex function: A function f : Rd → R is convex if and only if


f(αx + (1 − α)y) ≤ αf(x) + (1 − α)f(y), ∀x, y ∈ dom f, ∀α ∈ [0, 1].

48
Gradient

Gradient
Consider a multivariate function f : Rd → R, the gradient of f is
 ∂f 
 ∂x. 1  ∂f(x)
∇f(x) =  . 
 .  , [∇f(x)]i = ∀i ∈ [d] = {1, 2, . . . , d}.
∂xi
∂f
∂xd x

∇f(x) points in the direction of the steepest ascent from x (Why?).

49
Gradient

Gradient
Consider a multivariate function f : Rd → R, the gradient of f is
 ∂f 
 ∂x. 1  ∂f(x)
∇f(x) =  . 
 .  , [∇f(x)]i = ∀i ∈ [d] = {1, 2, . . . , d}.
∂xi
∂f
∂xd x

∇f(x) points in the direction of the steepest ascent from x (Why?).


Example

49
Gradient

Gradient
Consider a multivariate function f : Rd → R, the gradient of f is
 ∂f 
 ∂x. 1  ∂f(x)
∇f(x) =  . 
 .  , [∇f(x)]i = ∀i ∈ [d] = {1, 2, . . . , d}.
∂xi
∂f
∂xd x

∇f(x) points in the direction of the steepest ascent from x (Why?).


Example
• Question: find the gradient of f(x) = x21 + 3x1 x2 .

49
Gradient

Gradient
Consider a multivariate function f : Rd → R, the gradient of f is
 ∂f 
 ∂x. 1  ∂f(x)
∇f(x) =  . 
 .  , [∇f(x)]i = ∀i ∈ [d] = {1, 2, . . . , d}.
∂xi
∂f
∂xd x

∇f(x) points in the direction of the steepest ascent from x (Why?).


Example
• Question: find the gradient of f(x) = x21 + 3x1 x2 .
" #
2x1 + 3x2
• Answer: ∇f(x) = .
3x1

49
Hessian

Hessian
Consider a multivariate function f : Rd → R, the Hessian of f is
 ∂2f 2 
∂x1 ∂x1 . . . ∂x∂1 ∂xf d
 . .. ..    ∂ 2 f(x)
Hf (x) = 

.. . .
 ,
 Hf (x) ij
= ∀i, j ∈ [d].
∂xi ∂xj
∂2f ∂2f
∂xd ∂x1 . . . ∂xd ∂xd
x

Hf (x) captures the curvature of the function at x. (Why?).

50
Hessian

Hessian
Consider a multivariate function f : Rd → R, the Hessian of f is
 ∂2f 2 
∂x1 ∂x1 . . . ∂x∂1 ∂xf d
 . .. ..    ∂ 2 f(x)
Hf (x) = 

.. . .
 ,
 Hf (x) ij
= ∀i, j ∈ [d].
∂xi ∂xj
∂2f ∂2f
∂xd ∂x1 . . . ∂xd ∂xd
x

Hf (x) captures the curvature of the function at x. (Why?).


Example

50
Hessian

Hessian
Consider a multivariate function f : Rd → R, the Hessian of f is
 ∂2f 2 
∂x1 ∂x1 . . . ∂x∂1 ∂xf d
 . .. ..    ∂ 2 f(x)
Hf (x) = 

.. . .
 ,
 Hf (x) ij
= ∀i, j ∈ [d].
∂xi ∂xj
∂2f ∂2f
∂xd ∂x1 . . . ∂xd ∂xd
x

Hf (x) captures the curvature of the function at x. (Why?).


Example
• Question: find the Hessian of f(x) = x21 + 3x1 x2 .

50
Hessian

Hessian
Consider a multivariate function f : Rd → R, the Hessian of f is
 ∂2f 2 
∂x1 ∂x1 . . . ∂x∂1 ∂xf d
 . .. ..    ∂ 2 f(x)
Hf (x) = 

.. . .
 ,
 Hf (x) ij
= ∀i, j ∈ [d].
∂xi ∂xj
∂2f ∂2f
∂xd ∂x1 . . . ∂xd ∂xd
x

Hf (x) captures the curvature of the function at x. (Why?).


Example
• Question: find the Hessian of f(x) = x21 + 3x1 x2 .
" #
2 3
• Answer: ∇f(x) = .
3 0

50
Hessian

Clairaut’s Theorem
If the second order partial derivatives of f : Rd → R are continuous,
at a point x, then

∂2f ∂2f
= ∀i, j ∈ [d]2 .
∂xi ∂xj ∂xj ∂xi

The Hessian of a convex function is positive semi-definite, that is,

y⊤ Hf (x)y ≥ 0, ∀x, y ∈ Rd .

Hf (x) is positive semi-definite if and only if all its eigenvalues are


non-negative.

51
Smoothness

Smoothness
Differentiable function f : Rd → R is L-smooth if

∥∇f(x) − ∇f(y)∥ ≤ L∥x − y∥, ∀x, y ∈ Rd .

52
Smoothness

Smoothness
Differentiable function f : Rd → R is L-smooth if

∥∇f(x) − ∇f(y)∥ ≤ L∥x − y∥, ∀x, y ∈ Rd .

• Smoothness directly leads to the following quadratic upper


bound:
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd .
2

52
Smoothness

Smoothness
Differentiable function f : Rd → R is L-smooth if

∥∇f(x) − ∇f(y)∥ ≤ L∥x − y∥, ∀x, y ∈ Rd .

• Smoothness directly leads to the following quadratic upper


bound:
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd .
2
• Example: Show that objective f(x) = ∥Ax − b∥2 is L-smooth.

52
Smoothness

Smoothness
Differentiable function f : Rd → R is L-smooth if

∥∇f(x) − ∇f(y)∥ ≤ L∥x − y∥, ∀x, y ∈ Rd .

• Smoothness directly leads to the following quadratic upper


bound:
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd .
2
• Example: Show that objective f(x) = ∥Ax − b∥2 is L-smooth.
• Solution: Exercise.

52
Bregman Divergence
Bregman Divergence
Bregman divergence associated with f f : Rd → R is defined as

Df (y, x) = f(y) − f(x) − ⟨∇f(x), y − x⟩.

53
Bregman Divergence
Bregman Divergence
Bregman divergence associated with f f : Rd → R is defined as

Df (y, x) = f(y) − f(x) − ⟨∇f(x), y − x⟩.

• Bregman divergence is not symmetric.

53
Bregman Divergence

• For convex functions, we have


Df (x, y) ≥ 0, ∀x, y ∈ Rd .

54
Bregman Divergence

• For convex functions, we have


Df (x, y) ≥ 0, ∀x, y ∈ Rd .
• For smooth functions, we have
L
Df (x, y) ≤ ∥x − y∥2 , ∀x, y ∈ Rd .
2

54
Strong Convexity
Strong Convexity
We say that a differentiable function f : Rd → R is µ-strongly
convex (or µ-convex) if there exists such µ > 0
µ
∥x − y∥2 ≤ Df (x, y), ∀x, y ∈ Rd .
2

55
Strong Convexity
Strong Convexity
We say that a differentiable function f : Rd → R is µ-strongly
convex (or µ-convex) if there exists such µ > 0
µ
∥x − y∥2 ≤ Df (x, y), ∀x, y ∈ Rd .
2

• Putting µ-convexity and L-smoothness together leads to


µ L
∥x − y∥2 ≤ Df (x, y) ≤ ∥x − y∥2
2 2

55
Cheat-sheet

• very useful cheat-sheet https://fa.bianp.net/blog/2017/


optimization-inequalities-cheatsheet/ by Fabian
Pedregosa (Google)

56
Disclaimer

Neural Networks

57
Disclaimer

Neural Networks
• Neural Networks (NNs) are neither L-smooth nor µ-convex
globally.

57
Disclaimer

Neural Networks
• Neural Networks (NNs) are neither L-smooth nor µ-convex
globally.
• However, such properties usually holds locally.

57
Disclaimer

Neural Networks
• Neural Networks (NNs) are neither L-smooth nor µ-convex
globally.
• However, such properties usually holds locally.
• Assuming global properties facilitates analysis.

57
Disclaimer

Neural Networks
• Neural Networks (NNs) are neither L-smooth nor µ-convex
globally.
• However, such properties usually holds locally.
• Assuming global properties facilitates analysis.
• NNs (NNs proxy) Optimization: [Du et al., 2018a, Du et al., 2018b,
Arora et al., 2019, Ye and Du, 2021]

57
Disclaimer

Neural Networks
• Neural Networks (NNs) are neither L-smooth nor µ-convex
globally.
• However, such properties usually holds locally.
• Assuming global properties facilitates analysis.
• NNs (NNs proxy) Optimization: [Du et al., 2018a, Du et al., 2018b,
Arora et al., 2019, Ye and Du, 2021]

Exercise
Show that NNs are non-smooth and non-convex.
Hint: Consider NN with one hidden layer with no bias and activation.

57
Questions?

57
Optimizing Convex Objectives

• Consider a convex objective function f(x) : Rd → R, which


depends on a parameter vector x ∈ Rd .

58
Optimizing Convex Objectives

• Consider a convex objective function f(x) : Rd → R, which


depends on a parameter vector x ∈ Rd .
• Our goal is to find x⋆ that minimizes this function, that is,

x⋆ ∈ arg min f(x).


x∈Rd

58
Optimizing Convex Objectives

• Consider a convex objective function f(x) : Rd → R, which


depends on a parameter vector x ∈ Rd .
• Our goal is to find x⋆ that minimizes this function, that is,

x⋆ ∈ arg min f(x).


x∈Rd

• How to find x⋆ ?

58
Optimizing Convex Objectives

• Consider a convex objective function f(x) : Rd → R, which


depends on a parameter vector x ∈ Rd .
• Our goal is to find x⋆ that minimizes this function, that is,

x⋆ ∈ arg min f(x).


x∈Rd

• How to find x⋆ ?
• Find stationary points of f(x), that is, x⋆ that satisfies ∇f(x⋆ ) = 0

58
Optimizing Convex Objectives

• Consider a convex objective function f(x) : Rd → R, which


depends on a parameter vector x ∈ Rd .
• Our goal is to find x⋆ that minimizes this function, that is,

x⋆ ∈ arg min f(x).


x∈Rd

• How to find x⋆ ?
• Find stationary points of f(x), that is, x⋆ that satisfies ∇f(x⋆ ) = 0
• Since f(x) is convex, the stationary point(s) minimize f(x)

58
Optimizing Convex Objectives

• Consider a convex objective function f(x) : Rd → R, which


depends on a parameter vector x ∈ Rd .
• Our goal is to find x⋆ that minimizes this function, that is,

x⋆ ∈ arg min f(x).


x∈Rd

• How to find x⋆ ?
• Find stationary points of f(x), that is, x⋆ that satisfies ∇f(x⋆ ) = 0
• Since f(x) is convex, the stationary point(s) minimize f(x)

Example

58
Optimizing Convex Objectives

• Consider a convex objective function f(x) : Rd → R, which


depends on a parameter vector x ∈ Rd .
• Our goal is to find x⋆ that minimizes this function, that is,

x⋆ ∈ arg min f(x).


x∈Rd

• How to find x⋆ ?
• Find stationary points of f(x), that is, x⋆ that satisfies ∇f(x⋆ ) = 0
• Since f(x) is convex, the stationary point(s) minimize f(x)

Example
• Question: Find x that minimizes f(x1 , x2 ) = (x1 − 3)2 + 2(x2 − 2)2

58
Optimizing Convex Objectives

• Consider a convex objective function f(x) : Rd → R, which


depends on a parameter vector x ∈ Rd .
• Our goal is to find x⋆ that minimizes this function, that is,

x⋆ ∈ arg min f(x).


x∈Rd

• How to find x⋆ ?
• Find stationary points of f(x), that is, x⋆ that satisfies ∇f(x⋆ ) = 0
• Since f(x) is convex, the stationary point(s) minimize f(x)

Example
• Question: Find x that minimizes f(x1 , x2 ) = (x1 − 3)2 + 2(x2 − 2)2
" #
2(x1 − 3)
• Answer: ∇f(x) = =0
4(x2 − 2)

58
Optimizing Convex Objectives

• Consider a convex objective function f(x) : Rd → R, which


depends on a parameter vector x ∈ Rd .
• Our goal is to find x⋆ that minimizes this function, that is,

x⋆ ∈ arg min f(x).


x∈Rd

• How to find x⋆ ?
• Find stationary points of f(x), that is, x⋆ that satisfies ∇f(x⋆ ) = 0
• Since f(x) is convex, the stationary point(s) minimize f(x)

Example
• Question: Find x that minimizes f(x1 , x2 ) = (x1 − 3)2 + 2(x2 − 2)2
" #
2(x1 − 3)
• Answer: ∇f(x) = =0
4(x2 − 2)
• =⇒ x⋆ = [3, 2]⊤

58
Gradient Descent (GD)

• In many problems, it is difficult to solve for x⋆ that satisfies


∇f(x) = 0

59
Gradient Descent (GD)

• In many problems, it is difficult to solve for x⋆ that satisfies


∇f(x) = 0
• Gradient descent (GD) is an iterative method that can help find
x⋆ . It uses the fact that −∇f(x) is the direction of the steepest
descent of f(x). Starts from a random initial point x0 :

xt+1 = xt − η∇f(xt ).

for a small learning rate η > 0.

59
Gradient Descent (GD)

• In many problems, it is difficult to solve for x⋆ that satisfies


∇f(x) = 0
• Gradient descent (GD) is an iterative method that can help find
x⋆ . It uses the fact that −∇f(x) is the direction of the steepest
descent of f(x). Starts from a random initial point x0 :

xt+1 = xt − η∇f(xt ).

for a small learning rate η > 0.

Intuition

59
Gradient Descent (GD)

• In many problems, it is difficult to solve for x⋆ that satisfies


∇f(x) = 0
• Gradient descent (GD) is an iterative method that can help find
x⋆ . It uses the fact that −∇f(x) is the direction of the steepest
descent of f(x). Starts from a random initial point x0 :

xt+1 = xt − η∇f(xt ).

for a small learning rate η > 0.

Intuition
• A person standing on a hill covered in fog who wants to find
their way down to the hill. They can feel the slope with their feet
and move along the direction of steepest descent.

59
Gradient Descent (GD)

• GD starts from a random initial point x0 and updates the current


iterate xt as follows:
xt+1 = xt − η∇f(xt ).
for a small learning rate η > 0.
• For convex f and small enough η, GD is guaranteed to converge
to an optimal x⋆ (more on this later).
• But for non-convex functions, it can get stuck at local minima or
saddle points.

60
Concept Check Exercise

• Question: Write the gradient descent update rule for


f(x) = 2x21 + x22 .

61
Concept Check Exercise

• Question: Write the gradient descent update rule for


f(x) = 2x21 + x22 .
• Answer: " #
t+1 t 4xt1
x =x −η .
2xt2

61
SGD in Linear Regression
Example: Predicting house prices

Sale price ≈ (price per square footage) × (square footage) + (fixed


expense)

62
Linear Regression

Setup
• Input: x ∈ Rd (covariates, predictors, features, etc.)
• Ouput: y ∈ R (responses, targets, outcomes, outputs, etc.)
• Model: f : x → y, with f(x) = w0 + w⊤ x.
• w: weights, parameters, or parameter vector.
• w0 : bias.
• Sometimes, we also call ŵ = [w0 , w⊤ ]⊤ parameters.
• Training data: D = {(xi , yi ), i ∈ [n]}

Minimize the Empirical Risk Function (ERM), which is the Residual


sum of squares for linear regression:
n
X n
X
2 2
L(w) = (f(xi ) − yi ) = w0 + w⊤ xi − yi
i=1 i=1

63
A simple case: x is just one-dimensional (d = 1)

Residual sum of squares:


n
X n
X
2 2
L(w) = (f(xi ) − yi ) = (w0 + w1 xi − yi )
i=1 i=1

64
A simple case: x is just one-dimensional (d = 1)

Residual sum of squares:


n
X n
X
2 2
L(w) = (f(xi ) − yi ) = (w0 + w1 xi − yi )
i=1 i=1

Figure 1: Loss (L) is the sum of squares of the dashed lines. Middle: Adjust
(w0 , w1 ) to reduce loss. Middle: Loss minimized at (w⋆0 , w⋆1 ).

65
A simple case: x is just one-dimensional (d = 1)

Residual sum of squares:


n
X n
X
2 2
L(w) = (f(xi ) − yi ) = (w0 + w1 xi − yi )
i=1 i=1

66
A simple case: x is just one-dimensional (d = 1)

Residual sum of squares:


n
X n
X
2 2
L(w) = (f(xi ) − yi ) = (w0 + w1 xi − yi )
i=1 i=1

Stationary points
Take derivative with respect to parameters and set it to zero.

66
A simple case: x is just one-dimensional (d = 1)

Residual sum of squares:


n
X n
X
2 2
L(w) = (f(xi ) − yi ) = (w0 + w1 xi − yi )
i=1 i=1

Stationary points
Take derivative with respect to parameters and set it to zero.

X n
∂L(w)
= 0 =⇒ 2 (w0 + w1 xi − yi ) = 0,
∂w0
i=1
Xn
∂L(w)
= 0 =⇒ 2 (w0 + w1 xi − yi )xi = 0.
∂w1
i=1

66
A simple case: x is just one-dimensional (d = 1)

Simplify these expressions to get the “Normal Equations”:


n
X n
X
yi = nw0 + w1 xi ,
i=1 i=1
n
X n
X Xn
yi xi = w0 x i + w1 x2i .
i=1 i=1 i=1

67
A simple case: x is just one-dimensional (d = 1)

Simplify these expressions to get the “Normal Equations”:


n
X n
X
yi = nw0 + w1 xi ,
i=1 i=1
n
X n
X Xn
yi xi = w0 x i + w1 x2i .
i=1 i=1 i=1

Solving the system, we obtain the least squares coefficient estimates:


Pn
(xi − x̄)(yi − ȳ)
w1 = i=1

Pn and w⋆0 = ȳ − w⋆1 x̄,
i=1 (xi − x̄)
2

1
Pn 1
Pn
where x̄ = n i=1 xi and ȳ = n i=1 yi .

67
Least Mean Squares when x is d-dimensional

Pn 2
L(w) in matrix form: L(w) = i=1 w⊤ xi − yi , where we have
redefined some variables (by augmenting)

x = [1, x1 , . . . , xd−1 ]⊤ , w = [w0 , w1 , . . . , wd−1 ]⊤ .

68
Least Mean Squares when x is d-dimensional

Pn 2
L(w) in matrix form: L(w) = i=1 w⊤ xi − yi , where we have
redefined some variables (by augmenting)

x = [1, x1 , . . . , xd−1 ]⊤ , w = [w0 , w1 , . . . , wd−1 ]⊤ .

Design matrix and target vector:


   
x⊤
1 y1
   
X = . . . ∈ Rn×d , y =  . . .  ∈ Rn .
x⊤
n yn

1
Pn 1
Pn
where x̄ = n i=1 xi and ȳ = n i=1 yi .

68
Least Mean Squares when x is d-dimensional

Pn 2
L(w) in matrix form: L(w) = i=1 w⊤ xi − yi , where we have
redefined some variables (by augmenting)

x = [1, x1 , . . . , xd−1 ]⊤ , w = [w0 , w1 , . . . , wd−1 ]⊤ .

Design matrix and target vector:


   
x⊤
1 y1
   
X = . . . ∈ Rn×d , y =  . . .  ∈ Rn .
x⊤
n yn

1
Pn 1
Pn
where x̄ = n i=1 xi and ȳ = n i=1 yi .

Compact expression:
n ⊤ o
L(w) = ∥Xw − y∥2 = w⊤ X⊤ Xw − 2 X⊤ y w + const
68
Least-Squares Solution

Compact expression:
n ⊤ o
L(w) = ∥Xw − y∥2 = w⊤ X⊤ Xw − 2 X⊤ y w + const

69
Least-Squares Solution

Compact expression:
n ⊤ o
L(w) = ∥Xw − y∥2 = w⊤ X⊤ Xw − 2 X⊤ y w + const

Gradients of Linear and Quadratic Functions:

∇x (b⊤ x) = b
∇x (x⊤ Ax) = 2Ax (symmetric A)

69
Least-Squares Solution

Compact expression:
n ⊤ o
L(w) = ∥Xw − y∥2 = w⊤ X⊤ Xw − 2 X⊤ y w + const

Gradients of Linear and Quadratic Functions:

∇x (b⊤ x) = b
∇x (x⊤ Ax) = 2Ax (symmetric A)

Normal equation

∇w L(w) = 2X⊤ Xw − 2X⊤ y = 0.

This leads to the least-mean-squares (LMS) solution


−1
wLMS = X⊤ X X⊤ y
69
Computational complexity

Bottleneck of computing the solution?


−1
w ⋆ = X⊤ X X⊤ y

70
Computational complexity

Bottleneck of computing the solution?


−1
w ⋆ = X⊤ X X⊤ y

How many operations do we need?

70
Computational complexity

Bottleneck of computing the solution?


−1
w ⋆ = X⊤ X X⊤ y

How many operations do we need?


• O(nd2 ) for matrix multiplication X⊤ X,

70
Computational complexity

Bottleneck of computing the solution?


−1
w ⋆ = X⊤ X X⊤ y

How many operations do we need?


• O(nd2 ) for matrix multiplication X⊤ X,
• O(d3 ) (e.g., using Gauss-Jordan elimination) or O(d2.373 ) (recent
theoretical advances) for matrix inversion of X⊤ X,

70
Computational complexity

Bottleneck of computing the solution?


−1
w ⋆ = X⊤ X X⊤ y

How many operations do we need?


• O(nd2 ) for matrix multiplication X⊤ X,
• O(d3 ) (e.g., using Gauss-Jordan elimination) or O(d2.373 ) (recent
theoretical advances) for matrix inversion of X⊤ X,
• O(nd) for matrix multiplication X⊤ y,

70
Computational complexity

Bottleneck of computing the solution?


−1
w ⋆ = X⊤ X X⊤ y

How many operations do we need?


• O(nd2 ) for matrix multiplication X⊤ X,
• O(d3 ) (e.g., using Gauss-Jordan elimination) or O(d2.373 ) (recent
theoretical advances) for matrix inversion of X⊤ X,
• O(nd) for matrix multiplication X⊤ y,
−1
• O(d2 ) for X⊤ X times X⊤ y.

70
Computational complexity

Bottleneck of computing the solution?


−1
w ⋆ = X⊤ X X⊤ y

How many operations do we need?


• O(nd2 ) for matrix multiplication X⊤ X,
• O(d3 ) (e.g., using Gauss-Jordan elimination) or O(d2.373 ) (recent
theoretical advances) for matrix inversion of X⊤ X,
• O(nd) for matrix multiplication X⊤ y,
−1
• O(d2 ) for X⊤ X times X⊤ y.

In total:
O(nd2 + d3 )

70
Computational complexity

Bottleneck of computing the solution?


−1
w ⋆ = X⊤ X X⊤ y

How many operations do we need?


• O(nd2 ) for matrix multiplication X⊤ X,
• O(d3 ) (e.g., using Gauss-Jordan elimination) or O(d2.373 ) (recent
theoretical advances) for matrix inversion of X⊤ X,
• O(nd) for matrix multiplication X⊤ y,
−1
• O(d2 ) for X⊤ X times X⊤ y.

In total:
O(nd2 + d3 )

70
Computational complexity

Bottleneck of computing the solution?


−1
w ⋆ = X⊤ X X⊤ y

How many operations do we need?


• O(nd2 ) for matrix multiplication X⊤ X,
• O(d3 ) (e.g., using Gauss-Jordan elimination) or O(d2.373 ) (recent
theoretical advances) for matrix inversion of X⊤ X,
• O(nd) for matrix multiplication X⊤ y,
−1
• O(d2 ) for X⊤ X times X⊤ y.

In total:
O(nd2 + d3 ) – impractical for very large d or n.

70
Alternative method: Batch Gradient Descent

(Batch) Gradient Descent


• Initialize w to w0 (e.g., randomly), choose η,
and set t = 0
• Loop until convergence:
1. Compute the gradient ∇L(w) = X⊤ (Xwt − y)
2. Update the parameters wt+1 = wt − η∇L(w)
3. t ← t + 1

What is the complexity of each iteration?

71
Alternative method: Batch Gradient Descent

(Batch) Gradient Descent


• Initialize w to w0 (e.g., randomly), choose η,
and set t = 0
• Loop until convergence:
1. Compute the gradient ∇L(w) = X⊤ (Xwt − y)
2. Update the parameters wt+1 = wt − η∇L(w)
3. t ← t + 1

What is the complexity of each iteration?


• O(nd)

71
Alternative method: Batch Gradient Descent

(Batch) Gradient Descent


• Initialize w to w0 (e.g., randomly), choose η,
and set t = 0
• Loop until convergence:
1. Compute the gradient ∇L(w) = X⊤ (Xwt − y)
2. Update the parameters wt+1 = wt − η∇L(w)
3. t ← t + 1

What is the complexity of each iteration?


• O(nd) – much better comparing to O(nd2 + d3 ).

71
Why would this work?

If gradient descent converges, it will converge to the same solution


as using matrix inversion.

72
Why would this work?

If gradient descent converges, it will converge to the same solution


as using matrix inversion.
This is because L(w) is a convex function in its parameters w.

72
Why would this work?

If gradient descent converges, it will converge to the same solution


as using matrix inversion.
This is because L(w) is a convex function in its parameters w.
Hessian of L(w):
n ⊤ o
L(w) = ∥Xw − y∥2 = w⊤ X⊤ Xw − 2 X⊤ y w + const
∂L(w)
=⇒ = 2X⊤ X
∂ww⊤

72
Why would this work?

If gradient descent converges, it will converge to the same solution


as using matrix inversion.
This is because L(w) is a convex function in its parameters w.
Hessian of L(w):
n ⊤ o
L(w) = ∥Xw − y∥2 = w⊤ X⊤ Xw − 2 X⊤ y w + const
∂L(w)
=⇒ = 2X⊤ X
∂ww⊤
X⊤ X is a positive semidefinite, because for any v

v⊤ X⊤ Xv = ∥Xv∥2 ≥ 0.

72
Stochastic gradient descent (SGD)

Widrow-Hoff rule: update parameters using one example at a time.

73
Stochastic gradient descent (SGD)

Widrow-Hoff rule: update parameters using one example at a time.

• Initialize w to w0 (e.g., randomly), choose η, and set t = 0


• Loop until convergence:
1. choose uniformly at random a training sample xit
2. Compute its contribution to the gradient gt = (w⊤
t xit − y)xit
3. Update the parameters wt+1 = wt − ηgt
4. t←t+1

73
Stochastic gradient descent (SGD)

Widrow-Hoff rule: update parameters using one example at a time.

• Initialize w to w0 (e.g., randomly), choose η, and set t = 0


• Loop until convergence:
1. choose uniformly at random a training sample xit
2. Compute its contribution to the gradient gt = (w⊤
t xit − y)xit
3. Update the parameters wt+1 = wt − ηgt
4. t←t+1

How does the complexity per iteration compare with gradient


descent?

73
Stochastic gradient descent (SGD)

Widrow-Hoff rule: update parameters using one example at a time.

• Initialize w to w0 (e.g., randomly), choose η, and set t = 0


• Loop until convergence:
1. choose uniformly at random a training sample xit
2. Compute its contribution to the gradient gt = (w⊤
t xit − y)xit
3. Update the parameters wt+1 = wt − ηgt
4. t←t+1

How does the complexity per iteration compare with gradient


descent?
• O(nd) for gradient descent versus O(d) for SGD.

73
Stochastic gradient descent (SGD)

Widrow-Hoff rule: update parameters using one example at a time.

• Initialize w to w0 (e.g., randomly), choose η, and set t = 0


• Loop until convergence:
1. choose uniformly at random a training sample xit
2. Compute its contribution to the gradient gt = (w⊤
t xit − y)xit
3. Update the parameters wt+1 = wt − ηgt
4. t←t+1

How does the complexity per iteration compare with gradient


descent?
• O(nd) for gradient descent versus O(d) for SGD.

Exercise
• Show that gt is an unbiased estimator of the gradient.

73
SGD versus Batch GD

• Recall the empirical risk loss function that we considered for


linear regression
n
X 2
L(w) = w⊤ xi − yi
i=1

74
SGD versus Batch GD

• Recall the empirical risk loss function that we considered for


linear regression
n
X 2
L(w) = w⊤ xi − yi
i=1

• For large training datasets (large n), then computing gradients


with respect to each data point is expensive. For example, for
one iteration, the batch gradients are
n
X 
∇L(wt ) = 2 w⊤
t xi − y i xi
i=1

74
SGD versus Batch GD

• Recall the empirical risk loss function that we considered for


linear regression
n
X 2
L(w) = w⊤ xi − yi
i=1

• For large training datasets (large n), then computing gradients


with respect to each data point is expensive. For example, for
one iteration, the batch gradients are
n
X 
∇L(wt ) = 2 w⊤
t xi − y i xi
i=1

• Therefore, we use stochastic gradient descent (SGD), where we


2
choose a random data point xit and use L̃(w) = w⊤ xit − yi as
an approximation to the entire sum.

74
SGD versus Batch GD

• SGD reduces per-iteration complexity from O(nd) to O(d)


• But it is noisier and can take longer to converge

75
Mini-batch SGD

• Mini-batch SGD is between these two extremes.

76
Mini-batch SGD

• Mini-batch SGD is between these two extremes.


• In each iteration t, we choose a set St of b samples uniformly at
random from the n training samples and use
X 2
L̃(wt ) = w⊤t xi − y i
i∈St

for back-propagation.

76
Mini-batch SGD

• Mini-batch SGD is between these two extremes.


• In each iteration t, we choose a set St of b samples uniformly at
random from the n training samples and use
X 2
L̃(wt ) = w⊤t xi − y i
i∈St

for back-propagation.
• Small b saves per-iteration computing cost but increases noise
in the gradients and yields worse error convergence.

76
Mini-batch SGD

• Mini-batch SGD is between these two extremes.


• In each iteration t, we choose a set St of b samples uniformly at
random from the n training samples and use
X 2
L̃(wt ) = w⊤t xi − y i
i∈St

for back-propagation.
• Small b saves per-iteration computing cost but increases noise
in the gradients and yields worse error convergence.
• Large b reduces gradient noise and gives better error
convergence but increases computing cost per iteration.

76
How to Choose Mini-batch Size

• Small training datasets – use batch gradient descent b = n.


• Large training datasets – typical b are 64, 128, 256, . . ., whatever
fits in the CPU/GPU memory.
• Mini-batch size is another hyperparameter that you have to tune.

77
How to Choose Learning Rate

• SGD Update Rule


wt+1 = wt − ηgt .
• Large η: Faster convergence, but higher error floor (the flat
portion of each curve).
• Small η: Slow convergence, but lower error floor (the blue curve
will eventually go below the red curve).
• To get the best of both worlds, decay η over time.

78
Questions?

78
References i

Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. (2019).
Fine-grained analysis of optimization and generalization for
overparameterized two-layer neural networks.
In International Conference on Machine Learning, pages 322–332.
PMLR.
Ding, J., Tramel, E., Sahu, A. K., Wu, S., Avestimehr, S., and Zhang, T.
(2022).
Federated learning challenges and opportunities: An outlook.
In ICASSP 2022-2022 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 8752–8756. IEEE.
Du, S. S., Hu, W., and Lee, J. D. (2018a).
Algorithmic regularization in learning deep homogeneous
models: Layers are automatically balanced.
Advances in Neural Information Processing Systems, 31.
References ii

Du, S. S., Zhai, X., Poczos, B., and Singh, A. (2018b).


Gradient descent provably optimizes over-parameterized neural
networks.
arXiv preprint arXiv:1810.02054.
Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M.,
Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R.,
et al. (2021).
Advances and open problems in federated learning.
Foundations and Trends® in Machine Learning, 14(1–2):1–210.
Konečný, J., McMahan, H. B., Ramage, D., and Richtárik, P. (2016).
Federated optimization: Distributed machine learning for
on-device intelligence.
arXiv preprint arXiv:1610.02527.
References iii

Konečný, J., McMahan, H. B., Ramage, D., and Richtárik, P. (2016a).


Federated optimization: Distributed machine learning for
on-device intelligence.
arXiv preprint arXiv:1610.02527.
Konečný, J., McMahan, H. B., Yu, F. X., Richtárik, P., Suresh, A. T., and
Bacon, D. (2016b).
Federated learning: Strategies for improving communication
efficiency.
In NeurIPS Private Multi-Party Machine Learning Workshop.
Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. (2020).
Federated learning: Challenges, methods, and future directions.
IEEE Signal Processing Magazine, 37(3):50–60.
References iv

Ludwig, H. and Baracaldo, N. (2022).


Federated learning: A comprehensive overview of methods and
applications.
Springer Nature.
McMahan, H. B., Moore, E., Ramage, D., Hampson, S., and y Arcas,
B. A. (2017).
Communication-efficient learning of deep networks from
decentralized data.
In International Conference on Artificial Intelligence and
Statistics (AISTATS), pages 1273–1282.
Qiu, X., Parcollet, T., Beutel, D. J., Topal, T., Mathur, A., and Lane,
N. D. (2020).
Can federated learning save the planet?
arXiv preprint arXiv:2010.06537.
References v

Sapio, A., Canini, M., Ho, C.-Y., Nelson, J., Kalnis, P., Kim, C.,
Krishnamurthy, A., Moshref, M., Ports, D. R., and Richtárik, P. (2019).

Scaling distributed machine learning with in-network


aggregation.
arXiv preprint arXiv:1903.06701.
Wang, J., Charles, Z., Xu, Z., Joshi, G., McMahan, H. B., Al-Shedivat,
M., Andrew, G., Avestimehr, S., Daly, K., Data, D., et al. (2021).
A field guide to federated optimization.
arXiv preprint arXiv:2107.06917.
References vi

Ye, T. and Du, S. S. (2021).


Global convergence of gradient descent for asymmetric
low-rank matrix factorization.
Advances in Neural Information Processing Systems,
34:1429–1439.

You might also like