0% found this document useful (0 votes)
4 views

SGD

The document provides an introduction to Stochastic Gradient Descent (SGD) as an optimization technique for minimizing risk functions in machine learning. It explains the iterative process of gradient descent, the concept of subgradients for nondifferentiable functions, and outlines the SGD algorithm along with its variants and applications. Additionally, it discusses the importance of projection steps and the performance guarantees of SGD in convex-Lipschitz bounded problems.

Uploaded by

shahpanav8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

SGD

The document provides an introduction to Stochastic Gradient Descent (SGD) as an optimization technique for minimizing risk functions in machine learning. It explains the iterative process of gradient descent, the concept of subgradients for nondifferentiable functions, and outlines the SGD algorithm along with its variants and applications. Additionally, it discusses the importance of projection steps and the performance guarantees of SGD in convex-Lipschitz bounded problems.

Uploaded by

shahpanav8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

DS303: Introduction to Machine Learning

Stochastic Gradient Descent

Manjesh K. Hanawal
SGD Introduction

▶ Gradient descent is an iterative optimization procedure in


which at each step we improve the solution by taking a step
along the negative of the gradient of the function to be
minimized at the current point. We typically minimise the risk
function LD (h) = Ez∼D [l(h, z)]. Let’s refer to hypotheses as
vectors w then we try to minimize the risk function LD (w )
▶ We are minimizing the risk function, and since we do not
know D we also do not know the gradient of LD (w ).
▶ SGD circumvents this problem by allowing the optimization
procedure to take a step in a random direction (appropriately).
▶ SGD is an efficient algorithm and enjoys the same sample
complexity as the regularized risk minimization rule.

DS303 Manjesh K. Hanawal 2


Gradient Descent

The gradient of a differentiable function f : Rd → R at w, denoted


∇f (w), is the vector of partial derivatives of f , namely,
 
∂f (w) ∂f (w)
∇f (w) = ,..., .
∂w1 ∂wd
Gradient descent is an iterative algorithm. We start with an initial
value of w (say, w(1) = 0). Then, at each iteration, we take a step
in the direction of the negative of the gradient at the current
point. That is, the update step is

w(t+1) = w(t) − η∇f (w(t) ), (1)


where η > 0 is a parameter called learning rate.

DS303 Manjesh K. Hanawal 3


Gradient Descent

Another way to motivate gradient descent is by relying on Taylor


approximation. The gradient of f at w yields the first-order Taylor
approximation of f around w by

f (u) ≈ f (w) + ⟨u − w, ∇f (w)⟩.


When f is convex, this approximation lower bounds f , that is,

f (u) ≥ f (w) + ⟨u − w, ∇f (w)⟩.


Therefore, for w close to w(t) , we have that

f (w) ≈ f (w(t) ) + ⟨w − w(t) , ∇f (w(t) )⟩.

DS303 Manjesh K. Hanawal 4


Gradient Descent

Hence, we can minimize the approximation of f (w). However, the


approximation might become loose for w far away from w(t) .
Therefore, we would like to minimize jointly the distance between
w and w(t) and the approximation of f around w(t) .

 
(t+1) 1 (t) 2

(t) (t) (t)
w = arg min ∥w − w ∥ + η f (w ) + ⟨w − w , ∇f (w )⟩ .
w 2

Parameter η controls the trade-off between the two terms, we


obtain the update rule
Solving the preceding equation same update rule as in Equation 1
in previous slides.

DS303 Manjesh K. Hanawal 5


Subgradients

▶ The gradient descent (GD) algorithm requires that the


function f be differentiable.
▶ We generalize the discussion beyond differentiable functions.
We will show that the GD algorithm can be applied to
nondifferentiable functions by using a so-called subgradient of
f (w) at w(t) , instead of the gradient.

DS303 Manjesh K. Hanawal 6


Subgradients

To motivate the definition of subgradients, recall that for a convex


function f , the gradient at w defines the slope of a tangent that
lies below f , that is,

∀u, f (u) ≥ f (w) + ⟨u − w, ∇f (w)⟩. (1)


An illustration is given on the left-hand side of Figure 14.2.
The existence of a tangent that lies below f is an important
property of convex functions, which is, in fact, an alternative
characterization of convexity.
Lemma: Let S be an open convex set. A function f : S → R is
convex if and only if for every w ∈ S there exists v such that

∀u ∈ S, f (u) ≥ f (w) + ⟨u − w, v⟩. (2)

DS303 Manjesh K. Hanawal 7


Subgradients
Definition(Subgradients): A vector v that satisfies Equation 3 is
called a subgradient of f at w. The set of all subgradients of f at
w is called the differential set and is denoted by ∂f (w).

Figure: Left: The right-hand side of Equation 2 is the tangent of f at w.


For a convex function, the tangent lower bounds f. Right: Illustration of
several subgradients of a nondifferentiable convex function.

DS303 Manjesh K. Hanawal 8


Calculating Subgradients
Claim 1 : If f is differentiable at w, then ∂f (w) contains a single
element – the gradient of f at w, ∇f (w).

Example (The Differential Set of the Absolute Function):


Consider the absolute value function f (x) = |x|. Using Claim 1, we
can easily construct the differential set for the differentiable parts
of f , and the only point that requires special attention is x0 = 0.
At that point, it is easy to verify that the subdifferential is the set
of all numbers between −1 and 1. Hence:

{1},
 if x > 0
∂f (x) = {−1}, if x < 0

[−1, 1], if x = 0

For many practical uses, we do not need to calculate the whole set
of subgradients at a given point, as one member of this set would
suffice.
DS303 Manjesh K. Hanawal 9
Subgradient of a Maximum Function

Claim 2: Let g (w) = maxi∈[r ] gi (w) for r convex differentiable


functions g1 , . . . , gr . Given some w, let j ∈ arg maxi gi (w). Then
∇gj (w) ∈ ∂g (w).
Proof : Since gj is convex, we have that for all u:

gj (u) ≥ gj (w) + ⟨u − w, ∇gj (w)⟩.

Since g (w) = gj (w) and g (u) ≥ gj (u), we obtain:

g (u) ≥ g (w) + ⟨u − w, ∇gj (w)⟩.

which concludes our proof. □

DS303 Manjesh K. Hanawal 10


Subgradient of Hinge Loss and Lipschitz functions

Hinge Loss: If we use the Claim 2, we get the subgradient of


hinge loss , v as follows
(
0 if 1 − y ⟨w ,x⟩ ≤ 0
v
−y x if 1 − y ⟨w ,x⟩ > 0

Lemma: Let A be a convex open set and let f : A → R be a


convex function. Then, f is ρ-Lipschitz over A iff for all w ∈ A and
v ∈ ∂f (w) we have that:

∥v∥ ≤ ρ.

DS303 Manjesh K. Hanawal 11


Stochastic Gradient Descent (SGD)
In stochastic gradient descent we do not require the update
direction to be based exactly on the gradient. Instead, we allow the
direction to be a random vector and only require that its expected
value at each iteration will equal the gradient direction. Or, more
generally, we require that the expected value of the random vector
will be a subgradient of the function at the current vector.

Figure: Gradient descent on the left and stochastic gradient descent on


the right. In the right figure, the black line depicts the average value of w

DS303 Manjesh K. Hanawal 12


Algorithm for SGD

Stochastic Gradient Descent (SGD) for minimizing f (w)


parameters: Scalar η > 0, integer T > 0
initialize: w(1) = 0
for t = 1, 2, . . . , T
▶ Choose vt at random from a distribution such that
E[vt |w(t) ] ∈ ∂f (w(t) ).
▶ Update w(t+1) = w(t) − ηvt .
output w̄ = T1 T (t)
P
t=1 w

DS303 Manjesh K. Hanawal 13


Variants of SGD

A. Using Variable Step Size:


Decrease the step size as a function of t. Instead of using η for
B
update, we use ηt . For example, we can take ηt = ρ√ t
B. Averaging Techniques:
Use sophisticated averaging schemes which can improve the
convergence speed

DS303 Manjesh K. Hanawal 14


Variants of SGD: For Strongly Convex function
Claim : If f is λ-strongly convex, then for every w, u and
v ∈ ∂f (w), we have

λ
⟨w − u, v⟩ ≥ f (w) − f (u) + ∥w − u∥2 .
2
SGD algorithm for minimizing a λ-strongly convex function
Goal: Solve minw∈H f (w)
Parameter: T
Initialize: w(1) = 0
for t = 1, . . . , T do
▶ Choose a random vector vt such that E[vt |w(t) ] ∈ ∂f (w(t) )
▶ Set ηt = 1
λt
(t+ 12 )
▶ Set w = w(t) − ηt vt
1
▶ Set w(t+1) = arg minw∈H ∥w − w(t+ 2 ) ∥2
Output: w̄ = T1 T (t)
P
t=1 w
DS303 Manjesh K. Hanawal 15
Adding Projection step for SGD

Let H = w : ∥w∥ ≤ B A. Projected Stochastic Gradient


Descent (Projected SGD)
The basic idea is to add a projection step, where we first subtract
a subgradient from the current value of w and then project the
resulting vector onto a feasible set H.
1
1. w(t+ 2 ) = w(t) − ηvt
1
2. w(t+1) = arg minw∈H ∥w − w(t+ 2 ) ∥
The projection step replaces the current value of w by the vector
in H closest to it.

DS303 Manjesh K. Hanawal 16


Adding Projection step for SGD
Lemma (Projection Lemma): Let H be a closed convex set and
let v be the projection of w onto H, namely,

v = arg min ∥x − w∥2 .


x∈H

Then for every u ∈ H

∥w-u∥2 = ∥v-u∥2 ≥ 0

Using the lemma, we can easily adapt the analysis of SGD to the
case in which we add projection steps on a closed and convex set.
Simply note that for every t,

∥w(t+1) − w⋆ ∥2 − ∥w(t) − w⋆ ∥2
1 1
= ∥w(t+1) −w⋆ ∥2 −∥w(t+ 2 ) −w⋆ ∥2 +∥w(t+ 2 ) −w⋆ ∥2 −∥w(t) −w⋆ ∥2
1
≤ ∥w(t+ 2 ) − w⋆ ∥2 − ∥w(t) − w⋆ ∥2 .
DS303 Manjesh K. Hanawal 17
Convex-Lipschitz bounded problems

▶ Consider a convex-Lipschitz-bounded learning problem with


parameters ρ, B.
▶ Then, for every ϵ > 0, if we run the SGD method for
minimizing LD (w) with a number of iterations
q (i.e., number
B 2 ρ2 B2
of examples) T ≥ ϵ2
and with η = ρ2 T
, then the output
of SGD satisfies
 
E LD (w̄) ≤ min LD (w) + ϵ.
w∈H
▶ The required sample complexity is of the same order of
magnitude as the sample complexity guarantee we derived for
regularised loss minimisation

DS303 Manjesh K. Hanawal 18


Minimizing Loss function with SGD

Figure: Minimizing Loss function

DS303 Manjesh K. Hanawal 19

You might also like