16-dl-1 - converted
16-dl-1 - converted
• Homeworks:
Neural Nets and Deep • HW3 (due: 11/27)
• Link analysis
Learning 1 • Neural nets and deep learning
• Graph representation learning
EE412: Foundation of Big Data Analytics
Fall 2024
Recap A
F A G
A Outline
! A # A "
• Topic-Specific PageRank A A
1. Introduction to Deep Learning
B
• Clustering and Partitioning 2. Neural Networks
A A A …
• Finding Overlapping Communities & E 3. Objective Functions
D
A B D E A B
Smallest Best u v w
cut cut
"+ 1 = 41 + 6 where % = 4, 6 .
• Linear models are not expressive enough.
• XOR problem: Function &2 = 3$! + 5$" cannot learn XOR.
Д. Q
• Depth is meaningful only if we adopt nonlinear activation.
БГБодивіть рысть
• What do we expect from activation functions? • Sigmoid is common for perceptrons or shallow networks.
• Takes a single number. • Pros: Continuous, differentiable, and possible to interpret as a probability.
• Performs a differentiable nonlinear operation on it. • Cons: Saturates quickly beyond the “critical region” around 0. almost
gradient is
•6 $ =
• We care differentiability for training. !
O
1-
• Tanh function: tanh $ = 26 2$ − 1.
-
C J
Jaemin Yoo 19 Jaemin Yoo 20
Rectified Linear Unit (ReLU) ReLU Variants
• ReLU has replaced sigmoid in modern (deep) neural networks. • Leaky ReLU:
• Pros: Efficient to compute, and the gradient never saturates. • Attempts to fix the dying ReLU problem.
Cons: Saturation of derivative when $ < 0, where the output is stuck at 0. • Introduces a hyperparameter T (e.g., T = 0.01).
attef
•O
$ if $ ≥ 0
as
-
allong
• 6 $ =P
• Non-differentiability at ! = 0 is not an important issue.
• 6 $ = max 0, $ T$ if $ < 0 i bgrodient
1 if $ ≥ 0
• 1- 6 $ = P
1
• Exponential LU (ELU):
0 if $ < 0 у=Х
L • Smooth version of ReLU.
• Introduces a hyperparameter α (e.g., T = 1).
1
$ if $ ≥ 0
• 6 $ =P
l
T exp $ − 1 if $ < 0 D
ЕД
" 1 = softmax A 1
• Regression: One neuron that returns a real number.
• Binary classification: One neuron that returns a probability (by sigmoid).
• V-way classification: W neurons each of which corresponds to a class.
exp A 1 -
" 1 =
• Shouldn’t be < − 1.
-
∑? exp(A 1 ? )
3
25 Jaemin Yoo 26
2 3 + 52(3)
3
Jaemin Yoo 27 Jaemin Yoo 28
Regression Loss Classification Loss
• There is single output node, which produces a real value. • Consider a multiclass classification task (classes: U/ , U@ , ⋯ , U0 ).
• Squared error (L2) loss: ℒ 3, 3O = 3 − 3O @ . • The label 3 is a probability distribution V, which is typically one-hot.
• \& = 1 and \9 = 0 for ] ≠ _.
3 − 3O @ if 3 − 3O ≤ S
• Huber loss: ℒ 3, 3O = Q • The model’s output is a probability distribution W = W/ , W@ , ⋯ , W0 .
2S 3 − 3O − S/2 otherwise
• Can use softmax on the output layer to obtain probabilities.
• Less sensitive to outliers
1
0.8
0.2
C1 C2 C1 C2
p q
A B C B C
p Optimal binary code
Jaemin Yoo 31 Jaemin Yoo 32
0
o
Cross Entropy KL Divergence
• Cross entropy X V, W is the “cost” of encoding V through W. • Kullback-Leibler (KL) divergence measures a statistical difference.
• Average number of bits if we use the encoding scheme of a. V-
• Definition: X V, W = − ∑- V- log W- . ! V||W = − 0 V- log = X V, W − X V
W-
• Let’s say \ is fixed, and we want to change a.
n
-
• Maximum: ` \, a = ∞ if a& ≈ 0 for any _ such that \& > 0. • Note: It is not a distance metric (not satisfying the properties).
• Minimum: ` \, a = ` \ if \ = a. • Smaller value if (a) \ and a are similar or (b) \ is more uncertain.
1
0.8
1
0.2 B C||E = 1 × log = 0.32
0.8
C1 C2 C1 C2
p q
Jaemin Yoo 33 Jaemin Yoo 34