0% found this document useful (0 votes)
5 views

16-dl-1 - converted

The document outlines a course on Neural Networks and Deep Learning, detailing homework assignments and key topics such as machine learning fundamentals, model structures, and activation functions. It emphasizes the importance of design decisions in building neural networks and discusses various loss functions for regression and classification tasks. Additionally, it introduces concepts like cross-entropy and KL divergence for measuring differences between probability distributions.

Uploaded by

asansyzbai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

16-dl-1 - converted

The document outlines a course on Neural Networks and Deep Learning, detailing homework assignments and key topics such as machine learning fundamentals, model structures, and activation functions. It emphasizes the importance of design decisions in building neural networks and discusses various loss functions for regression and classification tasks. Additionally, it introduces concepts like cross-entropy and KL divergence for measuring differences between probability distributions.

Uploaded by

asansyzbai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Announcements

• Homeworks:
Neural Nets and Deep • HW3 (due: 11/27)
• Link analysis
Learning 1 • Neural nets and deep learning
• Graph representation learning
EE412: Foundation of Big Data Analytics
Fall 2024

Jaemin Yoo 1 Jaemin Yoo 2

Recap A
F A G
A Outline
! A # A "

• Topic-Specific PageRank A A
1. Introduction to Deep Learning
B
• Clustering and Partitioning 2. Neural Networks
A A A …
• Finding Overlapping Communities & E 3. Objective Functions
D

A B D E A B

C FuA FvA FwB


H G F

Smallest Best u v w
cut cut

Jaemin Yoo 3 Jaemin Yoo 4


What is Machine Learning? What is Machine Learning?
• Given data !, we want a function " to solve a problem #. • Core components of machine learning are given as follows:
• Traditional data mining: Engineers design " on their own. • Training data " = $!, &! , $", &" , ⋯ , $# , &# .
• Feature !! ∈ # and label $! ∈ % as the &-th sample.
• E.g., A-priori is designed to find frequent itemsets.
• Machine learning (ML): Engineers “learn” " from !.
• Labels may or may not be given depending on tasks.
• Model ($ : * → ,.
• Requires training, which is a computationally expensive step. • How '" performs depends on its parameters ( ∈ Θ.
• Not better if an existing algorithm is already good enough for !. • Loss function ℒ: Θ × * × , → 0%.
• Determines how good the current parameters ( are.
• Smaller values of ℒ typically represent better models.

Jaemin Yoo 5 Jaemin Yoo 6

What is Machine Learning? Machine Learning Tasks


• Goal of ML: Optimize "+ on ! with respect to ℒ. 1. Supervised learning:
• → Find parameters % ∗ that minimize the expected loss on !. • Train a model from labeled data $& , && &.
• E.g., image classification.
1
0 • To classify an image !! into $! ∈ cat, dog .
% = argmin - % = argmin 0 ℒ "+ 1- , 3-
/

2. Unsupervised learning:
-./ • Train a model from unlabeled data $& & .
• E.g., dimensionality reduction.
• Although we expect ($ also to work well on unseen test data. • UV decomposition learns matrices 8 and 9.
• To understand how users and items interact.
Source: Morimoto and Ponton (2021)

Jaemin Yoo 7 Jaemin Yoo 8


Choice of Model Structure Limited Expressivity of Linearity
• The functional design of "+ is an important decision to make. • Most traditional DM algorithms are linear with input.
• Although ML allows us to find its optimal parameters 1 ∗ . • Dimensionality reduction: SVD, PCA, and UV decomposition.
• Consider a linear regression model "+ as an example: • Link analysis: PageRank and its variations.

"+ 1 = 41 + 6 where % = 4, 6 .
• Linear models are not expressive enough.
• XOR problem: Function &2 = 3$! + 5$" cannot learn XOR.

• Is "+ sufficient to predict the expected income from age? X1 X2 Y


0 0 0
0 1 1
1 0 1
1 1 0
Source: Tech-Quantum
Jaemin Yoo 9 Jaemin Yoo 10

Deep Learning Design Decisions


• Neural network is a flexible, nonlinear structure of "+ . • Building a neural network is partially art and partially science.
• We can tune its expressivity by tuning hyperparameters. • Before training a network, we need to make design decisions:
• Deep learning is a machine learning paradigm including: • How many hidden layers?
• Choosing the structure of a neural network. • How many nodes (or neurons) per hidden layer?
• Training its parameters with a suitable objective function. • What objective function to use?
• Which algorithm should be used to optimize the weights?

Jaemin Yoo 11 Jaemin Yoo 12


Outline Perceptron
1. Introduction to Deep Learning • Perceptron (or a neuron) is a linear binary classifier.
2. Neural Networks • Input: $ = $!, $", $( .
• Output: ( $ = 6 ∑ 8& $& + 5 .
• Parameters: 1 = 8!, 8", 8(, 5 .
3. Objective Functions
• Three steps of computation:
1. Perform a dot product with 8.
2. Add the bias 5.
3. Apply the activation function 6.
Source: Towards Data Science

Jaemin Yoo 13 Jaemin Yoo 14

Activation Function Neural Network


• Activation function: • Neural network:
• Transforms a score into an interpretable signal, typically in −1, +1 . • Collection of neurons that are connected in an directed acyclic graph.
• (Logistic) sigmoid function is a common choice. • Neural networks are organized into layers:
• 6 $ = ∈ 0, 1
! • Outputs of some neurons are inputs to other neurons.
!%)*+ ,-
• Consider a binary classification task.
• Features: { size, weight } of a fish.
• Label: Whether it is a goldfish or not.
• Model: &2 = 6 3$. + 5$! + 5 .
• Probability to be a goldfish.

Jaemin Yoo 15 Jaemin Yoo 16


Layer Importance of Activation
• We create each layer by combining multiple neurons. • Note: " reduces to a single-layer perceptron without ;.
• Suppose that " is a 2-layer NN without activation functions.
дёйдд
• Consider the following network as an example:
• Consists of two layers with weights =. ∈ ℝ"×( and =! ∈ ℝ!×". • ( $ = =!(=.$ + 5 .) + 5!
• Computes ( $ = =!(6(=.$ . + 5 .)) + 5!. = =!=.$ + =!5 . + 5! = 0 = =!=. and 5 0 = =!5 . + 5!.
T Ent
= = 0 $ + 5′
• Learning =′ is equivalent to learning =! and =. separately.
can

Д. Q
• Depth is meaningful only if we adopt nonlinear activation.

Jaemin Yoo 17 Jaemin Yoo 18


e

Activation Function &2 = 6 ∑& 8& $& + 5
Sigmoid Function

БГБодивіть рысть
• What do we expect from activation functions? • Sigmoid is common for perceptrons or shallow networks.
• Takes a single number. • Pros: Continuous, differentiable, and possible to interpret as a probability.
• Performs a differentiable nonlinear operation on it. • Cons: Saturates quickly beyond the “critical region” around 0. almost
gradient is

•6 $ =
• We care differentiability for training. !

• Some choices of ; are


tat
!%)*+ ,-
6 $ =6 $ 1−6 $
1
• Sigmoid function: 6 $ = 1/ 1 + exp −$ .

O
1-
• Tanh function: tanh $ = 26 2$ − 1.
-

• ReLU function: relu $ = max 0, $ .

C J
Jaemin Yoo 19 Jaemin Yoo 20
Rectified Linear Unit (ReLU) ReLU Variants
• ReLU has replaced sigmoid in modern (deep) neural networks. • Leaky ReLU:
• Pros: Efficient to compute, and the gradient never saturates. • Attempts to fix the dying ReLU problem.
Cons: Saturation of derivative when $ < 0, where the output is stuck at 0. • Introduces a hyperparameter T (e.g., T = 0.01).
attef
•O
$ if $ ≥ 0
as
-
allong
• 6 $ =P
• Non-differentiability at ! = 0 is not an important issue.
• 6 $ = max 0, $ T$ if $ < 0 i bgrodient
1 if $ ≥ 0
• 1- 6 $ = P
1
• Exponential LU (ELU):
0 if $ < 0 у=Х
L • Smooth version of ReLU.
• Introduces a hyperparameter α (e.g., T = 1).
1

$ if $ ≥ 0
• 6 $ =P
l

T exp $ − 1 if $ < 0 D

Jaemin Yoo 21 Jaemin Yoo 22

Output Layer Softmax Function


• The number of output neurons is determined by the task. • For classification, softmax converts scores to probabilities:

ЕД
" 1 = softmax A 1
• Regression: One neuron that returns a real number.
• Binary classification: One neuron that returns a probability (by sigmoid).
• V-way classification: W neurons each of which corresponds to a class.
exp A 1 -
" 1 =
• Shouldn’t be < − 1.
-
∑? exp(A 1 ? )

• Normalizes any vector into a probability distribution that sums to one.

Jaemin Yoo 23 Jaemin Yoo 24


Stochastic Gradient Descent Gradient
• Train a neural network through stochastic gradient descent. • Given a function ": ℝ0 → ℝ and 1 = 1/ , 1@ , ⋯ , 10 .
• Let 1 be a set of parameters to learn. • The gradient of " 1 with respect to 1 is
• Let ℒ be an objective function to minimize.
For each data $, & , compute the gradient ∇$ ℒ $, & . K" 1 K" 1 K" 1
∇A " 1 = , ,⋯,

Update the parameters: 1 ← 1 − Z ⋅ ∇$ ℒ $, & .

K1/ K1A K10
2
8
• : Partial derivative that treats all other variables as constants.
8-!

• Example: ∇A L1/@ + 61@ + M = 2L1/ , 6 if 1 is 2-dimensional.


2 3 + 52(3)

3
25 Jaemin Yoo 26

Outline Objective Function


1. Introduction to Deep Learning • Let 3 be the correct label and 3O be the model’s prediction.
2. Neural Networks • Loss function ℒ 3, 3O quantifies the difference between 3 and 3.
O
3. Objective Functions • We usually compute the average for each batch (assuming SGD).
• Determine by the target task, model property, desired goals, etc.
2

2 3 + 52(3)

3
Jaemin Yoo 27 Jaemin Yoo 28
Regression Loss Classification Loss
• There is single output node, which produces a real value. • Consider a multiclass classification task (classes: U/ , U@ , ⋯ , U0 ).
• Squared error (L2) loss: ℒ 3, 3O = 3 − 3O @ . • The label 3 is a probability distribution V, which is typically one-hot.
• \& = 1 and \9 = 0 for ] ≠ _.
3 − 3O @ if 3 − 3O ≤ S
• Huber loss: ℒ 3, 3O = Q • The model’s output is a probability distribution W = W/ , W@ , ⋯ , W0 .
2S 3 − 3O − S/2 otherwise
• Can use softmax on the output layer to obtain probabilities.
• Less sensitive to outliers
1
0.8

0.2
C1 C2 C1 C2
p q

Jaemin Yoo 29 Jaemin Yoo 30

Probabilistic Difference Entropy


• Let V and W be probability distributions of the same length • Entropy X V is the average number of bits to encode an event.
J
• ` \ = “uncertainty” = “information.”
• Definition: X V = − ∑- V- log V- .
0.,Ш1
• How can we measure the difference between distributions?
• Three concepts related to the question:
1. Entropy ` \
Q• Maximum: ` \ = log e if \ is a uniform distribution.
Q • Minimum: ` \ = 0 if \ is a one-hot vector.
2. Cross entropy ` \, a
3. Kullback-Leibler (KL) divergence " \||a 2,0.
4%
0 1
0.5
0.25 A 0 1

A B C B C
p Optimal binary code
Jaemin Yoo 31 Jaemin Yoo 32
0
o
Cross Entropy KL Divergence
• Cross entropy X V, W is the “cost” of encoding V through W. • Kullback-Leibler (KL) divergence measures a statistical difference.
• Average number of bits if we use the encoding scheme of a. V-
• Definition: X V, W = − ∑- V- log W- . ! V||W = − 0 V- log = X V, W − X V
W-
• Let’s say \ is fixed, and we want to change a.
n

-
• Maximum: ` \, a = ∞ if a& ≈ 0 for any _ such that \& > 0. • Note: It is not a distance metric (not satisfying the properties).
• Minimum: ` \, a = ` \ if \ = a. • Smaller value if (a) \ and a are similar or (b) \ is more uncertain.
1
0.8
1
0.2 B C||E = 1 × log = 0.32
0.8
C1 C2 C1 C2
p q
Jaemin Yoo 33 Jaemin Yoo 34

Classification Loss: Cross Entropy Summary


• Our goal is to produce output 3O that estimates the answer 3. 1. Introduction to Deep Learning
• Since the entropy X 3 is a constant, we have: • Supervised and unsupervised learning
• Expressivity
argminML ! 3||3O = argminML X 3, 3O 2. Neural Networks
• Activation functions
• We use the cross entropy loss for most classification tasks: • Gradient descent
3. Objective Functions
ℒ 3, 3O = − 0 3- log 3O-
• Entropy and cross entropy
• KL divergence
-

Jaemin Yoo 35 Jaemin Yoo 36

You might also like