0% found this document useful (0 votes)
12 views

CS585 Lecture October10th

Uploaded by

fyi3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

CS585 Lecture October10th

Uploaded by

fyi3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 146

CS 585

Natural Language Processing

October 10, 2024


Announcements / Reminders
 Please follow the Week 08 To Do List instructions (if you haven't
already)

 Programming Assignment #02 due on Sunday (10/13/24) Sunday


(10/20/24) at 11:59 PM CST

2
Plan for Today
 Classification: Evaluating and improving
performance
 Introduction to Neural Networks

3
Classification: Improving
Performance

4
Classifier Evaluation: Confusion Matrix
Predicted class

Positive Negative

Sensitivity
False Negative (FN) (Recall)
Positive True Positive (TP)
Type II Error 𝑻𝑷
Actual class

𝑻𝑷 𝑭𝑵

False Positive (FP) Specificity


Negative True Negative (TN) 𝑻𝑵
Type I Error
𝑻𝑵 𝑭𝑷

Precision Negative Predictive Accuracy


𝑻𝑷 𝑻𝑵 𝑻𝑷 𝑻𝑵
Value
𝑻𝑷 𝑭𝑷 𝑻𝑵 𝑭𝑵 𝑻𝑷 𝑻𝑵 𝑭𝑷 𝑭𝑵

5
Classifier Evaluation: Accuracy
 Why don't we use accuracy as our metric?
 Imagine we saw 1 million tweets
 100 of them talked about Delicious Pie Co.
 999,900 talked about something else

 We could build a dumb classifier that just labels every


tweet "not about pie"
 It would get 99.99% accuracy!!! Wow!!!!
 But useless! Doesn't return the comments we are looking for!
 That's why we use precision and recall instead
6
Classifier Evaluation: Accuracy
 Our dumb pie-classifier
 Just label nothing as "about pie"
 Accuracy = 99.99%, but
 Recall = 0 (it doesn't get any of the 100 Pie tweets)
 Precision and recall, unlike accuracy, emphasize
true positives:
 finding the things that we are supposed to be looking
for.

7
Classifier Performance Metrics
 Precision and recall provide two ways to
summarize the errors made for the positive class
(FP, FN).

 F-measure provides a single score that


summarizes the precision and recall.

 Accuracy summarizes the correct predictions for


both positive and negative classes.
8
Classifier Evaluation: F-Score
 F-Score is a measure of a model’s accuracy on a
dataset
 evaluates binary classification systems
 F-score is a way of combining the precision and
recall of the model
 Used in NLP, Information Retrieval, ML

9
Receiver Operating Characteristic
Threshold: 0.5
HAM SPAM

𝑇𝑃𝑅 =
𝑇𝑃
=
3 1
HAM TP=3 FN=2 𝑇𝑃 + 𝐹𝑁 5

𝐹𝑃 1

True Positive Rate (TPR)


SPAM FP=1 TN=3 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4

Threshold: 0.2
HAM SPAM ROC Curve
𝑇𝑃 4
HAM TP=4 FN=1 𝑇𝑃𝑅 = =
𝑇𝑃 + 𝐹𝑁 5

𝐹𝑃 1
SPAM FP=1 TN=3 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4

Threshold: 0.8
HAM SPAM
𝑇𝑃 3
HAM TP=3 FN=2 𝑇𝑃𝑅 = =
𝑇𝑃 + 𝐹𝑁 5 0 False Positive Rate (FPR) 1
𝐹𝑃 0
SPAM FP=0 TN=4 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4
TPR: Sensitivity | FPR: 1 - Specificity
10
ROC Area Under the Curve
Threshold: 0.5
HAM SPAM

𝑇𝑃𝑅 =
𝑇𝑃
=
3 1
HAM TP=3 FN=2 𝑇𝑃 + 𝐹𝑁 5 ROC Curve
𝐹𝑃 1

True Positive Rate (TPR)


SPAM FP=1 TN=3 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4

Threshold: 0.2
HAM SPAM
𝑇𝑃 4
HAM TP=4 FN=1 𝑇𝑃𝑅 = =
𝑇𝑃 + 𝐹𝑁 5
ROC AUC
𝐹𝑃 1
SPAM FP=1 TN=3 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4

Threshold: 0.8
HAM SPAM
𝑇𝑃 3
HAM TP=3 FN=2 𝑇𝑃𝑅 = =
𝑇𝑃 + 𝐹𝑁 5 0 False Positive Rate (FPR) 1
𝐹𝑃 0
SPAM FP=0 TN=4 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4
TPR: Sensitivity | FPR: 1 - Specificity
11
Receiver Operating Characteristic
Threshold: 0.5
AUC: 1.0
HAM SPAM

𝑇𝑃𝑅 =
𝑇𝑃
=
3 1
HAM TP=3 FN=2 𝑇𝑃 + 𝐹𝑁 5

𝐹𝑃 1

True Positive Rate (TPR)


SPAM FP=1 TN=3 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4

Threshold: 0.2
HAM SPAM
𝑇𝑃 4 AUC: ~0.8
HAM TP=4 FN=1 𝑇𝑃𝑅 = =
𝑇𝑃 + 𝐹𝑁 5

𝐹𝑃 1
SPAM FP=1 TN=3 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4

Threshold: 0.8
HAM SPAM
𝑇𝑃 3
HAM TP=3 FN=2 𝑇𝑃𝑅 = =
𝑇𝑃 + 𝐹𝑁 5 0 False Positive Rate (FPR) 1
𝐹𝑃 0
SPAM FP=0 TN=4 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4
TPR: Sensitivity | FPR: 1 - Specificity
12
Receiver Operating Characteristic
Threshold: 0.5
You want your classifier
HAM SPAM
somewhere here with high AUC
𝑇𝑃𝑅 =
𝑇𝑃
=
3 1
HAM TP=3 FN=2 𝑇𝑃 + 𝐹𝑁 5

𝐹𝑃 1

True Positive Rate (TPR)


SPAM FP=1 TN=3 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4

Threshold: 0.2
HAM SPAM
𝑇𝑃 4
HAM TP=4 FN=1 𝑇𝑃𝑅 = =
𝑇𝑃 + 𝐹𝑁 5

𝐹𝑃 1
SPAM FP=1 TN=3 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4

Threshold: 0.8
HAM SPAM
𝑇𝑃 3
HAM TP=3 FN=2 𝑇𝑃𝑅 = =
𝑇𝑃 + 𝐹𝑁 5 0 False Positive Rate (FPR) 1
𝐹𝑃 0
SPAM FP=0 TN=4 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4
TPR: Sensitivity | FPR: 1 - Specificity
13
Receiver Operating Characteristic

14
Precision - Recall Curve

You want your


classifier
somewhere here

15
ROC vs. Precision-Recall Curves
 Both summarize model performance using
different probability thresholds
 ROC curves should be used when there are
roughly equal numbers of observations for each
class
 Precision-Recall curves should be used when
there is a moderate to large class imbalance
(when we are interested in the positive class and
there’s only a few positive samples)

16
3-class Confusion Matrix

17
Macroaveraging and Microaveraging
Macroaveraging:
 compute the performance for each class, and then
average over classes

Microaveraging:
 collect decisions for all classes into one confusion
matrix
 compute precision and recall from that table.

18
Macroaveraging and Microaveraging

19
Text Classification System Pipeline
1. Obtain / collect / create labeled data set suitable for the task
2. Split the data set into:
 two (training and test sets) parts OR
 three (training, validation, and test sets) parts
3. Choose evaluation metric
4. Transform raw text into feature vectors:
 bag of words
 other types
5. Using feature vectors and labels from the training set, train the classifier /
create a model
6. Using evaluation metric from (3) benchmark the classifier / model
performance using the test set
7. Deploy the classifier / model to serve a real world application and monitor its
performance

20
Text Classification System Pipeline

Training data
1
(texts and their labels)

Pre-processing and Trainining


2-4
feature extraction process

Train and evaluate


5-6
classifier(s) (learning)

New data / text with Use classifier to predict


7
unknown labels labels on new data

21
Poor Classifier Performance: Reasons
1. With all possible features extracted, we ended up
with a sparse feature vector (some features are too
rare and end up being noise)  makes training hard
2. Few (~20%) relevant samples compared to non-
relevant (~80%) samples in the data set  skews
learning towards non-relevant data
3. Need better learning algorithm
4. Need better pre-processing / feature extraction
5. Classifier parameters / hyperparameters need tuning

22
Underfitting / Overfitting

Underfitting

Overfitting

Underfitting: “failing” to find pattern in the data.

23
Bias vs. Variance
High bias Low bias

Low variance High variance

High bias Low bias

Low variance High variance

Bias: the tendency of a predictive hypothesis to deviate from the expected


value when averaged over different training sets
Variance: the amount of change in the hypothesis (model) due to fluctuations
in training data

24
Parameter Tuning

25
K-Fold Cross-Validation
Validation
Train Validate Score

4-fold cross-validation
Train Train Train Validate ScoreA

Train Train Validate Train ScoreB

Train Validate Train Train ScoreC

Validate Train Train Train ScoreD

ScoreA + ScoreB + ScoreC + ScoreD


Score =
4

26
Ensemble Learning
In ensemble learning we are creating a collection
(an ensemble) of hypotheses (models) h1, h2, ..., hN
and combine their predictions by averaging, voting,
or another level of machine learning. Indvidual
hypotheses (models) are base models and their
combination is the ensemble model.
 Bagging
 Boosting
 Random Trees
 etc.

27
Bagging: Classification
In bagging we generate K training sets by sampling
with replacement from the original training set.

Train (M dataTrain
points) Model 1 | h1

Train (M data points)


Train Model 2 | h2

Train
Train
(M data points) Model 3 | h3 Plurality vote Output

....
Train
Train (M data points) Model K | hK

28
Bagging: Classification
In bagging we generate K training sets by sampling
with replacement from the original training set.

Train (M dataTrain
points) NaiveBayes1 | h1

Train (M data points)


Train NaiveBayes2 | h2

Train
Train
(M data points) NaiveBayes3 | h3 Plurality vote Output

....
Train
Train (M data points) NaiveBayesK | hK

Bagging tends to reduce variance and helps with smaller data sets.

29
Ensemble Classification
Indvidual hypotheses (models) are base models
and their combination is the ensemble model.

Train
Train (M data points) NaiveBayes1 | h1

Train
Train (M data points) Perceptron | h2

Train
Train (M data points) k-NN | h3 Plurality vote Output

....
Train
Train (M data points) NaiveBayes2 | hK

30
Supervised Learning

31
What Kind of Questions ML Answers?
Question ML Category Example

Will this car fail in the next two


Is this A or B? Classification
months? Yes or no?

Is this weird? Anomaly detection Is this credit card charge normal?

What will the temperature be


How much / many? Regression
tomorrow?

Which car models have the most


How is this organized? Clustering
brake problems?

What should I do next? Reinforcement learning Adjust room humidity or leave as is?

32
Main Machine Learning Categories
Supervised learning Unsupervised learning Reinforcement learning

Supervised learning is one Unsupervised learning Reinforcement learning is


of the most common involves finding underlying inspired by behavioral
techniques in machine patterns within data. psychology. It is based on a
learning. It is based on Typically used in clustering rewarding / punishing an
known relationship(s) and data points (similar algorithm.
patterns within data (for customers, etc.)
example: relationship Rewards and punishments
between inputs and are based on algorithm’s
outputs). action within its
environment.
Frequently used types:
regression, and
classification.

33
Choosing Hypothesis / Model
Given a training set of N example input-output
(feature-label) pairs
(x1, y1), (x2, y2), ..., (xN, yN)
where each pair was generated by
y = f(x)
Ideally, we would like our model h(x) (hypothesis)
that approximates the true function f(x) to be:
h(x) = y = f(x) (consistent hypothesis)
34
Choosing Hypothesis / Model
Typically consistent hypothesis is impossible or
difficult to achieve:
 use best-fit model / hypothesis

Our model needs to be tested on the test set inputs


(data the model has not “seen” yet) to see how
well it generalizes (how accurately it predicts the
outputs of the test set).
35
Neural Networks
Basics

36
McCulloch-Pitts Model (1943)
x1 First computational models of
an Artificial Neural Network
w1
(loosely inspired by biological
neural networks) were
x2
proposed by Warren
w2
McCulloch and Walter Pitts in
w3  y 1943. Their ideas are a key
x3 component of modern day
w4
machine and deep learning.

x4

37
A Biological Neuron
A neuron or nerve cell is an
electrically excitable cell that
communicates with other
cells via specialized
connections called synapses.
Most neurons receive signals
via the dendrites and soma
and send out signals down
the axon. At the majority of
synapses, signals cross from
the axon of one neuron to a
dendrite of another.

Source: https://en.wikipedia.org/wiki/Neuron

38
Biological vs. Artificial Neuron

“Synapses”

“Axon”

“Neuron

Source: https://en.wikipedia.org/wiki/Neuron “Dendrites”

39
Artificial Neuron (Perceptron)
A (single-layer) perceptron is x1

a model of a biological
neuron. It is made of the w1
following components:
weights
 inputs xi - numerical values x2
(numbers)
w2
representing information Output
 weights wi - numerical w3  y

values representing how x3


w4
“important” corresponding
input is
 weighted sum:  wi * xi
 activation function f that x4

decides if the neuron


“fires” Inputs
(numbers)

40
Artificial Neuron (Perceptron)
A (single-layer) perceptron is x1

a model of a biological
neuron. It is made of the w1
following components:
Model
 inputs xi - numerical values x2
parameters
w2
representing information Output
 weights wi - numerical w3  y

values representing how x3


w4
“important” corresponding
input is
 weighted sum:  wi * xi
 activation function f that x4

decides if the neuron


“fires” Inputs
(numbers)

41
Artificial Neuron (Perceptron)
x1 x1

w1 w1

x2 x2
w2 w2

w3  X y w3  y

x3 x3
w4 w4

x4 x4

 wi * xi < 0 → f = 0 → DON’T “fire”  wi * xi  0 → f = 1 → “fire”

42
Single-layer Perceptron as a Classifier
x1 x1

w1 w1

x2 x2
w2 w2

w3  X y w3  y

x3 x3
w4 w4

x4 x4

 wi * xi < 0 → f = 0 → NO  wi * xi  0 → f = 1 → YES

43
Perceptron with Step Activation
x1

w1

x2
w2

w3  y

x3
wN

. b
.
.

xN

 wi * x i + b → f → y

44
Perceptrons = Linear Classifiers
x1

w1

x2
w2

w3 
Binary output y

x3
0 or 1
wN

. b
.
.

xN

 wi * x i + b → f → y

45
Classification: Linear Separation
HAM

SPAM HAM

SPAM

The w x + b = 0 line is the decision boundary

46
Perceptron with Sigmoid Activation
x1

w1

x2
w2

w3  y

x3
wN

. b
.
.

xN

 wi * x i + b → f → y

47
Logistic Regression Classifier
x1

w1

x2
w2

w3  y

x3
wN

. b
.
.

xN

 wi * x i + b → f → y

48
Single-layer Perceptron as a Classifier
word1 word1

w1 w1

word2 word2
w2 w2

w3  X y w3  y

word3 word3
w4 w4

. . . .
. . . .
. . . .

wordN wordN

 wi * wordi < 0 → f = 0 → SPAM  wi * wordi  0 → f = 1 → HAM

49
Basic Neural Unit
x1

w1

x2
w2
Nonlinear transform
w3   y

x3
wN   
. b
.
.

xN

z =  wi * x i + b y =  

50
Basic Neural Unit

weighted sum
feature vector

Nonlinear activation
input layer

weights


function output
(can differ for each
layer!)
bias

51
Selected Activation Functions
f1

w1

f2
w2
y
w3

f3
w4

f4

ReLU: Rectified Linear Unit

52
Classification: Linear Separation?
HAM

SPAM
HAM SPAM HAM

HAM SPAM

SPAM

Sometimes decision boundary CANNOT be linear? Not linearly separable f()

53
Hypothesis: Classification “Boundary”

54
XOR: Not a Linearly Separable f()

Logical XOR is an example of a function that is NOT linearly separable

55
XOR: Not a Linearly Separable f()?

56
Artificial Neural Network

57
Basic Neural Unit
weights

Input Output
layer layer

58
Basic Neural Unit

weighted sum
feature vector

Nonlinear activation
input layer

weights


function output
(can differ for each
layer!)
bias

59
Artificial Neural Network (ANN)
An artificial neural network is made of multiple artificial neuron layers.

Input Hidden Hidden Output


layer layer layer layer

60
Feedforward Neural Network
features weights weights weights output

Input Hidden Hidden Output


layer layer layer layer

Also called (historically): multi-layer perceptron

61
XOR: Hidden Layer Approach

62
Hidden Layer
features weights weights output

Input Hidden Output


layer layer layer

63
2 Layer Network
features weights weights output

Input Hidden Output


layer layer layer

64
Training Data: Features + Labels
Typically input data will be represented by a limited set of features.
Features:
Wheels: 4 Label:
Weight: 8 tons
Passengers: 1
Truck

Features:
Wheels: 6 Label:
Weight: 8 tons Truck
Passengers: 1

Features:
Wheels: 4 Label:
Weight: 1 ton Car
Passengers: 4

Features:
Wheels: 4 Label:
Weight: 2 tons Car
Passengers: 4

65
ANN: Supervised Learning
weights weights weights

wheels

weight

passengers

Input Hidden Hidden Output


layer layer layer layer

66
Training Data: Images + Labels
A classifier needs to be “shown” thousands of labeled examples to learn.

Label: Label: Label:


BUS CAR BRIDGE

Label: Label: Label:


PALM TRAFFIC LIGHT TAXI

Label: Label: Label:


CROSSWALK CHIMNEY MOTORCYCLE

Label: Label: Label:


STREET SIGN HYDRANT BICYCLE
Note how some images are “incomplete” and “flawed”.

67
Digit Image as ANN Feature Set
Individual features need to be “extracted” from an image. An image is numbers.

Source: https://nikolanews.com/not-just-introduction-to-convolutional-neural-networks-part-1/

68
ANN: Supervised Learning
An untrained classifier will NOT label input data correctly.
weights weights weights

0.12

0.99

0.55
Other

Input Hidden Hidden Output


layer layer layer layer

69
ANN: Training
Given: input data and it’s corresponding expected label: DOG calculate
weights
“error”.
weights weights
Should be 1!

0.12

0.99

0.55
Other

Input Hidden Hidden Output


layer layer layer layer

“Error” = 0.88. Go back and adjust all the weights to ensure it is lower next time.

70
ANN: Training
Show data / label pair: / DOG.
weights weights weights
Should be 1!

0.12

0.99

0.55
Other

Input Hidden Hidden Output


layer layer layer layer

Correct all the weights. Repeat many times.

71
ANN as a Complex Function
In ANNs hypotheses take form of complex algebraic circuits with
tunable connection strengths (weights).
weights weights weights

Input Hidden Hidden Output


layer layer layer layer

72
Exercise: ANN Demo
http://playground.tensorflow.org/

73
Logistic Regression = 1 Layer Network
x1

w1

x2
w2

w3  y

x3
scalar
wN

. b
.
.

xN

 wi * x i + b → f → y

74
Binary Logistic Regression = 1 Layer
weights

Input Output
layer layer

75
Multinomial Logistic Regression
weights

Input Output
layer layer

Multinomial Logistic Regression is still a 1 Layer Network

76
Fully Connected Network
weights

Input Output
layer layer

This Multinomial Logistic Regression network is fully connected network

77
Softmax: Sigmoid Generalization

78
Binary Logistic Regression
features weights output

Input Output
layer layer

In Binary Logistic Regression output is a SCALAR y = zwi * xi + b [ - sigmoid]

79
Multinomial Logistic Regression
features weights output

Input Output
layer layer

In Multinomial Logistic Regression output is a VECTOR y = szswi * xi + b


[s - softmax]
80
2 Layer Network
features weights weights output

Input Hidden Output


layer layer layer

81
2 Layer Network
features weights weights output

j
i

Input Hidden Output


layer layer layer

82
2 Layer Network
features weights weights output

activation
function f1

activation
function f1
activation
function f2

activation
function f1

activation
function f1

Input Hidden Output


layer layer layer

Activation function f1: sigmoid, tanh, ReLU, etc. | Activation function f2: sigmoid

83
2 Layer Network
features weights weights output

activation
function f1

activation
function f1
activation
function f2

activation
function f1

activation
function f1

Input Hidden Output


layer layer layer

Activation function f1: sigmoid, tanh, ReLU, etc. | Activation function f2: softmax

84
Multilayer Neural Net: Notation

85
Multilayer Neural Net: Notation
x1

w1

x2
w2
z a
w3  Nonlinear transform y

x3
wN
g
. b
.
.

xN

z =  wi * x i + b

86
Multilayer Neural Net: Notation
x1

w1

x2
w2
z a
w3  Nonlinear transform y

x3
wN
g
. b
.
.

xN

z =  wi * x i + b

87
Replacing the Bias Unit
x1

w1

x2
w2
z a
w3  Nonlinear transform y

x3
wN
g
. b
.
.

xN
This is a bit inconvenient

z =  wi * x i + b

88
Replacing the Bias Unit
Let's switch to a notation without the bias unit
1. Add a dummy node a0=1 to each layer
2. Its weight w0 will be the bias
3. So input layer a[0]0=1, a[1]0=1, a[2]0=1, …
and

89
Replacing the Bias Unit

90
Deep Learning

91
Deep Learning
Deep learning is a broad family of techniques for
machine learning (also a sub-field of ML) in which
hypotheses take the form of complex algebraic
circuits with tunable connections. The word “deep”
refers to the fact that the circuits are typically
organized into many layers, which means that
computation paths from inputs to outputs have
many steps.

92
Shallow vs. Deep Models

Shallow Shallow Deep


Model Model Model
Longer computation path
93
Machine Learning vs. Deep Learning

Source: https://www.quora.com/What-is-the-difference-between-deep-learning-and-usual-machine-learning

94
Machine Learning vs. Deep Learning

Source: https://www.intel.com/content/www/us/en/artificial-intelligence/posts/difference-between-ai-machine-learning-deep-
learning.html

95
Deep Learning: Feature Extraction

Source: https://en.wikipedia.org/wiki/Deep_learning

96
Neural Networks in NLP
Let’s consider the NLP modeling we explored
so far:
 Classification
 Language Modeling
Can we apply Neural Networks?

97
Logistic Regression Sentiment Analysis
features weights output

BINARY

answer

Input Output
layer layer

98
Logistic Regression Sentiment Analysis
features weights weights output

BINARY

answer

Input Hidden Output


layer layer layer

99
Complex Feature Vector Relationships

Adding hidden layers can help capture non-linear relationships between features!

100
Word Embedding: Definition
Word Embedding:
a term used for the representation of words for text analysis,
typically in the form of a real-valued vector that encodes the
meaning of the word such that the words that are closer in
the vector space are expected to be similar in meaning
from Wikipedia

101
Exercise: Word2Vec
https://www.cs.cmu.edu/~dst/WordEmbeddingDem
o/index.html

10
Embeddings as Input Features
features weights weights output
(features learned from data)
Embeddings

Input Hidden Output


layer layer layer

Multiclass output: add more output layer nodes + use softmax (instead of sigmoid)

103
Embeddings as Input Features

104
Embeddings as Input Features

Assumption:
“3-word sentences”

105
Embeddings as Input Features
features weights weights output
(features learned from data)
Embeddings

BINARY

answer

Input Hidden Output


layer layer layer

106
Texts in Different Sizes: Ideas
Some simple solutions:
1. Make the input the length of the longest sample
 if shorter then pad with zero embeddings
 truncate if you get longer reviews at test time
2. Create a single "sentence embedding" (the same
dimensionality as a word) to represent all the words
 take the mean of all the word embeddings
 take the element-wise max of all the word
embeddings
 for each dimension, pick the max value from all
words
107
Language Models Revisited
Language Modeling: Calculating the probability of the next
word in a sequence given some history.
• N-gram based language models
• other: neural network-based?

Task: predict next word wt


given prior words wt-1, wt-2, wt-3, …
Problem: Now we’re dealing with sequences of arbitrary
length.
Solution: Sliding windows (of fixed length)

108
Neural Language Model

109
Neural LM Better Than N-Gram LM
Training data:
We've seen: I have to make sure that the cat gets fed.
Never seen: dog gets fed

Test data:
I forgot to make sure that the dog gets ___
N-gram LM can't predict "fed"!
Neural LM can use similarity of "cat" and "dog" embeddings
to generalize and predict “fed” after dog

110
Training Neural Networks

111
Training Neural Networks: Intuition
For every training tuple (x ,y) = (feature vector, label)
 Run forward computation to find estimate y ̂
 Run backward computation to update weights:
 For every output node
 Compute loss L between true y and the estimated y ̂
 For every weight w from hidden layer to the output layer
 Update the weight

 For every hidden node


 Assess how much blame it deserves for the current answer
 For every weight w from input layer to the hidden layer
 Update the weight

112
Back-propagation
Feed forward Evaluate Loss Back-propagation

w1 w1 w1

x
z z 𝜕𝐿𝑜𝑠𝑠 z
f(x,y) f(x,y) 𝜕𝑥 f(x,y)
𝜕𝐿𝑜𝑠𝑠
y 𝜕𝑦
𝜕𝐿𝑜𝑠𝑠
Loss = z - 𝜕𝑧
w2 w2 zexpected w2

Feed a labeled sample How “incorrect” is the Update weights


through the network result compare to the (use Gradient Descent)
label?
113
Gradients and Learning Rate
 The value of the gradient (slope in our
example) weighted by a
learning rate η

 Higher learning rate means move w faster

114
NN Node: Derivative of the Loss
x1

w1

x2
w2
z a
w3  Nonlinear transform y

x3
wN
g
. b
.
.

xN

z =  wi * x i + b

115
Convolutional Neural Networks
The name Convolutional Neural Network (CNN) indicates that the
network employs a mathematical operation called convolution.

Convolutional networks are a specialized type of neural networks


that use convolution in place of general matrix multiplication in at
least one of their layers.

CNN is able to successfully capture the spatial dependencies in an


image (data grid) through the application of relevant filters.

CNNs can reduce images (data grids) into a form which is easier to
process without losing features that are critical for getting a good
prediction.

116
Convolutional Neural Networks
Flattening

Pooling

By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

117
Convolution: The Idea

3 x 3 Kernel / Filter

Source: https://commons.wikimedia.org/wiki/File:Convolutional_Neural_Network_NeuralNetworkFilter.gif

118
Kernel / Filter: The Idea

3 x 3 Kernel / Filter

Source: https://commons.wikimedia.org/wiki/File:Convolution_arithmetic_-_Padding_strides.gif

119
Convoluting Matrices
Convolution (and Convolutional Neural Networks) can be applied
to any grid-like data (tensors: matrices, vectors, etc.).

kernel data
0 1 0 0 2 3 0*0 1*2 0*3

1 1 1 conv 2 4 1 “overla
y” 1*2 1*4 1*1 sum 12
0 1 0 0 3 0 0*0 1*3 0*0

120
Selected Image Processing Kernels

Sharpen Mean Blur Gaussian Blur

Laplacian Prewitt (Edge) Prewitt (Edge)

121
Image Processing: Kernels / Filters

122
Applying Kernels / Filters

3 x 3 Kernel / Filter

123
Convolutional NN Kernels
In practice, Convolutional Neural Network kernels can be larger than
3x3 and are learned using back propagation.

Convolution Layer 1 Convolution Layer 2 Convolution Layer 3

124
Convolution Layer 1

Kernel 1

125
Convolution Layer 1

Kernel 2
Kernel 1

126
Convolution Layer 1

Original image
Kernel 3

Kernel 2
Kernel 1

Convolution 1

127
Convolutional Neural Networks

By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

128
Max Pooling Layer
Convolution 1

Max Pooling

129
Convolutional Neural Networks

By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

130
Convolution Layer 2

Original convolution
after pooling Kernel C

Kernel B
Kernel A

Convolution A

131
Convolutional Neural Networks

By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

132
Flattening
Final output of convolution layers is “flattened” to become a vector of features.

Convert to
vector

Final convolution layer output

Source: https://nikolanews.com/not-just-introduction-to-convolutional-neural-networks-part-1/

133
Recurrent Neural Networks
Recurrent Neural Networks (RNNs) allow cycles in the computational graph
(network). A network node (unit) can take its own output from an earlier step as
input (with delay introduced).
Enables having internal state / memory  inputs received earlier affect the RNN
response to current input.

134
Long Short-Term Memory (LSTM)
Long short-term memory (LSTM) is an artificial neural network. Unlike standard
feedforward neural networks, LSTM has feedback connections. Such a recurrent
neural network (RNN) can process not only single data points (such as images), but
also entire sequences of data (such as speech or video). This characteristic makes
LSTM networks ideal for processing and predicting data.

135
Large Language Model (LLM)
A large language model (LLM) is a language model
consisting of a neural network with many
parameters (typically billions of weights or more),
trained on large quantities of unlabeled text using
self-supervised learning.

Source: Wikipedia

136
Generative Pre-trained Transformer 3
What is it?
Generative Pre-trained Transformer 3 (GPT-3) is an
autoregressive language model that uses deep learning
to produce human-like text. It is the third-generation
language prediction model in the GPT-n series (and the
successor to GPT-2) created by OpenAI, a San Francisco-
based artificial intelligence research laboratory.

Size:
175 billion machine learning parameters
~45 GB
Source: Wikipedia

137
Parameters? What Are Those?
features weights weights output

j
i

Input Hidden Output


layer layer layer

138
Transformer Architecture

139
GPT-4 Architecture

Source: TheAiEdge.io

140
Self-Attention
In artificial neural networks, attention is a technique that is meant to mimic
cognitive attention. The effect enhances some parts of the input data while
diminishing other parts — the motivation being that the network should devote
more focus to the important parts of the data, even though they may be small.
Learning which part of the data is more important than another depends on the
context, and this is trained by gradient descent.

Source: Park et al. – “SANVis: Visual Analytics for Understanding Self-Attention Networks”

141
Generative Pre-trained Transformer 4
What is it?
Generative Pre-trained Transformer 4 (GPT-4) is a
multimodal large language model created by OpenAI. As a
transformer, GPT-4 was pretrained to predict the next
token (using both public data and "data licensed from
third-party providers"), and was then fine-tuned with
reinforcement learning from human and AI feedback for
human alignment and policy compliance.
Size:
1 trillion machine learning parameters

Source: Wikipedia

142
Large Language Models Data Sources

Source: Zhao et al. – “A Survey of Large Language Models” [2023]

143
LLM Data Pre-Processing Pipeline

Source: Zhao et al. – “A Survey of Large Language Models” [2023]

144
ChatGPT
What is it?
ChatGPT is a chatbot developed by OpenAI and released in
November 2022. It is built on top of OpenAI's GPT-3.5 and
GPT-4 families of large language models (LLMs) and has
been fine-tuned (an approach to transfer learning) using
both supervised and reinforcement learning techniques.

Source: Wikipedia

145
Transfer Learning
In transfer learning, experience with one
learning task helps an agent learn better on
another task.

Pre-trained models can be used as a starting


point for developing new models.

146

You might also like