0% found this document useful (0 votes)

12 views

CS585 Lecture October10th

Uploaded by

fyi3

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

CS585 Lecture October10th

Uploaded by

fyi3

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 146

CS 585

Natural Language Processing

October 10, 2024

Announcements / Reminders
 Please follow the Week 08 To Do List instructions (if you haven't
already)

 Programming Assignment #02 due on Sunday (10/13/24) Sunday

(10/20/24) at 11:59 PM CST

2
Plan for Today
 Classification: Evaluating and improving
performance
 Introduction to Neural Networks

3
Classification: Improving
Performance

4
Classifier Evaluation: Confusion Matrix
Predicted class

Positive Negative

Sensitivity
False Negative (FN) (Recall)
Positive True Positive (TP)
Type II Error 𝑻𝑷
Actual class

𝑻𝑷 𝑭𝑵

False Positive (FP) Specificity

Negative True Negative (TN) 𝑻𝑵
Type I Error
𝑻𝑵 𝑭𝑷

Precision Negative Predictive Accuracy

𝑻𝑷 𝑻𝑵 𝑻𝑷 𝑻𝑵
Value
𝑻𝑷 𝑭𝑷 𝑻𝑵 𝑭𝑵 𝑻𝑷 𝑻𝑵 𝑭𝑷 𝑭𝑵

5
Classifier Evaluation: Accuracy
 Why don't we use accuracy as our metric?
 Imagine we saw 1 million tweets
 100 of them talked about Delicious Pie Co.
 999,900 talked about something else

 We could build a dumb classifier that just labels every

tweet "not about pie"
 It would get 99.99% accuracy!!! Wow!!!!
 But useless! Doesn't return the comments we are looking for!
 That's why we use precision and recall instead
6
Classifier Evaluation: Accuracy
 Our dumb pie-classifier
 Just label nothing as "about pie"
 Accuracy = 99.99%, but
 Recall = 0 (it doesn't get any of the 100 Pie tweets)
 Precision and recall, unlike accuracy, emphasize
true positives:
 finding the things that we are supposed to be looking
for.

7
Classifier Performance Metrics
 Precision and recall provide two ways to
summarize the errors made for the positive class
(FP, FN).

 F-measure provides a single score that

summarizes the precision and recall.

 Accuracy summarizes the correct predictions for

both positive and negative classes.
8
Classifier Evaluation: F-Score
 F-Score is a measure of a model’s accuracy on a
dataset
 evaluates binary classification systems
 F-score is a way of combining the precision and
recall of the model
 Used in NLP, Information Retrieval, ML

9
Receiver Operating Characteristic
Threshold: 0.5
HAM SPAM

𝑇𝑃𝑅 =
𝑇𝑃
=
3 1
HAM TP=3 FN=2 𝑇𝑃 + 𝐹𝑁 5

𝐹𝑃 1

True Positive Rate (TPR)

SPAM FP=1 TN=3 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4

Threshold: 0.2
HAM SPAM ROC Curve
𝑇𝑃 4
HAM TP=4 FN=1 𝑇𝑃𝑅 = =
𝑇𝑃 + 𝐹𝑁 5

𝐹𝑃 1
SPAM FP=1 TN=3 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4

Threshold: 0.8
HAM SPAM
𝑇𝑃 3
HAM TP=3 FN=2 𝑇𝑃𝑅 = =
𝑇𝑃 + 𝐹𝑁 5 0 False Positive Rate (FPR) 1
𝐹𝑃 0
SPAM FP=0 TN=4 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4
TPR: Sensitivity | FPR: 1 - Specificity
10
ROC Area Under the Curve
Threshold: 0.5
HAM SPAM

𝑇𝑃𝑅 =
𝑇𝑃
=
3 1
HAM TP=3 FN=2 𝑇𝑃 + 𝐹𝑁 5 ROC Curve
𝐹𝑃 1

True Positive Rate (TPR)

SPAM FP=1 TN=3 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4

Threshold: 0.2
HAM SPAM
𝑇𝑃 4
HAM TP=4 FN=1 𝑇𝑃𝑅 = =
𝑇𝑃 + 𝐹𝑁 5
ROC AUC
𝐹𝑃 1
SPAM FP=1 TN=3 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4

Threshold: 0.8
HAM SPAM
𝑇𝑃 3
HAM TP=3 FN=2 𝑇𝑃𝑅 = =
𝑇𝑃 + 𝐹𝑁 5 0 False Positive Rate (FPR) 1
𝐹𝑃 0
SPAM FP=0 TN=4 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4
TPR: Sensitivity | FPR: 1 - Specificity
11
Receiver Operating Characteristic
Threshold: 0.5
AUC: 1.0
HAM SPAM

𝑇𝑃𝑅 =
𝑇𝑃
=
3 1
HAM TP=3 FN=2 𝑇𝑃 + 𝐹𝑁 5

𝐹𝑃 1

True Positive Rate (TPR)

SPAM FP=1 TN=3 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4

Threshold: 0.2
HAM SPAM
𝑇𝑃 4 AUC: ~0.8
HAM TP=4 FN=1 𝑇𝑃𝑅 = =
𝑇𝑃 + 𝐹𝑁 5

𝐹𝑃 1
SPAM FP=1 TN=3 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4

Threshold: 0.8
HAM SPAM
𝑇𝑃 3
HAM TP=3 FN=2 𝑇𝑃𝑅 = =
𝑇𝑃 + 𝐹𝑁 5 0 False Positive Rate (FPR) 1
𝐹𝑃 0
SPAM FP=0 TN=4 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4
TPR: Sensitivity | FPR: 1 - Specificity
12
Receiver Operating Characteristic
Threshold: 0.5
You want your classifier
HAM SPAM
somewhere here with high AUC
𝑇𝑃𝑅 =
𝑇𝑃
=
3 1
HAM TP=3 FN=2 𝑇𝑃 + 𝐹𝑁 5

𝐹𝑃 1

True Positive Rate (TPR)

SPAM FP=1 TN=3 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4

Threshold: 0.2
HAM SPAM
𝑇𝑃 4
HAM TP=4 FN=1 𝑇𝑃𝑅 = =
𝑇𝑃 + 𝐹𝑁 5

𝐹𝑃 1
SPAM FP=1 TN=3 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4

Threshold: 0.8
HAM SPAM
𝑇𝑃 3
HAM TP=3 FN=2 𝑇𝑃𝑅 = =
𝑇𝑃 + 𝐹𝑁 5 0 False Positive Rate (FPR) 1
𝐹𝑃 0
SPAM FP=0 TN=4 𝐹𝑃𝑅 = =
𝐹𝑃 + 𝑇𝑁 4
TPR: Sensitivity | FPR: 1 - Specificity
13
Receiver Operating Characteristic

14
Precision - Recall Curve

You want your

classifier
somewhere here

15
ROC vs. Precision-Recall Curves
 Both summarize model performance using
different probability thresholds
 ROC curves should be used when there are
roughly equal numbers of observations for each
class
 Precision-Recall curves should be used when
there is a moderate to large class imbalance
(when we are interested in the positive class and
there’s only a few positive samples)

16
3-class Confusion Matrix

17
Macroaveraging and Microaveraging
Macroaveraging:
 compute the performance for each class, and then
average over classes

Microaveraging:
 collect decisions for all classes into one confusion
matrix
 compute precision and recall from that table.

18
Macroaveraging and Microaveraging

19
Text Classification System Pipeline
1. Obtain / collect / create labeled data set suitable for the task
2. Split the data set into:
 two (training and test sets) parts OR
 three (training, validation, and test sets) parts
3. Choose evaluation metric
4. Transform raw text into feature vectors:
 bag of words
 other types
5. Using feature vectors and labels from the training set, train the classifier /
create a model
6. Using evaluation metric from (3) benchmark the classifier / model
performance using the test set
7. Deploy the classifier / model to serve a real world application and monitor its
performance

20
Text Classification System Pipeline

Training data
1
(texts and their labels)

Pre-processing and Trainining

2-4
feature extraction process

Train and evaluate

5-6
classifier(s) (learning)

New data / text with Use classifier to predict

7
unknown labels labels on new data

21
Poor Classifier Performance: Reasons
1. With all possible features extracted, we ended up
with a sparse feature vector (some features are too
rare and end up being noise)  makes training hard
2. Few (~20%) relevant samples compared to non-
relevant (~80%) samples in the data set  skews
learning towards non-relevant data
3. Need better learning algorithm
4. Need better pre-processing / feature extraction
5. Classifier parameters / hyperparameters need tuning

22
Underfitting / Overfitting

Underfitting

Overfitting

Underfitting: “failing” to find pattern in the data.

23
Bias vs. Variance
High bias Low bias

Low variance High variance

High bias Low bias

Low variance High variance

Bias: the tendency of a predictive hypothesis to deviate from the expected

value when averaged over different training sets
Variance: the amount of change in the hypothesis (model) due to fluctuations
in training data

24
Parameter Tuning

25
K-Fold Cross-Validation
Validation
Train Validate Score

4-fold cross-validation
Train Train Train Validate ScoreA

Train Train Validate Train ScoreB

Train Validate Train Train ScoreC

Validate Train Train Train ScoreD

ScoreA + ScoreB + ScoreC + ScoreD

Score =
4

26
Ensemble Learning
In ensemble learning we are creating a collection
(an ensemble) of hypotheses (models) h1, h2, ..., hN
and combine their predictions by averaging, voting,
or another level of machine learning. Indvidual
hypotheses (models) are base models and their
combination is the ensemble model.
 Bagging
 Boosting
 Random Trees
 etc.

27
Bagging: Classification
In bagging we generate K training sets by sampling
with replacement from the original training set.

Train (M dataTrain
points) Model 1 | h1

Train (M data points)

Train Model 2 | h2

Train
Train
(M data points) Model 3 | h3 Plurality vote Output

....
Train
Train (M data points) Model K | hK

28
Bagging: Classification
In bagging we generate K training sets by sampling
with replacement from the original training set.

Train (M dataTrain
points) NaiveBayes1 | h1

Train (M data points)

Train NaiveBayes2 | h2

Train
Train
(M data points) NaiveBayes3 | h3 Plurality vote Output

....
Train
Train (M data points) NaiveBayesK | hK

Bagging tends to reduce variance and helps with smaller data sets.

29
Ensemble Classification
Indvidual hypotheses (models) are base models
and their combination is the ensemble model.

Train
Train (M data points) NaiveBayes1 | h1

Train
Train (M data points) Perceptron | h2

Train
Train (M data points) k-NN | h3 Plurality vote Output

....
Train
Train (M data points) NaiveBayes2 | hK

30
Supervised Learning

31
What Kind of Questions ML Answers?
Question ML Category Example

Will this car fail in the next two

Is this A or B? Classification
months? Yes or no?

Is this weird? Anomaly detection Is this credit card charge normal?

What will the temperature be

How much / many? Regression
tomorrow?

Which car models have the most

How is this organized? Clustering
brake problems?

What should I do next? Reinforcement learning Adjust room humidity or leave as is?

32
Main Machine Learning Categories
Supervised learning Unsupervised learning Reinforcement learning

Supervised learning is one Unsupervised learning Reinforcement learning is

of the most common involves finding underlying inspired by behavioral
techniques in machine patterns within data. psychology. It is based on a
learning. It is based on Typically used in clustering rewarding / punishing an
known relationship(s) and data points (similar algorithm.
patterns within data (for customers, etc.)
example: relationship Rewards and punishments
between inputs and are based on algorithm’s
outputs). action within its
environment.
Frequently used types:
regression, and
classification.

33
Choosing Hypothesis / Model
Given a training set of N example input-output
(feature-label) pairs
(x1, y1), (x2, y2), ..., (xN, yN)
where each pair was generated by
y = f(x)
Ideally, we would like our model h(x) (hypothesis)
that approximates the true function f(x) to be:
h(x) = y = f(x) (consistent hypothesis)
34
Choosing Hypothesis / Model
Typically consistent hypothesis is impossible or
difficult to achieve:
 use best-fit model / hypothesis

Our model needs to be tested on the test set inputs

(data the model has not “seen” yet) to see how
well it generalizes (how accurately it predicts the
outputs of the test set).
35
Neural Networks
Basics

36
McCulloch-Pitts Model (1943)
x1 First computational models of
an Artificial Neural Network
w1
(loosely inspired by biological
neural networks) were
x2
proposed by Warren
w2
McCulloch and Walter Pitts in
w3  y 1943. Their ideas are a key
x3 component of modern day
w4
machine and deep learning.

37
A Biological Neuron
A neuron or nerve cell is an
electrically excitable cell that
communicates with other
cells via specialized
connections called synapses.
Most neurons receive signals
via the dendrites and soma
and send out signals down
the axon. At the majority of
synapses, signals cross from
the axon of one neuron to a
dendrite of another.

Source: https://en.wikipedia.org/wiki/Neuron

38
Biological vs. Artificial Neuron

“Synapses”

“Axon”

“Neuron
”

Source: https://en.wikipedia.org/wiki/Neuron “Dendrites”

39
Artificial Neuron (Perceptron)
A (single-layer) perceptron is x1

a model of a biological
neuron. It is made of the w1
following components:
weights
 inputs xi - numerical values x2
(numbers)
w2
representing information Output
 weights wi - numerical w3  y

values representing how x3

w4
“important” corresponding
input is
 weighted sum:  wi * xi
 activation function f that x4

decides if the neuron

“fires” Inputs
(numbers)

40
Artificial Neuron (Perceptron)
A (single-layer) perceptron is x1

a model of a biological
neuron. It is made of the w1
following components:
Model
 inputs xi - numerical values x2
parameters
w2
representing information Output
 weights wi - numerical w3  y

values representing how x3

w4
“important” corresponding
input is
 weighted sum:  wi * xi
 activation function f that x4

decides if the neuron

“fires” Inputs
(numbers)

41
Artificial Neuron (Perceptron)
x1 x1

w1 w1

x2 x2
w2 w2

w3  X y w3  y

x3 x3
w4 w4

x4 x4

 wi * xi < 0 → f = 0 → DON’T “fire”  wi * xi  0 → f = 1 → “fire”

42
Single-layer Perceptron as a Classifier
x1 x1

w1 w1

x2 x2
w2 w2

w3  X y w3  y

x3 x3
w4 w4

x4 x4

 wi * xi < 0 → f = 0 → NO  wi * xi  0 → f = 1 → YES

43
Perceptron with Step Activation
x1

x2
w2

w3  y

x3
wN

. b
.
.

 wi * x i + b → f → y

44
Perceptrons = Linear Classifiers
x1

x2
w2

w3 
Binary output y

x3
0 or 1
wN

. b
.
.

 wi * x i + b → f → y

45
Classification: Linear Separation
HAM

SPAM HAM

SPAM

The w x + b = 0 line is the decision boundary

46
Perceptron with Sigmoid Activation
x1

x2
w2

w3  y

x3
wN

. b
.
.

 wi * x i + b → f → y

47
Logistic Regression Classifier
x1

x2
w2

w3  y

x3
wN

. b
.
.

 wi * x i + b → f → y

48
Single-layer Perceptron as a Classifier
word1 word1

w1 w1

word2 word2
w2 w2

w3  X y w3  y

word3 word3
w4 w4

. . . .
. . . .
. . . .

wordN wordN

 wi * wordi < 0 → f = 0 → SPAM  wi * wordi  0 → f = 1 → HAM

49
Basic Neural Unit
x1

x2
w2
Nonlinear transform
w3   y

x3
wN   
. b
.
.

z =  wi * x i + b y =  

50
Basic Neural Unit

weighted sum
feature vector

Nonlinear activation
input layer

weights


function output
(can differ for each
layer!)
bias

51
Selected Activation Functions
f1

f2
w2
y
w3

f3
w4

ReLU: Rectified Linear Unit

52
Classification: Linear Separation?
HAM

SPAM
HAM SPAM HAM

HAM SPAM

SPAM

Sometimes decision boundary CANNOT be linear? Not linearly separable f()

53
Hypothesis: Classification “Boundary”

54
XOR: Not a Linearly Separable f()

Logical XOR is an example of a function that is NOT linearly separable

55
XOR: Not a Linearly Separable f()?

56
Artificial Neural Network

57
Basic Neural Unit
weights

Input Output
layer layer

58
Basic Neural Unit

weighted sum
feature vector

Nonlinear activation
input layer

weights


function output
(can differ for each
layer!)
bias

59
Artificial Neural Network (ANN)
An artificial neural network is made of multiple artificial neuron layers.

Input Hidden Hidden Output

layer layer layer layer

60
Feedforward Neural Network
features weights weights weights output

Input Hidden Hidden Output

layer layer layer layer

Also called (historically): multi-layer perceptron

61
XOR: Hidden Layer Approach

62
Hidden Layer
features weights weights output

Input Hidden Output

layer layer layer

63
2 Layer Network
features weights weights output

Input Hidden Output

layer layer layer

64
Training Data: Features + Labels
Typically input data will be represented by a limited set of features.
Features:
Wheels: 4 Label:
Weight: 8 tons
Passengers: 1
Truck

Features:
Wheels: 6 Label:
Weight: 8 tons Truck
Passengers: 1

Features:
Wheels: 4 Label:
Weight: 1 ton Car
Passengers: 4

Features:
Wheels: 4 Label:
Weight: 2 tons Car
Passengers: 4

65
ANN: Supervised Learning
weights weights weights

wheels

weight

passengers

Input Hidden Hidden Output

layer layer layer layer

66
Training Data: Images + Labels
A classifier needs to be “shown” thousands of labeled examples to learn.

Label: Label: Label:

BUS CAR BRIDGE

Label: Label: Label:

PALM TRAFFIC LIGHT TAXI

Label: Label: Label:

CROSSWALK CHIMNEY MOTORCYCLE

Label: Label: Label:

STREET SIGN HYDRANT BICYCLE
Note how some images are “incomplete” and “flawed”.

67
Digit Image as ANN Feature Set
Individual features need to be “extracted” from an image. An image is numbers.

Source: https://nikolanews.com/not-just-introduction-to-convolutional-neural-networks-part-1/

68
ANN: Supervised Learning
An untrained classifier will NOT label input data correctly.
weights weights weights

0.12

0.99

0.55
Other

Input Hidden Hidden Output

layer layer layer layer

69
ANN: Training
Given: input data and it’s corresponding expected label: DOG calculate
weights
“error”.
weights weights
Should be 1!

0.12

0.99

0.55
Other

Input Hidden Hidden Output

layer layer layer layer

“Error” = 0.88. Go back and adjust all the weights to ensure it is lower next time.

70
ANN: Training
Show data / label pair: / DOG.
weights weights weights
Should be 1!

0.12

0.99

0.55
Other

Input Hidden Hidden Output

layer layer layer layer

Correct all the weights. Repeat many times.

71
ANN as a Complex Function
In ANNs hypotheses take form of complex algebraic circuits with
tunable connection strengths (weights).
weights weights weights

Input Hidden Hidden Output

layer layer layer layer

72
Exercise: ANN Demo
http://playground.tensorflow.org/

73
Logistic Regression = 1 Layer Network
x1

x2
w2

w3  y

x3
scalar
wN

. b
.
.

 wi * x i + b → f → y

74
Binary Logistic Regression = 1 Layer
weights

Input Output
layer layer

75
Multinomial Logistic Regression
weights

Input Output
layer layer

Multinomial Logistic Regression is still a 1 Layer Network

76
Fully Connected Network
weights

Input Output
layer layer

This Multinomial Logistic Regression network is fully connected network

77
Softmax: Sigmoid Generalization

78
Binary Logistic Regression
features weights output

Input Output
layer layer

In Binary Logistic Regression output is a SCALAR y = zwi * xi + b [ - sigmoid]

79
Multinomial Logistic Regression
features weights output

Input Output
layer layer

In Multinomial Logistic Regression output is a VECTOR y = szswi * xi + b

[s - softmax]
80
2 Layer Network
features weights weights output

Input Hidden Output

layer layer layer

81
2 Layer Network
features weights weights output

j
i

Input Hidden Output

layer layer layer

82
2 Layer Network
features weights weights output

activation
function f1

activation
function f1
activation
function f2

activation
function f1

Input Hidden Output

layer layer layer

Activation function f1: sigmoid, tanh, ReLU, etc. | Activation function f2: sigmoid

83
2 Layer Network
features weights weights output

activation
function f1

activation
function f1
activation
function f2

activation
function f1

Input Hidden Output

layer layer layer

Activation function f1: sigmoid, tanh, ReLU, etc. | Activation function f2: softmax

84
Multilayer Neural Net: Notation

85
Multilayer Neural Net: Notation
x1

x2
w2
z a
w3  Nonlinear transform y

x3
wN
g
. b
.
.

z =  wi * x i + b

86
Multilayer Neural Net: Notation
x1

x2
w2
z a
w3  Nonlinear transform y

x3
wN
g
. b
.
.

z =  wi * x i + b

87
Replacing the Bias Unit
x1

x2
w2
z a
w3  Nonlinear transform y

x3
wN
g
. b
.
.

xN
This is a bit inconvenient

z =  wi * x i + b

88
Replacing the Bias Unit
Let's switch to a notation without the bias unit
1. Add a dummy node a0=1 to each layer
2. Its weight w0 will be the bias
3. So input layer a[0]0=1, a[1]0=1, a[2]0=1, …
and

89
Replacing the Bias Unit

90
Deep Learning

91
Deep Learning
Deep learning is a broad family of techniques for
machine learning (also a sub-field of ML) in which
hypotheses take the form of complex algebraic
circuits with tunable connections. The word “deep”
refers to the fact that the circuits are typically
organized into many layers, which means that
computation paths from inputs to outputs have
many steps.

92
Shallow vs. Deep Models

Shallow Shallow Deep

Model Model Model
Longer computation path
93
Machine Learning vs. Deep Learning

Source: https://www.quora.com/What-is-the-difference-between-deep-learning-and-usual-machine-learning

94
Machine Learning vs. Deep Learning

Source: https://www.intel.com/content/www/us/en/artificial-intelligence/posts/difference-between-ai-machine-learning-deep-
learning.html

95
Deep Learning: Feature Extraction

Source: https://en.wikipedia.org/wiki/Deep_learning

96
Neural Networks in NLP
Let’s consider the NLP modeling we explored
so far:
 Classification
 Language Modeling
Can we apply Neural Networks?

97
Logistic Regression Sentiment Analysis
features weights output

BINARY

answer

Input Output
layer layer

98
Logistic Regression Sentiment Analysis
features weights weights output

BINARY

answer

Input Hidden Output

layer layer layer

99
Complex Feature Vector Relationships

Adding hidden layers can help capture non-linear relationships between features!

100
Word Embedding: Definition
Word Embedding:
a term used for the representation of words for text analysis,
typically in the form of a real-valued vector that encodes the
meaning of the word such that the words that are closer in
the vector space are expected to be similar in meaning
from Wikipedia

101
Exercise: Word2Vec
https://www.cs.cmu.edu/~dst/WordEmbeddingDem
o/index.html

10
Embeddings as Input Features
features weights weights output
(features learned from data)
Embeddings

Input Hidden Output

layer layer layer

Multiclass output: add more output layer nodes + use softmax (instead of sigmoid)

103
Embeddings as Input Features

104
Embeddings as Input Features

Assumption:
“3-word sentences”

105
Embeddings as Input Features
features weights weights output
(features learned from data)
Embeddings

BINARY

answer

Input Hidden Output

layer layer layer

106
Texts in Different Sizes: Ideas
Some simple solutions:
1. Make the input the length of the longest sample
 if shorter then pad with zero embeddings
 truncate if you get longer reviews at test time
2. Create a single "sentence embedding" (the same
dimensionality as a word) to represent all the words
 take the mean of all the word embeddings
 take the element-wise max of all the word
embeddings
 for each dimension, pick the max value from all
words
107
Language Models Revisited
Language Modeling: Calculating the probability of the next
word in a sequence given some history.
• N-gram based language models
• other: neural network-based?

Task: predict next word wt

given prior words wt-1, wt-2, wt-3, …
Problem: Now we’re dealing with sequences of arbitrary
length.
Solution: Sliding windows (of fixed length)

108
Neural Language Model

109
Neural LM Better Than N-Gram LM
Training data:
We've seen: I have to make sure that the cat gets fed.
Never seen: dog gets fed

Test data:
I forgot to make sure that the dog gets ___
N-gram LM can't predict "fed"!
Neural LM can use similarity of "cat" and "dog" embeddings
to generalize and predict “fed” after dog

110
Training Neural Networks

111
Training Neural Networks: Intuition
For every training tuple (x ,y) = (feature vector, label)
 Run forward computation to find estimate y ̂
 Run backward computation to update weights:
 For every output node
 Compute loss L between true y and the estimated y ̂
 For every weight w from hidden layer to the output layer
 Update the weight

 For every hidden node

 Assess how much blame it deserves for the current answer
 For every weight w from input layer to the hidden layer
 Update the weight

112
Back-propagation
Feed forward Evaluate Loss Back-propagation

w1 w1 w1

x
z z 𝜕𝐿𝑜𝑠𝑠 z
f(x,y) f(x,y) 𝜕𝑥 f(x,y)
𝜕𝐿𝑜𝑠𝑠
y 𝜕𝑦
𝜕𝐿𝑜𝑠𝑠
Loss = z - 𝜕𝑧
w2 w2 zexpected w2

Feed a labeled sample How “incorrect” is the Update weights

through the network result compare to the (use Gradient Descent)
label?
113
Gradients and Learning Rate
 The value of the gradient (slope in our
example) weighted by a
learning rate η

 Higher learning rate means move w faster

114
NN Node: Derivative of the Loss
x1

x2
w2
z a
w3  Nonlinear transform y

x3
wN
g
. b
.
.

z =  wi * x i + b

115
Convolutional Neural Networks
The name Convolutional Neural Network (CNN) indicates that the
network employs a mathematical operation called convolution.

Convolutional networks are a specialized type of neural networks

that use convolution in place of general matrix multiplication in at
least one of their layers.

CNN is able to successfully capture the spatial dependencies in an

image (data grid) through the application of relevant filters.

CNNs can reduce images (data grids) into a form which is easier to
process without losing features that are critical for getting a good
prediction.

116
Convolutional Neural Networks
Flattening

Pooling

By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

117
Convolution: The Idea

3 x 3 Kernel / Filter

Source: https://commons.wikimedia.org/wiki/File:Convolutional_Neural_Network_NeuralNetworkFilter.gif

118
Kernel / Filter: The Idea

3 x 3 Kernel / Filter

Source: https://commons.wikimedia.org/wiki/File:Convolution_arithmetic_-_Padding_strides.gif

119
Convoluting Matrices
Convolution (and Convolutional Neural Networks) can be applied
to any grid-like data (tensors: matrices, vectors, etc.).

kernel data
0 1 0 0 2 3 0*0 1*2 0*3

1 1 1 conv 2 4 1 “overla
y” 1*2 1*4 1*1 sum 12
0 1 0 0 3 0 0*0 1*3 0*0

120
Selected Image Processing Kernels

Sharpen Mean Blur Gaussian Blur

Laplacian Prewitt (Edge) Prewitt (Edge)

121
Image Processing: Kernels / Filters

122
Applying Kernels / Filters

3 x 3 Kernel / Filter

123
Convolutional NN Kernels
In practice, Convolutional Neural Network kernels can be larger than
3x3 and are learned using back propagation.

Convolution Layer 1 Convolution Layer 2 Convolution Layer 3

124
Convolution Layer 1

Kernel 1

125
Convolution Layer 1

Kernel 2
Kernel 1

126
Convolution Layer 1

Original image
Kernel 3

Kernel 2
Kernel 1

Convolution 1

127
Convolutional Neural Networks

By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

128
Max Pooling Layer
Convolution 1

Max Pooling

129
Convolutional Neural Networks

By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

130
Convolution Layer 2

Original convolution
after pooling Kernel C

Kernel B
Kernel A

Convolution A

131
Convolutional Neural Networks

By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

132
Flattening
Final output of convolution layers is “flattened” to become a vector of features.

Convert to
vector

Final convolution layer output

Source: https://nikolanews.com/not-just-introduction-to-convolutional-neural-networks-part-1/

133
Recurrent Neural Networks
Recurrent Neural Networks (RNNs) allow cycles in the computational graph
(network). A network node (unit) can take its own output from an earlier step as
input (with delay introduced).
Enables having internal state / memory  inputs received earlier affect the RNN
response to current input.

134
Long Short-Term Memory (LSTM)
Long short-term memory (LSTM) is an artificial neural network. Unlike standard
feedforward neural networks, LSTM has feedback connections. Such a recurrent
neural network (RNN) can process not only single data points (such as images), but
also entire sequences of data (such as speech or video). This characteristic makes
LSTM networks ideal for processing and predicting data.

135
Large Language Model (LLM)
A large language model (LLM) is a language model
consisting of a neural network with many
parameters (typically billions of weights or more),
trained on large quantities of unlabeled text using
self-supervised learning.

Source: Wikipedia

136
Generative Pre-trained Transformer 3
What is it?
Generative Pre-trained Transformer 3 (GPT-3) is an
autoregressive language model that uses deep learning
to produce human-like text. It is the third-generation
language prediction model in the GPT-n series (and the
successor to GPT-2) created by OpenAI, a San Francisco-
based artificial intelligence research laboratory.

Size:
175 billion machine learning parameters
~45 GB
Source: Wikipedia

137
Parameters? What Are Those?
features weights weights output

j
i

Input Hidden Output

layer layer layer

138
Transformer Architecture

139
GPT-4 Architecture

Source: TheAiEdge.io

140
Self-Attention
In artificial neural networks, attention is a technique that is meant to mimic
cognitive attention. The effect enhances some parts of the input data while
diminishing other parts — the motivation being that the network should devote
more focus to the important parts of the data, even though they may be small.
Learning which part of the data is more important than another depends on the
context, and this is trained by gradient descent.

Source: Park et al. – “SANVis: Visual Analytics for Understanding Self-Attention Networks”

141
Generative Pre-trained Transformer 4
What is it?
Generative Pre-trained Transformer 4 (GPT-4) is a
multimodal large language model created by OpenAI. As a
transformer, GPT-4 was pretrained to predict the next
token (using both public data and "data licensed from
third-party providers"), and was then fine-tuned with
reinforcement learning from human and AI feedback for
human alignment and policy compliance.
Size:
1 trillion machine learning parameters

Source: Wikipedia

142
Large Language Models Data Sources

Source: Zhao et al. – “A Survey of Large Language Models” [2023]

143
LLM Data Pre-Processing Pipeline

Source: Zhao et al. – “A Survey of Large Language Models” [2023]

144
ChatGPT
What is it?
ChatGPT is a chatbot developed by OpenAI and released in
November 2022. It is built on top of OpenAI's GPT-3.5 and
GPT-4 families of large language models (LLMs) and has
been fine-tuned (an approach to transfer learning) using
both supervised and reinforcement learning techniques.

Source: Wikipedia

145
Transfer Learning
In transfer learning, experience with one
learning task helps an agent learn better on
another task.

Pre-trained models can be used as a starting

point for developing new models.

146

Ecd40 Part Book Rev001
No ratings yet
Ecd40 Part Book Rev001
259 pages
Rcbot Botprofile Readme
No ratings yet
Rcbot Botprofile Readme
4 pages
CS585 Lecture October03rd
No ratings yet
CS585 Lecture October03rd
146 pages
Lectures3 5
No ratings yet
Lectures3 5
57 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
DL_IT324a_4
No ratings yet
DL_IT324a_4
52 pages
ML Notes UT-2
No ratings yet
ML Notes UT-2
19 pages
3-Performance Measures
No ratings yet
3-Performance Measures
35 pages
Machine_Learning_II
No ratings yet
Machine_Learning_II
61 pages
Accuracy Precision and Recall
No ratings yet
Accuracy Precision and Recall
21 pages
Lecture 10
No ratings yet
Lecture 10
16 pages
ML Unit 3
No ratings yet
ML Unit 3
127 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
22 pages
KNN Evaluation
No ratings yet
KNN Evaluation
51 pages
Bioinformatics F&amp M 20100722 Bujak
100% (1)
Bioinformatics F&amp M 20100722 Bujak
27 pages
Session 2 Evaluation Boosting Bagging Contemporary Business Anaytics
No ratings yet
Session 2 Evaluation Boosting Bagging Contemporary Business Anaytics
17 pages
جلسه 13
No ratings yet
جلسه 13
76 pages
TensorFlow Classification
No ratings yet
TensorFlow Classification
68 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
A10-Model-Performance-v2-2up
No ratings yet
A10-Model-Performance-v2-2up
11 pages
19-Performance Metrics
No ratings yet
19-Performance Metrics
23 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
EvaluationMatrix
No ratings yet
EvaluationMatrix
29 pages
CH 4
No ratings yet
CH 4
9 pages
Lesson 4 - Performance Metrics
No ratings yet
Lesson 4 - Performance Metrics
46 pages
Chap3 Part1 Classification
No ratings yet
Chap3 Part1 Classification
38 pages
08 Classifier Evaluation
No ratings yet
08 Classifier Evaluation
39 pages
Evaluation Measures for Machine Learning Models
No ratings yet
Evaluation Measures for Machine Learning Models
6 pages
06 - ML - Classificaion Performance Evaluation Measures
No ratings yet
06 - ML - Classificaion Performance Evaluation Measures
19 pages
Module 2
No ratings yet
Module 2
151 pages
Classification Algorithm in Machine Learning
No ratings yet
Classification Algorithm in Machine Learning
13 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
11 pages
lec5_Classification
No ratings yet
lec5_Classification
27 pages
Module 6
No ratings yet
Module 6
24 pages
6.Data Mining - Classification Ppt
No ratings yet
6.Data Mining - Classification Ppt
37 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
Classification Evaluation
No ratings yet
Classification Evaluation
28 pages
11.2 - Classification Evaluation Metrics
No ratings yet
11.2 - Classification Evaluation Metrics
22 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
Classification - Performance Evlaution
No ratings yet
Classification - Performance Evlaution
13 pages
CSDS 440: Machine Learning: Soumya Ray (
No ratings yet
CSDS 440: Machine Learning: Soumya Ray (
20 pages
BSC ML CH1.pptx
No ratings yet
BSC ML CH1.pptx
63 pages
Unit 4 Model Evaluation
No ratings yet
Unit 4 Model Evaluation
24 pages
Machine Learningassignment
No ratings yet
Machine Learningassignment
10 pages
Instruction & Option Choice
No ratings yet
Instruction & Option Choice
6 pages
Chapter 3 Model Evaluation Final
No ratings yet
Chapter 3 Model Evaluation Final
30 pages
ML CH 5
No ratings yet
ML CH 5
45 pages
UNIT-1-2.Binary Classification and Related Tasks
No ratings yet
UNIT-1-2.Binary Classification and Related Tasks
22 pages
06-FSSR_DS610_2024=2025T1_ٍMetrics
No ratings yet
06-FSSR_DS610_2024=2025T1_ٍMetrics
24 pages
Binary Classification PDF
No ratings yet
Binary Classification PDF
27 pages
Machine Learning Cheatsheet
No ratings yet
Machine Learning Cheatsheet
12 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
ML Interview Questions placements
No ratings yet
ML Interview Questions placements
99 pages
lecture11evaluationmetricsforclassification-240913060639-0c766554
No ratings yet
lecture11evaluationmetricsforclassification-240913060639-0c766554
28 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
Module 8 - PDF
No ratings yet
Module 8 - PDF
51 pages
Performance Metrics (Classification) : Enrique J. de La Hoz D
100% (1)
Performance Metrics (Classification) : Enrique J. de La Hoz D
30 pages
Stream Control Transmission Protocol (SCTP) : Data Communication
No ratings yet
Stream Control Transmission Protocol (SCTP) : Data Communication
88 pages
426-431 Merged
No ratings yet
426-431 Merged
17 pages
EN_SAP Signavio ELearning Catalog
No ratings yet
EN_SAP Signavio ELearning Catalog
19 pages
Drying and Curing Time: Application Guide Penguard Topcoat
No ratings yet
Drying and Curing Time: Application Guide Penguard Topcoat
1 page
Type of Enterprise: Disclaimer: This Is Computer Generated Statement, No Signature Required. Printed From
No ratings yet
Type of Enterprise: Disclaimer: This Is Computer Generated Statement, No Signature Required. Printed From
1 page
150WS River Sand Pumping Machine China Manufacturer PDF
No ratings yet
150WS River Sand Pumping Machine China Manufacturer PDF
6 pages
VESCO METHOD STATEMENT DOC No PB-MS-XXXX-XXX-5471
No ratings yet
VESCO METHOD STATEMENT DOC No PB-MS-XXXX-XXX-5471
5 pages
OSDA 330W 72C (1956x992x35)
100% (1)
OSDA 330W 72C (1956x992x35)
1 page
DS-1 Cat 5 Inspection Program For Drill Pipe
100% (1)
DS-1 Cat 5 Inspection Program For Drill Pipe
1 page
Movie Tips
No ratings yet
Movie Tips
5 pages
Rebel Retrieval SRL Product Information Sheet - Low
No ratings yet
Rebel Retrieval SRL Product Information Sheet - Low
2 pages
Solved_ Revit 2023 - Difficulty merging analytical member nodes to floor level near structural walls - Autodesk Community
No ratings yet
Solved_ Revit 2023 - Difficulty merging analytical member nodes to floor level near structural walls - Autodesk Community
5 pages
SBI BANK SARAL Registration Form
No ratings yet
SBI BANK SARAL Registration Form
1 page
F30 Coding Detail Sheet Rev 1.05
100% (2)
F30 Coding Detail Sheet Rev 1.05
2 pages
Instructions:: WWW - Ekt
No ratings yet
Instructions:: WWW - Ekt
3 pages
SV8.pdf Page 1
No ratings yet
SV8.pdf Page 1
1 page
Aircraft Serious Incident Investigation Report: Academic Corporate Body Hiratagakuen J A 1 3 5 E
No ratings yet
Aircraft Serious Incident Investigation Report: Academic Corporate Body Hiratagakuen J A 1 3 5 E
37 pages
Cover Letter Examples Resume
100% (2)
Cover Letter Examples Resume
6 pages
Hageretal.2023 WI23 DigitalDetoxResearch
No ratings yet
Hageretal.2023 WI23 DigitalDetoxResearch
22 pages
Unix Filesystems Evolution Design and Implementation 1st Edition Steve D. Pate - Get the ebook instantly with just one click
No ratings yet
Unix Filesystems Evolution Design and Implementation 1st Edition Steve D. Pate - Get the ebook instantly with just one click
46 pages
Nice Gate Door Catalogue en
No ratings yet
Nice Gate Door Catalogue en
288 pages
State-of-the-Art Review On The Applicability of AI Methods To Automated Construction Manufacturing
No ratings yet
State-of-the-Art Review On The Applicability of AI Methods To Automated Construction Manufacturing
8 pages
Foss QB
No ratings yet
Foss QB
10 pages
Compal La-7901p r1.0 Schematics
No ratings yet
Compal La-7901p r1.0 Schematics
62 pages
SPSS With Job Description Data
No ratings yet
SPSS With Job Description Data
48 pages
Key Principles of Kyndryl Major Incident Management - Final - Advanced
No ratings yet
Key Principles of Kyndryl Major Incident Management - Final - Advanced
6 pages
Complete Download of Quantitative Methods for Business 12th Edition Anderson Solutions Manual Full Chapters in PDF DOCX
100% (14)
Complete Download of Quantitative Methods for Business 12th Edition Anderson Solutions Manual Full Chapters in PDF DOCX
66 pages
Visual Assistant Using Raspberry Pi For Blind People
No ratings yet
Visual Assistant Using Raspberry Pi For Blind People
6 pages