NLP_Module_I_IV
NLP_Module_I_IV
Natural Language
Processing
Modules:
➢ Module I:
➢ Introduction to Natural language processing (NLP)
➢ NLP problems in text summarization, text classification, sentiment
analysis, question answering, neural translation, etc.
➢ Module II:
➢ Introduction to Neural Networks
➢ Optimization formulations for Deep learning, Gradient-based
optimization, Gradient descent,
➢ Neural networks, feed-forward NN, Gradient-based learning, and back-
propagation, and differentiation algorithms
➢ Module III:
➢ Elements of NLP: Expression, word, corpora,
➢ Token and tokenization, word normalization, lemmatization, stemming,
sentence segmentation, sequence labelling, context-free grammars
2
Modules:
➢ Learning Outcomes
➢ Module I: Understand the importance of NLP problems and different types
of problems in the literature
➢ Module II: Understand the concepts of Optimization for Deep Learning,
Basics of Neural Networks, Feed-forward Neural Networks, Back
propagation and automatic differentiation
➢ Module III: Basic terminology and operation in text processing, different
types of tokenization, word normalization, minimum edit distance and their
computations
3
Modules:
➢ Module IV:
➢ N-gram Langauge models, lexical and vector semantics,
➢ TF-IDF, word2vector, semantic properties of embeddings
➢ Module V:
➢ Recurrent Neural Networks (RNNs),
➢ RNNs for Language modeling, Sequence labeling and classification,
sequence-to-sequence tasks,
➢ Stacked and bidirectional RNNs, LSTMs
4
Modules:
➢ Learning Outcomes
➢ Module IV: Understand formulation of language models (LMs), different LMs
such as n-grams, TF-IDF, etc. evaluation metrics from language view points
➢ Module V: Basics of RNNs and their variants, sequence modeling and
different problems associated with sequence modeling
5
Modules:
➢ Module VI:
➢ Introduction to transformer:
➢ Attention mechanism, multi-head attention, transformer architecture,
➢ Embedding of tokens and positions, language model head
➢ Module VII:
➢ Large Language Models: Use of Transformer Architecture,
➢ Pretraining architectures: Decoders, Encoders, and Encoder-Decoders,
➢ Finetuning, Generative Pretrained transformer
➢ Module VIII :More on LLMs:
➢ Scaling laws and scaling efficiency
➢ Prompting and instruction tuning, alignment, preference fine-tuning
➢ Applications of LLMs in Machine translations, question answering, and
information retrieval
6
Modules:
➢ Learning Outcomes
➢ Module VI: Learn the concepts of transformers and associated concepts
such as attention, multi-head attention, etc.
➢ Module VII: Different architectures of LLMs, finetuning methods
➢ Module VIII: Understand the scaling laws, and model parameters, prompt
engineering aspects and applications of LLMs
7
Module I: Introduction to Natural
language processing (NLP)
8
Quotes to remember
“For a large class of cases-though not for all-in which we employ
the word ‘meaning’ it can be defined thus: the meaning of a word
is its use in the language.”
- (Wittgenstein 1968, 943)
“All grammars leak”
-(Sapir 1921: 38)
“You shall know a word by the company it keeps”
-(Firth 1957: 11)
9
Can machines mimic humans?
Well! Yes. As wise as in this example
Killing is a honor
Genearted text:
Killing is a honor in my tribe. It showed that you were strong, brave, and willing to do
what needed to be done. To get this close to him, to have my arrow ready to fire the
moment he revealed himself to be what he was, filled me with a strange mix of fear and
excitement. Thoughts of the demon were pushed to the
Source: https://app.inferkit.com/generate
10
Or as stupid as in this example
Generated text
Or prove it.
There is no need to go to any center to learn things.
Today was an easy day.
I went into town to get to the PC shop.
I found the shop but it wasn't open.
I rang the bell but nobody came out.
I decided I would go to look at the esplanade.
The trees were starting to turn colors.
11
What is NLP?
⚫ Programming the computers to process and analyze large amount of natural
language data
⚫ Subfield of Linguistics and Computer Science
12
History
⚫ Symbolic NLP (1950 – early 1990s) – Mostly hand-written rules:
◦ Rule based parsing
◦ Morphology
◦ Semantics
⚫ Statistical NLP (1990 – 2010s) – Textual corpora were used predominantly
◦ Supervised Learning over hand-annotated data
◦ Unsupervised, semi-supervised learning over unannotated internet data
◦ Machine translation of governmental proceedings as a major focus
⚫ Neural NLP (present) – Deep Learning
◦ State of the art techniques
◦ Language modeling and many other applications
13
Applications of NLP
➢ Spam Detection:
✓ Scanning emails for words that indicate spam
➢ Machine Translation:
✓ Google translate is the best example of machine translation
✓ Capturing the meaning and tone of the source language is important
14
Applications of NLP
➢ Virtual Agents and Chat Bots:
✓ Siri and Alexa are examples of virtual agents that can take voice
commands and perform tasks
✓ Chat bots are developed to respond to human typed questions with
helpful answers
✓ Most websites which directly interact with many consumers have these
chat boxes
➢ Social Media Sentiment Analysis:
✓ Analysing social media posts, reviews, etc. to extract response
(positive/negative) to products, events, movies, etc.
➢ Text Summarisation:
✓ To ingest huge volumes of text and create summaries for indexes, busy
readers, etc.
15
Other Applications
➢ Drug Discovery
➢ Developing new drugs by understanding language(s) of molecules
➢ Recommender systems:
Recommending products or movies to consumers based on their
historical consumption
➢ Targeted Advertisements:
Getting insights into customer behaviour and needs and targeting ads accordingly
based on search history, clicked items,
16
Motivation: Biological Neuron to
Artificial Neuron
17
Module II: Introduction to Neural
Networks
18
Biological Neuron
• Basic working unit of the brain and
nervous system
Axon
• Close to 100 billion interconnected Dendrites terminals
neurons in a human brain
• Function together to aid decision making Nucleus Axon
• Parts and Functioning:
➢ Dendrite: Takes signals (stimulus) from
the other neurons or other cells in the
body Cell body (soma)
➢ Cell body (soma): Processes the signal
and may or may not fire the neuron –
excitation and inhibition Biological Neuron
➢ Axon: Transmits the output (response)
to other neurons or cells
19
Biological Neuron to Artificial Neuron
Axon
Dendrites terminals
Artificial neuron
➢ Mathematical model of a biological Nucleus Axon
neuron
➢ Mimics the functioning of a biological
neuron
➢ Takes input in the form of numbers Cell body
➢ Processes the input to give out an Biological Neuron
output
➢ Output = f(inputs)
➢ Different models of artificial neurons Output
have been developed based on this Inputs 𝑓(. )
idea
Artificial Neuron
20
Artificial Neuron: McCulloch Pitts Model
Inputs
o Inputs 𝑥1 , 𝑥2 , … , 𝑥𝑛 are binary 𝑥1 𝑥𝑖 ∈ {0,1}
numbers (0 or 1)
o Aggregated input passes through an 𝑥2 Output
activation function to give output (𝑦) ⋮ Σ 𝜃
o Activation function is based on
𝑦 ∈ {0,1}
𝑥𝑛
thresholding logic
o Here, model refers to the function McCulloch Pitts Model
relating the output to inputs 𝑎 = 𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛
𝑦 = 𝑓 𝑎 = 1 if 𝑎 ≥ 𝜃
𝑦 = 𝑓 𝑎 = 0 if 𝑎 < 𝜃
21
McCulloch Pitts Model: Boolean Functions
o This model can be used to represent most Boolean functions
𝑥1
𝒙𝟏 𝒙𝟐 𝒚
0 0 0 𝑦
Logical And Σ 2
Function (2 inputs) 0 1 0
1 0 0
𝑥2 𝑦 = 1 if 𝑎 ≥ 2
1 1 1
𝑦 = 0 if 𝑎 < 2
Logical Or 𝒙𝟏 𝒙𝟐 𝒚 𝑥1
Function (2 inputs) 0 0 0 𝑦
0 1 1 Σ 1
1 0 1
1 1 1 𝑥2 𝑦 = 1 if 𝑎 ≥ 1
𝑦 = 0 if 𝑎 <1
22
McCulloch Pitts Model: Drawbacks
• Drawbacks:
➢ Cannot handle non-boolean inputs and outputs
➢ Deciding a appropriate threshold value might be hard as the
number of inputs increases
➢ Equal weightage to all inputs – What if more importance is to
be attached to some inputs?
• How to overcome these issues? – Perceptron Model
23
Artificial Neuron: Perceptron Model
o Inputs 𝑥1 , 𝑥2 , … , 𝑥𝑛 are real Inputs 𝑏
numbers 𝑥1
Weights
𝑤1
o Neuron takes the weighted 𝑤2 Output
𝑥2
combination of inputs 𝑓(. )
Σ 𝑦
o Bias (𝑏) is added to weighted ⋮
𝑤𝑛
inputs 𝑥𝑛
o Weighted input passes through Perceptron Model
an activation function to give
output (𝑦) 𝑎 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + 𝑏
𝑦 = 𝑓 𝑎 = 𝑓(𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + b)
24
Artificial Neuron: A Simple Example
o Using an artificial neuron to decide Inputs 𝑓 𝑎 = 1 if 𝑎 > 7
𝑥1 𝑓 𝑎 = 0 otherwise
whether to watch a movie or not Weights
0.4
𝑥2 0.3 Output
Feature Value for Value for 𝑥3 0.2 Σ 𝑓(. ) 𝑦 (0 𝑜𝑟 1)
Movie 1 Movie 2
0.1
Lead Actor (𝑥1 ) 10 7 𝑥4
Director (𝑥2 ) 8 5 Perceptron Model
Trill factor (𝑥3 ) 8 9
Movie 1 Movie 2
Run time (𝑥4 ) 9 5
𝑎 = 8.9 𝑎 = 6.7
𝑦=𝑓 𝑎 =1 𝑦=𝑓 𝑎 =0
25
Artificial Neural Network
26
Artificial Neural Network: Motivation
𝑏
➢ One neuron is not sufficient to take
complex decisions (complex 𝑥1
Weights
functions) 𝑤1
➢ Again, inspired by brain neural 𝑥2 𝑤2 Output
network, artificial neural network ⋮ Σ 𝑓(. ) 𝑦
was developed
𝑤𝑛
➢ In the brain, many neurons are 𝑥𝑛
involved in taking a decision Perceptron Model
➢ All the neurons are inter-connected
in the brain
𝑎 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + 𝑏
➢ They are arranged hierarchically in
layers 𝑦 = 𝑓 𝑎 = 𝑓(𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + b)
27
Artificial Neural Network (ANN)
Output
o ANN consists of multiple layers with
multiple neurons in each layer (hidden
layers)
o Each neuron (except inputs) represent
…
a perceptron model Hidden
o Every neuron in one layer is Layers
of
connected to every neuron in the neurons
successive layer …
o Output of one neuron are passed as
inputs to the neurons of next layer
… Inputs
28
ANN Architecture • Input layer (0𝑡ℎ ) with 𝑛 inputs
𝒙 = 𝑥1 𝑥2 … 𝑥𝑛 𝑇
𝑦 • 𝐿 − 1 hidden layers with 𝑚 neurons each
𝑾3 • Output layer (𝐿𝑡ℎ ) with 𝑘 neurons
𝒃3
• 𝑾𝑖 is the matrix containing the weights
between layers 𝑖 − 1 and 𝑖 0 < 𝑖 ≤ 𝐿
… • 𝒃𝑖 is the vector representing the biases
1 1
𝑤11 … 𝑤1𝑛 𝑏11
𝑾2 𝒃2 𝑾1 = ⋮ ⋱ ⋮ 𝒃1 = ⋮
1 1 1
𝑤𝑚1 … 𝑤𝑚𝑛 𝑏𝑚
… 𝑚×𝑛
𝑖 𝑖
𝑾1 𝒃1 𝑤11 … 𝑤1𝑚 𝑏1𝑖
𝑾𝑖 = ⋮ ⋱ ⋮ 𝒃i = ⋮
𝑥1 𝑥2 … 𝑥𝑛 𝑖 𝑖 𝑖
𝑤𝑚1 … 𝑤𝑚𝑚 𝑚×𝑚
𝑏𝑚
29
ANN Architecture
𝑦 𝐿 𝐿
𝑤11 … 𝑤1𝑚 𝑏1𝐿
𝑾3 𝒃3 𝑾𝐿 = ⋮ ⋱ ⋮ 𝒃L = ⋮
𝐿 𝐿
𝑤𝑘1 … 𝑤𝑘𝑚 𝑏𝑘𝐿
𝑘×𝑚
• For a single output, 𝑾𝐿 will be a vector
…
and 𝒃𝐿 will be a scalar
• Each neuron in hidden and output layers
𝑾2 𝒃2
has an activation function
… • If there are more than 3 hidden layers,
then ANN is referred to as Deep Neural
𝑾1 𝒃1
Network (DNN) – Depth refers to the
number of layers
𝑥1 𝑥2 … 𝑥𝑛
30
DNN Feed Forward Calculation
• Feed forward calculation involves finding the
𝑦
output as a function of input, weights and
𝒂3
𝑾3 𝒃3
biases
• Input to activation function at layer 0:
𝒉2 𝒂1 = 𝑾1 𝒙 + 𝒃1
… • Activation at hidden layer 1:
𝒂2 𝒉1 = 𝑔ℎ (𝒂1 )
𝑾2 𝒃2 • 𝒉𝑖 is output vector at layer 𝑖
𝒉1 • 𝑔ℎ is the activation function which maps
… vector 𝒂𝑖 to vector 𝒉𝑖
𝒂1
• Input to activation function at hidden layer i:
𝑾1 𝒃1
𝒂𝑖 = 𝑾𝑖 𝒉𝑖−1 + 𝒃𝑖
… • Activation at hidden layer i:
𝑥1 𝑥2 𝑥𝑛
𝒉𝑖 = 𝑔ℎ 𝒂𝑖 = 𝑔ℎ (𝑾𝑖 𝒉𝑖−1 + 𝒃𝑖 )
31
DNN Feed Forward Calculation
𝒂1 = 𝑾1 𝒙 + 𝒃1
𝑦 𝒉1 = 𝑔ℎ (𝒂1 )
𝒂3 𝒂𝑖 = 𝑾𝑖 𝒉𝑖−1 + 𝒃𝑖
𝑾3 𝒃3 𝒉𝑖 = 𝑔ℎ 𝒂𝑖 = 𝑔ℎ (𝑾𝑖 𝒉𝑖−1 + 𝒃𝑖 )
• Input to activation function at output layer 𝐿:
𝒉2 𝒂𝐿 = 𝑾𝐿 𝒉𝐿−1 + 𝒃𝐿
… • Activation at output layer:
𝒂2
ෝ = 𝑔𝑜 𝒂𝐿 = 𝑔𝑜 𝑾𝐿 𝒉𝐿−1 + 𝒃𝐿
𝒚
𝒃2
𝑾2 • Model (function) being approximated by the
𝒉1 DNN (assuming L=3):
…
𝒂1
𝑾1 𝒃1 ෝ = 𝑔𝑜 𝑾3 𝑾2 𝑾1 𝒙 + 𝒃1 + 𝒃𝟐 + 𝒃3
𝒚
ෝ = 𝑓(𝒙)
𝒚
𝑥1 𝑥2 … 𝑥𝑛
32
Types of Activation Functions
33
Activation Function
o Activation function is like a gate between the input and output of a
neuron
o Purpose: To introduce non-linearity into the model and enable
learning complex functions (models)
o It affects the DNN output, accuracy and convergence
o Types of activation function:
1. Linear activation function
2. Sigmoid activation function
3. Tanh activation function
4. Relu activation function
5. Softmax activation function
34
Linear Activation Function
𝑓 𝑎 = 𝑐𝑎
• Output is directly proportional
to the input
𝑓 𝑎 = 𝑐𝑎
• Output can take any real
number 𝑓(𝑎)
• Gradient is always constant and
does not depend on the input
• Generally used in the output
0
layer of regression problem 𝑎
35
Sigmoid Activation Function
• Any value of input is mapped to a 1
𝑓 𝑎 =
value between 0 and 1 1 + 𝑒 −𝑎
1 1
𝑓 𝑎 =
1 + 𝑒 −𝑎
• Gradient is close to zero when the
output is close to 0 or 1 𝑓(𝑎)
• Useful when the expected output is
a probabilistic value between 0 and 1
0
𝑎
36
Softmax Activation Function
• Sigmoid function gives a value between 0 Total Probability = 1
and 1, and can be used for binary Prob. of
classification class 0
Prob. of class 1
• However, sigmoid cannot be used to Binary Classification
output multiple probability values which
add upto 1 (multi-class) Total Probability = 1
• Softmax function is an extension of Prob. of Prob. of Prob. of
sigmoid function class 1 class 2 class 3
• Softmax calculates the relative
Multi-class Classification
probabilities of multiples classes and
ensures that total probability is 1
𝑒 𝑎𝑖
Input to output layer of a DNN 𝑓 𝑎𝑖 = 𝑘
𝒂 = [𝑎1 𝑎2 … 𝑎𝑘 ] σ𝑖=1 𝑒 𝑎𝑖
37
Tanh Activation Function
𝑒 𝑎 − 𝑒 −𝑎
𝑓 𝑎 = 𝑎
• Any value of input is mapped to a 𝑒 + 𝑒 −𝑎
value between -1 and 1 1
𝑒 𝑎 − 𝑒 −𝑎
𝑓 𝑎 = 𝑎
𝑒 + 𝑒 −𝑎
• Positive values between 0 and 1
• Negative values between -1 and 0 0
• Output is zero centered which 𝑓(𝑎)
enables quick convergence
• Gradient is close to zero when the
output is close to -1 or 1 −1
0 𝑎
38
Relu Activation Function
Rectified Linear Unit
• All positive values go through 𝑓 𝑎 = max(0, 𝑎)
directly while all negative values are
mapped to zero
𝑓 𝑎 = max(0, 𝑎)
• Gradient is zero when the output ≤ 𝑓(𝑎)
0 and 1 for all positive outputs
• Relu is one of the most popular
activation functions and has many
variants 0
0 𝑎
39
Feed Forward Calculation - Example
40
Feed Forward Calculation - Example
𝑦1 𝑦2 𝑦3 • Task: To classify a person as underweight
𝒂3
𝒂3 (Class 1), normal weight (Class 2) or
𝒃3 overweight (Class 3) given the height and
𝑾3
𝒉2 weight as input features
• NN Architecture: A neural network with 2
𝒂2 input, 2 hidden layers with 3 neurons each and
𝒃2
𝑾2 3 outputs
𝒉1 • 2 inputs to take 2 features and 3 outputs to
predict probability of 3 classes
𝒂1 • Activation functions: All hidden layer neurons
𝑾1 𝒃1 have relu activation and output layer neurons
have softmax activation
𝑥1 𝑥2
41
Feed Forward Calculation - Example
𝑦1 𝑦2 𝑦3 • Inputs: 𝑥1 = 1.5 𝑥2 = 0.5
𝒂3
𝒂3 • Weights and Biases:
𝒃3 0.53 0.86 −0.43
𝑾3
𝒉2
𝑾1 = 1.84 0.32 𝒃1 = 0.34
−2.25 −1.31 3.58
𝒂2 1.41 −1.21 0.49 −0.2
𝒃2 𝑾𝟐 = 1.42 0.72 0.03 𝒃2 = −0.12
𝑾2
0.67 1.63 0.73 1.49
𝒉1 −0.3 0.89 −0.81 0.32
𝑾3 = 0.29 −1.14 −2.9 𝒃3 = −0.75
𝒂1
−0.78 −1.07 −1.43 1.37
𝑾1 𝒃1
𝑥1 𝑥2
42
Feed Forward Calculation - Example
𝑦1 𝑦2 𝑦3 • Inputs: 𝑥1 = 1.5 𝑥2 = 0.5
𝒂3
𝒂3 • Hidden Layer 1 calculation:
𝒃3
𝑾3 𝒂1 = 𝑾1 𝒙 + 𝒃1
𝒉2 0.8
𝒂1 = 3.25
𝒂2 −0.46
𝒃2 𝒉1 = 𝑟𝑒𝑙𝑢 𝒂1
𝑾2
0.8
𝒉1 𝒉1 = 3.25
𝒂1 0
𝑾1 𝒃1
𝑥1 𝑥2
43
Feed Forward Calculation - Example
𝑦1 𝑦2 𝑦3 • Inputs: 𝑥1 = 1.5 𝑥2 = 0.5
𝒂3
𝒂3 • Hidden Layer 2 calculation:
𝒃3
𝑾3 𝒂2 = 𝑾𝟐 𝒉𝟏 + 𝒃2
𝒉2 −3
𝒂2 = 3.34
𝒂2 7.33
𝒃2 𝒉2 = 𝑟𝑒𝑙𝑢 𝒂2
𝑾2
0
𝒉1 𝒉2 = 3.34
𝒂1 7.33
𝑾1 𝒃1
𝑥1 𝑥2
44
Feed Forward Calculation - Example
𝑦1 𝑦2 𝑦3 • Inputs: 𝑥1 = 1.5 𝑥2 = 0.5
𝒂3
𝒂3 • Output calculation:
𝒃3
𝑾3 𝒂3 = 𝑾3 𝒉2 + 𝒃3
𝒉2 −2.63
𝒂3 = −26.18
𝒂2 8.33
𝒃2 𝒚 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝒂3
𝑾2
0
𝒉1 𝒚= 0
𝒂1 1
• This example illustrations how 𝒚 is a function
𝑾1 𝒃1
of 𝒙
𝑥1 𝑥2
45
Universal Approximation Theorem
46
Universal Approximation Theorem (UAT)
𝑦
𝑦=𝑓 𝒙
𝑾3 𝒃3 • UAT establishes that neural networks
have a kind of universality in
approximating functions
…
• For any given function of inputs 𝑓(𝒙),
𝒃2
there exists a neural network which can
𝑾2
approximate the output
… • Holds even when the function has
multiple inputs and outputs
𝑾1 𝒃1 • Condition: Activations functions should
be non-linear
𝑥1 𝑥2 … 𝑥𝑛
Ref: Article by Micheal Nelson - Neural networks and deep learning
47
Supervised Learning using DNN
48
Supervised Learning using DNN
𝑦
Data for Supervised Learning:
𝑾3 𝒃3 ➢ Inputs: Values of input features − 𝒙
➢ Outputs: Values of predicted variables − 𝒚
… Regression – Real Numbers
Classification – Discrete class or
𝑾2 𝒃2 Probability of each class
• DNN is expected to take an input and predict
… the desired output
• Implies: DNN should approximate a function
𝑾1 𝒃1 𝑓(𝒙) which maps inputs to outputs
𝑥1 𝑥2 … 𝑥𝑛
49
Supervised Learning using DNN
𝑦
➢ Question: What is a suitable 𝑓(𝒙) for the
𝑾3 𝒃3 given data or task ?
✓ Answer: Generally, not known and can be
… a complex function
➢ Question: Can we find the weights and
𝑾2 𝒃2 biases which will approximate desired
𝑓(𝒙)?
… ✓ Answer: Yes! They can be learnt from
data
𝑾1 𝒃1 • Training a DNN: Learning the parameters
of the DNN (weights and biases) using
𝑥1 𝑥2 … 𝑥𝑛 the given data
50
Summary
➢ Deep Learning is a sub-field of machine learning with many applications
in diverse areas
➢ Functioning of a biological neuron was mathematically modelled to
replicate its decision making capability
➢ An artificial neural network was developed inspired from the structure
of a brain neural network
➢ ANN consists of multiple layers of inter-connected neurons which
process inputs to give out outputs
➢ Universal Approximation theorem establishes that there always exists an
ANN which can approximate any function of any complexity
➢ An ANN can be trained to map inputs to desired outputs by learning the
weights and biases
51
Supervised Learning using DNN
𝑦
➢ In supervised learning, input 𝒙 and
𝑾3 𝒃3 output (𝒚) data is given to learn a
function 𝒚
ෝ = 𝑓(𝒙) such that 𝒚ෝ≈𝒚
➢ Question: What is a suitable 𝑓(𝒙) for
… the given data or task ?
➢ Question: Can we find the weights and
𝑾2 𝒃2
biases which will approximate desired
𝑓(𝒙)?
…
➢ Training a DNN: Learning the
parameters of the DNN (weights and
𝑾1 𝒃1
biases) using the given data
𝑥1 𝑥2 … 𝑥𝑛
53
Training a DNN
𝑦
𝑾3 𝒃3
Steps to train a DNN using data
1. Create a
… 2. Randomly initialize 3. Pass inputs 𝒙
neural network through the network
weights and biases
𝒃2 # layers
𝑾2 (𝑊1 , … 𝑊𝐿 , 𝑏1 , … 𝑏𝐿 ) and get output (𝑦)
ො
# neuron
…
𝑾1 𝒃1
54
Training a DNN – Step 4
Step 4: Find out difference between actual output
(𝑦) and predicted output (𝑦)ො - Prediction error
𝑦
𝑾3 𝒃3 ➢ Loss function or Cost function is defined to quantify the
prediction error
… ➢ For regression: Mean squared error loss since output is
𝒃2
a real number
𝑾2 𝑁
…
1 2
ℒ = 𝑦𝑖 − 𝑦ෝ𝑖
𝑁
𝑾1 𝒃1 𝑖=1
55
Training a DNN – Step 5
Step 5: Adjust the weights to minimise the loss function
60
Training a DNN – Summary of Step 5
a) Known:
➢ 𝒙, 𝒚 − 𝑁 Samples
➢ 𝑾1 , 𝑾2 , … 𝑾𝐿 and 𝒃1 , 𝒃2 , … , 𝒃𝐿 (initialised values)
➢ 𝒚ෝ for each sample and and overall loss ℒ
b) Calculate the gradients of loss function w.r.t. all the weights and biases using
chain rule
Note: Gradient calculation involves a summation over all samples in the data
c) Update the weights and biases:
𝜕ℒ
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − 𝛼
𝜕𝑤
𝜕ℒ
𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 − 𝛼
𝜕𝑏
d) With new parameter, compute 𝒚ෝ for each sample and and overall loss ℒ
e) Repeat (b), (c) and (d) for many iterations – until loss is minimised
61
DNN Model Training - Components
𝑦 • Data (Given): 𝒙𝑖 , 𝒚𝑖 ; 𝑖 = 1 … 𝑁
𝑾3
• Model (Chosen):
𝒃3
ෝ = 𝑓 𝒙, 𝑾1 , … 𝑾𝐿 , 𝒃1 , … 𝒃𝐿
𝒚
• Parameters (To be learnt):
… 𝑾1 , … , 𝑾𝐿 , 𝒃1 , … , 𝒃𝐿
• Loss Function: Mean squared error for
𝑾2 𝒃2 regression and cross entropy for
classification
… • Training Algorithm: Gradient descent
𝑾1 𝒃1
𝑥1 𝑥2 … 𝑥𝑛
62
Computing Gradients
➢ How to compute gradients efficiently in Back propagation algorithm?
➢ How did we do in high school or the UG?
63
Computing Gradients
➢ Rule of Partial derivatives
➢ Sum of two functions f (.) + g (.)
64
Gradient Descent with Backpropagation
1 𝑁
ℒ= σ 𝑦𝑖 − 𝑦ෝ𝑖 2 (or) ℒ(𝑦, 𝑦)
ො 𝑦ො
𝑁 𝑖=1
𝑁 𝑘
1
ℒ = − 𝑦 log 𝑦ො 𝑎3
𝑁 𝑤3
𝑖=1 1 ℎ2
ෝ = 𝑔𝑜 𝑾3 𝑾2 𝑾1 𝒙 + 𝒃1 + 𝒃𝟐 + 𝒃3
𝒚 𝜕ℒ 𝜕ℒ 𝜕𝑦ො 𝜕ℒ 𝜕𝑦ො 𝜕𝑎3
= =
𝜕𝑤1 𝜕𝑦ො 𝜕𝑤1 𝜕𝑦ො 𝜕𝑎3 𝜕𝑤1 𝑎2
➢ Loss function is connected to the 𝑤2
parameters through 𝒚 ෝ 𝜕ℒ 𝜕ℒ 𝜕𝑦ො 𝜕𝑎3 𝜕ℎ2 𝜕𝑎2 𝜕ℎ1 𝜕𝑎1 ℎ1
=
➢ Issue: Finding loss function in terms of each 𝜕𝑤1 𝜕𝑦ො 𝜕𝑎3 𝜕ℎ2 𝜕𝑎2 𝜕ℎ1 𝜕𝑎1 𝜕𝑤1
of the parameters is a complex process 𝑎1
➢ Solution: Chain rule can be used to compute Note:
𝜕ℒ
𝜕𝑤1
will vary for each input and
𝑤1
66
Types of Gradient Descent
➢ Depending on the number of samples used to estimate the gradient of loss
w.r.t. the parameters, there are 3 types of gradient descent:
➢ Batch Gradient Descent
67
Some Terminology
➢ Epoch – One epoch of training is said to be complete if every sample in the
training dataset is used for gradient calculation and parameter updation
68
Batch Gradient Descent
• Parameters are updated by estimating the gradient using all the 𝑁 samples i.e.,
batch size, 𝑏 = 𝑁
𝜕ℒ 𝜕ℒ 𝜕ℒ 𝜕ℒ
𝑤1(𝑛𝑒𝑤) = 𝑤1(𝑜𝑙𝑑) − 𝛼 = 𝑤1 𝑜𝑙𝑑 − 𝛼 + + ⋯+
𝜕𝑤1 𝜕𝑤1 1 𝜕𝑤1 2 𝜕𝑤1 𝑁
69
Stochastic Gradient Descent
• Parameters are updated by estimating the gradient using a single sample
i.e., batch size, 𝑏 = 1
𝜕ℒ 𝜕ℒ
𝑤1(𝑛𝑒𝑤) = 𝑤1(𝑜𝑙𝑑) − 𝛼 = 𝑤1 𝑜𝑙𝑑 − 𝛼
𝜕𝑤1 𝜕𝑤1 1
⋮
𝜕ℒ 𝜕ℒ
𝑤1(𝑛𝑒𝑤) = 𝑤1(𝑜𝑙𝑑) − 𝛼 = 𝑤1 𝑜𝑙𝑑 − 𝛼
𝜕𝑤1 𝜕𝑤1 𝑁
• In this case, the parameters are updated 𝑁 times in a epoch
• Advantage: Very fast to minimise the loss function
• Disadvantages: Too much variance in gradient calculation and learning might not
be stable
• Convergence cannot be guaranteed
70
Mini-Batch Gradient Descent
• A balance between batch and stochastic gradient descent
• Parameters are updated by using the gradient computed at a small batch of
samples i.e., 1 < 𝑏 < 𝑁
𝜕ℒ 𝜕ℒ 𝜕ℒ
𝑤1(𝑛𝑒𝑤) = 𝑤1(𝑜𝑙𝑑) − 𝛼 = 𝑤1 𝑜𝑙𝑑 − 𝛼 + ⋯+
𝜕𝑤1 𝜕𝑤1 1 𝜕𝑤1 𝑏
• 𝑁
In this case, the parameters are updated times in a epoch
𝑏
• Advantages: Faster than batch gradient descent
• Lesser variance and more stability compared to stochastic gradient descent
• Convergence cannot be guaranteed but most preferred
71
Variants of Gradient Descent - Visualisation
72
Batch Normalisation
73
Why Batch Normalisation?
𝑦
• Intuition: All the samples can be
𝑾3 𝒃3 considered to be drawn from a
multi-variate distribution
𝒉2
… • If batch gradient descent is
𝒂2 performed, then the distribution of
𝑾2 𝒃2 samples for each batch of inputs
𝒉1 remains the same
… • Issue: However, with stochastic and
𝒂1
mini-batch gradient descent, the
𝒃1
𝑾1
distribution varies from one batch to
𝑥1 𝑥2 … 𝑥𝑛 another.
74
Why Batch Normalisation?
𝑦 • Illustration: Suppose the input
𝑾3
samples (𝒙) are fed to the DNN in
𝒃3
multiple mini-batches
𝒉2 • In each iteration, the samples change
… and the distribution of samples also
𝒂2
𝒃2
changes
𝑾2
• Weights and biases would have to
𝒉1
… adjust to a different distribution of
𝒂1 inputs in each iteration
𝑾1 𝒃1
• Learning (fitting of weights and biases
𝑥1 𝑥2 … 𝑥𝑛 to inputs) would be hard
• How to overcome this issue? – Batch
Normalisation 75
Applying Batch Normalisation
𝑦 • Inputs to each layer are normalised
𝑾3 𝒃3 to be unit gaussians before the
activation function
𝒉2 • Mean and variance across set of
…
samples in that batch, are calculated
𝒂2
𝒃2
for performing normalisation
𝑾2
• This normalisation step is
𝒉1
… differentiable and hence we can
𝒂1 backpropagate through it
𝑾1 𝒃1 • Advantage: Learning is much faster
and leads to be better convergence
𝑥1 𝑥2 … 𝑥𝑛
Ref: Article by Johann Huber - Batch normalization in 3 levels of understanding
76
Regularisation
77
Regularisation – Motivation
DNN Model Training Objectives:
• Finding a model which is a good fit to the
Error
given data
• Model should be able to generalise over Test
unseen data (Test data) error
Error
additional component is added to the loss
function Test
error
ℒ 𝜃 + 𝜆Ω(𝜃)
• Ω(𝜃) is the regularisation term which
regularises the model
• Ω(𝜃) ensures that the model is neither too
Ideal model
complex models nor too simple models Train
error complexity
• 𝜆 is the regularisation rate (hyperparamter)
• 𝜆 determines how much the model is to be Model complexity
regularised
79
Types of Regularization
➢ 𝑙1 and 𝑙2 regularization
➢ Early stopping
➢ Ensemble Methods
➢ Dropout
80
𝑙1 and 𝑙2 Regularisation
➢ Regularisation term is the 𝑙1 or 𝑙2 norm of Loss function with
the vectors of weights in the neural network 𝑙1 regularization term:
➢ This introduces a constraint over the
ҧ
ℒ(𝜽) =ℒ 𝜽 +𝜆 𝜽
parameters 1
81
Early Stopping
• Prediction error on a validation set is
tracked
Error
Validation
• New hyperparameter called patience error
parameter 𝑝 is introduced
• Check if there is improvement in the
validation error for 𝑝 continuous
iterations
Train
• If not stop and take the model prior to error
those 𝑝 iterations
𝑘−𝑝 𝑘 𝑡ℎ
iteration iteration
82
Ensemble Methods
• Different models are trained for the Ensemble of 2 DNN models
of trees
• Computationally very expensive and 𝑦1 + 𝑦 2
𝑦ො =
2
hence not preferred
83
Dropout
• Refers to dropping out of neurons during 𝑦
training 𝑾3 𝒃3
• For an iteration, some neurons with all
𝒉2
their connections are removed (inactive) …
𝒂2
𝑾2 𝒃2
𝒉1
…
𝒂1
𝑾1 𝒃1
𝑥1 𝑥2 … 𝑥𝑛
84
Dropout
𝑦
• Refers to dropping out of neurons during
𝑾3 𝒃3
training
• For an iteration, some neurons with all their 𝒉2
…
connections are removed (inactive)
𝒂2
• Feed forward calculation and backpropagation
𝑾2 𝒃2
happens only with the active connections
𝒉1
• Update the weights and biases of only active …
𝒂1
connections
𝑾1 𝒃1
• Effectively, learning is happening on a different
neural network in each iteration 𝑥1 𝑥2 … 𝑥𝑛
• Output is equivalent to ensembled output
At 𝑖 𝑡ℎ iteration
85
Module III: Elements of NLP
86
Module III
87
88
88
89
Language
◦ Forma
◦ Part of every text processing task
• Not a general NLP solution
• But very useful as part of those systems (e.g., for pre-
processing or text formatting)
◦ Necessary for data analysis of text data
◦ A widely used tool in industry and academics
89
90
Language
◦ A collection of strings
◦ Formal Language Theory to understand different
languages (including programming languages)
◦ Important aspect: Regular Expression
• It is a way to define a language
• Simple application: text matching and searching
90
Regular expressions
◦ woodchuck
◦ woodchucks
◦ Woodchuck
◦ Woodchucks
◦ Groundhog
◦ groundhogs
91
Regular Expressions: Disjunctions
⚫ Letters inside square brackets []
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any one digit
Pattern Matches
groundhog|woodchuck woodchuck
yours|mine yours
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck Woodchuck
93
How many words in a sentence?
filled pauses Fragments
94
How many words in a sentence?
⚫ Word Type: an element of the vocabulary (or number of
distinct words in a corpus)
They lay back on the San Francisco grass and looked at the
stars and their
⚫ How many?
◦ 15 tokens (or 14)
◦ 13 types (or 12) (or 11?)
96
How many words in a corpus?
N = number of tokens
V = vocabulary = set of types, |V | is size of vocabulary
Heaps Law = Herdan's Law = where often .67 < β < .75
i.e., vocabulary size grows with > square root of the number of word tokens
• at a specific time,
• in a specific variety,
• of a specific language,
98
Corpora vary along dimension like
◦ Language: 7097 languages in the world
◦ Variety, like African American Language varieties.
• AAE Twitter posts might include forms like "iont" (I don't)
◦ Code switching, e.g., Spanish/English, Hindi/English:
S/E: Por primera vez veo a @username actually being hateful! It was beautiful:)
[For the first time I get to see @username actually being hateful! it was beautiful:) ]
H/E: dost tha or ra- hega ... dont wory ... but hoshla rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
◦ Genre: newswire, fiction, scientific articles, Wikipedia
◦ Author Demographics: writer's age, gender, ethnicity,
SES
99
Corpus datasheets
Gebru et al (2020), Bender and Friedman (2018)
Motivation:
• Why was the corpus collected?
• By whom?
• Who funded it?
Situation: In what situation was the text written?
Collection process: If it is a subsample how was it sampled?
Was there consent? Pre-processing?
⚫ +Annotation process, language variety,
demographics, etc.
100
Basic Text ⚫ Words and Corpora
Processing
101
Basic Text ⚫ Word tokenization
Processing
102
Text Normalization
103
Space-based tokenization
104
Issues in Tokenization
⚫Can't just blindly remove punctuation:
◦ m.p.h., Ph.D., AT&T, cap’n
◦ prices ($45.55)
◦ dates (01/02/06)
◦ URLs (http://www.stanford.edu)
◦ hashtags (#nlproc)
◦ email addresses ([email protected])
⚫Clitic: a word that doesn't stand on its own
◦ "are" in we're, French "je" in j'ai, "le" in l'honneur
⚫When should multiword expressions (MWE) be
words?
◦ New York, rock ’n’ roll
105
Tokenization in languages without spaces
Many languages (like Chinese, Japanese, Thai) don't
use spaces to separate words!
107
Word tokenization in Chinese
108
How to do word tokenization in Chinese?
⚫ 姚明进入总决赛
“YaoMing reaches the finals”
3 words?
⚫姚明 进入 总决赛
Yao Ming reaches finals
5 words?
⚫姚 明 进入 总 决赛
Yao Ming reaches overall finals
7 characters? (don't use words at all):
⚫姚 明 进 入 总 决 赛
Yao Ming enter enter overall decision game
109
Word tokenization / segmentation
110
Another option for text tokenization
Instead of
• white-space segmentation
• single-character segmentation
111
Subword tokenization
114
Byte Pair Encoding (BPE) Addendum
115
BPE token learner
Original (very fascinating ) corpus:
low low low low low lowest lowest newer newer newer newer newer newer wider wider wider new
new
representation
116
BPE token learner
Merge e r to er
117
BPE
Merge er _ to er_
118
BPE
Merge n e to ne
119
BPE
120
BPE token segmenter algorithm
On the test data, run each merge learned from the
training data:
◦ Greedily
◦ In the order we learned them
◦ (test frequencies don't play a role)
So: merge every e r to er, then merge er _ to er_,
etc.
⚫Result:
◦ Test set "n e w e r _" would be tokenized as a full word
◦ Test set "l o w e r _" would be two tokens: "low er_"
121
Properties of BPE tokens
122
Basic Text Byte Pair Encoding
Processing
123
Word Normalization
124
Case folding
125
Lemmatization
⚫ Morphemes:
◦ The small meaningful units that make up words
◦ Stems: The core meaning-bearing units
◦ Affixes: Parts that adhere to stems, often with grammatical
functions
⚫ Morphological Parsers:
◦ Parse cats into two morphemes cat and s
◦ Parse Spanish amaren (‘if in the future they would love’)
into morpheme amar ‘to love’, and the morphological
features 3PL and future subjunctive. 127
Stemming
⚫ Reduce terms to stems, chopping off affixes
crudely
This was not the map we found in Billy Bones’s Thi wa not the map we found in Billi Bone s chest
chest, but an accurate copy, complete in all but an accur copi complet in all thing name and
things-names and heights and soundings-with the height and sound with the singl except of the red
single exception of the red crosses and the written cross and the written note
notes. .
128
Porter Stemmer
129
Sentence Segmentation
!, ? mostly unambiguous but period “.” is very
ambiguous
◦ Sentence boundary
◦ Abbreviations like Inc. or Dr.
◦ Numbers like .02% or 4.3
Common algorithm: Tokenize first: use rules or ML to
classify a period as either (a) part of the word or (b) a
sentence-boundary.
◦ An abbreviation dictionary can help
Sentence segmentation can then often be done by rules
based on this tokenization. 131
➢ Module IV:
➢N-gram Language models, lexical and vector semantics,
➢TF-IDF, word2vector, semantic properties of embeddings
132
N-gram Language Modeling
Predicting words
⚫ The water of Walden Pond is beautifully ...
blue
*refrigerator
green
*that
clear
133
Language Models
134
Why word prediction?
• Speech recognition
I will be back soonish I will be bassoon dish
135
Why word prediction?
136
Language Modeling (LM) more formally
137
How to estimate these probabilities
138
How to compute P(W) or P(wn|w1, …wn-1)
139
Reminder: The Chain Rule
⚫More variables:
P(A,B,C,D) = P(A) P(B|A) P(C|A,B) P(D|A,B,C)
⚫The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-
1)
140
The Chain Rule applied to compute joint
probability of words in sentence
⚫ Simplifying assumption:
Andrei Markov
142
Wikimedia commons
Bigram Markov Assumption
⚫Instead of:
More generally, we approximate each
component in the product
143
Simplest case: Unigram model
P(w1w2 … wn ) » Õ P(w i )
i
Some automatically generated sentences from two different unigram models
To him swallowed confess hear both . Which . Of save on trail for
are ay device and rote life have
What means, sir. I confess she? then all sorts, he is trim, captain.
148
N-gram
Language
Modeling
Estimating N-gram
⚫
Probabilities
149
Estimating bigram probabilities
150
An example
c(wi-1,wi )
<s> I am Sam </s> P(wi | w i-1 ) =
<s> Sam I am </s> c(wi-1)
<s> I do not like green eggs and ham </s>
151
More examples:
Berkeley Restaurant Project sentences
⚫can you tell me about any good cantonese restaurants close
by
⚫tell me about chez panisse
152
Raw bigram counts
153
Raw bigram probabilities
⚫Normalize by unigrams:
⚫Result:
154
Bigram estimates of sentence probabilities
⚫P(english|want) = .0011
⚫P(chinese|want) = .0065
⚫P(to|want) = .66
⚫P(food | to) = 0
⚫P(want | spend) = 0
⚫P (i | <s>) = .25
156
Dealing with scale in large n-grams
157
N-gram Language Modeling
⚫Estimating N-gram
Probabilities
160
Evaluation and Perplexity
Language
Modeling
161
How to evaluate N-gram models
164
Choosing training and test sets
168
Intuition of perplexity 2:
Predicting upcoming words
time 0.9
The Shannon Game: How well can we
predict the next word? dream 0.03
• Once upon a ____ midnight 0.02
• That is a picture of a ____ …
• For breakfast I ate my usual ____
and 1e-100
Claude Shannon Unigrams are terrible at this game (Why?)
1
= N
P(w1w2 ...wN )
171
Intuition of perplexity 5: the inverse
1
= N
P(w1w2 ...wN )
(The inverse comes from the original definition of perplexity
from cross-entropy rate in information theory)
Probability range is [0,1], perplexity range is [1,∞]
Minimizing perplexity is the same as maximizing probability
172
Intuition of perplexity 6: N-grams
1
-
PP(W ) = P(w1w2 ...wN ) N
1
= N
P(w1w2 ...wN )
Chain rule:
Bigrams:
173
Intuition of perplexity 7:
Weighted average branching factor
Perplexity is also the weighted average branching factor of a language.
Branching factor: number of possible next words that can follow any word
Example: Deterministic language L = {red,blue, green}
Branching factor = 3 (any word can be followed by red, blue, green)
Now assume LM A where each word follows any other word with equal probability ⅓
Given a test set T = "red red red red blue"
PerplexityA(T) = PA(red red red red blue)-1/5 =
((⅓)5)-1/5 = (⅓)-1 =3
⚫But now suppose red was very likely in training set, such that for LM B:
◦ P(red) = .8 p(green) = .1 p(blue) = .1
⚫We would expect the probability to be higher, and hence the perplexity to be
smaller:
PerplexityB(T) = PB(red red red red blue)-1/5
176
Sampling and
Language Generalization
Modeling
177
The Shannon (1948) Visualization Method
Sample words from an LM
Unigram:
Claude Shannon
⚫
PRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN
DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF
TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE
HAD BE THESE.
⚫ Bigram:
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH
WRITER THAT THE CHARACTER OF THIS POINT IS
THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE
TIME OF WHO EVER TOLD THE PROBLEM FOR AN
UNEXPECTED. 178
How Shannon sampled those words in 1948
180
Visualizing Bigrams the Shannon Way
181
Approximating Shakespeare
183
Shakespeare as corpus
184
The Wall Street Journal is not Shakespeare
185
186
188
Zeros
⚫ Training set: • Test set
… ate lunch … ate lunch
… ate dinner … ate breakfast
… ate a
… ate the
P(“breakfast” | ate) = 0
189
Zero probability bigrams
190
N-gram Language
Modeling
⚫ Smoothing,
Interpolation, and
Backoff
191
The intuition of smoothing (from Dan Klein)
⚫ When we have sparse statistics:
P(w | denied the)
3 allegations
allegations
2 reports
1 claims
outcome
reports
1 request
…
attack
request
claims
man
7 total
allegations
allegations
0.5 claims
outcome
0.5 request
reports
attack
2 other
…
man
claims
request
7 total 192
Add-one estimation or Laplace Smoothing
⚫ MLE estimate:
⚫ Add-1 estimate:
193
Maximum Likelihood Estimates
⚫ The maximum likelihood estimate
◦ of some parameter of a model M from a training set T
◦ maximizes the likelihood of the training set T given the model M
⚫Suppose the word “bagel” occurs 400 times in a corpus of a million
words
⚫What is the probability that a random word from some other text will be
“bagel”?
⚫MLE estimate is 400/1,000,000 = .0004
⚫This may be a bad estimate for some other corpus
◦ But it is the estimate that makes it most likely that “bagel” will occur 400
times in a million word corpus.
194
Berkeley Restaurant Project sentences
⚫can you tell me about any good cantonese restaurants close
by
⚫tell me about chez panisse
195
Raw bigram counts
196
Berkeley Restaurant Corpus: Laplace smoothed
bigram counts
197
Laplace-smoothed bigrams
198
Reconstituted counts
199
Compare with raw bigram counts
Raw Original
Smoothed
200
Add-1 estimation is a blunt instrument
201
Backoff and Interpolation
⚫ Simple interpolation
203
How to set λs for interpolation?
204
Backoff
205
Vector Semantics
Vector Semantics & Embeddings
& Embeddings
207
What do words mean?
⚫ N-gram or text classification methods we've seen so
far
◦ Words are just strings (or indices wi in a vocabulary list)
◦ That's not very satisfactory!
⚫ Old linguistics joke by Barbara Partee in 1967:
◦ Q: What's the meaning of life?
◦ A: LIFE
⚫ That seems hardly better!
208
Desiderata: needed or wanted?
209
Lemmas and senses
lemma
mouse (N)
1. any of numerous small rodents...
sense
2. a hand-operated device that controls
a cursor...
A sense or “concept” is the meaning component of a word
Lemmas can be polysemous (have multiple senses)
Modified from the online thesaurus WordNet 210
Relations between senses: Synonymy
211
Relations between senses: Synonymy
212
Relation: Synonymy?
water/H20
"H20" in a surfing guide?
big/large
my big sister != my large sister
213
The Linguistic Principle of Contrast
214
Abbé Gabriel Girard 1718
Re: "exact" synonyms
"
"
215
Thanks to Mark Aronoff!
Relation: Similarity
car, bicycle
cow, horse
216
Ask humans how similar 2 words are
218
Semantic field
⚫Words that
◦ cover a particular semantic domain
◦ bear structured relations with each other.
hospitals
surgeon, scalpel, nurse, anaesthetic, hospital
restaurants
waiter, menu, plate, food, menu, chef
houses
door, roof, kitchen, family, bed
219
Relation: Antonymy
220
Connotation (sentiment)
223
Vector Semantics & Embeddings
224
Computational models of word meaning
225
Ludwig Wittgenstein
⚫PI #43:
"The meaning of a word is its use in the language"
226
Let's define words by their usages
227
What does recent English borrowing ongchoi mean?
230
Idea 2: Meaning as a point in space
(Osgood et al. 1957)
⚫ 3 affective dimensions for a word
◦ valence: pleasantness
◦ arousal: intensity of emotion
◦ dominance: the degree of control exerted
Word Score Word Score
Valence love 1.000 toxic 0.008
happy 1.000 nightmare 0.005
Arousal elated 0.960 mellow 0.069
frenzy 0.965 napping 0.046
Dominance powerful 0.991 weak 0.045
◦ leadership 0.983 empty 0.081
232
Defining meaning as a point in space based on distribution
233
We define meaning of a word as a vector
◦ With embeddings:
• Feature is a word vector
• 'The previous word was vector [35,22,17…]
• Now in the test set we might see a similar vector [34,21,14]
• We can generalize to similar but unseen words!!!
235
We'll discuss 2 kinds of embeddings
⚫ tf-idf
◦ Information Retrieval workhorse!
◦ A common baseline model
◦ Sparse vectors
◦ Words are represented by (a simple function of) the counts of
nearby words
⚫ Word2vec
◦ Dense vectors
◦ Representation is created by training a classifier to predict whether a
word is likely to appear nearby
◦ Later we'll discuss extensions called contextual embeddings 236
⚫ Words and Vectors
238
Vector and Documents
Term-document matrix
Work of Shakespeare
Each document is represented by a vector of words
240
Vectors are the basis of information retrieval
battle is "the kind of word that occurs in Julius Caesar and Henry V"
243
244
Cosine for computing word similarity
245
Computing word similarity:
Dot product and cosine
⚫ The dot product between two vectors is a
scalar:
Based on the definition of the dot product between two vectors a and b
248
Cosine as a similarity metric
250
250
Visualizing cosines
(well, angles)
251
TF-IDF: Term Frequency-Inverse Document Frequency
252
But raw frequency is a bad representation
• The co-occurrence matrices we have seen represent
each cell by word frequencies.
• Frequency is clearly useful; if sugar appears a lot near
apricot, that's useful information.
• But overly frequent words like the, it, or they are not
very informative about the context
• It's a paradox! How can we balance these two conflicting
constraints?
253
Two common solutions for word weighting
255
Document frequency (df)
256
Inverse document frequency (idf)
257
What is a document?
258
Final tf-idf weighted value for a word
⚫ Raw counts:
⚫ tf-idf:
259
Vector Semantics Word2vec
& Embeddings
⚫
260
Sparse versus dense vectors
261
Sparse versus dense vectors
264
Word2vec
265
Word2vec
267
Skip-Gram Training Data
⚫Assume a +/- 2 word window, given training
sentence:
268
Skip-Gram Classifier
270
Turning dot products into probabilities
⚫ Sim(w,c) ≈ w ∙ c
⚫ To turn this into a probability
271
How Skip-Gram Classifier computes P(+|w, c)
⚫ This is for one context word, but we have lots of context words.
⚫ We'll assume independence and just multiply them:
272
Skip-gram classifier: summary
274
Skip-Gram Training data
275
275
Skip-Gram Training data
277
277
Word2vec: how to learn vectors
279
Learning the classifier
⚫How to learn?
◦ Stochastic gradient descent!
280
Intuition of one step of gradient descent
281
Reminder: gradient descent
• At each step
• Direction: We move in the reverse direction from
GISTI C the gradient
REGRESSI ON of the loss function
• Magnitude: we move the value of this gradient
𝑑
𝐿(𝑓 𝑥; 𝑤 , 𝑦) weighted by a learning rate η
𝑑𝑤
• Higher learning rate means move w faster
t+ 1 t d
w =w−h L( f (x; w), y)
dw
282
The derivatives of the loss function
283
Update equation in SGD
284
Two sets of embeddings
285
Summary: How to learn word2vec (skip-gram) embeddings
⚫ Start with V random d-dimensional vectors as initial
embeddings
⚫ Train a classifier based on embedding similarity
◦ Take a corpus and take pairs of words that co-occur as
positive examples
◦ Take pairs of words that don't co-occur as negative examples
◦ Train the classifier to distinguish these by slowly adjusting all
the embeddings to improve the classifier performance
◦ Throw away the classifier code and keep the embeddings.
286
The kinds of neighbors depend on window size
⚫ Small windows (C= +/- 2) : nearest words are
syntactically similar words in same taxonomy
◦Hogwarts nearest neighbors are other fictional schools
•Sunnydale, Evernight, Blandings
287
Analogical relations
288
Analogical relations via parallelogram
289
Structure in GloVE Embedding space
290
Caveats with the parallelogram method
William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic Word Embeddings Reveal Statistical Laws of
Semantic Change. Proceedings of ACL. 292
Embeddings reflect cultural bias!
Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. "Man is
to computer programmer as woman is to homemaker? debiasing word embeddings." In NeurIPS,
pp. 4349-4357. 2016.