0% found this document useful (0 votes)
3 views

NLP_Module_I_IV

The document outlines a comprehensive curriculum on Natural Language Processing (NLP) and Neural Networks, detailing various modules that cover foundational concepts, applications, and advanced techniques such as transformers and large language models. Each module includes learning outcomes that emphasize understanding key NLP problems, neural network architectures, and practical applications in areas like sentiment analysis and machine translation. Additionally, it discusses the evolution of NLP from symbolic to neural approaches and highlights the importance of artificial neural networks in complex decision-making.

Uploaded by

Awatif Maisara
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

NLP_Module_I_IV

The document outlines a comprehensive curriculum on Natural Language Processing (NLP) and Neural Networks, detailing various modules that cover foundational concepts, applications, and advanced techniques such as transformers and large language models. Each module includes learning outcomes that emphasize understanding key NLP problems, neural network architectures, and practical applications in areas like sentiment analysis and machine translation. Additionally, it discusses the evolution of NLP from symbolic to neural approaches and highlights the importance of artificial neural networks in complex decision-making.

Uploaded by

Awatif Maisara
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 283

Introduction

Natural Language
Processing
Modules:
➢ Module I:
➢ Introduction to Natural language processing (NLP)
➢ NLP problems in text summarization, text classification, sentiment
analysis, question answering, neural translation, etc.
➢ Module II:
➢ Introduction to Neural Networks
➢ Optimization formulations for Deep learning, Gradient-based
optimization, Gradient descent,
➢ Neural networks, feed-forward NN, Gradient-based learning, and back-
propagation, and differentiation algorithms
➢ Module III:
➢ Elements of NLP: Expression, word, corpora,
➢ Token and tokenization, word normalization, lemmatization, stemming,
sentence segmentation, sequence labelling, context-free grammars

2
Modules:
➢ Learning Outcomes
➢ Module I: Understand the importance of NLP problems and different types
of problems in the literature
➢ Module II: Understand the concepts of Optimization for Deep Learning,
Basics of Neural Networks, Feed-forward Neural Networks, Back
propagation and automatic differentiation
➢ Module III: Basic terminology and operation in text processing, different
types of tokenization, word normalization, minimum edit distance and their
computations

3
Modules:
➢ Module IV:
➢ N-gram Langauge models, lexical and vector semantics,
➢ TF-IDF, word2vector, semantic properties of embeddings
➢ Module V:
➢ Recurrent Neural Networks (RNNs),
➢ RNNs for Language modeling, Sequence labeling and classification,
sequence-to-sequence tasks,
➢ Stacked and bidirectional RNNs, LSTMs

4
Modules:
➢ Learning Outcomes
➢ Module IV: Understand formulation of language models (LMs), different LMs
such as n-grams, TF-IDF, etc. evaluation metrics from language view points
➢ Module V: Basics of RNNs and their variants, sequence modeling and
different problems associated with sequence modeling

5
Modules:
➢ Module VI:
➢ Introduction to transformer:
➢ Attention mechanism, multi-head attention, transformer architecture,
➢ Embedding of tokens and positions, language model head
➢ Module VII:
➢ Large Language Models: Use of Transformer Architecture,
➢ Pretraining architectures: Decoders, Encoders, and Encoder-Decoders,
➢ Finetuning, Generative Pretrained transformer
➢ Module VIII :More on LLMs:
➢ Scaling laws and scaling efficiency
➢ Prompting and instruction tuning, alignment, preference fine-tuning
➢ Applications of LLMs in Machine translations, question answering, and
information retrieval

6
Modules:
➢ Learning Outcomes
➢ Module VI: Learn the concepts of transformers and associated concepts
such as attention, multi-head attention, etc.
➢ Module VII: Different architectures of LLMs, finetuning methods
➢ Module VIII: Understand the scaling laws, and model parameters, prompt
engineering aspects and applications of LLMs

7
Module I: Introduction to Natural
language processing (NLP)

8
Quotes to remember
“For a large class of cases-though not for all-in which we employ
the word ‘meaning’ it can be defined thus: the meaning of a word
is its use in the language.”
- (Wittgenstein 1968, 943)
“All grammars leak”
-(Sapir 1921: 38)
“You shall know a word by the company it keeps”
-(Firth 1957: 11)

9
Can machines mimic humans?
Well! Yes. As wise as in this example
Killing is a honor

Genearted text:
Killing is a honor in my tribe. It showed that you were strong, brave, and willing to do
what needed to be done. To get this close to him, to have my arrow ready to fire the
moment he revealed himself to be what he was, filled me with a strange mix of fear and
excitement. Thoughts of the demon were pushed to the

Source: https://app.inferkit.com/generate

10
Or as stupid as in this example

The world is round and can be proved.

Generated text
Or prove it.
There is no need to go to any center to learn things.
Today was an easy day.
I went into town to get to the PC shop.
I found the shop but it wasn't open.
I rang the bell but nobody came out.
I decided I would go to look at the esplanade.
The trees were starting to turn colors.

11
What is NLP?
⚫ Programming the computers to process and analyze large amount of natural
language data
⚫ Subfield of Linguistics and Computer Science

⚫ Two school of thoughts:

◦ Rationalist approach (dominated from 1960 to 1985)


• Significant part of the knowledge in human mind is not derived by the senses
• Knowledge is fixed in advance (presumably by genetic inheritance)
• Poverty of the stimulus (e.g., Chomsky 1986: 7)
◦ Empiricist approach (resurging today)
• Cognitive abilities present in the brain
• Association, Generalization, Pattern Recognition to learn the structure of language
• Insists on generic language model (statistical and corpus linguistics)

12
History
⚫ Symbolic NLP (1950 – early 1990s) – Mostly hand-written rules:
◦ Rule based parsing
◦ Morphology
◦ Semantics
⚫ Statistical NLP (1990 – 2010s) – Textual corpora were used predominantly
◦ Supervised Learning over hand-annotated data
◦ Unsupervised, semi-supervised learning over unannotated internet data
◦ Machine translation of governmental proceedings as a major focus
⚫ Neural NLP (present) – Deep Learning
◦ State of the art techniques
◦ Language modeling and many other applications

13
Applications of NLP
➢ Spam Detection:
✓ Scanning emails for words that indicate spam
➢ Machine Translation:
✓ Google translate is the best example of machine translation
✓ Capturing the meaning and tone of the source language is important

14
Applications of NLP
➢ Virtual Agents and Chat Bots:
✓ Siri and Alexa are examples of virtual agents that can take voice
commands and perform tasks
✓ Chat bots are developed to respond to human typed questions with
helpful answers
✓ Most websites which directly interact with many consumers have these
chat boxes
➢ Social Media Sentiment Analysis:
✓ Analysing social media posts, reviews, etc. to extract response
(positive/negative) to products, events, movies, etc.
➢ Text Summarisation:
✓ To ingest huge volumes of text and create summaries for indexes, busy
readers, etc.
15
Other Applications
➢ Drug Discovery
➢ Developing new drugs by understanding language(s) of molecules

➢ Molecules: Language representation

➢ Recommender systems:
Recommending products or movies to consumers based on their
historical consumption
➢ Targeted Advertisements:
Getting insights into customer behaviour and needs and targeting ads accordingly
based on search history, clicked items,

16
Motivation: Biological Neuron to
Artificial Neuron

17
Module II: Introduction to Neural
Networks

18
Biological Neuron
• Basic working unit of the brain and
nervous system
Axon
• Close to 100 billion interconnected Dendrites terminals
neurons in a human brain
• Function together to aid decision making Nucleus Axon
• Parts and Functioning:
➢ Dendrite: Takes signals (stimulus) from
the other neurons or other cells in the
body Cell body (soma)
➢ Cell body (soma): Processes the signal
and may or may not fire the neuron –
excitation and inhibition Biological Neuron
➢ Axon: Transmits the output (response)
to other neurons or cells

19
Biological Neuron to Artificial Neuron
Axon
Dendrites terminals
Artificial neuron
➢ Mathematical model of a biological Nucleus Axon
neuron
➢ Mimics the functioning of a biological
neuron
➢ Takes input in the form of numbers Cell body
➢ Processes the input to give out an Biological Neuron
output
➢ Output = f(inputs)
➢ Different models of artificial neurons Output
have been developed based on this Inputs 𝑓(. )
idea

Artificial Neuron
20
Artificial Neuron: McCulloch Pitts Model
Inputs
o Inputs 𝑥1 , 𝑥2 , … , 𝑥𝑛 are binary 𝑥1 𝑥𝑖 ∈ {0,1}
numbers (0 or 1)
o Aggregated input passes through an 𝑥2 Output
activation function to give output (𝑦) ⋮ Σ 𝜃
o Activation function is based on
𝑦 ∈ {0,1}
𝑥𝑛
thresholding logic
o Here, model refers to the function McCulloch Pitts Model
relating the output to inputs 𝑎 = 𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛
𝑦 = 𝑓 𝑎 = 1 if 𝑎 ≥ 𝜃
𝑦 = 𝑓 𝑎 = 0 if 𝑎 < 𝜃
21
McCulloch Pitts Model: Boolean Functions
o This model can be used to represent most Boolean functions
𝑥1
𝒙𝟏 𝒙𝟐 𝒚
0 0 0 𝑦
Logical And Σ 2
Function (2 inputs) 0 1 0
1 0 0
𝑥2 𝑦 = 1 if 𝑎 ≥ 2
1 1 1
𝑦 = 0 if 𝑎 < 2
Logical Or 𝒙𝟏 𝒙𝟐 𝒚 𝑥1
Function (2 inputs) 0 0 0 𝑦
0 1 1 Σ 1
1 0 1
1 1 1 𝑥2 𝑦 = 1 if 𝑎 ≥ 1
𝑦 = 0 if 𝑎 <1
22
McCulloch Pitts Model: Drawbacks
• Drawbacks:
➢ Cannot handle non-boolean inputs and outputs
➢ Deciding a appropriate threshold value might be hard as the
number of inputs increases
➢ Equal weightage to all inputs – What if more importance is to
be attached to some inputs?
• How to overcome these issues? – Perceptron Model

23
Artificial Neuron: Perceptron Model
o Inputs 𝑥1 , 𝑥2 , … , 𝑥𝑛 are real Inputs 𝑏
numbers 𝑥1
Weights
𝑤1
o Neuron takes the weighted 𝑤2 Output
𝑥2
combination of inputs 𝑓(. )
Σ 𝑦
o Bias (𝑏) is added to weighted ⋮
𝑤𝑛
inputs 𝑥𝑛
o Weighted input passes through Perceptron Model
an activation function to give
output (𝑦) 𝑎 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + 𝑏
𝑦 = 𝑓 𝑎 = 𝑓(𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + b)

24
Artificial Neuron: A Simple Example
o Using an artificial neuron to decide Inputs 𝑓 𝑎 = 1 if 𝑎 > 7
𝑥1 𝑓 𝑎 = 0 otherwise
whether to watch a movie or not Weights
0.4
𝑥2 0.3 Output
Feature Value for Value for 𝑥3 0.2 Σ 𝑓(. ) 𝑦 (0 𝑜𝑟 1)
Movie 1 Movie 2
0.1
Lead Actor (𝑥1 ) 10 7 𝑥4
Director (𝑥2 ) 8 5 Perceptron Model
Trill factor (𝑥3 ) 8 9
Movie 1 Movie 2
Run time (𝑥4 ) 9 5
𝑎 = 8.9 𝑎 = 6.7
𝑦=𝑓 𝑎 =1 𝑦=𝑓 𝑎 =0

25
Artificial Neural Network

26
Artificial Neural Network: Motivation
𝑏
➢ One neuron is not sufficient to take
complex decisions (complex 𝑥1
Weights
functions) 𝑤1
➢ Again, inspired by brain neural 𝑥2 𝑤2 Output
network, artificial neural network ⋮ Σ 𝑓(. ) 𝑦
was developed
𝑤𝑛
➢ In the brain, many neurons are 𝑥𝑛
involved in taking a decision Perceptron Model
➢ All the neurons are inter-connected
in the brain
𝑎 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + 𝑏
➢ They are arranged hierarchically in
layers 𝑦 = 𝑓 𝑎 = 𝑓(𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + b)

27
Artificial Neural Network (ANN)
Output
o ANN consists of multiple layers with
multiple neurons in each layer (hidden
layers)
o Each neuron (except inputs) represent

a perceptron model Hidden
o Every neuron in one layer is Layers
of
connected to every neuron in the neurons
successive layer …
o Output of one neuron are passed as
inputs to the neurons of next layer
… Inputs

28
ANN Architecture • Input layer (0𝑡ℎ ) with 𝑛 inputs
𝒙 = 𝑥1 𝑥2 … 𝑥𝑛 𝑇
𝑦 • 𝐿 − 1 hidden layers with 𝑚 neurons each
𝑾3 • Output layer (𝐿𝑡ℎ ) with 𝑘 neurons
𝒃3
• 𝑾𝑖 is the matrix containing the weights
between layers 𝑖 − 1 and 𝑖 0 < 𝑖 ≤ 𝐿
… • 𝒃𝑖 is the vector representing the biases
1 1
𝑤11 … 𝑤1𝑛 𝑏11
𝑾2 𝒃2 𝑾1 = ⋮ ⋱ ⋮ 𝒃1 = ⋮
1 1 1
𝑤𝑚1 … 𝑤𝑚𝑛 𝑏𝑚
… 𝑚×𝑛

𝑖 𝑖
𝑾1 𝒃1 𝑤11 … 𝑤1𝑚 𝑏1𝑖
𝑾𝑖 = ⋮ ⋱ ⋮ 𝒃i = ⋮
𝑥1 𝑥2 … 𝑥𝑛 𝑖 𝑖 𝑖
𝑤𝑚1 … 𝑤𝑚𝑚 𝑚×𝑚
𝑏𝑚
29
ANN Architecture
𝑦 𝐿 𝐿
𝑤11 … 𝑤1𝑚 𝑏1𝐿
𝑾3 𝒃3 𝑾𝐿 = ⋮ ⋱ ⋮ 𝒃L = ⋮
𝐿 𝐿
𝑤𝑘1 … 𝑤𝑘𝑚 𝑏𝑘𝐿
𝑘×𝑚
• For a single output, 𝑾𝐿 will be a vector

and 𝒃𝐿 will be a scalar
• Each neuron in hidden and output layers
𝑾2 𝒃2
has an activation function
… • If there are more than 3 hidden layers,
then ANN is referred to as Deep Neural
𝑾1 𝒃1
Network (DNN) – Depth refers to the
number of layers
𝑥1 𝑥2 … 𝑥𝑛
30
DNN Feed Forward Calculation
• Feed forward calculation involves finding the
𝑦
output as a function of input, weights and
𝒂3
𝑾3 𝒃3
biases
• Input to activation function at layer 0:
𝒉2 𝒂1 = 𝑾1 𝒙 + 𝒃1
… • Activation at hidden layer 1:
𝒂2 𝒉1 = 𝑔ℎ (𝒂1 )
𝑾2 𝒃2 • 𝒉𝑖 is output vector at layer 𝑖
𝒉1 • 𝑔ℎ is the activation function which maps
… vector 𝒂𝑖 to vector 𝒉𝑖
𝒂1
• Input to activation function at hidden layer i:
𝑾1 𝒃1
𝒂𝑖 = 𝑾𝑖 𝒉𝑖−1 + 𝒃𝑖
… • Activation at hidden layer i:
𝑥1 𝑥2 𝑥𝑛
𝒉𝑖 = 𝑔ℎ 𝒂𝑖 = 𝑔ℎ (𝑾𝑖 𝒉𝑖−1 + 𝒃𝑖 )
31
DNN Feed Forward Calculation
𝒂1 = 𝑾1 𝒙 + 𝒃1
𝑦 𝒉1 = 𝑔ℎ (𝒂1 )
𝒂3 𝒂𝑖 = 𝑾𝑖 𝒉𝑖−1 + 𝒃𝑖
𝑾3 𝒃3 𝒉𝑖 = 𝑔ℎ 𝒂𝑖 = 𝑔ℎ (𝑾𝑖 𝒉𝑖−1 + 𝒃𝑖 )
• Input to activation function at output layer 𝐿:
𝒉2 𝒂𝐿 = 𝑾𝐿 𝒉𝐿−1 + 𝒃𝐿
… • Activation at output layer:
𝒂2
ෝ = 𝑔𝑜 𝒂𝐿 = 𝑔𝑜 𝑾𝐿 𝒉𝐿−1 + 𝒃𝐿
𝒚
𝒃2
𝑾2 • Model (function) being approximated by the
𝒉1 DNN (assuming L=3):

𝒂1
𝑾1 𝒃1 ෝ = 𝑔𝑜 𝑾3 𝑾2 𝑾1 𝒙 + 𝒃1 + 𝒃𝟐 + 𝒃3
𝒚
ෝ = 𝑓(𝒙)
𝒚
𝑥1 𝑥2 … 𝑥𝑛
32
Types of Activation Functions

33
Activation Function
o Activation function is like a gate between the input and output of a
neuron
o Purpose: To introduce non-linearity into the model and enable
learning complex functions (models)
o It affects the DNN output, accuracy and convergence
o Types of activation function:
1. Linear activation function
2. Sigmoid activation function
3. Tanh activation function
4. Relu activation function
5. Softmax activation function

34
Linear Activation Function
𝑓 𝑎 = 𝑐𝑎
• Output is directly proportional
to the input
𝑓 𝑎 = 𝑐𝑎
• Output can take any real
number 𝑓(𝑎)
• Gradient is always constant and
does not depend on the input
• Generally used in the output
0
layer of regression problem 𝑎

35
Sigmoid Activation Function
• Any value of input is mapped to a 1
𝑓 𝑎 =
value between 0 and 1 1 + 𝑒 −𝑎
1 1
𝑓 𝑎 =
1 + 𝑒 −𝑎
• Gradient is close to zero when the
output is close to 0 or 1 𝑓(𝑎)
• Useful when the expected output is
a probabilistic value between 0 and 1

0
𝑎
36
Softmax Activation Function
• Sigmoid function gives a value between 0 Total Probability = 1
and 1, and can be used for binary Prob. of
classification class 0
Prob. of class 1
• However, sigmoid cannot be used to Binary Classification
output multiple probability values which
add upto 1 (multi-class) Total Probability = 1
• Softmax function is an extension of Prob. of Prob. of Prob. of
sigmoid function class 1 class 2 class 3
• Softmax calculates the relative
Multi-class Classification
probabilities of multiples classes and
ensures that total probability is 1
𝑒 𝑎𝑖
Input to output layer of a DNN 𝑓 𝑎𝑖 = 𝑘
𝒂 = [𝑎1 𝑎2 … 𝑎𝑘 ] σ𝑖=1 𝑒 𝑎𝑖

37
Tanh Activation Function
𝑒 𝑎 − 𝑒 −𝑎
𝑓 𝑎 = 𝑎
• Any value of input is mapped to a 𝑒 + 𝑒 −𝑎
value between -1 and 1 1
𝑒 𝑎 − 𝑒 −𝑎
𝑓 𝑎 = 𝑎
𝑒 + 𝑒 −𝑎
• Positive values between 0 and 1
• Negative values between -1 and 0 0
• Output is zero centered which 𝑓(𝑎)
enables quick convergence
• Gradient is close to zero when the
output is close to -1 or 1 −1
0 𝑎

38
Relu Activation Function
Rectified Linear Unit
• All positive values go through 𝑓 𝑎 = max(0, 𝑎)
directly while all negative values are
mapped to zero
𝑓 𝑎 = max(0, 𝑎)
• Gradient is zero when the output ≤ 𝑓(𝑎)
0 and 1 for all positive outputs
• Relu is one of the most popular
activation functions and has many
variants 0
0 𝑎

39
Feed Forward Calculation - Example

40
Feed Forward Calculation - Example
𝑦1 𝑦2 𝑦3 • Task: To classify a person as underweight
𝒂3
𝒂3 (Class 1), normal weight (Class 2) or
𝒃3 overweight (Class 3) given the height and
𝑾3
𝒉2 weight as input features
• NN Architecture: A neural network with 2
𝒂2 input, 2 hidden layers with 3 neurons each and
𝒃2
𝑾2 3 outputs
𝒉1 • 2 inputs to take 2 features and 3 outputs to
predict probability of 3 classes
𝒂1 • Activation functions: All hidden layer neurons
𝑾1 𝒃1 have relu activation and output layer neurons
have softmax activation
𝑥1 𝑥2
41
Feed Forward Calculation - Example
𝑦1 𝑦2 𝑦3 • Inputs: 𝑥1 = 1.5 𝑥2 = 0.5
𝒂3
𝒂3 • Weights and Biases:
𝒃3 0.53 0.86 −0.43
𝑾3
𝒉2
𝑾1 = 1.84 0.32 𝒃1 = 0.34
−2.25 −1.31 3.58
𝒂2 1.41 −1.21 0.49 −0.2
𝒃2 𝑾𝟐 = 1.42 0.72 0.03 𝒃2 = −0.12
𝑾2
0.67 1.63 0.73 1.49
𝒉1 −0.3 0.89 −0.81 0.32
𝑾3 = 0.29 −1.14 −2.9 𝒃3 = −0.75
𝒂1
−0.78 −1.07 −1.43 1.37
𝑾1 𝒃1

𝑥1 𝑥2
42
Feed Forward Calculation - Example
𝑦1 𝑦2 𝑦3 • Inputs: 𝑥1 = 1.5 𝑥2 = 0.5
𝒂3
𝒂3 • Hidden Layer 1 calculation:
𝒃3
𝑾3 𝒂1 = 𝑾1 𝒙 + 𝒃1
𝒉2 0.8
𝒂1 = 3.25
𝒂2 −0.46
𝒃2 𝒉1 = 𝑟𝑒𝑙𝑢 𝒂1
𝑾2
0.8
𝒉1 𝒉1 = 3.25
𝒂1 0
𝑾1 𝒃1

𝑥1 𝑥2
43
Feed Forward Calculation - Example
𝑦1 𝑦2 𝑦3 • Inputs: 𝑥1 = 1.5 𝑥2 = 0.5
𝒂3
𝒂3 • Hidden Layer 2 calculation:
𝒃3
𝑾3 𝒂2 = 𝑾𝟐 𝒉𝟏 + 𝒃2
𝒉2 −3
𝒂2 = 3.34
𝒂2 7.33
𝒃2 𝒉2 = 𝑟𝑒𝑙𝑢 𝒂2
𝑾2
0
𝒉1 𝒉2 = 3.34
𝒂1 7.33
𝑾1 𝒃1

𝑥1 𝑥2
44
Feed Forward Calculation - Example
𝑦1 𝑦2 𝑦3 • Inputs: 𝑥1 = 1.5 𝑥2 = 0.5
𝒂3
𝒂3 • Output calculation:
𝒃3
𝑾3 𝒂3 = 𝑾3 𝒉2 + 𝒃3
𝒉2 −2.63
𝒂3 = −26.18
𝒂2 8.33
𝒃2 𝒚 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝒂3
𝑾2
0
𝒉1 𝒚= 0
𝒂1 1
• This example illustrations how 𝒚 is a function
𝑾1 𝒃1
of 𝒙
𝑥1 𝑥2
45
Universal Approximation Theorem

46
Universal Approximation Theorem (UAT)
𝑦
𝑦=𝑓 𝒙
𝑾3 𝒃3 • UAT establishes that neural networks
have a kind of universality in
approximating functions

• For any given function of inputs 𝑓(𝒙),
𝒃2
there exists a neural network which can
𝑾2
approximate the output
… • Holds even when the function has
multiple inputs and outputs
𝑾1 𝒃1 • Condition: Activations functions should
be non-linear
𝑥1 𝑥2 … 𝑥𝑛
Ref: Article by Micheal Nelson - Neural networks and deep learning

47
Supervised Learning using DNN

48
Supervised Learning using DNN
𝑦
Data for Supervised Learning:
𝑾3 𝒃3 ➢ Inputs: Values of input features − 𝒙
➢ Outputs: Values of predicted variables − 𝒚
… Regression – Real Numbers
Classification – Discrete class or
𝑾2 𝒃2 Probability of each class
• DNN is expected to take an input and predict
… the desired output
• Implies: DNN should approximate a function
𝑾1 𝒃1 𝑓(𝒙) which maps inputs to outputs
𝑥1 𝑥2 … 𝑥𝑛

49
Supervised Learning using DNN
𝑦
➢ Question: What is a suitable 𝑓(𝒙) for the
𝑾3 𝒃3 given data or task ?
✓ Answer: Generally, not known and can be
… a complex function
➢ Question: Can we find the weights and
𝑾2 𝒃2 biases which will approximate desired
𝑓(𝒙)?
… ✓ Answer: Yes! They can be learnt from
data
𝑾1 𝒃1 • Training a DNN: Learning the parameters
of the DNN (weights and biases) using
𝑥1 𝑥2 … 𝑥𝑛 the given data

50
Summary
➢ Deep Learning is a sub-field of machine learning with many applications
in diverse areas
➢ Functioning of a biological neuron was mathematically modelled to
replicate its decision making capability
➢ An artificial neural network was developed inspired from the structure
of a brain neural network
➢ ANN consists of multiple layers of inter-connected neurons which
process inputs to give out outputs
➢ Universal Approximation theorem establishes that there always exists an
ANN which can approximate any function of any complexity
➢ An ANN can be trained to map inputs to desired outputs by learning the
weights and biases
51
Supervised Learning using DNN
𝑦
➢ In supervised learning, input 𝒙 and
𝑾3 𝒃3 output (𝒚) data is given to learn a
function 𝒚
ෝ = 𝑓(𝒙) such that 𝒚ෝ≈𝒚
➢ Question: What is a suitable 𝑓(𝒙) for
… the given data or task ?
➢ Question: Can we find the weights and
𝑾2 𝒃2
biases which will approximate desired
𝑓(𝒙)?

➢ Training a DNN: Learning the
parameters of the DNN (weights and
𝑾1 𝒃1
biases) using the given data
𝑥1 𝑥2 … 𝑥𝑛

53
Training a DNN
𝑦
𝑾3 𝒃3
Steps to train a DNN using data
1. Create a
… 2. Randomly initialize 3. Pass inputs 𝒙
neural network through the network
weights and biases
𝒃2 # layers
𝑾2 (𝑊1 , … 𝑊𝐿 , 𝑏1 , … 𝑏𝐿 ) and get output (𝑦)

# neuron

𝑾1 𝒃1

𝑥1 𝑥2 … 𝑥𝑛 A DNN based 4. Find out difference


5. Adjust the
model is trained on between actual
weights to minimise
the given data output (𝑦) and
the difference
i.e., 𝑦ො ≈ 𝑦 for all predicted output (𝑦)ො
between 𝑦 and 𝑦ො
inputs - Prediction error

54
Training a DNN – Step 4
Step 4: Find out difference between actual output
(𝑦) and predicted output (𝑦)ො - Prediction error
𝑦
𝑾3 𝒃3 ➢ Loss function or Cost function is defined to quantify the
prediction error
… ➢ For regression: Mean squared error loss since output is
𝒃2
a real number
𝑾2 𝑁

1 2
ℒ = ෍ 𝑦𝑖 − 𝑦ෝ𝑖
𝑁
𝑾1 𝒃1 𝑖=1

𝑥1 𝑥2 … 𝑥𝑛 ➢ For classification: Cross entropy loss error since output


is a probabilistic value for each class (𝑘 classes)
1
ℒ = − σ𝑁 σ 𝑘
𝑖=1 1 𝑦 log 𝑦ො
𝑁

55
Training a DNN – Step 5
Step 5: Adjust the weights to minimise the loss function

➢ Question: How to adjust parameters


to minimise the loss function ℒ(𝑦, 𝑦)?

➢ Suppose there are only 2 parameters
𝑤1 and 𝑤2 on which the loss function is
dependent
➢ To minimize loss, take small steps in
the direction along which ℒ decreases
𝑤11
➢ Implies: 𝑤1 and 𝑤2 are to be modified ℒ
𝑤21
at each step such that ℒ decreases
ℒ1
➢ Directions at each step are given by 𝑤2
the gradient 𝑤12
𝑤1 𝑤22
ℒ2
57
Gradient Descent
➢ Gradient: Derivative of the loss function with
𝜕ℒ 𝜕ℒ
respect to the parameters and
𝜕𝑤1 𝜕𝑤2
➢ Interpretation: Gradient is the value by which
the loss increases when the parameter
increases
➢ 𝜕ℒ
𝜕𝑤1
= 2 ⇒ when 𝑤1 increases by a value of
1, ℒ increases by a value of 2 𝑤11

➢ By updating 𝑤1 as 𝑤1 + 𝜕ℒ
𝜕𝑤1
and 𝑤2 as 𝑤2 + 𝑤21
𝜕ℒ ℒ1
, the value of ℒ increases 𝑤2
𝜕𝑤2
𝑤12
𝑤1 𝑤22
ℒ2
58
Gradient Descent
➢ Since the objective is to decrease, the
parameters are updated as follows:
𝜕ℒ
𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝛼
𝜕𝑤1
𝜕ℒ
𝑤2 𝑛𝑒𝑤 = 𝑤2 𝑜𝑙𝑑 − 𝛼
𝜕𝑤2
➢ 𝛼 is the learning rate to decide how much
to change ℒ 𝑤11
➢ Parameters are to updated iteratively until 𝑤21
the loss function is minimised ℒ1
(convergence) 𝑤2
𝑤12
𝑤1 𝑤22
ℒ2
59
Training a DNN – Gradient Descent
➢ Note: In a DNN, the loss is dependent on all the 𝑦
parameters 𝑾1 , 𝑾2 , … 𝑾𝐿 and 𝒃1 , 𝒃2 , … , 𝒃𝐿
➢ So gradient needs to be calculated w.r.t. all the 𝑾3 𝒃3
parameters
➢ General parameter update: …
𝜕ℒ
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − 𝛼
𝜕𝑤 𝑾2 𝒃2
𝜕ℒ
𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 − 𝛼
𝜕𝑏 …
➢ Question: How to calculate the gradient of ℒ
w.r.t. all the parameters in the DNN ? 𝑾1 𝒃1
➢ Answer: Gradient descent with backpropagation
𝑥1 𝑥2 … 𝑥𝑛

60
Training a DNN – Summary of Step 5
a) Known:
➢ 𝒙, 𝒚 − 𝑁 Samples
➢ 𝑾1 , 𝑾2 , … 𝑾𝐿 and 𝒃1 , 𝒃2 , … , 𝒃𝐿 (initialised values)
➢ 𝒚ෝ for each sample and and overall loss ℒ
b) Calculate the gradients of loss function w.r.t. all the weights and biases using
chain rule
Note: Gradient calculation involves a summation over all samples in the data
c) Update the weights and biases:
𝜕ℒ
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − 𝛼
𝜕𝑤
𝜕ℒ
𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 − 𝛼
𝜕𝑏
d) With new parameter, compute 𝒚ෝ for each sample and and overall loss ℒ
e) Repeat (b), (c) and (d) for many iterations – until loss is minimised
61
DNN Model Training - Components
𝑦 • Data (Given): 𝒙𝑖 , 𝒚𝑖 ; 𝑖 = 1 … 𝑁
𝑾3
• Model (Chosen):
𝒃3
ෝ = 𝑓 𝒙, 𝑾1 , … 𝑾𝐿 , 𝒃1 , … 𝒃𝐿
𝒚
• Parameters (To be learnt):
… 𝑾1 , … , 𝑾𝐿 , 𝒃1 , … , 𝒃𝐿
• Loss Function: Mean squared error for
𝑾2 𝒃2 regression and cross entropy for
classification
… • Training Algorithm: Gradient descent

𝑾1 𝒃1

𝑥1 𝑥2 … 𝑥𝑛
62
Computing Gradients
➢ How to compute gradients efficiently in Back propagation algorithm?
➢ How did we do in high school or the UG?

➢ Numerical Gradient Computation:


➢ Not computational efficient n parameters 2n function evaluations

➢ Often used to check the automatic differentiation algorithms

63
Computing Gradients
➢ Rule of Partial derivatives
➢ Sum of two functions f (.) + g (.)

➢ Product of two functions f(.) g (.)

➢ Chain Rule f(g(.))

64
Gradient Descent with Backpropagation
1 𝑁
ℒ= σ 𝑦𝑖 − 𝑦ෝ𝑖 2 (or) ℒ(𝑦, 𝑦)
ො 𝑦ො
𝑁 𝑖=1
𝑁 𝑘
1
ℒ = − ෍ ෍ 𝑦 log 𝑦ො 𝑎3
𝑁 𝑤3
𝑖=1 1 ℎ2
ෝ = 𝑔𝑜 𝑾3 𝑾2 𝑾1 𝒙 + 𝒃1 + 𝒃𝟐 + 𝒃3
𝒚 𝜕ℒ 𝜕ℒ 𝜕𝑦ො 𝜕ℒ 𝜕𝑦ො 𝜕𝑎3
= =
𝜕𝑤1 𝜕𝑦ො 𝜕𝑤1 𝜕𝑦ො 𝜕𝑎3 𝜕𝑤1 𝑎2
➢ Loss function is connected to the 𝑤2
parameters through 𝒚 ෝ 𝜕ℒ 𝜕ℒ 𝜕𝑦ො 𝜕𝑎3 𝜕ℎ2 𝜕𝑎2 𝜕ℎ1 𝜕𝑎1 ℎ1
=
➢ Issue: Finding loss function in terms of each 𝜕𝑤1 𝜕𝑦ො 𝜕𝑎3 𝜕ℎ2 𝜕𝑎2 𝜕ℎ1 𝜕𝑎1 𝜕𝑤1
of the parameters is a complex process 𝑎1
➢ Solution: Chain rule can be used to compute Note:
𝜕ℒ
𝜕𝑤1
will vary for each input and
𝑤1

the gradient w.r.t each of the parameters 𝑥


➢ So the loss is being backpropagated through output
the chain of gradients
65
Variants of Gradient Descent

66
Types of Gradient Descent
➢ Depending on the number of samples used to estimate the gradient of loss
w.r.t. the parameters, there are 3 types of gradient descent:
➢ Batch Gradient Descent

➢ Stochastic Gradient Descent

➢ Mini-Batch Gradient Descent

➢ Reasons for these 3 types of gradient descent:


➢ Computational efficiency

➢ Accuracy of estimated gradient

67
Some Terminology
➢ Epoch – One epoch of training is said to be complete if every sample in the
training dataset is used for gradient calculation and parameter updation

➢ Batch size (𝑏) – Number of samples used in gradient computation


➢ Batch size and epoch are the same if all the samples in the dataset are used
for computing the gradient

68
Batch Gradient Descent
• Parameters are updated by estimating the gradient using all the 𝑁 samples i.e.,
batch size, 𝑏 = 𝑁
𝜕ℒ 𝜕ℒ 𝜕ℒ 𝜕ℒ
𝑤1(𝑛𝑒𝑤) = 𝑤1(𝑜𝑙𝑑) − 𝛼 = 𝑤1 𝑜𝑙𝑑 − 𝛼 + + ⋯+
𝜕𝑤1 𝜕𝑤1 1 𝜕𝑤1 2 𝜕𝑤1 𝑁

• In this case, the parameters are updated only once in a epoch


• Advantages: Convergence is guaranteed in this case
• Gradients are estimated in an unbiased manner
• Disadvantage: Slow to converge with large datasets

69
Stochastic Gradient Descent
• Parameters are updated by estimating the gradient using a single sample
i.e., batch size, 𝑏 = 1
𝜕ℒ 𝜕ℒ
𝑤1(𝑛𝑒𝑤) = 𝑤1(𝑜𝑙𝑑) − 𝛼 = 𝑤1 𝑜𝑙𝑑 − 𝛼
𝜕𝑤1 𝜕𝑤1 1

𝜕ℒ 𝜕ℒ
𝑤1(𝑛𝑒𝑤) = 𝑤1(𝑜𝑙𝑑) − 𝛼 = 𝑤1 𝑜𝑙𝑑 − 𝛼
𝜕𝑤1 𝜕𝑤1 𝑁
• In this case, the parameters are updated 𝑁 times in a epoch
• Advantage: Very fast to minimise the loss function
• Disadvantages: Too much variance in gradient calculation and learning might not
be stable
• Convergence cannot be guaranteed

70
Mini-Batch Gradient Descent
• A balance between batch and stochastic gradient descent
• Parameters are updated by using the gradient computed at a small batch of
samples i.e., 1 < 𝑏 < 𝑁
𝜕ℒ 𝜕ℒ 𝜕ℒ
𝑤1(𝑛𝑒𝑤) = 𝑤1(𝑜𝑙𝑑) − 𝛼 = 𝑤1 𝑜𝑙𝑑 − 𝛼 + ⋯+
𝜕𝑤1 𝜕𝑤1 1 𝜕𝑤1 𝑏

• 𝑁
In this case, the parameters are updated times in a epoch
𝑏
• Advantages: Faster than batch gradient descent
• Lesser variance and more stability compared to stochastic gradient descent
• Convergence cannot be guaranteed but most preferred

71
Variants of Gradient Descent - Visualisation

Batch Gradient descent


Mini-Batch Gradient descent
Stochastic Gradient descent

• Ellipse with higher radius indicates higher


loss function

Image Source: medium.com

72
Batch Normalisation

73
Why Batch Normalisation?
𝑦
• Intuition: All the samples can be
𝑾3 𝒃3 considered to be drawn from a
multi-variate distribution
𝒉2
… • If batch gradient descent is
𝒂2 performed, then the distribution of
𝑾2 𝒃2 samples for each batch of inputs
𝒉1 remains the same
… • Issue: However, with stochastic and
𝒂1
mini-batch gradient descent, the
𝒃1
𝑾1
distribution varies from one batch to
𝑥1 𝑥2 … 𝑥𝑛 another.

74
Why Batch Normalisation?
𝑦 • Illustration: Suppose the input
𝑾3
samples (𝒙) are fed to the DNN in
𝒃3
multiple mini-batches
𝒉2 • In each iteration, the samples change
… and the distribution of samples also
𝒂2
𝒃2
changes
𝑾2
• Weights and biases would have to
𝒉1
… adjust to a different distribution of
𝒂1 inputs in each iteration
𝑾1 𝒃1
• Learning (fitting of weights and biases
𝑥1 𝑥2 … 𝑥𝑛 to inputs) would be hard
• How to overcome this issue? – Batch
Normalisation 75
Applying Batch Normalisation
𝑦 • Inputs to each layer are normalised
𝑾3 𝒃3 to be unit gaussians before the
activation function
𝒉2 • Mean and variance across set of

samples in that batch, are calculated
𝒂2
𝒃2
for performing normalisation
𝑾2
• This normalisation step is
𝒉1
… differentiable and hence we can
𝒂1 backpropagate through it
𝑾1 𝒃1 • Advantage: Learning is much faster
and leads to be better convergence
𝑥1 𝑥2 … 𝑥𝑛
Ref: Article by Johann Huber - Batch normalization in 3 levels of understanding

76
Regularisation

77
Regularisation – Motivation
DNN Model Training Objectives:
• Finding a model which is a good fit to the

Error
given data
• Model should be able to generalise over Test
unseen data (Test data) error

• Prediction error should be low both on


training and test data
Issues:
• In DNNs, there is the issue of overfitting due Train Ideal model
to many parameters error complexity

• Test error could be high due to overfitting


Solution: Model complexity

• To obtain a balanced model, regularisation is


performed
78
Finding Ideal Model Complexity

• To get near ideal model complexity, an

Error
additional component is added to the loss
function Test
error
ℒ 𝜃 + 𝜆Ω(𝜃)
• Ω(𝜃) is the regularisation term which
regularises the model
• Ω(𝜃) ensures that the model is neither too
Ideal model
complex models nor too simple models Train
error complexity
• 𝜆 is the regularisation rate (hyperparamter)
• 𝜆 determines how much the model is to be Model complexity
regularised

79
Types of Regularization
➢ 𝑙1 and 𝑙2 regularization
➢ Early stopping
➢ Ensemble Methods
➢ Dropout

80
𝑙1 and 𝑙2 Regularisation
➢ Regularisation term is the 𝑙1 or 𝑙2 norm of Loss function with
the vectors of weights in the neural network 𝑙1 regularization term:
➢ This introduces a constraint over the
ҧ
ℒ(𝜽) =ℒ 𝜽 +𝜆 𝜽
parameters 1

➢ Pushes some of the weights towards zero Loss function with


➢ Some neuron connections become negligible 𝑙2 regularization term:
(no impact on output) and overall complexity 𝜆
ҧ
ℒ(𝜽) =ℒ 𝜽 + 𝜽 2
reduces 2
𝜽 − Vector containing all the
weights and biases of a neural
network

81
Early Stopping
• Prediction error on a validation set is
tracked

Error
Validation
• New hyperparameter called patience error
parameter 𝑝 is introduced
• Check if there is improvement in the
validation error for 𝑝 continuous
iterations
Train
• If not stop and take the model prior to error
those 𝑝 iterations
𝑘−𝑝 𝑘 𝑡ℎ
iteration iteration

82
Ensemble Methods
• Different models are trained for the Ensemble of 2 DNN models

same task using different features, 𝑦1 𝑦2

hyperparameters, samples, etc. 𝑾3 𝒃3 𝑾3 𝒃3

• Outputs of these models are … …

combined to reduce the prediction 𝑾2 𝒃2 𝑾2 𝒃2


… …
error 𝑾1 𝒃1 𝑾1 𝒃1
• Similar to random forest or bagging 𝑥1 𝑥2 … 𝑥𝑖 𝑥1 𝑥2 … 𝑥𝑗

of trees
• Computationally very expensive and 𝑦1 + 𝑦 2
𝑦ො =
2
hence not preferred

83
Dropout
• Refers to dropping out of neurons during 𝑦
training 𝑾3 𝒃3
• For an iteration, some neurons with all
𝒉2
their connections are removed (inactive) …
𝒂2
𝑾2 𝒃2
𝒉1

𝒂1
𝑾1 𝒃1

𝑥1 𝑥2 … 𝑥𝑛

84
Dropout
𝑦
• Refers to dropping out of neurons during
𝑾3 𝒃3
training
• For an iteration, some neurons with all their 𝒉2

connections are removed (inactive)
𝒂2
• Feed forward calculation and backpropagation
𝑾2 𝒃2
happens only with the active connections
𝒉1
• Update the weights and biases of only active …
𝒂1
connections
𝑾1 𝒃1
• Effectively, learning is happening on a different
neural network in each iteration 𝑥1 𝑥2 … 𝑥𝑛
• Output is equivalent to ensembled output
At 𝑖 𝑡ℎ iteration
85
Module III: Elements of NLP

86
Module III

➢ Elements of NLP: Expression, word, corpora,


➢ Token and tokenization, word normalization, lemmatization,
stemming, sentence segmentation, sequence labelling, context-free
grammars
➢ Slide Courtesy: Dan Jurafsky and James H. Martin
➢ Errors: Nirav

87
88

Regular expressions are used everywhere


For a large class of cases-though not for all-in which we employ the word ‘meaning’ it
can be defined thus: the meaning of a word is its use in the language.”
- (Wittgenstein 1968, 943)
“All grammars leak”
-(Sapir 1921: 38)
“You shall know a word by the company it keeps”
-(Firth 1957: 11)
◦ Part of every text processing task
• Not a general NLP solution
• But very useful as part of those systems (e.g., for pre-processing or text formatting)
◦ Necessary for data analysis of text data
◦ A widely used tool in industry and academics

88
89

Language

◦ Forma
◦ Part of every text processing task
• Not a general NLP solution
• But very useful as part of those systems (e.g., for pre-
processing or text formatting)
◦ Necessary for data analysis of text data
◦ A widely used tool in industry and academics
89
90

Language

◦ A collection of strings
◦ Formal Language Theory to understand different
languages (including programming languages)
◦ Important aspect: Regular Expression
• It is a way to define a language
• Simple application: text matching and searching

90
Regular expressions

• A formal language for specifying text strings


• How can we search for mentions of these cute animals in text?

◦ woodchuck
◦ woodchucks
◦ Woodchuck
◦ Woodchucks
◦ Groundhog
◦ groundhogs
91
Regular Expressions: Disjunctions
⚫ Letters inside square brackets []
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any one digit

⚫ Ranges using the dash [A-Z]


Pattern Matches
[A-Z] An upper case Drenched Blossoms
letter
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
92
Regular Expressions: More Disjunction

⚫Groundhog is another name for woodchuck!


⚫The pipe symbol | for disjunction

Pattern Matches
groundhog|woodchuck woodchuck
yours|mine yours
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck Woodchuck
93
How many words in a sentence?
filled pauses Fragments

⚫ "I do uh main- mainly business data processing"

⚫ "Seuss’s cat in the hat is different from other


cats!"
◦ Lemma: same stem, part of speech, rough word sense
• cat and cats = same lemma
◦ Wordform: the full inflected surface form
• cat and cats = different wordforms

94
How many words in a sentence?
⚫ Word Type: an element of the vocabulary (or number of
distinct words in a corpus)

⚫ Token (Word Instance): an instance of that type in running


text.

They lay back on the San Francisco grass and looked at the
stars and their

⚫ How many?
◦ 15 tokens (or 14)
◦ 13 types (or 12) (or 11?)

96
How many words in a corpus?
N = number of tokens
V = vocabulary = set of types, |V | is size of vocabulary
Heaps Law = Herdan's Law = where often .67 < β < .75
i.e., vocabulary size grows with > square root of the number of word tokens

Tokens = N Types = |V|


Switchboard phone 2.4 million 20 thousand
conversations
Shakespeare 884,000 31 thousand
COCA 440 million 2 million
Google N-grams 1 trillion 13+ million 97
Corpora

Words don't appear out of nowhere!


A text is produced by
• a specific writer(s),

• at a specific time,

• in a specific variety,

• of a specific language,

• for a specific function.

98
Corpora vary along dimension like
◦ Language: 7097 languages in the world
◦ Variety, like African American Language varieties.
• AAE Twitter posts might include forms like "iont" (I don't)
◦ Code switching, e.g., Spanish/English, Hindi/English:
S/E: Por primera vez veo a @username actually being hateful! It was beautiful:)
[For the first time I get to see @username actually being hateful! it was beautiful:) ]
H/E: dost tha or ra- hega ... dont wory ... but hoshla rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
◦ Genre: newswire, fiction, scientific articles, Wikipedia
◦ Author Demographics: writer's age, gender, ethnicity,
SES
99
Corpus datasheets
Gebru et al (2020), Bender and Friedman (2018)

Motivation:
• Why was the corpus collected?
• By whom?
• Who funded it?
Situation: In what situation was the text written?
Collection process: If it is a subsample how was it sampled?
Was there consent? Pre-processing?
⚫ +Annotation process, language variety,
demographics, etc.
100
Basic Text ⚫ Words and Corpora
Processing

101
Basic Text ⚫ Word tokenization
Processing

102
Text Normalization

⚫ Every NLP task requires text normalization:


1. Tokenizing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences

103
Space-based tokenization

⚫ A very simple way to tokenize


◦ For languages that use space characters between words
• Arabic, Cyrillic, Greek, Latin, etc., based writing systems
◦ Segment off a token between instances of spaces

104
Issues in Tokenization
⚫Can't just blindly remove punctuation:
◦ m.p.h., Ph.D., AT&T, cap’n
◦ prices ($45.55)
◦ dates (01/02/06)
◦ URLs (http://www.stanford.edu)
◦ hashtags (#nlproc)
◦ email addresses ([email protected])
⚫Clitic: a word that doesn't stand on its own
◦ "are" in we're, French "je" in j'ai, "le" in l'honneur
⚫When should multiword expressions (MWE) be
words?
◦ New York, rock ’n’ roll
105
Tokenization in languages without spaces
Many languages (like Chinese, Japanese, Thai) don't
use spaces to separate words!

How do we decide where the token boundaries


should be?

107
Word tokenization in Chinese

Chinese words are composed of characters called


"hanzi" (or sometimes just "zi")
Each one represents a meaning unit called a morpheme.
Each word has on average 2.4 of them.
But deciding what counts as a word is complex and not
agreed upon.

108
How to do word tokenization in Chinese?
⚫ 姚明进入总决赛
“YaoMing reaches the finals”
3 words?
⚫姚明 进入 总决赛
Yao Ming reaches finals
5 words?
⚫姚 明 进入 总 决赛
Yao Ming reaches overall finals
7 characters? (don't use words at all):
⚫姚 明 进 入 总 决 赛
Yao Ming enter enter overall decision game

109
Word tokenization / segmentation

So in Chinese it's common to just treat each character


(zi) as a token.
• So the segmentation step is very simple

In other languages (like Thai and Japanese), more


complex word segmentation is required.
• The standard algorithms are neural sequence
models trained by supervised machine learning.

110
Another option for text tokenization

Instead of
• white-space segmentation

• single-character segmentation

Use the data to tell us how to tokenize.


Subword tokenization (because tokens can be
parts of words as well as whole words)

111
Subword tokenization

⚫ Three common algorithms:


◦ Byte-Pair Encoding (BPE) (Sennrich et al., 2016)
◦ Unigram language modeling tokenization
(Kudo, 2018)
◦ WordPiece (Schuster and Nakajima, 2012)
⚫ All have 2 parts:
◦ A token learner that takes a raw training corpus and
induces a vocabulary (a set of tokens).
◦ A token segmenter that takes a raw test sentence and
tokenizes it according to that vocabulary
112
Byte Pair Encoding (BPE) token learner

Let vocabulary be the set of all individual


characters
= {A, B, C, D,…, a, b, c, d….}
⚫Repeat:
◦ Choose the two symbols that are most frequently
adjacent in the training corpus (say 'A', 'B')
◦ Add a new merged symbol 'AB' to the vocabulary
◦ Replace every adjacent 'A' 'B' in the corpus with 'AB'.
⚫ Until k merges have been done.
113
BPE token learner algorithm

114
Byte Pair Encoding (BPE) Addendum

Most subword algorithms are run inside space-


separated tokens.
So we commonly first add a special end-of-word
symbol '__' before space in training corpus
Next, separate into letters.

115
BPE token learner
Original (very fascinating ) corpus:

low low low low low lowest lowest newer newer newer newer newer newer wider wider wider new
new

Add end-of-word tokens, resulting in this vocabulary:

representation

116
BPE token learner

Merge e r to er

117
BPE

Merge er _ to er_

118
BPE

Merge n e to ne

119
BPE

The next merges are:

120
BPE token segmenter algorithm
On the test data, run each merge learned from the
training data:
◦ Greedily
◦ In the order we learned them
◦ (test frequencies don't play a role)
So: merge every e r to er, then merge er _ to er_,
etc.
⚫Result:
◦ Test set "n e w e r _" would be tokenized as a full word
◦ Test set "l o w e r _" would be two tokens: "low er_"
121
Properties of BPE tokens

Usually include frequent words


And frequent subwords
• Which are often morphemes like -est or –er

A morpheme is the smallest meaning-bearing unit of


a language
• unlikeliest has 3 morphemes un-, likely, and -est

122
Basic Text Byte Pair Encoding
Processing

123
Word Normalization

⚫ Putting words/tokens in a standard format


• U.S.A. or USA
• uhhuh or uh-huh
• Fed or fed
• am, is, be, are

124
Case folding

⚫Applications like IR: reduce all letters to lower case


◦ Since users tend to use lower case
◦ Possible exception: upper case in mid-sentence?
• e.g., General Motors
• Fed vs. fed
• SAIL vs. sail
⚫For sentiment analysis, MT, Information extraction
◦ Case is helpful (US versus us is important)

125
Lemmatization

Represent all words as their lemma, their shared root


= dictionary headword form:
◦ am, are, is → be
◦ car, cars, car's, cars' → car
◦ Spanish quiero (‘I want’), quieres (‘you want’)
→ querer ‘want'
◦ He is reading detective stories
→ He be read detective story
126
Lemmatization is done by Morphological Parsing

⚫ Morphemes:
◦ The small meaningful units that make up words
◦ Stems: The core meaning-bearing units
◦ Affixes: Parts that adhere to stems, often with grammatical
functions
⚫ Morphological Parsers:
◦ Parse cats into two morphemes cat and s
◦ Parse Spanish amaren (‘if in the future they would love’)
into morpheme amar ‘to love’, and the morphological
features 3PL and future subjunctive. 127
Stemming
⚫ Reduce terms to stems, chopping off affixes
crudely

This was not the map we found in Billy Bones’s Thi wa not the map we found in Billi Bone s chest
chest, but an accurate copy, complete in all but an accur copi complet in all thing name and
things-names and heights and soundings-with the height and sound with the singl except of the red
single exception of the red crosses and the written cross and the written note
notes. .

128
Porter Stemmer

⚫ Based on a series of rewrite rules run in series


◦ A cascade, in which output of each pass fed to next pass
⚫ Some sample rules:

129
Sentence Segmentation
!, ? mostly unambiguous but period “.” is very
ambiguous
◦ Sentence boundary
◦ Abbreviations like Inc. or Dr.
◦ Numbers like .02% or 4.3
Common algorithm: Tokenize first: use rules or ML to
classify a period as either (a) part of the word or (b) a
sentence-boundary.
◦ An abbreviation dictionary can help
Sentence segmentation can then often be done by rules
based on this tokenization. 131
➢ Module IV:
➢N-gram Language models, lexical and vector semantics,
➢TF-IDF, word2vector, semantic properties of embeddings

132
N-gram Language Modeling
Predicting words
⚫ The water of Walden Pond is beautifully ...

blue
*refrigerator
green
*that
clear

133
Language Models

⚫ Systems that can predict upcoming words


• Can assign a probability to each potential next word
• Can assign a probability to a whole sentence

134
Why word prediction?

It's a helpful part of language tasks


• Grammar or spell checking
Their are two midterms Their There are two midterms
Everything has improve Everything has improve improved

• Speech recognition
I will be back soonish I will be bassoon dish

135
Why word prediction?

It's how large language models (LLMs) work!


LLMs are trained to predict words
• Left-to-right (autoregressive) LMs learn to predict next
word
LLMs generate text by predicting words
• By predicting the next word over and over again

136
Language Modeling (LM) more formally

⚫ Goal: compute the probability of a sentence or


sequence of words W:
P(W) = P(w1,w2,w3,w4,w5…wn)
⚫ Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4) or P(wn|w1,w2…wn-1)
⚫ An LM computes either of these:
P(W) or P(wn|w1,w2…wn-1)

137
How to estimate these probabilities

⚫Could we just count and divide?


=

⚫No! Too many possible sentences!


⚫We’ll never see enough data for estimating these

138
How to compute P(W) or P(wn|w1, …wn-1)

⚫ How to compute the joint probability P(W):

P(The, water, of, Walden, Pond, is, so, beautifully, blue)

⚫ Intuition: let’s rely on the Chain Rule of Probability

139
Reminder: The Chain Rule

⚫Recall the definition of conditional probabilities


P(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A) P(B|A)

⚫More variables:
P(A,B,C,D) = P(A) P(B|A) P(C|A,B) P(D|A,B,C)
⚫The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-
1)

140
The Chain Rule applied to compute joint
probability of words in sentence

P(“The water of Walden Pond”) =


P(The) × P(water|The) × P(of|The water)
× P(Walden|The water of) × P(Pond|The water of
Walden)
141
Markov Assumption

⚫ Simplifying assumption:
Andrei Markov

142
Wikimedia commons
Bigram Markov Assumption

⚫Instead of:
More generally, we approximate each
component in the product

143
Simplest case: Unigram model

P(w1w2 … wn ) » Õ P(w i )
i
Some automatically generated sentences from two different unigram models
To him swallowed confess hear both . Which . Of save on trail for
are ay device and rote life have

Hill he late speaks ; or ! a more to leg less first you enter

Months the my and issue of year foreign new exchange’s September

were recession exchange new endorsed a acquire to six executives


144
Bigram model

P(wi | w1w2 … wi-1) » P(wi | wi-1)


Some automatically generated sentences rom two different unigram models
Why dost stand forth thy canopy, forsooth; he is this palpable hit
the King Henry. Live king. Follow.

What means, sir. I confess she? then all sorts, he is trim, captain.

Last December through the way to preserve the Hudson corporation N.


B. E. C. Taylor would seem to complete the major central planners one
gram point five percent of U. S. E. has already old M. X. corporation
of living

on information such as more frequently fishing to keep her


145
Problems with N-gram models

• N-grams can't handle long-distance dependencies:


“The soups that I made from that new cookbook I
bought yesterday were amazingly delicious."
• N-grams don't do well at modeling new sequences
with similar meanings
The solution: Large language models
• can handle much longer contexts
• because of using embedding spaces, can model
synonymy better, and generate better novel strings 146
Why N-gram models?

A nice clear paradigm that lets us introduce many


of the important issues for large language
models
• training and test sets
• the perplexity metric
• sampling to generate sentences
• ideas like interpolation and backoff
147
N-gram
Language
Modeling
⚫ Introduction to N-grams

148
N-gram
Language
Modeling
Estimating N-gram

Probabilities

149
Estimating bigram probabilities

⚫ The Maximum Likelihood Estimate

150
An example
c(wi-1,wi )
<s> I am Sam </s> P(wi | w i-1 ) =
<s> Sam I am </s> c(wi-1)
<s> I do not like green eggs and ham </s>

151
More examples:
Berkeley Restaurant Project sentences
⚫can you tell me about any good cantonese restaurants close
by
⚫tell me about chez panisse

⚫i’m looking for a good place to eat breakfast

⚫when is caffe venezia open during the day

152
Raw bigram counts

⚫ Out of 9222 sentences

153
Raw bigram probabilities

⚫Normalize by unigrams:

⚫Result:

154
Bigram estimates of sentence probabilities

P(<s> I want english food </s>) =


P(I|<s>)
× P(want|I)
× P(english|want)
× P(food|english)
× P(</s>|food)
= .000031
155
What kinds of knowledge do N-grams represent?

⚫P(english|want) = .0011
⚫P(chinese|want) = .0065

⚫P(to|want) = .66

⚫P(eat | to) = .28

⚫P(food | to) = 0

⚫P(want | spend) = 0

⚫P (i | <s>) = .25

156
Dealing with scale in large n-grams

⚫ LM probabilities are stored and computed


in log format, i.e. log probabilities
⚫This avoids underflow from multiplying

many small numbers


log(p1 ´ p2 ´ p3 ´ p4 ) = log p1 + log p2 + log p3 + log p4
If we need probabilities we can do one exp at the end

157
N-gram Language Modeling
⚫Estimating N-gram
Probabilities

160
Evaluation and Perplexity
Language
Modeling

161
How to evaluate N-gram models

⚫ "Extrinsic (in-vivo) Evaluation"


To compare models A and B
1. Put each model in a real task
• Machine Translation, speech recognition, etc.
2. Run the task, get a score for A and for B
• How many words translated correctly
• How many words transcribed correctly
3. Compare accuracy for A and B 162
Intrinsic (in-vitro) evaluation

⚫ Extrinsic evaluation not always possible


• Expensive, time-consuming
• Doesn't always generalize to other applications
⚫ Intrinsic evaluation: perplexity
• Directly measures language model performance at
predicting words.
• Doesn't necessarily correspond with real application
performance
• But gives us a single general metric for language models
• Useful for large language models (LLMs) as well as n-grams
163
Training sets and test sets

We train parameters of our model on a training


set.
We test the model’s performance on data we
haven’t seen.
◦ A test set is an unseen dataset; different from training
set.
• Intuition: we want to measure generalization to unseen data
◦ An evaluation metric (like perplexity) tells us how
well our model does on the test set.

164
Choosing training and test sets

• If we're building an LM for a specific task


• The test set should reflect the task language
we want to use the model for
• If we're building a general-purpose model
• We'll need lots of different kinds of training
data
• We don't want the training set or the test set
to be just from one domain or author or
language.
165
166

Training on the test set

We can’t allow test sentences into the training


set
• Or else the LM will assign that sentence an artificially
high probability when we see it in the test set
• And hence assign the whole test set a falsely high
probability.
• Making the LM look better than it really is
This is called “Training on the test set”
Bad science!
166
Dev sets

• If we test on the test set many times we might


implicitly tune to its characteristics
• Noticing which changes make the model better.
• So we run on the test set only once, or a few times

• That means we need a third dataset:


• A development test set or, devset.
• We test our LM on the devset until the very end
• And then test our LM on the test set once
167
Intuition of perplexity as evaluation metric: How
good is our language model?
Intuition: A good LM prefers "real" sentences
• Assign higher probability to “real” or “frequently
observed” sentences
• Assigns lower probability to “word salad” or
“rarely observed” sentences?

168
Intuition of perplexity 2:
Predicting upcoming words
time 0.9
The Shannon Game: How well can we
predict the next word? dream 0.03
• Once upon a ____ midnight 0.02
• That is a picture of a ____ …
• For breakfast I ate my usual ____
and 1e-100
Claude Shannon Unigrams are terrible at this game (Why?)

A good LM is one that assigns a higher probability


to the next word that actually occurs
169
Picture credit: Historiska bildsamlingen
https://creativecommons.org/licenses/by/2.0/
Intuition of perplexity 3: The best language
model is one that best predicts the entire unseen
test set
• We said: a good LM is one that assigns a higher
probability to the next word that actually occurs.
• Let's generalize to all the words!
• The best LM assigns high probability to the entire test
set.
• When comparing two LMs, A and B
• We compute PA(test set) and PB(test set)
• The better LM will give a higher probability to (=be less
surprised by) the test set than the other LM.
170
Intuition of perplexity 4: Use perplexity instead of
raw probability
• Probability depends on size of test set
• Probability gets smaller the longer the text
• Better: a metric that is per-word, normalized by length
• Perplexity is the inverse probability of the test
set, normalized by the number of words
1
-
PP(W ) = P(w1w2 ...wN ) N

1
= N
P(w1w2 ...wN )
171
Intuition of perplexity 5: the inverse

Perplexity is the inverse probability of the test set, normalized


by the number of words
1
-
PP(W ) = P(w1w2 ...wN ) N

1
= N
P(w1w2 ...wN )
(The inverse comes from the original definition of perplexity
from cross-entropy rate in information theory)
Probability range is [0,1], perplexity range is [1,∞]
Minimizing perplexity is the same as maximizing probability
172
Intuition of perplexity 6: N-grams
1
-
PP(W ) = P(w1w2 ...wN ) N

1
= N
P(w1w2 ...wN )

Chain rule:

Bigrams:

173
Intuition of perplexity 7:
Weighted average branching factor
Perplexity is also the weighted average branching factor of a language.
Branching factor: number of possible next words that can follow any word
Example: Deterministic language L = {red,blue, green}
Branching factor = 3 (any word can be followed by red, blue, green)
Now assume LM A where each word follows any other word with equal probability ⅓
Given a test set T = "red red red red blue"
PerplexityA(T) = PA(red red red red blue)-1/5 =
((⅓)5)-1/5 = (⅓)-1 =3
⚫But now suppose red was very likely in training set, such that for LM B:
◦ P(red) = .8 p(green) = .1 p(blue) = .1
⚫We would expect the probability to be higher, and hence the perplexity to be
smaller:
PerplexityB(T) = PB(red red red red blue)-1/5

= (.8 * .8 * .8 * .8 * .1) -1/5 =.04096 -1/5 = .527-1 = 1.89 174


Holding test set constant:
Lower perplexity = better language model

⚫Training 38 million words, test 1.5 million words,


WSJ

N-gram Unigra Bigram Trigram


Order m
Perplexity 962 170 109
175
Evaluation and Perplexity
Language
Modeling

176
Sampling and
Language Generalization
Modeling

177
The Shannon (1948) Visualization Method
Sample words from an LM
Unigram:
Claude Shannon


PRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN
DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF
TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE
HAD BE THESE.

⚫ Bigram:
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH
WRITER THAT THE CHARACTER OF THIS POINT IS
THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE
TIME OF WHO EVER TOLD THE PROBLEM FOR AN
UNEXPECTED. 178
How Shannon sampled those words in 1948

"Open a book at random and select a letter at random on the


page. This letter is recorded. The book is then opened to another
page and one reads until this letter is encountered. The
succeeding letter is then recorded. Turning to another page this
second letter is searched for and the succeeding letter recorded,
etc." 179
Sampling a word from a distribution

180
Visualizing Bigrams the Shannon Way

⚫Choose a random bigram (<s>, w) <s> I


⚫ according to its probability p(w|<s>) I want
⚫Now choose a random bigram (w, x) want to
according to its probability p(x|w) to eat
⚫And so on until we choose </s> eat Chinese
⚫Then string the words together Chinese food
food </s>
I want to eat Chinese food

181
Approximating Shakespeare

183
Shakespeare as corpus

N=884,647 tokens, V=29,066


Shakespeare produced 300,000 bigram types out of
V2= 844 million possible bigrams.
◦ So 99.96% of the possible bigrams were never seen (have
zero entries in the table)
◦ That sparsity is even worse for 4-grams, explaining why
our sampling generated actual Shakespeare.

184
The Wall Street Journal is not Shakespeare

185
186

Can you guess the author? These 3-gram sentences


are sampled from an LM trained on who?
1) They also point to ninety nine point
six billion dollars from two hundred four
oh six three percent of the rates of
interest stores as Mexico and gram Brazil
on market conditions
2) This shall forbid it should be
branded, if renown made it empty.
3) “You are uniformly charming!” cried
he, with a smile of associating and now
and then I bowed and they perceived a
chaise and four to wish for.
186
The perils of overfitting
N-grams only work well for word prediction if the
test corpus looks like the training corpus
• But even when we try to pick a good training
corpus, the test set will surprise us!
• We need to train robust models that generalize!
One kind of generalization: Zeros
• Things that don’t ever occur in the training set
• But occur in the test set

188
Zeros
⚫ Training set: • Test set
… ate lunch … ate lunch
… ate dinner … ate breakfast
… ate a
… ate the
P(“breakfast” | ate) = 0

189
Zero probability bigrams

Bigrams with zero probability


◦ Will hurt our performance for texts where those words
appear!
◦ And mean that we will assign 0 probability to the test
set!
And hence we cannot compute perplexity (can’t
divide by 0)!

190
N-gram Language
Modeling
⚫ Smoothing,
Interpolation, and
Backoff

191
The intuition of smoothing (from Dan Klein)
⚫ When we have sparse statistics:
P(w | denied the)
3 allegations

allegations
2 reports
1 claims

outcome
reports
1 request

attack
request
claims

man
7 total

⚫ Steal probability mass to generalize better


P(w | denied the)
2.5 allegations
1.5 reports

allegations
allegations
0.5 claims

outcome
0.5 request

reports

attack
2 other

man
claims

request
7 total 192
Add-one estimation or Laplace Smoothing

⚫ Pretend we saw each word one more time than


we did
⚫ Just add one to all the counts!

⚫ MLE estimate:

⚫ Add-1 estimate:
193
Maximum Likelihood Estimates
⚫ The maximum likelihood estimate
◦ of some parameter of a model M from a training set T
◦ maximizes the likelihood of the training set T given the model M
⚫Suppose the word “bagel” occurs 400 times in a corpus of a million
words
⚫What is the probability that a random word from some other text will be
“bagel”?
⚫MLE estimate is 400/1,000,000 = .0004
⚫This may be a bad estimate for some other corpus
◦ But it is the estimate that makes it most likely that “bagel” will occur 400
times in a million word corpus.

194
Berkeley Restaurant Project sentences
⚫can you tell me about any good cantonese restaurants close
by
⚫tell me about chez panisse

⚫i’m looking for a good place to eat breakfast

⚫when is caffe venezia open during the day

195
Raw bigram counts

⚫ Out of 9222 sentences

196
Berkeley Restaurant Corpus: Laplace smoothed
bigram counts

197
Laplace-smoothed bigrams

198
Reconstituted counts

199
Compare with raw bigram counts

Raw Original

Smoothed

200
Add-1 estimation is a blunt instrument

⚫So add-1 isn’t used for N-grams:


◦ Generally we use interpolation or backoff instead
⚫But add-1 is used to smooth other NLP models
◦ For text classification
◦ In domains where the number of zeros isn’t so huge.

201
Backoff and Interpolation

⚫ Sometimes it helps to use less context


◦ Condition on less context for contexts you know less
about
⚫ Backoff:
◦ use trigram if you have good evidence,
◦ otherwise bigram, otherwise unigram
⚫ Interpolation:
◦ mix unigram, bigram, trigram

Interpolation works better 202


Linear Interpolation

⚫ Simple interpolation

⚫ Lambdas conditional on context:

203
How to set λs for interpolation?

⚫ Use a held-out corpus


Held-Out Test
Training Data Data Data
⚫ Choose λs to maximize probability of held-out
data:
◦ Fix the N-gram probabilities (on the training data)
◦ Then search for λs that give largest probability to held-
out set

204
Backoff

Suppose you want:


P(pancakes| delicious soufflé)
⚫If the trigram probability is 0, use the bigram
⚫P(pancakes| soufflé)
⚫If the bigram probability is 0, use the unigram
⚫P(pancakes)
Complication: need to discount the higher-order ngram
so probabilities don't sum higher than 1 (e.g., Katz
backoff)

205
Vector Semantics
Vector Semantics & Embeddings
& Embeddings

207
What do words mean?
⚫ N-gram or text classification methods we've seen so
far
◦ Words are just strings (or indices wi in a vocabulary list)
◦ That's not very satisfactory!
⚫ Old linguistics joke by Barbara Partee in 1967:
◦ Q: What's the meaning of life?
◦ A: LIFE
⚫ That seems hardly better!
208
Desiderata: needed or wanted?

⚫ What should a theory of word meaning do for


us?
⚫Let's look at some desiderata

⚫From lexical semantics, the linguistic study of


word meaning

209
Lemmas and senses
lemma

mouse (N)
1. any of numerous small rodents...
sense
2. a hand-operated device that controls
a cursor...
A sense or “concept” is the meaning component of a word
Lemmas can be polysemous (have multiple senses)
Modified from the online thesaurus WordNet 210
Relations between senses: Synonymy

⚫ Synonyms have the same meaning in some or all


contexts.
◦ filbert / hazelnut
◦ couch / sofa
◦ big / large
◦ automobile / car
◦ vomit / throw up
◦ water / H20

211
Relations between senses: Synonymy

⚫ Note that there are probably no examples of


perfect synonymy.
◦ Even if many aspects of meaning are identical
◦ Still may differ based on politeness, slang, register,
genre, etc.

212
Relation: Synonymy?

water/H20
"H20" in a surfing guide?
big/large
my big sister != my large sister

213
The Linguistic Principle of Contrast

⚫Difference in form → difference in


meaning

214
Abbé Gabriel Girard 1718
Re: "exact" synonyms
"

"

[I do not believe that there


is a synonymous word in
any language]

215
Thanks to Mark Aronoff!
Relation: Similarity

Words with similar meanings. Not synonyms, but sharing


some element of meaning

car, bicycle
cow, horse

216
Ask humans how similar 2 words are

word1 word2 similarity


vanish disappear 9.8
behave obey 7.3
belief impression 5.95
muscle bone 3.65
modest flexible 0.98

SimLex-999 dataset (Hill et al., 2015)


217
Relation: Word relatedness

⚫Also called "word association"


⚫Words can be related in any way, perhaps via a semantic
frame or field

• coffee, tea: similar


• coffee, cup: related, not similar

218
Semantic field

⚫Words that
◦ cover a particular semantic domain
◦ bear structured relations with each other.

hospitals
surgeon, scalpel, nurse, anaesthetic, hospital
restaurants
waiter, menu, plate, food, menu, chef
houses
door, roof, kitchen, family, bed
219
Relation: Antonymy

⚫ Senses that are opposites with respect to only one


feature of meaning
⚫Otherwise, they are very similar!
dark/light short/long fast/slow rise/fall
hot/cold up/down in/out
⚫ More formally: antonyms can
◦ define a binary opposition or be at opposite ends of a scale
• long/short, fast/slow
• rise/fall, up/down

220
Connotation (sentiment)

• Words have affective meanings


• Positive connotations (happy)
• Negative connotations (sad)
• Connotations can be subtle:
• Positive connotation: copy, replica, reproduction
• Negative connotation: fake, knockoff, forgery
• Evaluation (sentiment!)
• Positive evaluation (great, love)
• Negative evaluation (terrible, hate)
221
Connotation
Osgood et al. (1957)
⚫ Words seem to vary along 3 affective (related to
feelings) dimensions:
◦ valence: the pleasantness of the stimulus
◦ arousal: the intensity of emotion provoked by the stimulus
◦ dominance: the degree of control exerted by the stimulus
Word Score Word Score
Valence love 1.000 toxic 0.008
happy 1.000 nightmare 0.005
Arousal elated 0.960 mellow 0.069
frenzy 0.965 napping 0.046
Dominance powerful 0.991 weak 0.045
leadership 0.983 empty 0.081
222
Values from NRC VAD Lexicon (Mohammad 2018)
So far

⚫ Concepts or word senses


◦ Have a complex many-to-many association with words
(homonymy, multiple senses)
⚫ Have relations with each other
◦ Synonymy
◦ Antonymy
◦ Similarity
◦ Relatedness
◦ Connotation

223
Vector Semantics & Embeddings

224
Computational models of word meaning

⚫ Can we build a theory of how to represent word


meaning, that accounts for at least some of the desiderata?
⚫We'll introduce vector semantics

⚫ The standard model in language processing!


⚫ Handles many of our goals!

225
Ludwig Wittgenstein

⚫PI #43:
"The meaning of a word is its use in the language"

226
Let's define words by their usages

⚫ One way to define "usage":


words are defined by their environments (the words around
them)
If A and B have almost identical environments we
say that they are synonyms.

227
What does recent English borrowing ongchoi mean?

⚫Suppose you see these sentences:


• Spinach is delicious sautéed with garlic.
• Spinach is superb over rice
• Spinach leaves with salty sauces
⚫And you've also seen these:
• …Ong choy sautéed with garlic over rice
• Chard stems and leaves are delicious
• Collard greens and other salty leafy greens
⚫Conclusion:
◦ Spinach is a leafy green like Ong choy, chard, or collard greens
• We could conclude this based on words like "leaves" and "delicious" and "sauteed"228
Idea 1: Defining meaning by linguistic distribution
Let's define the meaning of a word by its
distribution in language use, meaning its
neighboring words or grammatical environments.

230
Idea 2: Meaning as a point in space
(Osgood et al. 1957)
⚫ 3 affective dimensions for a word
◦ valence: pleasantness
◦ arousal: intensity of emotion
◦ dominance: the degree of control exerted
Word Score Word Score
Valence love 1.000 toxic 0.008
happy 1.000 nightmare 0.005
Arousal elated 0.960 mellow 0.069
frenzy 0.965 napping 0.046
Dominance powerful 0.991 weak 0.045
◦ leadership 0.983 empty 0.081

⚫ Hence the connotation of a word is a vector in 3-space 231


Idea 1: Defining meaning by linguistic distribution

Idea 2: Meaning as a point in multidimensional space

232
Defining meaning as a point in space based on distribution

⚫Each word = a vector (not just "good" or "w45")


⚫Similar words are "nearby in semantic space"

⚫We build this space automatically by seeing which words are


nearby in text

233
We define meaning of a word as a vector

⚫ Called an "embedding" because it's embedded into a


space

⚫ The standard way to represent meaning in NLP


Every modern NLP algorithm uses embeddings
as the representation of word meaning

⚫ Fine-grained model of meaning for similarity


234
Intuition: why vectors?

⚫Consider sentiment analysis (+, -, Neutral):


◦ With words, a feature is a word identity
• Feature 5: 'The previous word was "terrible"'
• requires exact same word to be in training and test

◦ With embeddings:
• Feature is a word vector
• 'The previous word was vector [35,22,17…]
• Now in the test set we might see a similar vector [34,21,14]
• We can generalize to similar but unseen words!!!

235
We'll discuss 2 kinds of embeddings
⚫ tf-idf
◦ Information Retrieval workhorse!
◦ A common baseline model
◦ Sparse vectors
◦ Words are represented by (a simple function of) the counts of
nearby words
⚫ Word2vec
◦ Dense vectors
◦ Representation is created by training a classifier to predict whether a
word is likely to appear nearby
◦ Later we'll discuss extensions called contextual embeddings 236
⚫ Words and Vectors

238
Vector and Documents
Term-document matrix
Work of Shakespeare
Each document is represented by a vector of words

Each column vector representing for a document as a point in |V


|-dimensional space
239
Visualizing document vectors

240
Vectors are the basis of information retrieval

Vectors are similar for the two comedies

But comedies are different than the other two


Comedies have more fools and wit and fewer
battles. 241
Idea for word meaning:
Words can be vectors too!!!

battle is "the kind of word that occurs in Julius Caesar and Henry V"

fool is "the kind of word that occurs in comedies, especially Twelfth


Night“ Vector for the word “fool”: [36,58,1,4],
242
More common: word-word matrix
(or "term-context matrix")

⚫Two words are similar in meaning if their context vectors are


similar

243
244
Cosine for computing word similarity

245
Computing word similarity:
Dot product and cosine
⚫ The dot product between two vectors is a
scalar:

⚫ The dot product tends to be high when the two


vectors have large values in the same dimensions
⚫Dot product can thus be a useful similarity
metric between vectors
246
Problem with raw dot-product

⚫Dot product favors long vectors


⚫Dot product is higher if a vector is longer (has
higher values in many dimension)
⚫Vector length:

⚫Frequent words (of, the, you) have long vectors


(since they occur many times with other words).
⚫So dot product overly favors frequent words
247
Alternative:
cosine for computing word similarity

Based on the definition of the dot product between two vectors a and b

248
Cosine as a similarity metric

⚫-1: vectors point in opposite directions


⚫+1: vectors point in same directions
⚫0: vectors are orthogonal

⚫ But since raw frequency values are non-negative,


the cosine for term-term matrix vectors ranges from
0–1
249
Cosine examples

åi=1 pie data computer


N
v ·w v w vi wi
cos(v, w) = = · = cherry 442 8 2
v w v w
åi=1 vi2 åi=1 wi2
N N
digital 5 1683 1670
information 5 3982 3325

250

250
Visualizing cosines
(well, angles)

251
TF-IDF: Term Frequency-Inverse Document Frequency

252
But raw frequency is a bad representation
• The co-occurrence matrices we have seen represent
each cell by word frequencies.
• Frequency is clearly useful; if sugar appears a lot near
apricot, that's useful information.
• But overly frequent words like the, it, or they are not
very informative about the context
• It's a paradox! How can we balance these two conflicting
constraints?

253
Two common solutions for word weighting

⚫ tf-idf: tf-idf value for word t in document d:

Words like "the" or "it" have very low idf


⚫ PMI: (Pointwise mutual information)
𝒑(𝒘𝟏 ,𝒘𝟐 )
◦ PMI 𝒘𝟏 , 𝒘𝟐 = 𝒍𝒐𝒈
𝒑 𝒘𝟏 𝒑(𝒘𝟐 )

See if words like "good" appear more often with "great"


than we would expect by chance
254
Term frequency (tf) in the tf-idf algorithm

⚫ We could imagine using raw count:



tft,d = count(t,d)
⚫ But instead of using raw count, we usually squash a
bit:

255
Document frequency (df)

⚫ dft is the number of documents t occurs in.


⚫(note this is not collection frequency: total count
across all documents)
⚫"Romeo" is very distinctive for one Shakespeare play:

256
Inverse document frequency (idf)

N is the total number of documents


in the collection

257
What is a document?

⚫Could be a play or a Wikipedia article


⚫But for the purposes of tf-idf, documents can be
anything; we often call each paragraph a document!

258
Final tf-idf weighted value for a word

⚫ Raw counts:

⚫ tf-idf:

259
Vector Semantics Word2vec
& Embeddings

260
Sparse versus dense vectors

⚫ tf-idf (or PMI) vectors are


◦ long (length |V|= 20,000 to 50,000)
◦ sparse (most elements are zero)
⚫ Alternative: learn vectors which are
◦ short (length 50-1000)
◦ dense (most elements are non-zero)

261
Sparse versus dense vectors

⚫ Why dense vectors?


◦ Short vectors may be easier to use as features in
machine learning (fewer weights to tune)
◦ Dense vectors may generalize better than explicit
counts
◦ Dense vectors may do better at capturing synonymy:
• car and automobile are synonyms; but are distinct dimensions
• a word with car as a neighbor and a word with automobile as a
262 neighbor should be similar, but aren't
◦ In practice, they work better
262
Common methods for getting short dense vectors

⚫“Neural Language Model”-inspired models


◦ Word2vec (skipgram, CBOW), GloVe
⚫Singular Value Decomposition (SVD)
◦ A special case of this is called LSA – Latent Semantic
Analysis
⚫Alternative to these "static embeddings":
• Contextual Embeddings (ELMo, BERT)
• Compute distinct embeddings for a word in its context
• Separate embeddings for each token of a word
263
Simple static embeddings you can download!

⚫Word2vec (Mikolov et al)


⚫https://code.google.com/archive/p/word2vec/

⚫GloVe (Pennington, Socher, Manning)


⚫http://nlp.stanford.edu/projects/glove/

264
Word2vec

⚫Popular embedding method


⚫Very fast to train

⚫Code available on the web

⚫Idea: predict rather than count

⚫Word2vec provides various options. We'll do:

⚫ skip-gram with negative sampling (SGNS)

265
Word2vec

⚫Instead of counting how often each word w occurs near


"apricot"
◦ Train a classifier on a binary prediction task:
• Is w likely to show up near "apricot"?
⚫We don’t actually care about this task
• But we'll take the learned classifier weights as the word embeddings
⚫Big idea: self-supervision:
• A word c that occurs near apricot in the corpus cats as the gold
"correct answer" for supervised learning
• No need for human labels
• Bengio et al. (2003); Collobert et al. (2011)
266
Approach: predict if candidate word c is a "neighbor"
1. Treat the target word t and a neighboring context word
c as positive examples.
2. Randomly sample other words in the lexicon to get
negative examples
3. Use logistic regression to train a classifier to distinguish
those two cases
4. Use the learned weights as the embeddings

267
Skip-Gram Training Data
⚫Assume a +/- 2 word window, given training
sentence:

…lemon, a [tablespoon of apricot jam, a]


pinch… [target]
⚫ c1 c2 c3 c4

268
Skip-Gram Classifier

⚫(assuming a +/- 2 word window)

…lemon, a [tablespoon of apricot jam, a] pinch…


c1 c2 [target] c3 c4

⚫Goal: train a classifier that isgiven a candidate (word, context) pair


(apricot, jam)
(apricot, aardvark)

⚫ Assign each pair a probability:
P(+|w, c)
P(−|w, c) = 1 − P(+|w, c) 269
Similarity is computed from dot product

⚫ Remember: two vectors are similar if they have a


high dot product
◦ Cosine is just a normalized dot product
⚫ So:
◦ Similarity(w,c) ∝ w ∙ c
⚫ We’ll need to normalize to get a probability
◦ (cosine isn't a probability either)

270
Turning dot products into probabilities

⚫ Sim(w,c) ≈ w ∙ c
⚫ To turn this into a probability

⚫ We'll use the sigmoid from logistic regression:

271
How Skip-Gram Classifier computes P(+|w, c)

⚫ This is for one context word, but we have lots of context words.
⚫ We'll assume independence and just multiply them:

272
Skip-gram classifier: summary

⚫A probabilistic classifier, given


• a test target word w
• its context window of L words c1:L
⚫ Estimates probability that w occurs in this window based
on similarity of w (embeddings) to c1:L (embeddings).

⚫ To compute this, we just need embeddings for all the


words.
273
These embeddings we'll need: a set for w, a set for c

274
Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a]


pinch…
⚫ c1 c2 [target] c3 c4

275

275
Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a]


pinch…
⚫ c1 c2 [target] c3 c4

For each positive


example we'll grab k
negative examples,
276
sampling by
frequency 276
Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a]


pinch…
⚫ c1 c2 [target] c3 c4

277

277
Word2vec: how to learn vectors

⚫ Given the set of positive and negative training


instances, and an initial set of embedding vectors
⚫The goal of learning is to adjust those word vectors
such that we:
◦ Maximize the similarity of the target word, context word
pairs (w , cpos) drawn from the positive data
◦ Minimize the similarity of the (w , cneg) pairs drawn from
the negative data.
278
12/23/2024 278
Loss function for one w with cpos , cneg1 ...cnegk
⚫Maximize the similarity of the target with the actual context
words, and minimize the similarity of the target with the k negative
sampled non-neighbor words.

279
Learning the classifier

⚫How to learn?
◦ Stochastic gradient descent!

⚫We’ll adjust the word weights to


◦ make the positive pairs more likely
◦ and the negative pairs less likely,
◦ over the entire training set.

280
Intuition of one step of gradient descent

281
Reminder: gradient descent

• At each step
• Direction: We move in the reverse direction from
GISTI C the gradient
REGRESSI ON of the loss function
• Magnitude: we move the value of this gradient
𝑑
𝐿(𝑓 𝑥; 𝑤 , 𝑦) weighted by a learning rate η
𝑑𝑤
• Higher learning rate means move w faster

t+ 1 t d
w =w−h L( f (x; w), y)
dw
282
The derivatives of the loss function

283
Update equation in SGD

Start with randomly initialized C and W matrices, then incrementally do updates

284
Two sets of embeddings

⚫ SGNS learns two sets of embeddings


⚫ Target embeddings matrix W
⚫ Context embedding matrix C
⚫It's common to just add them together,
representing word i as the vector wi + ci

285
Summary: How to learn word2vec (skip-gram) embeddings
⚫ Start with V random d-dimensional vectors as initial
embeddings
⚫ Train a classifier based on embedding similarity
◦ Take a corpus and take pairs of words that co-occur as
positive examples
◦ Take pairs of words that don't co-occur as negative examples
◦ Train the classifier to distinguish these by slowly adjusting all
the embeddings to improve the classifier performance
◦ Throw away the classifier code and keep the embeddings.
286
The kinds of neighbors depend on window size
⚫ Small windows (C= +/- 2) : nearest words are
syntactically similar words in same taxonomy
◦Hogwarts nearest neighbors are other fictional schools
•Sunnydale, Evernight, Blandings

⚫ Large windows (C= +/- 5) : nearest words are


related words in same semantic field
◦Hogwarts nearest neighbors are Harry Potter world:
•Dumbledore, half-blood, Malfoy

287
Analogical relations

⚫ The classic parallelogram model of analogical


reasoning (Rumelhart and Abrahamson 1973)
⚫To solve: "apple is to tree as grape is to _____"

⚫ Add tree – apple to grape to get vine

288
Analogical relations via parallelogram

⚫ The parallelogram method can solve analogies with


both sparse and dense embeddings (Turney and
Littman 2005, Mikolov et al. 2013b)
⚫ king – man + woman is close to queen
⚫ Paris – France + Italy is close to Rome
⚫For a problem a:a*::b:b*, the parallelogram method
is:

289
Structure in GloVE Embedding space

290
Caveats with the parallelogram method

⚫ It only seems to work for frequent words, small


distances and certain relations (relating countries
to capitals, or parts of speech), but not others.
(Linzen 2016, Gladkova et al. 2016, Ethayarajh et
al. 2019a)

⚫ Understanding analogy is an open area of


research (Peterson et al. 2020)
291
Embeddings as a window onto historical semantics

Train embeddings on different decades of historical text to see meanings


~30 million books, 1850-1990, Google Books data

William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic Word Embeddings Reveal Statistical Laws of
Semantic Change. Proceedings of ACL. 292
Embeddings reflect cultural bias!
Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. "Man is
to computer programmer as woman is to homemaker? debiasing word embeddings." In NeurIPS,
pp. 4349-4357. 2016.

⚫Ask “Paris : France :: Tokyo : x”


◦ x = Japan
⚫Ask “father : doctor :: mother : x”
◦ x = nurse
⚫Ask “man : computer programmer :: woman : x”
◦ x = homemaker
Algorithms that use embeddings as part of e.g., hiring searches
for programmers, might lead to bias in hiring 293
Historical embedding as a tool to study cultural
biases
• Compute a gender or ethnic bias for each adjective: e.g.,
how much closer the adjective is to "woman" synonyms than
"man" synonyms, or names of particular ethnicities
• Embeddings for competence adjective (smart, wise,
brilliant, resourceful, thoughtful, logical) are biased
toward men, a bias slowly decreasing 1960-1990
• Embeddings for dehumanizing adjectives (barbaric,
monstrous, bizarre) were biased toward Asians in the
1930s, bias decreasing over the 20th century.
• These match the results of old surveys done in the 1930s
Garg, N., Schiebinger, L., Jurafsky, D., and Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. 294
Proceedings of the National Academy of Sciences 115(16), E3635–E3644.
THANK YOU

You might also like