0% found this document useful (0 votes)

3 views

NLP_Module_I_IV

The document outlines a comprehensive curriculum on Natural Language Processing (NLP) and Neural Networks, detailing various modules that cover foundational concepts, applications, and advanced techniques such as transformers and large language models. Each module includes learning outcomes that emphasize understanding key NLP problems, neural network architectures, and practical applications in areas like sentiment analysis and machine translation. Additionally, it discusses the evolution of NLP from symbolic to neural approaches and highlights the importance of artificial neural networks in complex decision-making.

Uploaded by

Awatif Maisara

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

NLP_Module_I_IV

Uploaded by

Awatif Maisara

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 283

Introduction

Natural Language
Processing
Modules:
➢ Module I:
➢ Introduction to Natural language processing (NLP)
➢ NLP problems in text summarization, text classification, sentiment
analysis, question answering, neural translation, etc.
➢ Module II:
➢ Introduction to Neural Networks
➢ Optimization formulations for Deep learning, Gradient-based
optimization, Gradient descent,
➢ Neural networks, feed-forward NN, Gradient-based learning, and back-
propagation, and differentiation algorithms
➢ Module III:
➢ Elements of NLP: Expression, word, corpora,
➢ Token and tokenization, word normalization, lemmatization, stemming,
sentence segmentation, sequence labelling, context-free grammars

2
Modules:
➢ Learning Outcomes
➢ Module I: Understand the importance of NLP problems and different types
of problems in the literature
➢ Module II: Understand the concepts of Optimization for Deep Learning,
Basics of Neural Networks, Feed-forward Neural Networks, Back
propagation and automatic differentiation
➢ Module III: Basic terminology and operation in text processing, different
types of tokenization, word normalization, minimum edit distance and their
computations

3
Modules:
➢ Module IV:
➢ N-gram Langauge models, lexical and vector semantics,
➢ TF-IDF, word2vector, semantic properties of embeddings
➢ Module V:
➢ Recurrent Neural Networks (RNNs),
➢ RNNs for Language modeling, Sequence labeling and classification,
sequence-to-sequence tasks,
➢ Stacked and bidirectional RNNs, LSTMs

4
Modules:
➢ Learning Outcomes
➢ Module IV: Understand formulation of language models (LMs), different LMs
such as n-grams, TF-IDF, etc. evaluation metrics from language view points
➢ Module V: Basics of RNNs and their variants, sequence modeling and
different problems associated with sequence modeling

5
Modules:
➢ Module VI:
➢ Introduction to transformer:
➢ Attention mechanism, multi-head attention, transformer architecture,
➢ Embedding of tokens and positions, language model head
➢ Module VII:
➢ Large Language Models: Use of Transformer Architecture,
➢ Pretraining architectures: Decoders, Encoders, and Encoder-Decoders,
➢ Finetuning, Generative Pretrained transformer
➢ Module VIII :More on LLMs:
➢ Scaling laws and scaling efficiency
➢ Prompting and instruction tuning, alignment, preference fine-tuning
➢ Applications of LLMs in Machine translations, question answering, and
information retrieval

6
Modules:
➢ Learning Outcomes
➢ Module VI: Learn the concepts of transformers and associated concepts
such as attention, multi-head attention, etc.
➢ Module VII: Different architectures of LLMs, finetuning methods
➢ Module VIII: Understand the scaling laws, and model parameters, prompt
engineering aspects and applications of LLMs

7
Module I: Introduction to Natural
language processing (NLP)

8
Quotes to remember
“For a large class of cases-though not for all-in which we employ
the word ‘meaning’ it can be defined thus: the meaning of a word
is its use in the language.”
- (Wittgenstein 1968, 943)
“All grammars leak”
-(Sapir 1921: 38)
“You shall know a word by the company it keeps”
-(Firth 1957: 11)

9
Can machines mimic humans?
Well! Yes. As wise as in this example
Killing is a honor

Genearted text:
Killing is a honor in my tribe. It showed that you were strong, brave, and willing to do
what needed to be done. To get this close to him, to have my arrow ready to fire the
moment he revealed himself to be what he was, filled me with a strange mix of fear and
excitement. Thoughts of the demon were pushed to the

Source: https://app.inferkit.com/generate

10
Or as stupid as in this example

The world is round and can be proved.

Generated text
Or prove it.
There is no need to go to any center to learn things.
Today was an easy day.
I went into town to get to the PC shop.
I found the shop but it wasn't open.
I rang the bell but nobody came out.
I decided I would go to look at the esplanade.
The trees were starting to turn colors.

11
What is NLP?
⚫ Programming the computers to process and analyze large amount of natural
language data
⚫ Subfield of Linguistics and Computer Science

⚫ Two school of thoughts:

◦ Rationalist approach (dominated from 1960 to 1985)

• Significant part of the knowledge in human mind is not derived by the senses
• Knowledge is fixed in advance (presumably by genetic inheritance)
• Poverty of the stimulus (e.g., Chomsky 1986: 7)
◦ Empiricist approach (resurging today)
• Cognitive abilities present in the brain
• Association, Generalization, Pattern Recognition to learn the structure of language
• Insists on generic language model (statistical and corpus linguistics)

12
History
⚫ Symbolic NLP (1950 – early 1990s) – Mostly hand-written rules:
◦ Rule based parsing
◦ Morphology
◦ Semantics
⚫ Statistical NLP (1990 – 2010s) – Textual corpora were used predominantly
◦ Supervised Learning over hand-annotated data
◦ Unsupervised, semi-supervised learning over unannotated internet data
◦ Machine translation of governmental proceedings as a major focus
⚫ Neural NLP (present) – Deep Learning
◦ State of the art techniques
◦ Language modeling and many other applications

13
Applications of NLP
➢ Spam Detection:
✓ Scanning emails for words that indicate spam
➢ Machine Translation:
✓ Google translate is the best example of machine translation
✓ Capturing the meaning and tone of the source language is important

14
Applications of NLP
➢ Virtual Agents and Chat Bots:
✓ Siri and Alexa are examples of virtual agents that can take voice
commands and perform tasks
✓ Chat bots are developed to respond to human typed questions with
helpful answers
✓ Most websites which directly interact with many consumers have these
chat boxes
➢ Social Media Sentiment Analysis:
✓ Analysing social media posts, reviews, etc. to extract response
(positive/negative) to products, events, movies, etc.
➢ Text Summarisation:
✓ To ingest huge volumes of text and create summaries for indexes, busy
readers, etc.
15
Other Applications
➢ Drug Discovery
➢ Developing new drugs by understanding language(s) of molecules

➢ Molecules: Language representation

➢ Recommender systems:
Recommending products or movies to consumers based on their
historical consumption
➢ Targeted Advertisements:
Getting insights into customer behaviour and needs and targeting ads accordingly
based on search history, clicked items,

16
Motivation: Biological Neuron to
Artificial Neuron

17
Module II: Introduction to Neural
Networks

18
Biological Neuron
• Basic working unit of the brain and
nervous system
Axon
• Close to 100 billion interconnected Dendrites terminals
neurons in a human brain
• Function together to aid decision making Nucleus Axon
• Parts and Functioning:
➢ Dendrite: Takes signals (stimulus) from
the other neurons or other cells in the
body Cell body (soma)
➢ Cell body (soma): Processes the signal
and may or may not fire the neuron –
excitation and inhibition Biological Neuron
➢ Axon: Transmits the output (response)
to other neurons or cells

19
Biological Neuron to Artificial Neuron
Axon
Dendrites terminals
Artificial neuron
➢ Mathematical model of a biological Nucleus Axon
neuron
➢ Mimics the functioning of a biological
neuron
➢ Takes input in the form of numbers Cell body
➢ Processes the input to give out an Biological Neuron
output
➢ Output = f(inputs)
➢ Different models of artificial neurons Output
have been developed based on this Inputs 𝑓(. )
idea

Artificial Neuron
20
Artificial Neuron: McCulloch Pitts Model
Inputs
o Inputs 𝑥1 , 𝑥2 , … , 𝑥𝑛 are binary 𝑥1 𝑥𝑖 ∈ {0,1}
numbers (0 or 1)
o Aggregated input passes through an 𝑥2 Output
activation function to give output (𝑦) ⋮ Σ 𝜃
o Activation function is based on
𝑦 ∈ {0,1}
𝑥𝑛
thresholding logic
o Here, model refers to the function McCulloch Pitts Model
relating the output to inputs 𝑎 = 𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛
𝑦 = 𝑓 𝑎 = 1 if 𝑎 ≥ 𝜃
𝑦 = 𝑓 𝑎 = 0 if 𝑎 < 𝜃
21
McCulloch Pitts Model: Boolean Functions
o This model can be used to represent most Boolean functions
𝑥1
𝒙𝟏 𝒙𝟐 𝒚
0 0 0 𝑦
Logical And Σ 2
Function (2 inputs) 0 1 0
1 0 0
𝑥2 𝑦 = 1 if 𝑎 ≥ 2
1 1 1
𝑦 = 0 if 𝑎 < 2
Logical Or 𝒙𝟏 𝒙𝟐 𝒚 𝑥1
Function (2 inputs) 0 0 0 𝑦
0 1 1 Σ 1
1 0 1
1 1 1 𝑥2 𝑦 = 1 if 𝑎 ≥ 1
𝑦 = 0 if 𝑎 <1
22
McCulloch Pitts Model: Drawbacks
• Drawbacks:
➢ Cannot handle non-boolean inputs and outputs
➢ Deciding a appropriate threshold value might be hard as the
number of inputs increases
➢ Equal weightage to all inputs – What if more importance is to
be attached to some inputs?
• How to overcome these issues? – Perceptron Model

23
Artificial Neuron: Perceptron Model
o Inputs 𝑥1 , 𝑥2 , … , 𝑥𝑛 are real Inputs 𝑏
numbers 𝑥1
Weights
𝑤1
o Neuron takes the weighted 𝑤2 Output
𝑥2
combination of inputs 𝑓(. )
Σ 𝑦
o Bias (𝑏) is added to weighted ⋮
𝑤𝑛
inputs 𝑥𝑛
o Weighted input passes through Perceptron Model
an activation function to give
output (𝑦) 𝑎 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + 𝑏
𝑦 = 𝑓 𝑎 = 𝑓(𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + b)

24
Artificial Neuron: A Simple Example
o Using an artificial neuron to decide Inputs 𝑓 𝑎 = 1 if 𝑎 > 7
𝑥1 𝑓 𝑎 = 0 otherwise
whether to watch a movie or not Weights
0.4
𝑥2 0.3 Output
Feature Value for Value for 𝑥3 0.2 Σ 𝑓(. ) 𝑦 (0 𝑜𝑟 1)
Movie 1 Movie 2
0.1
Lead Actor (𝑥1 ) 10 7 𝑥4
Director (𝑥2 ) 8 5 Perceptron Model
Trill factor (𝑥3 ) 8 9
Movie 1 Movie 2
Run time (𝑥4 ) 9 5
𝑎 = 8.9 𝑎 = 6.7
𝑦=𝑓 𝑎 =1 𝑦=𝑓 𝑎 =0

25
Artificial Neural Network

26
Artificial Neural Network: Motivation
𝑏
➢ One neuron is not sufficient to take
complex decisions (complex 𝑥1
Weights
functions) 𝑤1
➢ Again, inspired by brain neural 𝑥2 𝑤2 Output
network, artificial neural network ⋮ Σ 𝑓(. ) 𝑦
was developed
𝑤𝑛
➢ In the brain, many neurons are 𝑥𝑛
involved in taking a decision Perceptron Model
➢ All the neurons are inter-connected
in the brain
𝑎 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + 𝑏
➢ They are arranged hierarchically in
layers 𝑦 = 𝑓 𝑎 = 𝑓(𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + b)

27
Artificial Neural Network (ANN)
Output
o ANN consists of multiple layers with
multiple neurons in each layer (hidden
layers)
o Each neuron (except inputs) represent
…
a perceptron model Hidden
o Every neuron in one layer is Layers
of
connected to every neuron in the neurons
successive layer …
o Output of one neuron are passed as
inputs to the neurons of next layer
… Inputs

28
ANN Architecture • Input layer (0𝑡ℎ ) with 𝑛 inputs
𝒙 = 𝑥1 𝑥2 … 𝑥𝑛 𝑇
𝑦 • 𝐿 − 1 hidden layers with 𝑚 neurons each
𝑾3 • Output layer (𝐿𝑡ℎ ) with 𝑘 neurons
𝒃3
• 𝑾𝑖 is the matrix containing the weights
between layers 𝑖 − 1 and 𝑖 0 < 𝑖 ≤ 𝐿
… • 𝒃𝑖 is the vector representing the biases
1 1
𝑤11 … 𝑤1𝑛 𝑏11
𝑾2 𝒃2 𝑾1 = ⋮ ⋱ ⋮ 𝒃1 = ⋮
1 1 1
𝑤𝑚1 … 𝑤𝑚𝑛 𝑏𝑚
… 𝑚×𝑛

𝑖 𝑖
𝑾1 𝒃1 𝑤11 … 𝑤1𝑚 𝑏1𝑖
𝑾𝑖 = ⋮ ⋱ ⋮ 𝒃i = ⋮
𝑥1 𝑥2 … 𝑥𝑛 𝑖 𝑖 𝑖
𝑤𝑚1 … 𝑤𝑚𝑚 𝑚×𝑚
𝑏𝑚
29
ANN Architecture
𝑦 𝐿 𝐿
𝑤11 … 𝑤1𝑚 𝑏1𝐿
𝑾3 𝒃3 𝑾𝐿 = ⋮ ⋱ ⋮ 𝒃L = ⋮
𝐿 𝐿
𝑤𝑘1 … 𝑤𝑘𝑚 𝑏𝑘𝐿
𝑘×𝑚
• For a single output, 𝑾𝐿 will be a vector
…
and 𝒃𝐿 will be a scalar
• Each neuron in hidden and output layers
𝑾2 𝒃2
has an activation function
… • If there are more than 3 hidden layers,
then ANN is referred to as Deep Neural
𝑾1 𝒃1
Network (DNN) – Depth refers to the
number of layers
𝑥1 𝑥2 … 𝑥𝑛
30
DNN Feed Forward Calculation
• Feed forward calculation involves finding the
𝑦
output as a function of input, weights and
𝒂3
𝑾3 𝒃3
biases
• Input to activation function at layer 0:
𝒉2 𝒂1 = 𝑾1 𝒙 + 𝒃1
… • Activation at hidden layer 1:
𝒂2 𝒉1 = 𝑔ℎ (𝒂1 )
𝑾2 𝒃2 • 𝒉𝑖 is output vector at layer 𝑖
𝒉1 • 𝑔ℎ is the activation function which maps
… vector 𝒂𝑖 to vector 𝒉𝑖
𝒂1
• Input to activation function at hidden layer i:
𝑾1 𝒃1
𝒂𝑖 = 𝑾𝑖 𝒉𝑖−1 + 𝒃𝑖
… • Activation at hidden layer i:
𝑥1 𝑥2 𝑥𝑛
𝒉𝑖 = 𝑔ℎ 𝒂𝑖 = 𝑔ℎ (𝑾𝑖 𝒉𝑖−1 + 𝒃𝑖 )
31
DNN Feed Forward Calculation
𝒂1 = 𝑾1 𝒙 + 𝒃1
𝑦 𝒉1 = 𝑔ℎ (𝒂1 )
𝒂3 𝒂𝑖 = 𝑾𝑖 𝒉𝑖−1 + 𝒃𝑖
𝑾3 𝒃3 𝒉𝑖 = 𝑔ℎ 𝒂𝑖 = 𝑔ℎ (𝑾𝑖 𝒉𝑖−1 + 𝒃𝑖 )
• Input to activation function at output layer 𝐿:
𝒉2 𝒂𝐿 = 𝑾𝐿 𝒉𝐿−1 + 𝒃𝐿
… • Activation at output layer:
𝒂2
ෝ = 𝑔𝑜 𝒂𝐿 = 𝑔𝑜 𝑾𝐿 𝒉𝐿−1 + 𝒃𝐿
𝒚
𝒃2
𝑾2 • Model (function) being approximated by the
𝒉1 DNN (assuming L=3):
…
𝒂1
𝑾1 𝒃1 ෝ = 𝑔𝑜 𝑾3 𝑾2 𝑾1 𝒙 + 𝒃1 + 𝒃𝟐 + 𝒃3
𝒚
ෝ = 𝑓(𝒙)
𝒚
𝑥1 𝑥2 … 𝑥𝑛
32
Types of Activation Functions

33
Activation Function
o Activation function is like a gate between the input and output of a
neuron
o Purpose: To introduce non-linearity into the model and enable
learning complex functions (models)
o It affects the DNN output, accuracy and convergence
o Types of activation function:
1. Linear activation function
2. Sigmoid activation function
3. Tanh activation function
4. Relu activation function
5. Softmax activation function

34
Linear Activation Function
𝑓 𝑎 = 𝑐𝑎
• Output is directly proportional
to the input
𝑓 𝑎 = 𝑐𝑎
• Output can take any real
number 𝑓(𝑎)
• Gradient is always constant and
does not depend on the input
• Generally used in the output
0
layer of regression problem 𝑎

35
Sigmoid Activation Function
• Any value of input is mapped to a 1
𝑓 𝑎 =
value between 0 and 1 1 + 𝑒 −𝑎
1 1
𝑓 𝑎 =
1 + 𝑒 −𝑎
• Gradient is close to zero when the
output is close to 0 or 1 𝑓(𝑎)
• Useful when the expected output is
a probabilistic value between 0 and 1

0
𝑎
36
Softmax Activation Function
• Sigmoid function gives a value between 0 Total Probability = 1
and 1, and can be used for binary Prob. of
classification class 0
Prob. of class 1
• However, sigmoid cannot be used to Binary Classification
output multiple probability values which
add upto 1 (multi-class) Total Probability = 1
• Softmax function is an extension of Prob. of Prob. of Prob. of
sigmoid function class 1 class 2 class 3
• Softmax calculates the relative
Multi-class Classification
probabilities of multiples classes and
ensures that total probability is 1
𝑒 𝑎𝑖
Input to output layer of a DNN 𝑓 𝑎𝑖 = 𝑘
𝒂 = [𝑎1 𝑎2 … 𝑎𝑘 ] σ𝑖=1 𝑒 𝑎𝑖

37
Tanh Activation Function
𝑒 𝑎 − 𝑒 −𝑎
𝑓 𝑎 = 𝑎
• Any value of input is mapped to a 𝑒 + 𝑒 −𝑎
value between -1 and 1 1
𝑒 𝑎 − 𝑒 −𝑎
𝑓 𝑎 = 𝑎
𝑒 + 𝑒 −𝑎
• Positive values between 0 and 1
• Negative values between -1 and 0 0
• Output is zero centered which 𝑓(𝑎)
enables quick convergence
• Gradient is close to zero when the
output is close to -1 or 1 −1
0 𝑎

38
Relu Activation Function
Rectified Linear Unit
• All positive values go through 𝑓 𝑎 = max(0, 𝑎)
directly while all negative values are
mapped to zero
𝑓 𝑎 = max(0, 𝑎)
• Gradient is zero when the output ≤ 𝑓(𝑎)
0 and 1 for all positive outputs
• Relu is one of the most popular
activation functions and has many
variants 0
0 𝑎

39
Feed Forward Calculation - Example

40
Feed Forward Calculation - Example
𝑦1 𝑦2 𝑦3 • Task: To classify a person as underweight
𝒂3
𝒂3 (Class 1), normal weight (Class 2) or
𝒃3 overweight (Class 3) given the height and
𝑾3
𝒉2 weight as input features
• NN Architecture: A neural network with 2
𝒂2 input, 2 hidden layers with 3 neurons each and
𝒃2
𝑾2 3 outputs
𝒉1 • 2 inputs to take 2 features and 3 outputs to
predict probability of 3 classes
𝒂1 • Activation functions: All hidden layer neurons
𝑾1 𝒃1 have relu activation and output layer neurons
have softmax activation
𝑥1 𝑥2
41
Feed Forward Calculation - Example
𝑦1 𝑦2 𝑦3 • Inputs: 𝑥1 = 1.5 𝑥2 = 0.5
𝒂3
𝒂3 • Weights and Biases:
𝒃3 0.53 0.86 −0.43
𝑾3
𝒉2
𝑾1 = 1.84 0.32 𝒃1 = 0.34
−2.25 −1.31 3.58
𝒂2 1.41 −1.21 0.49 −0.2
𝒃2 𝑾𝟐 = 1.42 0.72 0.03 𝒃2 = −0.12
𝑾2
0.67 1.63 0.73 1.49
𝒉1 −0.3 0.89 −0.81 0.32
𝑾3 = 0.29 −1.14 −2.9 𝒃3 = −0.75
𝒂1
−0.78 −1.07 −1.43 1.37
𝑾1 𝒃1

𝑥1 𝑥2
42
Feed Forward Calculation - Example
𝑦1 𝑦2 𝑦3 • Inputs: 𝑥1 = 1.5 𝑥2 = 0.5
𝒂3
𝒂3 • Hidden Layer 1 calculation:
𝒃3
𝑾3 𝒂1 = 𝑾1 𝒙 + 𝒃1
𝒉2 0.8
𝒂1 = 3.25
𝒂2 −0.46
𝒃2 𝒉1 = 𝑟𝑒𝑙𝑢 𝒂1
𝑾2
0.8
𝒉1 𝒉1 = 3.25
𝒂1 0
𝑾1 𝒃1

𝑥1 𝑥2
43
Feed Forward Calculation - Example
𝑦1 𝑦2 𝑦3 • Inputs: 𝑥1 = 1.5 𝑥2 = 0.5
𝒂3
𝒂3 • Hidden Layer 2 calculation:
𝒃3
𝑾3 𝒂2 = 𝑾𝟐 𝒉𝟏 + 𝒃2
𝒉2 −3
𝒂2 = 3.34
𝒂2 7.33
𝒃2 𝒉2 = 𝑟𝑒𝑙𝑢 𝒂2
𝑾2
0
𝒉1 𝒉2 = 3.34
𝒂1 7.33
𝑾1 𝒃1

𝑥1 𝑥2
44
Feed Forward Calculation - Example
𝑦1 𝑦2 𝑦3 • Inputs: 𝑥1 = 1.5 𝑥2 = 0.5
𝒂3
𝒂3 • Output calculation:
𝒃3
𝑾3 𝒂3 = 𝑾3 𝒉2 + 𝒃3
𝒉2 −2.63
𝒂3 = −26.18
𝒂2 8.33
𝒃2 𝒚 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝒂3
𝑾2
0
𝒉1 𝒚= 0
𝒂1 1
• This example illustrations how 𝒚 is a function
𝑾1 𝒃1
of 𝒙
𝑥1 𝑥2
45
Universal Approximation Theorem

46
Universal Approximation Theorem (UAT)
𝑦
𝑦=𝑓 𝒙
𝑾3 𝒃3 • UAT establishes that neural networks
have a kind of universality in
approximating functions
…
• For any given function of inputs 𝑓(𝒙),
𝒃2
there exists a neural network which can
𝑾2
approximate the output
… • Holds even when the function has
multiple inputs and outputs
𝑾1 𝒃1 • Condition: Activations functions should
be non-linear
𝑥1 𝑥2 … 𝑥𝑛
Ref: Article by Micheal Nelson - Neural networks and deep learning

47
Supervised Learning using DNN

48
Supervised Learning using DNN
𝑦
Data for Supervised Learning:
𝑾3 𝒃3 ➢ Inputs: Values of input features − 𝒙
➢ Outputs: Values of predicted variables − 𝒚
… Regression – Real Numbers
Classification – Discrete class or
𝑾2 𝒃2 Probability of each class
• DNN is expected to take an input and predict
… the desired output
• Implies: DNN should approximate a function
𝑾1 𝒃1 𝑓(𝒙) which maps inputs to outputs
𝑥1 𝑥2 … 𝑥𝑛

49
Supervised Learning using DNN
𝑦
➢ Question: What is a suitable 𝑓(𝒙) for the
𝑾3 𝒃3 given data or task ?
✓ Answer: Generally, not known and can be
… a complex function
➢ Question: Can we find the weights and
𝑾2 𝒃2 biases which will approximate desired
𝑓(𝒙)?
… ✓ Answer: Yes! They can be learnt from
data
𝑾1 𝒃1 • Training a DNN: Learning the parameters
of the DNN (weights and biases) using
𝑥1 𝑥2 … 𝑥𝑛 the given data

50
Summary
➢ Deep Learning is a sub-field of machine learning with many applications
in diverse areas
➢ Functioning of a biological neuron was mathematically modelled to
replicate its decision making capability
➢ An artificial neural network was developed inspired from the structure
of a brain neural network
➢ ANN consists of multiple layers of inter-connected neurons which
process inputs to give out outputs
➢ Universal Approximation theorem establishes that there always exists an
ANN which can approximate any function of any complexity
➢ An ANN can be trained to map inputs to desired outputs by learning the
weights and biases
51
Supervised Learning using DNN
𝑦
➢ In supervised learning, input 𝒙 and
𝑾3 𝒃3 output (𝒚) data is given to learn a
function 𝒚
ෝ = 𝑓(𝒙) such that 𝒚ෝ≈𝒚
➢ Question: What is a suitable 𝑓(𝒙) for
… the given data or task ?
➢ Question: Can we find the weights and
𝑾2 𝒃2
biases which will approximate desired
𝑓(𝒙)?
…
➢ Training a DNN: Learning the
parameters of the DNN (weights and
𝑾1 𝒃1
biases) using the given data
𝑥1 𝑥2 … 𝑥𝑛

53
Training a DNN
𝑦
𝑾3 𝒃3
Steps to train a DNN using data
1. Create a
… 2. Randomly initialize 3. Pass inputs 𝒙
neural network through the network
weights and biases
𝒃2 # layers
𝑾2 (𝑊1 , … 𝑊𝐿 , 𝑏1 , … 𝑏𝐿 ) and get output (𝑦)
ො
# neuron
…

𝑾1 𝒃1

𝑥1 𝑥2 … 𝑥𝑛 A DNN based 4. Find out difference

5. Adjust the
model is trained on between actual
weights to minimise
the given data output (𝑦) and
the difference
i.e., 𝑦ො ≈ 𝑦 for all predicted output (𝑦)ො
between 𝑦 and 𝑦ො
inputs - Prediction error

54
Training a DNN – Step 4
Step 4: Find out difference between actual output
(𝑦) and predicted output (𝑦)ො - Prediction error
𝑦
𝑾3 𝒃3 ➢ Loss function or Cost function is defined to quantify the
prediction error
… ➢ For regression: Mean squared error loss since output is
𝒃2
a real number
𝑾2 𝑁
…
1 2
ℒ = ෍ 𝑦𝑖 − 𝑦ෝ𝑖
𝑁
𝑾1 𝒃1 𝑖=1

𝑥1 𝑥2 … 𝑥𝑛 ➢ For classification: Cross entropy loss error since output

is a probabilistic value for each class (𝑘 classes)
1
ℒ = − σ𝑁 σ 𝑘
𝑖=1 1 𝑦 log 𝑦ො
𝑁

55
Training a DNN – Step 5
Step 5: Adjust the weights to minimise the loss function

➢ Question: How to adjust parameters

to minimise the loss function ℒ(𝑦, 𝑦)?
ො
➢ Suppose there are only 2 parameters
𝑤1 and 𝑤2 on which the loss function is
dependent
➢ To minimize loss, take small steps in
the direction along which ℒ decreases
𝑤11
➢ Implies: 𝑤1 and 𝑤2 are to be modified ℒ
𝑤21
at each step such that ℒ decreases
ℒ1
➢ Directions at each step are given by 𝑤2
the gradient 𝑤12
𝑤1 𝑤22
ℒ2
57
Gradient Descent
➢ Gradient: Derivative of the loss function with
𝜕ℒ 𝜕ℒ
respect to the parameters and
𝜕𝑤1 𝜕𝑤2
➢ Interpretation: Gradient is the value by which
the loss increases when the parameter
increases
➢ 𝜕ℒ
𝜕𝑤1
= 2 ⇒ when 𝑤1 increases by a value of
1, ℒ increases by a value of 2 𝑤11
ℒ
➢ By updating 𝑤1 as 𝑤1 + 𝜕ℒ
𝜕𝑤1
and 𝑤2 as 𝑤2 + 𝑤21
𝜕ℒ ℒ1
, the value of ℒ increases 𝑤2
𝜕𝑤2
𝑤12
𝑤1 𝑤22
ℒ2
58
Gradient Descent
➢ Since the objective is to decrease, the
parameters are updated as follows:
𝜕ℒ
𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝛼
𝜕𝑤1
𝜕ℒ
𝑤2 𝑛𝑒𝑤 = 𝑤2 𝑜𝑙𝑑 − 𝛼
𝜕𝑤2
➢ 𝛼 is the learning rate to decide how much
to change ℒ 𝑤11
➢ Parameters are to updated iteratively until 𝑤21
the loss function is minimised ℒ1
(convergence) 𝑤2
𝑤12
𝑤1 𝑤22
ℒ2
59
Training a DNN – Gradient Descent
➢ Note: In a DNN, the loss is dependent on all the 𝑦
parameters 𝑾1 , 𝑾2 , … 𝑾𝐿 and 𝒃1 , 𝒃2 , … , 𝒃𝐿
➢ So gradient needs to be calculated w.r.t. all the 𝑾3 𝒃3
parameters
➢ General parameter update: …
𝜕ℒ
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − 𝛼
𝜕𝑤 𝑾2 𝒃2
𝜕ℒ
𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 − 𝛼
𝜕𝑏 …
➢ Question: How to calculate the gradient of ℒ
w.r.t. all the parameters in the DNN ? 𝑾1 𝒃1
➢ Answer: Gradient descent with backpropagation
𝑥1 𝑥2 … 𝑥𝑛

60
Training a DNN – Summary of Step 5
a) Known:
➢ 𝒙, 𝒚 − 𝑁 Samples
➢ 𝑾1 , 𝑾2 , … 𝑾𝐿 and 𝒃1 , 𝒃2 , … , 𝒃𝐿 (initialised values)
➢ 𝒚ෝ for each sample and and overall loss ℒ
b) Calculate the gradients of loss function w.r.t. all the weights and biases using
chain rule
Note: Gradient calculation involves a summation over all samples in the data
c) Update the weights and biases:
𝜕ℒ
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − 𝛼
𝜕𝑤
𝜕ℒ
𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 − 𝛼
𝜕𝑏
d) With new parameter, compute 𝒚ෝ for each sample and and overall loss ℒ
e) Repeat (b), (c) and (d) for many iterations – until loss is minimised
61
DNN Model Training - Components
𝑦 • Data (Given): 𝒙𝑖 , 𝒚𝑖 ; 𝑖 = 1 … 𝑁
𝑾3
• Model (Chosen):
𝒃3
ෝ = 𝑓 𝒙, 𝑾1 , … 𝑾𝐿 , 𝒃1 , … 𝒃𝐿
𝒚
• Parameters (To be learnt):
… 𝑾1 , … , 𝑾𝐿 , 𝒃1 , … , 𝒃𝐿
• Loss Function: Mean squared error for
𝑾2 𝒃2 regression and cross entropy for
classification
… • Training Algorithm: Gradient descent

𝑾1 𝒃1

𝑥1 𝑥2 … 𝑥𝑛
62
Computing Gradients
➢ How to compute gradients efficiently in Back propagation algorithm?
➢ How did we do in high school or the UG?

➢ Numerical Gradient Computation:

➢ Not computational efficient n parameters 2n function evaluations

➢ Often used to check the automatic differentiation algorithms

63
Computing Gradients
➢ Rule of Partial derivatives
➢ Sum of two functions f (.) + g (.)

➢ Product of two functions f(.) g (.)

➢ Chain Rule f(g(.))

64
Gradient Descent with Backpropagation
1 𝑁
ℒ= σ 𝑦𝑖 − 𝑦ෝ𝑖 2 (or) ℒ(𝑦, 𝑦)
ො 𝑦ො
𝑁 𝑖=1
𝑁 𝑘
1
ℒ = − ෍ ෍ 𝑦 log 𝑦ො 𝑎3
𝑁 𝑤3
𝑖=1 1 ℎ2
ෝ = 𝑔𝑜 𝑾3 𝑾2 𝑾1 𝒙 + 𝒃1 + 𝒃𝟐 + 𝒃3
𝒚 𝜕ℒ 𝜕ℒ 𝜕𝑦ො 𝜕ℒ 𝜕𝑦ො 𝜕𝑎3
= =
𝜕𝑤1 𝜕𝑦ො 𝜕𝑤1 𝜕𝑦ො 𝜕𝑎3 𝜕𝑤1 𝑎2
➢ Loss function is connected to the 𝑤2
parameters through 𝒚 ෝ 𝜕ℒ 𝜕ℒ 𝜕𝑦ො 𝜕𝑎3 𝜕ℎ2 𝜕𝑎2 𝜕ℎ1 𝜕𝑎1 ℎ1
=
➢ Issue: Finding loss function in terms of each 𝜕𝑤1 𝜕𝑦ො 𝜕𝑎3 𝜕ℎ2 𝜕𝑎2 𝜕ℎ1 𝜕𝑎1 𝜕𝑤1
of the parameters is a complex process 𝑎1
➢ Solution: Chain rule can be used to compute Note:
𝜕ℒ
𝜕𝑤1
will vary for each input and
𝑤1

the gradient w.r.t each of the parameters 𝑥

➢ So the loss is being backpropagated through output
the chain of gradients
65
Variants of Gradient Descent

66
Types of Gradient Descent
➢ Depending on the number of samples used to estimate the gradient of loss
w.r.t. the parameters, there are 3 types of gradient descent:
➢ Batch Gradient Descent

➢ Stochastic Gradient Descent

➢ Mini-Batch Gradient Descent

➢ Reasons for these 3 types of gradient descent:

➢ Computational efficiency

➢ Accuracy of estimated gradient

67
Some Terminology
➢ Epoch – One epoch of training is said to be complete if every sample in the
training dataset is used for gradient calculation and parameter updation

➢ Batch size (𝑏) – Number of samples used in gradient computation

➢ Batch size and epoch are the same if all the samples in the dataset are used
for computing the gradient

68
Batch Gradient Descent
• Parameters are updated by estimating the gradient using all the 𝑁 samples i.e.,
batch size, 𝑏 = 𝑁
𝜕ℒ 𝜕ℒ 𝜕ℒ 𝜕ℒ
𝑤1(𝑛𝑒𝑤) = 𝑤1(𝑜𝑙𝑑) − 𝛼 = 𝑤1 𝑜𝑙𝑑 − 𝛼 + + ⋯+
𝜕𝑤1 𝜕𝑤1 1 𝜕𝑤1 2 𝜕𝑤1 𝑁

• In this case, the parameters are updated only once in a epoch

• Advantages: Convergence is guaranteed in this case
• Gradients are estimated in an unbiased manner
• Disadvantage: Slow to converge with large datasets

69
Stochastic Gradient Descent
• Parameters are updated by estimating the gradient using a single sample
i.e., batch size, 𝑏 = 1
𝜕ℒ 𝜕ℒ
𝑤1(𝑛𝑒𝑤) = 𝑤1(𝑜𝑙𝑑) − 𝛼 = 𝑤1 𝑜𝑙𝑑 − 𝛼
𝜕𝑤1 𝜕𝑤1 1
⋮
𝜕ℒ 𝜕ℒ
𝑤1(𝑛𝑒𝑤) = 𝑤1(𝑜𝑙𝑑) − 𝛼 = 𝑤1 𝑜𝑙𝑑 − 𝛼
𝜕𝑤1 𝜕𝑤1 𝑁
• In this case, the parameters are updated 𝑁 times in a epoch
• Advantage: Very fast to minimise the loss function
• Disadvantages: Too much variance in gradient calculation and learning might not
be stable
• Convergence cannot be guaranteed

70
Mini-Batch Gradient Descent
• A balance between batch and stochastic gradient descent
• Parameters are updated by using the gradient computed at a small batch of
samples i.e., 1 < 𝑏 < 𝑁
𝜕ℒ 𝜕ℒ 𝜕ℒ
𝑤1(𝑛𝑒𝑤) = 𝑤1(𝑜𝑙𝑑) − 𝛼 = 𝑤1 𝑜𝑙𝑑 − 𝛼 + ⋯+
𝜕𝑤1 𝜕𝑤1 1 𝜕𝑤1 𝑏

• 𝑁
In this case, the parameters are updated times in a epoch
𝑏
• Advantages: Faster than batch gradient descent
• Lesser variance and more stability compared to stochastic gradient descent
• Convergence cannot be guaranteed but most preferred

71
Variants of Gradient Descent - Visualisation

Batch Gradient descent

Mini-Batch Gradient descent
Stochastic Gradient descent

• Ellipse with higher radius indicates higher

loss function

Image Source: medium.com

72
Batch Normalisation

73
Why Batch Normalisation?
𝑦
• Intuition: All the samples can be
𝑾3 𝒃3 considered to be drawn from a
multi-variate distribution
𝒉2
… • If batch gradient descent is
𝒂2 performed, then the distribution of
𝑾2 𝒃2 samples for each batch of inputs
𝒉1 remains the same
… • Issue: However, with stochastic and
𝒂1
mini-batch gradient descent, the
𝒃1
𝑾1
distribution varies from one batch to
𝑥1 𝑥2 … 𝑥𝑛 another.

74
Why Batch Normalisation?
𝑦 • Illustration: Suppose the input
𝑾3
samples (𝒙) are fed to the DNN in
𝒃3
multiple mini-batches
𝒉2 • In each iteration, the samples change
… and the distribution of samples also
𝒂2
𝒃2
changes
𝑾2
• Weights and biases would have to
𝒉1
… adjust to a different distribution of
𝒂1 inputs in each iteration
𝑾1 𝒃1
• Learning (fitting of weights and biases
𝑥1 𝑥2 … 𝑥𝑛 to inputs) would be hard
• How to overcome this issue? – Batch
Normalisation 75
Applying Batch Normalisation
𝑦 • Inputs to each layer are normalised
𝑾3 𝒃3 to be unit gaussians before the
activation function
𝒉2 • Mean and variance across set of
…
samples in that batch, are calculated
𝒂2
𝒃2
for performing normalisation
𝑾2
• This normalisation step is
𝒉1
… differentiable and hence we can
𝒂1 backpropagate through it
𝑾1 𝒃1 • Advantage: Learning is much faster
and leads to be better convergence
𝑥1 𝑥2 … 𝑥𝑛
Ref: Article by Johann Huber - Batch normalization in 3 levels of understanding

76
Regularisation

77
Regularisation – Motivation
DNN Model Training Objectives:
• Finding a model which is a good fit to the

Error
given data
• Model should be able to generalise over Test
unseen data (Test data) error

• Prediction error should be low both on

training and test data
Issues:
• In DNNs, there is the issue of overfitting due Train Ideal model
to many parameters error complexity

• Test error could be high due to overfitting

Solution: Model complexity

• To obtain a balanced model, regularisation is

performed
78
Finding Ideal Model Complexity

• To get near ideal model complexity, an

Error
additional component is added to the loss
function Test
error
ℒ 𝜃 + 𝜆Ω(𝜃)
• Ω(𝜃) is the regularisation term which
regularises the model
• Ω(𝜃) ensures that the model is neither too
Ideal model
complex models nor too simple models Train
error complexity
• 𝜆 is the regularisation rate (hyperparamter)
• 𝜆 determines how much the model is to be Model complexity
regularised

79
Types of Regularization
➢ 𝑙1 and 𝑙2 regularization
➢ Early stopping
➢ Ensemble Methods
➢ Dropout

80
𝑙1 and 𝑙2 Regularisation
➢ Regularisation term is the 𝑙1 or 𝑙2 norm of Loss function with
the vectors of weights in the neural network 𝑙1 regularization term:
➢ This introduces a constraint over the
ҧ
ℒ(𝜽) =ℒ 𝜽 +𝜆 𝜽
parameters 1

➢ Pushes some of the weights towards zero Loss function with

➢ Some neuron connections become negligible 𝑙2 regularization term:
(no impact on output) and overall complexity 𝜆
ҧ
ℒ(𝜽) =ℒ 𝜽 + 𝜽 2
reduces 2
𝜽 − Vector containing all the
weights and biases of a neural
network

81
Early Stopping
• Prediction error on a validation set is
tracked

Error
Validation
• New hyperparameter called patience error
parameter 𝑝 is introduced
• Check if there is improvement in the
validation error for 𝑝 continuous
iterations
Train
• If not stop and take the model prior to error
those 𝑝 iterations
𝑘−𝑝 𝑘 𝑡ℎ
iteration iteration

82
Ensemble Methods
• Different models are trained for the Ensemble of 2 DNN models

same task using different features, 𝑦1 𝑦2

hyperparameters, samples, etc. 𝑾3 𝒃3 𝑾3 𝒃3

• Outputs of these models are … …

combined to reduce the prediction 𝑾2 𝒃2 𝑾2 𝒃2

… …
error 𝑾1 𝒃1 𝑾1 𝒃1
• Similar to random forest or bagging 𝑥1 𝑥2 … 𝑥𝑖 𝑥1 𝑥2 … 𝑥𝑗

of trees
• Computationally very expensive and 𝑦1 + 𝑦 2
𝑦ො =
2
hence not preferred

83
Dropout
• Refers to dropping out of neurons during 𝑦
training 𝑾3 𝒃3
• For an iteration, some neurons with all
𝒉2
their connections are removed (inactive) …
𝒂2
𝑾2 𝒃2
𝒉1
…
𝒂1
𝑾1 𝒃1

𝑥1 𝑥2 … 𝑥𝑛

84
Dropout
𝑦
• Refers to dropping out of neurons during
𝑾3 𝒃3
training
• For an iteration, some neurons with all their 𝒉2
…
connections are removed (inactive)
𝒂2
• Feed forward calculation and backpropagation
𝑾2 𝒃2
happens only with the active connections
𝒉1
• Update the weights and biases of only active …
𝒂1
connections
𝑾1 𝒃1
• Effectively, learning is happening on a different
neural network in each iteration 𝑥1 𝑥2 … 𝑥𝑛
• Output is equivalent to ensembled output
At 𝑖 𝑡ℎ iteration
85
Module III: Elements of NLP

86
Module III

➢ Elements of NLP: Expression, word, corpora,

➢ Token and tokenization, word normalization, lemmatization,
stemming, sentence segmentation, sequence labelling, context-free
grammars
➢ Slide Courtesy: Dan Jurafsky and James H. Martin
➢ Errors: Nirav

87
88

Regular expressions are used everywhere

For a large class of cases-though not for all-in which we employ the word ‘meaning’ it
can be defined thus: the meaning of a word is its use in the language.”
- (Wittgenstein 1968, 943)
“All grammars leak”
-(Sapir 1921: 38)
“You shall know a word by the company it keeps”
-(Firth 1957: 11)
◦ Part of every text processing task
• Not a general NLP solution
• But very useful as part of those systems (e.g., for pre-processing or text formatting)
◦ Necessary for data analysis of text data
◦ A widely used tool in industry and academics

88
89

Language

◦ Forma
◦ Part of every text processing task
• Not a general NLP solution
• But very useful as part of those systems (e.g., for pre-
processing or text formatting)
◦ Necessary for data analysis of text data
◦ A widely used tool in industry and academics
89
90

Language

◦ A collection of strings
◦ Formal Language Theory to understand different
languages (including programming languages)
◦ Important aspect: Regular Expression
• It is a way to define a language
• Simple application: text matching and searching

90
Regular expressions

• A formal language for specifying text strings

• How can we search for mentions of these cute animals in text?

◦ woodchuck
◦ woodchucks
◦ Woodchuck
◦ Woodchucks
◦ Groundhog
◦ groundhogs
91
Regular Expressions: Disjunctions
⚫ Letters inside square brackets []
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any one digit

⚫ Ranges using the dash [A-Z]

Pattern Matches
[A-Z] An upper case Drenched Blossoms
letter
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
92
Regular Expressions: More Disjunction

⚫Groundhog is another name for woodchuck!

⚫The pipe symbol | for disjunction

⚫ "I do uh main- mainly business data processing"

⚫ "Seuss’s cat in the hat is different from other

cats!"
◦ Lemma: same stem, part of speech, rough word sense
• cat and cats = same lemma
◦ Wordform: the full inflected surface form
• cat and cats = different wordforms

94
How many words in a sentence?
⚫ Word Type: an element of the vocabulary (or number of
distinct words in a corpus)

⚫ Token (Word Instance): an instance of that type in running

text.

They lay back on the San Francisco grass and looked at the
stars and their

⚫ How many?
◦ 15 tokens (or 14)
◦ 13 types (or 12) (or 11?)

96
How many words in a corpus?
N = number of tokens
V = vocabulary = set of types, |V | is size of vocabulary
Heaps Law = Herdan's Law = where often .67 < β < .75
i.e., vocabulary size grows with > square root of the number of word tokens

Tokens = N Types = |V|

Switchboard phone 2.4 million 20 thousand
conversations
Shakespeare 884,000 31 thousand
COCA 440 million 2 million
Google N-grams 1 trillion 13+ million 97
Corpora

Words don't appear out of nowhere!

A text is produced by
• a specific writer(s),

• at a specific time,

• in a specific variety,

• of a specific language,

• for a specific function.

98
Corpora vary along dimension like
◦ Language: 7097 languages in the world
◦ Variety, like African American Language varieties.
• AAE Twitter posts might include forms like "iont" (I don't)
◦ Code switching, e.g., Spanish/English, Hindi/English:
S/E: Por primera vez veo a @username actually being hateful! It was beautiful:)
[For the first time I get to see @username actually being hateful! it was beautiful:) ]
H/E: dost tha or ra- hega ... dont wory ... but hoshla rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
◦ Genre: newswire, fiction, scientific articles, Wikipedia
◦ Author Demographics: writer's age, gender, ethnicity,
SES
99
Corpus datasheets
Gebru et al (2020), Bender and Friedman (2018)

Motivation:
• Why was the corpus collected?
• By whom?
• Who funded it?
Situation: In what situation was the text written?
Collection process: If it is a subsample how was it sampled?
Was there consent? Pre-processing?
⚫ +Annotation process, language variety,
demographics, etc.
100
Basic Text ⚫ Words and Corpora
Processing

101
Basic Text ⚫ Word tokenization
Processing

102
Text Normalization

⚫ Every NLP task requires text normalization:

1. Tokenizing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences

103
Space-based tokenization

⚫ A very simple way to tokenize

◦ For languages that use space characters between words
• Arabic, Cyrillic, Greek, Latin, etc., based writing systems
◦ Segment off a token between instances of spaces

104
Issues in Tokenization
⚫Can't just blindly remove punctuation:
◦ m.p.h., Ph.D., AT&T, cap’n
◦ prices ($45.55)
◦ dates (01/02/06)
◦ URLs (http://www.stanford.edu)
◦ hashtags (#nlproc)
◦ email addresses ([email protected])
⚫Clitic: a word that doesn't stand on its own
◦ "are" in we're, French "je" in j'ai, "le" in l'honneur
⚫When should multiword expressions (MWE) be
words?
◦ New York, rock ’n’ roll
105
Tokenization in languages without spaces
Many languages (like Chinese, Japanese, Thai) don't
use spaces to separate words!

How do we decide where the token boundaries

should be?

107
Word tokenization in Chinese

Chinese words are composed of characters called

"hanzi" (or sometimes just "zi")
Each one represents a meaning unit called a morpheme.
Each word has on average 2.4 of them.
But deciding what counts as a word is complex and not
agreed upon.

108
How to do word tokenization in Chinese?
⚫ 姚明进入总决赛
“YaoMing reaches the finals”
3 words?
⚫姚明进入总决赛
Yao Ming reaches finals
5 words?
⚫姚明进入总决赛
Yao Ming reaches overall finals
7 characters? (don't use words at all):
⚫姚明进入总决赛
Yao Ming enter enter overall decision game

109
Word tokenization / segmentation

So in Chinese it's common to just treat each character

(zi) as a token.
• So the segmentation step is very simple

In other languages (like Thai and Japanese), more

complex word segmentation is required.
• The standard algorithms are neural sequence
models trained by supervised machine learning.

110
Another option for text tokenization

Instead of
• white-space segmentation

• single-character segmentation

Use the data to tell us how to tokenize.

Subword tokenization (because tokens can be
parts of words as well as whole words)

111
Subword tokenization

⚫ Three common algorithms:

◦ Byte-Pair Encoding (BPE) (Sennrich et al., 2016)
◦ Unigram language modeling tokenization
(Kudo, 2018)
◦ WordPiece (Schuster and Nakajima, 2012)
⚫ All have 2 parts:
◦ A token learner that takes a raw training corpus and
induces a vocabulary (a set of tokens).
◦ A token segmenter that takes a raw test sentence and
tokenizes it according to that vocabulary
112
Byte Pair Encoding (BPE) token learner

Let vocabulary be the set of all individual

characters
= {A, B, C, D,…, a, b, c, d….}
⚫Repeat:
◦ Choose the two symbols that are most frequently
adjacent in the training corpus (say 'A', 'B')
◦ Add a new merged symbol 'AB' to the vocabulary
◦ Replace every adjacent 'A' 'B' in the corpus with 'AB'.
⚫ Until k merges have been done.
113
BPE token learner algorithm

114
Byte Pair Encoding (BPE) Addendum

Most subword algorithms are run inside space-

separated tokens.
So we commonly first add a special end-of-word
symbol '__' before space in training corpus
Next, separate into letters.

115
BPE token learner
Original (very fascinating ) corpus:

low low low low low lowest lowest newer newer newer newer newer newer wider wider wider new
new

Add end-of-word tokens, resulting in this vocabulary:

representation

116
BPE token learner

Merge e r to er

117
BPE

Merge er _ to er_

118
BPE

Merge n e to ne

119
BPE

The next merges are:

120
BPE token segmenter algorithm
On the test data, run each merge learned from the
training data:
◦ Greedily
◦ In the order we learned them
◦ (test frequencies don't play a role)
So: merge every e r to er, then merge er _ to er_,
etc.
⚫Result:
◦ Test set "n e w e r _" would be tokenized as a full word
◦ Test set "l o w e r _" would be two tokens: "low er_"
121
Properties of BPE tokens

Usually include frequent words

And frequent subwords
• Which are often morphemes like -est or –er

A morpheme is the smallest meaning-bearing unit of

a language
• unlikeliest has 3 morphemes un-, likely, and -est

122
Basic Text Byte Pair Encoding
Processing

123
Word Normalization

⚫ Putting words/tokens in a standard format

• U.S.A. or USA
• uhhuh or uh-huh
• Fed or fed
• am, is, be, are

124
Case folding

⚫Applications like IR: reduce all letters to lower case

◦ Since users tend to use lower case
◦ Possible exception: upper case in mid-sentence?
• e.g., General Motors
• Fed vs. fed
• SAIL vs. sail
⚫For sentiment analysis, MT, Information extraction
◦ Case is helpful (US versus us is important)

125
Lemmatization

Represent all words as their lemma, their shared root

= dictionary headword form:
◦ am, are, is → be
◦ car, cars, car's, cars' → car
◦ Spanish quiero (‘I want’), quieres (‘you want’)
→ querer ‘want'
◦ He is reading detective stories
→ He be read detective story
126
Lemmatization is done by Morphological Parsing

⚫ Morphemes:
◦ The small meaningful units that make up words
◦ Stems: The core meaning-bearing units
◦ Affixes: Parts that adhere to stems, often with grammatical
functions
⚫ Morphological Parsers:
◦ Parse cats into two morphemes cat and s
◦ Parse Spanish amaren (‘if in the future they would love’)
into morpheme amar ‘to love’, and the morphological
features 3PL and future subjunctive. 127
Stemming
⚫ Reduce terms to stems, chopping off affixes
crudely

This was not the map we found in Billy Bones’s Thi wa not the map we found in Billi Bone s chest
chest, but an accurate copy, complete in all but an accur copi complet in all thing name and
things-names and heights and soundings-with the height and sound with the singl except of the red
single exception of the red crosses and the written cross and the written note
notes. .

128
Porter Stemmer

⚫ Based on a series of rewrite rules run in series

◦ A cascade, in which output of each pass fed to next pass
⚫ Some sample rules:

129
Sentence Segmentation
!, ? mostly unambiguous but period “.” is very
ambiguous
◦ Sentence boundary
◦ Abbreviations like Inc. or Dr.
◦ Numbers like .02% or 4.3
Common algorithm: Tokenize first: use rules or ML to
classify a period as either (a) part of the word or (b) a
sentence-boundary.
◦ An abbreviation dictionary can help
Sentence segmentation can then often be done by rules
based on this tokenization. 131
➢ Module IV:
➢N-gram Language models, lexical and vector semantics,
➢TF-IDF, word2vector, semantic properties of embeddings

132
N-gram Language Modeling
Predicting words
⚫ The water of Walden Pond is beautifully ...

blue
*refrigerator
green
*that
clear

133
Language Models

⚫ Systems that can predict upcoming words

• Can assign a probability to each potential next word
• Can assign a probability to a whole sentence

134
Why word prediction?

It's a helpful part of language tasks

• Grammar or spell checking
Their are two midterms Their There are two midterms
Everything has improve Everything has improve improved

• Speech recognition
I will be back soonish I will be bassoon dish

135
Why word prediction?

It's how large language models (LLMs) work!

LLMs are trained to predict words
• Left-to-right (autoregressive) LMs learn to predict next
word
LLMs generate text by predicting words
• By predicting the next word over and over again

136
Language Modeling (LM) more formally

⚫ Goal: compute the probability of a sentence or

sequence of words W:
P(W) = P(w1,w2,w3,w4,w5…wn)
⚫ Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4) or P(wn|w1,w2…wn-1)
⚫ An LM computes either of these:
P(W) or P(wn|w1,w2…wn-1)

137
How to estimate these probabilities

⚫Could we just count and divide?

⚫No! Too many possible sentences!

⚫We’ll never see enough data for estimating these

138
How to compute P(W) or P(wn|w1, …wn-1)

⚫ How to compute the joint probability P(W):

P(The, water, of, Walden, Pond, is, so, beautifully, blue)

⚫ Intuition: let’s rely on the Chain Rule of Probability

139
Reminder: The Chain Rule

⚫Recall the definition of conditional probabilities

P(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A) P(B|A)

⚫More variables:
P(A,B,C,D) = P(A) P(B|A) P(C|A,B) P(D|A,B,C)
⚫The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-
1)

140
The Chain Rule applied to compute joint
probability of words in sentence

P(“The water of Walden Pond”) =

P(The) × P(water|The) × P(of|The water)
× P(Walden|The water of) × P(Pond|The water of
Walden)
141
Markov Assumption

⚫ Simplifying assumption:
Andrei Markov

142
Wikimedia commons
Bigram Markov Assumption

⚫Instead of:
More generally, we approximate each
component in the product

143
Simplest case: Unigram model

P(w1w2 … wn ) » Õ P(w i )
i
Some automatically generated sentences from two different unigram models
To him swallowed confess hear both . Which . Of save on trail for
are ay device and rote life have

Hill he late speaks ; or ! a more to leg less first you enter

Months the my and issue of year foreign new exchange’s September

were recession exchange new endorsed a acquire to six executives

144
Bigram model

P(wi | w1w2 … wi-1) » P(wi | wi-1)

Some automatically generated sentences rom two different unigram models
Why dost stand forth thy canopy, forsooth; he is this palpable hit
the King Henry. Live king. Follow.

What means, sir. I confess she? then all sorts, he is trim, captain.

Last December through the way to preserve the Hudson corporation N.

B. E. C. Taylor would seem to complete the major central planners one
gram point five percent of U. S. E. has already old M. X. corporation
of living

on information such as more frequently fishing to keep her

145
Problems with N-gram models

• N-grams can't handle long-distance dependencies:

“The soups that I made from that new cookbook I
bought yesterday were amazingly delicious."
• N-grams don't do well at modeling new sequences
with similar meanings
The solution: Large language models
• can handle much longer contexts
• because of using embedding spaces, can model
synonymy better, and generate better novel strings 146
Why N-gram models?

A nice clear paradigm that lets us introduce many

of the important issues for large language
models
• training and test sets
• the perplexity metric
• sampling to generate sentences
• ideas like interpolation and backoff
147
N-gram
Language
Modeling
⚫ Introduction to N-grams

148
N-gram
Language
Modeling
Estimating N-gram
⚫

Probabilities

149
Estimating bigram probabilities

⚫ The Maximum Likelihood Estimate

150
An example
c(wi-1,wi )
<s> I am Sam </s> P(wi | w i-1 ) =
<s> Sam I am </s> c(wi-1)
<s> I do not like green eggs and ham </s>

151
More examples:
Berkeley Restaurant Project sentences
⚫can you tell me about any good cantonese restaurants close
by
⚫tell me about chez panisse

⚫i’m looking for a good place to eat breakfast

⚫when is caffe venezia open during the day

152
Raw bigram counts

⚫ Out of 9222 sentences

153
Raw bigram probabilities

⚫Normalize by unigrams:

⚫Result:

154
Bigram estimates of sentence probabilities

P(<s> I want english food </s>) =

⚫P(english|want) = .0011
⚫P(chinese|want) = .0065

⚫P(to|want) = .66

⚫P(eat | to) = .28

⚫P(food | to) = 0

⚫P(want | spend) = 0

⚫P (i | <s>) = .25

156
Dealing with scale in large n-grams

⚫ LM probabilities are stored and computed

in log format, i.e. log probabilities
⚫This avoids underflow from multiplying

many small numbers

log(p1 ´ p2 ´ p3 ´ p4 ) = log p1 + log p2 + log p3 + log p4
If we need probabilities we can do one exp at the end

157
N-gram Language Modeling
⚫Estimating N-gram
Probabilities

160
Evaluation and Perplexity
Language
Modeling

161
How to evaluate N-gram models

⚫ "Extrinsic (in-vivo) Evaluation"

To compare models A and B
1. Put each model in a real task
• Machine Translation, speech recognition, etc.
2. Run the task, get a score for A and for B
• How many words translated correctly
• How many words transcribed correctly
3. Compare accuracy for A and B 162
Intrinsic (in-vitro) evaluation

⚫ Extrinsic evaluation not always possible

• Expensive, time-consuming
• Doesn't always generalize to other applications
⚫ Intrinsic evaluation: perplexity
• Directly measures language model performance at
predicting words.
• Doesn't necessarily correspond with real application
performance
• But gives us a single general metric for language models
• Useful for large language models (LLMs) as well as n-grams
163
Training sets and test sets

We train parameters of our model on a training

set.
We test the model’s performance on data we
haven’t seen.
◦ A test set is an unseen dataset; different from training
set.
• Intuition: we want to measure generalization to unseen data
◦ An evaluation metric (like perplexity) tells us how
well our model does on the test set.

164
Choosing training and test sets

• If we're building an LM for a specific task

• The test set should reflect the task language
we want to use the model for
• If we're building a general-purpose model
• We'll need lots of different kinds of training
data
• We don't want the training set or the test set
to be just from one domain or author or
language.
165
166

Training on the test set

We can’t allow test sentences into the training

set
• Or else the LM will assign that sentence an artificially
high probability when we see it in the test set
• And hence assign the whole test set a falsely high
probability.
• Making the LM look better than it really is
This is called “Training on the test set”
Bad science!
166
Dev sets

• If we test on the test set many times we might

implicitly tune to its characteristics
• Noticing which changes make the model better.
• So we run on the test set only once, or a few times

• That means we need a third dataset:

• A development test set or, devset.
• We test our LM on the devset until the very end
• And then test our LM on the test set once
167
Intuition of perplexity as evaluation metric: How
good is our language model?
Intuition: A good LM prefers "real" sentences
• Assign higher probability to “real” or “frequently
observed” sentences
• Assigns lower probability to “word salad” or
“rarely observed” sentences?

168
Intuition of perplexity 2:
Predicting upcoming words
time 0.9
The Shannon Game: How well can we
predict the next word? dream 0.03
• Once upon a ____ midnight 0.02
• That is a picture of a ____ …
• For breakfast I ate my usual ____
and 1e-100
Claude Shannon Unigrams are terrible at this game (Why?)

A good LM is one that assigns a higher probability

to the next word that actually occurs
169
Picture credit: Historiska bildsamlingen
https://creativecommons.org/licenses/by/2.0/
Intuition of perplexity 3: The best language
model is one that best predicts the entire unseen
test set
• We said: a good LM is one that assigns a higher
probability to the next word that actually occurs.
• Let's generalize to all the words!
• The best LM assigns high probability to the entire test
set.
• When comparing two LMs, A and B
• We compute PA(test set) and PB(test set)
• The better LM will give a higher probability to (=be less
surprised by) the test set than the other LM.
170
Intuition of perplexity 4: Use perplexity instead of
raw probability
• Probability depends on size of test set
• Probability gets smaller the longer the text
• Better: a metric that is per-word, normalized by length
• Perplexity is the inverse probability of the test
set, normalized by the number of words
1
-
PP(W ) = P(w1w2 ...wN ) N

1
= N
P(w1w2 ...wN )
171
Intuition of perplexity 5: the inverse

Perplexity is the inverse probability of the test set, normalized

by the number of words
1
-
PP(W ) = P(w1w2 ...wN ) N

1
= N
P(w1w2 ...wN )
(The inverse comes from the original definition of perplexity
from cross-entropy rate in information theory)
Probability range is [0,1], perplexity range is [1,∞]
Minimizing perplexity is the same as maximizing probability
172
Intuition of perplexity 6: N-grams
1
-
PP(W ) = P(w1w2 ...wN ) N

1
= N
P(w1w2 ...wN )

Chain rule:

Bigrams:

173
Intuition of perplexity 7:
Weighted average branching factor
Perplexity is also the weighted average branching factor of a language.
Branching factor: number of possible next words that can follow any word
Example: Deterministic language L = {red,blue, green}
Branching factor = 3 (any word can be followed by red, blue, green)
Now assume LM A where each word follows any other word with equal probability ⅓
Given a test set T = "red red red red blue"
PerplexityA(T) = PA(red red red red blue)-1/5 =
((⅓)5)-1/5 = (⅓)-1 =3
⚫But now suppose red was very likely in training set, such that for LM B:
◦ P(red) = .8 p(green) = .1 p(blue) = .1
⚫We would expect the probability to be higher, and hence the perplexity to be
smaller:
PerplexityB(T) = PB(red red red red blue)-1/5

= (.8 * .8 * .8 * .8 * .1) -1/5 =.04096 -1/5 = .527-1 = 1.89 174

Holding test set constant:
Lower perplexity = better language model

⚫Training 38 million words, test 1.5 million words,

WSJ

N-gram Unigra Bigram Trigram

Order m
Perplexity 962 170 109
175
Evaluation and Perplexity
Language
Modeling

176
Sampling and
Language Generalization
Modeling

177
The Shannon (1948) Visualization Method
Sample words from an LM
Unigram:
Claude Shannon

⚫
PRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN
DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF
TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE
HAD BE THESE.

⚫ Bigram:
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH
WRITER THAT THE CHARACTER OF THIS POINT IS
THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE
TIME OF WHO EVER TOLD THE PROBLEM FOR AN
UNEXPECTED. 178
How Shannon sampled those words in 1948

"Open a book at random and select a letter at random on the

page. This letter is recorded. The book is then opened to another
page and one reads until this letter is encountered. The
succeeding letter is then recorded. Turning to another page this
second letter is searched for and the succeeding letter recorded,
etc." 179
Sampling a word from a distribution

180
Visualizing Bigrams the Shannon Way

⚫Choose a random bigram (<s>, w) <s> I

⚫ according to its probability p(w|<s>) I want
⚫Now choose a random bigram (w, x) want to
according to its probability p(x|w) to eat
⚫And so on until we choose </s> eat Chinese
⚫Then string the words together Chinese food
food </s>
I want to eat Chinese food

181
Approximating Shakespeare

183
Shakespeare as corpus

N=884,647 tokens, V=29,066

Shakespeare produced 300,000 bigram types out of
V2= 844 million possible bigrams.
◦ So 99.96% of the possible bigrams were never seen (have
zero entries in the table)
◦ That sparsity is even worse for 4-grams, explaining why
our sampling generated actual Shakespeare.

184
The Wall Street Journal is not Shakespeare

185
186

Can you guess the author? These 3-gram sentences

are sampled from an LM trained on who?
1) They also point to ninety nine point
six billion dollars from two hundred four
oh six three percent of the rates of
interest stores as Mexico and gram Brazil
on market conditions
2) This shall forbid it should be
branded, if renown made it empty.
3) “You are uniformly charming!” cried
he, with a smile of associating and now
and then I bowed and they perceived a
chaise and four to wish for.
186
The perils of overfitting
N-grams only work well for word prediction if the
test corpus looks like the training corpus
• But even when we try to pick a good training
corpus, the test set will surprise us!
• We need to train robust models that generalize!
One kind of generalization: Zeros
• Things that don’t ever occur in the training set
• But occur in the test set

188
Zeros
⚫ Training set: • Test set
… ate lunch … ate lunch
… ate dinner … ate breakfast
… ate a
… ate the
P(“breakfast” | ate) = 0

189
Zero probability bigrams

Bigrams with zero probability

◦ Will hurt our performance for texts where those words
appear!
◦ And mean that we will assign 0 probability to the test
set!
And hence we cannot compute perplexity (can’t
divide by 0)!

190
N-gram Language
Modeling
⚫ Smoothing,
Interpolation, and
Backoff

191
The intuition of smoothing (from Dan Klein)
⚫ When we have sparse statistics:
P(w | denied the)
3 allegations

allegations
2 reports
1 claims

outcome
reports
1 request
…

attack
request
claims

man
7 total

⚫ Steal probability mass to generalize better

P(w | denied the)
2.5 allegations
1.5 reports

allegations
allegations
0.5 claims

outcome
0.5 request

reports

attack
2 other
…

man
claims

request
7 total 192
Add-one estimation or Laplace Smoothing

⚫ Pretend we saw each word one more time than

we did
⚫ Just add one to all the counts!

⚫ MLE estimate:

⚫ Add-1 estimate:
193
Maximum Likelihood Estimates
⚫ The maximum likelihood estimate
◦ of some parameter of a model M from a training set T
◦ maximizes the likelihood of the training set T given the model M
⚫Suppose the word “bagel” occurs 400 times in a corpus of a million
words
⚫What is the probability that a random word from some other text will be
“bagel”?
⚫MLE estimate is 400/1,000,000 = .0004
⚫This may be a bad estimate for some other corpus
◦ But it is the estimate that makes it most likely that “bagel” will occur 400
times in a million word corpus.

194
Berkeley Restaurant Project sentences
⚫can you tell me about any good cantonese restaurants close
by
⚫tell me about chez panisse

⚫i’m looking for a good place to eat breakfast

⚫when is caffe venezia open during the day

195
Raw bigram counts

⚫ Out of 9222 sentences

196
Berkeley Restaurant Corpus: Laplace smoothed
bigram counts

197
Laplace-smoothed bigrams

198
Reconstituted counts

199
Compare with raw bigram counts

Raw Original

Smoothed

200
Add-1 estimation is a blunt instrument

⚫So add-1 isn’t used for N-grams:

◦ Generally we use interpolation or backoff instead
⚫But add-1 is used to smooth other NLP models
◦ For text classification
◦ In domains where the number of zeros isn’t so huge.

201
Backoff and Interpolation

⚫ Sometimes it helps to use less context

◦ Condition on less context for contexts you know less
about
⚫ Backoff:
◦ use trigram if you have good evidence,
◦ otherwise bigram, otherwise unigram
⚫ Interpolation:
◦ mix unigram, bigram, trigram

Interpolation works better 202

Linear Interpolation

⚫ Simple interpolation

⚫ Lambdas conditional on context:

203
How to set λs for interpolation?

⚫ Use a held-out corpus

Held-Out Test
Training Data Data Data
⚫ Choose λs to maximize probability of held-out
data:
◦ Fix the N-gram probabilities (on the training data)
◦ Then search for λs that give largest probability to held-
out set

204
Backoff

Suppose you want:

P(pancakes| delicious soufflé)
⚫If the trigram probability is 0, use the bigram
⚫P(pancakes| soufflé)
⚫If the bigram probability is 0, use the unigram
⚫P(pancakes)
Complication: need to discount the higher-order ngram
so probabilities don't sum higher than 1 (e.g., Katz
backoff)

205
Vector Semantics
Vector Semantics & Embeddings
& Embeddings

207
What do words mean?
⚫ N-gram or text classification methods we've seen so
far
◦ Words are just strings (or indices wi in a vocabulary list)
◦ That's not very satisfactory!
⚫ Old linguistics joke by Barbara Partee in 1967:
◦ Q: What's the meaning of life?
◦ A: LIFE
⚫ That seems hardly better!
208
Desiderata: needed or wanted?

⚫ What should a theory of word meaning do for

us?
⚫Let's look at some desiderata

⚫From lexical semantics, the linguistic study of

word meaning

209
Lemmas and senses
lemma

mouse (N)
1. any of numerous small rodents...
sense
2. a hand-operated device that controls
a cursor...
A sense or “concept” is the meaning component of a word
Lemmas can be polysemous (have multiple senses)
Modified from the online thesaurus WordNet 210
Relations between senses: Synonymy

⚫ Synonyms have the same meaning in some or all

contexts.
◦ filbert / hazelnut
◦ couch / sofa
◦ big / large
◦ automobile / car
◦ vomit / throw up
◦ water / H20

211
Relations between senses: Synonymy

⚫ Note that there are probably no examples of

perfect synonymy.
◦ Even if many aspects of meaning are identical
◦ Still may differ based on politeness, slang, register,
genre, etc.

212
Relation: Synonymy?

water/H20
"H20" in a surfing guide?
big/large
my big sister != my large sister

213
The Linguistic Principle of Contrast

⚫Difference in form → difference in

meaning

214
Abbé Gabriel Girard 1718
Re: "exact" synonyms
"

[I do not believe that there

is a synonymous word in
any language]

215
Thanks to Mark Aronoff!
Relation: Similarity

Words with similar meanings. Not synonyms, but sharing

some element of meaning

car, bicycle
cow, horse

216
Ask humans how similar 2 words are

word1 word2 similarity

vanish disappear 9.8
behave obey 7.3
belief impression 5.95
muscle bone 3.65
modest flexible 0.98

SimLex-999 dataset (Hill et al., 2015)

217
Relation: Word relatedness

⚫Also called "word association"

⚫Words can be related in any way, perhaps via a semantic
frame or field

• coffee, tea: similar

• coffee, cup: related, not similar

218
Semantic field

⚫Words that
◦ cover a particular semantic domain
◦ bear structured relations with each other.

hospitals
surgeon, scalpel, nurse, anaesthetic, hospital
restaurants
waiter, menu, plate, food, menu, chef
houses
door, roof, kitchen, family, bed
219
Relation: Antonymy

⚫ Senses that are opposites with respect to only one

feature of meaning
⚫Otherwise, they are very similar!
dark/light short/long fast/slow rise/fall
hot/cold up/down in/out
⚫ More formally: antonyms can
◦ define a binary opposition or be at opposite ends of a scale
• long/short, fast/slow
• rise/fall, up/down

220
Connotation (sentiment)

• Words have affective meanings

• Positive connotations (happy)
• Negative connotations (sad)
• Connotations can be subtle:
• Positive connotation: copy, replica, reproduction
• Negative connotation: fake, knockoff, forgery
• Evaluation (sentiment!)
• Positive evaluation (great, love)
• Negative evaluation (terrible, hate)
221
Connotation
Osgood et al. (1957)
⚫ Words seem to vary along 3 affective (related to
feelings) dimensions:
◦ valence: the pleasantness of the stimulus
◦ arousal: the intensity of emotion provoked by the stimulus
◦ dominance: the degree of control exerted by the stimulus
Word Score Word Score
Valence love 1.000 toxic 0.008
happy 1.000 nightmare 0.005
Arousal elated 0.960 mellow 0.069
frenzy 0.965 napping 0.046
Dominance powerful 0.991 weak 0.045
leadership 0.983 empty 0.081
222
Values from NRC VAD Lexicon (Mohammad 2018)
So far

⚫ Concepts or word senses

◦ Have a complex many-to-many association with words
(homonymy, multiple senses)
⚫ Have relations with each other
◦ Synonymy
◦ Antonymy
◦ Similarity
◦ Relatedness
◦ Connotation

223
Vector Semantics & Embeddings

224
Computational models of word meaning

⚫ Can we build a theory of how to represent word

meaning, that accounts for at least some of the desiderata?
⚫We'll introduce vector semantics

⚫ The standard model in language processing!

⚫ Handles many of our goals!

225
Ludwig Wittgenstein

⚫PI #43:
"The meaning of a word is its use in the language"

226
Let's define words by their usages

⚫ One way to define "usage":

words are defined by their environments (the words around
them)
If A and B have almost identical environments we
say that they are synonyms.

227
What does recent English borrowing ongchoi mean?

⚫Suppose you see these sentences:

• Spinach is delicious sautéed with garlic.
• Spinach is superb over rice
• Spinach leaves with salty sauces
⚫And you've also seen these:
• …Ong choy sautéed with garlic over rice
• Chard stems and leaves are delicious
• Collard greens and other salty leafy greens
⚫Conclusion:
◦ Spinach is a leafy green like Ong choy, chard, or collard greens
• We could conclude this based on words like "leaves" and "delicious" and "sauteed"228
Idea 1: Defining meaning by linguistic distribution
Let's define the meaning of a word by its
distribution in language use, meaning its
neighboring words or grammatical environments.

230
Idea 2: Meaning as a point in space
(Osgood et al. 1957)
⚫ 3 affective dimensions for a word
◦ valence: pleasantness
◦ arousal: intensity of emotion
◦ dominance: the degree of control exerted
Word Score Word Score
Valence love 1.000 toxic 0.008
happy 1.000 nightmare 0.005
Arousal elated 0.960 mellow 0.069
frenzy 0.965 napping 0.046
Dominance powerful 0.991 weak 0.045
◦ leadership 0.983 empty 0.081

⚫ Hence the connotation of a word is a vector in 3-space 231

Idea 1: Defining meaning by linguistic distribution

Idea 2: Meaning as a point in multidimensional space

232
Defining meaning as a point in space based on distribution

⚫Each word = a vector (not just "good" or "w45")

⚫Similar words are "nearby in semantic space"

⚫We build this space automatically by seeing which words are

nearby in text

233
We define meaning of a word as a vector

⚫ Called an "embedding" because it's embedded into a

space

⚫ The standard way to represent meaning in NLP

Every modern NLP algorithm uses embeddings
as the representation of word meaning

⚫ Fine-grained model of meaning for similarity

234
Intuition: why vectors?

⚫Consider sentiment analysis (+, -, Neutral):

◦ With words, a feature is a word identity
• Feature 5: 'The previous word was "terrible"'
• requires exact same word to be in training and test

◦ With embeddings:
• Feature is a word vector
• 'The previous word was vector [35,22,17…]
• Now in the test set we might see a similar vector [34,21,14]
• We can generalize to similar but unseen words!!!

235
We'll discuss 2 kinds of embeddings
⚫ tf-idf
◦ Information Retrieval workhorse!
◦ A common baseline model
◦ Sparse vectors
◦ Words are represented by (a simple function of) the counts of
nearby words
⚫ Word2vec
◦ Dense vectors
◦ Representation is created by training a classifier to predict whether a
word is likely to appear nearby
◦ Later we'll discuss extensions called contextual embeddings 236
⚫ Words and Vectors

238
Vector and Documents
Term-document matrix
Work of Shakespeare
Each document is represented by a vector of words

Each column vector representing for a document as a point in |V

|-dimensional space
239
Visualizing document vectors

240
Vectors are the basis of information retrieval

Vectors are similar for the two comedies

But comedies are different than the other two

Comedies have more fools and wit and fewer
battles. 241
Idea for word meaning:
Words can be vectors too!!!

battle is "the kind of word that occurs in Julius Caesar and Henry V"

fool is "the kind of word that occurs in comedies, especially Twelfth

Night“ Vector for the word “fool”: [36,58,1,4],
242
More common: word-word matrix
(or "term-context matrix")

⚫Two words are similar in meaning if their context vectors are

similar

243
244
Cosine for computing word similarity

245
Computing word similarity:
Dot product and cosine
⚫ The dot product between two vectors is a
scalar:

⚫ The dot product tends to be high when the two

vectors have large values in the same dimensions
⚫Dot product can thus be a useful similarity
metric between vectors
246
Problem with raw dot-product

⚫Dot product favors long vectors

⚫Dot product is higher if a vector is longer (has
higher values in many dimension)
⚫Vector length:

⚫Frequent words (of, the, you) have long vectors

(since they occur many times with other words).
⚫So dot product overly favors frequent words
247
Alternative:
cosine for computing word similarity

Based on the definition of the dot product between two vectors a and b

248
Cosine as a similarity metric

⚫-1: vectors point in opposite directions

⚫+1: vectors point in same directions
⚫0: vectors are orthogonal

⚫ But since raw frequency values are non-negative,

the cosine for term-term matrix vectors ranges from
0–1
249
Cosine examples

åi=1 pie data computer

N
v ·w v w vi wi
cos(v, w) = = · = cherry 442 8 2
v w v w
åi=1 vi2 åi=1 wi2
N N
digital 5 1683 1670
information 5 3982 3325

250

250
Visualizing cosines
(well, angles)

251
TF-IDF: Term Frequency-Inverse Document Frequency

252
But raw frequency is a bad representation
• The co-occurrence matrices we have seen represent
each cell by word frequencies.
• Frequency is clearly useful; if sugar appears a lot near
apricot, that's useful information.
• But overly frequent words like the, it, or they are not
very informative about the context
• It's a paradox! How can we balance these two conflicting
constraints?

253
Two common solutions for word weighting

⚫ tf-idf: tf-idf value for word t in document d:

Words like "the" or "it" have very low idf

⚫ PMI: (Pointwise mutual information)
𝒑(𝒘𝟏 ,𝒘𝟐 )
◦ PMI 𝒘𝟏 , 𝒘𝟐 = 𝒍𝒐𝒈
𝒑 𝒘𝟏 𝒑(𝒘𝟐 )

See if words like "good" appear more often with "great"

than we would expect by chance
254
Term frequency (tf) in the tf-idf algorithm

⚫ We could imagine using raw count:

⚫
tft,d = count(t,d)
⚫ But instead of using raw count, we usually squash a
bit:

255
Document frequency (df)

⚫ dft is the number of documents t occurs in.

⚫(note this is not collection frequency: total count
across all documents)
⚫"Romeo" is very distinctive for one Shakespeare play:

256
Inverse document frequency (idf)

N is the total number of documents

in the collection

257
What is a document?

⚫Could be a play or a Wikipedia article

⚫But for the purposes of tf-idf, documents can be
anything; we often call each paragraph a document!

258
Final tf-idf weighted value for a word

⚫ Raw counts:

⚫ tf-idf:

259
Vector Semantics Word2vec
& Embeddings
⚫

260
Sparse versus dense vectors

⚫ tf-idf (or PMI) vectors are

◦ long (length |V|= 20,000 to 50,000)
◦ sparse (most elements are zero)
⚫ Alternative: learn vectors which are
◦ short (length 50-1000)
◦ dense (most elements are non-zero)

261
Sparse versus dense vectors

⚫ Why dense vectors?

◦ Short vectors may be easier to use as features in
machine learning (fewer weights to tune)
◦ Dense vectors may generalize better than explicit
counts
◦ Dense vectors may do better at capturing synonymy:
• car and automobile are synonyms; but are distinct dimensions
• a word with car as a neighbor and a word with automobile as a
262 neighbor should be similar, but aren't
◦ In practice, they work better
262
Common methods for getting short dense vectors

⚫“Neural Language Model”-inspired models

◦ Word2vec (skipgram, CBOW), GloVe
⚫Singular Value Decomposition (SVD)
◦ A special case of this is called LSA – Latent Semantic
Analysis
⚫Alternative to these "static embeddings":
• Contextual Embeddings (ELMo, BERT)
• Compute distinct embeddings for a word in its context
• Separate embeddings for each token of a word
263
Simple static embeddings you can download!

⚫Word2vec (Mikolov et al)

⚫https://code.google.com/archive/p/word2vec/

⚫GloVe (Pennington, Socher, Manning)

⚫http://nlp.stanford.edu/projects/glove/

264
Word2vec

⚫Popular embedding method

⚫Very fast to train

⚫Code available on the web

⚫Idea: predict rather than count

⚫Word2vec provides various options. We'll do:

⚫ skip-gram with negative sampling (SGNS)

265
Word2vec

⚫Instead of counting how often each word w occurs near

"apricot"
◦ Train a classifier on a binary prediction task:
• Is w likely to show up near "apricot"?
⚫We don’t actually care about this task
• But we'll take the learned classifier weights as the word embeddings
⚫Big idea: self-supervision:
• A word c that occurs near apricot in the corpus cats as the gold
"correct answer" for supervised learning
• No need for human labels
• Bengio et al. (2003); Collobert et al. (2011)
266
Approach: predict if candidate word c is a "neighbor"
1. Treat the target word t and a neighboring context word
c as positive examples.
2. Randomly sample other words in the lexicon to get
negative examples
3. Use logistic regression to train a classifier to distinguish
those two cases
4. Use the learned weights as the embeddings

267
Skip-Gram Training Data
⚫Assume a +/- 2 word window, given training
sentence:

…lemon, a [tablespoon of apricot jam, a]

pinch… [target]
⚫ c1 c2 c3 c4

268
Skip-Gram Classifier

⚫(assuming a +/- 2 word window)

…lemon, a [tablespoon of apricot jam, a] pinch…

c1 c2 [target] c3 c4

⚫Goal: train a classifier that isgiven a candidate (word, context) pair

(apricot, jam)
(apricot, aardvark)
…
⚫ Assign each pair a probability:
P(+|w, c)
P(−|w, c) = 1 − P(+|w, c) 269
Similarity is computed from dot product

⚫ Remember: two vectors are similar if they have a

high dot product
◦ Cosine is just a normalized dot product
⚫ So:
◦ Similarity(w,c) ∝ w ∙ c
⚫ We’ll need to normalize to get a probability
◦ (cosine isn't a probability either)

270
Turning dot products into probabilities

⚫ Sim(w,c) ≈ w ∙ c
⚫ To turn this into a probability

⚫ We'll use the sigmoid from logistic regression:

271
How Skip-Gram Classifier computes P(+|w, c)

⚫ This is for one context word, but we have lots of context words.
⚫ We'll assume independence and just multiply them:

272
Skip-gram classifier: summary

⚫A probabilistic classifier, given

• a test target word w
• its context window of L words c1:L
⚫ Estimates probability that w occurs in this window based
on similarity of w (embeddings) to c1:L (embeddings).

⚫ To compute this, we just need embeddings for all the

words.
273
These embeddings we'll need: a set for w, a set for c

274
Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a]

pinch…
⚫ c1 c2 [target] c3 c4

275

275
Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a]

pinch…
⚫ c1 c2 [target] c3 c4

For each positive

example we'll grab k
negative examples,
276
sampling by
frequency 276
Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a]

pinch…
⚫ c1 c2 [target] c3 c4

277

277
Word2vec: how to learn vectors

⚫ Given the set of positive and negative training

instances, and an initial set of embedding vectors
⚫The goal of learning is to adjust those word vectors
such that we:
◦ Maximize the similarity of the target word, context word
pairs (w , cpos) drawn from the positive data
◦ Minimize the similarity of the (w , cneg) pairs drawn from
the negative data.
278
12/23/2024 278
Loss function for one w with cpos , cneg1 ...cnegk
⚫Maximize the similarity of the target with the actual context
words, and minimize the similarity of the target with the k negative
sampled non-neighbor words.

279
Learning the classifier

⚫How to learn?
◦ Stochastic gradient descent!

⚫We’ll adjust the word weights to

◦ make the positive pairs more likely
◦ and the negative pairs less likely,
◦ over the entire training set.

280
Intuition of one step of gradient descent

281
Reminder: gradient descent

• At each step
• Direction: We move in the reverse direction from
GISTI C the gradient
REGRESSI ON of the loss function
• Magnitude: we move the value of this gradient
𝑑
𝐿(𝑓 𝑥; 𝑤 , 𝑦) weighted by a learning rate η
𝑑𝑤
• Higher learning rate means move w faster

t+ 1 t d
w =w−h L( f (x; w), y)
dw
282
The derivatives of the loss function

283
Update equation in SGD

Start with randomly initialized C and W matrices, then incrementally do updates

284
Two sets of embeddings

⚫ SGNS learns two sets of embeddings

⚫ Target embeddings matrix W
⚫ Context embedding matrix C
⚫It's common to just add them together,
representing word i as the vector wi + ci

285
Summary: How to learn word2vec (skip-gram) embeddings
⚫ Start with V random d-dimensional vectors as initial
embeddings
⚫ Train a classifier based on embedding similarity
◦ Take a corpus and take pairs of words that co-occur as
positive examples
◦ Take pairs of words that don't co-occur as negative examples
◦ Train the classifier to distinguish these by slowly adjusting all
the embeddings to improve the classifier performance
◦ Throw away the classifier code and keep the embeddings.
286
The kinds of neighbors depend on window size
⚫ Small windows (C= +/- 2) : nearest words are
syntactically similar words in same taxonomy
◦Hogwarts nearest neighbors are other fictional schools
•Sunnydale, Evernight, Blandings

⚫ Large windows (C= +/- 5) : nearest words are

related words in same semantic field
◦Hogwarts nearest neighbors are Harry Potter world:
•Dumbledore, half-blood, Malfoy

287
Analogical relations

⚫ The classic parallelogram model of analogical

reasoning (Rumelhart and Abrahamson 1973)
⚫To solve: "apple is to tree as grape is to _____"

⚫ Add tree – apple to grape to get vine

288
Analogical relations via parallelogram

⚫ The parallelogram method can solve analogies with

both sparse and dense embeddings (Turney and
Littman 2005, Mikolov et al. 2013b)
⚫ king – man + woman is close to queen
⚫ Paris – France + Italy is close to Rome
⚫For a problem a:a*::b:b*, the parallelogram method
is:

289
Structure in GloVE Embedding space

290
Caveats with the parallelogram method

⚫ It only seems to work for frequent words, small

distances and certain relations (relating countries
to capitals, or parts of speech), but not others.
(Linzen 2016, Gladkova et al. 2016, Ethayarajh et
al. 2019a)

⚫ Understanding analogy is an open area of

research (Peterson et al. 2020)
291
Embeddings as a window onto historical semantics

Train embeddings on different decades of historical text to see meanings

~30 million books, 1850-1990, Google Books data

William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic Word Embeddings Reveal Statistical Laws of
Semantic Change. Proceedings of ACL. 292
Embeddings reflect cultural bias!
Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. "Man is
to computer programmer as woman is to homemaker? debiasing word embeddings." In NeurIPS,
pp. 4349-4357. 2016.

⚫Ask “Paris : France :: Tokyo : x”

◦ x = Japan
⚫Ask “father : doctor :: mother : x”
◦ x = nurse
⚫Ask “man : computer programmer :: woman : x”
◦ x = homemaker
Algorithms that use embeddings as part of e.g., hiring searches
for programmers, might lead to bias in hiring 293
Historical embedding as a tool to study cultural
biases
• Compute a gender or ethnic bias for each adjective: e.g.,
how much closer the adjective is to "woman" synonyms than
"man" synonyms, or names of particular ethnicities
• Embeddings for competence adjective (smart, wise,
brilliant, resourceful, thoughtful, logical) are biased
toward men, a bias slowly decreasing 1960-1990
• Embeddings for dehumanizing adjectives (barbaric,
monstrous, bizarre) were biased toward Asians in the
1930s, bias decreasing over the 20th century.
• These match the results of old surveys done in the 1930s
Garg, N., Schiebinger, L., Jurafsky, D., and Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. 294
Proceedings of the National Academy of Sciences 115(16), E3635–E3644.
THANK YOU

Learning Adjectives by Describing Images For AI Image Generation
100% (1)
Learning Adjectives by Describing Images For AI Image Generation
9 pages
005 NLP Computer Vision and Neural Network (Machine Learning)
No ratings yet
005 NLP Computer Vision and Neural Network (Machine Learning)
45 pages
Deep Learning for Natural Language GDG Bloomington 1690248059
No ratings yet
Deep Learning for Natural Language GDG Bloomington 1690248059
41 pages
MthMLP
No ratings yet
MthMLP
6 pages
AI For Business Applications Unit 1: Introduction To AI: Faculty Name: Dr. Shivangi Agarwal
No ratings yet
AI For Business Applications Unit 1: Introduction To AI: Faculty Name: Dr. Shivangi Agarwal
52 pages
P-1.1.3
No ratings yet
P-1.1.3
9 pages
Hocken Maier 25
No ratings yet
Hocken Maier 25
46 pages
NLP LectureNotes UNIT 1
No ratings yet
NLP LectureNotes UNIT 1
55 pages
Natural Language Processing A Machine Learning Perspective by Yue Zhang, Westlake University Zhiyang Teng, Westlake University
No ratings yet
Natural Language Processing A Machine Learning Perspective by Yue Zhang, Westlake University Zhiyang Teng, Westlake University
768 pages
Lecture01 Introduction
No ratings yet
Lecture01 Introduction
35 pages
VAP PPT
No ratings yet
VAP PPT
47 pages
1 s2.0 S0925231221010997 Main
No ratings yet
1 s2.0 S0925231221010997 Main
14 pages
Lntroduction NN
No ratings yet
Lntroduction NN
96 pages
NLP Intro Logistics MIHE
No ratings yet
NLP Intro Logistics MIHE
21 pages
GenAI_Syllabus
No ratings yet
GenAI_Syllabus
17 pages
Deep Learning Natural Language Processing Term Paper
No ratings yet
Deep Learning Natural Language Processing Term Paper
6 pages
03-NLP-Document
No ratings yet
03-NLP-Document
38 pages
NLP Module 1
No ratings yet
NLP Module 1
31 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
47 pages
Week4 LLMs EN
No ratings yet
Week4 LLMs EN
48 pages
NLP handwritten notes_copy
No ratings yet
NLP handwritten notes_copy
26 pages
dl_nlp_reading materials_bda_cs_25
No ratings yet
dl_nlp_reading materials_bda_cs_25
25 pages
Chapter 1
No ratings yet
Chapter 1
29 pages
NLP_Intermediate_Presentation
No ratings yet
NLP_Intermediate_Presentation
7 pages
Module-1 Introduction To NLP
No ratings yet
Module-1 Introduction To NLP
28 pages
Lecture 5 Emerging Technology
No ratings yet
Lecture 5 Emerging Technology
20 pages
AI4youngster - 6 - Topic NLP
No ratings yet
AI4youngster - 6 - Topic NLP
66 pages
Topic 2: Introduction To Natural Language Processing (NLP)
No ratings yet
Topic 2: Introduction To Natural Language Processing (NLP)
16 pages
Machine Learning Suggestion
No ratings yet
Machine Learning Suggestion
16 pages
Lec 1
No ratings yet
Lec 1
57 pages
Art 53
No ratings yet
Art 53
32 pages
Session 8
No ratings yet
Session 8
24 pages
What Is NLP
No ratings yet
What Is NLP
16 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
34 pages
Chapter 1
No ratings yet
Chapter 1
29 pages
Course Introduction-I-1
No ratings yet
Course Introduction-I-1
39 pages
Deep Learning With Advanced NLP
No ratings yet
Deep Learning With Advanced NLP
18 pages
Natural Language Processing: All You Need To Know About
No ratings yet
Natural Language Processing: All You Need To Know About
45 pages
Brochure CMU NLP 24-08-2022 V13
No ratings yet
Brochure CMU NLP 24-08-2022 V13
13 pages
AI and prompt
No ratings yet
AI and prompt
18 pages
OceanofPDF - Com Large Language Models Concepts - John AtkinsonAbutridy
No ratings yet
OceanofPDF - Com Large Language Models Concepts - John AtkinsonAbutridy
185 pages
Natural Language Processing
No ratings yet
Natural Language Processing
4 pages
Topic 5 - Intelligent System Applications
No ratings yet
Topic 5 - Intelligent System Applications
142 pages
Chapter 1
No ratings yet
Chapter 1
66 pages
Massp2023 NLP
No ratings yet
Massp2023 NLP
26 pages
Lec-1 Introduction
No ratings yet
Lec-1 Introduction
68 pages
Ann mod1
No ratings yet
Ann mod1
106 pages
Introduction To Natural Language Processing: by Rohit Sharma
No ratings yet
Introduction To Natural Language Processing: by Rohit Sharma
8 pages
14-LookingForward
No ratings yet
14-LookingForward
48 pages
Seminar Outline NLP
No ratings yet
Seminar Outline NLP
5 pages
1.1 What Is A Neural Network?
No ratings yet
1.1 What Is A Neural Network?
3 pages
2023 07 28 Evolution of Language Models
No ratings yet
2023 07 28 Evolution of Language Models
73 pages
Natural Language Processing 101
No ratings yet
Natural Language Processing 101
26 pages
NLP unit 1 notes
No ratings yet
NLP unit 1 notes
15 pages
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
NLP ppt
No ratings yet
NLP ppt
20 pages
Lecture-5-Intro DL
No ratings yet
Lecture-5-Intro DL
39 pages
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
No ratings yet
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
53 pages
Introduction (BT4222) YL
No ratings yet
Introduction (BT4222) YL
48 pages
Hadi Pres, 21-12-24-1
No ratings yet
Hadi Pres, 21-12-24-1
16 pages
Deep Learning Frameworks
From Everand
Deep Learning Frameworks
Jamal Hopper
No ratings yet
liberal democrat
No ratings yet
liberal democrat
11 pages
Lesson2 Extrapolation Technique and MCSs Motion With Record Simple
No ratings yet
Lesson2 Extrapolation Technique and MCSs Motion With Record Simple
25 pages
Thermodynamics of Dry Air
No ratings yet
Thermodynamics of Dry Air
46 pages
Lesson5 Hailstone Part1
No ratings yet
Lesson5 Hailstone Part1
19 pages
Digital Circuit Design
No ratings yet
Digital Circuit Design
5 pages
Aviation Meteorology
No ratings yet
Aviation Meteorology
35 pages
Research Pepar
No ratings yet
Research Pepar
11 pages
Ali Hamad Bakar 2014/2015 Introduction To VHDL: (Very High Hardware Definition Language - VHDL)
No ratings yet
Ali Hamad Bakar 2014/2015 Introduction To VHDL: (Very High Hardware Definition Language - VHDL)
14 pages
Ali Hamad Bakar Memory Management Devices: Par Ticipants
No ratings yet
Ali Hamad Bakar Memory Management Devices: Par Ticipants
12 pages
How To Design Translation Prompts For ChatGPT - An Empirical Study
No ratings yet
How To Design Translation Prompts For ChatGPT - An Empirical Study
9 pages
4th YEAR
No ratings yet
4th YEAR
47 pages
Automate performance appraisal
No ratings yet
Automate performance appraisal
3 pages
Finacial News Summary and Sentiment Report
No ratings yet
Finacial News Summary and Sentiment Report
3 pages
Graph Theory Implementation in NLP
No ratings yet
Graph Theory Implementation in NLP
9 pages
A Survey On Transfer Learning: Sinno Jialin Pan and Qiang Yang, Fellow, IEEE
No ratings yet
A Survey On Transfer Learning: Sinno Jialin Pan and Qiang Yang, Fellow, IEEE
15 pages
Sourabh23_Resume (4)
No ratings yet
Sourabh23_Resume (4)
1 page
Deep Unordered Composition Rivals Syntactic Methods For Text Classification
No ratings yet
Deep Unordered Composition Rivals Syntactic Methods For Text Classification
11 pages
Natural Language Processing, The Jewel On The Crown: Abstract
No ratings yet
Natural Language Processing, The Jewel On The Crown: Abstract
5 pages
AI and Managers
No ratings yet
AI and Managers
9 pages
Natural Language Processing: Dr. Tulasi Prasad Sariki SCOPE, VIT Chennai
No ratings yet
Natural Language Processing: Dr. Tulasi Prasad Sariki SCOPE, VIT Chennai
29 pages
resume parsing report m
No ratings yet
resume parsing report m
103 pages
WhiteHat JR ADV 144 New Classes
No ratings yet
WhiteHat JR ADV 144 New Classes
12 pages
About Sentiment Analysis
No ratings yet
About Sentiment Analysis
33 pages
Unit V - AI
No ratings yet
Unit V - AI
41 pages
6aid5-12 NLP
No ratings yet
6aid5-12 NLP
2 pages
Artificial Intelligence in Marketing: A Bibliographic Perspective
No ratings yet
Artificial Intelligence in Marketing: A Bibliographic Perspective
14 pages
Learning To Identify Emotions in Text: Carlo Strapparava Strappa@itc - It Rada Mihalcea Rada@cs - Unt.edu
No ratings yet
Learning To Identify Emotions in Text: Carlo Strapparava Strappa@itc - It Rada Mihalcea Rada@cs - Unt.edu
5 pages
Complete Download AI for Immunology 1st Edition Louis J. Catania PDF All Chapters
100% (5)
Complete Download AI for Immunology 1st Edition Louis J. Catania PDF All Chapters
55 pages
Get Practical Solutions for Diverse Real-World NLP Applications 1st Edition Mourad Abbas free all chapters
100% (1)
Get Practical Solutions for Diverse Real-World NLP Applications 1st Edition Mourad Abbas free all chapters
28 pages
CPDHTS 1A-2
No ratings yet
CPDHTS 1A-2
22 pages
On The Opportunities and Risks of Foundation Models: Corresponding Author: Pliang@cs - Stanford.edu Equal Contribution
No ratings yet
On The Opportunities and Risks of Foundation Models: Corresponding Author: Pliang@cs - Stanford.edu Equal Contribution
212 pages
Exploring Advancements in AI Algorithms, Deep Learning, Neural Networks, and Their Applications in Various Fields
No ratings yet
Exploring Advancements in AI Algorithms, Deep Learning, Neural Networks, and Their Applications in Various Fields
13 pages
6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
Ai-(Ix)-Practice Paper 1 New
No ratings yet
Ai-(Ix)-Practice Paper 1 New
5 pages
6th International Conference on NLP & Information Retrieval (NLPI 2025)
No ratings yet
6th International Conference on NLP & Information Retrieval (NLPI 2025)
2 pages
02-INTRO TO GEN AI AND PROMPTING
No ratings yet
02-INTRO TO GEN AI AND PROMPTING
6 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
1 page