0% found this document useful (0 votes)
31 views57 pages

Intro to statistical nlp

Tells about the NLP statistics used in Artificial Intelligence

Uploaded by

editedvideoes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views57 pages

Intro to statistical nlp

Tells about the NLP statistics used in Artificial Intelligence

Uploaded by

editedvideoes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Statistical Natural

Language Processing:
an introduction
Statistical natural language
processing
1.Introduction
2.Applications
3.Why is it so
hard?
- Challenges of
natural
language data
Language is processed in our phones and
homes
Including televisions, phones, new assistance
devices, toys
Language is used for several everyday tasks

Including dictation, captioning, translation, interpretation,


information retrieval, conversational assistants,
language learning
Language is human communication
● Rich communication signal between humans
● Human speech is the most complex of all biosignals
● speech => text +
emotion,loudness,speed,emphasis,...
● text + emotion,loudness,speed,emphasis,... => speech
● How much language “understanding” is needed?
● People perceive the use of language as a sign of
“intelligence”
Modeling of
language
● Language is complex, adaptive
system

Storing and processing text and
● speech Large datasets
● We want to make systems that
'understand'

Take into account language
related phenomena

Building models about natural
language using large data sets
Statistical Natural
Language Processing
Methodological basis: Related fields:

machine ●
computational
● learning ● linguistics
● pattern ● corpus
● recognition ● linguistics
● probability ● Phonetics
theory ● speech
statistics ● processing
signal discourse
processing
What is in a
language?
Phonetics and phonology:

• physical sounds

• patterns of
sounds
Morphology: building blocks of
words

Syntax: grammatical structure

Semantics: meaning of words

Pragmatics, discourse, spoken


interaction...
Application
areas
● Information retrieval
● Text clustering and

classification


Automatic speech


recognition Natural


language interfaces
Statistical machine
translation
...
Information
retrieval
PageRank
algorithm
Text clustering and
classification
Speech
recognition
Speech recognition:
large probabilistic
models
Output: word sequence
Input: sound
observations

Decodin Acousti Languag


g c e
algorith model model
m
Machine
translation
Machine
translation: large
probabilistic models
Natural language
interfaces
Dialogue generation: a large
probabilistic models point of view
Output: word sequence
Input: word
sequence

Decodin Dialogu Languag


g e e
algorith model model
m
More application
areas
● Information ● Topic detection

retrieval ● Sentiment
Text clustering ● analysis

and
classification ● Word sense


Automatic ● disambiguation
speech ● Syntactic parsing
recognition
● ● Text generation
Natural
language ● Image, audio and video
interfaces
description Text-to-speech
Statistical
synthesis
machine
Complexity of natural
languages
• 6000+ languages, many All languages
• Each
dialects
has many Google ASR no
ASR
• Each
wordsword is understood
slightly differently by each
speaker
• Large variety of
sentence structures
Languages in the
internet

www.internetworldstats.
com
Effect of morphology: vocabulary a
size as function of corpus size

Varjokallio, Kurimo, Virpioja


(2016)
Challenges of
segmentation
● Modeling morphology –- segmenting
words
●istua "to sit", istuutua “to sit down”,

Istun "I sit", istahdan "I sit down for a

while" istahtaisin "I would sit down for

a while" istahtaisinko? "should I sit

down for a while?"
● Where are the "Iword
istahtaisinkohan? wonder if I should
sit down for a while?"
boundaries?
Challenge of
modeling
syntax
Challenges of natural
language

Understanding the meaning of words is
subjective:

learning language through individual life
● paths
end up having different ways of
● understanding
Many words have and producing language
several
meanings:

E.g. “play”, “game”,
● “window”
Sentences have several
interpretations:

E.g. “Big children and adults saw a man with a
telescope”
Example: color
naming
Different cultural
contexts
Challenge of
encoding world
knowledge

For good performance, world
knowledge is needed
● Quantitatively this is challenging

Qualitatively there are also many
problems (mapping between language
and the world is complex, cf. examples
above)

Note: world is essentially dynamic,
continuous and multimodal, symbolic
systems are not
Corpus-based

methods
Corpora are large collections of
text
● Annotated: add knowledge about words or structure
● into corpus Or just plain text
● Statistical
information
● Distributionon
of words and parts of
● words Structure
● Word similarity
● Allow us to build models and test
● hypotheses Allow us to explore
● Choose the best models based on
statistics
Natural language
processing

METHODS TOOLS
• Speech-to-text
S A Y
Hidden Markov • Text-to-speech
model • Machine translation
CAR CAT
Vector space DOG • Information retrieval
model • Named entity
h(t- y(t recognition
Recurrent neural
1) ) • Sentence parsing
network x(t h(t
) )
• Topic detection
Natural language modeling: basic
tasks
Word level Sentence level
1. Vector space 1. Part-of-speech tagging
models
2. Named entity
2. Text preprocessing
recognition
3. Bag of words
models 3. Statistical language
models
4. Modeling
morphology 4. Neural language models
A recent revolution in the language
modeling approach

UN+ +RELATE+
• Split language into UNRELATED +D
tokens
• Vector space modeling,
U N R E L A T E D
• embedding Representation CAR CAT A C
DOG D
• learning
• Deep & recurrent learning h(t- y(t
1) )
Sequence to sequence x(t h(t
) )
mapping
=> artificial intelligence
TURN IT OFF SAMMUT SE
A
Statistical NLP
How to deal with NL?

• Statistical methods are relevant to language


acquisition,
change, variation, generation and comprehension.

• Pure algebraic methods are inadequate for


understanding
many important properties of language, such as the
measure
of goodness that allows to identify the correct parse
among a
large candidate set.

• The focus of computational linguistics has been up to


now on
technology, but the same techniques promise progress
at
Statistical methods for NLP
Descriptive Statistics in NLP

• Frequency Counts: Frequency counts involve tallying


occurrences of words, phrases, or characters in a text
corpus.
• For instance, counting word frequency helps in
identifying the most common words, which can be
instrumental in tasks such as text summarization,
keyword extraction, and sentiment analysis.

• Measures of Central Tendency:


• Mean: The average length of words or sentences in a
corpus can indicate the complexity of the text.
• Median: This provides a central value, offering insight
into the typical word or sentence length, which is less
affected by outliers than the mean.
• Mode: The most frequent word or sentence length can
reveal common patterns in language usage.
Descriptive Statistics in NLP

Measures of Dispersion:
Variance and Standard Deviation: These metrics
indicate the variability in word or sentence lengths. High
variance suggests a mix of very short and very long
words or sentences, while low variance indicates more
uniform lengths.
Probability Distributions

Uniform Distribution: In NLP, a uniform distribution


might be used for generating random words or
characters, where each has an equal probability of
selection. This can serve as a baseline for more
sophisticated models.

Normal Distribution: Many linguistic features, such as


sentence lengths and word frequencies, approximate a
normal distribution. This assumption is useful for various
statistical tests and for modeling language phenomena.
Zipf’s Laws (1929)
Statistical Methods for Text
Analysis
1. Tokenization:

• Tokenization is the process of splitting text into smaller units,


such as words or sentences.
• This is a fundamental preprocessing step in NLP.
• For instance, in sentiment analysis, tokenizing text into words
allows the analysis of individual words' sentiments.

2. N-gram Models:
Statistical Methods for Text
Analysis
3. TF, IDF:

•Term Frequency (TF): Measures how frequently a


term appears in a document.

•Inverse Document Frequency (IDF): Measures


how common or rare a term is across all documents.

•TF-IDF (Term Frequency-Inverse Document


Frequency):
• TF-IDF is a statistic that reflects the importance of
a word in a document relative to a corpus.
• The product of TF and IDF, highlighting words that
are important in a specific document but not too
common across the entire corpus. It is widely
used in information retrieval and text mining.
Example: Calculating TF, IDF
Suppose we are looking for documents using the
query Q and our database is composed of the
documents D1, D2, and D3.
•Q: The cat.
•D1: The cat is on the mat.
•D2: My dog and cat are the best.
•TF(word,
D3: Thedocument)
locals are=playing.
“number of occurrences of the word in
the
document” / “number of words in the
document”

Let’s compute the TF scores of the words “the” and “cat” (i.e. the query
words) with respect to the documents D1, D2, and D3.
TF(“the”, D1) = 2/6 = 0.33
TF(“the”, D2) = 1/7 = 0.14
TF(“the”, D3) = 1/4 = 0.25
TF(“cat”, D1) = 1/6 = 0.17
TF(“cat”, D2) = 1/7 = 0.14
TF(“cat”, D3) = 0/4 = 0
IDF can be calculated by taking the total number of
documents, dividing it by the number of documents that
contain a word, and calculating the logarithm. If the word
is very common and appears in many documents, this
number will approach 0. Otherwise, it will approach 1.
IDF(word) = log(number of documents / number of
documents that contain the word)
Let’s compute the IDF scores of the words “the” and “cat”.
IDF(“the”) = log(3/3) = log(1) = 0
IDF(“cat”) = log(3/2) = 0.18
Multiplying TF and IDF gives the TF-IDF score of a word in a
document. The higher the score, the more relevant that word is in
that particular document.
TF-IDF(word, document) = TF(word, document) *
IDF(word)
Let’s compute the TF-IDF scores of the words “the” and “cat”.
TF-IDF(“the”, D1) = 0.33 * 0 = 0
TF-IDF(“the, D2) = 0.14 * 0 = 0
TF-IDF(“the”, D3) = 0.25 * 0 = 0
TF-IDF(“cat”, D1) = 0.17 * 0.18= 0.0306
TF-IDF(“cat, D2) = 0.14 * 0.18= 0.0252
The next step is to use a ranking function to order the documents
according to the TF-IDF scores of their words. We can use the average
TF-IDF word scores over each document to get the ranking
of D1, D2, and D3 with respect to the query Q.

Average TF-IDF of D1 = (0 + 0.0306) / 2 = 0.0153


Average TF-IDF of D2 = (0 + 0.0252) / 2 = 0.0126
Average TF-IDF of D3 = (0 + 0) / 2 = 0

Looks like the word “the” does not contribute to the TF-IDF scores of
each document. This is because “the” appears in all of the documents
and thus it is considered a not-relevant word.
As a conclusion, when performing the query “The cat” over
the collection of documents D1, D2, and D3, the ranked
results would be:

1.D1: The cat is on the mat.


2.D2: My dog and cat are the best.
3.D3: The locals are playing.
Statistical Methods for Text
Analysis
4. Word Embeddings:

• Word embeddings represent words as dense


vectors in a continuous vector space,
capturing semantic relationships.

• Techniques like Word2Vec, GloVe, and


FastText learn embeddings by leveraging
large text corpora, enabling applications
such as semantic similarity, text
classification, and machine translation.
Cosine Similarity

Cosine similarity (cos⁡(θ)) value ranges


from -1 (not similar) to +1 (very similar).

It is defined as: cos(θ) = (a · b) / (|a| |b|).


Here we see that point A(1.5, 1.5) and point B(2.0, 1.0) are close
together in a 2-dimensional embedding space.

When we calculate the cosine similarity, we obtain a value of 0.948,


confirming that both vectors are quite similar.

In contrast, when we compare the similarity of point A(1.5, 1.5) and


point C(-1.0, -0.5), we observe that the cosine similarity is -0.948,
indicating that both vectors are dissimilar.

We can see that they are in opposite directions in the embedding


space.

A cos⁡(θ) value of 0 would indicate that both vectors are perpendicular


to each other, showing neither similarity nor dissimilarity.
Hypothesis Testing

Chi-Square Test: The Chi-Square test assesses whether


there is a significant association between two categorical
variables. In NLP, it can be used to test the
independence of word occurrences across different
document categories, helping in feature selection for
text classification.

T-Test and ANOVA:


•T-Test: Compares the means of two groups to determine
if they are statistically different. For example, comparing
the average sentiment scores of two different sets of
documents.
•ANOVA (Analysis of Variance): Extends the T-Test to
compare the means of three or more groups, useful in
experiments involving multiple text categories.
Machine Learning Models

Logistic Regression: Logistic regression is a linear model


used for binary classification tasks. It predicts the
probability of a categorical outcome (e.g., spam vs. non-
spam emails) and is interpretable, making it a popular
choice for text classification.

Naive Bayes: Naive Bayes classifiers are based on Bayes'


theorem and assume independence between features.
Despite this strong assumption, they perform remarkably
well in text classification tasks such as sentiment analysis
and spam detection, due to the nature of text data.

Support Vector Machines (SVM): SVMs are powerful


classifiers that find the hyperplane best separating the data
into classes. They are used for both text classification and
regression tasks, offering robust performance with high-
dimensional data.
Neural Networks:

•Recurrent Neural Networks (RNNs): Designed for


sequential data, RNNs can remember previous inputs,
making them suitable for language modeling and text
generation. However, they suffer from the vanishing
gradient problem.

•Long Short-Term Memory (LSTM): A type of RNN that


mitigates the vanishing gradient problem by maintaining
long-range dependencies, making it effective for tasks like
part-of-speech tagging and named entity recognition.

•Transformers: Advanced models using self-attention


mechanisms to handle long-range dependencies.
Transformers underpin state-of-the-art models like BERT
and GPT, excelling in various NLP tasks such as translation,
summarization, and question answering.
Dimensionality Reduction

Principal Component Analysis (PCA):


• PCA reduces the dimensionality of data by transforming
it into a set of linearly uncorrelated components,
preserving as much variance as possible.
• In NLP, PCA can be used to visualize high-dimensional
word embeddings.

t-SNE (t-Distributed Stochastic Neighbor


Embedding):
• t-SNE is a non-linear dimensionality reduction technique
that maps high-dimensional data to lower dimensions
(2D or 3D) for visualization.
• It is particularly effective in visualizing the structure of
word embeddings and document clusters.
Bayesian Inference

Latent Dirichlet Allocation (LDA):


• LDA is a generative probabilistic model for topic
modeling, which discovers hidden topics in a
collection of documents.
• Each document is represented as a mixture of topics,
and each topic as a distribution over words.
• LDA is used for organizing large corpora, improving
search, and recommending content.

Bayesian Networks:
• Bayesian networks are probabilistic graphical
models representing variables and their conditional
dependencies.
• They are used in NLP for tasks like part-of-speech
tagging, named entity recognition, and understanding
Time Series Analysis

Markov Chains: Markov chains model sequences of


events where the probability of each event depends only
on the previous state. In NLP, they are used for text
generation, speech recognition, and predictive text
input.

Hidden Markov Models (HMMs): HMMs are statistical


models where the system being modeled is assumed to
be a Markov process with hidden states. They are
applied in sequence labeling tasks such as part-of-
speech tagging, named entity recognition, and speech
processing.
Maximum Likelihood Estimation
(MLE)
Read
more

Manning & Schütze: Foundations of
Statistical Natural language processing

You might also like