0% found this document useful (0 votes)

31 views57 pages

Intro to statistical nlp

Tells about the NLP statistics used in Artificial Intelligence

Uploaded by

editedvideoes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views57 pages

Intro to statistical nlp

Tells about the NLP statistics used in Artificial Intelligence

Uploaded by

editedvideoes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 57

Statistical Natural

Language Processing:
an introduction
Statistical natural language
processing
1.Introduction
2.Applications
3.Why is it so
hard?
- Challenges of
natural
language data
Language is processed in our phones and
homes
Including televisions, phones, new assistance
devices, toys
Language is used for several everyday tasks

Including dictation, captioning, translation, interpretation,

information retrieval, conversational assistants,
language learning
Language is human communication
● Rich communication signal between humans
● Human speech is the most complex of all biosignals
● speech => text +
emotion,loudness,speed,emphasis,...
● text + emotion,loudness,speed,emphasis,... => speech
● How much language “understanding” is needed?
● People perceive the use of language as a sign of
“intelligence”
Modeling of
language
● Language is complex, adaptive
system
●
Storing and processing text and
● speech Large datasets
● We want to make systems that
'understand'
●
Take into account language
related phenomena
●
Building models about natural
language using large data sets
Statistical Natural
Language Processing
Methodological basis: Related fields:
●
machine ●
computational
● learning ● linguistics
● pattern ● corpus
● recognition ● linguistics
● probability ● Phonetics
theory ● speech
statistics ● processing
signal discourse
processing
What is in a
language?
Phonetics and phonology:

• physical sounds

• patterns of
sounds
Morphology: building blocks of
words

Syntax: grammatical structure

Semantics: meaning of words

Pragmatics, discourse, spoken

interaction...
Application
areas
● Information retrieval
● Text clustering and
●
classification

●
Automatic speech

●
recognition Natural

●
language interfaces
Statistical machine
translation
...
Information
retrieval
PageRank
algorithm
Text clustering and
classification
Speech
recognition
Speech recognition:
large probabilistic
models
Output: word sequence
Input: sound
observations

Decodin Acousti Languag

g c e
algorith model model
m
Machine
translation
Machine
translation: large
probabilistic models
Natural language
interfaces
Dialogue generation: a large
probabilistic models point of view
Output: word sequence
Input: word
sequence

Decodin Dialogu Languag

g e e
algorith model model
m
More application
areas
● Information ● Topic detection
●
retrieval ● Sentiment
Text clustering ● analysis
●
and
classification ● Word sense

●
Automatic ● disambiguation
speech ● Syntactic parsing
recognition
● ● Text generation
Natural
language ● Image, audio and video
interfaces
description Text-to-speech
Statistical
synthesis
machine
Complexity of natural
languages
• 6000+ languages, many All languages
• Each
dialects
has many Google ASR no
ASR
• Each
wordsword is understood
slightly differently by each
speaker
• Large variety of
sentence structures
Languages in the
internet

www.internetworldstats.
com
Effect of morphology: vocabulary a
size as function of corpus size

Varjokallio, Kurimo, Virpioja

(2016)
Challenges of
segmentation
● Modeling morphology –- segmenting
words
●istua "to sit", istuutua “to sit down”,
●
Istun "I sit", istahdan "I sit down for a
●
while" istahtaisin "I would sit down for
●
a while" istahtaisinko? "should I sit
●
down for a while?"
● Where are the "Iword
istahtaisinkohan? wonder if I should
sit down for a while?"
boundaries?
Challenge of
modeling
syntax
Challenges of natural
language
●
Understanding the meaning of words is
subjective:
●
learning language through individual life
● paths
end up having different ways of
● understanding
Many words have and producing language
several
meanings:
●
E.g. “play”, “game”,
● “window”
Sentences have several
interpretations:
●
E.g. “Big children and adults saw a man with a
telescope”
Example: color
naming
Different cultural
contexts
Challenge of
encoding world
knowledge
●
For good performance, world
knowledge is needed
● Quantitatively this is challenging
●
Qualitatively there are also many
problems (mapping between language
and the world is complex, cf. examples
above)
●
Note: world is essentially dynamic,
continuous and multimodal, symbolic
systems are not
Corpus-based
●
methods
Corpora are large collections of
text
● Annotated: add knowledge about words or structure
● into corpus Or just plain text
● Statistical
information
● Distributionon
of words and parts of
● words Structure
● Word similarity
● Allow us to build models and test
● hypotheses Allow us to explore
● Choose the best models based on
statistics
Natural language
processing

METHODS TOOLS
• Speech-to-text
S A Y
Hidden Markov • Text-to-speech
model • Machine translation
CAR CAT
Vector space DOG • Information retrieval
model • Named entity
h(t- y(t recognition
Recurrent neural
1) ) • Sentence parsing
network x(t h(t
) )
• Topic detection
Natural language modeling: basic
tasks
Word level Sentence level
1. Vector space 1. Part-of-speech tagging
models
2. Named entity
2. Text preprocessing
recognition
3. Bag of words
models 3. Statistical language
models
4. Modeling
morphology 4. Neural language models
A recent revolution in the language
modeling approach

UN+ +RELATE+
• Split language into UNRELATED +D
tokens
• Vector space modeling,
U N R E L A T E D
• embedding Representation CAR CAT A C
DOG D
• learning
• Deep & recurrent learning h(t- y(t
1) )
Sequence to sequence x(t h(t
) )
mapping
=> artificial intelligence
TURN IT OFF SAMMUT SE
A
Statistical NLP
How to deal with NL?

• Statistical methods are relevant to language

acquisition,
change, variation, generation and comprehension.

• Pure algebraic methods are inadequate for

understanding
many important properties of language, such as the
measure
of goodness that allows to identify the correct parse
among a
large candidate set.

• The focus of computational linguistics has been up to

now on
technology, but the same techniques promise progress
at
Statistical methods for NLP
Descriptive Statistics in NLP

• Frequency Counts: Frequency counts involve tallying

occurrences of words, phrases, or characters in a text
corpus.
• For instance, counting word frequency helps in
identifying the most common words, which can be
instrumental in tasks such as text summarization,
keyword extraction, and sentiment analysis.

• Measures of Central Tendency:

• Mean: The average length of words or sentences in a
corpus can indicate the complexity of the text.
• Median: This provides a central value, offering insight
into the typical word or sentence length, which is less
affected by outliers than the mean.
• Mode: The most frequent word or sentence length can
reveal common patterns in language usage.
Descriptive Statistics in NLP

Measures of Dispersion:
Variance and Standard Deviation: These metrics
indicate the variability in word or sentence lengths. High
variance suggests a mix of very short and very long
words or sentences, while low variance indicates more
uniform lengths.
Probability Distributions

Uniform Distribution: In NLP, a uniform distribution

might be used for generating random words or
characters, where each has an equal probability of
selection. This can serve as a baseline for more
sophisticated models.

Normal Distribution: Many linguistic features, such as

sentence lengths and word frequencies, approximate a
normal distribution. This assumption is useful for various
statistical tests and for modeling language phenomena.
Zipf’s Laws (1929)
Statistical Methods for Text
Analysis
1. Tokenization:

• Tokenization is the process of splitting text into smaller units,

such as words or sentences.
• This is a fundamental preprocessing step in NLP.
• For instance, in sentiment analysis, tokenizing text into words
allows the analysis of individual words' sentiments.

2. N-gram Models:
Statistical Methods for Text
Analysis
3. TF, IDF:

•Term Frequency (TF): Measures how frequently a

term appears in a document.

•Inverse Document Frequency (IDF): Measures

how common or rare a term is across all documents.

•TF-IDF (Term Frequency-Inverse Document

Frequency):
• TF-IDF is a statistic that reflects the importance of
a word in a document relative to a corpus.
• The product of TF and IDF, highlighting words that
are important in a specific document but not too
common across the entire corpus. It is widely
used in information retrieval and text mining.
Example: Calculating TF, IDF
Suppose we are looking for documents using the
query Q and our database is composed of the
documents D1, D2, and D3.
•Q: The cat.
•D1: The cat is on the mat.
•D2: My dog and cat are the best.
•TF(word,
D3: Thedocument)
locals are=playing.
“number of occurrences of the word in
the
document” / “number of words in the
document”

Let’s compute the TF scores of the words “the” and “cat” (i.e. the query
words) with respect to the documents D1, D2, and D3.
TF(“the”, D1) = 2/6 = 0.33
TF(“the”, D2) = 1/7 = 0.14
TF(“the”, D3) = 1/4 = 0.25
TF(“cat”, D1) = 1/6 = 0.17
TF(“cat”, D2) = 1/7 = 0.14
TF(“cat”, D3) = 0/4 = 0
IDF can be calculated by taking the total number of
documents, dividing it by the number of documents that
contain a word, and calculating the logarithm. If the word
is very common and appears in many documents, this
number will approach 0. Otherwise, it will approach 1.
IDF(word) = log(number of documents / number of
documents that contain the word)
Let’s compute the IDF scores of the words “the” and “cat”.
IDF(“the”) = log(3/3) = log(1) = 0
IDF(“cat”) = log(3/2) = 0.18
Multiplying TF and IDF gives the TF-IDF score of a word in a
document. The higher the score, the more relevant that word is in
that particular document.
TF-IDF(word, document) = TF(word, document) *
IDF(word)
Let’s compute the TF-IDF scores of the words “the” and “cat”.
TF-IDF(“the”, D1) = 0.33 * 0 = 0
TF-IDF(“the, D2) = 0.14 * 0 = 0
TF-IDF(“the”, D3) = 0.25 * 0 = 0
TF-IDF(“cat”, D1) = 0.17 * 0.18= 0.0306
TF-IDF(“cat, D2) = 0.14 * 0.18= 0.0252
The next step is to use a ranking function to order the documents
according to the TF-IDF scores of their words. We can use the average
TF-IDF word scores over each document to get the ranking
of D1, D2, and D3 with respect to the query Q.

Average TF-IDF of D1 = (0 + 0.0306) / 2 = 0.0153

Average TF-IDF of D2 = (0 + 0.0252) / 2 = 0.0126
Average TF-IDF of D3 = (0 + 0) / 2 = 0

Looks like the word “the” does not contribute to the TF-IDF scores of
each document. This is because “the” appears in all of the documents
and thus it is considered a not-relevant word.
As a conclusion, when performing the query “The cat” over
the collection of documents D1, D2, and D3, the ranked
results would be:

1.D1: The cat is on the mat.

2.D2: My dog and cat are the best.
3.D3: The locals are playing.
Statistical Methods for Text
Analysis
4. Word Embeddings:

• Word embeddings represent words as dense

vectors in a continuous vector space,
capturing semantic relationships.

• Techniques like Word2Vec, GloVe, and

FastText learn embeddings by leveraging
large text corpora, enabling applications
such as semantic similarity, text
classification, and machine translation.
Cosine Similarity

Cosine similarity (cos⁡(θ)) value ranges

from -1 (not similar) to +1 (very similar).

It is defined as: cos(θ) = (a · b) / (|a| |b|).

Here we see that point A(1.5, 1.5) and point B(2.0, 1.0) are close
together in a 2-dimensional embedding space.

When we calculate the cosine similarity, we obtain a value of 0.948,

confirming that both vectors are quite similar.

In contrast, when we compare the similarity of point A(1.5, 1.5) and

point C(-1.0, -0.5), we observe that the cosine similarity is -0.948,
indicating that both vectors are dissimilar.

We can see that they are in opposite directions in the embedding

space.

A cos⁡(θ) value of 0 would indicate that both vectors are perpendicular

to each other, showing neither similarity nor dissimilarity.
Hypothesis Testing

Chi-Square Test: The Chi-Square test assesses whether

there is a significant association between two categorical
variables. In NLP, it can be used to test the
independence of word occurrences across different
document categories, helping in feature selection for
text classification.

T-Test and ANOVA:

•T-Test: Compares the means of two groups to determine
if they are statistically different. For example, comparing
the average sentiment scores of two different sets of
documents.
•ANOVA (Analysis of Variance): Extends the T-Test to
compare the means of three or more groups, useful in
experiments involving multiple text categories.
Machine Learning Models

Logistic Regression: Logistic regression is a linear model

used for binary classification tasks. It predicts the
probability of a categorical outcome (e.g., spam vs. non-
spam emails) and is interpretable, making it a popular
choice for text classification.

Naive Bayes: Naive Bayes classifiers are based on Bayes'

theorem and assume independence between features.
Despite this strong assumption, they perform remarkably
well in text classification tasks such as sentiment analysis
and spam detection, due to the nature of text data.

Support Vector Machines (SVM): SVMs are powerful

classifiers that find the hyperplane best separating the data
into classes. They are used for both text classification and
regression tasks, offering robust performance with high-
dimensional data.
Neural Networks:

•Recurrent Neural Networks (RNNs): Designed for

sequential data, RNNs can remember previous inputs,
making them suitable for language modeling and text
generation. However, they suffer from the vanishing
gradient problem.

•Long Short-Term Memory (LSTM): A type of RNN that

mitigates the vanishing gradient problem by maintaining
long-range dependencies, making it effective for tasks like
part-of-speech tagging and named entity recognition.

•Transformers: Advanced models using self-attention

mechanisms to handle long-range dependencies.
Transformers underpin state-of-the-art models like BERT
and GPT, excelling in various NLP tasks such as translation,
summarization, and question answering.
Dimensionality Reduction

Principal Component Analysis (PCA):

• PCA reduces the dimensionality of data by transforming
it into a set of linearly uncorrelated components,
preserving as much variance as possible.
• In NLP, PCA can be used to visualize high-dimensional
word embeddings.

t-SNE (t-Distributed Stochastic Neighbor

Embedding):
• t-SNE is a non-linear dimensionality reduction technique
that maps high-dimensional data to lower dimensions
(2D or 3D) for visualization.
• It is particularly effective in visualizing the structure of
word embeddings and document clusters.
Bayesian Inference

Latent Dirichlet Allocation (LDA):

• LDA is a generative probabilistic model for topic
modeling, which discovers hidden topics in a
collection of documents.
• Each document is represented as a mixture of topics,
and each topic as a distribution over words.
• LDA is used for organizing large corpora, improving
search, and recommending content.

Bayesian Networks:
• Bayesian networks are probabilistic graphical
models representing variables and their conditional
dependencies.
• They are used in NLP for tasks like part-of-speech
tagging, named entity recognition, and understanding
Time Series Analysis

Markov Chains: Markov chains model sequences of

events where the probability of each event depends only
on the previous state. In NLP, they are used for text
generation, speech recognition, and predictive text
input.

Hidden Markov Models (HMMs): HMMs are statistical

models where the system being modeled is assumed to
be a Markov process with hidden states. They are
applied in sequence labeling tasks such as part-of-
speech tagging, named entity recognition, and speech
processing.
Maximum Likelihood Estimation
(MLE)
Read
more
●
Manning & Schütze: Foundations of
Statistical Natural language processing

Text and Speech Analysis Notes CCS369-UNIT 1
No ratings yet
Text and Speech Analysis Notes CCS369-UNIT 1
27 pages
Design Doc - Mini Project
No ratings yet
Design Doc - Mini Project
40 pages
A Perennial Course: in Living Druidry
No ratings yet
A Perennial Course: in Living Druidry
42 pages
Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
NLP Introduction
No ratings yet
NLP Introduction
35 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
Natural Language Processing tools and approaches
No ratings yet
Natural Language Processing tools and approaches
106 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
Assignment of AI Finished
No ratings yet
Assignment of AI Finished
16 pages
NLP Unit I Notes-1
75% (4)
NLP Unit I Notes-1
22 pages
L2Introduction to Statistical Natural Language Processing
No ratings yet
L2Introduction to Statistical Natural Language Processing
27 pages
Unit 5 - Notes
No ratings yet
Unit 5 - Notes
11 pages
13) Natural Language Processing
No ratings yet
13) Natural Language Processing
28 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
Lec 1.1.2
No ratings yet
Lec 1.1.2
44 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
05 Introduction To NLP
No ratings yet
05 Introduction To NLP
63 pages
Statistical NLP
No ratings yet
Statistical NLP
19 pages
Technical NLP U3-6
No ratings yet
Technical NLP U3-6
83 pages
Basic NLP to End-to-end Pipeline .pptx_removed
No ratings yet
Basic NLP to End-to-end Pipeline .pptx_removed
35 pages
Module_5-Natural_language_processing[1]
No ratings yet
Module_5-Natural_language_processing[1]
13 pages
Stat NLP
No ratings yet
Stat NLP
19 pages
Unit 1 Text and Speech Analysis Notes
No ratings yet
Unit 1 Text and Speech Analysis Notes
28 pages
Natural Language Processing
No ratings yet
Natural Language Processing
28 pages
lecture5-ngrams
No ratings yet
lecture5-ngrams
40 pages
IntroductionToNLPAbebeZerihun
No ratings yet
IntroductionToNLPAbebeZerihun
45 pages
SebentaLN-parte1
No ratings yet
SebentaLN-parte1
42 pages
Unit 5
No ratings yet
Unit 5
20 pages
L2 challenges in NLP pptx
No ratings yet
L2 challenges in NLP pptx
18 pages
1 Intro To NLP
100% (1)
1 Intro To NLP
46 pages
NLP Introduction
No ratings yet
NLP Introduction
35 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
Natural Language Processing 1
No ratings yet
Natural Language Processing 1
19 pages
NLP Viva
No ratings yet
NLP Viva
14 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
NLP Introduction Overview
No ratings yet
NLP Introduction Overview
34 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Formal Analysis for NLP. Zhiwei_Feng
No ratings yet
Formal Analysis for NLP. Zhiwei_Feng
802 pages
text-and-speech-analysis-notes-ccs369-unit-1
No ratings yet
text-and-speech-analysis-notes-ccs369-unit-1
28 pages
Foundations of Statistical Natural Language Processing 1st Edition by Christopher Manning, Hinrich Schutze ISBN 9780262303798 0262303795 download
100% (1)
Foundations of Statistical Natural Language Processing 1st Edition by Christopher Manning, Hinrich Schutze ISBN 9780262303798 0262303795 download
53 pages
Unit 1 Notes.pptx
No ratings yet
Unit 1 Notes.pptx
74 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
SPA.2018.8563389
No ratings yet
SPA.2018.8563389
6 pages
MOD-1
No ratings yet
MOD-1
71 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
28 pages
Module 3
No ratings yet
Module 3
40 pages
Statistical Methods Modelling Language
No ratings yet
Statistical Methods Modelling Language
11 pages
BCSE306L_AI_MODULE-7_SMSATAPATHY
No ratings yet
BCSE306L_AI_MODULE-7_SMSATAPATHY
51 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
5 pages
01
No ratings yet
01
60 pages
CCS369-UNIT 1
No ratings yet
CCS369-UNIT 1
27 pages
02 Slides
No ratings yet
02 Slides
39 pages
1 - Intro - To - NLP 2
No ratings yet
1 - Intro - To - NLP 2
55 pages
module5_DS_ppt
No ratings yet
module5_DS_ppt
38 pages
SNLP Overview
No ratings yet
SNLP Overview
43 pages
UNIT 3 Language Modelling
No ratings yet
UNIT 3 Language Modelling
15 pages
Unit I Inroduction
No ratings yet
Unit I Inroduction
52 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Natural Language Understanding: Fundamentals and Applications
From Everand
Natural Language Understanding: Fundamentals and Applications
Fouad Sabry
No ratings yet
The Enigmatic Bridge: Computing and Linguistics
From Everand
The Enigmatic Bridge: Computing and Linguistics
Pasquale De Marco
No ratings yet
British English and American English
No ratings yet
British English and American English
2 pages
Basic Principles of Graphics and Layout
No ratings yet
Basic Principles of Graphics and Layout
5 pages
Celier Aviation: Service Manual
No ratings yet
Celier Aviation: Service Manual
79 pages
MASTER PLAN 01-Model
No ratings yet
MASTER PLAN 01-Model
1 page
Waltan Individual Deposit Account Opening Form
No ratings yet
Waltan Individual Deposit Account Opening Form
3 pages
Past Time Clauses
No ratings yet
Past Time Clauses
18 pages
1Q Gen Physics I
No ratings yet
1Q Gen Physics I
4 pages
Resume
No ratings yet
Resume
4 pages
Comparison of Serum PCT and CRP Levels in Patients Infected by Different Pathogenic Microorganisms
No ratings yet
Comparison of Serum PCT and CRP Levels in Patients Infected by Different Pathogenic Microorganisms
8 pages
Clasa 7 Scoala
No ratings yet
Clasa 7 Scoala
3 pages
Iata Safety
100% (2)
Iata Safety
61 pages
OceanofPDF.com Finite Element Methods Concepts and Applications in Geomechanics - Debasis Deb
No ratings yet
OceanofPDF.com Finite Element Methods Concepts and Applications in Geomechanics - Debasis Deb
288 pages
Piaget Theory of Development and It's Academic Implications
No ratings yet
Piaget Theory of Development and It's Academic Implications
6 pages
ACC 345 Excel 3 1 Business Valuation Model.xlsx
No ratings yet
ACC 345 Excel 3 1 Business Valuation Model.xlsx
18 pages
(Paper) Correlation Between Low Strain Shear Modulus and Standard Penetration Test N' Values
No ratings yet
(Paper) Correlation Between Low Strain Shear Modulus and Standard Penetration Test N' Values
11 pages
Module 1-Evaluate-Task 2-Prelim Journal Writing
No ratings yet
Module 1-Evaluate-Task 2-Prelim Journal Writing
3 pages
Revised Lab Manual
No ratings yet
Revised Lab Manual
43 pages
Business Letters
No ratings yet
Business Letters
4 pages
JChart Global Guide
No ratings yet
JChart Global Guide
32 pages
The Participle: Active Passive Indefinite Perfect
No ratings yet
The Participle: Active Passive Indefinite Perfect
5 pages
SEE OUR Party Packages BROCHURE: They Include: The VOLCANO, ZOO AND
100% (2)
SEE OUR Party Packages BROCHURE: They Include: The VOLCANO, ZOO AND
4 pages
The Sunday Times Magazine - 07.11.2021
100% (1)
The Sunday Times Magazine - 07.11.2021
76 pages
Data Sheet PDF
No ratings yet
Data Sheet PDF
2 pages
07 July 1993
No ratings yet
07 July 1993
116 pages
EVO-6800 Quick Start Guide - 20160922
No ratings yet
EVO-6800 Quick Start Guide - 20160922
154 pages
Creative Engineering Design & Analysis (Theory & Practice) 16MPD31
No ratings yet
Creative Engineering Design & Analysis (Theory & Practice) 16MPD31
4 pages