Intro to statistical nlp
Intro to statistical nlp
Language Processing:
an introduction
Statistical natural language
processing
1.Introduction
2.Applications
3.Why is it so
hard?
- Challenges of
natural
language data
Language is processed in our phones and
homes
Including televisions, phones, new assistance
devices, toys
Language is used for several everyday tasks
• physical sounds
• patterns of
sounds
Morphology: building blocks of
words
●
Automatic speech
●
recognition Natural
●
language interfaces
Statistical machine
translation
...
Information
retrieval
PageRank
algorithm
Text clustering and
classification
Speech
recognition
Speech recognition:
large probabilistic
models
Output: word sequence
Input: sound
observations
●
Automatic ● disambiguation
speech ● Syntactic parsing
recognition
● ● Text generation
Natural
language ● Image, audio and video
interfaces
description Text-to-speech
Statistical
synthesis
machine
Complexity of natural
languages
• 6000+ languages, many All languages
• Each
dialects
has many Google ASR no
ASR
• Each
wordsword is understood
slightly differently by each
speaker
• Large variety of
sentence structures
Languages in the
internet
www.internetworldstats.
com
Effect of morphology: vocabulary a
size as function of corpus size
METHODS TOOLS
• Speech-to-text
S A Y
Hidden Markov • Text-to-speech
model • Machine translation
CAR CAT
Vector space DOG • Information retrieval
model • Named entity
h(t- y(t recognition
Recurrent neural
1) ) • Sentence parsing
network x(t h(t
) )
• Topic detection
Natural language modeling: basic
tasks
Word level Sentence level
1. Vector space 1. Part-of-speech tagging
models
2. Named entity
2. Text preprocessing
recognition
3. Bag of words
models 3. Statistical language
models
4. Modeling
morphology 4. Neural language models
A recent revolution in the language
modeling approach
UN+ +RELATE+
• Split language into UNRELATED +D
tokens
• Vector space modeling,
U N R E L A T E D
• embedding Representation CAR CAT A C
DOG D
• learning
• Deep & recurrent learning h(t- y(t
1) )
Sequence to sequence x(t h(t
) )
mapping
=> artificial intelligence
TURN IT OFF SAMMUT SE
A
Statistical NLP
How to deal with NL?
Measures of Dispersion:
Variance and Standard Deviation: These metrics
indicate the variability in word or sentence lengths. High
variance suggests a mix of very short and very long
words or sentences, while low variance indicates more
uniform lengths.
Probability Distributions
2. N-gram Models:
Statistical Methods for Text
Analysis
3. TF, IDF:
Let’s compute the TF scores of the words “the” and “cat” (i.e. the query
words) with respect to the documents D1, D2, and D3.
TF(“the”, D1) = 2/6 = 0.33
TF(“the”, D2) = 1/7 = 0.14
TF(“the”, D3) = 1/4 = 0.25
TF(“cat”, D1) = 1/6 = 0.17
TF(“cat”, D2) = 1/7 = 0.14
TF(“cat”, D3) = 0/4 = 0
IDF can be calculated by taking the total number of
documents, dividing it by the number of documents that
contain a word, and calculating the logarithm. If the word
is very common and appears in many documents, this
number will approach 0. Otherwise, it will approach 1.
IDF(word) = log(number of documents / number of
documents that contain the word)
Let’s compute the IDF scores of the words “the” and “cat”.
IDF(“the”) = log(3/3) = log(1) = 0
IDF(“cat”) = log(3/2) = 0.18
Multiplying TF and IDF gives the TF-IDF score of a word in a
document. The higher the score, the more relevant that word is in
that particular document.
TF-IDF(word, document) = TF(word, document) *
IDF(word)
Let’s compute the TF-IDF scores of the words “the” and “cat”.
TF-IDF(“the”, D1) = 0.33 * 0 = 0
TF-IDF(“the, D2) = 0.14 * 0 = 0
TF-IDF(“the”, D3) = 0.25 * 0 = 0
TF-IDF(“cat”, D1) = 0.17 * 0.18= 0.0306
TF-IDF(“cat, D2) = 0.14 * 0.18= 0.0252
The next step is to use a ranking function to order the documents
according to the TF-IDF scores of their words. We can use the average
TF-IDF word scores over each document to get the ranking
of D1, D2, and D3 with respect to the query Q.
Looks like the word “the” does not contribute to the TF-IDF scores of
each document. This is because “the” appears in all of the documents
and thus it is considered a not-relevant word.
As a conclusion, when performing the query “The cat” over
the collection of documents D1, D2, and D3, the ranked
results would be:
Bayesian Networks:
• Bayesian networks are probabilistic graphical
models representing variables and their conditional
dependencies.
• They are used in NLP for tasks like part-of-speech
tagging, named entity recognition, and understanding
Time Series Analysis