Text Pre Processing With NLTK
Text Pre Processing With NLTK
NLTK
Text mining
It refers to data mining using text
documents as data.
There are many special techniques for
pre-processing text documents to make
them suitable for mining.
Most of these techniques are from the field
of “Information Retrieval”.
Information Retrieval (IR)
Conceptually, information retrieval (IR) is the
study of finding needed information. I.e., IR helps
users find information that matches their
information needs.
Historically, information retrieval is about
document retrieval, emphasizing document as
the basic unit.
Technically, IR studies the acquisition,
organization, storage, retrieval, and distribution of
information.
IR has become a center of focus in the Web era.
Text Processing
Word (token) extraction
Stop words
Stemming
Frequency counts
Tokenization
may be constructed
Why do we need to remove stop words?
Reduce indexing (or data) file size
stopwords accounts 20-30% of total word counts.
Improve efficiency
stop words are not useful for searching or text mining
import nltk
from nltk.corpus import stopwords
print(stopwords.words(‘english'))
Stemming
Techniques used to find out the root/stem of a
word:
E.g.,
user engineering
users engineered
used engineer
using
stem: use engineer
Usefulness
improving effectiveness of IR and text mining
matching similar words
reducing indexing size
combing words with same roots reduce size of the
corpus as much as 40-50%.
Stemming Algorithms
Porter Stemmer
Snowball Stemmer
Lancaster Stemmer
Regex-based Stemmer
NLTK Code for Stemmer
import nltk
from nltk.stem import porterStemmer
ps= porterStemmer()// Create an object
print(ps.stem(‘coder’)
print(ps.stem(‘coding’)
print(ps.stem(‘code’)
Basic stemming methods
remove ending
if a word ends with a consonant other than s,
followed by an s, then delete s.
if a word ends in es, drop the s.
if a word ends in ing, delete the ing unless the
remaining word consists only of one letter or of th.
If a word ends with ed, preceded by a consonant,
delete the ed unless this leaves only a single letter.
…...
transform words
if a word ends with “ies” but not “eies” or “aies” then
“ies --> y.”
Term Frequency Inverse Document Frequency
5.Vectorize vocab
To put it in more formal mathematical terms,
the TF-IDF score for the word t in the
document d from the document set D is
calculated as follows:
Where:
Example
0.2]
Normalized TF for document 3:
Where:
x . y = dot product of the vectors ‘x’ and ‘y’
||x|| and ||y|= Length of the two vectors ‘x’ and ‘y’.
Example :
Consider an example to find the similarity between two
vectors – ‘x’ and ‘y’, using Cosine Similarity.
The ‘x’ vector has values, x = { 3, 2, 0, 5 }
The ‘y’ vector has values, y = { 1, 0, 0, 0 }
The formula for calculating the cosine similarity is : Cos(x, y)
= x . y / ||x|| * ||y||
x . Y= 3 *1 + 2 * 0 + 0* 0+ 5 * 0=3
||x|| = √ (3)^2 + (2)^2 + (0)^2 + (5)^2 = 6.16 ||y||
Manhattan Distance.
Jaccard Similarity.
Minkowski Distance.
Applications of TF-IDF
tik * tjk
Similarity (Di, Dj) k 1
n n
t
k 1
ik
2
t
k 1
jk
2
Vector Space Representation
Each doc j is a vector, one component for each
term (= word).
Have a vector space
terms are attributes
n docs live in this space
even with stop word removal and stemming, we
may have 10000+ dimensions, or even 1,000,000+