0% found this document useful (0 votes)
24 views15 pages

NLP Manual

NATURAL LANGUAGE PROCESSING THROUGH PYTHON
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views15 pages

NLP Manual

NATURAL LANGUAGE PROCESSING THROUGH PYTHON
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

LAB MANUAL

FOR

NATURAL LANGUAGE PROCESSING


WITH PYTHON LABORATORY

II B. TECH II SEMESTER (JNTUK-R20)

NEWTON’S INSTITUTE OF ENGINEERING COLLEGE


ALUGURAJUPALLI, KOPPUNOOR, MACHERLA, PALNADU, 522426

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING


Introduction to Natural Language Processing

NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and
deriving information from the text data in a smart and efficient manner. By utilizing NLP and its
components, one can organize the massive chunks of text data, perform numerous automated tasks and
solve a wide range of problems such as – automatic summarization, machine translation, named entity
recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.

Before moving further, I would like to explain some terms that are used in the article:

 Tokenization – process of converting a text into tokens


 Tokens – words or entities present in the text
 Text object – a sentence or a phrase or a word or an article

Steps to install NLTK and its data:

Install Pip: run in terminal:

sudo easy_install pip

Install NLTK: run in terminal :

sudo pip install -U nltk

Download NLTK data: run python shell (in terminal) and write the following code:

```
import nltk

nltk.download()

```

Text Preprocessing

Since, text is the most unstructured form of all the available data, various types of noise are present in it
and the data is not readily analyzable without any pre-processing. The entire process of cleaning and
standardization of text, making it noise-free and ready for analysis is known as text preprocessing.

It is predominantly comprised of three steps:

 Noise Removal
 Lexicon Normalization
 Object Standardization

Page 1
1. Demonstrate Noise Removal for any textual data and remove regular expression pattern such
as hash tag from textual data.

Noise Removal
Any piece of text which is not relevant to the context of the data and the end-output can be specified as
the noise.
For example – language stopwords (commonly used words of a language – is, am, the, of, in etc),
URLs or links, social media entities (mentions, hashtags), punctuations and industry specific words.
This step deals with removal of all types of noisy entities present in the text.
A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate the
text object by tokens (or by words), eliminating those tokens which are present in the noise dictionary.

# Sample code to remove a regex pattern


import re

def _remove_regex(input_text, regex_pattern):


urls = re.finditer(regex_pattern, input_text)
for i in urls:
input_text = re.sub(i.group().strip(), '', input_text)
return input_text

regex_pattern = "#[\w]*"

_remove_regex("remove this #hashtag from analytics vidhya", regex_pattern)

Page 2
2. Perform lemmatization and stemming using python library nltk.

Lexicon Normalization

Another type of textual noise is about the multiple representations exhibited by single word.

For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”,

Though they mean different but contextually all are similar. The step converts all the disparities of a word into their

normalized form (also known as lemma). Normalization is a pivotal step for feature engineering with text as it

converts the high dimensional features (N different features) to the low dimensional space (1 feature), which is an

ideal ask for any ML model.

The most common lexicon normalization practices are :

 Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from

a word.

 Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root

form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word

structure and grammar relations).

 Below is the sample code that performs lemmatization and stemming using python’s popular library – NLTK.

from nltk.stem.wordnet import WordNetLemmatizer


lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer


stem = PorterStemmer()

word = "multiplying"
lem.lemmatize(word, "v")
>> "multiply"
stem.stem(word)

Page 3
3. Demonstrate object standardization such as replace social media slangs from a text.

Object Standardization

Text data often contains words or phrases which are not present in any standard lexical dictionaries. These

pieces are not recognized by search engines and models.

Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With the help of

regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the code below uses

a dictionary lookup method to replace social media slangs from a text.

lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome",


"luv" :"love", "..."}
def _lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word) new_text = " ".join(new_words)
return new_text

_lookup_words("RT this is a retweeted tweet by Shiva kumar")

Page 4
4. Perform part of speech tagging on any textual data.

Syntactic Parsing
Part of speech tagging – Apart from the grammar relations, every word in a sentence is also associated with a part of speech
(pos) tag (nouns, verbs, adjectives, adverbs etc). The pos tags define the usage and function of a word in the sentence. H ere
is a list of all possible pos-tags defined by Pennsylvania university. Following code using NLTK performs pos tagging
annotation on input text. (it provides several implementations, the default one is perceptron tagger)

Part of Speech tagging is used for many important purposes in NLP:

A. Word sense disambiguation: Some language words have multiple meanings according to their usage. For example, in the
two sentences below:
I. “Please book my flight for Delhi”
II. “I am going to read this book in the flight”

“Book” is used with different context, however the part of speech tag for both of the cases are different. In sentence I, the
word “book” is used as verb, while in II it is used as noun. (Lesk Algorithm is also used for similar purposes)

B. Improving word-based features: A learning model could learn different contexts of a word when
used word as the features, however if the part of speech tag is linked with them, the context is preserved,
thus making strong features. For example:
Sentence -“book my flight, I will read this book”
Tokens – (“book”, 2), (“my”, 1), (“flight”, 1), (“I”, 1), (“will”, 1), (“read”, 1), (“this”, 1)
Tokens with POS – (“book_VB”, 1), (“my_PRP$”, 1), (“flight_NN”, 1), (“I_PRP”, 1), (“will_MD”, 1), (“read_VB”, 1),
(“this_DT”, 1), (“book_NN”, 1)

C. Normalization and Lemmatization: POS tags are the basis of lemmatization process for converting a word to
its base form (lemma).

D. Efficient stopword removal : POS tags are also useful in efficient removal of stopwords. For example, there
are some tags which always define the low frequency / less important words of a language. For example: (IN –
“within”, “upon”, “except”), (CD – “one”, ”two”, “hundred”), (MD – “may”, “mu st” etc)

from nltk import word_tokenize, pos_tag


text = "I am learning Natural Language Processing on Analytics Vidhya"
tokens = word_tokenize(text)
print pos_tag(tokens)

>>> [('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural',


'NNP'),('Language', 'NNP'),
('Processing', 'NNP'), ('on', 'IN'), ('Analytics', 'NNP'),

Page 5
5. Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.

Topic Modeling
Topic modeling is a process of automatically identifying the topics present in a text corpus, it derives the
hidden patterns among the words in the corpus in an unsupervised manner. Topics are defined as “a
repeating pattern of co-occurring terms in a corpus”. A good topic model results in – “health”, “doctor”,
“patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.

Latent Dirichlet Allocation (LDA) is the most popular topic modelling technique, Following is the code to implement

topic modeling using LDA in python.

doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my
father."
doc2 = "My father spends a lot of time driving my sister around to dance
practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood
pressure."
doc_complete = [doc1, doc2, doc3]
doc_clean = [doc.split() for doc in doc_complete]

import gensim from gensim


import corpora

# Creating the term dictionary of our corpus, where every unique term is
assigned an index.
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using


dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

# Creating the object for LDA model using gensim library


Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix


ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary,
passes=50)

# Results
print(ldamodel.print_topics())

Page 6
6. Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF) using python

Statistical Features

Text data can also be quantified directly into numbers using several techniques described in this section:

A. Term Frequency – Inverse Document Frequency (TF – IDF)


TF-IDF is a weighted model commonly used for information retrieval problems. It aims to convert the
text documents into vector models on the basis of occurrence of words in the documents without taking
considering the exact ordering. For Example – let say there is a dataset of N text documents, In any
document “D”, TF and IDF will be defined as –

Term Frequency (TF) – TF for a term “t” is defined as the count of a term “t” in a document “D”

Inverse Document Frequency (IDF) – IDF for a term is defined as logarithm of ratio of total
documents available in the corpus and number of documents containing the term T.

TF . IDF – TF IDF formula gives the relative importance of a term in a corpus (list of documents),
given by the following formula below. Following is the code using python’s scikit learn package to
convert a text into tf idf vectors:

from sklearn.feature_extraction.text import TfidfVectorizer


obj = TfidfVectorizer()
corpus = ['This is sample document.', 'another random document.', 'third
sample document text']
X = obj.fit_transform(corpus)
print X

>>>
(0, 1) 0.345205016865
(0, 4) ... 0.444514311537
(2, 1) 0.345205016865
(2, 4) 0.444514311537

Page 7
7. Demonstrate word embedding’s using word2vec.

Word Embedding (text vectors)


Word embedding is the modern way of representing words as vectors. The aim of word embedding is to
redefine the high dimensional word features into low dimensional feature vectors by preserving the
contextual similarity in the corpus. They are widely used in deep learning models such as Convolutional
Neural Networks and Recurrent Neural Networks.

Word2Vec and GloVe are the two popular models to create word embedding of a text. These models
takes a text corpus as input and produces the word vectors as output.

Word2Vec model is composed of preprocessing module, a shallow neural network model called
Continuous Bag of Words and another shallow neural network model called skip-gram. These models
are widely used for all other nlp problems. It first constructs a vocabulary from the training corpus and
then learns word embedding representations. Following code using gensim package prepares the word
embedding as the vectors.

from gensim.models import Word2Vec


sentences = [['data', 'science'], ['vidhya', 'science', 'data',
'analytics'],['machine', 'learning'], ['deep', 'learning']]

# train the model on your corpus


model = Word2Vec(sentences, min_count = 1)

print model.similarity('data', 'science')


>>> 0.11222489293

print model['learning']
>>> array([ 0.00459356 0.00303564 -0.00467622 0.00209638, ...])

Page 8
8. Implement Text classification using naïve bayes classifier and text blob library.

Text Classification

Text classification is one of the classical problem of NLP. Notorious examples include – Email Spam
Identification, topic classification of news, sentiment classification and organization of web pages by
search engines.

Text classification, in common words is defined as a technique to systematically classify a text object
(document or sentence) in one of the fixed category. It is really helpful when the amount of data is too
large, especially for organizing, information filtering, and storage purposes.

A typical natural language classifier consists of two parts: (a) Training (b) Prediction as shown in image
below. Firstly the text input is processes and features are created. The machine learning models then
learn these features and is used for predicting against the new text.

Here is a code that uses naive bayes classifier using text blob library (built on top of nltk).

from textblob.classifiers import NaiveBayesClassifier as NBC


from textblob import TextBlob
training_corpus = [
('I am exhausted of this work.', 'Class_B'),
("I can't cooperate with this", 'Class_B'),
('He is my badest enemy!', 'Class_B'),
('My management is poor.', 'Class_B'),
('I love this burger.', 'Class_A'),
('This is an brilliant place!', 'Class_A'),
('I feel very good about these dates.', 'Class_A'),
('This is my best work.', 'Class_A'),
("What an awesome view", 'Class_A'),
('I do not like this dish', 'Class_B')]
test_corpus = [
("I am not feeling well today.", 'Class_B'),
("I feel brilliant!", 'Class_A'),

Page 9
('Gary is a friend of mine.', 'Class_A'),
("I can't believe I'm doing this.", 'Class_B'),
('The date was good.', 'Class_A'), ('I do not enjoy my job',
'Class_B')]

model = NBC(training_corpus)
print(model.classify("Their codes are amazing."))
>>> "Class_A"
print(model.classify("I don't like their computer."))
>>> "Class_B"
print(model.accuracy(test_corpus))
>>> 0.83

Page 10
9. Apply support vector machine for text classification.

from sklearn.feature_extraction.text
import TfidfVectorizer from sklearn.metrics
import classification_report
from sklearn import svm

# preparing data for SVM model (using the same training_corpus, test_corpus
from naive bayes example)
train_data = []
train_labels = []
for row in training_corpus:
train_data.append(row[0])
train_labels.append(row[1])

test_data = []
test_labels = []
for row in test_corpus:
test_data.append(row[0])
test_labels.append(row[1])

# Create feature vectors


vectorizer = TfidfVectorizer(min_df=4, max_df=0.9)
# Train the feature vectors
train_vectors = vectorizer.fit_transform(train_data)
# Apply model on test data
test_vectors = vectorizer.transform(test_data)

# Perform classification with SVM, kernel=linear


model = svm.SVC(kernel='linear')
model.fit(train_vectors, train_labels)
prediction = model.predict(test_vectors)
>>> ['Class_A' 'Class_A' 'Class_B' 'Class_B' 'Class_A' 'Class_A']

print (classification_report(test_labels, prediction))

Page 11
10. Convert text to vectors (using term frequency) and apply cosine similarity to provide closeness
among two text.

Text Matching / Similarity


One of the important areas of NLP is the matching of text objects to find similarities. Important
applications of text matching includes automatic spelling correction, data de-duplication and genome
analysis etc.
A number of text matching techniques are available depending upon the requirement. This
section describes the important techniques in detail.

A. Levenshtein Distance – The Levenshtein distance between two strings is defined as the minimum
number of edits needed to transform one string into the other, with the allowable edit operations
being insertion, deletion, or substitution of a single character. Following is the implementation for
efficient memory computations.
B. Phonetic Matching – A Phonetic matching algorithm takes a keyword as input (person’s name,
location name etc) and produces a character string that identifies a set of words that are (roughly)
phonetically similar. It is very useful for searching large text corpuses, correcting spelling errors and
matching relevant names. Soundex and Metaphone are two main phonetic algorithms used for this
purpose. Python’s module Fuzzy is used to compute soundex strings for different words
C. Flexible String Matching – A complete text matching system includes different algorithms
pipelined together to compute variety of text variations. Regular expressions are really helpful for
this purposes as well. Another common techniques include – exact string matching, lemmatized
matching, and compact matching (takes care of spaces, punctuation’s, slangs etc).
D. Cosine Similarity – W hen the text is represented as vector notation, a general cosine similarity can
also be applied in order to measure vectorized similarity. Following code converts a text to vectors
(using term frequency) and applies cosine similarity to provide closeness among two text.

import math
from collections import Counter
def get_cosine(vec1, vec2):
common = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in common])

sum1 = sum([vec1[x]**2 for x in vec1.keys()])


sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)

if not denominator:
return 0.0
else:
return float(numerator) / denominator

def text_to_vector(text):
words = text.split()
return Counter(words)

Page 12
text1 = 'This is an article on analytics vidhya'
text2 = 'article on analytics vidhya is about natural language processing'

vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)
cosine = get_cosine(vector1, vector2)

>>> 0.62

Page 13
11. Case study 1: Identify the sentiment of tweets

In this problem, you are provided with tweet data to predict sentiment on electronic products of
netizens

Introduction
Natural Language Processing (NLP) is a hotbed of research in data science these days and one of the
most common applications of NLP is sentiment analysis. From opinion polls to creating entire
marketing strategies, this domain has completely reshaped the way businesses work, which is why this is
an area every data scientist must be familiar with.

Thousands of text documents can be processed for sentiment (and other features including named
entities, topics, themes, etc.) in seconds, compared to the hours it would take a team of people to
manually complete the same task.

We will do so by following a sequence of steps needed to solve a general sentiment analysis problem.
We will start with preprocessing and cleaning of the raw text of the tweets. Then we will explore the
cleaned text and try to get some intuition about the context of the tweets. After that, we will extract
numerical features from the data and finally use these feature sets to train models and identify the
sentiments of the tweets.

Page 14

You might also like