NLP Cheatsheet
NLP Cheatsheet
NLP
CheatSheet
NLP
The goal of Natural Language Processing is to analyze and extract
meaningful information from natural human languages such as speech and
text.
Stemming Lemmatization
>>> import nltk
Sentence Tokenization
>>> from nltk.stem import PorterStemmer >>> from nltk.stem import WordNetLemmatizer
['You', 'can', 'do', 'this', '.',
>>> stemmer = PorterStemmer() >>> nltk.download('wordnet')
>>> tokenized_text = ['.', '’', 'tell', 'people', 'plans', '.', 'show', 'results', '.', >>> nltk.download('omw-1.4') 'Don', '’', 't', 'tell', 'people', 'your',
['You can do this.',
'pressure', ',', 'diamonds', '.', 'try', '.', 'fail', '.', 'fail', 'better', '.'] >>> lemmatizer = WordNetLemmatizer()
'plans', '.', 'Show', 'them', 'your', 'results',
'Don’t
>>> tell people
stemmed_text your plans.', for word in tokenized_text
= [stemmer.stem(word) >>> tokenized_text = ['.', '’', 'tell', 'people', 'plans', '.', 'show', 'results', '.', 'pressure', ',',
if word not in '.,’']
'Show them your results.', 'diamonds', '.', 'try', '.','.', 'No',
'fail', 'pressure',
'.', 'fail', 'better', '.'] ',', 'no', 'diamonds', '.',
>>> stemmed_text >>> lemmatized_text = [lemmatizer.lemmatize(word) 'Try', 'Again', '.', 'Fail',
for word 'again', if
in tokenized_text
'No 'peopl',
['tell', pressure, no diamonds.',
'plan', 'show', 'result', 'pressur', 'diamond', 'tri', 'fail', 'fail', word not in '.,’']
'.', 'Fail', 'better', '.']
'Try Again.', 'Fail again.', 'Fail better.']
'better'] >>> lemmatized_text
>>> ['tell', 'people', 'plan', 'show', 'result', 'pressure', 'diamond', 'try', 'fail', 'fail', 'better']
>>>
3.1 Bag Of Words(BOW)
A Bag-of-Words is a representation of Extracted Text features in the form of a vector that
describes the occurrence of words within a document.
>>> from nltk.corpus import stopwords >>> from sklearn.feature_extraction.text import CountVectorizer
>>> from nltk.tokenize import word_tokenize, sent_tokenize >>> vectorizer = CountVectorizer()
>>> stop_words = set(stopwords.words("english")) >>> bag_of_words = vectorizer.fit_transform(lemmatized_token ).toarray()
>>> >>> bag_of_words
>>> para = "You can do this. Don’t tell people your plans. Show them your array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
results. No pressure, no diamonds. Try Again. Fail again. Fail better.".lower() [0, 0, 0, 1, 1, 0, 0, 0, 1, 0],
['You can do this.',
>>> tokenized_sent = sent_tokenize(para) [0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
>>> tokenized_words_sent = [word_tokenize(sent) for sent in tokenized_sent] [0, 1, 0, 0, 0, 1, 0, 0, 0, 0],
'Don’t tell people your plans.',
>>> filtered_sent_token = [[word for word in sent if word.casefold() not in [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
stop_words] for sent in'Show them your results.',
tokenized_words_sent] [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
>>> lemmatizer = WordNetLemmatizer()
'No pressure, no diamonds.',
>>> lemmatized_token = [' '.join([lemmatizer.lemmatize(word) for word in
[1, 0, 1, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)
>>>
sent_token if word not'Try
in '.,’'])Again.', 'Failinagain.',
for sent_token 'Fail better.']
filtered_sent_token]
3.3 TF-IDF
Tf-Idf stands for term frequency-inverse document frequency, and the Tf-Idf weight is a weight
often used in information retrieval and text mining. This weight is a statistical measure used to
evaluate how important a word is to a document in a collection or corpus.
IDF
Words Freq Words IDF
ai 1 ai log(3/1)
technique 2 ai machine smart technique using learns deep powerful technique log(3/2)
powerful 1 ... ... .... ... ... ... ... ... ... powerful log(3/1)
3.4 TF-IDF