100% found this document useful (2 votes)

396 views

NLP Cheatsheet

The document discusses various Natural Language Processing techniques including tokenization, filtering stop words, stemming, lemmatization, and bag-of-words modeling. Tokenization breaks text into tokens, stop word filtering removes common words, stemming and lemmatization reduce words to their root forms, and bag-of-words creates numeric representations of text.

Uploaded by

Girish Jha

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

396 views

NLP Cheatsheet

Uploaded by

Girish Jha

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Beyond 🅼🅻

NLP
CheatSheet
NLP
The goal of Natural Language Processing is to analyze and extract
meaningful information from natural human languages such as speech and
text.

Natural Language Processing

Natural Language Natural Language

Understanding(NLU) Generation (NLG)
You can do this. Don’t tell people your plans. Show them your
results. No pressure, no diamonds. Try Again. Fail again. Fail
better.
1.Tokenization
Tokenization is a process of converting huge text or paragraph or sentences into Tokens.

Sentence Tokenization Word Tokenization

>>> from nltk.tokenize import word_tokenize
>>> from nltk.tokenize import sent_tokenize
>>> para = """You can do this. Don’t tell people your
>>> para = """You can do this. Don’t tell people your plans.
plans. Show them your results. No pressure, no
Show them your results. No pressure, no diamonds. Try Again.
diamonds. Try Again. Fail again. Fail better.""".lower()
Fail again. Fail better.""".lower()
>>> word_tokenize(para)
>>> sent_tokenize(para)
['you', 'can', 'do', 'this', '.', 'don', '’', 't', 'tell', 'people', 'your',
['you can do this.', 'don’t tell people your plans.', 'show them
'plans', '.', 'show', 'them', 'your', 'results', '.', 'no',
your results.', 'no pressure, no diamonds.', 'try again.', 'fail
'pressure', ',', 'no', 'diamonds', '.', 'try', 'again', '.', 'fail',
again.', 'fail better.']
'again', '.', 'fail', 'better', '.']
>>>
>>>
2. Filtering Stop Words
Stop words are the words that are frequently common to all sentences or paragraphs.

>>> import nltk

>>> nltk.download("stopwords")
Sentence Tokenization
>>> from nltk.corpus import stopwords
>>> tokenized_text = ['you', 'can', 'do', 'this', '.', 'don', '’', 't', 'tell', 'people', 'your',
['You
'plans', '.',can do 'them',
'show', this.', 'your', 'results', '.', 'no', 'pressure', ',', 'no', 'diamonds', '.', 'try',
'again', '.', tell
'Don’t 'fail',people
'again', your
'.', 'fail',plans.',
'better', '.']
'Show them your results.',
>>> stop_words = set(stopwords.words("english"))
'No pressure, no diamonds.',
>>> tokenized_text = [word for word in tokenized_text if word.casefold() not in
'Try Again.', 'Fail again.', 'Fail better.']
stop_words]
>>> tokenized_text
['.', '’', 'tell', 'people', 'plans', '.', 'show', 'results', '.', 'pressure', ',', 'diamonds', '.', 'try', '.',
'fail', '.', 'fail', 'better', '.']
3. Stemming & Lemmatization
Stemming is a text processing task in which you reduce words to their root, which is the core part of a word. For
example, the words “helping” and “helper” share the root “help.”
Lemmatization is a text processing task in which you reduce words to their meaningful root, which is the core
part of a word. For example, the words “helping” and “helper” share the root “help.”

Stemming Lemmatization
>>> import nltk
Sentence Tokenization
>>> from nltk.stem import PorterStemmer >>> from nltk.stem import WordNetLemmatizer
['You', 'can', 'do', 'this', '.',
>>> stemmer = PorterStemmer() >>> nltk.download('wordnet')
>>> tokenized_text = ['.', '’', 'tell', 'people', 'plans', '.', 'show', 'results', '.', >>> nltk.download('omw-1.4') 'Don', '’', 't', 'tell', 'people', 'your',
['You can do this.',
'pressure', ',', 'diamonds', '.', 'try', '.', 'fail', '.', 'fail', 'better', '.'] >>> lemmatizer = WordNetLemmatizer()
'plans', '.', 'Show', 'them', 'your', 'results',
'Don’t
>>> tell people
stemmed_text your plans.', for word in tokenized_text
= [stemmer.stem(word) >>> tokenized_text = ['.', '’', 'tell', 'people', 'plans', '.', 'show', 'results', '.', 'pressure', ',',
if word not in '.,’']
'Show them your results.', 'diamonds', '.', 'try', '.','.', 'No',
'fail', 'pressure',
'.', 'fail', 'better', '.'] ',', 'no', 'diamonds', '.',
>>> stemmed_text >>> lemmatized_text = [lemmatizer.lemmatize(word) 'Try', 'Again', '.', 'Fail',
for word 'again', if
in tokenized_text
'No 'peopl',
['tell', pressure, no diamonds.',
'plan', 'show', 'result', 'pressur', 'diamond', 'tri', 'fail', 'fail', word not in '.,’']
'.', 'Fail', 'better', '.']
'Try Again.', 'Fail again.', 'Fail better.']
'better'] >>> lemmatized_text
>>> ['tell', 'people', 'plan', 'show', 'result', 'pressure', 'diamond', 'try', 'fail', 'fail', 'better']
>>>
3.1 Bag Of Words(BOW)
A Bag-of-Words is a representation of Extracted Text features in the form of a vector that
describes the occurrence of words within a document.

Sent1 -> AI is about smart machines.

Sent2 -> Machine Learning is a technique using which machine learns.
Sent3 -> Deep Learning is powerful technique.

Words Frequency Bag Of Word Vector

machine 3 ai machine smart technique using learns deep powerful
learns 3 sent1 1 1 1 0 0 0 0 0
techniqu
2 sent2 0 2 0 1 1 2 0 0
e
deep 1 sent2 2 0 0 1 0 1 1 1
smart 1 ... ... .... ... ... ... ... ... ...
... ...
3.2 Bag Of Words(BOW)

>>> from nltk.corpus import stopwords >>> from sklearn.feature_extraction.text import CountVectorizer
>>> from nltk.tokenize import word_tokenize, sent_tokenize >>> vectorizer = CountVectorizer()
>>> stop_words = set(stopwords.words("english")) >>> bag_of_words = vectorizer.fit_transform(lemmatized_token ).toarray()
>>> >>> bag_of_words
>>> para = "You can do this. Don’t tell people your plans. Show them your array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
results. No pressure, no diamonds. Try Again. Fail again. Fail better.".lower() [0, 0, 0, 1, 1, 0, 0, 0, 1, 0],
['You can do this.',
>>> tokenized_sent = sent_tokenize(para) [0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
>>> tokenized_words_sent = [word_tokenize(sent) for sent in tokenized_sent] [0, 1, 0, 0, 0, 1, 0, 0, 0, 0],
'Don’t tell people your plans.',
>>> filtered_sent_token = [[word for word in sent if word.casefold() not in [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
stop_words] for sent in'Show them your results.',
tokenized_words_sent] [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
>>> lemmatizer = WordNetLemmatizer()
'No pressure, no diamonds.',
>>> lemmatized_token = [' '.join([lemmatizer.lemmatize(word) for word in
[1, 0, 1, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)
>>>
sent_token if word not'Try
in '.,’'])Again.', 'Failinagain.',
for sent_token 'Fail better.']
filtered_sent_token]
3.3 TF-IDF
Tf-Idf stands for term frequency-inverse document frequency, and the Tf-Idf weight is a weight
often used in information retrieval and text mining. This weight is a statistical measure used to
evaluate how important a word is to a document in a collection or corpus.

IDF
Words Freq Words IDF

ai 1 ai log(3/1)

machine 3 machine log(3/2)

smart 1 TF smart log(3/1)

technique 2 ai machine smart technique using learns deep powerful technique log(3/2)

using 1 sent1 1/3 1/3 1/3 0 0 0 0 0 using log(3/1)

TF-Idf
learns 3 sent2 0 2/4 0 1/4 1/4 2/4 0 0 X learns log(3/2) = Vector
deep 1 sent2 0 0 0 1/4 0 1/4 1/4 1/4 deep log(3/1)

powerful 1 ... ... .... ... ... ... ... ... ... powerful log(3/1)
3.4 TF-IDF

>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>>
>>> vectorizer = TfidfVectorizer()
>>> tf_idf_vect = vectorizer.fit_transform(lemmatized_token).toarray()
>>
>>> tf_idf_vect
array([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0.57735027, 0.57735027, 0. , 0. , 0. , 0.57735027, 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0.70710678, 0.70710678, 0. , 0. ],
[0. , 0.70710678, 0. , 0. , 0. , 0.70710678, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. ],
[0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0.76944876, 0. , 0.63870855, 0. , 0. , 0. , 0. , 0. , 0. , 0. ]])
>>>
4. POS Tagging
In grammar, part of speech refers to the roles words play in sentences.
POS tagging refers to labeling your words according to their parts of speech.

>>> from nltk.tokenize import word_tokenize

>>> import nltk
>>> para = "You can do this. Don’t tell people your plans. Show them your results. No
pressure, no diamonds. Try Again. Fail again. Fail better.".lower()
>>> tokens = word_tokenize(para)
>>> nltk.pos_tag(tokens)
[('you', 'PRP'), ('can', 'MD'), ('do', 'VB'), ('this', 'DT'), ('.', '.'), ('don', 'VB'), ('’', 'JJ'), ('t',
'NN'), ('tell', 'VBP'), ('people', 'NNS'), ('your', 'PRP$'), ('plans', 'NNS'), ('.', '.'), ('show',
'VB'), ('them', 'PRP'), ('your', 'PRP$'), ('results', 'NNS'), ('.', '.'), ('no', 'DT'), ('pressure',
'NN'), (',', ','), ('no', 'DT'), ('diamonds', 'NNS'), ('.', '.'), ('try', 'VB'), ('again', 'RB'), ('.', '.'),
('fail', 'VB'), ('again', 'RB'), ('.', '.'), ('fail', 'VB'), ('better', 'JJR'), ('.', '.')]
>>>
5. Chunking
While tokenizing allows you to identify words and sentences, chunking allows you to identify phrases.

Here are some examples:

“A horse”
“A white horse” >>> from nltk.tokenize import word_tokenize
>>> import nltk
“A running white horse” >>> para = "You can do this. Don’t tell people your plans. ".lower()
>>> tokens = word_tokenize(para)
>>>> pos_tokens = nltk.pos_tag(tokens)
>>> grammar = "NP: {<DT>?<JJ>*<NN>}"
>>> chunk_parser = nltk.RegexpParser(grammar)
>>> tree = chunk_parser.parse(pos_tokens)
>>> tree.draw()
6. Chinking
Chinking is used together with chunking, but while chunking is used to include a pattern, chinking is
used to exclude a pattern.

>>> from nltk.tokenize import word_tokenize

>>> import nltk
>>> para = "You can do this. Don’t tell people your plans. ".lower()
>>> tokens = word_tokenize(para)
>>>> pos_tokens = nltk.pos_tag(tokens)
>>> grammar = """
Chunk: {<.*>+}
}<JJ>{"""
>>> chunk_parser = nltk.RegexpParser(grammar)
>>> tree = chunk_parser.parse(pos_tokens)
>>> tree.draw()
7. Named Entity recognition
>>> from nltk.tokenize import word_tokenize
>>> import nltk
>>> nltk.download("maxent_ne_chunker")
>>> nltk.download('words')
>>> para = "You can do this. Don’t tell people your plans. ".lower()
Named Entities are noun phrases that >>> tokens = word_tokenize(para)
refer to specific locations, people, >>>> pos_tokens = nltk.pos_tag(tokens)
>>> grammar = """
organizations, and so on. Chunk: {<.*>+}
}<JJ>{"""
>>> chunk_parser = nltk.RegexpParser(grammar)
>>> tree = chunk_parser.parse(pos_tokens)
>>> tree.draw()
8. WordEmbeddings
Word Embeddings are meaningful feature
representation of a document, vocabulary. >>> from nltk.tokenize import word_tokenize
They try to preserve syntactical and >>> import nltk
>>> nltk.download("maxent_ne_chunker")
semantic information. >>> nltk.download('words')
>>> para = "You can do this. Don’t tell people your plans. ".lower()
>>> tokens = word_tokenize(para)
>>>> pos_tokens = nltk.pos_tag(tokens)
>>> grammar = """
Chunk: {<.*>+}
}<JJ>{"""
>>> chunk_parser = nltk.RegexpParser(grammar)
>>> tree = chunk_parser.parse(pos_tokens)
>>> tree.draw()
9. Text Classification
It is the process of assigning a label or
class to a given text. Sentiment analysis,
natural language inference, and >>> from transformers import pipeline
grammatical correctness assessment are >>> classifier = pipeline("sentiment-analysis")
>>> classifier("I loved Star Wars so much!")
some of the use cases.
>>> [{'label': 'POSITIVE', 'score': 0.99}]
>>> classifier("Where is the capital of France?,
Paris is the capital of France.")
## [{'label': 'entailment', 'score': 0.997}]
10. Text Summarization
Text Summarization is the process of
reducing a document to its essential elements
while preserving its main points. It is possible >>> from transformers import pipeline
to extract text from the original input, or to >>> summarizer = pipeline("sentiment-analysis")
>>> summarizer("The tower is 324 metres (1,063 ft) tall, about the
generate entirely new text using some models. same height as an 81-storey building, and the tallest structure in Paris.
Its base is square, measuring 125 metres (410 ft) on each side. It was
the first structure to reach a height of 300 metres. Excluding
transmitters, the Eiffel Tower is the second tallest free-standing
structure in France after the Millau Viaduct.")
>>> The tower is 324 metres (1,063 ft) tall, about the same height as
an 81-storey building. It was the first structure to reach a height of
300 metres.
Beyond 🅼🅻 being._.happy
/skhapijulhossen

PDF Transformers for Natural Language Processing and Computer Vision, Third Edition Denis Rothman download
50% (2)
PDF Transformers for Natural Language Processing and Computer Vision, Third Edition Denis Rothman download
65 pages
(Ebook) Machine Learning Algorithms in Depth (MEAP V01) by Vadim Smolyakov ISBN 9781633439214, 1633439216 download pdf
100% (5)
(Ebook) Machine Learning Algorithms in Depth (MEAP V01) by Vadim Smolyakov ISBN 9781633439214, 1633439216 download pdf
81 pages
Scan To BIM - Presentation
No ratings yet
Scan To BIM - Presentation
61 pages
The Hundred Page Machine Learning Book
No ratings yet
The Hundred Page Machine Learning Book
7 pages
Read & Download (PDF Kindle)
No ratings yet
Read & Download (PDF Kindle)
5 pages
A Survey of Evolution of Image Captioning PDF
No ratings yet
A Survey of Evolution of Image Captioning PDF
18 pages
What Is A Support Vector Machine?: Primer
No ratings yet
What Is A Support Vector Machine?: Primer
3 pages
A Practical Guide To Hybrid Natural Language Processing (Combining Neural Models and Knowledge Graph
No ratings yet
A Practical Guide To Hybrid Natural Language Processing (Combining Neural Models and Knowledge Graph
281 pages
Bhawini NLP File
No ratings yet
Bhawini NLP File
100 pages
Text Summarization
No ratings yet
Text Summarization
60 pages
CPP Notes - Object Oriented Programming Using CPP
No ratings yet
CPP Notes - Object Oriented Programming Using CPP
22 pages
First-Order Logic in Artificial Intelligence - Javatpoint
No ratings yet
First-Order Logic in Artificial Intelligence - Javatpoint
10 pages
A Hands-On Guide To Text Classification With Transformer Models (XLNet, BERT, XLM, RoBERTa)
No ratings yet
A Hands-On Guide To Text Classification With Transformer Models (XLNet, BERT, XLM, RoBERTa)
9 pages
NLP Unit 5
No ratings yet
NLP Unit 5
10 pages
Machine Learning Handouts
No ratings yet
Machine Learning Handouts
110 pages
Ai Agents
No ratings yet
Ai Agents
31 pages
Chapter 2. Pair Programming
No ratings yet
Chapter 2. Pair Programming
15 pages
ML Unit 1 Pallav
No ratings yet
ML Unit 1 Pallav
22 pages
Machine Learning Megapack
No ratings yet
Machine Learning Megapack
6 pages
Intelligent Agents
100% (2)
Intelligent Agents
24 pages
MACHINELEARING UNIT 1material
100% (1)
MACHINELEARING UNIT 1material
64 pages
A Novel Adoption of LSTM in Customer Touchpoint Prediction Problems Presentation 1
No ratings yet
A Novel Adoption of LSTM in Customer Touchpoint Prediction Problems Presentation 1
73 pages
Towards An Approach of Ontological Coreference Resolution For Sentiment Analysis
No ratings yet
Towards An Approach of Ontological Coreference Resolution For Sentiment Analysis
11 pages
TensorFlow Cheatsheet Zero To Mastery V1.01
No ratings yet
TensorFlow Cheatsheet Zero To Mastery V1.01
26 pages
Get Feature Engineering Bookcamp 1st Edition Sinan Ozdemir free all chapters
100% (2)
Get Feature Engineering Bookcamp 1st Edition Sinan Ozdemir free all chapters
55 pages
Machine Learning: Andrew NG's Course From Coursera: Presentation
100% (1)
Machine Learning: Andrew NG's Course From Coursera: Presentation
4 pages
Deep Learning (MODULE-3) (1)
No ratings yet
Deep Learning (MODULE-3) (1)
85 pages
PyTorch For Machine Learning
No ratings yet
PyTorch For Machine Learning
5 pages
Ai Unit 1 - Compressed
No ratings yet
Ai Unit 1 - Compressed
142 pages
Feature Engineering
No ratings yet
Feature Engineering
13 pages
Learning Path Machine Learning
No ratings yet
Learning Path Machine Learning
7 pages
7 Time Series Datasets For Machine Learning
No ratings yet
7 Time Series Datasets For Machine Learning
8 pages
Information Retrieval Processes and Techniques
100% (1)
Information Retrieval Processes and Techniques
32 pages
Natural Language Processing
No ratings yet
Natural Language Processing
21 pages
Chapter 6
100% (1)
Chapter 6
28 pages
NLP - Cheatsheet
No ratings yet
NLP - Cheatsheet
10 pages
NLP Unit1
No ratings yet
NLP Unit1
51 pages
Natural Language Processing Notes
No ratings yet
Natural Language Processing Notes
26 pages
Machine Learning Operations MLOps Overview Definition and Architecture
No ratings yet
Machine Learning Operations MLOps Overview Definition and Architecture
14 pages
Natural Language Processing
No ratings yet
Natural Language Processing
49 pages
Large Scale Deep Learning
No ratings yet
Large Scale Deep Learning
170 pages
NLP Presentation
No ratings yet
NLP Presentation
19 pages
NLTK: The Natural Language Toolkit: Steven Bird Edward Loper
No ratings yet
NLTK: The Natural Language Toolkit: Steven Bird Edward Loper
4 pages
Full download Neural Networks A Visual Introduction for Beginners Michael Taylor pdf docx
100% (1)
Full download Neural Networks A Visual Introduction for Beginners Michael Taylor pdf docx
65 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
Deep Learning 2024
No ratings yet
Deep Learning 2024
14 pages
NLP StudyMaterial
No ratings yet
NLP StudyMaterial
540 pages
Rakesh Kumar - Data Scientist
No ratings yet
Rakesh Kumar - Data Scientist
3 pages
Verilog Nonblocking Assignments Demystified
100% (2)
Verilog Nonblocking Assignments Demystified
3 pages
Unit-8: Natural Language: Processing
No ratings yet
Unit-8: Natural Language: Processing
16 pages
Neural Network Methods for Natural Language Processing 1st Edition by Yoav Goldberg ISBN 9783031021657 3031021657 - Own the ebook now with all fully detailed content
100% (6)
Neural Network Methods for Natural Language Processing 1st Edition by Yoav Goldberg ISBN 9783031021657 3031021657 - Own the ebook now with all fully detailed content
89 pages
Krr Unit i Notes
No ratings yet
Krr Unit i Notes
32 pages
Machine Learning
No ratings yet
Machine Learning
12 pages
Regularization_for_Neural_Networks_1718966083
No ratings yet
Regularization_for_Neural_Networks_1718966083
9 pages
Natural Language Processing
100% (1)
Natural Language Processing
12 pages
Review On NLP Paraphrase Detection Approaches
No ratings yet
Review On NLP Paraphrase Detection Approaches
4 pages
10 Natural Language Processing
No ratings yet
10 Natural Language Processing
27 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
From Everand
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
Fouad Sabry
No ratings yet
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet

NLP Cheatsheet

Uploaded by

NLP Cheatsheet

Uploaded by

Beyond 🅼🅻

Natural Language Processing

Natural Language Natural Language

Sentence Tokenization Word Tokenization

>>> import nltk

Sent1 -> AI is about smart machines.

Words Frequency Bag Of Word Vector

machine 3 machine log(3/2)

smart 1 TF smart log(3/1)

using 1 sent1 1/3 1/3 1/3 0 0 0 0 0 using log(3/1)

>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>> from nltk.tokenize import word_tokenize

Here are some examples:

>>> from nltk.tokenize import word_tokenize

You might also like