[Slides] Module 44
[Slides] Module 44
Youth
NLP Algorithms and
Applications (Language
Recognition Model,
Sentiment Analysis)
Legal Disclaimers
The Intel® Digital Readiness Programs and Intel® AI for Youth program are developed by Intel Corporation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands
may be claimed as the property of others. All rights reserved. Program dates and lesson plans are subject to change.
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
2
Recap
1. Importance of Vectorization
2. Vectorization Methods
3. Importance of Preprocessing
Can you set an alarm for 6 a.m.? 4.5 2.3 1.3 0.3
Vector representation
Where is the best restaurant in town?
Force of LSTM and GRU – Blog. (2022). Retrieved 10 September 2022, from https://dudeperf3ct.github.io/lstm/gru/nlp/2019/01/28/Force-of-LSTM-and-GRU/
5
Importance of Pre-Processing
High dimensional vector cause:
Increased training time for the AI model Requires more memory size to store the AI model
6
Removing Stopwords, Special Characters & Numbers
it is into in if
@ # $ % &
7
Converting Text to a Common Case
hello
8
Stemming vs Lemmatization
9
You will be able to answer the following:
1. What are the different real-world applications of NLP? How do they work?
4. Which pretrained models can be used to build language recognition and translation?
6. What are the pre-processing steps involved during sentiment analysis of the dataset?
10
Module Sections
11
NLP Pre-trained
Functions and Models
1.
1.NLP Pre-trained
Different Functions
Pre trained and Models
Models
2.Language
2. Open Model Zoo and Translation
Detection
3.Sentiment
3. Using OpenAnalysis
Model Zoo
NLP Applications
Language Translation
Chatbots
13
All the applications we just saw use
Sequence-to-Sequence Models!
Let’s try to explore this further…
14
Sequence-to-sequence Modeling
Sequence to sequence model(seq2seq) is used in several applications that we use daily. For example,
Google Translate is built upon the core concept of seq2seq models.
15
What is sequence-to-sequence modeling?
▪ Sequence-to-Sequence models are Language Translation
a particular class of Neural
Network architectures. Hello, How are you? Hola coma estas?
Sequence Model
▪ It is typically used to solve complex
Language problems like Language
Chatbot
Translation, Chatbot creation, Text
Summarization, etc. How are you doing? I am very well. Thank you.
Sequence Model
16
Understanding this model requires an
understanding of a series of concepts that
are built on top of each other.
17
Sequence to Sequence Model
▪ A sequence-to-sequence model is a model that takes a sequence of items (words,
letters, features of an image, etc.) and outputs another sequence of items.
Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) – Jay Alammar – Visualizing machine learning one concept at a time. (2022). Retrieved 18 September 2022, from https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-
attention/
18
Sequence to Sequence Model
▪ In Language Translation, a sequence is a series of words, as shown below. It is
processed in a sequence, one after the other. Output is also generated similarly.
Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) – Jay Alammar – Visualizing machine learning one concept at a time. (2022). Retrieved 18 September 2022, from https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-
attention/
19
What is inside the Seq2Seq Box?
20
Introduction to Encoders and Decoders
▪ Inside the BlackBox or Sequence-to-Sequence Model, we will find an encoder and decoder.
▪ The Encoder takes each word in the sequence as input and processes them item by item.
▪ This processed information is compiled and stored in a vector.
▪ After processing the input, the Encoder sends the vector over to the Decoder.
▪ The Decoder produces the output in a sequence using the vector.
21
Introduction to Encoders and Decoders
▪ This vector is called the context, which is passed on from encoder to decoder.
22
What’s inside the encoder and decoder stage?
▪ Each input in the sequence needs a Neural Network(NN) processing unit, as shown below in the diagram.
▪ Encoders or decoders are chains of several NN units where each unit accepts a single element of a sequence.
NN NN NN NN NN NN NN
23
Vectors and Hidden States
▪ Each NN unit takes two inputs at each step: an input and a hidden state
▪ In the Encoder stage, input will be the individual words of the input sequence and a hidden state(h).
▪ However, these individual words need to be represented using vectors.
SEQUENCE TO SEQUENCE MODEL
h1 h2 h3
NN NN NN NN NN NN NN
24
Vectors and Hidden States
▪ As encoders and decoders both are a specific class of NNs, each time step one of the NNs does some
processing, it updates its hidden state based on its inputs and previous inputs it has received.
▪ Observe how the last hidden state is actually the context that is passed along to the decoder.
SEQUENCE TO SEQUENCE MODEL
25
Processing Unit
Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) – Jay Alammar – Visualizing machine learning one concept at a time. (2022). Retrieved 18 September 2022, from https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-
attention/
26
This was a high-level look at the architecture
of the Sequence to Sequence Model.
27
Attention Mechanism
▪ Attention Mechanism is an improved technique over the basic seq2seq model architecture we just saw.
▪ The first difference is that instead of passing the last hidden state of the encoding stage,
the encoder passes all the hidden states to the decoder.
SEQUENCE TO SEQUENCE MODEL
28
Attention Mechanism
▪ The second difference is that decoder in the attention mechanism goes through a series of steps before
producing the output:
29
Look at this Model Structure
Try to answer the question in the next slide by taking this
Structure as a reference -
OUTPUT
ENCODER
ENCODER DECODER
INPUT
DECODER
Understanding Encoder-Decoder Sequence to Sequence Model | by Simeon Kostadinov | Towards Data Science. (2022). Retrieved 14 September 2022, from https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346
30
If the input sequence is “How are you?” What will be x2
according to the encoder block discussed in the last slide?
How are
you Hidden
31
If the input sequence is “How are you?” What will be x2
according to the encoder block discussed in the last slide?
How are
you Hidden
32
Language Detection
and Translation
1.
1.NLP Pre-trained
Different Functions
Pre trained and Models
Models
2.Language
2. Open Model Zoo and Translation
Detection
3.Sentiment
3. Using OpenAnalysis
Model Zoo
Let’s bring in AI Project Cycle again!
Problem Data
Evaluation
Scoping Exploration
Data Deployment
Acquisition Modeling
34
Starting with a problem statement..
Problem Data
Evaluation
Scoping Exploration
Data Deployment
Acquisition Modeling
35
What is Language Barrier?
Learning new language gaining ground in India. (2022). Retrieved 14 September 2022, from https://www.thehansindia.com/amp/hans/young-hans/learning-new-language-gaining-ground-in-india-719417
36
How many languages are there?
37
Language Statistics
▪ According to Ethnologue,
7,151 languages are spoken in the
world today.
▪ These languages are spoken by
communities whose lives are
shaped by our rapidly changing
world.
▪ 40% of languages are
now endangered.
38
What is the most spoken language?
English is the largest language in the world if you count both native and non-native speakers. If you count only
native speakers, Mandarin Chinese is the largest.
What is the most spoken language?. (2022). Retrieved 14 September 2022, from https://www.ethnologue.com/guides/most-spoken-languages
39
Since Mandarin speakers are mostly native
speakers, language translation can come in handy
to facilitate global communication.
40
Top 10 Spoken Languages Worldwide
By Carmen Ang. The World’s Top 10 Most Spoken Languages. (2022). Retrieved 14 September 2022, from https://www.visualcapitalist.com/the-worlds-top-10-most-spoken-languages/
41
Why do some languages disappear? Problem
Scoping
disappearing languages - HeeJoo the Piggy. (2022). Retrieved 14 September 2022, from https://heejoothepiggy.weebly.com/blog/disappearing-languages
42
Impact of Dying Languages
43
NLP applications like language detection and
translation could be used to help preserve
languages and help healthy global
communication.
44
A Quick Analogy!
When you start analyzing the data, you realize data is How would you solve this problem?
generated from different languages worldwide.
45
The first problem would be to know how we can
detect the language in which it is tweeted.
46
The solution would be to build a language
detection model for a scenario like this.
Let’s learn how to build a language detection model…
47
Where should we find the data?
Problem Data
Evaluation
Scoping Exploration
Data Deployment
Acquisition Modeling
48
Hands-on Session:
Duration: 120 Minutes
Refer to [Jupyter Notebooks-Youth] Language Detection
and Translation
⦁ English
Data Acquisition ⦁ Arabic
⦁ French Data
⦁ Hindi Acquisition
⦁ Urdu
Dataset source: ⦁ Portuguese
https://www.kaggle.com/code/martinkk5575/language- ⦁ Persian
⦁ Pushto
detection/data
⦁ Spanish
Data were initially extracted from the WiLi-2018 wikipedia ⦁ Korean
dataset. ⦁ Tamil
⦁ Turkish
▪ WiLI-2018, the Wikipedia language identification
benchmark dataset, contains 235000 paragraphs of 235 ⦁ Estonian
languages. ⦁ Russian
Each language in this dataset contains 1000 ⦁ Romanian
rows/paragraphs. ⦁ Chinese
⦁ Swedish
▪ After data selection and pre-processing, 22 selective
languages were used from the original dataset, which ⦁ Latin
includes the following Languages ⦁ Indonesian
⦁ Dutch
⦁ Japanese
⦁ Thai
50
Load the Dataset
Data
Acquisition
51
The next stage is Data Exploration or Pre-processing!
Problem Data
Evaluation
Scoping Exploration
Data Deployment
Acquisition Modeling
52
Let’s explore the data! Data
Exploration
53
Let’s explore the data! Data
Exploration
▪ Visualizing languages present
in the data frame along with
the frequency dataset for each
language.
54
Let’s explore the data! Data
Exploration
55
Let’s explore the data! Data
Exploration
▪ Tqdm library is used to display
the progress status of
execution.
56
Let’s explore the data! Data
Exploration
▪ The join() method takes all
items in an iterable and joins
them into one string.
57
Let’s explore the data! Data
Exploration
▪ Converting sentences into
vectors using CountVectorizer
58
Splitting the data Data
Exploration
▪ Splitting the data into training
and testing data sets.
59
The next stage in the AI Project cycle is Modeling!
Problem Data
Evaluation
Scoping Exploration
Data Deployment
Acquisition Modeling
60
Model Building
Modeling
61
The next stage in the AI Project cycle is Evaluation!
Problem Data
Evaluation
Scoping Exploration
Data Deployment
Acquisition Modeling
62
Model Evaluation
Evaluation
63
The last stage in the AI Project cycle is Deployment!
Problem Data
Evaluation
Scoping Exploration
Data Deployment
Acquisition Modeling
64
Save the Model
Deployment
65
Testing the Model
Deployment
66
Testing the Model
Deployment
67
Language Translation
68
Language Translation
▪ Comparing the
results by verifying
the output on a live
google translator.
69
Sentiment Analysis
1.
1.NLP Pre-trained
Different Functions
Pre trained and Models
Models
2.Language
2. Open Model Zoo and Translation
Detection
3.Sentiment
3. Using OpenAnalysis
Model Zoo
What is the first step of AI Project Cycle?
Problem Data
Evaluation
Scoping Exploration
Data Deployment
Acquisition Modeling
71
What is Climate Change?
freepik2m images. Hand drawn flat design climate change concept. (2022). Retrieved 14 September 2022, from https://www.freepik.com/premium-vector/hand-drawn-flat-design-climate-change-concept_18262695.htm
72
Let’s learn more about climate change...
73
What are the different causes of Climate Change?
By Nikita Kaul. Climate Change: Extinction in Disguise | Its Causes & Effects. (2022). Retrieved 14 September 2022, from https://www.pranaair.com/blog/climate-change-its-causes-effects-extinction-in-disguise/
74
What is the impact of Climate Change?
By Nikita Kaul. Climate Change: Extinction in Disguise | Its Causes & Effects. (2022). Retrieved 14 September 2022, from https://www.pranaair.com/blog/climate-change-its-causes-effects-extinction-in-disguise/
75
Step 1: Problem Scoping Problem
Scoping
As human activities have been the main contributing factor to climate change, we need to ensure there is
awareness about climate change and that corrective steps can be taken to minimize its impact.
New team members Isometric Illustrations. (2022). Retrieved 14 September 2022, from https://storyset.com/illustration/new-team-members/amico
76
How do we differentiate between sentiments or
opinions of people on climate change?
We can use NLP to build a model that can do sentiment analysis. Let’s dive in…
77
Hands-on Session:
Duration: 120 Minutes
Refer to [Jupyter Notebooks-Youth] Semantic Analysis
Import the Libraries
79
What is the second step of the AI Project Cycle?
Problem Data
Evaluation
Scoping Exploration
Data Deployment
Acquisition Modeling
80
Step 2: Data Acquisition
Data
Acquisition
▪ This dataset aggregates tweets about climate change
collected between Apr 27, 2015 and Feb 21, 2018.
▪ In total, 43943 tweets were annotated.
▪ Each tweet is labeled as one of the following classes:
▪ 2(News): the tweet links to factual news about
climate change
▪ 1(Pro): the tweet supports the belief in man-made
climate change
▪ 0(Neutral): the tweet neither supports nor refutes
the belief in man-made climate change
▪ -1(Anti): the tweet does not believe in man-made
climate change.
81
Is the data ethical?
Data
Acquisition
▪ The data collected is from Twitter; it comes from
varied demographic and random in nature.
▪ Therefore, it may not be accurate and reliable.
Thinking face Semi Flat Illustrations. (2022). Retrieved 14 September 2022, from https://storyset.com/illustration/thinking-face/pana
82
Load the dataset
Data
Acquisition
▪ Load the dataset into a data frame
called df.
▪ Viewing the first five entries of the
data frame using head()
83
What is the third step of the AI Project Cycle?
Problem Data
Evaluation
Scoping Exploration
Data Deployment
Acquisition Modeling
84
Explore the dataset
Data
Exploration
85
Explore the dataset
Data
Exploration
▪ Visualizing the
distribution of sentiment
using a pie chart.
86
Let’s prepare the data for sentiment analysis…
87
1. Tokenization
Data
Exploration
Tokenization is the first step in any NLP Pipeline. Basically, it is a process of breaking down
the raw text into smaller units like words or sentences called tokens.
88
1. Tokenization
Data
Exploration
▪ Tweet texts will be
transformed and
vectorized to be fed into
models
Sample tokens
89
2. Stop Word Removal
Data
Exploration
Stop words commonly occur in all documents and add little meaning to sentences like is,
was, and, etc. Such words can be removed to make the learning process faster.
90
2. Stop Word Removal
Data
Exploration
▪ Stop word removal is yet
another critical step in
NLP.
▪ Defining a function to
remove stop words
▪ Tokenized list(tokenizedLi)
generated in the last step
will be the input
parameter to this
function.
91
2. Stop Word Removal
Data
Exploration
92
3. Stemming
Data
Exploration
Stemming is a process of extracting the base or root form of the word. It is used to
normalize and prepare the text for further processing.
Walking
Walk
Walked
Walks
93
3. Stemming
Data
Exploration
▪ Defining a function to stem
words to their root form.
▪ Using Porter Stemmer algorithm
for performing stemming.
▪ Output after stemming
94
4. Vectorization
Data
Exploration
Vectorization is a technique of converting textual data into numerical feature vectors.
Ashutosh Tripathi. Word2Vec and Semantic Similarity using spacy | NLP spacy Series | Part 7 – Data Science Duniya. (2022). Retrieved 14 September 2022, from https://ashutoshtripathi.com/2020/09/04/word2vec-and-semantic-similarity-using-spacy-nlp-spacy-series-part-7/
95
4. Vectorization
Data
Exploration
96
4. Vectorization
Data
Exploration
97
N-Gram Language Modeling
Data
Exploration
N-gram is a sequence of the N-words in the modeling of NLP. They are basically a set of co-
occurring words within a given window.
98
N-Gram Language Modeling
Data
Exploration
99
N-Gram Language Modeling
Data
▪ Visualizing the bi-gram model for the sentiment(-1) : Anti-climate tweets for the top 20 Exploration
occurrences
100
Defining Functions
Data
Exploration
▪ Defining function for building the
▪ Defining function for evaluating the model
model
101
Defining Functions
Data
Exploration
102
Which Pre-Processing technique has been applied
here?
Input Output
Stemming Tokenization
103
Which Pre-Processing technique has been applied
here?
Input Output
Stemming Tokenization
104
Modeling is the next step in AI Project Cycle!
Problem Data
Evaluation
Scoping Exploration
Data Deployment
Acquisition Modeling
105
Model Building
Modeling
106
Model Building
Modeling
▪ Using DecisionTreeClassifier
to build the model.
▪ Here, we are using initial
vectorizedTweets without
N-gram sequencing
▪ modelAndPredict() function
gives a call to
evaluateModel() function,
which prints the accuracy
of the model.
Model results (actual and
predicted values of
sentiments) can be viewed
using the confusion matrix
107
Model Building
Modeling
▪ Using
RandomForestClassifier to
build the model.
▪ Here, we are using initial
vectorizedTweets without
N-gram sequencing
▪ modelAndPredict() function
gives a call to
evaluateModel() function,
which prints the accuracy
of the model. Model results (actual and
predicted values of
sentiments) can be viewed
using the confusion matrix
108
Which is the next stage after Modeling?
Problem Data
Evaluation
Scoping Exploration
Data Deployment
Acquisition Modeling
109
Model Comparisons
Evaluation
110
What is the last stage of the AI Project Cycle?
Problem Data
Evaluation
Scoping Exploration
Data Deployment
Acquisition Modeling
111
Save the Model
Deployment
▪ Defining a preprocessing
function to reuse the built
model to predict other
tweets.
▪ Model is also saved as a
binary file.
▪ Rest of the code is a
similar preprocessing
pipeline code.
112
Save the Model
Deployment
113
Quiz time
Let’s Kahoot!
▪ Let’s go!
114
Key Takeaways
1. Different steps involved in NLP Pipeline
2. Exploring NLP Applications like Language Detection and Translation and
Semantic Analysis.
115
Reflection
116
Bibliography
▪ Causes and Effects of Climate Change | United Nations. (n.d.). Retrieved September 14, 2022, from
https://www.un.org/en/climatechange/science/causes-effects-climate-change#collapseOne
▪ Climate Change: Extinction in Disguise | Its Causes & Effects. (n.d.). Retrieved September 14, 2022, from
https://www.pranaair.com/blog/climate-change-its-causes-effects-extinction-in-disguise/
▪ Four Things That Happen When a Language Dies | Smart News| Smithsonian Magazine. (n.d.). Retrieved September 14, 2022, from
https://www.smithsonianmag.com/smart-news/four-things-happen-when-language-dies-and-one-thing-you-can-do-help-180962188/
▪ GitHub - JatinSadhwani02/Langugae_Identification_NLP. (n.d.). Retrieved September 14, 2022, from
https://github.com/JatinSadhwani02/Langugae_Identification_NLP
▪ How many languages are endangered? | Ethnologue. (n.d.). Retrieved September 14, 2022, from
https://www.ethnologue.com/guides/how-many-languages-endangered
▪ How Many Languages Are There In The World? (n.d.). Retrieved September 14, 2022, from https://www.babbel.com/en/magazine/how-
many-languages-are-there-in-the-world
▪ How many languages are there in the world? | Ethnologue. (n.d.). Retrieved September 14, 2022, from
https://www.ethnologue.com/guides/how-many-languages
▪ How to Break Down Language Barriers in Your International Dev Team - DistantJob - Remote Recruitment Agency. (n.d.). Retrieved
September 14, 2022, from https://distantjob.com/blog/break-down-language-barrier-dev-teams/
▪ Seq2Seq Model | Understand Seq2Seq Model Architecture. (n.d.). Retrieved September 14, 2022, from
https://www.analyticsvidhya.com/blog/2020/08/a-simple-introduction-to-sequence-to-sequence-models/
117
Bibliography
▪ The Top 10 Most Spoken Languages Across the Globe. (n.d.). Retrieved September 14, 2022, from https://www.visualcapitalist.com/the-
worlds-top-10-most-spoken-languages/
▪ Twitter Climate Change Sentiment Analysis | Kaggle. (n.d.). Retrieved September 14, 2022, from
https://www.kaggle.com/code/roellekim/twitter-climate-change-sentiment-analysis
▪ Understanding Encoder-Decoder Sequence to Sequence Model | by Simeon Kostadinov | Towards Data Science. (n.d.). Retrieved
September 14, 2022, from https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-
679e04af4346
▪ What Is Climate Change? | United Nations. (n.d.). Retrieved September 14, 2022, from https://www.un.org/en/climatechange/what-is-
climate-change
▪ What is the most spoken language? | Ethnologue. (n.d.). Retrieved September 14, 2022, from
https://www.ethnologue.com/guides/most-spoken-languages
▪ Word2Vec and Semantic Similarity using spacy | NLP spacy Series | Part 7 – Data Science Duniya. (n.d.). Retrieved September 14, 2022,
from https://ashutoshtripathi.com/2020/09/04/word2vec-and-semantic-similarity-using-spacy-nlp-spacy-series-part-7/
▪ Transformers, what can they do? - Hugging Face Course. (n.d.). Retrieved September 30, 2022, from
https://huggingface.co/course/chapter1/3?fw=pt
▪ Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) – Jay Alammar – Visualizing machine
learning one concept at a time. (n.d.). Retrieved September 30, 2022, from https://jalammar.github.io/visualizing-neural-machine-
translation-mechanics-of-seq2seq-models-with-attention/
118