0% found this document useful (0 votes)
3 views

[Slides] Module 44

D

Uploaded by

Kristian Hasa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

[Slides] Module 44

D

Uploaded by

Kristian Hasa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 119

AI for

Youth
NLP Algorithms and
Applications (Language
Recognition Model,
Sentiment Analysis)
Legal Disclaimers
The Intel® Digital Readiness Programs and Intel® AI for Youth program are developed by Intel Corporation.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands
may be claimed as the property of others. All rights reserved. Program dates and lesson plans are subject to change.

Intel technologies may require enabled hardware, software, or service activation.

No product or component can be absolutely secure.

Results have been estimated or simulated.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

Your costs and results may vary.

2
Recap

1. Importance of Vectorization

2. Vectorization Methods

3. Importance of Preprocessing

4. NLP Data Preprocessing Techniques


We need to convert them into numbers
What is the weather like outside?

6.3 2.3 4.4 1.3

Can you set an alarm for 6 a.m.? 4.5 2.3 1.3 0.3

5.7 2.5 5.0 2.0

Vector representation
Where is the best restaurant in town?

Please note: The values in the vector aren’t accurate


4
Vectorization Methods

1. Bag of Words 2. TFIDF

Force of LSTM and GRU – Blog. (2022). Retrieved 10 September 2022, from https://dudeperf3ct.github.io/lstm/gru/nlp/2019/01/28/Force-of-LSTM-and-GRU/

5
Importance of Pre-Processing
High dimensional vector cause:

Increased training time for the AI model Requires more memory size to store the AI model

6
Removing Stopwords, Special Characters & Numbers

a an and are as for

it is into in if

on or such the there to

@ # $ % &

7
Converting Text to a Common Case

HELLO HeLLo HELLo hELLO hElLO HellO

hello

8
Stemming vs Lemmatization

Caring Lemmatization Care

Caring Stemming Car

9
You will be able to answer the following:
1. What are the different real-world applications of NLP? How do they work?

2. How do Language recognition and translation models work?

3. How AI can be used to preserve marginalized languages?

4. Which pretrained models can be used to build language recognition and translation?

5. How does sentiment analysis work?

6. What are the pre-processing steps involved during sentiment analysis of the dataset?

10
Module Sections

11
NLP Pre-trained
Functions and Models

1.
1.NLP Pre-trained
Different Functions
Pre trained and Models
Models

2.Language
2. Open Model Zoo and Translation
Detection

3.Sentiment
3. Using OpenAnalysis
Model Zoo
NLP Applications
Language Translation

Chatbots

Text Summarization Question Answering

13
All the applications we just saw use
Sequence-to-Sequence Models!
Let’s try to explore this further…

14
Sequence-to-sequence Modeling

Sequence to sequence model(seq2seq) is used in several applications that we use daily. For example,
Google Translate is built upon the core concept of seq2seq models.

15
What is sequence-to-sequence modeling?
▪ Sequence-to-Sequence models are Language Translation
a particular class of Neural
Network architectures. Hello, How are you? Hola coma estas?
Sequence Model
▪ It is typically used to solve complex
Language problems like Language
Chatbot
Translation, Chatbot creation, Text
Summarization, etc. How are you doing? I am very well. Thank you.
Sequence Model

Bob and Clara attended a party in the town. Text Summarization


While at the party, Clara collapsed and was Clara was hospitalized after
rushed to the hospital. attending a party with Bob.
Sequence Model

16
Understanding this model requires an
understanding of a series of concepts that
are built on top of each other.

17
Sequence to Sequence Model
▪ A sequence-to-sequence model is a model that takes a sequence of items (words,
letters, features of an image, etc.) and outputs another sequence of items.

Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) – Jay Alammar – Visualizing machine learning one concept at a time. (2022). Retrieved 18 September 2022, from https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-
attention/

18
Sequence to Sequence Model
▪ In Language Translation, a sequence is a series of words, as shown below. It is
processed in a sequence, one after the other. Output is also generated similarly.

▪ In the example, a sentence in French is translated into English.

Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) – Jay Alammar – Visualizing machine learning one concept at a time. (2022). Retrieved 18 September 2022, from https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-
attention/

19
What is inside the Seq2Seq Box?

20
Introduction to Encoders and Decoders
▪ Inside the BlackBox or Sequence-to-Sequence Model, we will find an encoder and decoder.
▪ The Encoder takes each word in the sequence as input and processes them item by item.
▪ This processed information is compiled and stored in a vector.

SEQUENCE TO SEQUENCE MODEL

je suis étudiant ENCODER DECODER I am a student

▪ After processing the input, the Encoder sends the vector over to the Decoder.
▪ The Decoder produces the output in a sequence using the vector.

21
Introduction to Encoders and Decoders
▪ This vector is called the context, which is passed on from encoder to decoder.

SEQUENCE TO SEQUENCE MODEL


C
O
N
T
je suis étudiant ENCODER E DECODER I am a student
X
T

▪ The context is nothing but an array of numbers for machine translation.


▪ And the encoders and decoders are a specific class of neural networks.

22
What’s inside the encoder and decoder stage?
▪ Each input in the sequence needs a Neural Network(NN) processing unit, as shown below in the diagram.
▪ Encoders or decoders are chains of several NN units where each unit accepts a single element of a sequence.

SEQUENCE TO SEQUENCE MODEL

ENCODING STAGE I am a student

NN NN NN NN NN NN NN

je suis etudiant DECODING STAGE

23
Vectors and Hidden States
▪ Each NN unit takes two inputs at each step: an input and a hidden state
▪ In the Encoder stage, input will be the individual words of the input sequence and a hidden state(h).
▪ However, these individual words need to be represented using vectors.
SEQUENCE TO SEQUENCE MODEL

ENCODING STAGE I am a student

h1 h2 h3
NN NN NN NN NN NN NN

je suis etudiant DECODING STAGE

24
Vectors and Hidden States
▪ As encoders and decoders both are a specific class of NNs, each time step one of the NNs does some
processing, it updates its hidden state based on its inputs and previous inputs it has received.
▪ Observe how the last hidden state is actually the context that is passed along to the decoder.
SEQUENCE TO SEQUENCE MODEL

ENCODING STAGE I am a student

Hidden State (h3)


h1 h2 h3
NN NN NN NN NN NN NN

je suis etudiant DECODING STAGE

25
Processing Unit

▪ This specific class of NN is RNN


▪ RNN stands for Recurrent Neural
Network.
▪ If you want to learn more about RNN,
visit this link: A friendly introduction to
RNN

Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) – Jay Alammar – Visualizing machine learning one concept at a time. (2022). Retrieved 18 September 2022, from https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-
attention/

26
This was a high-level look at the architecture
of the Sequence to Sequence Model.

27
Attention Mechanism
▪ Attention Mechanism is an improved technique over the basic seq2seq model architecture we just saw.
▪ The first difference is that instead of passing the last hidden state of the encoding stage,
the encoder passes all the hidden states to the decoder.
SEQUENCE TO SEQUENCE MODEL

ENCODING STAGE I am a student

Hidden State (h3)


Hidden State (h1)

Hidden State (h2)


h1 h2 h3
NN NN NN NN NN NN NN

je suis etudiant DECODING STAGE

28
Attention Mechanism
▪ The second difference is that decoder in the attention mechanism goes through a series of steps before
producing the output:

▪ Consider the hidden states received from the encoder stage


▪ Assign a score to each hidden state (let’s not focus on how to score)
▪ Multiply each score with the Softmax score
▪ Multiplying will magnify the hidden states with a high score and condense the hidden
states with a low score.

29
Look at this Model Structure
Try to answer the question in the next slide by taking this
Structure as a reference -
OUTPUT

ENCODER

ENCODER DECODER

INPUT
DECODER

Understanding Encoder-Decoder Sequence to Sequence Model | by Simeon Kostadinov | Towards Data Science. (2022). Retrieved 14 September 2022, from https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346

30
If the input sequence is “How are you?” What will be x2
according to the encoder block discussed in the last slide?

How are

you Hidden

31
If the input sequence is “How are you?” What will be x2
according to the encoder block discussed in the last slide?

How are

you Hidden

32
Language Detection
and Translation

1.
1.NLP Pre-trained
Different Functions
Pre trained and Models
Models

2.Language
2. Open Model Zoo and Translation
Detection

3.Sentiment
3. Using OpenAnalysis
Model Zoo
Let’s bring in AI Project Cycle again!

Problem Data
Evaluation
Scoping Exploration

Data Deployment
Acquisition Modeling

34
Starting with a problem statement..

Problem Data
Evaluation
Scoping Exploration

Data Deployment
Acquisition Modeling

35
What is Language Barrier?

▪ A language barrier refers to linguistic


barriers to communication.

▪ The barrier in communication faced by


people speaking different languages or
even dialects in some cases.

Learning new language gaining ground in India. (2022). Retrieved 14 September 2022, from https://www.thehansindia.com/amp/hans/young-hans/learning-new-language-gaining-ground-in-india-719417

36
How many languages are there?

37
Language Statistics

▪ According to Ethnologue,
7,151 languages are spoken in the
world today.
▪ These languages are spoken by
communities whose lives are
shaped by our rapidly changing
world.
▪ 40% of languages are
now endangered.

Living Languages 2022 by Ethnologue


How many languages are there in the world?. (2022). Retrieved 14 September 2022, from https://www.ethnologue.com/guides/how-many-languages

38
What is the most spoken language?
English is the largest language in the world if you count both native and non-native speakers. If you count only
native speakers, Mandarin Chinese is the largest.

What is the most spoken language?. (2022). Retrieved 14 September 2022, from https://www.ethnologue.com/guides/most-spoken-languages

39
Since Mandarin speakers are mostly native
speakers, language translation can come in handy
to facilitate global communication.

40
Top 10 Spoken Languages Worldwide

By Carmen Ang. The World’s Top 10 Most Spoken Languages. (2022). Retrieved 14 September 2022, from https://www.visualcapitalist.com/the-worlds-top-10-most-spoken-languages/

41
Why do some languages disappear? Problem
Scoping

▪ A language becomes endangered when its users


begin to teach and speak a more dominant
language to the children in the community.

▪ Due to their nature, endangered languages


often have few speakers left, and it may be
difficult to get information about them.

▪ Other times, the last known speaker of a


language may die without public records.

disappearing languages - HeeJoo the Piggy. (2022). Retrieved 14 September 2022, from https://heejoothepiggy.weebly.com/blog/disappearing-languages

42
Impact of Dying Languages

▪ Language Loss can be culturally


devastating

▪ We lose the memory of the planet’s


many histories and cultures

▪ Some people lose their mother tongue

43
NLP applications like language detection and
translation could be used to help preserve
languages and help healthy global
communication.

44
A Quick Analogy!

Imagine you are a data scientist, and you have been


assigned a task to summarize the understanding of
climate change from across the world.

One of the first tasks will be to collect tweets related to


#climate change

When you start analyzing the data, you realize data is How would you solve this problem?
generated from different languages worldwide.

45
The first problem would be to know how we can
detect the language in which it is tweeted.

46
The solution would be to build a language
detection model for a scenario like this.
Let’s learn how to build a language detection model…

47
Where should we find the data?

Problem Data
Evaluation
Scoping Exploration

Data Deployment
Acquisition Modeling

48
Hands-on Session:
Duration: 120 Minutes
Refer to [Jupyter Notebooks-Youth] Language Detection
and Translation
⦁ English
Data Acquisition ⦁ Arabic
⦁ French Data
⦁ Hindi Acquisition
⦁ Urdu
Dataset source: ⦁ Portuguese
https://www.kaggle.com/code/martinkk5575/language- ⦁ Persian
⦁ Pushto
detection/data
⦁ Spanish
Data were initially extracted from the WiLi-2018 wikipedia ⦁ Korean
dataset. ⦁ Tamil
⦁ Turkish
▪ WiLI-2018, the Wikipedia language identification
benchmark dataset, contains 235000 paragraphs of 235 ⦁ Estonian
languages. ⦁ Russian
Each language in this dataset contains 1000 ⦁ Romanian
rows/paragraphs. ⦁ Chinese
⦁ Swedish
▪ After data selection and pre-processing, 22 selective
languages were used from the original dataset, which ⦁ Latin
includes the following Languages ⦁ Indonesian
⦁ Dutch
⦁ Japanese
⦁ Thai

50
Load the Dataset
Data
Acquisition

▪ Loading the dataset to a data frame


‘data’ using pandas.

▪ head() is used to display the first five


entries of the data frame.

▪ shape() gives the total number of


rows and columns in the data frame.

51
The next stage is Data Exploration or Pre-processing!

Problem Data
Evaluation
Scoping Exploration

Data Deployment
Acquisition Modeling

52
Let’s explore the data! Data
Exploration

▪ value_counts() - Checking how many


unique languages are present in the
dataset and what is the total count
per language.

53
Let’s explore the data! Data
Exploration
▪ Visualizing languages present
in the data frame along with
the frequency dataset for each
language.

54
Let’s explore the data! Data
Exploration

▪ To view the text for a


particular language – specify
the column name and index no

▪ Importing nltk library to do


NLP Pre-processing

55
Let’s explore the data! Data
Exploration
▪ Tqdm library is used to display
the progress status of
execution.

▪ The sub() function searches for


the pattern in the string and
replaces the matched strings
with the replacement.

▪ lower() method returns the


lowercase string from the
given string.

▪ The split() method splits a


string into a list.

56
Let’s explore the data! Data
Exploration
▪ The join() method takes all
items in an iterable and joins
them into one string.

▪ append() method is used for


appending and adding
elements to the end of the List
‘corpus’.

57
Let’s explore the data! Data
Exploration
▪ Converting sentences into
vectors using CountVectorizer

▪ Applying the fit_transform


method on the corpus.

▪ Applying Label encoding to the


data.

58
Splitting the data Data
Exploration
▪ Splitting the data into training
and testing data sets.

59
The next stage in the AI Project cycle is Modeling!

Problem Data
Evaluation
Scoping Exploration

Data Deployment
Acquisition Modeling

60
Model Building
Modeling

▪ Multinomial Naive Bayes


algorithm is suitable for
classification and is mostly
used in NLP.

61
The next stage in the AI Project cycle is Evaluation!

Problem Data
Evaluation
Scoping Exploration

Data Deployment
Acquisition Modeling

62
Model Evaluation
Evaluation

▪ Score() is used to calculate


the accuracy of a Model.

▪ Confusion Matrix is used to


compare the actual and the
predicted output.

63
The last stage in the AI Project cycle is Deployment!

Problem Data
Evaluation
Scoping Exploration

Data Deployment
Acquisition Modeling

64
Save the Model
Deployment

▪ joblib library is used to save


the model.

▪ Model is being saved in the


.sav extension

65
Testing the Model
Deployment

▪ Code for testing the language


detection Model

66
Testing the Model
Deployment

▪ Some example sentences to


test the language detection
model

67
Language Translation

▪ Using the googletrans module


for Language Translation

▪ Translating French text into


English

68
Language Translation

▪ Using the module to


translate multiple
sentences

▪ Example shows the


translation of English
text to Spanish

▪ Comparing the
results by verifying
the output on a live
google translator.

69
Sentiment Analysis

1.
1.NLP Pre-trained
Different Functions
Pre trained and Models
Models

2.Language
2. Open Model Zoo and Translation
Detection

3.Sentiment
3. Using OpenAnalysis
Model Zoo
What is the first step of AI Project Cycle?

Problem Data
Evaluation
Scoping Exploration

Data Deployment
Acquisition Modeling

71
What is Climate Change?

▪ Climate change refers to long-term shifts in


temperatures and weather patterns.
▪ These shifts may be natural, such as through
variations in the solar cycle.
▪ But it has been observed that human
activities have been the main driver of
climate change.

freepik2m images. Hand drawn flat design climate change concept. (2022). Retrieved 14 September 2022, from https://www.freepik.com/premium-vector/hand-drawn-flat-design-climate-change-concept_18262695.htm

72
Let’s learn more about climate change...

73
What are the different causes of Climate Change?

▪ Burning fossil fuels generates greenhouse


gas (carbon dioxide and methane)
emissions.
▪ They act like a blanket wrapped around the
Earth, trapping the sun’s heat and raising
temperatures.
▪ Deforestation limits to nature’s ability to
keep emissions out of the atmosphere.
▪ You may read this link to know more: Causes
of Climate Change

By Nikita Kaul. Climate Change: Extinction in Disguise | Its Causes & Effects. (2022). Retrieved 14 September 2022, from https://www.pranaair.com/blog/climate-change-its-causes-effects-extinction-in-disguise/

74
What is the impact of Climate Change?

▪ As greenhouse gas concentrations rise, we


see a rise in the global surface temperature
as well.
▪ All these climate change factors contribute
to :
▪ Hotter Temperatures
▪ Severe Storms
▪ Increased Drought
▪ Loss of Species
▪ Increased Health Risks
▪ Poverty and Displacement

By Nikita Kaul. Climate Change: Extinction in Disguise | Its Causes & Effects. (2022). Retrieved 14 September 2022, from https://www.pranaair.com/blog/climate-change-its-causes-effects-extinction-in-disguise/

75
Step 1: Problem Scoping Problem
Scoping

Public opinion on climate change is dynamic


and differentiated.
So it is very important to understand
the sentiments of people regarding
climate change…

As human activities have been the main contributing factor to climate change, we need to ensure there is
awareness about climate change and that corrective steps can be taken to minimize its impact.
New team members Isometric Illustrations. (2022). Retrieved 14 September 2022, from https://storyset.com/illustration/new-team-members/amico

76
How do we differentiate between sentiments or
opinions of people on climate change?
We can use NLP to build a model that can do sentiment analysis. Let’s dive in…

77
Hands-on Session:
Duration: 120 Minutes
Refer to [Jupyter Notebooks-Youth] Semantic Analysis
Import the Libraries

▪ Importing all the required libraries for


natural language processing like nltk
(natural language toolkit)
▪ We will also be requiring libraries for
basic data operations and modeling.

79
What is the second step of the AI Project Cycle?

Problem Data
Evaluation
Scoping Exploration

Data Deployment
Acquisition Modeling

80
Step 2: Data Acquisition
Data
Acquisition
▪ This dataset aggregates tweets about climate change
collected between Apr 27, 2015 and Feb 21, 2018.
▪ In total, 43943 tweets were annotated.
▪ Each tweet is labeled as one of the following classes:
▪ 2(News): the tweet links to factual news about
climate change
▪ 1(Pro): the tweet supports the belief in man-made
climate change
▪ 0(Neutral): the tweet neither supports nor refutes
the belief in man-made climate change
▪ -1(Anti): the tweet does not believe in man-made
climate change.

81
Is the data ethical?
Data
Acquisition
▪ The data collected is from Twitter; it comes from
varied demographic and random in nature.
▪ Therefore, it may not be accurate and reliable.

Thinking face Semi Flat Illustrations. (2022). Retrieved 14 September 2022, from https://storyset.com/illustration/thinking-face/pana

82
Load the dataset
Data
Acquisition
▪ Load the dataset into a data frame
called df.
▪ Viewing the first five entries of the
data frame using head()

83
What is the third step of the AI Project Cycle?

Problem Data
Evaluation
Scoping Exploration

Data Deployment
Acquisition Modeling

84
Explore the dataset
Data
Exploration

▪ Info() gives a basic summary of the


data
▪ The data frame has a total of 43943
entries with no null entries and has
3 features.

▪ To check the distribution of


sentiments using value_counts()

85
Explore the dataset
Data
Exploration

▪ Visualizing the
distribution of sentiment
using a pie chart.

86
Let’s prepare the data for sentiment analysis…

87
1. Tokenization
Data
Exploration
Tokenization is the first step in any NLP Pipeline. Basically, it is a process of breaking down
the raw text into smaller units like words or sentences called tokens.

Like a diamond in the sky

Like a diamond in the sky

88
1. Tokenization
Data
Exploration
▪ Tweet texts will be
transformed and
vectorized to be fed into
models

▪ A very critical step in NLP


is tokenization, so here all
the tweets will be split
into an array of words.

Sample tokens

89
2. Stop Word Removal
Data
Exploration
Stop words commonly occur in all documents and add little meaning to sentences like is,
was, and, etc. Such words can be removed to make the learning process faster.

Like a diamond in the sky

Like diamond sky

90
2. Stop Word Removal
Data
Exploration
▪ Stop word removal is yet
another critical step in
NLP.
▪ Defining a function to
remove stop words
▪ Tokenized list(tokenizedLi)
generated in the last step
will be the input
parameter to this
function.

91
2. Stop Word Removal
Data
Exploration

▪ Passing tokenizedLi as a parameter to the function and


verifying the output.
▪ You can observe that words like as, it, was, is, an, etc. has
been removed from the list

92
3. Stemming
Data
Exploration
Stemming is a process of extracting the base or root form of the word. It is used to
normalize and prepare the text for further processing.

Walking

Walk
Walked

Walks

93
3. Stemming
Data
Exploration
▪ Defining a function to stem
words to their root form.
▪ Using Porter Stemmer algorithm
for performing stemming.
▪ Output after stemming

94
4. Vectorization
Data
Exploration
Vectorization is a technique of converting textual data into numerical feature vectors.

Ashutosh Tripathi. Word2Vec and Semantic Similarity using spacy | NLP spacy Series | Part 7 – Data Science Duniya. (2022). Retrieved 14 September 2022, from https://ashutoshtripathi.com/2020/09/04/word2vec-and-semantic-similarity-using-spacy-nlp-spacy-series-part-7/

95
4. Vectorization
Data
Exploration

▪ In simple words, vectorization


is a process of converting
words into numbers.
▪ Sample sentence to compare
the difference before and
after vectorization.

96
4. Vectorization
Data
Exploration

▪ The print output shows the results of 2


sample sentences after being vectorized.
▪ Each vector signifies a unique word in all
tweets.
▪ For example, observe that there are 2
sets of vectors appearing in both
sentences - (0, 12943) and (0, 13774) -
which are representations of "climat" and
"chang" respectively.

97
N-Gram Language Modeling
Data
Exploration
N-gram is a sequence of the N-words in the modeling of NLP. They are basically a set of co-
occurring words within a given window.

Like a diamond in the sky

Uni-Gram Like a diamond in the sky


Bi-Gram Like a a diamond diamond in in the the sky
Tri-Gram Like a diamond a diamond in diamond in the in the sky

98
N-Gram Language Modeling
Data
Exploration

▪ Using N-Grams, we can group N


numbers of words together and
analyze their frequencies for
specific sentiment ratings.
▪ In this example, we are first
checking for a bi-gram sequence
with SIZE=2

99
N-Gram Language Modeling
Data
▪ Visualizing the bi-gram model for the sentiment(-1) : Anti-climate tweets for the top 20 Exploration
occurrences

100
Defining Functions
Data
Exploration
▪ Defining function for building the
▪ Defining function for evaluating the model
model

101
Defining Functions
Data
Exploration

▪ Defining function for displaying the


confusion matrix.

102
Which Pre-Processing technique has been applied
here?

I like to Party like Party

Input Output

Stemming Tokenization

Stop Words Removal Vectorization

103
Which Pre-Processing technique has been applied
here?

I like to Party like Party

Input Output

Stemming Tokenization

Stop Words Removal Vectorization

104
Modeling is the next step in AI Project Cycle!

Problem Data
Evaluation
Scoping Exploration

Data Deployment
Acquisition Modeling

105
Model Building
Modeling

▪ Using Logistic Regression to


build the model.
▪ Here, we are using initial
vectorized Tweets without
N-gram sequencing
▪ modelAndPredict() function
gives a call to
evaluateModel() function,
which prints the accuracy
of the model. Model results (actual and
predicted values of
sentiments) can be viewed
using the confusion matrix

106
Model Building
Modeling

▪ Using DecisionTreeClassifier
to build the model.
▪ Here, we are using initial
vectorizedTweets without
N-gram sequencing
▪ modelAndPredict() function
gives a call to
evaluateModel() function,
which prints the accuracy
of the model.
Model results (actual and
predicted values of
sentiments) can be viewed
using the confusion matrix

107
Model Building
Modeling

▪ Using
RandomForestClassifier to
build the model.
▪ Here, we are using initial
vectorizedTweets without
N-gram sequencing
▪ modelAndPredict() function
gives a call to
evaluateModel() function,
which prints the accuracy
of the model. Model results (actual and
predicted values of
sentiments) can be viewed
using the confusion matrix

108
Which is the next stage after Modeling?

Problem Data
Evaluation
Scoping Exploration

Data Deployment
Acquisition Modeling

109
Model Comparisons
Evaluation

▪ Code compares 3 models Logistic


Regression, Decision Tree, and
Random Forest Classifier.
▪ Logistic Regression seems to be
performing the best because it
has the highest accuracy, recall,
precision, and f1 among all
models.

110
What is the last stage of the AI Project Cycle?

Problem Data
Evaluation
Scoping Exploration

Data Deployment
Acquisition Modeling

111
Save the Model
Deployment

▪ Defining a preprocessing
function to reuse the built
model to predict other
tweets.
▪ Model is also saved as a
binary file.
▪ Rest of the code is a
similar preprocessing
pipeline code.

112
Save the Model
Deployment

▪ Passing some random tweets


and statements based on climate
change in the function.
▪ Output is classifying the tweets
in 4 different categories.

113
Quiz time

Let’s Kahoot!

▪ Time to take a short and simple, fun quiz to see


how much have we learnt!

▪ Open Kahoot! App or join at www.kahoot.it

▪ Enter Game PIN to start the quiz!

▪ Let’s go!

114
Key Takeaways
1. Different steps involved in NLP Pipeline
2. Exploring NLP Applications like Language Detection and Translation and
Semantic Analysis.

115
Reflection

• Any social impact NLP project that comes to your mind?


• Which NLP applications can you think of for building social impact projects?

116
Bibliography
▪ Causes and Effects of Climate Change | United Nations. (n.d.). Retrieved September 14, 2022, from
https://www.un.org/en/climatechange/science/causes-effects-climate-change#collapseOne
▪ Climate Change: Extinction in Disguise | Its Causes & Effects. (n.d.). Retrieved September 14, 2022, from
https://www.pranaair.com/blog/climate-change-its-causes-effects-extinction-in-disguise/
▪ Four Things That Happen When a Language Dies | Smart News| Smithsonian Magazine. (n.d.). Retrieved September 14, 2022, from
https://www.smithsonianmag.com/smart-news/four-things-happen-when-language-dies-and-one-thing-you-can-do-help-180962188/
▪ GitHub - JatinSadhwani02/Langugae_Identification_NLP. (n.d.). Retrieved September 14, 2022, from
https://github.com/JatinSadhwani02/Langugae_Identification_NLP
▪ How many languages are endangered? | Ethnologue. (n.d.). Retrieved September 14, 2022, from
https://www.ethnologue.com/guides/how-many-languages-endangered
▪ How Many Languages Are There In The World? (n.d.). Retrieved September 14, 2022, from https://www.babbel.com/en/magazine/how-
many-languages-are-there-in-the-world
▪ How many languages are there in the world? | Ethnologue. (n.d.). Retrieved September 14, 2022, from
https://www.ethnologue.com/guides/how-many-languages
▪ How to Break Down Language Barriers in Your International Dev Team - DistantJob - Remote Recruitment Agency. (n.d.). Retrieved
September 14, 2022, from https://distantjob.com/blog/break-down-language-barrier-dev-teams/
▪ Seq2Seq Model | Understand Seq2Seq Model Architecture. (n.d.). Retrieved September 14, 2022, from
https://www.analyticsvidhya.com/blog/2020/08/a-simple-introduction-to-sequence-to-sequence-models/

117
Bibliography
▪ The Top 10 Most Spoken Languages Across the Globe. (n.d.). Retrieved September 14, 2022, from https://www.visualcapitalist.com/the-
worlds-top-10-most-spoken-languages/
▪ Twitter Climate Change Sentiment Analysis | Kaggle. (n.d.). Retrieved September 14, 2022, from
https://www.kaggle.com/code/roellekim/twitter-climate-change-sentiment-analysis
▪ Understanding Encoder-Decoder Sequence to Sequence Model | by Simeon Kostadinov | Towards Data Science. (n.d.). Retrieved
September 14, 2022, from https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-
679e04af4346
▪ What Is Climate Change? | United Nations. (n.d.). Retrieved September 14, 2022, from https://www.un.org/en/climatechange/what-is-
climate-change
▪ What is the most spoken language? | Ethnologue. (n.d.). Retrieved September 14, 2022, from
https://www.ethnologue.com/guides/most-spoken-languages
▪ Word2Vec and Semantic Similarity using spacy | NLP spacy Series | Part 7 – Data Science Duniya. (n.d.). Retrieved September 14, 2022,
from https://ashutoshtripathi.com/2020/09/04/word2vec-and-semantic-similarity-using-spacy-nlp-spacy-series-part-7/
▪ Transformers, what can they do? - Hugging Face Course. (n.d.). Retrieved September 30, 2022, from
https://huggingface.co/course/chapter1/3?fw=pt
▪ Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) – Jay Alammar – Visualizing machine
learning one concept at a time. (n.d.). Retrieved September 30, 2022, from https://jalammar.github.io/visualizing-neural-machine-
translation-mechanics-of-seq2seq-models-with-attention/

118

You might also like