0% found this document useful (0 votes)
232 views

Next Word Prediction With NLP and Deep Learning

This document discusses building a next word prediction model using LSTM recurrent neural networks. It covers preprocessing text data from Franz Kafka's "Metamorphosis" by removing punctuation and unique tokenizing words. An embedding and LSTM layer model is created to take word sequences as input and predict the next word, with a softmax output equal to the vocabulary size. Callbacks including model checkpointing, learning rate reduction, and Tensorboard visualization are used for training the model to predict the next word.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
232 views

Next Word Prediction With NLP and Deep Learning

This document discusses building a next word prediction model using LSTM recurrent neural networks. It covers preprocessing text data from Franz Kafka's "Metamorphosis" by removing punctuation and unique tokenizing words. An embedding and LSTM layer model is created to take word sequences as input and predict the next word, with a softmax output equal to the vocabulary size. Callbacks including model checkpointing, learning rate reduction, and Tensorboard visualization are used for training the model to predict the next word.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Next Word Prediction with NLP and

Deep Learning
Designing a Word Predictive system using LSTM

Wouldn’t it be cool for your device to predict what could be the next word
that you are planning to type? This is similar to how a predictive text
keyboard works on apps like What’s App, Facebook Messenger, Instagram,
e-mails, or even Google searches. Below is an image to comprehend these
predictive searches.

Introduction:

This section will cover what the next word prediction model built will
exactly perform. The model will consider the last word of a particular
sentence and predict the next possible word. We will be using methods of
natural language processing, language modeling, and deep learning. We will
start by analyzing the data followed by the pre-processing of the data. We
will then tokenize this data and finally build the deep learning model. The
deep learning model will be built using LSTM’s. The entire code will be
provided at the end of the article with a link to the GitHub repository.

Approach:

The Datasets for text data are easy to find and we can consider Project
Gutenberg which is a volunteer effort to digitize and archive cultural works,
to “encourage the creation and distribution of eBooks”. From here we can
get many stories, documentations, and text data which are necessary for our
problem statement. The dataset links can be obtained from here. We will
use the text from the book Metamorphosis by Franz Kafka. You can
download the dataset from here. However, if you have the time to collect
your own emails as well as your texting data, then I would highly
recommend you to do so. This will be very helpful for your virtual assistant
project where the predictive keyword will make predictions similar to your
style of texting or similar to the style of how you compose your e-mails.

Pre-processing the Dataset:


The first step is to remove all the unnecessary data from the Metamorphosis
dataset. We will delete the starting and end of the dataset. This is the data
that is irrelevant to us. The starting line should be as follows:

One morning, when Gregor Samsa woke from


troubled dreams, he found

The ending line for the dataset should be:

first to get up and stretch out her young body.

Once this step is done save the file as Metamorphosis_clean.txt. We will


access the Metamorphosis_clean.txt by using the encoding as utf-8. The
next step of our cleaning process involves replacing all the unnecessary
extra new lines, the carriage return, and the Unicode character. Finally, we
will make sure we have only unique words. We will consider each word only
once and remove any additional repetitions. This will help the model train
better avoiding extra confusion due to the repetition of words. Below is the
complete code for the pre-processing of the text data.
file =
open("metamorphosis_clean.txt",
"r", encoding = "utf8")
lines = []

for i in file:
lines.append(i)

data = ""
for i in lines:
data = ' '. join(lines)

data = data.replace('\n', '').replace('\r', '').replace('\ufeff', '')

translator = str.maketrans(string.punctuation, '


'*len(string.punctuation)) #map punctuation to space
new_data = data.translate(translator)

z = []

for i in data.split():
if i not in z:
z.append(i)

data = ' '.join(z)

Tokenization: Tokenization refers to splitting bigger text data, essays,


or corpus’s into smaller segments. These smaller segments can be in the
form of smaller documents or lines of text data. They can also be a
dictionary of words.

The Keras Tokenizer allows us to vectorize a text corpus, by turning each


text into either a sequence of integers (each integer being the index of a
token in a dictionary) or into a vector where the coefficient for each token
could be binary, based on word count, based on tf-idf. To learn more about
the Tokenizer class and text data pre-processing using Keras visit here.
We will then convert the texts to sequences. This is a way of interpreting the
text data into numbers so that we can perform better analyses on them. We
will then create the training dataset. The ‘X’ will contain the training data
with the input of text data. The ‘y’ will contain the outputs for the training
data. So, the ‘y’ contains all the next word predictions for each input ‘X’.

We will calculate the vocab_size by using the length extracted from


tokenizer.word_index and then add 1 to it. We are adding 1 because 0 is a
reserved for padding and we want to start our count from 1. Finally, we will
convert our predictions data ‘y’ to categorical data of the vocab size. This
function converts a class vector (integers) to the binary class matrix. This
will be useful with our loss which will be categorical_crossentropy. The rest
of the code for the tokenization of data, creating the dataset, and converting
the prediction set into categorical data is as follows:

Note: Improvements can be made in the pre-processing. You can try


different methods to improve the pre-processing step which would help in
achieving a better loss and accuracy in lesser epochs.
tokenizer =
Tokenizer()
tokenizer.fit_on_texts([data])

# saving the tokenizer for predict function.


pickle.dump(tokenizer, open('tokenizer1.pkl', 'wb'))

sequence_data = tokenizer.texts_to_sequences([data])
[0]

vocab_size = len(tokenizer.word_index) + 1
sequences = []

for i in range(1, len(sequence_data)):


words = sequence_data[i-1:i+1]
sequences.append(words)

sequences = np.array(sequences)

X = []
y = []

for i in sequences:
X.append(i[0])
y.append(i[1])

X = np.array(X)
y = np.array(y)

y = to_categorical(y, num_classes=vocab_size)

Creating the Model:

We will be building a sequential model. We will then create an embedding


layer and specify the input dimensions and output dimensions. It is
important to specify the input length as 1 since the prediction will be made
on exactly one word and we will receive a response for that particular word.
We will then add an LSTM layer to our architecture. We will give it a 1000
units and make sure we return the sequences as true. This is to ensure that
we can pass it through another LSTM layer. For the next LSTM layer, we
will also pass it through another 1000 units but we don’t need to specify
return sequence as it is false by default. We will pass this through a hidden
layer with 1000 node units using the dense layer function with relu set as
the activation. Finally, we pass it through an output layer with the specified
vocab size and a softmax activation. The softmax activation ensures that we
receive a bunch of probabilities for the outputs equal to the vocab size. The
entire code for our model structure is as shown below. After we look at the
model code, we will also look at the model summary and the model plot.
model =
Sequential()
model.add(Embedding(vocab_size, 10, input_length=1))
model.add(LSTM(1000, return_sequences=True))
model.add(LSTM(1000))
model.add(Dense(1000, activation="relu"))
model.add(Dense(vocab_size, activation="softmax"))

Model Summary:
Model Plot:

Callbacks:

The callbacks we will be using for the next word prediction model is as
shown in the below code block:
from
tensorflow.keras.callbacks
import ModelCheckpoint
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.callbacks import TensorBoard
checkpoint = ModelCheckpoint("nextword1.h5", monitor='loss', verbose=1,
save_best_only=True, mode='auto')

reduce = ReduceLROnPlateau(monitor='loss', factor=0.2, patience=3,


min_lr=0.0001, verbose = 1)

logdir='logsnextword1'
tensorboard_Visualization = TensorBoard(log_dir=logdir)

We will be importing the 3 required callbacks for training our model. The 3
important callbacks are ModelCheckpoint, ReduceLROnPlateau, and
Tensorboard. Let us look at what task each of these individual callbacks
performs.

1. ModelCheckpoint — This callback is used for storing the weights of


our model after training. We save only the best weights of our model by
specifying save_best_only=True. We will monitor our training by using
the loss metric.

2. ReduceLROnPlateau — This callback is used for reducing the


learning rate of the optimizer after a specified number of epochs. Here,
we have specified the patience as 3. If the accuracy does not improve
after 3 epochs, then our learning rate is reduced accordingly by a factor
of 0.2. The metric used for monitoring here is loss as well.

3. Tensorboard — The tensorboard callback is used for plotting the


visualization of the graphs, namely the graph plots for accuracy and the
loss. Here, we will only be looking at the loss graph of the next word
prediction.
We will be saving the best models based on the metric loss to the file
nextword1.h5. This file will be crucial while accessing the predict function
and trying to predict our next word. We will wait for 3 epochs for the loss to
improve. If it does not improve, then we will reduce the learning rate.
Finally, we will be using the tensorboard function for visualizing the graphs
and histograms if needed.

Compile and Fit:

Below is the code block for compiling and fitting of the model.
model.compile(loss="categorical_crossentropy",optimizer=Adam(lr=0.001))

model.fit(X, y, epochs=150, batch_size=64, callbacks=[checkpoint, reduce, tensorboard_Visualization])

We are compiling and fitting our model in the final step. Here, we are training the
model and saving the best weights to nextword1.h5 so that we don’t have to re-train
the model repeatedly and we can use our saved model when required. Here I have
trained only on the training data. However, you can choose to train with both train
and validation data. The loss we have used is categorical_crossentropy which
computes the cross-entropy loss between the labels and predictions. The optimizer we
will be using is Adam with a learning rate of 0.001 and we will compile our model on
the metric loss. Our result is as shown below:

Graph:
Prediction:

For the prediction notebook, we will load the tokenizer file which we have
stored in the pickle format. We will then load our next word model which we
have saved in our directory. We will use this same tokenizer to perform
tokenization on each of the input sentences for which we should make the
predictions on. After this step, we can proceed to make predictions on the
input sentence by using the saved model.

We will use the try and except statements while running the predictions. We
are using this statement because in case there is an error in finding the
input sentence, we do not want the program to exit the loop. We want to run
the script as long as the user wants the script to be run. When the user
wants to exit the script, the user must manually choose to do so. The
program will run as long as the user desires.
Let us have a brief look at the predictions made by the model. This is done
as follows:

Enter your line: at the dull


weather
Enter your line: collection of textile
samples
Enter your line: what a strenuous
career
Enter your line: stop the script
Ending The Program.....

This can be tested by using the predictions script which will be provided in
the next section of the article. I will be giving a link to the GitHub repository
in the next section. The predictions model can predict optimally on most
lines as we can see. The stop the script line will end the model and exit the
program. When we enter the line “stop the script” the entire program will be
terminated. For all the other sentences a prediction is made on the last word
of the entered line. We will be considering the very last word of each line
and try to match it with the next word which has the highest probability.

Note: There are certain cases where the program might not return the
expected result. This is obvious because each word is being considered only
once. This will cause certain issues for particular sentences and you will
not receive the desired output. To improve the accuracy of the model you
can consider trying out bi-grams or tri-grams. We have only used uni-
grams in this approach. Also, a few more additional steps can be done in
the pre-processing steps. Overall, there is a lot of scope for improvement.

Observation:

We are able to develop a high-quality next word prediction for the


metamorphosis dataset. We are able to reduce the loss significantly in
about 150 epochs. The next word prediction model which we have
developed is fairly accurate on the provided dataset. The overall quality of
the prediction is good. However, certain pre-processing steps and certain
changes in the model can be made to improve the prediction of the model.

With this, we have reached the end of the article. The entire code can be
accessed through this link. The next word prediction model is now
completed and it performs decently well on the dataset. I would recommend
all of you to build your next word prediction using your e-mails or texting
data. This will be better for your virtual assistant project. Feel free to refer to
the GitHub repository for the entire code. I would also highly recommend
the Machine Learning Mastery website which is an amazing website to learn
more. It was of great help for this project and you can check out the
website here. Thank you so much for reading the article and I hope all of
you have a wonderful day!

You might also like