Next Word Prediction With NLP and Deep Learning
Next Word Prediction With NLP and Deep Learning
Deep Learning
Designing a Word Predictive system using LSTM
Wouldn’t it be cool for your device to predict what could be the next word
that you are planning to type? This is similar to how a predictive text
keyboard works on apps like What’s App, Facebook Messenger, Instagram,
e-mails, or even Google searches. Below is an image to comprehend these
predictive searches.
Introduction:
This section will cover what the next word prediction model built will
exactly perform. The model will consider the last word of a particular
sentence and predict the next possible word. We will be using methods of
natural language processing, language modeling, and deep learning. We will
start by analyzing the data followed by the pre-processing of the data. We
will then tokenize this data and finally build the deep learning model. The
deep learning model will be built using LSTM’s. The entire code will be
provided at the end of the article with a link to the GitHub repository.
Approach:
The Datasets for text data are easy to find and we can consider Project
Gutenberg which is a volunteer effort to digitize and archive cultural works,
to “encourage the creation and distribution of eBooks”. From here we can
get many stories, documentations, and text data which are necessary for our
problem statement. The dataset links can be obtained from here. We will
use the text from the book Metamorphosis by Franz Kafka. You can
download the dataset from here. However, if you have the time to collect
your own emails as well as your texting data, then I would highly
recommend you to do so. This will be very helpful for your virtual assistant
project where the predictive keyword will make predictions similar to your
style of texting or similar to the style of how you compose your e-mails.
for i in file:
lines.append(i)
data = ""
for i in lines:
data = ' '. join(lines)
z = []
for i in data.split():
if i not in z:
z.append(i)
sequence_data = tokenizer.texts_to_sequences([data])
[0]
vocab_size = len(tokenizer.word_index) + 1
sequences = []
sequences = np.array(sequences)
X = []
y = []
for i in sequences:
X.append(i[0])
y.append(i[1])
X = np.array(X)
y = np.array(y)
y = to_categorical(y, num_classes=vocab_size)
Model Summary:
Model Plot:
Callbacks:
The callbacks we will be using for the next word prediction model is as
shown in the below code block:
from
tensorflow.keras.callbacks
import ModelCheckpoint
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.callbacks import TensorBoard
checkpoint = ModelCheckpoint("nextword1.h5", monitor='loss', verbose=1,
save_best_only=True, mode='auto')
logdir='logsnextword1'
tensorboard_Visualization = TensorBoard(log_dir=logdir)
We will be importing the 3 required callbacks for training our model. The 3
important callbacks are ModelCheckpoint, ReduceLROnPlateau, and
Tensorboard. Let us look at what task each of these individual callbacks
performs.
Below is the code block for compiling and fitting of the model.
model.compile(loss="categorical_crossentropy",optimizer=Adam(lr=0.001))
We are compiling and fitting our model in the final step. Here, we are training the
model and saving the best weights to nextword1.h5 so that we don’t have to re-train
the model repeatedly and we can use our saved model when required. Here I have
trained only on the training data. However, you can choose to train with both train
and validation data. The loss we have used is categorical_crossentropy which
computes the cross-entropy loss between the labels and predictions. The optimizer we
will be using is Adam with a learning rate of 0.001 and we will compile our model on
the metric loss. Our result is as shown below:
Graph:
Prediction:
For the prediction notebook, we will load the tokenizer file which we have
stored in the pickle format. We will then load our next word model which we
have saved in our directory. We will use this same tokenizer to perform
tokenization on each of the input sentences for which we should make the
predictions on. After this step, we can proceed to make predictions on the
input sentence by using the saved model.
We will use the try and except statements while running the predictions. We
are using this statement because in case there is an error in finding the
input sentence, we do not want the program to exit the loop. We want to run
the script as long as the user wants the script to be run. When the user
wants to exit the script, the user must manually choose to do so. The
program will run as long as the user desires.
Let us have a brief look at the predictions made by the model. This is done
as follows:
This can be tested by using the predictions script which will be provided in
the next section of the article. I will be giving a link to the GitHub repository
in the next section. The predictions model can predict optimally on most
lines as we can see. The stop the script line will end the model and exit the
program. When we enter the line “stop the script” the entire program will be
terminated. For all the other sentences a prediction is made on the last word
of the entered line. We will be considering the very last word of each line
and try to match it with the next word which has the highest probability.
Note: There are certain cases where the program might not return the
expected result. This is obvious because each word is being considered only
once. This will cause certain issues for particular sentences and you will
not receive the desired output. To improve the accuracy of the model you
can consider trying out bi-grams or tri-grams. We have only used uni-
grams in this approach. Also, a few more additional steps can be done in
the pre-processing steps. Overall, there is a lot of scope for improvement.
Observation:
With this, we have reached the end of the article. The entire code can be
accessed through this link. The next word prediction model is now
completed and it performs decently well on the dataset. I would recommend
all of you to build your next word prediction using your e-mails or texting
data. This will be better for your virtual assistant project. Feel free to refer to
the GitHub repository for the entire code. I would also highly recommend
the Machine Learning Mastery website which is an amazing website to learn
more. It was of great help for this project and you can check out the
website here. Thank you so much for reading the article and I hope all of
you have a wonderful day!