Sentiment Analysis From H El Reviews: Data Mining For Business Intelligence
Sentiment Analysis From H El Reviews: Data Mining For Business Intelligence
Aditya(17122003)
Aniket Sujay(17122006)
Sudhir(17122027)
extracting emotions related to some raw texts. This is usually used on social media posts and
customer reviews in order to automatically understand if some users are positive or negative and
why. The goal of this study is to show how sentiment analysis can be performed using python. Here
2
We will use here some hotel reviews data. Each observation consists in one customer review for one
hotel. Each customer review is composed of a textual feedback of the customer's experience at the
For each textual review, we want to predict if it corresponds to a good review (the customer is
happy) or to a bad one (the customer is not satisfied). The reviews overall ratings can range from
2.5/10 to 10/10. In order to simplify the problem we will split those into two categories:
Load data
We first start by loading the raw data. Each textual review is splitted into a positive part and a
negative part. We group them together in order to start with only raw text data and no other
information.
Clean data
If the user doesn't leave any negative feedback comment, this will appear as "No Negative" in our
data. This is the same for the positive comments with the default value "No Positive". We have to
remove those parts from our texts.
To clean textual data, we have created custom 'clean_text' function that performs several
transformations:
3
● lemmatize the text: transform every word into their root form (e.g. rooms -> room,
slept -> sleep)
Now that we have cleaned our data, we can do some feature engineering for our modelization part.
Feature engineering
We first start by adding sentiment analysis features because we can guess that customers' reviews
are highly linked to how they felt about their stay at the hotel. We use Vader, which is a part of the
NLTK module designed for sentiment analysis. Vader uses a lexicon of words to find which ones are
positives or negatives. It also takes into account the context of the sentences to determine the
● a neutrality score
● a positivity score
● a negativity score
● an overall score that summarizes the previous scores
The next step consists in extracting vector representations for every review. The module Gensim
creates a numerical vector representation of every word in the corpus by using the contexts in which
they appear (Word2Vec). This is performed using shallow neural networks (or, a neural network with
4
Each text can also be transformed into numerical vectors using the word vectors (Doc2Vec). Same
texts will also have similar representations and that is why we can use those vectors as training
features.
Finally we add the TF-IDF (Term Frequency - Inverse Document Frequency) values for every word
and every document. These are the steps involved in TF-IDF:
● TF computes the classic number of times the word appears in the text
● IDF computes the relative importance of this word which depends on how many
texts the word can be found
We add TF-IDF columns for every word that appear in at least 10 different texts to filter some of
The data is highly imbalanced- less than 5% of total values belong to the positive class.
5
Sentiment Analysis using Machine Learning:
After cleaning the data and preprocessing it, we have adopted these two methods to build
our models:
● RandomForestClassifier:
We have built a Random Forest Classifier with all the default parameters.
● Stacking:
6
1. KNearestNeighbors Classifier
2. RandomForest Classifier
3. XGBoost Classifier
4. AdaBoost Classifier
5. Logistic Regression
6. MultiLayerPerceptron(MLP) Classifier
Out of these models, we chose the top 3 best performing (best accuracy score)
classifiers, i.e, RandomForest, KNN and XGBoost for the first level of our stacked model.
RandomForest was chosen as our meta learner as it gave the highest accuracy on the
particular dataset.
7
Sentiment Analysis using Transfer Learning with FastAI
library :
In this method, we'll use transfer learning in Natural Language Processing with the FastAI
library. We will apply the Universal Language Fine-tuning for Text Classification paper to
classify the sentiment of hotel reviews. As for transfer learning in Computer Vision, we can
obtain relatively good results with not a huge sample of training data.
Load Data:
The format of the given reviews is particular as the review is separated in positive and negative
points.
Since the data seems quite balanced, we can use accuracy as the metric.
We can see the true prowess of transfer learning by working with only 10000 samples.
In fastai library, preprocessing text holds in a single line. It performs all behind the scenes
different steps of preprocessing:
● cleaning
● tokenizing
● Indexing
● building vocabulary
8
Then, rather than training a sentiment analysis model directly from scratch, we will fine-tune a
pre-trained language model whose weights are available in fastai library. It uses the AWD_LSTM
architecture, an LSTM network without attention but regularized with adaptive dropout.
Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), serve
as a fundamental building block for many sequence learning tasks, including machine
translation, language modelling, and question answering. In this model, the specific problem of
word-level language modelling and investigate strategies for regularizing and optimizing
LSTMbased models are considered. The weight-dropped LSTM which uses DropConnect on
hidden-to hidden weights as a form of recurrent regularization is used. Further, NT-ASGD, a
variant of the averaged stochastic gradient method, wherein the averaging trigger is determined
using a non-monotonic condition as opposed to being tuned by the user. Using these and other
regularization strategies, the model achieves state-of-the-art word level perplexities on our data
sets.
To see how it is trained, we can try to generate sentences with it. Even if they don't make any
sense, the generated sentences seem grammatically correct.
We obtain an accuracy of about 30%, which means that our language model predicts correctly the
next word with a 0.3 probability. This is quite good for 1 min of training but especially only with 10000
training examples. We also generate text with our language model. The language model now
generates sentences adapted to the context of the data we just trained it on.
We further improve upon this score with some Hyperparameter tuning techniques presented in the
ULMfit paper.
9
Discriminative learning rates is a technique which consists of applying different learning rates to
each layer of the network. As you go from layers to layers, we need to decrease the learning rate as
lowest levels represent the most general knowledge (Yosinki et al. 2014).
One cycle learning is a technique that is commonly used in fastai library. It consists of a cycle of the
learning rate, which starts low, increases to the maximum value passed in the fit_one_cycle
function, then decreases. It prevents our network from overfitting, and the entire information about it
Gradual unfreezing
Rather than fine-tuning all layers at once, ULMfit paper experiments a gradual unfreezing from the
last layer to the lowest ones, each time fitting one single epoch.
10
The model is run on a few hotel comments to predict their category, i. E, positive, negative
or neutral.
11
Conclusion:
12
3. Transfer Learning:
13