0% found this document useful (0 votes)
37 views

Sentiment Analysis From H El Reviews: Data Mining For Business Intelligence

The document describes a project to perform sentiment analysis on hotel reviews using various machine learning techniques in Python. A group of 10 students analyzed hotel review data to predict whether reviews expressed positive or negative sentiment. They cleaned the text data, engineered features related to sentiment scores and text properties, and applied models like Random Forest and stacking ensembles. The document also discusses using transfer learning with the FastAI library to fine-tune a pre-trained language model for sentiment classification with over 90% accuracy. Hyperparameter tuning methods like discriminative learning rates and one cycle learning helped further improve the model's performance.

Uploaded by

Aniket Sujay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Sentiment Analysis From H El Reviews: Data Mining For Business Intelligence

The document describes a project to perform sentiment analysis on hotel reviews using various machine learning techniques in Python. A group of 10 students analyzed hotel review data to predict whether reviews expressed positive or negative sentiment. They cleaned the text data, engineered features related to sentiment scores and text properties, and applied models like Random Forest and stacking ensembles. The document also discusses using transfer learning with the FastAI library to fine-tune a pre-trained language model for sentiment classification with over 90% accuracy. Hyperparameter tuning methods like discriminative learning rates and one cycle learning helped further improve the model's performance.

Uploaded by

Aniket Sujay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Mining For Business Intelligence

Sentiment Analysis from Hotel Reviews


Group Members:

Ritesh Chhetri (17118057)

Sanat Bhargava (17122023)

Aditya(17122003)

Aniket Sujay(17122006)

Harshit Kankariya (17122011)

Aakash Khepar (17122001)

Hritik Saini (17122013)

Gagan Singh Saini (17122010)

Aman Kumar (17122005)

Sudhir(17122027)

Ankit Kumar (17122007)


Introduction
Sentiment analysis is part of the Natural Language Processing (NLP) techniques that consists in

extracting emotions related to some raw texts. This is usually used on social media posts and

customer reviews in order to automatically understand if some users are positive or negative and

why. The goal of this study is to show how sentiment analysis can be performed using python. Here

are some of the main libraries we will use:

● NLTK: the most famous python module for NLP techniques


● Gensim: a topic-modelling and vector space modelling toolkit
● Scikit-learn: the most used python machine learning library

2
We will use here some hotel reviews data. Each observation consists in one customer review for one

hotel. Each customer review is composed of a textual feedback of the customer's experience at the

hotel and an overall rating.

For each textual review, we want to predict if it corresponds to a good review (the customer is

happy) or to a bad one (the customer is not satisfied). The reviews overall ratings can range from

2.5/10 to 10/10. In order to simplify the problem we will split those into two categories:

● bad reviews have overall ratings < 5


● good reviews have overall ratings >= 5

Load data
We first start by loading the raw data. Each textual review is splitted into a positive part and a

negative part. We group them together in order to start with only raw text data and no other

information.

Clean data
If the user doesn't leave any negative feedback comment, this will appear as "No Negative" in our
data. This is the same for the positive comments with the default value "No Positive". We have to
remove those parts from our texts.

To clean textual data, we have created custom 'clean_text' function that performs several

transformations:

● lower the text


● tokenize the text (split the text into words) and remove the punctuation
● remove useless words that contain numbers
● remove useless stop words like 'the', 'a' ,'this' etc.
● Part-Of-Speech (POS) tagging: assign a tag to every word to define if it
corresponds to a noun, a verb etc. using the WordNet lexical database

3
● lemmatize the text: transform every word into their root form (e.g. rooms -> room,
slept -> sleep)

Now that we have cleaned our data, we can do some feature engineering for our modelization part.

Feature engineering
We first start by adding sentiment analysis features because we can guess that customers' reviews

are highly linked to how they felt about their stay at the hotel. We use Vader, which is a part of the

NLTK module designed for sentiment analysis. Vader uses a lexicon of words to find which ones are

positives or negatives. It also takes into account the context of the sentences to determine the

sentiment scores. For each text, Vader returns 4 values:

● a neutrality score
● a positivity score
● a negativity score
● an overall score that summarizes the previous scores

We will integrate those 4 values as features in our dataset.

Next, we have added some simple metrics for every text:

● number of characters in the text


● number of words in the text

The next step consists in extracting vector representations for every review. The module Gensim

creates a numerical vector representation of every word in the corpus by using the contexts in which

they appear (Word2Vec). This is performed using shallow neural networks (or, a neural network with

just one hidden layer).

4
Each text can also be transformed into numerical vectors using the word vectors (Doc2Vec). Same

texts will also have similar representations and that is why we can use those vectors as training

features.

Finally we add the TF-IDF (Term Frequency - Inverse Document Frequency) values for every word
and every document. These are the steps involved in TF-IDF:

● TF computes the classic number of times the word appears in the text
● IDF computes the relative importance of this word which depends on how many
texts the word can be found

We add TF-IDF columns for every word that appear in at least 10 different texts to filter some of

them and reduce the size of the final output.

Exploratory Data Analysis (EDA):

These are our major takeaways after EDA:

The data is highly imbalanced- less than 5% of total values belong to the positive class.

5
Sentiment Analysis using Machine Learning:

After cleaning the data and preprocessing it, we have adopted these two methods to build
our models:

● RandomForestClassifier:

We have built a Random Forest Classifier with all the default parameters.

● Stacking:

Stacking or Stacked Generalization is a machine learning is a machine learning


ensemble method which relies on the aggregation of several models to build a more
robust model. The idea behind stacking is to feed the dataset to different estimators
which each learn a different part of the problem, but may not be able to learn the
whole space of the problem. So, one can build multiple different learners and you
use them to build an intermediate prediction, one prediction for each learned
model. Then we add a new model, also called a meta learner, which learns from the
intermediate predictions the same target. The theoretical basis behind this idea can
be found here.

We considered the performance of the following models on our dataset:

6
1. KNearestNeighbors Classifier
2. RandomForest Classifier
3. XGBoost Classifier
4. AdaBoost Classifier
5. Logistic Regression
6. MultiLayerPerceptron(MLP) Classifier

Out of these models, we chose the top 3 best performing (best accuracy score)
classifiers, i.e, RandomForest, KNN and XGBoost for the first level of our stacked model.
RandomForest was chosen as our meta learner as it gave the highest accuracy on the
particular dataset.

7
Sentiment Analysis using Transfer Learning with FastAI
library :

In this method, we'll use transfer learning in Natural Language Processing with the FastAI
library. We will apply the Universal Language Fine-tuning for Text Classification paper to
classify the sentiment of hotel reviews. As for transfer learning in Computer Vision, we can
obtain relatively good results with not a huge sample of training data.

Load Data:
The format of the given reviews is particular as the review is separated in positive and negative
points.

We will transform this problem into a binary classification problem.

Since the data seems quite balanced, we can use accuracy as the metric.

We can see the true prowess of transfer learning by working with only 10000 samples.

Pre-process and Fine-tune language Model:

In fastai library, preprocessing text holds in a single line. It performs all behind the scenes
different steps of preprocessing:

● cleaning
● tokenizing
● Indexing
● building vocabulary

8
Then, rather than training a sentiment analysis model directly from scratch, we will fine-tune a
pre-trained language model whose weights are available in fastai library. It uses the AWD_LSTM
architecture, an LSTM network without attention but regularized with adaptive dropout.

Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), serve
as a fundamental building block for many sequence learning tasks, including machine
translation, language modelling, and question answering. In this model, the specific problem of
word-level language modelling and investigate strategies for regularizing and optimizing
LSTMbased models are considered. The weight-dropped LSTM which uses DropConnect on
hidden-to hidden weights as a form of recurrent regularization is used. Further, NT-ASGD, a
variant of the averaged stochastic gradient method, wherein the averaging trigger is determined
using a non-monotonic condition as opposed to being tuned by the user. Using these and other
regularization strategies, the model achieves state-of-the-art word level perplexities on our data
sets.

To see how it is trained, we can try to generate sentences with it. Even if they don't make any
sense, the generated sentences seem grammatically correct.
We obtain an accuracy of about 30%, which means that our language model predicts correctly the

next word with a 0.3 probability. This is quite good for 1 min of training but especially only with 10000

training examples. We also generate text with our language model. The language model now

generates sentences adapted to the context of the data we just trained it on.

Train a sentiment analysis model:


In only 1 epoch with 10K samples, we reach more than 90% accuracy.

We further improve upon this score with some Hyperparameter tuning techniques presented in the

ULMfit paper.

Hyperparameters tuning techniques

Discriminative learning rates

9
Discriminative learning rates is a technique which consists of applying different learning rates to

each layer of the network. As you go from layers to layers, we need to decrease the learning rate as

lowest levels represent the most general knowledge (Yosinki et al. 2014).

One cycle learning

One cycle learning is a technique that is commonly used in fastai library. It consists of a cycle of the

learning rate, which starts low, increases to the maximum value passed in the fit_one_cycle

function, then decreases. It prevents our network from overfitting, and the entire information about it

is present is in this paper.

Gradual unfreezing

Rather than fine-tuning all layers at once, ULMfit paper experiments a gradual unfreezing from the

last layer to the lowest ones, each time fitting one single epoch.

Results of Transfer Learning:

10
The model is run on a few hotel comments to predict their category, i. E, positive, negative
or neutral.

11
Conclusion:

1. Random Forest Classifier:

2. Stacked Generalization (Stacking):

12
3. Transfer Learning:

13

You might also like