Project Documentation 1
Project Documentation 1
REVIEWS
BACHELOR OF TECHNOLOGY
IN
INFORMATION TECHNOLOGY
Submitted by
M. LEKHANA (17BF1A1232)
K. MOULI (17BF1A1231)
G. RAGA MALIKA (17BF1A1220)
CERTIFICATE
This is to certify that the project report entitled,
M. LEKHANA (17BF1A1232)
K. MOULI (17BF1A1231)
G. RAGA MALIKA (17BF1A1220)
For the partial fulfillment of the requirements for the award of Degree of BACHELOR
OF TECHNOLOGY in INFORMATION TECHNOLOGY, JNTUA, Anantapur.
We are thankful to project coordinator Mr. A. Basi Reddy, Senior Assistant Professor for his guidance
and regular schedules.
We would like to express our grateful and sincere thanks to Dr. S. Murali Krishna, Head, Dept of IT,
for his kind help and encouragement during the course of our study and in the successful completion of
the project work.
We have great pleasure in expressing our hearty thanks to beloved Principal Dr. N. Sudhakar Reddy for
spending his valuable time with us to complete the project.
Successful completion of any project cannot be done without proper support and encouragement. We
sincerely thank the Management for providing all the necessary facilities during the course of study.
We would like to thank our parents and friends, who have the greatest contributions in all our achievements
for the great care and blessings in making us successful in all our endeavors.
M. LEKHANA 17BF1A1232
K. MOULI 17BF1A1231
G. RAGA MALIKA 17BF1A1220
DECLARATION
We hereby declare that the Project report entitled “SENTIMENTAL ANALYSIS ON MOBILE
PHONE REVIEWS” submitted to the department of Information Technology, Sri Venkateswara
College of Engineering, Tirupati in partial fulfilment of requirements for the award of the degree of
Bachelor of Technology.
This Project is the result of our own effort and it has not been submitted to any other University or
Institution for the award of any degree or diploma other than specified above.
M. LEKHANA 17BF1A1232
K. MOULI 17BF1A1231
G. RAGA MALIKA 17BF1A1220
CONTENTS
ABSTRACT i
LIST OF ABBREVIATIONS ii
LIST OF FIGURES iii
1 INTRODUCTION 1
2 LITERATURE SURVEY 4
2.1 ASPECT BASED SENTIMENTAL ANALYSIS
2.2 CLASSIFICATION LEVELS
2.3 SENTIMENT CLASSIFICATION TECHNIQUES
3 COMPUTATIONAL ENVIRONMENT 9
4 FEASIBILITY STUDY 15
4.1 ECONOMICAL FEASIBILITY
4.2 TECHNICAL FEASIBILITY
4.3 SOCIAL FEASIBILITY
5 SYSTEM ANALYSIS 17
7 SYSTEM IMPLEMENTATION 29
7.1 MODULES 29
7.2 ALGORITHMS 30
8 TESTING 36
8.1 UNIT TESTING 39
8.2 INTEGRATION TESTING 39
8.3 FUNCTIONAL TESTING 40
8.4 SYSTEM TESTING 40
8.5 ACCEPTANCE TESTING 41
Index Terms — Sentimental Analysis, Bag of words Model, TF-IDE and Logistic Regression
(i)
LIST OF ABBREVIATIONS
( ii )
LIST OF FIGURES
FIG NO FIG NAME PAGE NO
( iii )
SENTIMENTAL ANALYSIS ON MOBILE PHONE REVIEWS
1. INTRODUCTION
1.1 Introduction & Objective:
Sentiment Analysis is one of the interesting applications of text analytics.
Although it is often associated with sentiment classification of documents, broadly
speaking it refers to the use of text analytics approaches applied to the set of problems
related to identifying and extracting subjective material in text sources.
1.3 Objectives
The main objective of this project is to go about an extra mile to provide the users with an
output that is the analysis of thousands of reviews. To save time by analyzing thousands
of reviews in short period and if those reviews were analyzed manually it may take upto
decades.
Sentiment is an idea or feeling that someone expresses in words. With that in mind,
sentiment analysis is the process of predicting/extracting these ideas or feelings. We want
to know if the sentiment of a piece of writing is positive, negative or neutral. Exactly what
we mean by positive/negative sentiment depends on the problem we’re trying to solve.
For the BTS example, we are trying to predict a listeners opinion. Positive sentiment
would mean the listener enjoyed the song. We could be using sentiment analysis to flag
potential hate speech on our platform. In this case, negative sentiment would mean the text
On line 2–5, we have some standard packages such as Pandas/NumPy to handle our data and
Matplotlib/Seaborn to visualise it. For modelling, we use the svm package (line 7) from sci-kit
learn. We also use some metrics packages (line 8) to measure the performance of our model. The
last set of packages are used for text processing. They will help us clean our text data and create
model features.
Dataset
To train our sentiment analysis model, we use a sample of tweets from the dataset. This dataset
contains 1.6 million tweets that have been classified as having either a positive or negative
sentiment.
Text cleaning
The next step is to clean the text. We do this to remove aspects of the text that are not important
and hopefully make our models more accurate. Specifically, we will make our text lower case and
remove punctuation. We will also remove very common words, know as stopwords, from the text.
To do this, we have created the function below which takes a piece of text, performs the above
cleaning and returns the cleaned text. In line 18, we apply this function to every tweet in our dataset.
Even after cleaning, like all ML models, SVMs cannot understand text. What we mean by this is
that our model cannot take in the raw text as an input. We have to first represent the text in a
mathematical way. In other words, we must transform the tweets into model features. One way to
do this is by using N-grams.
N-grams are sets of N consecutive words. In Figure 2, we see an example of how a sentence is
broken down into 1-grams (unigrams) and 2-grams (bigrams). Unigrams are just the individual
words in the sentence. Bigrams are the set of all two consecutive words. Trigrams (3-grams)
would be the set of all 3 consecutive words and so on. You can represent text mathematically by
simply counting the number of times certain N-grams occur.
2. Literature survey
Sentimental analysis or opinion mining is the computational study of people’s opinions,
sentiments, attitudes, and emotions expressed in written language. It is one of the most
active research areas in natural language processing and text mining in recent years.
ABSA deals with identifying aspects of given target entities and estimating the sentiment
polarity for each mentioned aspect.
Aspect extraction can be thought of some kind of information extraction which deals with
recognizing aspects in the entity. There are two ways in aspect extraction
• High frequent words or phrases across reviews are to be found and filtered by conditions
like “occurs right after sentiment word”.
• To specify all the aspects in advance and find them in the reviews.
In our project, we are displaying word cloud which is the graphical representation of the
frequently used words in the dataset.
Aspect sentiment classification deals with the opinions and illustrates whether certain
opinion on a aspect is positive, negative or neutral.
Word can be described in many ways like it can be attractive, shock, positive, negative or
emotion. To identify the type of word description we need to perform sentimental analysis
where we need to understand the classification between the words and refine the sentiments
into the certain categories like positive and negative. This can be done using lexicon where
lexicon is a collection of hash tables ,dictionaries and wordlist which is used to identify
the polarity of words like positive and negative. Using this lexicons we can classify the
word sentiment.
Sentiment Analysis (SA) is considered a three-layered approach. The first of the layers is
document based. The second being sentence based. The third is the aspect level, also
considered as word or phrase based.
In SA, the first level is considered the document level. At this level, an entire document is
considered as a whole for SA. In this treatment for conducting SA, the one making an
opinion is considered a single source or an individual entity. A different aspect of SA at
this level has been observed to be sentiment regression. To gauge the extent of positive or
negative outlook, many researchers have turned to using supervised learning to estimate
ratings of a document. In another study, researchers have proposed a linearbased
combination approach based on polarities observed in text documents. A major problem
observed while dealing at document level is at this layer not all sentences involving
expression of opinions can be deemed as subjective sentences. Hence, accuracy of results
is dependent on how closely each sentence is extracted and analysed individually. This
method, therefore, promotes the rate at which subjective sentences can be extracted for the
purpose of SA over objective sentences that can be set aside. Research in SA techniques
often place major thrust and emphasis at the sentence level.
Research studies abound on classifying and analysing each sentence in a document or piece
of text as either objective or subjective ones. In one example, the authors propose
conducting SA on subjective sentences alone following classification. As a tool for
detecting subjective sentences, machine learning has been an asset for researchers and
scholars in this field. In one study, the authors have proposed a model based on logarithmic
probability rate and a number of root terms to form the basis of scorecard for categorization
of each subjective sentence classified. In another paper, research scholars have postulated
a model that takes into account sentiments of all terms in each sentence to formulate an
overall sentiment for a sentence that is under consideration. SA at sentiment level is not
without its own shortcomings. There may be objective sentences that may actually have
sentiments not detected. An example of such a sentence could be: ―I bought a table from
a reputed online store only to find that its legs are not stable enough.
Aspect Level SA aims at addressing the shortcomings of document and sentence levels of
SA. Fine-grained control can be exercise with the help of SA. The target focus of aspect
level SA is to examine opinion in critically and exclusively. Aspect level SA assumes that
an opinion can only be one among positive, neutral, negative or an outright objective
sentiment expressed. If one were to consider the sentence ―Telephone call quality of Sony
phones are remarkable save and except for the quality of battery, two statements are
immediately apparent. The first being that the call quality of Sony phones is good, while
that of battery is not so much. This approach, therefore, enables turning unstructured
content into organized form of information that can be subjected to a range of subjective
and quantitative experiments. Such finer degree of SA is mostly beyond the scope of
document and sentence level analyses.
Sentiment classification techniques can be segregated into three categories. These are
machine learning, lexicon-based and hybrid approaches. The first of these techniques
involve popular machine learning (ML) algorithms and involves using linguistic features.
The second involves analyses through a collection of sentiment terms that are precompiled
into sentiment lexicon. This is further divided into dictionary- and corpus-based
approaches that use semantic or statistical methods to gauge the extent of polarity of
sentiment. The hybrid approach involves combining ML and lexiconbased approaches.
The following illustration aims at providing an insight into more popular algorithms used
in sentiment classification techniques.
In machine learning approach, machine learning (ML) algorithms are used almost
exclusively and extensively to conduct SA. ML algorithms are used on conjunction with
linguistic and syntactic features.
Supervised Learning
Supervised Learning involves datasets that are clearly labelled. Such datasets are share
with assorted supervised learning models [23, 24, 25, 27]. 3.1.1.1 Decision Tree Classifiers
This type of classifier extends a hierarchical breakdown of training data space in which
attribute values are used for data segregation [28]. This method is predicated or based upon
the absence or presence of one or more words and is conducted recursively until a
minimum number of records are registered with lead nodes that are used for classification.
Linear Classification
Linear classification models involve support vector machine (SVM). This is a form of
classifier that is focused on segregating and isolating direct separators between different
classes. It also involves neural system [29]. SVM is a form of supervised learning model
and works chiefly on the principle of decision boundaries setup by decision planes. A
decision plane is defined as a set of objects that are part of a range of class memberships.
SVMs have been designed to fundamentally identify and isolate linear separators in search
space for the purpose of categorizing assorted classes. Ideally, text data is considered
suitable for SVM classification in view of sparse nature accorded by text. Since, only a
select number of features are irrelevant notwithstanding the tendency to be correlated and
organized into linear distinguishing buckets, text data is often considered an ideal
candidate for SVM. With the help of SVM, a non-linear decision surface can be
constructed out of original feature space. This can be achieved through mapping of data
instances in a non-linear fashion to an inner product space where classes can be linearly
segregated using hyperplane.
Neural Networks are constituted of, as the term suggests, by neurons as the basic
fundamental building block. Inputs to neurons are depicted using a vector and denoting
word frequencies in a document across a line. A set of weights are considered for each
neuron to enable computation of a function of the inputs involved. For boundaries that are
nonlinear, multilayer neural networks are employed. These multiple layers are used in
conjunction with multiple pieces of linear boundaries used for approximation of enclosed
regions involving particular class. Neuron outputs generated in previous layers are used to
feed the neurons in subsequent layers. The training process, therefore, becomes
progressively complex as errors are reversing propagated across all layers . As shown by
the authors, SVM and NN can also be deployed for classifying relationships that are of
personal nature in biographical texts. In their research, relations were marked as being
positive, neutral or unknown between two individuals.
A type of probabilistic classifier with a place among exponential class of models, it does
not rely on the assumption that components are independent. On the other hand, ME is
dependent on Principle of Maximum Entropy. It selects the model that has the largest
entropy. ME classifiers find use in applications involving dialect identification, assumption
investigation, point arrangement, etc.
Lexicon-based Approach
This kind of approach involves determination of polarity by employing opinion words from
sentiment dictionary and matches those with data. Such an approach marks sentiment
scores to indicate positive, negative or objective types of words. Lexicon-based approaches
are dependent on sentiment of lexicon involving a set of precompiled and known sentiment
phrases, terms and idioms. Two forms of subclassifications exist for this type of approach.
These are discussed in subsequent sections.
Dictionary-based Approach
generated by looking up a notable corpus – WordNet, for appropriate words and antonyms
relevant for the SA. The process is iterative and stops only when no new words are
detected. Else, subsequent iterations follow when words are progressively appended to
seed list. After the process stops, a manual appraisal is conducted to evaluate and correct
errors. However, this approach is not without its flaws as it is often unable to detect in
certain circumstances involving specific introductions or words with spaces.
Corpus-Based
This approach involves dictionaries specific to given domain. The dictionaries are
produced on the basis of seeds of opinion terms that grow out of search for related words
through use of statistical or semantic procedures.
Aside from involving individual ML approaches or Lexiconbased approach that has been
described earlier, there are select few research techniques that involve a mixture of both.
The improved Naïve Bayes and SVM algorithms find frequent mention in research studies.
To narrow the gap between positive and negative, feature selections such as unigrams and
bigrams are often used. Studies have shown that combination of ML and dictionary-based
methods can significantly improve the level of classification of sentiments.
3. COMPUTATIONAL ENVIRONMENT
PIP is a package management system used to install and manage software packages/libraries
written in Python. These files are stored in a large “on-line repository” termed as Python
Package Index (PyPI).
pip uses PyPI as the default source for packages and their dependencies.
To install Jupyter using pip, we need to first check if pip is updated in our system. Use the
following command to update pip:
python -m pip install --upgrade pip
After updating the pip version, follow the instructions provided below to install Jupyter:
• Command to install Jupyter:
python -m pip install jupyter
Launching Jupyter:
Use the following command to launch Jupyter using command-line: jupyter notebook
3.3.2 Python
HISTORY
Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde &
Informatica (CWI) in the Netherlands as a successor to the ABC language (itself inspired
by SETL), capable of exception handling and interfacing with the Amoeba operating
system. Its implementation began in December 1989. Van Rossum shouldered sole
responsibility for the project, as the lead developer, until July 12, 2018, when he announced
his "permanent vacation" from his responsibilities as Python's Benevolent Dictator for
Life, a title the Python community bestowed upon him to reflect his long-term commitment
as the project's chief decision-maker. He now shares his leadership as a member of a five-
person steering council. In January, 2019, active Python core developers elected Brett Cannon,
Nick Coghlan, Barry Warsaw, Carol Willing and Van Rossum to a five-member "Steering
Council" to lead the project. Python 2.0 was released on 16 October 2000 with many major new
features, including a cycle-detecting garbage collector and support for Unicode. Python 3.0 was
released on 3 December 2008. It was a major revision of the language that is not completely
backward-compatible. Many of its major features were backported to Python 2.6.x and 2.7.x
version series. Releases of Python 3 include the 2to3 utility, which automates (at least partially)
the translation of Python 2 code to Python 3.Python 2.7's end-of-life date was initially set at
2015 then postponed to 2020 out of concern that a large body of existing code could not easily
be forward-ported to Python 3.
1. The default install path is in the AppData/directory of the current Windows user.
2. The Customize installation button can be used to customize the installation location and
which additional features get installed, including pip and IDLE.
3. The Install launcher for all users (recommended) checkbox is checked default. This
means every user on the machine will have access to the py.exe launcher. You can
uncheck this box to restrict Python to the current Windows user.
4. The Add Python 3.8 to PATH checkbox is unchecked by default. There are several
reasons that you might not want Python on PATH, so make sure you understand the
implications before you check this box.
The full installer gives you total control over the installation process. Congratulations—
you now have the latest version of Python 3 on your Windows machine!
3.3.3 Anaconda
Installing on Windows
1. Download the installer:
o Miniconda installer for Windows.
o Anaconda installer for Windows.
2. Verify your installer hashes.
3. Double-click the .exe file.
4. Follow the instructions on the screen.
If you are unsure about any setting, accept the defaults. You can change them later.
When installation is finished, from the start menu, open the Anaconda Prompt.
5. Test your installation. In your terminal window or Anaconda Prompt, run the
command conda list . A list of installed packages appears if it has been installed
correctly.
3.3.4 STREAMLIT:
Streamlit is a tool to create web-based frontends with a focus for the Machine Learning
scientists or engineer. It’s a target for Machine Learning scientists to quickly put out a
Python-based web GUI frontend. It is an opensource Python library that makes it easy to
create and share beautiful, custom web apps for machine learning and data science.
Installing on Windows
1. Make sure that you have Python 3.6 - Python 3.8 installed.
2. Install Streamlit using PIP and run the ‘hello world’ app:
3. pip install streamlit
4. streamlit hello
5. That’s it! In the next few seconds, the sample app will open in a new tab in your default
browser.
Still with us? Great! Now make your own app in just 3 more steps:
1. Open a new Python file, import Streamlit, and write some code
2. Run the file with:
streamlit run [filename]
3. When you’re ready, click ‘Deploy’ from the Streamlit menu to share your app with the
world!
Now that you’re set up, let’s dive into more of how Streamlit works and how to build great apps.
4. FEASIBILITY STUDY
The feasibility of the project is analyzed in this phase and business proposal is put forth
with a very general plan for the project and some cost estimates. During system analysis
the feasibility study of the proposed system is to be carried out. This is to ensure that the
proposed system is not a burden to the company. For feasibility analysis, some
understanding of the major requirements for the system is essential.
4.1 TYPES OF FEASIBILITY
Three key considerations involved in the feasibility analysis are
• ECONOMICAL FEASIBILITY
• TECHNICAL FEASIBILITY
• SOCIAL FEASIBILITY
user about the system and to make him familiar with it. His level of confidence must be
raised so that he is also able to make some constructive criticism, which is welcomed, as
he is the final user of the system.
5. SYSTEM ANALYSIS
5.1 EXISTING SYSTEM
Due to the increases in demand for e-commerce with people preferring online
purchasing of goods and products, there is a vast amount information being shared.
The e-commerce websites are loaded with millions of reviews. The customer finds
difficult to precisely find the review for a particular feature of a product that he intends
to buy. There is a mixture of positive and negative reviews thereby making it difficult
for customer to predict whether product is genuine or not. Also, the reviews suffer
from spammed reviews from unauthenticated users.
DISADVANTAGES
➢ There are thousands of reviews at e-commerce websites if those reviews are analyzed
manually, it may take decades to analyze.
➢ Difficult to find the sentiment of a review whether it is a positive or negative.
6. SYSTEM DESIGN
System Design
The Systems design is the process of defining the architecture, components, modules,
interfaces, and data for a system to satisfy specified requirements. One could see it as the
application of systems theory to product development. There is some overlap with the
disciplines of systems analysis, systems architecture and systems engineering. If the
broader topic of product development "blends the perspective of marketing, design, and
manufacturing into a single approach to product development, then design is the act of
taking the marketing information and creating the design of the product to be
manufactured. Systems design is therefore the process of defining and developing systems
to satisfy specified requirements of the user
Physical Design
Physical design relates to the actual input and output processes of the system. This is laid
down in terms of how data is input into a system, how it is verified / authenticated, how it
is processed, and how it is displayed as output. In Physical design, following requirements
about the system are decided
. ➢ Input requirements
➢ Output requirements
➢ Storage requirements
➢ Processing Requirements
➢ System control and backup or recovery
The physical portion of systems design can generally be broken down into three sub-tasks:
➢ User Interface
➢ Design Data Design
➢ Process Design
User Interface Design is concerned with how users add information to the system and with
how the system presents information back to them. Data Design is concerned with how the
data is represented and stored within the system. Finally, Process Design is concerned with
how data moves through the system, and with how and where it is validated, secured and/or
transformed as it flows into, through and out of the system.
Class Diagram
• Implementation: a dotted line with a solid arrowhead that points from a class
to the interface that it implements.
Association: a solid line with an open arrowhead that represents a "has a"
relationship. The arrow points from the containing to the contained class.
Associations can be one of the following two types or not specified.
Dependency: a dotted line with an open arrowhead that shows one entity depends
on the behavior of another entity.
Class Diagram
A use case is a methodology used in system analysis to identify, clarify, and organize system
requirements. The use case is made up of a set of possible sequences of interactions between
systems and users in a particular environment and related to a particular goal. It consists of a
group of elements (for example, classes and interfaces) that can be used together in a way that
will have an effect larger than the sum of the separate elements combined.
The main purpose of a use case diagram is to show what system functions are performed
for which actor. Roles of the actors in the system can be depicted.
Relationships
Generalization
Associations
Associations between actors and use cases are indicated in use case diagrams by solid lines.
Associations are modelled as lines connecting use cases and actors to one another, with an
optional arrowhead on one end of the line.
1. Identifying actor
2. Identifying use cases
3. Review your use case for completeness
Before drawing an activity diagram, we must have a clear understanding about the
elements used in activity diagram. An activity is a function performed by the system. After
identifying the activities, we need to understand how they are associated with constraints and
conditions. So before drawing an activity diagram we should identify the following elements:
• Activities
• Association
• Conditions
• Constraints
The following are the basic notational elements that can be used to make up a diagram:
Initial state
An initial state represents a default vertex that is the source for a single transition to the default
state of a composite state. There can be at most one initial vertex in a region. The outgoing
transition from the initial vertex may have behaviour, but not a trigger or guard. It is
represented by Filled circle, pointing to the initial state
Final state
A special kind of state that signifying the enclosing region is completed. If the enclosing
region is directly contained in a state machine and all other regions in the state machine also are
completed, then it means that the entire state machine is completed. It is represented by Hollow
circle containing a smaller filled circle, indicating the final state.
Rounded rectangle
It denotes a state. Top of the rectangle contains a name of the state. Can contain a horizontal
line in the middle, below which the activities that are done in that state are indicated.
Arrow
It denotes transition. The name of the event (if any) causing this transition labels the arrow body.
Activity Diagram
A component diagram is used to break down a large object-oriented system into the
smaller components, so as to make them more manageable. It models the physical view of a
system such as executables, files, libraries, etc. that resides within the node.
It visualizes the relationships as well as the organization between the components present
in the system. It helps in forming an executable system. A component is a single unit of the
system, which is replaceable and executable. The implementation details of a component are
hidden, and it necessitates an interface to execute a function. It is like a black box whose behavior
is explained by the provided and required interfaces. The purpose of a component diagram is to
show the relationship between different components in a system.
The steps below outline the major steps to take in creating a UML Component Diagram.
1) Decide on the purpose of the diagram
2) Add components to the diagram, grouping them within other components if appropriate
3) Add other elements to the diagram, such as classes, objects an interfaces.
4) Add the dependencies between the elements of the diagram.
Component Diagram
Deployment diagrams are used to visualize the topology of the physical components of a system,
where the software components are deployed.
Deployment diagrams are used to describe the static deployment view of a system. Deployment
diagrams consist of nodes and their relationships. A deployment diagram shows the
configuration of run time processing nodes and the components that live on them. Deployment
diagrams is a kind of structure diagram used in modelling the physical aspects of an object-
Deployment Diagram
A data flow diagram (DFD) is a graphical representation of the "flow" of data through
an information system, modeling its process aspects. A DFD is often used as a preliminary
step to create an overview of the system without going into great detail, which can later be
elaborated.
Admin:
7. SYSTEM IMPLEMENTATION
7.1 MODULES
Modules:
The system can be divided into three major modules which can be sub tasked
further. The modules are as follows:
The Bag of Words (BoW) model learns a vocabulary from all of the documents, and then
models each document by counting the number of times each word appears. In this model,
a text (such as a sentence or a document) is represented as the bag (multiset) of its words,
disregarding grammar and even word order but keeping multiplicity. The BoW model is
commonly used in methods of document classification, where the (frequency of)
occurrence of each word is used as a feature for training a classifier.
● We build the prediction model based on logistic regression algorithm as here the problem
is supervised learning classification problem.
● It is a mathematical model used in statistics to estimate the probability of an event
occurring having been given some previous data. This is used when the dependent
variable (Target) is categorical. The outcome can be either yes or No, 0 or 1, True or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
● Logistic regression generally explains the relationship between one dependent binary
variable and one or more nominal, ordinal, interval or ratio-level independent variables.
● In this algorithm we use sigmoid function in order to map predicted values to
probabilities. This function maps any prediction into probabilities which values between
0 and 1.
● The performance of the logistic regression algorithm is evaluated by confusion matrix
and confusion report.
● In any classification problem, the target variable (or output), y, can take only discrete
values for a given set of features (or inputs), X.
● In this algorithm, in order to map predicted values to probabilities, we use the Sigmoid
function.
The plot shows a model of the relationship between a continuous predictor and the
probability of an event or outcome. The linear model clearly does not fit if this is the true
relationship between X and the probability. In order to model this relationship directly, you must
use a nonlinear function. The plot displays one such function. The S-shape of the function is
known as sigmoid.
Logit transformation
A logistic regression model applies a logit transformation to the probabilities. The logit is the
natural log of the odds.
Here in the diagram if the value of z goes to positive infinity then the predicted value of y will
become 1 and if it goes to negative infinity then the predicted value of y will become 0.
In our project, a confusion matrix is applied to logistic regression algorithm to check the
performance.
Confusion matrix
● Expected down the side: Each row of the matrix corresponds to a predicted class.
● Predicted across the top: Each column of the matrix corresponds to an actual class.
True Positive:
Interpretation: You predicted positive and it’s true. You predicted that a woman is pregnant and
she actually is.
True Negative:
Interpretation: You predicted negative and it’s true. You predicted that a man is not pregnant and
he actually is not.
False Positive: (Type 1 Error)
Interpretation: You predicted positive and it’s false. You predicted that a man is pregnant but
he actually is not.
False Negative: (Type 2 Error)
Interpretation: You predicted negative and it’s false. You predicted that a woman is not pregnant
Precision:
The above equation can be explained by saying, from all the classes we have predicted as positive,
how many are actually positive. Precision should be as high as possible.
Accuracy:
From all the classes (positive and negative), how many of them we have predicted correctly.
The accuracy should be as high as possible.
F-Measure:
It is difficult to compare two models with low precision and high recall or vice versa. So to
make them comparable, we use F-Score. F-score helps to measure Recall and Precision at the
same time. It uses Harmonic Mean in place of Arithmetic Mean by punishing the extreme values
more.
The report shows the main classification metrics precision, recall and f1-score on a per-class
basis. The metrics are calculated by using true and false positives, true and false negatives.
Positive and negative in this case are generic names for the predicted classes. There are four
ways to check if the predictions are right or wrong:
Positive and negative in this case are generic names for the predicted classes. There are four
ways to check if the predictions are right or wrong:
Here precision is what percentage of your predictions were correct, recall is what percentage of
positive cases did you catch and F1-Score is what percent of positive predictions were correct
WORD CLOUD
Created a word cloud for positive and negative sentiment reviews of selected brand of mobile
phones.
Word clouds also known as text clouds work in a simple way that more a specific word appears
in a source of textual data, the bigger and bolder it appears in the word cloud. A word cloud is a
collection, or cluster, of words depicted in different sizes. The bigger and bolder the word
appears, the more often it’s mentioned within a given text and the more important it is.
8.TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub-assemblies, assemblies and/or a finished product. It is the
process of exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of tests. Each test addresses a specific testing requirement
TYPES OF TESTS
Features to be tested
• Verify that the entries are of the correct format.
• No duplicate entries should be allowed.
• All links should take the user to the correct page.
(JUPYTER NOTEBOOK)
#imported modules
import pandas as pd
import numpy as np
import nltk
import future
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.preprocessing import label_binarize
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from wordcloud import WordCloud
import seaborn as sns
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from bs4 import BeautifulSoup
import re
import nltk
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
# Load csv file
df = pd.read_csv('Amazon_Unlocked_Mobile.csv')
df.head()
#description
print("\nTotal number of reviews: ",len(df))
print("\nTotal number of brands: ", len(list(set(df['Brand Name']))))
print("\nTotal number of unique products: ", len(list(set(df['Product Name']))))
#labelled data
def label_data():
rows = pd.read_csv('Amazon_Unlocked_Mobile.csv', header=0, index_col=False,
delimiter=',')
labels = []
for cell in rows['Rating']:
if cell >= 4:
labels.append('2') #Good
elif cell == 3:
labels.append('1') #Neutral
else:
labels.append('0') #Poor
rows['Label'] = labels
del rows['Review Votes']
return rows
def clean_data(data):
#replace blank values in all the cells with 'nan'
df.replace('',np.nan,inplace=True)
#delete all the rows which contain at least one cell with nan value
df.dropna(axis=0, how='any', inplace=True)
#save output csv file
df.to_csv('labelled_dataset.csv', index=False)
return data
clean_data(df)
df = pd.read_csv('labelled_dataset.csv')
df.head()
l = df["Rating"].values
print(list(l).count(0))
print(list(l).count(1))
print(list(l).count(2))
print(list(l).count(3))
print(list(l).count(4))
print(list(l).count(5))
df3
=pd.DataFrame([["Positive",230674],["Neutral",26058],["Negative",77603]],columns=
["Polarity","Frequency"])
df = df.sample(frac=0.1, random_state=0) #uncomment to use full set of data
# Drop missing values
df.dropna(inplace=True)
# Remove any 'neutral' ratings equal to 3
df = df[df['Rating'] != 3]
# Encode 4s and 5s as 1 (positive sentiment) and 1s and 2s as 0 (negative sentiment)
df['Sentiment'] = np.where(df['Rating'] > 3, 1, 0)
df.head()
def cleanText(raw_text, remove_stopwords=False, stemming=False, split_text=False):
'''
Convert a raw review to a cleaned review
'''
text = BeautifulSoup(raw_text, 'lxml').get_text() #remove html
letters_only = re.sub("[^a-zA-Z]", " ", text) # remove non-character
words = letters_only.lower().split() # convert to lower case
if remove_stopwords: # remove stopword
stops = set(stopwords.words("english"))
words = [w for w in words if not w in stops]
if stemming==True: # stemming
# stemmer = PorterStemmer()
stemmer = SnowballStemmer('english')
words = [stemmer.stem(w) for w in words]
if split_text==True: # split text
return (words)
return( " ".join(words))
# Split data into training set and validation
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], df['Sentiment'], \
test_size=0.1, random_state=0)
print('Load %d training examples and %d validation examples. \n'
%(X_train.shape[0],X_test.shape[0]))
print('Show a review in the training set : \n', X_train.iloc[10])
# Preprocess text data in training set and validation set
X_train_cleaned = []
X_test_cleaned = []
for d in X_train:
X_train_cleaned.append(cleanText(d))
print('Show a cleaned review in the training set : \n', X_train_cleaned[11])
for d in X_test:
X_test_cleaned.append(cleanText(d))
# Split review text into parsed sentences uisng NLTK's punkt tokenizer
# nltk.download()
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
def parseSent(review, tokenizer, remove_stopwords=False):
'''
Parse text into sentences
'''
raw_sentences = tokenizer.tokenize(review.strip())
sentences = []
plt.figure(figsize=(12,8))
brands[:50].plot(kind='bar')
plt.title("Number of Reviews for Top 50 Brands")
# Plot number of reviews for top 20 products
products = df["Product Name"].value_counts()
# brands.count()
plt.figure(figsize=(12,8))
products[:20].plot(kind='bar')
plt.title("Number of Reviews for Top 20 products")
# Plot distribution of review length
review_length = df["Reviews"].dropna().map(lambda x: len(x))
plt.figure(figsize=(12,8))
review_length.loc[review_length < 1500].hist()
plt.title("Distribution of Review Length")
plt.xlabel('Review length (Number of character)')
plt.ylabel('Count')
review_length = df["Reviews"].dropna().map(lambda x: len(x))
df["lenght"] = review_length
#word cloud
def create_word_cloud(brand, sentiment):
try:
df_brand = df.loc[df['Brand Name'].isin([brand])]
df_brand_sample = df_brand.sample(frac=0.1)
word_cloud_collection = ''
if sentiment == 1:
df_reviews = df_brand_sample[df_brand_sample["Sentiment"]==1]["Reviews"]
if sentiment == 0:
df_reviews = df_brand_sample[df_brand_sample["Sentiment"]==0]["Reviews"]
for val in df_reviews.str.lower():
tokens = nltk.word_tokenize(val)
APPLICATION
app1.py
import streamlit as st
import numpy as np
import pandas as pd
import plotly.express as px
from wordcloud import WordCloud
import seaborn as sns
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import re
import nltk
import warnings
return (words)
return( " ".join(words))
if page == "Visualization" :
st.header("Distribution of Rating")
df1 = pd.DataFrame(data = [[1,5787],[2,2023],[4,5009],[5,17998]],columns =
["Rating","Count"])
fig = px.pie(df1, values= "Count", names='Rating', title='Distribution of Rating')
st.plotly_chart(fig,use_container_width=10)
st.header("Number of Reviews for Top 20 Brands")
brands = df["Brand Name"].value_counts()
b = brands.to_frame()
b = b.reset_index()
b = b.iloc[0:20,:]
b.columns = ["Brand Name","Number of Reviews"]
fig = px.bar(b,x='Brand Name',y = "Number of Reviews",title="Number of Reviews
for Top 20 Brands")
st.plotly_chart(fig,use_container_width=10)
st.header("Number of Reviews for Top 50 Brands")
brands = df["Brand Name"].value_counts()
b = brands.to_frame()
b = b.reset_index()
b = b.iloc[0:50,:]
b.columns = ["Brand Name","Number of Reviews"]
fig = px.bar(b,x='Brand Name',y = "Number of Reviews",title="Number of Reviews for
Top 50 Brands")
st.plotly_chart(fig,use_container_width=10)
st.header("Number of Reviews for Top 20 products")
brands = df["Product Name"].value_counts()
b = brands.to_frame()
b = b.reset_index()
b = b.iloc[0:20,:]
b.columns = ["Product Name","Number of Reviews"]
fig = px.bar(b,x='Product Name',y = "Number of Reviews",title="Number of Reviews
for Top 20 products")
st.plotly_chart(fig,use_container_width=30)
st.header("Number of Reviews for Top 50 products")
brands = df["Product Name"].value_counts()
b = brands.to_frame()
b = b.reset_index()
b = b.iloc[0:50,:]
b.columns = ["Product Name","Number of Reviews"]
fig = px.bar(b,x='Product Name',y = "Number of Reviews",title="Number of Reviews
for Top 50 products")
st.plotly_chart(fig,use_container_width=30)
st.header("Distribution of Review Length")
review_length = df["Reviews"].dropna().map(lambda x: len(x))
df["Review length (Number of character)"] = review_length
fig = px.histogram(df, x="Review length (Number of character)",title = "Distribution of
Review Length" )
st.plotly_chart(fig,use_container_width=20)
st.header("Polarity Distribution")
df3
=pd.DataFrame([["Positive",230674],["Neutral",26058],["Negative",77603]],columns=
["Polarity","Frequency"])
fig = px.bar(df3,x='Polarity',y = "Frequency",title = "Polarity Distribution")
st.plotly_chart(fig,use_container_width=20)
def create_word_cloud(brand, sentiment):
df_brand = df.loc[df['Brand Name'].isin([brand])]
df_brand_sample = df_brand.sample(frac=0.1)
word_cloud_collection = ''
if sentiment == 1:
df_reviews = df_brand_sample[df_brand_sample["Sentiment"]==1]["Reviews"]
if sentiment == 0:
df_reviews = df_brand_sample[df_brand_sample["Sentiment"]==0]["Reviews"]
for val in df_reviews.str.lower():
tokens = nltk.word_tokenize(val)
tokens = [word for word in tokens if word not in stopwords.words('english')]
for words in tokens:
word_cloud_collection = word_cloud_collection + words + ' '
wordcloud = WordCloud(max_font_size=50, width=500,
height=300).generate(word_cloud_collection)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
plt.savefig('WC.jpg')
img= Image.open("WC.jpg")
return img
if page == "Word cloud" :
st.header("Word cloud")
form = st.form(key='my_form1')
brand = form.text_input(label='Enter Brand Name')
s = form.selectbox("Select The Sentiment",["Positive","Negative"])
submit_button = form.form_submit_button(label='Plot Word Cloud')
if submit_button:
if s=="Positive" :
img = create_word_cloud(brand,1 )
st.image(img)
else :
img = create_word_cloud(brand,0 )
st.image(img)
10.SCREEN LAYOUTS
10.2 DATASET:
Results:
Sentiment analysis deals with the classification of texts based on the sentiments they
contain. This sentiment analysis model consisting of three core steps, namely data
preparation, review analysis and sentiment classification and describes representative
techniques involved in those steps. This prediction model is built by using a logistic
regression algorithm to predict sentimental analysis on mobile phone reviews and used to
predict whether the review is positive review or negative review. At the end we have used
quality metric parameters to measure the performance of prediction model. Exploratory
visualizations are performed on mobile reviews and plotted in graphs. finally, word cloud
is created for positive and negative sentiment reviews of selected brand. Sentiment analysis
is an emerging research area in text mining and computational linguistics.
The future of sentiment analysis is going to continue to dig deeper, far past the surface of
the number of likes, comments and shares, and aim to reach, and truly understand, the
significance of social media interactions and what they tell us about the consumers behind
the screens. This forecast also predicts broader applications for sentiment analysis – brands
will continue to leverage this tool, but so will individuals in the public eye, governments,
nonprofits, education centers and many other organizations
12.REFERENCES
(https://web.standford.edu/class/cs124/lec/ sentiment.pdf)
(http://www. ryanmcd.com/papers/local_service_summ.pdf)
➢ Ding, Xiaowen, Bing Liu, and Philip S. Yu. "A holistic lexicon-based approach to opinion
mining." Proceedings of the 2008 International Conference on Web Search and Data
Mining. ACM, pp. 231-240, 2008.
➢ Word Cloud
https://boostlabs.com/blog/what-are-word-clouds-value-simple-
visualizations/#:~:text=Word%20clouds%20(also%20known%20as,words%20depicted%
20in%20different%20sizes.