Customer Review Analysis Using Data Science
Customer Review Analysis Using Data Science
DATA SCIENCE
Submitted By:
KRITIKA GUPTA (16BCS1223)
SAHIL GROVER (16BCS1201)
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Chandigarh University, Gharuan
NOV 2019
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
ACKNOWLEDGEMENT
We have taken efforts in this project. However, it would not have been possible without the kind
support and help of many individuals.
We would like to extend our sincere thanks to all of them. We are highly indebted to Ms. Khyati
for his guidance and constant supervision as well as for providing necessary information
regarding the project & also for his support in completing the project.
We would like to express our gratitude towards our parents & members of Chandigarh
University for their kind co-operation and encouragement which help me in completion of this
project.
We would like to express our special gratitude and thanks to all the resources from where we
collected the information regarding project and all the books from where we gathered the
information.
Our thanks and appreciations also go to our colleague in developing the project and people who
have willingly helped us out with their abilities.
B.E. CSE
i
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
Abstract
Efficient and accurate CUSTOMER REVIEW ANALYSIS has been an important topic in the
advancement of computer vision systems. With the advent of deep learning techniques, the
accuracy for object detection has increased drastically. The project aims to incorporate state-of-
the-art technique for object detection with the goal of achieving high accuracy with a real-time
performance. A major challenge in many of the object detection systems is the dependency on
other computer vision techniques for helping the deep learning based approach, which leads to
slow and non-optimal performance. In this project, we use a completely deep learning based
approach to solve the problem of object detection in an end-to-end fashion. The network is trained
on the most challenging publicly available dataset (PASCAL VOC), on which a object detection
challenge is conducted annually. The resulting system is fast and accurate, thus aiding those
applications which require object detection.
ii
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
List of Figures
iii
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
Table of contents
i. ACKNOWLEDGEMENT…………………………………………..………………………………………
……………..………..i
ii. ABSTRACT…………………………………………………………………….……….…………………
…………..………………II
iii. LIST OF FIGURES......................................................................................................................III
Chapter 1 Introduction........................................................................................................…...5
Chapter 2 SRS……………………………………………………………………………….19
1.1. HARDWARE
INTERFACE…………………………………………………………………………………………………
………….19
1.2. SOFTWARE
INTERFACE……………………………………….………………………………………………………
…………….19
1.3. USER
INTERFACE…………………………………………………………………………………………………
…………………….20
1.4. MEMORY
CONSTRAINTS……………………………………………………………………………………………
………………20
1.5. OTHER NON-FUNCTIONAL
REQUIREMENTS………………………………………………………………………………20
1.6. DATABASE
REQUIREMENTS…………………………………………………………………………………………
…………..20
iv
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
1.8. PROCESS
DIAGRAM…………………………………………………………………………………………………
………………..23
1.9. OPERATING
ENVIRONMENT……………………………………………..……………………………………………
…………23
Chapter 7 References….……………………………………………………………………..37
v
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
Chapter 1 - INTRODUCTION
Machine Learning Basics:
Machine Learning, as the name suggests, provides machines with the ability to
learn autonomously based on experiences, observations and analysing patterns within a
given data set without explicitly programming. When we write a program or a code for
some specific purpose, we are actually writing a definite set of instructions which the
machine will follow. Whereas in machine learning, we input a data set through which the
machine will learn by identifying and analysing the patterns in the data set and learn to
take decisions autonomously based on its observations and learnings from the dataset.
6
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
7
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
Well, the simpler answer is, just like humans do! First, we receive the knowledge about a
certain thing and then keeping this knowledge in mind, we are able to identify the thing in
the future. Also, past experiences help us in taking decisions accordingly in the future. Our
brain trains itself by identifying the features and patterns in knowledge/data received, thus
enabling itself to successfully identify or distinguish between various things.
Similarly, we feed knowledge/data to the machine, this data is divided into two parts
namely, training data and testing data. The machine learns the patterns and features from
the training data and trains itself to take decisions like identifying, classifying or predicting
new data. To check how accurately the machine is able to take these decisions, the
predictions are tested on the testing data.
Let’s understand this with the help of a basic machine learning example:
Consider that you want to predict whether the next day is going to be rainy or sunny.
Generally, we will do this by looking at a combination of data like the weather conditions
of the past few days and present data such as wind direction, cloud formation etc. Had it
been raining for the past few days, we would predict that it would rain for the next day too
based on the pattern and vice versa. Similarly, we feed the past few days’ weather data
along with the present data such as wind direction, cloud formation etc. to the machine,
and based on the data provided, the machine will analyse the patterns and eventually predict
the weather for the next day.
8
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
In this type of algorithm, the data set on which the machine is trained consists of labelled
data or simply said, consists both the input parameters as well as the required output. For
example, classifying whether a person is a male or a female. Here male and female will be
our labels and our training dataset will already be classified into the given labels based on
certain parameters through which the machine will learn these features and patterns and
classify some new input data based on the learning from this training data.
Supervised Learning Algorithms can be broadly divided into two types of algorithms,
Classification and Regression.
Classification Algorithms
Just as the name suggests, these algorithms are used to classify data into predefined classes
or labels. We will discuss one of the most used classification algorithm known as the K-
Nearest Neighbor (KNN) Classification Algorithm.
This algorithm is used to classify a set of data points into specific groups or classes based
on the similarities between the data points.Let’s consider an example where we need to
check whether a person is fit or not based on the height and weight of a person.Suppose we
give the following table as the training data set:
9
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
Now consider a new person needs to be classified as fit/not fit. Let us consider the value
of K=3, which means will consider 3 nearest neighbours. The nearest neighbours can be
found out by determining the Euclidean difference between the height and weight of one
person and the height and weight of the persons given in the table. The persons with the 3
least differences will be considered as the nearest neighbours. Now we will check how
many out of these 3 are fit. If 2 or more out of the 3 are fit, then we will classify the new
person as fit and vice versa. In case, we get an equal number of neighbours with different
outcomes, then we can increase the value of K and check again.
Regression Algorithms
These algorithms are used to determine the mathematical relationship between two or more
variables and the level of dependency between variables. These can be used for predicting
an output based on the interdependency of two or more variables. For example, an increase
in the price of a product will decrease its consumption, which means, in this case, the
amount of consumption will depend on the price of the product. Here, the amount of
10
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
consumption will be called as the dependent variable and price of the product will be called
the independent variable. The level of dependency of the amount of consumption on the
price of a product will help us predict the future value of the amount of consumption based
on the change in prices of the product.
We have two types of regression algorithms: Linear Regression and Logistic Regression
Linear regression is used with continuously valued variables, like the previous example in
which the price of the product and amount of consumption are continuous variables, which
means that they can have an infinite number of possible values. Linear regression can also
be represented as a graph known as scatter plot, where all the data points of the dependent
and independent variables are plotted and a straight line is drawn through them such that
the maximum number of points will lie on the line or at a smaller distance from the line.
This line – also called the regression line, will then help us determine the relationship
between the dependent and independent variables along with which the linear regression
equation is formed.
You can learn about Linear Regression and how it can be used to predict the stock prices
in detail in
The difference between linear and logistic regression is that logistic regression is used with
categorical dependent variables (eg: Yes/No, Male/Female, Sunny/Rainy/Cloudy,
Red/Blue etc.), unlike the continuous valued variables used in linear regression. Logistic
regression helps determine the probability of a certain variable to be in a certain group like
whether it is night or day, or whether the colour is red or blue etc. The graph of logistic
regression consists of a non-linear sigmoid function which demonstrates the probabilities
of the variables.
11
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
Unlike supervised learning algorithms, where we deal with labelled data for training, the
training data will be unlabelled for Unsupervised Machine Learning Algorithms. The
clustering of data into a specific group will be done on the basis of the similarities between
the variables. Some of the unsupervised machine learning algorithms are K-means
clustering, neural networks. In this article, we will talk about the k-means clustering
algorithm.
K-means clustering
Before we understand the working of K-means clustering algorithm, let us first break down
the word K-means clustering to understand what it means.
Clustering: In this algorithm, we form clusters which are a collection of data points grouped
together due to their similarities.
K refers to the number of centroids which will be considered for a specific problem whereas
‘means’ refers to a centroid which is considered as the central point of any cluster.
1. Define the value of K. For eg: if K= 2, then we will have two centroids.
2. Randomly select K data points as centroids.
3. Check the distance of each data point with the centroids.
4. Assign the data point to the centroid with which it has a minimum distance, thus forming
a cluster of similar data points.
5. Recalculate the centroid of each newly formed cluster and reassign the data points to the
cluster whose centroid is at a minimum distance from the data point.
You can decide the number of iterations for repeating step 5 to optimize the algorithm.
When the centroid stops changing or remains same after some amount of iterations then
that will be our stopping point and the algorithm will be fully optimized.
12
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
Another machine learning concept which is extensively used in the field is Neural
Networks. You can read about the working of neural networks and how it can be used for
stock price prediction in this article.
For reinforcement algorithm, a machine can be adjusted and programmed to focus more on
either the long-term rewards or the short-term rewards. When the machine is in a particular
state and has to be the action for the next state in order to achieve the reward, this process
is called the Markov Decision Process.
13
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
Today, digital reviews play a pivotal role in enhancing global communications among
consumers and influencing consumer buying patterns. E-commerce giants like Amazon,
Flipkart, etc. provide a platform to consumers to share their experience and provide real
insights about the performance of the product to future buyers. In order to extract valuable
insights from a large set of reviews, classification of reviews into positive and negative
sentiment is required. Sentiment Analysis is a computational study to extract subjective
information from the text. In the proposed work, over 4,000,00 reviews have been classified
into positive and negative sentiments using Sentiment Analysis. Out of the various
classification models, Naïve Bayes, Support Vector Machine (SVM) and Decision Tree have
been employed for classification of reviews. The evaluation of models is done using 10 Fold
Cross Validation. An overview of all these problems is depicted in Fig. 1.
14
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
1.2 Challenges
The major challenge in this problem is that of the variable dimension of the output which
is caused due to the variable number of objects that can be present in any given input image.
Any general machine learning task requires a xed dimension of input and output for the
model to be trained. Another important obstacle for widespread adoption of object
detection systems is the requirement of real-time (>30fps) while being accurate in
detection. The more complex the model is, the more time it requires for inference; and the
less complex the model is, the less is the accuracy. This trade-o between accuracy and
performance needs to be chosen as per the application. The problem involves classification
as well as regression, leading the model to be learnt simultaneously. This adds to the
complexity of the problem.
1.4 Overview
This project is two-fold. The first part deals with an exploration of the dataset, with the aim
of understanding some properties of the delays registered by algorithm. This exploration
gives us the occasion of using various visualization tools offered by python. The second
15
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
part of the notebook consists in the elaboration of a model aimed at predicting bounding
box. For that purpose, we have used polynomial regressions and showed the importance of
regularization techniques.
This aspect of data science is all about uncovering findings from data. Diving in at a granular
level to mine and understand complex behaviors, trends, and inferences. It's about surfacing
hidden insight that can help enable companies to make smarter business decisions. For example:
Netflix data mines movie viewing patterns to understand what drives user interest, and uses that
to make decisions on which Netflix original series to produce.
16
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
Target identifies what are major customer segments within it's base and the unique shopping
behaviors within those segments, which helps to guide messaging to different market audiences.
Proctor & Gamble utilizes time series models to more clearly understand future demand, which
help plan for production levels more optimally.
How do data scientists mine out insights? It starts with data exploration. When given a
challenging question, data scientists become detectives. They investigate leads and try to
understand pattern or characteristics within the data. This requires a big dose of analytical
creativity.
Then as needed, data scientists may apply quantitative technique in order to get a level
deeper – e.g. inferential models, segmentation analysis, time series forecasting, synthetic
control experiments, etc. The intent is to scientifically piece together a forensic view of
what the data is really saying.
This data-driven insight is central to providing strategic guidance. In this sense, data
scientists act as consultants, guiding business stakeholders on how to act on findings.
A "data product" is a technical asset that: (1) utilizes data as input, and (2) processes that
data to return algorithmically-generated results. The classic example of a data product is a
recommendation engine, which ingests user data, and makes personalized
recommendations based on that data. Here are some examples of data products:
Amazon's recommendation engines suggest items for you to buy, determined by their
algorithms. Netflix recommends movies to you. Spotify recommends music to you.
Gmail's spam filter is data product – an algorithm behind the scenes processes incoming
mail and determines if a message is junk or not.
17
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
Computer vision used for self-driving cars is also data product – machine learning
algorithms are able to recognize traffic lights, other cars on the road, pedestrians, etc.
This is different from the "data insights" section above, where the outcome to that is to
perhaps provide advice to an executive to make a smarter business decision. In contrast, a
data product is technical functionality that encapsulates an algorithm, and is designed to
integrate directly into core applications. Respective examples of applications that
incorporate data product behind the scenes: Amazon's homepage, Gmail's inbox, and
autonomous driving software.
Data scientists play a central role in developing data product. This involves building out
algorithms, as well as testing, refinement, and technical deployment into production
systems. In this sense, data scientists serve as technical developers, building assets that
can be leveraged at wide scale.
18
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
As systems grew more complex, it become evident that the goal of the entire system
cannot be easily comprehended. Hence need for the requirements analysis phase arose.
Now, for large software systems, requirements analysis is perhaps the most difficult
activity and also the most error prone.
Some of the difficulty is due to the scope of this phase. The software project is
imitated by the client needs. In the beginning these needs are in the minds of various people
in the client organization. The requirement analyst has to identify the requirements by
tacking to these people and understanding their needs. In situations where the software is
to automated a currently manuals process, most of the needs can be understood by
observing the current practice.
The SRS is a means of translating the ideas in the minds of the clients (the output)
into formal document (the output of the requirements phase). Thus the output of the phase
is a set of formally specified requirements, which hopefully are complete and consistent,
while the input has none of these properties.
Hardware interfaces
This project is intended to be platform independent. Therefore no specific hardware is
excluded. But it will at least work on x86 systems without any additional porting efforts.
Moreover, no special hardware is needed for software operation. The hardware
specifications on which the project has been developed:
Processor: Intel i5 7200U (Minimum - Pentium processor and above)
x86 compatible processor
19
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
Software interfaces
This project is intended to work on any operating system as it has been programmed in
python which is a platform independent language. We have tested the software to run on
windows. The software specifications on which the project has been developed :
Operating System: Windows 10
Front End: Jupyter Notebook (Python 3.6)
User interfaces
GUI (Graphical User Interface) will be used in this application.
Memory constraints
The project is expected to use 512 MB of RAM or more and 4 GB of external storage or
more.
Other Nonfunctional Requirements
Adaptability and reusability: The suite is simple (and intuitive) enough that it should not
be difficult to adapt it to user’s needs. The code is standards compliant and is therefore
easily reusable in other applications.
Portability: The code is platform independent; it should be easily portable to different
architectures and operating systems.
Reliability: Serious attempts are made to make sure code is reliable and of enterprise
quality.
Usability: The project is still in planning stage.
Performance: Most data analysis techniques grow very complex due to large amount of
uneven and repeated data. So we have minutely organized the data and removed
redundancy in the datasets so as to provide high grade of feedback analysis for the industry.
Every aspect of the feedback is analyzed differently so as to improve performance in every
field.
20
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
Database Requirements
All the datasets are stored in local drive in .csv semi-structured format.
2. PROCESS DIAGRAM
21
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
3. OPERATING ENVIRONMENT
The system is divided into three phases. The first phase is the data collection phase, here
the individual passengers review along with rating are scraped from collection of airline
forum links. The second phase is the data analysis phase, here the review extracted are
analyzed using analysis techniques to obtain the sentiment. The third phase is the
presentation phase and it involves plotting the rating and sentiment graphically.
22
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
1. Pandas - pandas is a Python package providing fast, flexible, and expressive data
structures designed to make working with “relational” or “labeled” data both easy and
intuitive. It aims to be the fundamental high-level building block for doing practical, real
world data analysis in Python
2. NumPy - NumPy is the fundamental package for scientific computing with Python. It
contains among other things: a powerful N-dimensional array object. sophisticated
(broadcasting) functions.
3. MatplotLib - matplotlib is a plotting library for the Pythonprogramming language and its
numerical mathematics extension NumPy. It provides an object-oriented API for
embedding plots into applications using general-purpose GUI toolkits like Tkinter,
wxPython, Qt, or GTK+.
4. SciPy - SciPy (pronounced /ˈsaɪpaɪ'/ "Sigh Pie") is a free and open-source Python library
used for scientific computing and technical computing
5. SeaBorn - Seaborn is a Pythondata visualization library based on matplotlib. It provides a
high-level interface for drawing attractive and informative statistical graphics.
6. Basemap - Basemap is a great tool for creating maps using python in a simple way. It's a
matplotlib extension, so it has got all its features to create data visualizations, and adds the
geographical projections and some datasets to be able to plot coast lines, countries, and so
on directly from the library.
7. SkLearn - In scikit-learn, an estimator for classification is a Python object that implements
the methods fit(X, y) and predict(T) . An example of an estimator is the
classsklearn.svm.SVC , which implements support vector classification.
23
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
24
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
25
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
Chapter 5 – Screenshots
26
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
27
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
Using K-means :
28
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
The computational complexity of our algorithm is linear in the size of a video frame and
the number of vehicles detected. As we have considered traffic on highways there is no
question of shadow of any cast such as trees but sometimes due to occlusions two vehicles
are merged together and treated as a single entity.
29
CUSTOMER REVIEW ANALYSIS USING DATA SCIENCE
Chapter 7 - References
Books –
[1] Python for data analysis by Wes McKinney
Websites –
[1] https://www.kaggle.com/
[2] https://www.journals.elsevier.com/big-data-research
[3] https://www.dataquest.io/
[4] https://www.analyticsvidhya.com/learning-paths-data-science-business-analytics-
business-intelligence-big-data/learning-path-data-science-python/
[5] https://www.analyticsvidhya.com/learning-paths-data-science-business-analytics-
business-intelligence-big-data/learning-path-data-science-python/
[6] https://www.researchgate.net/publication/315382748_A_Review_on_Flight_Delay_Predi
ction
30