Aspect Identification and Sentiment Analysis in Text-Based Reviews
Aspect Identification and Sentiment Analysis in Text-Based Reviews
Lehigh Preserve
Theses and Dissertations
2017
Recommended Citation
Byrne, Sean, "Aspect Identification and Sentiment Analysis in Text-Based Reviews" (2017). Theses and Dissertations. 2535.
http://preserve.lehigh.edu/etd/2535
This Thesis is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact [email protected].
Aspect Identification and Sentiment Analysis in Text-Based
Reviews
by
Sean Byrne
A Thesis
of Lehigh University
Master of Science
in
Lehigh University
May 2017
ii
© Copyright 2017
Sean Byrne
This thesis is accepted and approved in partial fulfillment of the requirements for
the Master of Science.
Date
iii
Acknowledgements
I would like to express my gratitude to Professor Martin Takáč for his guidance and
encouragement throughout my research and my time at Lehigh. I’d also like to thank
my parents, Kevin and Jenifer Byrne, and my brothers Matthew and Jake for their
constant support in everything I do. In addition, I’d like to thank Dr. Ali Yazdanyar,
physician at Reading Hospital, for allowing me to apply what I’ve learned in a real-world
setting.
iv
Contents
Acknowledgements iv
Abstract 1
1 Introduction 2
1.1 Automatic Aspect-Based Review System . . . . . . . . . . . . . . . . . . . 2
1.2 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Application: Reading Hospital . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Aspect Identification 16
3.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Sequential Labeling: Conditional Random Fields . . . . . . . . . . . . . . 17
3.2.1 Labeling Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Background: Naive Bayes and Maximum Entropy Models . . . . . 18
3.2.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.4 CRF Model Description . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.5 CRF Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.6 CRF Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Association Mining Method (Hu and Liu) . . . . . . . . . . . . . . . . . . 29
3.3.1 Association Mining Method Description . . . . . . . . . . . . . . . 29
3.3.2 Association Mining Method Evaluation . . . . . . . . . . . . . . . 32
v
Contents vi
5 Conclusion 40
Bibliography 42
Biography 81
List of Figures
vii
List of Tables
4.1 The results using VADER on aspect terms in the Laptop domain. . . . . 38
4.2 The results using VADER on aspect terms in the Restaurant domain. . . 38
4.3 The results using VADER on aspect categories in the Restaurant domain. 38
4.4 True and predicted ratings for each category in the Restaurant domain. . 39
viii
Abstract
Online text-based reviews are often associated with only an aggregate numeric rating
that does not account for nuances in the sentiment towards specific aspects of the re-
view’s subject. This thesis explores the problem of determining review scores for specific
identification (identifying specific words and phrases that refer to aspects of the review
subject) and aspect-based sentiment analysis (determining the sentiment of each as-
pect). We examine two different models, conditional random fields and an association
mining algorithm, for performing aspect identification. We also develop a method for
ment analysis algorithm built for sentiment analysis of social media. We identify key
problem considerations, including other important subtasks and ideal training dataset
1
Chapter 1
Introduction
Text-based reviews found online have become a common way to evaluate options when
making a decision. These reviews span subjects from a variety of topics - products avail-
able for purchase online, downloadable applications, movie and music releases, restau-
rants, hotels, and more. Oftentimes, these reviews are associated with an overall numeric
average rating for a given subject. However, these ratings oftentimes hide the details
present in the text of the reviews. For example, by examining a set of laptop reviews
with an average rating of 3.0 out of 5.0, one might find that the screen of the laptop
is mostly referred to positively, but the keyboard is mostly referred to negatively. This
nuance is not reflected with an overall 5-point numeric rating, despite the fact that users
oftentimes have preferences that require a more detailed view of the subject.
2
Introduction 3
In order to more accurately reflect how reviewers feel about different aspects of
separately, providing more meaningful information to those who may have specific pref-
erences. A shopper looking to purchase a laptop, for example, may desire a high screen
quality while not caring much about the processing power. This shopper would benefit
from finding a laptop with a highly-rated score for the aspect ”Screen” and may not
mind if the laptop’s overall score is dragged down by a lower rating for the aspect ”Pro-
cessing Power”. It’s possible that websites aiming to have a more comprehensive set of
ratings could force users to rate specific qualities on a numeric scale, rather than just
the overall product. However, this requires more effort on the end user, and ignores the
One way such a system can be developed using existing product reviews is to utilize
derive measures of subjectivity from written text, typically labeling text using either the
labels ”subjective” and ”objective” (ignoring polarity of subjective text), or the labels
”positive”, ”negative”, and ”neutral” (where ”positive” and ”negative” are opposite
are an important source of data for sentiment analysis because they consist primarily of
subjective opinions, making them particularly useful for building models with the ability
particular attribute is found to be associated with positive or negative polarity for most
instances within a set of reviews, then it is given a high or low rating, respectively, for
Introduction 4
that particular attribute. These attributes (or aspects) can be found through aspect
identification - determining what words and phrases (terms) refer to specific aspects of
the subject. For example, in the sentence ”The battery life is quite strong and lasts all
day long,” the phrase ”battery life” is an aspect term of the subject.
Once these aspect terms have been identified, sentiment analysis can be used to
determine the sentiment polarity of each aspect term. Specifically, aspect-based senti-
ment analysis attempts to determine the sentiment of each aspect term. Accurately
determining the polarity of aspect terms is more challenging than the typical sentiment
analysis task. Sentiment analysis relies heavily on sentiment lexicons that classify ad-
jectives based on their sentiment polarity, but an adjective that has a positive sentiment
when used to describe one aspect may have a negative or neutral sentiment when used
to describe another aspect. For example, “long” tends to have a positive sentiment
when used to refer to “battery life” in a laptop, but a negative sentiment when used to
refer to “wait times” at a restaurant. Another significant issue is when multiple aspect
terms are mentioned within the same sentence. If one aspect has a positive sentiment
and another has a negative sentiment, determining these sentiments accurately requires
processing and mention an ongoing application of the methods describes in this thesis. In
Chapter 2, we describe the datasets used and qualities of a useful dataset for the problems
features that can be derived from the text. In Chapter 3, we examine methods for
extracting aspect terms from these datasets, and in Chapter 4, we examine methods for
Natural Language Processing (NLP) is a field of study within computer science and
artificial intelligence that focuses on analyzing and deriving meaningful information from
one language to another, in the 1950s. Research was severely limited due to the relatively
with attempts to translate sentences word-for-word, but issues with determining the
correct syntax (the arrangement of words) and semantics (the meaning of words) in
limitations, research of this time period was able to effectively identify the importance
of developing an explicit structure and definition for language that could allow methods
to be generalized and implemented with computers [16]. The low quality of the methods
in the merit of continued MT research in a report in 1966 [8]. The committee suggested
be effectively tackled, leading to a significant shift away from MT in the late 1960s. This
shift allowed other problems within NLP to be explored, eventually leading to the broad
The massive amount of data and processing power that are accessible today has
opened the door to new heights in the world of natural language processing. Modern NLP
research examines problems such as converting speech to text [12], answering text-based
Introduction 6
determining grammatical relationships between words, and much more. NLP has been
utilized in a large variety of business contexts as well. Lawyers use NLP software to
analyze large sets of legal documents to find meaningful information. Spam filters utilize
NLP to find patterns within email text that indicate a high likelihood of being spam,
and Google uses NLP in their language-translation software. Various social media sites
In this thesis, we utilize some commonly-used software for natural language pro-
cessing. In particular, we make extensive use of the Natural Language Toolkit (NLTK)
[6] and Stanford’s CoreNLP toolkit [18]. These are packages for Python that provides a
large set of functions and datasets useful for natural language processing.
With the rise of electronic medical records, applications have started to appear within
the medical field. Taking text-based data from the past (in this case, from physicians’
notes) and data related to the eventual treatment of the patient’s medical issues (for
example, procedures done, diagnoses given, and success/failure rates), patterns can be
found within the text of the doctors’ notes. In this way, physicians’ notes can be analyzed
highest likelihood of success for a given diagnosis and set of physical traits.
One potential area for the application of natural language processing techniques is
a project recently started at Reading Hospital. Because of the nature of this work, the
Dataset Structure and Text Features 7
specific details cannot be shared in this paper. However, a basic problem outline can
be shared. When a scan is done to examine a particular portion of the body, secondary
information can be gathered. For example, in a CT scan where the primary objective is
to examine a tumor, secondary nodules could be found on the scan that aren’t directly
related to the tumor. In this case, the doctor often suggests that the patient make
connect the patient to the appropriate office for a follow-up appointment, and oftentimes
patients end up ignoring these secondary findings until their next appointment months
or even years later. Complications that could have been treated easily, if dealt with
earlier, can end up becoming much more serious medical issues because of this.
The project’s goal is to use NLP techniques to identify keywords in the notes of these
needed. The details of these patients and scans could then be routed to the appropriate
place automatically. In this way, the methods discussed in this paper can have a real
Features
2.1 Datasets
There is a great deal of text-based review data available online - however, the raw data
alone isn’t enough. In order to perform the three major tasks associated with aspect-
based sentiment analysis, the data provided must contain information about which words
are aspect terms, which words are a part of which aspect categories, and whether each
of datasets, along with cross-validation measures to ensure that the tags are consistent
The difficulty of creating adequate data sources causes significant issues when tack-
ling the problem of aspect-based sentiment analysis. It significantly limits the effective-
ness of methods that rely heavily on domain-based features, since each subject (spanning
8
Dataset Structure and Text Features 9
all categories of online products, media, restaurants, and others) may require a different
set of training data for these methods to be effective. Thus, the importance of develop-
ing a model that is not overly reliant on the domain of the training data is particularly
important.
annually by SIGLEX (Special Interest Group on the Lexicon of the Association for
Computational Linguistics) [1]. Each year, a set of tasks related to semantics within
natural language processing are developed, with the goals of developing methods of
discerning meaning from language and identifying issues worth exploring further. From
2014 to 2016, one of the tasks was ”Aspect-Based Sentiment Analysis” [5] [4] [24]. In
this task, participants were given data annotated with aspect terms, aspect categories,
and their polarities. The goal of the task was to predict each of these for a set of testing
We utilize datasets from the 2014-2016 SemEval competitions. They have been
cross-validated to ensure that inter-annotator agreement is high [23], and there is data
available from two different domains: laptop and restaurant reviews. The sentences
in each years’ data are largely the same, but the format they’re stored in (as well
as their aspect term and aspect category annotations) vary. In all formats, aspect
terms and/or aspect categories are associated with a sentiment polarity from the set
{”positive”, ”negative”, ”neutral”}, though the 2014 and 2016 formats also allow for
a fourth ”conflict” value that represents subjective statements without clear overall
The 2014 datasets are stored as sentences (without review context) in two different
domains: laptop reviews with 3,141 sentences and restaurant reviews with 3,145 sen-
tences. Aspect terms are provided for sentences in the datasets of both domains, and
aspect categories are provided for sentences in the dataset of the restaurant domain. The
sentiment polarity fields in this dataset support the ”conflict” value when the dominant
sentiment polarity is not clear. For each aspect term, character offsets are provided (in
two fields: ”from” and ”to”, representing the beginning and end of the term, respec-
tively) to identify the location of each aspect term within the sentence. Offsets start at
index 0 within a sentence, and the ”to” field stores the index of the offset immediately
after the last character of the aspect term. Each sentence contains zero or more aspect
The 2015 datasets are stored as reviews in two different domains: laptop reviews
and restaurant reviews. Each review is provided as a list of sentences in order, and each
sentence is associated with zero or more aspect categories. Aspect categories are stored
as pairs of entities (E) and attributes (A) in the following format: ”E#A”. Entities
are components of the overall topic - for example, entities in the set of Laptop reviews
include ”CPU”, ”Software”, ”Shipping”, and ”Support”. Attributes are specific features
or qualities of the entities - for example, attributes in the set of laptop reviews include
”Price”, ”Quality”, and ”Portability”. This dataset does not support the ”conflict” value
Dataset Structure and Text Features 11
for sentiment polarity. For reviews in the Restaurant dataset, an opinion ”target” can be
specified - this happens when an entity E is explicitly referenced through a target word
or phrase in the sentence. This allows aspect terms to be linked to aspect categories.
The keyword ”NULL” is used if an aspect category’s entity is not explicitly referenced
through a target. If an opinion target is specified, ”from” and ”to” fields are used to
specify the location of the target within the sentence. These are set to 0 when the target
is ”NULL”.
The 2016 dataset is provided in two different formats. One is identical to the 2015
dataset format. The other is a review-based format that stores sentences and aspect
categories separately. Each review consists of a list of sentences and a separate list
of the aspect categories within the review. This means that polarity ratings for each
aspect category are review-level rather than sentence-level, and so each aspect category
is assigned the sentiment polarity that is dominant within most sentences that contain
Dataset Structure and Text Features 12
the aspect category. Aspect categories are defined in a similar way to the 2015 dataset,
using an entity-attribute pair to represent each category. In cases where the dominant
sentiment polarity is not clear, the polarity is defined as ”conflict”. Opinion targets are
not provided.
The formats can be summarized as follows. The 2014 dataset identifies specific as-
pect terms and their associated polarities, as well as aspect categories for the Restaurant
dataset that are not explicitly linked to aspect terms. The 2015 dataset is somewhat
more specific - it identifies specific entity-attribute combinations that form aspect cat-
egories, as well as target aspect terms for the Restaurant dataset that explicitly link
sentences by each review. The 2016 dataset is more general - it identifies specific entity-
attribute combinations that form aspect categories that are found within a review as a
whole, rather than individual sentences. By comparing methods across dataset formats
with very similar data, the value of creating training datasets with a higher or lower
We break each sentence down into tokens consisting of words and punctuation using
the Penn Treebank tokenizer within NLTK [20]. This tokenizer splits contractions (for
example, “don’t” will become the separate tokens “do” and “n’t”) and stores punctuation
as separate tokens.
Some features can be extracted from individual tokens without the need for infor-
mation from the rest of the sentence or corpus. We store the original token text, as well
as a lowercase version of the token. Several binary features are stored - whether or not
the token is punctuation, whether or not it is in ”titlecase” (the first letter of the token
is capitalized, and the following letters are all lowercase), and whether the token is a
digit. We use a popular word stemmer, PorterStemmer, to store the stem of a given
word, removing all prefixes and suffixes from the token [25].
Some features require sentence-level context. The index of each token within the sentence
is stored, with 0 being the first token of the sentence. A part-of-speech (POS) tagger
using the Penn Treebank tagset is used to tag the part-of-speech for each token in a
sentence [20]. The full POS tag and the first 2 characters of the POS tag are stored as
separate features, as the first two characters are indicative of a broader category that the
following characters are part of (for example, “NN”, “NNP”. “NNS”, and “NNPS” are
all tags to describe nouns). Each token also stores information about the previous and
Dataset Structure and Text Features 14
next tokens in the sentence - the text, lowercase text, stem, and both POS tag features
of the previous and next tokens, storing a default value if the previous or next token
doesn’t exist.
Oftentimes, text-based reviews are associated with an overall numeric rating. Our
datasets do not have contain numerical rating information, but utilizing these review
ratings in an aspect-based sentiment analysis model may yield positive results, and is
worth future consideration when designing annotated datasets from online reviews.
Many other features are commonly used for natural language processing purposes. Word-
Net is a lexical database designed to store words based on their word sense (the meaning
of the word) rather than the word itself [21]. It contains over 155,000 words and 117,000
synonym sets (sets of words with the same meaning), with over 206,000 word-sense pairs
in total [2]. Several other semantic relations are stored as well, such as antonyms. Hy-
a given word, are stored - for example, the pair “sport” and “baseball” is a hypernym-
hyponym pair. Meronyms and holonyms refer to component parts and the collective
whole, respectively - for example, the pair “car” and “wheel” is a holonym-meronym
pair. Using WordNet in a natural language processing model, particularly the prob-
lems of aspect identification and aspect-based sentiment analysis, would give the model
WordNet has been found to not significantly impact the performance of text classifica-
tion models [19], and the limited tests we performed showed little benefit. Despite this,
usage of WordNet in other models for aspect identification and aspect-based sentiment
Word2Vec is a deep learning algorithm that takes sentences as inputs and out-
puts a vectorization of each distinct word within the training data. This can be used
to determine the similarity of one word from another word. Word2Vec also allows
for accurate operations among words, meaning syntactic and semantic patterns can be
accurately generated. For example, suppose vec(word) is the Word2Vec vector represen-
Word2Vec was designed for massive datasets, ranging from tens of millions to billions of
words, and so attempts to train Word2Vec on the datasets available (with only several
available, such as the full English Wikipedia, has resulted in positive results in other
Aspect Identification
In some texts, particularly text-based reviews, there is an overall subject being discussed
throughout the text. Aspect identification (or aspect term extraction) is the process of
identifying what words and phrases (terms) refer to specific aspects of a subject in these
texts.
tioned within the sentence, rather than implied terms. For example, the sentence ”The
restaurant was quite expensive” does not explicitly mention price, but ”expensive” is an
adjective referring to the price of the food, an implicit aspect within the sentence. We
An ideal system would not rely heavily on the domain of the training data, as
otherwise a new set of training data would be required for each new domain examined.
Identifying aspect terms requires human identifiers to manually record these aspect terms
16
Aspect Identification 17
and their sentiment, and requires a consistent approach so that these human identifiers
mostly agree with each other. When each set of training data requires potentially hun-
dreds of reviews (thousands of sentences), this task becomes infeasible to complete for
with robustness. The most accurate models will likely require more detailed training
data - accurate sentence-level datasets identifying aspect terms and their respective
polarities (positive, negative, or neutral). But the most domain-neutral models will rely
on more general features and potentially unsupervised methods. Thus, we examine both
supervised and unsupervised approaches, and test across domains to see how applicable
Aspect term extraction can be modeled as a sequence labeling problem, where each sen-
tence is examined as a sequence of tokens, taking the context of an individual token into
account. This framework is used for problems such as part-of-speech tagging, named
entity recognition, and shallow parsing [26]. We describe and implement a common
another model called a Hidden Markov Model. These are sequential labeling models
based on generalizations of the single-label models described with the naive Bayes clas-
sifier and Maximum Entropy models. The goal of a CRF is to determine the conditional
distribution of potential labels (in our case, using the IOB2 tagging format) given the
output (each token’s text). Using the framework for Maximum Entropy models and
Aspect Identification 18
CRFs, feature functions can be defined that can allow a vector of output features to be
We use the IOB2 tagging format, where each token is associated with one of three labels
- inside an aspect term (”I”), outside an aspect term (”O”), or the beginning of an
aspect term (”B”). All aspect terms start with a ”B”, so only multi-token aspect terms
The naive Bayes classifier is used to predict a class label y given a feature vector x. It is
based on the assumption of conditional independence of the individual features given the
class label. The model attempts to maximize the joint probability p(x, y) of the features
and the class label, which due to their conditional independence can be described as
follows:
n
Y
p(x, y) = p(y) p(xi | y). (3.1)
i=1
makes the assumption that log(p(y | x)) can be represented as a linear combination of
the features and a constant. This is useful in that the features are not assumed to be
independent, and so the relationships among the output features are considered. The
1
p(y | x) = exp(βy x + βy,0 ). (3.2)
Z
P
Z= y exp(βy x + βy,0 ) is a normalization constant which adjusts to ensure valid
probabilities. The parameters βy and βy,0 can be chosen based on the training data
Naive Bayes is a generative model, meaning that the model estimates the joint
probability distribution of the state and the feature vector and uses this learned dis-
tribution to predict the likelihood of a feature vector x being assigned a class label y.
Maximum Entropy models, on the other hand, are discriminative - they learn the con-
because unlike generative models, the probability distribution of outputs p(x) does not
need to be learned. In the case of natural language processing where the observed out-
puts are words, there are almost certainly words that don’t exist in the training corpus
that may occur when using the model, meaning p(x) cannot be accurately estimated
without training data that contains every possible word - an unfeasible task.
Because these classifiers only predict a single class label for a set of features, they
cannot model the relationships among the hidden states. Graphical models such as
Hidden Markov Models and CRFs, on the other hand, are able to account for the
One model for labeling sequences of inputs is called a Hidden Markov Model (HMM).
The system is assumed to be a Markov process, where the state of the current node is
dependent only on the state of the previous node in the sequence. However, instead of
observing the state of a given node directly, we observe an output that is dependent on
the state, and each state has a probability distribution over the set of outputs. HMMs
also assume conditional independence of the output features given each node’s state,
making them a generalization of the Naive Bayes classifier. Given a sequence of outputs
and information about each state’s distribution of possible outputs, we can predict a
In our problem, the sequence of words or tokens within the sentence is the se-
quence of outputs, and the sequence of labels, using the IOB2 standard, are the hidden
states. Our goal is to predict the IOB2 labels of each token within a sentence, using the
Let X = (x1 , x2 , ..., xn ) be the sequence of observed outputs and Y = (y1 , y2 , ..., yn )
be the sequence of hidden states. xi can be any value within a set of possible outputs
O and yi can be any value within a set of possible state labels L. We define the tran-
sition probability p(yi |yi−1 ) of the current state given the previous state. The emission
probability p(xi |yi ) is the probability of observing the current output given the state of
the node. The joint probability distribution of a sequence of outputs x and a sequence
of hidden states y can be defined as follows, denoting p(y1 ) as p(y1 | y0 ) for simplicity:
n
Y
p(x, y) = p(yi | yi−1 )p(xi | yi ). (3.3)
i=1
Aspect Identification 21
This is a generalization of the joint probability distribution defined in the naive Bayes
n n
" #
X X
p(x, y) = exp log(p(yi | yi−1 ) + log(p(xi | yi )) . (3.4)
i=1 i=1
If we by replace log(p(yi | yi−1 )) with a parameter βyi ,yi−1 , log(p(xi | yi )) with a param-
eter µxi ,yi , and adjust by a normalization factor Z, we can rewrite this further as:
" n n
#
1 X X
p(x, y) = exp βyi ,yi−1 µxi ,yi . (3.5)
Z
i=1 i=1
These parameters can be indexed based on the set of labels by using indicator functions
n X n XX
1 X X
p(x, y) = exp βj,k 1{yi =j} 1{yi−1 =k} µo,l 1{xi =o} 1{yi =l} . (3.6)
Z
i=1 j,k∈L i=1 o∈O l∈L
Finally, feature functions can be defined to simplify the notation used. Allow fj,k (yi , yi−1 , xi ) =
1{yi =j} 1{yi−1 =k} and fo,l (yi , yi−1 , xi ) = 1{xi =o} 1{yi =l} Under this notation, each pair of
possible labels (j, k) and each observation-label pair (o, l) has a feature function defined.
By indexing these feature functions and their corresponding parameters βj,k and µo,l
using q (with F total functions), we can write the joint probability as follows:
n X
F
1 X
p(x, y) = exp λq fq (yi , yi−1 , xi ) . (3.7)
Z
i=1 q=1
This notation will allow the differences between HMMs and CRFs to be highlighted.
Aspect Identification 22
As with HMMs, we define X as the sequence of hidden states and Y as the sequence of
outputs. However, unlike HMMs, feature functions can be defined that can account for
other output features. In the case of aspect term extraction, this means that the token
features defined in the previous chapter can be used to train the model. [27]
Consider the joint probability distribution for HMMs. The conditional probability
hP P i
n F
exp i=1 q=1 λq fq (yi , yi−1 , xi )
p(y | x) = P hP P i. (3.8)
n F 0 , y0 , x )
y’ exp i=1 λ f (y
q=1 q q i i−1 i
generally, we can describe each output xi as a vector of features. In our case, this
means that rather than using only the word itself as a feature, we can use various
for any function of the current features, the current label, and the previous label. The
hP i
n PF
exp i=1 q=1 λq fq (yi , yi−1 , xi )
p(y | x) = , (3.9)
Z(x)
hP P i
P n F 0 0
where Z(x) = y’ exp i=1 q=1 λq fq (yi , yi−1 , xi ) is the normalization constant,
computed by summing the feature functions multiplied by their weights over the possible
Aspect Identification 23
label combinations. The number of possible label combinations becomes very large, but
N
Training requires a set of training data (x(i) , y (i) ) i=1 consisting of N ’documents’ - in
of feature vectors, with one feature vector for each token in the sentence. The goal of
training is to maximize the conditional log-likelihood for a set of parameters θ = {λq }Fq=1 .
N
X
l(θ) = log(p(y (i) | x(i) )). (3.10)
i=1
N F
X
(i) (i)
X λ2q
l(θ) = log(p(y |x )) − . (3.11)
2σ 2
i=1 q=1
This assumes a Gaussian prior on the parameters θ, each with a mean of 0 and variance
N
i n N
∂l XX (i) (i) (i)
X ∂ λq
= fq (yj , yj−1 , xj ) − log(Z(x(i) )) − 2 . (3.12)
∂λq ∂λq 2σ
i=1 j=1 i=1
where
ni X
N X
∂ X (i)
(i)
log(Z(x )) = fq (y, y 0 , xj )p(y, y 0 | x(i) ). (3.13)
∂λq 0 i=1 j=1 y,y ∈L
Aspect Identification 24
The partial derivative with respect to λq can be interpreted as follows: the first com-
ponent is the number of observed occurrences of the feature fq , the second component
is the expected number of occurrences of the feature fq , and the third is the regular-
ization adjustment. At the maximum likelihood solution, the expected and observed
The maximum likelihood function l(θ) with regularization is strictly concave, and
so a global optimum can be found [27]. This can be done with nonlinear optimization al-
gorithms such as L-BFGS, stochastic gradient descent, and others. CRFsuite, a software
An important consideration is the method with which ATE systems are evaluated. One
key question is whether to apply these methods to distinct aspect terms or to each
take the set of predicted distinct aspect terms and compare them to the list of actual
distinct aspect terms. However, aspect terms with higher frequency are more valuable,
given that our eventual goal is to determine polarity scores for a few most common
terms, but is less effective at predicting low-frequency aspect terms, is more valuable
than a model that is better at predicting low-frequency terms than high-frequency terms.
Aspect Identification 25
On the other hand, evaluation based on instances of each aspect term can lead to
overconfidence in models that can identify some of the most common terms with accu-
racy, but cannot accurately identify most other terms. Aspect terms with the highest
frequencies in the dataset aren’t always more important to accurately identify than as-
pect terms with lower frequencies. An individual aspect term may be more frequent than
other aspect terms simply because it has few or no synonyms (for example, ”Microsoft
Office” has no synonyms, while ”price” has several different words representing the same
concept).
Thus, we evaluate the methods described in the previous sections with respect to
both distinct aspect terms and instances of each aspect term. We use 70% of the data
available in each domain for training and 30% for testing. As a review, three of the most
common methods of evaluating models are precision, recall, and F-measure. Precision
describes the fraction of predicted aspect terms that actually exist in the dataset. Recall
is the fraction of true aspect terms that are predicted by the model. F-measure is the
CRFsuite implements several algorithms to solve for the CRF parameters. Two of
the most common optimization algorithms for solving CRFs are provided: L-BFGS and
storing a full approximated Hessian, making is useful for problems such as CRFs where
there are often a large number of parameters to be found [17]. Stochastic gradient descent
(SGD) is an extension of gradient descent that moves in the direction of a random data
Table 3.1: The results for CRFs using distinct aspect terms.
(AP), passive aggressive (PA), and Adaptive Regularization of Weight Vectors (AROW).
Averaged perceptrons iterates over the training data, updating the feature weights of a
perceptron whenever the model cannot make a correct prediction and updating the av-
erage feature weights. The final averaged feature weights are returned by the algorithm
sively shifting the current parameter estimate when the current training instance has a
positive value for the loss function and making no adjustment when the loss function is
sian distribution to measure the confidence in each parameter estimate. It adjusts the
model to prevent overly aggressive shifts that can occur when using passive-aggressive
updates [10].
The results for distinct aspect terms for both Restaurant and Laptop datasets
(using the 2014 format described in Figure 2.1) can be seen in Table 3.1. The best
training algorithms for both datasets evaluated with distinct aspect terms were L-BFGS
and PA. Overall, CRFs were more effective on the restaurants domain (with a best
Aspect Identification 27
Table 3.2: The results for CRFs using instances of aspect terms.
F-measure of 0.5984 when using L-BFGS) than on the laptop domain (with a best
The results for instances of aspect terms can be seen in Table 3.2. The best training
algorithms for both datasets evaluated with aspect term instances were L-BFGS and
PA, with F-measures of 0.7810 and 0.7867 respectively for the Restaurant domain and
0.6944 and 0.7018 respectively for the Laptop domain. Again, the CRF seems to be
The model seems to have significantly higher precision than recall regardless of
algorithm and across both distinct and instance-based evaluation methods. This sug-
gests that the models may have difficulty identifying some aspect terms; however, the
significant increase in both precision and recall when evaluating the instances of each
aspect term suggests that much of this may come from failing to identify infrequent
aspect terms.
To see how effective the model would be on data outside of the training domain, we
attempt to train each model on one domain and evaluate its performance using testing
data from the other domain. These results can be found in Table 3.3 and Table 3.4.
Aspect Identification 28
Table 3.3: The results for CRFs using distinct aspect terms across domains.
Table 3.4: The results for CRFs using instances of aspect terms across domains.
Overall, the quality of the results suffered significantly, suggesting that the model doesn’t
perform well on data outside of the domain of the training data. However, some of the
algorithms used seem to suffer less reduction in quality than others. Using SGD on
the Laptops dataset for training, the F-measures for distinct and instances (0.3438 and
0.4459, respectively) for the Laptop testing set are relatively close to their values when
using the Restaurant testing set (0.3455 and 0.3750, respectively). Though the overall
results were still poor, this suggests that some methods and training datasets may be
more generalizable than others. Discovering than area that may worth exploring in the
future.
Aspect Identification 29
A method based on association mining to find frequent itemsets was defined in [14].
and noun phrases in each sentence, then prunes them to identify aspect terms. This is
based on the notion that reviewers tend to use similar words when describing aspects
of a review topic, and so frequently-occurring sets of words are more likely to be aspect
terms.
First, a set of initial candidate itemsets are generated. A list of nouns and noun phrases
N , ordered by their placement within the sentence, are extracted from each sentence i as
initial itemsets. Pairs and triples of these nouns and noun phrases within each sentence
are also considered candidate terms. This is only done for adjacent nouns and noun
phrases. More specifically, the extracted pairs and triples can be described as
At this point, we have an initial set of candidate terms. We reduce the set of
level m. This can also be based off a specified percentage of the dataset. All other
Two additional pruning measures are taken to reduce the set of candidate terms.
An adjusted frequency measure called “p-support” is found that only counts a candidate
term in a sentence if it is not a subset of another candidate term within the sentence.
For example, if a sentence contained the phrase “ham sandwich” and both “ham” and
“ham sandwich” were candidate terms, then this sentence would not count towards the
support of “ham” , since it’s a subset of another candidate term “ham sandwich” that
exists within the sentence. If the p-support of a candidate term is low, and it appears as
part of a larger candidate term, the candidate term is likely a component of the larger
is less than p and the candidate term is a subset of some other term, we remove it from
Another pruning measure attempts to correct for issues that can arise from using
frequent itemsets. When the initial set of candidate terms is created, pairs and triples
of nouns and noun phrases are considered candidate terms. However, these words may
be relatively far apart within a sentence, suggesting that they might not be part of the
same aspect term. For a term within a given sentence, we find the maximum distance
between any two adjacent words in the term. Their distance is measured by how many
tokens apart they are in the sentence. If this value exceeds a token distance threshold
w, then we consider the term non-compact within the sentence. If a term is found to be
An important consideration in aspect term extraction is the idea that aspect terms are
likely to be nouns and noun phrases. In the case of one-word aspect terms that are
nouns, identification is as simple as finding nouns from each sentence using a part-of-
speech tagger, then using other methods to filter out nouns that aren’t actually aspect
terms. For the case of multi-word aspect terms, noun phrases must be identified. Any
unsupervised method for aspect identification must somehow identify these noun phrases
without the benefit of training data. The general problem of identifying grammatical
structures such as noun phrases is called shallow parsing [3]. Three different methods
were explored for identifying noun phrases. We attempted to use NLTK’s ”Regexp”
(regular expression) feature, which finds specific patterns in text using a pre-defined
search pattern [6]. However, noun phrases take many possible forms, and defining all the
possible search patterns that noun phrases may exist in is unfeasible. We also examined
bigram and trigram classifiers, trained on a portion of Treebank data available in NLTK
In the testing of the Association Mining algorithm, it became clear that noun
chunking was a significant issue that hindered the performance of the algorithm as a
whole. After tuning the input parameters for the Restaurant domain dataset, the best
model using the named-entity chunker had a precision of 0.3777, recall of 0.2480, and
compact frequency threshold of 1. We examined the full list of candidate terms before
pruning (consisting of all nouns and noun phrases in the sentence, as well as all adjacent
Aspect-Based Sentiment Analysis 33
pairs and triples of nouns and noun phrases) and found that only 61.18% were detected
- this provides an upper bound on the recall of the model. In the future, examining
This chapter is focused on estimating the sentiment of the aspect terms in a sentence,
assuming the aspect terms are known. We also examine the case where aspect categories
are provided for each sentence rather than individual aspect terms.
Given a set of reviews with aspect terms identified, we would like to accurately
estimate the sentiment of each occurrence of an aspect term in a sentence. With one
aspect per sentence, an assumption can be made that polarity within the sentence is
associated with the polarity of the aspect. When multiple aspect terms are present in
one sentence, words associated with one aspect term may incorrectly be associated with
another aspect term, causing the polarities of each aspect term within the sentence to
An issue with using aspect terms individually is that oftentimes, multiple aspect
terms will refer to the same of similar aspects. For example, ”price” and ”cost” refer
34
Aspect-Based Sentiment Analysis 35
to the same aspect, yet are considered separate aspect terms. This suggests that a
way to categorize aspect terms is desirable when designing a system to accurately rate
the sentiment of instances of aspect terms, rather than the problem of aggregating these
A secondary formulation of the problem can be given for aspect categories (pre-
aspects of a review’s subject). Given a set of reviews with sentence-level aspect cat-
aspect category in a sentence. There are multiple benefits to using aspect categories
rather than specific aspect terms. Typically there will be a much smaller number of
aspect categories than aspect terms, and these categories will be present in a larger
number of sentences than individual aspect terms. This means a smaller amount of data
However, identifying these aspect categories in the first place can be difficult, and re-
quires a predefined list of categories for each domain. As such, the results provided here
model for performing sentiment analysis on a per-sentence basis. The system was trained
on online media text, some of which included movie and product reviews. VADER uti-
lizes a sentiment lexicon constructed with the purpose of being generalizable to multiple
Aspect-Based Sentiment Analysis 36
domains. This makes VADER particularly suitable for analyzing online review data. In
can easily be applied and tested on newly-seen data and data across domains.
Their sentiment lexicon was based on several existing sentiment lexicons, as well
that contain information about sentiment intensity (how strongly a word expresses a
intensity and polarity. Five major heuristics are used to determine the valence score of
a given sentence:
tude of the valence score. For example, “The keyboard is great.” is rated with a
2. Full-word capitalization, especially when other nearby words aren’t fully capital-
ized, increases the magnitude of the valence score. For example, “The keyboard is
3. A set of adverbs called ‘degree modifiers’ is used to increase or decrease the mag-
nitude of the valence score, depending on the word. For example, “The keyboard
is great.” is rated with a lower magnitude than “The keyboard is very great.” and
4. The conjunction “but” signals a shift in sentiment polarity. The sentiment of the
portion of the sentence after “but” is considered to be the dominant sentiment, and
Aspect-Based Sentiment Analysis 37
contributes a greater amount (two-thirds) to the valence score than the portion of
5. The trigram before an occurrence of a lexical feature (as determined by the sen-
the opposite polarity. For example, “The keyboard is not great” would be given a
negative valence score, since “not” is a negation that flips the polarity of “great”.
VADER returns a set of four scores: one each for “positive”, “negative”, and “neu-
tral” (which together sum to 1.0), as well as a “compound” score (ranging from -1.0 to
1.0) reflecting the intensity of the polarity within the sentence. Negative scores are as-
sociated with negative polarity within a sentence, and positive score are associated with
positive sentence polarity. Larger magnitudes of the “compound” score are associated
In order to compare these scores with the available data, for each sentence we return
sentence’s “negative” score is greater than its “positive” score (or if the “compound”
score is less than 0), we return “negative”. Otherwise, we return “positive”. We do not
4.2.1 Evaluation
We keep track of the predicted and true label values for each occurrence of an aspect
term (and additionally for each occurrence of an aspect category, in the case of the
Table 4.1: The results using VADER on aspect terms in the Laptop domain.
Accuracy: 0.5855
Label Precision Recall F-Measure
Positive 0.8140 0.6543 0.7254
Negative 0.3362 0.2916 0.3123
Neutral 0.4535 0.7104 0.5536
Table 4.2: The results using VADER on aspect terms in the Restaurant domain.
Accuracy: 0.6501
Label Precision Recall F-Measure
Positive 0.8235 0.7558 0.7882
Negative 0.4028 0.3287 0.3620
Neutral 0.3730 0.6423 0.4719
Table 4.3: The results using VADER on aspect categories in the Restaurant domain.
Accuracy: 0.6535
Label Precision Recall F-Measure
Positive 0.7918 0.7974 0.7946
Negative 0.5226 0.2942 0.3765
Neutral 0.3674 0.6570 0.4713
our model; and Precision, recall, and F-Measure are also calculated with respect to the
labels “positive”, “negative”, and “neutral”. The term-based results can be found in
Table 4.1 for the laptop domain dataset and in Table 4.2 for the Restaurant domain. The
accuracy is reasonably high for an unsupervised model, though it is somewhat lower for
the Laptop domain (0.5855) versus the restaurant domain dataset (0.6501). Evaluating
based on aspect categories for the restaurant domain dataset provides similar results,
without significant variation in any of the evaluation measures. These results can be
Given the number of positive (p), negative (n), neutral/objective (o), and conflict (c)
labels for a given aspect term or aspect category, a rating (r) from 1 to 5 can be
Conclusion 39
Table 4.4: True and predicted ratings for each category in the Restaurant domain.
determined as follows:
p + 0.5c
r=4 + 1. (4.1)
p+n+c
This model assumes that “conflict” labels are associated with an equal split between pos-
itive and negative sentiment, and assumes that positive occurrences should be weighted
the same as negative occurrences. This assumption is based on the idea that a review
n−1
with n out N stars has a fraction of positive to negative sentiment of N −1 . However,
this may not be true in practice. Given a dataset with quantitative review scores in ad-
negative sentiment may be developed. Using these proportions, weights can be used to
more heavily skew an occurrence of a particular sentiment label versus other sentiment
labels.
We use the restaurant domain dataset’s aspect categories to calculate ratings, since
there are significantly more occurrences of each aspect category than any one aspect
term. The ratings based on the adjusted VADER model’s predictions and based on the
true sentiment labels can be found in Table 4.4. Overall, the predicted ratings tended
to overestimate the true rating by an average of 0.564; this suggests that VADER is
Conclusion
In this thesis, we explored some of the key tasks in the development of an aspect-based
dataset for the purpose of training models for aspect identification and aspect-based
sentiment analysis. For the task of aspect identification, two algorithms were described
and tested: a supervised sequential learning model called a conditional random field and
an unsupervised association mining algorithm. The results for conditional random fields
suggest that they are an effective classifier for identifying aspect terms, particularly
when the parameters are learned using L-BFGS or a passive-aggressive algorithm. The
results for the association mining algorithm were relatively poor due to issues with iden-
tifying noun phrases, but illuminated a future area for further exploration: accurately
identifying noun phrases. For the task of aspect-based sentiment analysis, we describe a
sentiment of aspect terms and aspect categories. [15]. The results for this model were
40
Conclusion 41
terms that are synonyms of each other (for example, ”price” and ”cost”) and aspect
terms that are a part of an overarching category (for example, ”water” and ”wine”
might be part of an overarching category called ”beverages”). This can be done with pre-
defined categories, which can allow for a supervised approach to the clustering problem.
Review-level and sentence-level training data is difficult to generate for a large number
of domains, but having a small number of predefined categories to capture the most
common aspect terms for each domain is much more feasible. Unsupervised clustering
methods may also be explored, given a fixed number of clusters. In this case, clusters
[6] S. Bird. NLTK: the natural language toolkit. In Proceedings of the COLING/ACL
on Interactive presentation sessions, pages 69–72. Association for Computational
Linguistics, 2006.
[7] M. Collins. Discriminative training methods for hidden markov models: Theory
and experiments with perceptron algorithms. In Proceedings of the ACL-02 Con-
ference on Empirical Methods in Natural Language Processing-Volume 10, pages
1–8. Association for Computational Linguistics, 2002.
[13] L. Hirschman and R. Gaizauskas. Natural language question answering: the view
from here. Natural Language Engineering, 7(04):275–300, 2001.
[14] M. Hu and B. Liu. Mining and summarizing customer reviews. In Proceedings of the
tenth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pages 168–177. ACM, 2004.
[15] C. J. Hutto and E. Gilbert. Vader: A parsimonious rule-based model for sentiment
analysis of social media text. In Eighth International AAAI Conference on Weblogs
and Social Media, 2014.
[17] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale
optimization. Mathematical Programming, 45(1):503–528, 1989.
[26] F. Sha and F. Pereira. Shallow parsing with conditional random fields. In Pro-
ceedings of the 2003 Conference of the North American Chapter of the Association
for Computational Linguistics on Human Language Technology-Volume 1, pages
134–141. Association for Computational Linguistics, 2003.
1 import time
2 from collections import defaultdict
3 import xml.etree.ElementTree as ET
4 import libraries.structure as st
5 from libraries.structure import Corpus
6 import aspect_identification as ai
7 import sentiment_analysis as sa
8 from stanford_corenlp_python import jsonrpc
9
10
11 def sentimentAnalysisTest(data):
12 instances = data.corpus
13 trueTermPolsBySent = sa.getTermPolarities(instances)
14 trueCatPolsBySent = sa.getCategoryPolarities(instances)
15
16 predictedTermPolsBySent = sa.vaderTermPolarities(instances)
17 print "Evaluate By Terms:"
18 sa.evaluatePolarities(trueTermPolsBySent, predictedTermPolsBySent)
19
34 test = testR
35
39 numMethods = 0
40 if HL == True:
41 # H&L settings
42 minSupports = [0]
43 minPsupports = [0]
44 maxWordDist = [10.0]
45 maxNonCompact = [10.0]
46 params = [(i,j,k,l) for i in minSupports for j in minPsupports
,→ for k in maxWordDist for l in maxNonCompact]
47 numMethods += len(params)
48
49 if CRF == True:
50 # CRF settings
51 algs = ['lbfgs', 'l2sgd', 'ap', 'pa', 'arow']
52 numMethods += len(algs)
53
54 predictedFDs = range(numMethods)
55 predictedTermsBySent = range(numMethods)
56 methodNames = []
57 count = 0
58
59 if HL == True:
60 for p in range(len(params)):
61 # Run Association Mining (Hu & Liu) algorithm
62 i = params[p][0]
63 j = params[p][1]
64 k = params[p][2]
65 l = params[p][3]
66 predictedFDs[count], predictedTermsBySent[count] =
,→ ai.HuLiu(dataL.corpus, minSupport = i, minPsupport = j,
,→ maxWordDist = k, maxNonCompact = l)
67 methodNames.append("H&L:
,→ (minS="+str(i)+",minPS="+str(j)+",maxWD="+str(k)+",maxNC="+str(l)+")")
68 count += 1
69
70 if CRF == True:
71 for k in algs:
72 # Run Conditional Random Field algorithm
73 crfLabels = ai.crf(train, test, k)
74 predictedFDs[count], predictedTermsBySent[count] =
,→ ai.IOB2toAspectTerms(crfLabels, test)
75
Appendix - Data Processing and Test Functions 47
76 methodNames.append(k)
77 count += 1
78 # Evaluate methods
79 ai.evaluateAspectTerms(testFD, testBySent, predictedFDs,
,→ predictedTermsBySent, methodNames, True)
80
81
82 def process_semeval_2015():
83 # the train set is composed by train and trial data set
84 corpora = dict()
85 corpora['laptop'] = dict()
86 train_filename =
,→ 'datasets/ABSA-SemEval2015/ABSA-15_Restaurants_Train_Final.xml'
87 trial_filename =
,→ 'datasets/ABSA-SemEval2015/absa-2015_restaurants_trial.xml'
88
89 reviews = ET.parse(train_filename).getroot().findall('Review') + \
90 ET.parse(trial_filename).getroot().findall('Review')
91
92 sentences = []
93 for r in reviews:
94 sentences += r.find('sentences').getchildren()
95
100
ET.parse(trial_filename).getroot().findall('sentence'))
,→
117
,→ jsonrpc.TransportTcpIp(addr=("127.0.0.1",
125 8080)))
126
8 import string
9 from collections import defaultdict
10
11 def fd(counts):
12 '''Given a list of occurrences (e.g., [1,1,1,2]), return a
,→ dictionary of frequencies (e.g., {1:3, 2:1}.)'''
13 d = defaultdict(lambda:0)
14 for i in counts: d[i] = d[i] + 1 if i in d else 1
15 return d
16
20 def fd2(counts):
21 '''Given a list of 2-uplets (e.g., [(a,pos), (a,pos), (a,neg),
,→ ...]), form a dict of frequencies of specific items (e.g.,
,→ {a:{pos:2, neg:1}, ...}).'''
22 d = {}
23 for i in counts:
24 # If the first element of the 2-uplet is not in the map, add it.
25 if i[0] in d:
26 if i[1] in d[i[0]]:
27 d[i[0]][i[1]] += 1
28 else:
29 d[i[0]][i[1]] = 1
30 else:
31 d[i[0]] = defaultdict(lambda: 0)
32 d[i[0]][i[1]] += 1
33 return d
49
Appendix - Class Definitions 50
34
35 def validate(filename):
36 '''Validate an XML file, w.r.t. the format given in the 4th task of
,→ **SemEval '14**.'''
37 elements = ET.parse(filename).getroot().findall('sentence')
38 aspects = []
39 for e in elements:
40 for eterms in e.findall('aspectTerms'):
41 if eterms is not None:
42 for a in eterms.findall('aspectTerm'):
43 aspects.append(Aspect('', '').createEl(a).term)
44 return elements, aspects
45
46
50 # Dice coefficient
51 def dice(t1, t2, stopwords=[]):
52 tokenize = lambda t: set([w for w in t.split() if (w not in
,→ stopwords)])
53 t1, t2 = tokenize(t1), tokenize(t2)
54 return 2. * len(t1.intersection(t2)) / (len(t1) + len(t2))
55
56 # Find the index of the nth occurrence of a word within a tokenized text
57 def findNthOccurrence(tokenized_text, word, n):
58 if n < 1:
59 print "Error: n must be an integer > 1"
60 exit()
61 k = 0 # How many occurrences we've seen so far
62 for index in range(len(tokenized_text)):
63 if word in tokenized_text[index]:
64 k = k + 1
65 if k == n:
66 return index
67 print "Error: Could not find nth occurrence"
68 return -1
69
70 def generate(sentences):
71 features = [[token.toDict() for token in s.tokens] for s in
,→ sentences]
72 labels = [[token.actualIOB2 for token in s.tokens] for s in
,→ sentences]
73 return features, labels
74
75 class Category:
Appendix - Class Definitions 51
76 '''Category objects contain the term and polarity (i.e., pos, neg,
,→ neu, conflict) of the category (e.g., food, price, etc.) of a
,→ sentence.'''
77
91 class Token:
92 ''' Token objects contain information about an individual token -
,→ usually a word or punctuation. '''
93
107
143
253 # Update the next and previous tokens for each Token object
254 for i in range(len(self.tokens)):
255 token = self.tokens[i]
256 if i == 0 and i == (len(self.tokens) - 1):
257 token.setPrev(Token())
258 token.setNext(Token(index=len(self.tokens)))
259 elif i == 0:
260 token.setPrev(Token())
261 token.setNext(self.tokens[i+1])
262 elif i == (len(self.tokens) - 1):
263 token.setPrev(self.tokens[i-1])
264 token.setNext(Token(index=len(self.tokens)))
265 else:
266 token.setPrev(self.tokens[i-1])
267 token.setNext(self.tokens[i+1])
268
281 # Add a list of the Token objects for the aspect term
282 at.setTokens(self.tokens[at.headIndex:at.endIndex])
283
288 term = []
289 termList = []
290 i = 0
291 while i < len(self.tokens):
292 t = self.tokens[i]
293 if t.predictedIOB2 == "B":
294 term.append(t)
295 while i+1 < len(self.tokens):
296 if self.tokens[i+1].predictedIOB2 == "I":
297 term.append(self.tokens[i+1])
298 i = i + 1
299 else:
300 break
301 termList.append(term)
302 term = []
303 i = i + 1
304 return termList
305 '''
306
1 import time
2 import math
3 from collections import defaultdict
4 import xml.etree.ElementTree as ET
5 from libraries.structure import Corpus
6 from libraries.structure import fd
7 from libraries.structure import freq_rank
8 from libraries.structure import generate
9
12 import nltk
13 import nltk.corpus, nltk.tag
14 from nltk import word_tokenize
15 from nltk.tokenize import WordPunctTokenizer
16 from nltk.tokenize import TreebankWordTokenizer as Tokenizer
17 import nltk.chunk as chunk
18 from nltk.stem.porter import PorterStemmer as Stemmer
19
20 import pycrfsuite
21
60
Appendix - Aspect Identification 61
40 tbc = 0
41 # treebank chunking
42 if tbc:
43 treebank_sents = nltk.corpus.treebank_chunk.chunked_sents()
44 train_chunks = conll_tag_chunks(treebank_sents)
45 u_chunker = nltk.tag.UnigramTagger(train_chunks)
46 ub_chunker = nltk.tag.BigramTagger(train_chunks,
,→ backoff=u_chunker)
47 ubt_chunker = nltk.tag.TrigramTagger(train_chunks,
,→ backoff=ub_chunker)
48 ut_chunker = nltk.tag.TrigramTagger(train_chunks,
,→ backoff=u_chunker)
49 utb_chunker = nltk.tag.BigramTagger(train_chunks,
,→ backoff=ut_chunker)
50 # Find nouns and noun phrases in each sentence - these are
,→ initial candidate terms. Nouns are lists of (String, Index)
,→ tuples
51 nounsBySentence = [nounsAndPhrasesInSentence(s,
,→ chunker=ub_chunker, tbc=True) for s in instances]
52 else:
53 # Find nouns and noun phrases in each sentence - these are
,→ initial candidate terms. Nouns are lists of (String, Index)
,→ tuples
54 nounsBySentence = [nounsAndPhrasesInSentence(s, ne=True) for s
,→ in instances]
55
61 # Get all frequent candidate terms - those that meet the minimum
,→ support.
62 terms['sent']['(Term,Indices)'] = [[term for term in s if
,→ support[term[0]] >= minSupport] for s in temp]
63
64 # Update support
65 support = fd([term[0] for s in terms['sent']['(Term,Indices)'] for
,→ term in s])
66
70 nonCompact = dict.fromkeys(terms['all']['set'], 0)
,→ # Stores occurrences of non-compact form for each term
71 pSupport = fd([term[0] for s in terms['sent']['(Term,Indices)'] for
,→ term in removeSubsets(s)]) # Stores p-support of each term
72 isSubset = dict.fromkeys(terms['all']['set'])
73
269 ''' Given a sentence tagged with chunks (using nltk.pos_tag and
,→ RegexpParser), return a list. Each element is a list itself,
,→ containing the indices of each noun / noun phrase;
,→ single-element lists are a single index and correspond to
,→ single-word nouns. Multi-element lists store a list of indices
,→ in sequence, and are noun phrases.
270 '''
271 index = 0
272 parsed = []
273 for i in range(len(cTagged)):
274 if type(cTagged[i]) is nltk.tree.Tree:
275 parsed.append(range(index, index + len(cTagged[i])))
276 index = index + len(cTagged[i])
277 else:
278 if cTagged[i][0].startswith('NN'):
279 parsed.append([index])
280 index = index + 1
281 return parsed
282
309 trainer.train('conll2002-esp.crfsuite')
310 trainer.logparser.last_iteration
311
335
352 # Distinct
353 TP, FN, FP, P, R, F =
,→ evaluateAspectTermsDistinct(set(trueFD.keys()),
,→ set(predictedFD.keys()))
354 distinct["TP"][methodNames[i]] = TP
355 distinct["FN"][methodNames[i]] = FN
356 distinct["FP"][methodNames[i]] = FP
357 distinct["P"][methodNames[i]] = P
358 distinct["R"][methodNames[i]] = R
359 distinct["F"][methodNames[i]] = F
360
361 # Instances
362 TP, FN, FP, P, R, F = evaluateAspectTermsInstances(trueSent,
,→ predictedSent, trueFD, predictedFD)
363 instances["TP"][methodNames[i]] = TP
364 instances["FN"][methodNames[i]] = FN
365 instances["FP"][methodNames[i]] = FP
366 instances["P"][methodNames[i]] = P
367 instances["R"][methodNames[i]] = R
368 instances["F"][methodNames[i]] = F
369
373 if toScreen:
374 print "------------------- Distinct: -------------------"
375 for m in methodNames:
376 print m
377 print "TP: " + str(distinct["TP"][m])
378 print "FN: " + str(distinct["FN"][m])
379 print "FP: " + str(distinct["FP"][m])
380 print "P: " + str(distinct["P"][m])
381 print "R: " + str(distinct["R"][m])
382 print "F: " + str(distinct["F"][m])
383 print ""
384
424 '''
425 print "\nEvaluate by Distinct Term: "
426 print "Predicted: " + str(len(ptSet))
427 print "Actual: " + str(len(atSet))
428 '''
429
430 if toScreen:
431 print "True Positive: %f -- False Negative: %f -- False
,→ Positive: %f (P = %d, R = %d, F = %d)" % TP, FN, FP, P, R, F
432
444 # Below is a version of the code that gives frequencies for true
,→ positive, false negative, and false positive for each word:
445
471 '''
472
473 TP = sum(truePos.values())
474 FN = sum(falseNeg.values())
475 FP = sum(falsePos.values())
476
477 if TP == 0:
478 P = 0.0
479 R = 0.0
480 F = 0.0
481 else:
482 P = float(TP)/(TP + FP)
483 R = float(TP)/(TP + FN)
484 F = 2.0*P*R/(P+R)
485
486 '''
487 print "\nEvaluate by Instance: "
488 print "Predicted: " +
,→ str(sum(train_data.predicted_terms_fd.values()))
489 print "Actual: " + str(sum(train_data.aspect_terms_fd.values()))
490 '''
491 if toScreen:
492 print "True Positive: %f -- False Negative: %f -- False
,→ Positive: %f (P = %d, R = %d, F = %d)" % TP, FN, FP, P, R, F
493
1 import time
2 from collections import defaultdict
3 import xml.etree.ElementTree as ET
4 from libraries.structure import Corpus
5 from libraries.structure import fd
6 from libraries.structure import freq_rank
7 from libraries.structure import generate
8
11 import nltk
12 from nltk import word_tokenize
13 from nltk.tokenize import WordPunctTokenizer
14 from nltk.tokenize import TreebankWordTokenizer as Tokenizer
15 from nltk.corpus import sentiwordnet as swn
16 import nltk.chunk as chunk
17 from nltk.stem.porter import PorterStemmer as Stemmer
18 from nltk.corpus import wordnet as wn
19 from nltk.sentiment.vader import SentimentIntensityAnalyzer
20
21 import pycrfsuite
22
75
Appendix - Sentiment Analysis 76
91 def getTermPolarities(instances):
92 ''' Input: A list of instances
93 Output: A dictionary of terms, where each term contains a
,→ dictionary with counts for each polarity category (positive,
,→ negative, neutral, conflict)
94 '''
95 return [[(aspect.term, aspect.polarity) for aspect in
,→ instance.aspect_terms] for instance in instances]
96
97 def getCategoryPolarities(instances):
98 ''' Input: A list of instances
99 Output: A list of sentences, where each sentence is a list of
,→ (category, polarity) tuples, where polarity is one of (positive,
,→ negative, neutral, conflict)
100 '''
101 return [[(category.term, category.polarity) for category in
,→ instance.aspect_categories] for instance in instances]
102
80
Biography
Sean Byrne was born in 1994 in the state of Pennsylvania. He attended Lehigh University
for his undergraduate education, and was highly involved in the Industrial & Systems
the ISE Student Council. He graduated from Lehigh University with a Bachelor of
in May 2016. He is now completing a Master of Science degree in Industrial & Systems
Engineering through the President’s Scholars program, and will graduate in May 2017.
81