0% found this document useful (0 votes)
13 views

A Study of The Application of Weight Distributing Method Combining Sentiment Dictionary and TF-IDF For Text Sentiment Analysis

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

A Study of The Application of Weight Distributing Method Combining Sentiment Dictionary and TF-IDF For Text Sentiment Analysis

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Received February 18, 2022, accepted March 13, 2022, date of publication March 16, 2022, date of current

version March 28, 2022.


Digital Object Identifier 10.1109/ACCESS.2022.3160172

A Study of the Application of Weight Distributing


Method Combining Sentiment Dictionary and
TF-IDF for Text Sentiment Analysis
HAO LIU , XI CHEN , AND XIAOXIAO LIU
Collaborative Innovation Centre of Assessment for Basic Education Quality, BNU, Beijing 100875, China
Institute of Education, University of Alberta, Edmonton, AB T6G 2G5, Canada
Corresponding author: Hao Liu ([email protected])
This work was supported in part by the National Social Science Fund of China under Grant 20CTJ019.

ABSTRACT The most commonly used methods in text sentiment analysis are rule-based sentiment
dictionary and machine learning, with the later referring to the use of vectors to represent text followed by the
use of machine learning to classify the vectors. Both methods have their limitations, including inflexibility of
rules, non-prominence of sentiment words. In this paper, we design a weight distributing method combining
the two methods for text sentiment analysis, by which the sentence vectors obtained can both highlight words
with sentiment meanings while retaining their text information. Empirical results show that based on this
new method, the accuracy rate of text sentiment analysis can reach as high as 82.1%, which means 13.9%
higher than rule-based sentiment dictionary method, and 7.7% higher than TF-IDF weighting method.

INDEX TERMS Text sentiment analysis, sentiment dictionary, sentence vector, weights distribution.

I. INTRODUCTION digital texts. Sentiment analysis applications have spread to


Natural language processing (NLP) is a way to system- almost every possible domain, such as consumer products,
atically analyze, understand and extract information from services, healthcare, financial services and social events [4].
text data. Because the capacity for language is one of By analyzing sentiment information, governments can effec-
the central features of human intelligence, NLP is one tively grasp the trend of public opinion by analyzing the sen-
of the most important branches of artificial intelligence, timent orientation in these comments and provide a basis for
which integrates social sciences (linguistics, logic, etc.), nat- policy making; businesses can tap the feedback information
ural sciences (computer science, statistics, etc.) and engi- of customers on various products for businesses to develop
neering (electrical engineering, etc.) [1], [2]. The goal more precise marketing strategies; and ordinary consumers
of NLP is to provide new computational capabilities can know other usrs’ opinions about a product before making
around human language, which typically involves infor- better purchasing decisions [5]. In the era of big data when the
mation extraction, machine translation, text generation and analysis of massive text data can only be done by computers,
so on [2]. the common challenge faced by researchers in this area is
Text sentiment analysis is an important application of NL, how to improve the accuracy and efficiency of text sentiment
which refers to ‘‘analysis of subjective texts with sentiment analysis.
overtones to ta and classify their sentiment connotations and
attitudes’’ [3]. With the rapid development of the Internet and II. RELATED WORK
social media, people often log on to different types of web- The initial steps of text sentiment analysis include data collec-
sites to express their opinions on current affairs or products, tion and data pre-processing, the later includes such processes
so individuals and organizations are increasingly using the as removal o invalid characters, word tokenizing and stop
content in these media for decision making [4]. It has become words filtering. This section reviews the work done in the
one of the hot research topics today to effectively obtain and field of text sentiment analysis from three aspects: sentiment
analyze the sentiment information contained in the massive analysis based on sentiment dictionary, sentiment analysis
methods based on machine learning and sentiment analysis
The associate editor coordinating the review of this manuscript and methods based on integration of sentiment dictionary and
approving it for publication was Md. Asaduzzaman . machine learning.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
32280 VOLUME 10, 2022
H. Liu et al.: Study of Application of Weight Distributing Method Combining Sentiment Dictionary and TF-IDF

A. SENTIMENT ANALYSIS BASED ON SENTIMENT methods. Many studies have focused on the feature selec-
DICTIONARY tion. Jianzhong et al. [13], For example, based on four
The most important indicators of sentiments are sentiment features, i.e., sentiment word frequency, the polarity offirst
words, also called opinion words, which are commonly used occurrence of sentiment word, emoji polarity, and negation
to express positive or negative sentiments [4]. The method words, to analyzespace-related text data collected from Web,
of rule-based sentiment dictionary analysis refers to statis- by using SVM (support vector machine) and Naive Bayes,
tically calculate the sentiment weights of sentiment words respectively. The results is satisfactory in terms of accuracy
in a text based on a sentiment dictionary [6]. The main and recall. Word2vec is one of the most commonly used meth-
strategy is rule-based, using a dictionary of words marked ods for generating representation vectors of words, which is
with sentiment to determine the sentiment of a sentence, a Word Embedding tool open sourced by Google in 2013,
which is an unsupervised method [7]. This method is the most because of a few initial successes that motivate early adopters
intuitive and is also similar to theprocess of people’ recog- to do more, and leaving plenty of room for early adopters to
nizing sentiments in a text. In sentiment analysis, the more contribute and benefit [14]. Embedding is essentially the rep-
widely used dictionaries are HowNet sentiment dictionary resentation of words with a low-dimensional vector, in which
from CNK, affective lexicon ontology of Dalian University of word vectors with close distances to each other correspond to
Technology, and Chinese sentiment dictionary from Taiwan words with similar meanings [15].
University [8]. Some scholars have conducted related sen- After obtaining the vectors representing the sentence
timent tendency analysis exploration based on this method. using word vectors, the sentence vectors are then classi-
Haodong and Wenqi [6], have improved the average accu- fied via machine learning. TF-IDF weighting method is
racy of sentiment tendency analysis of Chinese microblogs one of the commonly used methods for generating sentence
to some extent by combining HowNet sentiment dictionary vectors based on word vectors. For example, Thomas and
and lexical ontology database, and incorporating numerous Latha [16] used TF-IDF for feature selection of Kanada text
features of emoticons, semantic rules, negation words, degree and used decision tree classifier to classify text. Soumya and
adverbs and Internet neologisms. Peiyu and Yan [8] con- Pramod [17] used methods such as BOW and TF-IDF to form
structed a Chinese novel sentiment dictionary based on lex- feature vectors. They then used different machine learning
icon and Word2Vec, including basic sentiment dictionary, techniques, such as Naive Bayes Machines (NB), Support
expanded dictionary and sentiment-imagery dictionary, and Vector Machines (SVM), and Random Forests (RF), to clas-
obtained sentiment tendency of sentences after sentiment sify tweets into positive and negative ones. Mukwazvure and
score accumulation and averaging. Xu et al. [9] constructed Supreethi [18] also used TF-IDF for information weight-
an extended sentiment dictionary, including basic sentiment ing to generate sentence vectors and then used SVM and
words, the field sentiment words and polysemy sentiment K-nearest neighbors for classification, which have achieved
words, and obtained the sentiment polarity of the text by good results for text sentiment classification. Gang and
using the extended sentiment dictionary and the designed Fei [19] proposed to perform sentiment clustering through
sentiment scoring rules. Yu and Egger [10] used the VADER voting mechanism of multiple clustering, which is a kind of
algorithm, a sentiment-scoring method for text based on unsupervised machine learning. Specifically, each clustering
sentiment lexicons and rules, to explore how tourists feel uses the TF-IDF feature weighting method, which overcomes
about safety, service, queues, etc. when visiting popular and the low accuracy and instability of K-means.
crowded attractions. As a new direction of machine learning, deep learning has
gradually been widely used in sentiment analysis because it is
more complex and accurate than traditional machine learning
B. SENTIMENT ANALYSIS BASED ON MACHINE LEARNING models. Jane [20] utilized deep learning techniques such as
Machine learning, a branch of artificial intelligence (AI) the Doc2Vec algorithm, Recurrent Neural Network (RNN),
and computer science, generates empirical-data-based mod- Convolutional Neural Network (CNN) to extract insights
els that can make decisions and judgments in new situa- valuable to Electric Vehicles buyers, marketers and manufac-
tions [11]. The method is often used to process and analyze turers. Wang et al. [21] used an unsupervised BERT (Bidi-
large amounts of data and is widely used in finance, health- rectional Encoder Representation for Transformer) model
care, education, and other fields. In text sentiment analysis, to classify texts on Sina Weibo into positive, neutral, and
the sentiment analysis task is usually modeled as a classifica- negative, and used a TF-IDF model to summarize the topics
tion problem. The researchers convert the text into a feature of posts. Aydin and Gungor [22] proposed a novel neural
vector, and then feed these vectors to the model to generate network framework that combines recurrent and recursive
predicted labels for the corresponding vectors [7]. neural network models. Recurrent models propagate infor-
The key to using machine learning for sentiment clas- mation about sentiment labels throughout word sequences;
sification is to select text features suitable for sentiment Recursive models extract syntactic structure from text. The
classification [12], represent the text data with a vector, and neural network framework achieved state-of-the-art results
then train and classify the model using machine learning and outperformed the baseline study by a significant margin.

VOLUME 10, 2022 32281


H. Liu et al.: Study of Application of Weight Distributing Method Combining Sentiment Dictionary and TF-IDF

C. SENTIMENT ANALYSIS INTEGRATING SENTIMENT etc. [27], [28]. In this study, 2 sentiment major categories
DICTIONARY AND MACHINE LEARNING ‘‘ (happy),’’ ‘‘ (good)’’ are combined into ‘‘positive sen-
Although both methods have good performance in sentiment timent’’ and 5 sentiment major categories ‘‘ (angry),’’ ‘‘
analysis, their shortcomings are also obvious. The rules based (sad),’’ ‘‘ (scared),’’ ‘‘ (disgusted),’’ ‘‘ (astonished)’’ are
on sentiment dictionary relies too much on sentiment words combined into ‘‘negative sentiment,’’ and the sentiment score
and the rules are not flexible enough, while the machine of the sentence is obtained by semantic rules and the intensity
learning approach does not highlight the important role of of the sentiment words, and if the score is greater than 0, the
emotional words. To address these drawbacks, some scholars sentence is judged as positive sentiment; if the score is less
have improved the method of sentence vector generation than 0, it is considered as negative sentiment; if the score is
based on sentiment dictionaries: Yang [23] proposed to assign equal to 0, it is considered as neutral sentiment.
weights of 2 to sentiment words and 1 to neutral words, The rules used in this study include the following three.
and then sum up the word vectors obtained from Word2Vec Rule 1: When a negative word, e.g., ‘‘ (not),’’ ‘‘
training based on the weights to obtain the corresponding (without),’’ etc., appears before a sentiment word, and there is
sentence vectors; Hui et al. [12] adjusted the sentiment of no punctuation between the sentiment word and the negative
the word vectors obtained from Word2Vec training to obtain word, the sentiment is reversed, and the weight of thenegative
word vectors considering both semantic and sentiment ten- word is −1. For example, ‘‘ (I’m not happy). The word
dencies, and used TF-IDF values for the word vectors to ( ) happy’’ is an positive sentiment word with an intensity
weight and sum to obtain the text vector representation, and of 7, so the sentence has an sentiment score of −7 and is
machine learning methods are used to classify the text for finally judged as a negative sentiment.
sentiment. Dashtipour et al. [24] firstly judged sentence sen- Rule 2: When there is an adversative conjunction, such as
timent polarity based on sentiment dictionary and rules. If the ‘‘ (but),’’ ‘‘ (however),’’ etc.), the sentiment before
sentence could not be classified, the concatenated fastText the transitive word is reversed. For example, ‘‘ ,
embedding of the sentence is inputted to the DNN (Deep (Although he studied hard, his perfor-
Neural Networks) to determine the polarity of the sentence. mance was still unsatisfactory).’’ The strengths of ‘‘
Yang et al. [25] used the BERT model to train word vectors, (someone study so hard that he forget to eat and sleep)’’
used a sentiment dictionary to enhance the sentiment features and ‘‘ (satisfactory)’’ are 7 and 5, respectively. The
in the text, and then went through a convolutional layer and adversative conjunction ‘‘ (but)’’ makes the weight of
a pooling layer to classify the weighted sentiment features. ‘‘ ’’ be −1, and the negative word ‘‘ (not)’’ makes the
Chiny et al. [26] proposed a hybrid sentiment analysis model weight of ‘‘ (satisfactory)’’ be −1. The final sentiment
based on Long Short-Term Memory network (LSTM), a rule score of the sentence is −12, which is finally judged as a
based sentiment dictionary (VADER) and TF-IDF weight- negative sentiment. When there are multiple sentences in
ing method. The above three methods each get a sentiment a paragraph, the adversative conjuctions only reverses the
score, and then treated the three scores as three inputs, ans sentence that it is in, rather than the sentiment of all the
used classification models such as Logistic Regression (LR), sentences, so it needs to be calculated sentence by sentence.
Random Forest (RF) and Support Vector Machine (SVM) for Rule 3: When there are adverbs of degree, e.g., ‘‘
sentiment polarity classification. (a little),’’ ‘‘ (very),’’ etc., they are given different weights.
Currently, the major difficulty facing researchers in the In this study, the degree adverbs are divided into 5 cate-
area is how to combin sentiment dictionary and machine gories: ‘‘ (extremely),’’ ‘‘ (very),’’ ‘‘ (more),’’ ‘‘
learning. To solve this problem, we proposes the Sentiment (slightly),’’ and ‘‘ (less),’’ are given weights of 1.5, 1.4,
Dictionary Weighting method based on the TF-IDF, which 1.2, 0.8, and 0.5, respectivel. When the degree adverb and the
combines sentiment dictionary and pre-trained word vec- negative word appear at the samesentence, thenegative word
tors, which enables obtained sentence vectors to retain text is given a weight of −0.5 if it comes before the degree adverb,
information while highlighting the words with sentiment and a weight of −1.5 if it comes after the degree adverb. For
tendencies. example, ‘‘ (I’m very unhappy).’’ The strength of the
word ‘‘ (happy)’’ is 7, the weight of the degree adverb
‘‘ (very)’’ is 1.4, and the weight of thenegative word is −1.
III. RESEARCH METHODOLOGY because it is after the degree adverb. Therefore, the final score
A. RULES BASED ON SENTIMENT DICTIONARY is −14.7, which is a negative sentiment.
The sentiment dictionary used in this study is a Chinese As the most widely used one, this method has the advan-
ontology resource compiled and labeled by the Information tages of easy operation and highlighting the role of sentiment
Retrieval Research Laboratory of Dalian University of Tech- words, but it also has many shortcomings Only sentiment
nology. The resource classifies sentiments into 7 major cate- words, negative words and related words are considered
gories and 20 subcategories, which describe a Chinese word with other information in the sentence being overlooked, and
or phrase from different perspectives, including word lexical the rules of classification are rigid and inflexible, among
category, sentiment category, sentiment intensity and polarity, others.

32282 VOLUME 10, 2022


H. Liu et al.: Study of Application of Weight Distributing Method Combining Sentiment Dictionary and TF-IDF

FIGURE 1. The framework of weight distribution method proposed in this paper.

B. TF-IDF regression, support vector machines, Bayesian classifiers,


TF-IDF (Term Frequency–Inverse Document Frequency) is and neural networks. This is a supervised learning process
a common weighting technique used in information retrieval that manually mark the sentiment of the training set of texts.
and data mining. TF or term frequency refers to the frequency This method retains information of all words and is more
of word in a text) and IDF refers to the inverse text frequency flexible in application compared wit rule-based sentiment
index. The essential idea underlying TF-IDF is that words that dictionary method. But there is, however a major deficiency
appear more frequently in one document and less in others in it, that is, the weight of each word in sentiment classifica-
should be more important because they are more useful for tion is only related to its frequency of occurrence and does
classification [26]. Therefore, It is widely used in keyword not highlight the importance of sentiment words.
extraction, text similarity comparison and topic classification.
The TF-IDF algorithm is used to calculate every word and its C. A NEW METHOD COMBINING SENTIMENT
weight value in every text of a document. It takes into account DICTIONARY AND TF-IDF
the probability TF of a word’s occurrence in a single text In order to combine the advantages of the above two meth-
and the weight IDF of the word in the entire document [29]. ods, we designed a new sentence vector generation method.
Assuming that there are N texts in one document, the weight To both highlight the role of sentiment words in sentiment
of word w in text s is calculated as classification and maintain its flexibility, this study tries to
N improve the TF-IDF vector weighting method. Using the
tsw = tf w × idf w = tf w × log( ) (1) ontology library of sentiment words from Dalian University
Nw
of Technology and the word vector library provided by the
where Nw is the number of texts containing the word w.
Institute of Chinese Information Processing of Beijing Nor-
The Institute of Chinese Information Processing of Beijing
mal University, the set of all sentiment words is noted as E.
Normal University provides 36 open source pre-trained word
First, the weight sum of a text is constrained to 1. Then,
vector libraries on GitHub [30]. In this study, we use one
the weight sum 1 is divided into two parts, the weight α is
of these pre-trained Chinese word vector libraries consisting
assigned to all sentiment words and 1-α is assigned to all
of 8599 modern Chinese literary works, and each word is
neutral words in the text, and α is assigned to sentiment words
represented by a 300-dimensional vector of real numbers. The
in proportion to the strength of sentiment words, and the
TF-IDF values of each word are used as weights, and the pre-
neutral words are assigned weights 1-α proportionally to their
trained word vectors are weighted and summed to obtain a
TF-IDF values. Thus we get the vector equation for text s:
vector of text s.
X dw tw
[α Iw∈E + (1 − α) Iw∈E
X
vs = tsw ∗ vw (2) vs = / ]vw (3)
w∈s w∈s Ds TS
After the sentence vectors are generated, sentiment Among them, Ds is the sum of the intensities of all sen-
classification can be made by using tools such as logistic timent words and TS is the sum of the TF-IDF values of

VOLUME 10, 2022 32283


H. Liu et al.: Study of Application of Weight Distributing Method Combining Sentiment Dictionary and TF-IDF

TABLE 1. The vector computing method for sentiment classification.

all neutral words in text s. IC is a schematic function and Thus we make a hypothesis that when the sentence vector
takes 1 when condition C is satisfied, otherwise it takes 0. of sentence s is known, the probability that word w appears in
The framework of weight distribution method proposed in sentence s is proportional to the exponent of the inner product
this paper is shown in Figure 1. of the sentence vector and the word vector. To facilitate the
As the weight sum of sentiment words, α represents the calculation, we assume that the intercept term is 0, so
importance of sentiment words. To highlight the importance
P (w ∈ s | vs ) = A × exp (vs ∗ vw )
of sentiment words, it cannot be too small, but if it is too
large, the role of neutral words would be reduced. According where A is a constant, and we obtain a log-likelihood function
to the latent variable generation model of Arora et al. [31], the for a sentence
generation of sentences is considered as a dynamic process, Y X
where the sentence vector does a slow random tour in order to lnL(w, vs ) = ln[ A × exp(vs ∗ vw )] ∝ vs ∗ vw
generate similar words in the context at moment t. his means w∈s w∈s

that the probability of the occurrence of words at moment


XX dw0 tw0
= [α Iw0 ∈E +(1−α) Iw0 ∈E / ]vw0 ∗ vw
t is related to the sentences that already exist at moment t. w∈s w0 ∈s
Ds TS
Assuming that the sentence vector does not change much X X dw0
when words appear in the sentence, simply replace all in the =α vw0 ∗ vw
w∈s w0 ∈s1 s
D
sentence with the sentence vector [32], i.e. X X tw0
+ (1 − α) vw0 ∗ vw
P(w ∈ s|vs ) ∝ exp(vs ∗ vw ) (4) w∈s 0
TS
w ∈s2

32284 VOLUME 10, 2022


H. Liu et al.: Study of Application of Weight Distributing Method Combining Sentiment Dictionary and TF-IDF

FIGURE 2. Sentiment distribution in the text data.

where s1 is the set of sentiment words in s and s2 is the set of cross-validation to compare the predictive power of the three
neutral words in s. We define: methods mentioned, each experiment uses four chapters as
X X dw0 the test set and the remaining 12 chapters as the training set,
a= 0
vw0 ∗ vw , repeating four times altogether. The sentiment word of novel
w∈s w ∈s1 Ds
X X tw0 text were manually marked. Because one assumption that
b= 0
vw0 ∗ vw (5) researchers often make about sentence-level analysis is that
w∈s w ∈s2 TS
a sentence expresses a single sentiment from a single opinion
It can be seen that, if a < b, the smaller the α, the holder [4] long, emotionally-rich paragraphs are segmented
largerthe likelihood function, which means that for the whole so that each paragraph contained only one kind of sentiment
sentence structure, neutral words are more important than as much as possible. A total of 2079 sentences were obtained,
sentiment words; given the importance of sentiment words in including 486 neutral sentences, 535 sentences with posi-
sentiment classification, sentiment words and neutral words tive sentiment and 1058 sentences with negative sentiment.
can be considered as equally important, so α is taken as 0.5. There are 503 sentences in chapter 1-4, 478 sentences in
When a > b, the larger α is, the larger the likelihood function, chapter 5-8, 363 sentences in chapter 9-12, 735 sentences in
which means that for the whole sentence structure, sentiment chapter 13-16. Figure 2 shows the distribution of sentiment
words are more important than neutral words. Considering in the novel. It can be found that the sentiment distribution of
the importance of sentiment words in sentiment classifica- each part of the text is almost the same, and there are more
tion, the weight of sentiment words can be enlarged, but it negative sentiment sentences than positive sentiment and
cannot be taken as 1, otherwise the role of neutral words will neutral sentences, which account for almost half of the total,
be completely eliminated. So α takes 0.75, the middle value which is in line with the tragic characteristics of loneliness,
between 0.5 and 1. In summary, the following formula is used fear and sadness of the novel.
to determine the values taken in the text s.

0.75 if a > b
 B. EVALUATION METRICS
αs = 0.5 if a < b (6) All three methods determine the sentiment category (negative


0 if S1 = ∅ sentiment, neutral sentiment, andpositive sentiment) of each
sentence. For the overall performance evaluation, accuracy is
The vector computing method for sentiment classification used to measure.
is shown in Table 1.

IV. EMPIRICAL STUDY


Number of texts correctly classified
accuracy =
A. DATA SOURCES Total number of texts
In the experiment, Yu Hua’s novel ‘‘ (Cries in the
Drizzle)’’ was used as the domain classification corpus which In order to show the details of the three methods more
includes 16 chapters. We utilized leave-four-chapters-out clearly, we use precision, recall and F1 score as evaluation

VOLUME 10, 2022 32285


H. Liu et al.: Study of Application of Weight Distributing Method Combining Sentiment Dictionary and TF-IDF

FIGURE 3. Comparison of accuracy of three sentiment classification methods.

TABLE 2. Comparison of the precision of three sentiment classification methods.

criteria, where: to the leave-four-chapters-out cross-validation for sentiment


classification, and the classification performance of the three
the number of texts correctly classified into methods was measured using the above criteria. Support
the category vector machine (SVM) is used to classify sentiment from
precision =
the number of texts classified into the the sentence vectors obtained by TF-IDF weighting method
category and the method of this paper. SVM is a linear classifier with
the number of texts correctly classified into maximum interval defined on the feature space [33], and the
the category basic idea is to find a maximally spaced division hyperplane
recall = in the sample space based on the training set to separate
Total number of texts in the category different classes of samples [11]. Define a hyperplane by
precision × recall
F1 score = 2 × {x : f (x) = xT β + β0 = 0}
precision + recall
The optimization problem is
C. COMPARISON OF THE CLASSIFICATION
PERFORMANCE OF THE THREE SENTIMENT 1 X N
ANALYSIS METHOD min kβk2 + C ξi
β,β0 2
In order to test the effectiveness of the above three meth- i=1
ods for sentiment classification in novels, they were applied Subject to ξi ≥ 0, yi (xT β + β0 ) ≥ 1 − ξi ∀i

32286 VOLUME 10, 2022


H. Liu et al.: Study of Application of Weight Distributing Method Combining Sentiment Dictionary and TF-IDF

TABLE 3. Comparison of the recall of three sentiment classification methods.

TABLE 4. Comparison of the F1 score of three sentiment classification methods.

where the slack variable ξ1 is the proportional amount by taken to represent the classification performance of the two
where the prediction f (x) = xT β + β0 is on the wrong side of parameter combinations. We found that the optimal parame-
its margin [34]. The cost parameter C represents the penalty ter combination of both the TF-IDF and the method of this
for misclassification. Kernel function is used in nonlinear paper is C = 10 and γ = 0.00. The model was trained with
classification tasks. The basic idea is to use a transformation the training set and the sentiment classification was done on
to map the sample points to a new space, making it linearly the test set, and Figure 3 is a comparison of the accuracy of the
separable [30]. The most popular choice for kernel function three sentiment classification methods.
in SVM literature is radial basis As can be seen from Figure3, the performance of the rule-
2 based on sentiment dictionary method is not satisfactory, the
K(x, x0 ) = exp(−γ x − x0 )
accuracy is between 60% and 70%. The method proposed in
To maximize predictive ability of the model, parameter this paper performed the best in four experiments, with the
selection is first made. A grid search method was used for accuracy rate above 80%. The mean values of the accuracy of
the parameters C and γ of the SVM, and a 5-fold cross- the three methods are 68.2%, 74.4% and 82.1%, respectively.
validation was made for each combination of the two param- Table 2, Table 3, and Table 4 show the comparison of preci-
eters, respectively, The training set is randomly divided into sion, recall and F1 score of the experimental results of leave-
five subsets, four subsets are used for model training, and four-chapters-out cross-validation.
the remaining one is used for testing. The above process is From the result of our experiment, the rule-based sentiment
repeated five times, and the average of the five results is dictionary method has a significant gap in classification effect

VOLUME 10, 2022 32287


H. Liu et al.: Study of Application of Weight Distributing Method Combining Sentiment Dictionary and TF-IDF

compared with the other two methods; the TF-IDF, which [6] Z. Haodong and L. Wenqi, ‘‘Chinese micro-blog emotional analysis
does not consider sentiment tendency, also has significant method based on semantic rules and expression weighting,’’ J. Light Ind.,
vol. 35, no. 2, pp. 74–82, 2020.
shortcomings for sentiment analysis. The method proposed in [7] P. Sudhir and V. D. Suresh, ‘‘Comparative study of various approaches,
the paper combines the advantages of the two, and the clas- applications and classifiers for sentiment analysis,’’ Global Transitions
sification effect show an obvious improvement, in which the Proc., vol. 2, no. 2, pp. 205–211, Nov. 2021.
[8] S. Peiyu and X. Yan, ‘‘Research on multi–feature fusion method for senti-
precision is improved by 13.9% compared with the rule-based ment analysis of Chinese microbiog,’’ Electron. World, vol. 2, pp. 20–21,
sentiment dictionary method and 7.7% compared with the Feb. 2018.
TF-IDF weighting method. The good results of the method [9] G. Xu, Z. Yu, H. Yao, F. Li, Y. Meng, and X. Wu, ‘‘Chinese text sentiment
analysis based on extended sentiment dictionary,’’ IEEE Access, vol. 7,
proposed in the paper in classification illustrate that the pp. 43749–43762, 2019.
method fully extracts the sentiment information in the text [10] J. Yu and R. Egger, Tourist Experiences at Overcrowded Attractions: A Text
and verifies its effectiveness in sentiment analysis. Analytics Approach. Springer, 2021, pp. 231–243.
[11] Z. Zhihua, Machine Learning. Beijing, China: Tsinghua Univ., 2016.
[12] D. Hui, X. Xueke, W. Dayong, L. Yue, Y. Zhihua, and C. Xueqi, ‘‘A senti-
V. CONCLUSION ment classification method based on sentiment-specific word embedding,’’
In this paper, a sentence vector generation method based J. Chin. Inf. Process., vol. 31, no. 3, pp. 170–176, 2017.
on sentiment dictionaries and pre-trained word vectors is [13] X. Jianzhong, Z. Jun, Z. Rui, Z. Liang, H. Liang, and L. Jiaojiao, ‘‘Sen-
timent analysis of aerospace microblog using SVM,’’ J. Inf. Secur. Res.,
proposed for sentiment classification, which calculates the vol. 3, no. 12, pp. 1129–1133, 2017.
weights of sentiment words and neutral words in a sentence [14] K. W. Church, ‘‘Word2Vec,’’ Nat. Lang. Eng., vol. 23, no. 1, pp. 155–162,
separately, and retains the overall information of the sen- 2017, doi: 10.1017/S1351324916000334.
tence while highlighting the sentiment words. A supervised [15] C. Deguang, M. Jinlin, M. Ziping, and Z. Jie, ‘‘Review of pre-training
techniques for natural language processing,’’ J. Frontiers Comput. Sci.
machine learning method is used to achieve the sentiment Technol., vol. 15, no. 8, pp. 1359–1389, 2021.
polarity determination of the text (in this paper, SVM is used [16] V. Rohini, M. Thomas, and C. A. Latha, ‘‘Domain based sentiment analysis
for classification), and the advantages and disadvantages of in regional Language-Kannada using machine learning algorithm,’’ in
Proc. IEEE Int. Conf. Recent Trends Electron., Inf. Commun. Technol.
the classification effect are compared with the commonly (RTEICT), May 2016, pp. 503–507.
used rule-based sentiment dictionary and TF-IDF weighting [17] S. Soumya and K. V. Pramod, ‘‘Sentiment analysis of Malayalam tweets
methods. Yu Hua’s novel Cries in the Drizzle was used as the using machine learning techniques,’’ ICT Exp., vol. 6, no. 4, pp. 300–305,
Dec. 2020.
domain classification corpus. From the experimental result, [18] A. Mukwazvure and K. P. Supreethi, ‘‘A hybrid approach to sentiment
the accuracy of text sentiment analysis using the method analysis of news comments,’’ in Proc. 4th Int. Conf. Rel., Infocom Technol.
proposed in this paper reaches 2.1%, which is 13.9% and Optim. (ICRITO) (Trends Future Directions), Sep. 2015, pp. 1–6.
[19] G. Li and F. Liu, ‘‘A clustering-based approach on sentiment analysis,’’ in
7.7% higher respectively than the other two methods. Proc. IEEE Int. Conf. Intell. Syst. Knowl. Eng., Nov. 2010, pp. 331–337.
It must be admitted that there are still some limitations and [20] R. Jena, ‘‘An empirical case study on Indian consumers’ sentiment towards
shortcomings in the proposed method, which needs further electric vehicles: A big data analytics approach,’’ Ind. Marketing Manage.,
vol. 90, pp. 605–616, Oct. 2020.
improvemen. The method proposed in this paper cannot break
[21] T. Wang, K. Lu, K. P. Chow, and Q. Zhu, ‘‘COVID-19 sens-
the boundary of single sentence, which cuts the connec- ing: Negative sentiment analysis on social media in China via
tion between text and context, which hinders the study of BERT model,’’ IEEE Access, vol. 8, pp. 138162–138169, 2020, doi:
paragraph and discourse comprehension. Second, sentiment 10.1109/ACCESS.2020.3012595.
[22] C. R. Aydin and T. Gungor, ‘‘Combination of recursive and recurrent neural
analysis at the sentence level is often insufficient for appli- networks for aspect-based sentiment analysis using inter-aspect relations,’’
cations because they do not identify opinion targets or assign IEEE Access, vol. 8, pp. 77820–77832, 2020.
sentiments to such targets [4]. In addition, it can be seen from [23] G. Yang, ‘‘Research and application of sentiment analysis based on
Word2Vec method,’’ Xiamen Univ., Xiamen, China, Tech. Rep., 2019.
the classification results that the classification results of the [24] K. Dashtipour, M. Gogate, J. Li, F. Jiang, B. Kong, and A. Hussain,
three methods are significantly better in negative sentiment ‘‘A hybrid Persian sentiment analysis framework: Integrating depen-
than neutral and positive sentiment. The reason for this phe- dency grammar based rules and deep neural networks,’’ Neurocomputing,
vol. 380, pp. 1–10, Mar. 2020.
nomenon is that the number of positive sentiment sentences is [25] L. Yang, Y. Li, J. Wang, and R. S. Sherratt, ‘‘Sentiment analysis for
more than the negative and neutral sentences. It indicates that e-commerce product reviews in Chinese based on sentiment lexicon
the classification performance is interfered by the imbalance and deep learning,’’ IEEE Access, vol. 8, pp. 23522–23530, 2020, doi:
10.1109/ACCESS.2020.2969854.
of text sentiment. [26] M. Chiny, M. Chihab, O. Bencharef, and Y. Chihab, ‘‘LSTM, VADER and
TF-IDF based hybrid sentiment analysis model,’’ Int. J. Adv. Comput. Sci.
REFERENCES Appl., vol. 12, no. 7, p. 2021, 2021, doi: 10.14569/IJACSA.2021.0120730.
[27] X. Lin-Hong, L. Hong-Fei, and Z. Jing, ‘‘Construction and analysis of
[1] F. Zhi-wei, ‘‘Academic position of natural language processing,’’ J. PLA
emotional corpus,’’ J. Chin. Inf. Process., vol. 22, no. 1, pp. 116–122,
Univ. Foreign Languagesguage Process., vol. 3, no. 28, pp. 1–8, 2005.
2008.
[2] J. Eisenstein, Natural Language Processing. Cambridge, MA, USA: MIT [28] L. Xu, H. Lin, Y. Pan, H. Ren, and J. Chen, ‘‘Constructing the affective
Press, 2018. lexicon ontology,’’ J. China Soc. Sci. Tech. Inf., vol. 27, no. 2, pp. 180–185,
[3] W. Ting and Y. Wenzhong, ‘‘Review of text sentiment analysis methods,’’ 2008.
Comput. Eng. Appl., vol. 57, no. 12, pp. 11–24, 2021. [29] T. MingZ Lei and Z. Xian-chun, ‘‘Document vector representation
[4] B. Liu, Sentiment Analysis and Opinion Mining. San Rafael, CA, USA: based on Word2CVec,’’ Comput. Sci., vol. 6, no. 43, pp. 214–217,
Morgan & Claypool, 2012. 2016.
[5] C. Long, G. Ziyu, H. Jianhong, and P. Jinye, ‘‘A survey on sentiment [30] S. Li, Z. Zhao, R. Hu, W. Li, T. Liu, and X. Du, ‘‘Analogical
classification,’’ J. Comput. Res. Develop., vol. 54, no. 6, pp. 1150–1170, reasoning on Chinese morphological and semantic relations,’’ 2018,
2017. arXiv:1805.06504.

32288 VOLUME 10, 2022


H. Liu et al.: Study of Application of Weight Distributing Method Combining Sentiment Dictionary and TF-IDF

[31] S. Arora, Y. Li, Y. Liang, T. Ma, and A. Risteski, ‘‘A latent XI CHEN was born in Linxia, Gansu, China,
variable model approach to PMI-based word embeddings,’’ 2015, in 1997. He received the B.S. degree from the
arXiv:1502.03520. South China University of Technology, in 2016.
[32] S. Arora, Y. Liang, and T. Ma, ‘‘A simple but tough-to-beat baseline for He is currently pursuing the master’s degree with
sentence embeddings,’’ in Proc. 5th Int. Conf. Learn. Represent., 2017, the Beijing Normal University, Beijing, China. His
pp. 1–16. current research interests include machine learn-
[33] L. Hang, Statistical Learning Methods. Beijing, China: Tsinghua Univ., ing, sentiment analysis, and model interpretability.
2019.
[34] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical
Learning: Data Mining, Inference, and Prediction. Cham, Switzerland:
Springer, 2009.

HAO LIU was born in Chifeng, Inner Mongolia,


China, in 1985. He received the B.S. degree from XIAOXIAO LIU was born in Linyi, Shandong,
Beijing Normal University, in 2008, the M.S. China, in 1995. She received the B.S. degree in
degree from Kent University, U.K., in 2010, and management from the University of Jinan, in 2018,
the Ph.D. degree in statistics from Beijing Normal and the master’s degree from Beijing Normal Uni-
University, in January 2017. He has been work- versity, in 2021. She is currently pursuing the
ing with the Collaborative Innovation Center for master’s degree in educational psychology with
Basic Education Quality Monitoring, Beijing Nor- the University of Alberta, Edmonton, Canada. Her
mal University, China, as a Lecturer and a Master’s research interests include data mining, machine
Supervisor. His research interests include basic learning, and natural language processing.
education quality impact factors, educational data mining, and other areas.

VOLUME 10, 2022 32289

You might also like