Combining Lexical and Semantic Features For Short Text Classification
Combining Lexical and Semantic Features For Short Text Classification
com
ScienceDirect
Procedia Computer Science 22 (2013) 78 – 86
Abstract
In this paper, we propose a novel approach to classify short texts by combining both their lexical and semantic features. We
present an improved measurement method for lexical feature selection and furthermore obtain the semantic features with the
background knowledge repository which covers target category domains. The combination of lexical and semantic features is
achieved by mapping words to topics with different weights. In this way, the dimensionality of feature space is reduced to the
number of topics. We here use Wikipedia as background knowledge and employ Support Vector Machine (SVM) as classifier.
The experiment results show that our approach has better effectiveness compared with existing methods for classifying short
texts.
©c 2013
2013 The
The Authors.
Authors. Published
PublishedbybyElsevier
ElsevierB.V.
B.V.
Selection
Selection and
andpeer-review
peer-reviewunder
underresponsibility ofof
responsibility KES International.
KES International
Keywords: Short text; Topic model; Wikipedia; Feature selection
1. Introduction
Text classification plays a very important role in many application domains. With the widespread of web ap-
plications such as social networks and online review systems, etc., we are now confronting much more short texts
and news every day. Traditional text mining methods have their limitations for automatic classification of short
texts, as the word sparseness in short texts, the lack of context information and informal sentence expressiveness.
A common method to overcome these problems for classifying short texts is to enrich the original texts with
additional information. One way is to employ search engines and utilize the search results to expand related
contextual content [1, 2, 3]. The other way is to utilize external repositories (e.g., Wikipedia and Open Directory
Project, etc) as background knowledge [4, 5, 6, 7]. Although both of these two methods achieve improvement of
short text classification to some extents, there is handicap to deal with amount of unrelated and noisy information
if we naively expand original texts.
Probabilistic latent topic models [6, 8, 9, 10] have effectively been used in text mining. The basic idea of
these kinds of models is to learn the topics from domain related datasets and assume each text is a multinomial
distribution over these topics. As the number of topics is relatively small, the dimensionality of each text thus
becomes lower and the vector space of texts is no longer sparse. We observe that the probabilities of all topics
are none-zero because these models must assure each text to have a probability to be generated by any of topics.
This means that any a text has more or less relations with every topic. In real applications, however, a text may
be bound up with a small number of topics and often has no relations with others at all. Only applying topic
distribution has obvious limitations especially for dealing with short texts.
In this paper, we propose a topic model based approach which combines both lexical and semantic features to
avoid the aforementioned limitations for short text classification. Like some existing methods, we also employ a
background knowledge repository to learn topics with respect to all target categories. After we obtain all topics
from the repository, we assign each word of short texts to the learned topics by making use of a Gibbs sampling
method. That is to say, we would map each word occurrence to a particular topic and then we can represent a short
text with these mapped topics instead. In this way, we can notice words in a short text may be mapped to a few but
not all topics. Additionally, with respect to discriminative capacity of words, we adopt different mapping weights.
For words coherent with a particular category, we assume that the topic which words assigned to is more closely
related to the target category. We thus present the method of expected cross entropy based on lexical evidence
for measuring the discriminative capacity of words in short texts. We evaluate the performance and effect of our
proposed approach on both GoogleSnippet and Ohsumed datasets using Wikipedia as background knowledge.
The experiment results show that our approach achieve better effectiveness compared to existing methods.
The remainder of this paper is organized as follows. In Section 2 the background and related works are
introduce. In Section 3 we present our proposed approach in details. In Section 4, we show experiments and result
analysis, respectively on two real-world datasets. We have a discussion in Section 5 and the concluding remarks
in Section 6.
2. Related Work
One of the main challenges for text classification is the high dimensionality of feature space which not only
would lead to high computational complexity but also is prone to overfitting problems. Plenty of feature selection
measures have been put forward to reduce dimensionality in the past years, such as term frequency-inverse docu-
ment frequency (TF-IDF), information gain (IG), mutual information (MI) and expected cross entropy (ECE), etc
[11]. Documents are then represented by these selected features. By applying a classification model (K-Nearest
Neighbor, Naive Bayes and Support Vector Machine) to the training set, we can obtain a classifier which could be
employed to predict the category labels for future unseen documents. This type of classification methods is called
lexical-based classification.
Semantic-based text classification springs up after topic models become popular for semantic analysis. [8]
and [9] reduced the dimensionality of the feature space of a document to the number of topics by using topic
distribution parameters for each document and then combined with traditional classifier to achieve classification.
[12] associated each document with a single categorical label corresponding to a topic. [13] proposed a novel
cross-domain text classification algorithm which extends traditional PLSA algorithm to integrate both labeled and
unlabeled data into a unified probabilistic model. [14] designed one-to-one mapping between topics and labels
and could be applied for multi-label classification.
While both types of aforementioned lexical-based and semantic-based classification are comparatively suitable
for classifying long texts, problems for dealing with short text rise as short texts are constantly emerging nowadays.
[15] proposed a method for extracting important topic words from a blog by measuring whether the blog includes
rich content, which is achieved by comparing web search results of the candidate words with the content of the
blog. [1] presented a contextual vector to represent each short text by using the L2 normalization of the centroid of
all results returned by a search engine. [16] utilized TAGME, a powerful tool to identify meaningful phrases for
annotating short and poorly composed texts so as to help the understanding of short texts. [17] proposed a topics
based similarity measurement method to select feature words based on both of the lexical weight and relationships
of topics that words belonged to. [10] analyzed short texts by assuming each short text is associated with one
certain topic.
Another line of short text mining techniques is to cooperate with external repository by combining with topic
models. [4] presented a novel approach to cluster short text messages via transfer learning from auxiliary longer
80 Lili Yang et al. / Procedia Computer Science 22 (2013) 78 – 86
textual data and applied the topic model, assumed that short texts and auxiliary texts have different generative
processes. [5] presented a “universal dataset” based hidden topic analysis method which integrates topics and
words by appending the words with respect to topics into feature vectors for building a classifier.
In this paper, we propose a novel way to combine the lexical and semantic features for classifying short
texts, meanwhile retaining the dimensionality of feature space to be low. In our approach, short texts are often
considered to be related with only a small number of topics and to overcome the limitation of traditional semantic-
based classification methods.
3. Proposed Approach
Here we present our approach in details. The main process of our approach is as follows.
(1) Choose an external credible repository and extract some longer documents related to target categories as
background knowledge.
(2) Apply topic model to these longer documents to learn a certain number of topics.
(3) Select discriminative feature words using our improved expected cross entropy as the measure.
(4) Map the weighted words of short texts to corresponding topics as the vector representations of short texts.
(5) Training the classification model on labeled data.
In our approach, the topics related to the target categories are learned from a background knowledge repository.
The choice of the repository is of great importance because its content should be abundant enough to cover topics
related to categories as much as possible. After collecting related long texts, we apply topic model to learn topics
from the background knowledge dataset.
Latent Dirichlet Allocation (LDA) [8] is a generative probabilistic model to learn the semantic topics from a
corpus. The basic idea is that documents are represented as multinomial distributions over latent topics; meanwhile
each topic is characterized by a multinomial distribution over words. The generative process of LDA is as follows.
nkt + β
φkt = (1)
nk +V β
where nkt is the times word t assigned to topic k, nk is the total number of words assigned to topic k and V is the
vocabulary size. β is the hyper-parameter as explained in the generative process.
Lili Yang et al. / Procedia Computer Science 22 (2013) 78 – 86 81
Expected cross entropy (ECE) is a kind of feature selection measure which considers both word frequency and
the relationship between word and category. The bigger the ECE value, the larger impact the corresponding word
has for the purpose classification. ECE value of word w is usually calculated as:
p(Ci |w)
f (w) = p(w) ∑ p(Ci |w)log (2)
i p(Ci )
The top-N distinctive words for each category are selected to represent lexical features. For these feature
words, we use different mapping weights when combining with semantic features in our subsequent procedure.
We here present the way to obtain the semantics from a text and reduce the dimensionality of feature spaces
to the number of topics. Then we show how to combine lexical and semantic features together to represent a short
text while retaining the dimensionality of feature space unchanged.
We first map the words of short texts to a learned topics. The Gibbs sampling approach is adopted. For every
word in each text, we iteratively use Formula (5) to assign a topic to it.
n¬i
mk + α
p(zi = k|z−i , w−i , •) ∝ · φkt (5)
nm + Kα
where nmk represents the number of words assigned to topic k in document m, and nm represents the length of
document m. ¬i here means not including the current processing word and t represents the word in i-th position
of document m, φkt is obtained from the topic learning of background knowledge and calculated by (1). K is
82 Lili Yang et al. / Procedia Computer Science 22 (2013) 78 – 86
the number of topics and α is hyper-parameters as explained in LDA generative process. As we can see, the
assignment of a word to a corresponding topic is influenced by both the context information of a text and the
topic-words distribution φ obtained from background knowledge.
After finishing the word-topic assignment, we could get semantic representation of short text by using corre-
sponding topics to take place of original words. Therefore all the texts could be represented with a vector, in which
each element represents the times of the corresponding topic assigned to. Thus the dimensionality of feature space
is reduced to the number of topics. As we noticed, several words in a text may be assigned to the same topic while
some topics may not be assigned by any words within the text. Hence, the vectors we get have some zero elements
and a small number of elements are much bigger than others.
Now we are going to combine the semantic features of a text with those lexical feature obtained by using the
feature selection method we proposed. We assume if words are more related to some a category, the corresponding
topics are more related to the category. Therefore, we came up with an idea by increasing the mapping weights
of these words to assigned topics in order to put more emphasis on them. Here, mapping weight η refers to how
many times a topic appear in the text if a word is assigned to it. For example, word currency is a lexical feature of
category Business. If it is assigned to topic 1, then topic 1 is considered to appear η times owing to an occurrence
of word currency in the text. The calculation of η could be followed by Formula (6).
where F(w,Ci ) is the M-ECE value of word w with regard to category Ci . That is to say, the more important the
word is in the category, the higher mapping weight it will get. Hence the corresponding topic is emphasized when
representing a short text with topics.
After mapping words to topics with different weights, short texts could still be represented by all these learned
topics as the elements of the vector are different comparing with only considering semantics as we have put more
emphasis on topics.
In order to evaluate our approach, we conduct experiments on two data sets. GoogleSnippet 1 Dataset contains
the web search results related to 8 different domains. We choose 5 categories of the original dataset as our
experimental dataset, details of this data set are shown in Table 1. Ohsumed 2 includes medical abstracts from the
Medical Subject Headings (MeSH) categories of the year 1991, which is categorized into 23 categories. We also
choose 5 categories of the original dataset and part of the abstracts of them as our second dataset. The statistics of
Ohsumed are shown in Table 2.
As we can see, the average length (AveLen) of texts in GoogleSnippet dataset is only ca. 16 after preprocess-
ing. Although abstracts of Ohsumed dataset are much longer, they still contain less word co-occurrence.
1 http://jwebpro.sourceforge.net/data-web-snippets.tar.gz
2 http://disi.unitn.it/moschitti/corpora.htm
Lili Yang et al. / Procedia Computer Science 22 (2013) 78 – 86 83
Table 3: Background Dataset for GoogleSnippet Table 4: Background Dataset for Ohsumed
Category # webPages AveLen Category # webPages AveLen
Business 642 1169.91 Cardiovascular 554 789.17
Computer 639 1058.13 Digestive 457 1130.01
Health 555 1265.29 Immunology 570 1259.98
Politics-Society 561 1326.02 Neoplasms 607 986.40
Sports 458 1249.83 Respiratory Tract 638 1003.05
We can see in Fig 2(a), in almost all cases of different feature sizes, M-ECE performs better than traditional
ECE, except when feature size is 300, they perform the same. Specifically, when the feature size is small, this
superiority is more obvious while traditional ECE and M-ECE have approximate performance when feature size
becomes larger. Besides, we have noticed that when the number of features is more than 200, the accuracy keeps
almost stable and we may infer 200 is the best feature size for lexical classification of this dataset. Fig 2(b)
demonstrates similar results on Ohsumed dataset as in most cases M-ECE performs better than traditional ECE
although the exception exists when the feature size is 250. Besides, the accuracy begins to decrease when feature
size is larger than 200 if we use traditional ECE as selection measure. However, M-ECE achieves stable accuracy
with the growing size of features. Hence, we can conclude that M-ECE is a more effective and stable than ECE
measure for selecting features for text classification.
84 Lili Yang et al. / Procedia Computer Science 22 (2013) 78 – 86
training and testing size ranges from 0.25 to 4 and the classification results are shown in Fig. 4. It shows that even
with a small labeled training set, our method could achieve higher accuracy. On GoogleSnippet dataset, even the
training set is a quarter of testing set, the accuracy is 92.38% and the accuracy almost keeps stable with training
sets becoming larger. On Ohsumed dataset, the accuracy is increasing very slowly with the size of training set
becoming larger. Hence, we can conclude that the classifier built by the measure method we proposed has higher
predictive capability and scales well along with the variation of training and testing size.
Fig. 4: The Effect of Different Training Sizes Fig. 5: The Effect of Topic Numbers
5. Discussion
Our proposed approach in this paper relies heavily on the right topics learned for the given corpus of short texts,
implicating the importance of choosing the right background knowledge as well as how we acquire high-quality
topics from it.
For the sake of simplicity, we constructed the background knowledge for both the experimental datasets by
crawling web pages from Wikipedia, known as the richest online encyclopedia. However, other collections may
be appropriate to act as the background knowledge for particular short text corpus, for example, related medical
journals for dataset Ohsumed. Another issue is the quality of topics learned from the background knowledge. As
far as we known, the HDP [18] model probably is more adaptable to learn high-quality topics as it can learn
suitable number of topics automatically according to the nature of corpus without manually settings. Both of the
issues are worth of further researches.
6. Conclusion
In this paper, we present a novel approach to combine lexical and semantic features for short text classification
and also put forward a new measure method to select lexical features from short texts. Experimental results
indicate both the improvement of the feature selection and classification for short texts. Our future work lies
86 Lili Yang et al. / Procedia Computer Science 22 (2013) 78 – 86
in applying our proposed approach to other related text analysis and mining tasks. We are also interested in
extending basic LDA model to the correlated topic models to extract not only some semantic features, but also the
correlations between these features for mining short texts.
References
[1] Mehran Sahami , Timothy D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. Proceedings of
the 15th international conference on World Wide Web, 2006.
[2] D Bollegala, Y Matsuo, M Ishizuka. Measuring semantic similarity between words using web search engines. Proceedings of the 16th
international conference on World Wide Web, 2007.
[3] W. Yih and C. Meek. Improving similarity measures for short segments of text. Proceedings of the 22nd National Conference on Artificial
Intelligence, 2007.
[4] Ou Jin , Nathan N. Liu , Kai Zhao , Yong Yu , Qiang Yang. Transferring topical knowledge from auxiliary long texts for short text
clustering. Proceedings of the 20th ACM international conference on Information and knowledge management, 2011.
[5] Xuan-Hieu Phan , Le-Minh Nguyen , Susumu Horiguchi. Learning to classify short and sparse text & web with hidden topics from
large-scale data collections. Proceeding of the 17th international conference on World Wide Web, 2008.
[6] Mengen Chen , Xiaoming Jin , Dou Shen. Short text classification improved by learning multi-granularity topics. Proceedings of the
Twenty-Second international joint conference on Artificial Intelligence, p.1776-1781, 2011.
[7] Somnath Banerjee , Krishnan Ramanathan , Ajay Gupta. Feature Selection for Unbalanced Class Distribution and Naive Bayes. Proceed-
ings of the Sixteenth International Conference on Machine Learning, p.258-267,1999.
[8] D. Blei , A. Ng , M. Jordan and J. Lafferty. Latent Dirichlet allocation. The Journal of Machine Learning Research, 3, p.993-1022,2003.
[9] Yue Lu , Qiaozhu Mei , Chengxiang Zhai. Investigating task performance of probabilistic topic models: an empirical study of PLSA and
LDA, Information Retrieval, v.14 n.2, p.178-203, 2011.
[10] Q. Diao, J. Jiang, F. Zhu, and E.-P. Lim. Finding bursty topics from microblogs. in Proceedings of the 50th Annual Meeting of the
Association for Computational Linguistics, Volume 1, p. 536C544. 2012.
[11] Dunja Mladenic , Marko Grobelnik. Feature Selection for Unbalanced Class Distribution and Naive Bayes. Proceedings of the Sixteenth
International Conference on Machine Learning, p.258-267, 1999.
[12] S. Lacoste-Julien, F. Sha, and M. I. Jordan. DiscLDA: Discriminative learning for dimensionality reduction and classification. In NIPS,
volume 22, 2008.
[13] Gui-Rong Xue , Wenyuan Dai , Qiang Yang , Yong Yu. Topic-bridged PLSA for cross-domain text classification. Proceedings of the 31st
annual international ACM SIGIR conference on Research and development in information retrieval, 2008.
[14] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled lda: a supervised topic model for credit attribution in multi-labeled
corpora. In EMNLP ’09: Proceedings of the Conference on Empirical Methods in Natural Language Processing, p.248-256, 2009.
[15] Jinhee Park,Sungwoo Lee, Hye-Wuk Jung and Jee-Hyong Lee. Topic word selection for blogs by topic richness using web search result
clustering. Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, 2012.
[16] Paolo Ferragina , Ugo Scaiella. TAGME: on-the-fly annotation of short text fragments (by wikipedia entities). Proceedings of the 19th
ACM international conference on Information and knowledge management, 2010.
[17] Xiaojun Quan , Gang Liu , Zhi Lu , Xingliang Ni , Liu Wenyin. Short text similarity based on probabilistic topics. Knowledge and
Information Systems, v.25 n.3, p.473-491,2010.
[18] Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. Hierarchical Dirichlet processes. Technical Report, Department of Statistics, UC
Berkeley, 2004.