Multi-Label Hierarchical Text Classification Using The ACM Taxonomy
Multi-Label Hierarchical Text Classification Using The ACM Taxonomy
ACM Taxonomy
1 Introduction
1 Available http://www.acm.org/class/1998/
Multi-label Hierarchical Text Classification using the ACM Taxonomy 555
The most important element of a classifier is its training set. A training set is just a set
of documents that exemplify the different classifications as fully as possible. If the
training set is poor, the classifier cannot classify incoming documents correctly.
There are several sets of texts, including Reuters-215782, 20 Newsgroups3, as well
as biological data sets ENZYME [20], used and referenced in the literature for multi-
label classification. However, these collections are not suitable for the study of multi-
label hierarchical classification, because, besides not having an original hierarchical
structure (Reuters-21578, Reuters-RCV1), they are not multi-label (20 Newsgroups)
or are not even a text collection (ENZYME).
Due to the lack of a truly hierarchical and multi-label collection of texts, we
developed a solution able to autonomously navigate through the ACM digital library,
in order to collect Web pages and extract the relevant data to build a test collection.
From the various types of scientific documents available (journals, magazines,
proceedings, among others), we chose to collect the proceedings because this is the
largest type of document represented in the ACM library and it covers a large share of
the ACM categories involved, thereby exempting the collection of another type of
document.
2.1 Architecture
Using the SAX2 API (Simple API for XML version 2) the ACM tree was extracted
from a XML file and stored in a database4. Due to the large number of Web pages
needed and the high volume of data necessary to extract, we designed and
implemented a system capable of automatically browsing the relevant Web pages
from the ACM website and extract the contents of interest to be stored in the created
database. The designed system architecture was based on Google’s [2] architecture.
Figure 1 shows a simplified architecture of the system, where the Extractor, its key
component, is represented. The system works as follows: (1) the Crawler is initially
supplied with an address (URL) of the ACM digital library; (2) the Crawler delivers
the Web Page to the Extractor; (3) The Extractor consults its set of rules and:
a. If the page has details of a scientific article, it extracts data such as title,
keywords, abstract, primary and secondary classification, etc. and save data in
the database;
b. If the page points to scientific articles, their links are extracted and saved in the
database;
c. If the page is not any of the previous types, then it is not relevant and therefore
nothing is done.
(4) The Crawler checks if there are web addresses in the database and if there are,
goes back to point 2; otherwise, ends.
2
http://www.daviddlewis.com/resources/testcollections/reuters21578/
3 http://people.csail.mit.edu/jrennie/20Newsgroups/
4 Available at http://www.dei.isep.ipp.pt/~i000313/
556 A.P. Santos, F. Rodrigues
The implemented process has the ability to find relevant Web pages and ignore the
others. 112,000 Web pages were collected; of those 106,000 correspond to pages with
information on scientific documents such as title, abstract, keywords, general terms,
ACM classification, authors and the connection (link) to the full document in pdf
format. The information of each document was obtained first with the identification of
relevant Web pages, followed by the acquisition and retrieval of information. The
information concerned by each publication was extracted from the web pages based
on manually written rules. This was possible because the pages of the ACM site are
reasonably structured which facilitates the writing of rules to automatically extract the
information. From the documents collected in the ACM portal, it is possible to infer
that a document may have between zero and seven primary classifications and
between zero and thirty-six additional classifications - such a large number of
classifications only occur in extreme cases. Considering the number of classifications,
ignoring the distinction between primary and secondary classifications, the collected
documents are distributed according to the number of classifications, as shown in
figure 2.
The distribution expresses the following: 31,122 documents do not have any
classification, whereas only 13132 documents have one classification and the others
have two or more classifications.
Multi-label Hierarchical Text Classification using the ACM Taxonomy 557
After obtaining the document text collection, where each training document is
composed by title, abstract, keywords and their classifications, it was necessary to
preprocess it in order to get better performance from the classification algorithms. In
this step, we used the API WVTool 1.1 (Word Vector Tool) [13]. The preprocessing
tasks performed were:
– Term normalization, consisting of changing all characters to lowercase, removal
of accents and treatment of equivalent terms (although written in different ways -
for example, color/colour);
– Segmentation, division of text in single units (procedure known as tokenization),
where all the characters (not letters) were recognized as separators, resulting units
(tokens) with just letters;
– Stopwords removal, based on a standard English list of words, incorporated into
the API;
– Reduction of the words to his radical, by the Stemming process, using the Porter
Stemmer algorithm [11];
– Pruning (elimination) in the collection of words with frequency lower than 3. We
have not done the elimination of words with frequency greater than a given value,
because these words were deleted in the removal of stopwords or excluded in the
selection of the most important terms (task reported in the next section).
4 Feature Selection
After the documents pre-processing phase, the number of terms originally in the text
collection remained high, although it was considerably reduced. The high number of
terms or features is typical of text classification problems. This high number of
features is not desirable because it significantly increases the amount of time
necessary for a classifier to learn. In fact, not all the terms used in text documents are
relevant to describe them (and may even reduce the quality of the classifier´s
learning). As a result, it is common to choose a subset made up of the most important
terms. To assign a value which represents the importance of a term, there are several
measures such as: information gain [12], mutual information, χ2 (chi-square) and
frequency [9]. We apply the measure information gain, since is one of the most
effective measures [18].
558 A.P. Santos, F. Rodrigues
As described before, each document in the ACM collection has several primary and
secondary classifications (multi-label classification), and each classification is
organized in a hierarchical structure of four layers. However, in our experiments we
only considered the first two levels of the ACM hierarchy, that is, the documents are
classified according to the classes A..K of the first level, and the topics of the second
level. From the collection of documents gathered from ACM site, we have created
two smaller collections of documents: one with 5000 documents and the other with
10,000 documents. Both are described on table 1.
Multi-Label Multi-Label
5000 10000
Number of documents 5000 10000
Total number of labels 11306 23786
Average number of labels per document 2,2612 2,3786
Maximum number of labels per document 14 19
Number of distinct terms after removing
11743 16475
Stopwords
Number of distinct terms after removing
4467 5697
Stopwords and pruning of terms with lenght < 3
Average number of documents by category in the
454,5 909
1st level (11 categories)
Average number of documents by category in the
61,7 123,4
second level (81 categories)
Note: each document has one or more labels. In other words, a document belongs to
one or more categories, and therefore contributes as a unit for one or more categories.
Since each of the collections above has a high number of different terms, we created
new collections of documents from each of them, each one with the 200, 400, 600,
800 and 1000 most important terms, selected according to the gain of information
measure.
0 1 2 m 0 1 2 …
This approach does not make use of the hierarchy of categories, resulting in the
loss of this knowledge. Despite the results in [8] were very similar when dealing with
5 Available http://mlkd.csd.auth.gr/multilabel.html
6 Available http://www.csie.ntu.edu.tw/~cjlin/libsvm
7 Available http://www.cs.waikato.ac.nz/ml/weka
560 A.P. Santos, F. Rodrigues
the problem of classification in a hierarchical and flat shape, it is our opinion that
such results were possible due to the reduced number of categories involved in the
problem.
The results reported in [7] with a higher number of categories, support the use of a
hierarchy between the categories. For this reason in this work we adopt the local
hierarchical approach.
Root P1
Root
Level 1 P2 P3
A B … K
Level 2
0 1 2 m 0 1 2 …
In the global hierarchical approach (or big-bang) [15], the class hierarchy is treated as
a whole, and thus only a single classifier is responsible to discriminate all the classes
[7]. This approach is similar to the transformation of a hierarchical structure into a flat
structure approach, but it somehow considers the hierarchical relationships between
classes. For this reason, the use of the flat classification methods in its original form is
not possible, it is necessary to do some transformations in order to capture the
relationships between classes.
The construction of a classifier using the global approach is more complex than
following the local approach: it is computationally more complex and not flexible; for
example, each time there is a change in the hierarchy of classes, the classifier needs to
Multi-label Hierarchical Text Classification using the ACM Taxonomy 561
6 Evaluation
The best known methods of evaluation of classifiers, such as the holdout method, k-
fold cross-validation, leave-one-out and boostrap were designed to assess problems of
plain classification. Since this problem is a hierarchical classification problem it was
necessary to adapt the method k-fold cross-validation for each level of the hierarchy.
As there is still no consensus or a clear trend on the evaluation measures to be applied
to multi-label hierarchical classification problems, we chose to implement several
measures based on examples.
Note that, for each classifier, all the learning steps (learning, classification and
evaluation) were made together and not separately. The experiments were performed
following the k-fold cross-validation method with k = 3 (where k is the number of
subsets and the number of times that cross-validation process is repeated). Although
this value is not as popular as the values k = 10, k = 5, shown to be the most
appropriate, given the large number of experiments performed, and the size of text
sets.
7 Results
With respect to the ACM tree, we only used the first and second level, because going
down to the third level would result in categories with small numbers of documents.
In both 5000 and 10,000 documents collections we have applied the various
algorithms above refered, using the 200, 400, 600, 800 and 1000 most important
terms in each collection, selected according to the information gain measure. Next, we
only present the classification results at the second level of the tree, because only at
this level the final document classification is obtained, see figures 5 and 6. Among all
combinations of algorithms tested we only present the best results that match those
obtained with Binary Relevance combined with Naive Bayes Multinomial (B.R. NB-
M.); Label Powerset combined with Sequential Minimal Optimization (L.P.SMO) and
Multilabel k-Nearest Neighbor (MLkNN).
The best results were obtained with the algorithms Binary Relevance combined
with Naive Bayes Multinomial. It is clear that the results obtained with this
combination of algorithms are independent of both the size of the collection or the
number of terms used, that is, the results are similar in both collections, with 5000 and
10,000 documents and are better with smaller number documents terms, that is, 200
terms are sufficient to characterize this type of scientific documents collected from
the ACM repository. In what concerns the other algorithms, the algorithm MLkNN
562 A.P. Santos, F. Rodrigues
gives better results than those obtained with the algorithm L.P.SMO. Regarding the
documents collection, the MLkNN algorithm has the best results with the largest
collection (10,000 documents) and with the largest number of terms (1000 terms).
But the L.P.SMO algorithm has better results with the smallest collection (5000
documents) and also needs a large number of terms (1000).
Fig. 5. Results of the different algorithms on Fig. 6. Results of the different algorithms on
dataset 5000, according to the number of dataset 10000, according to the number of
terms terms
Text classification has received the attention of researchers, since the mid 90´s.
However, the multi-label hierarchical text classification still remains actual and offers
exciting challenges, which give space for new research, and optimization of
contributions already available. In this paper, we give an overview of hierarchical
classification problems and their solutions. A multi-label hierarchical text collection
classified according to the ACM scheme was created and pre-processed using various
Multi-label Hierarchical Text Classification using the ACM Taxonomy 563
Acknowledgments. We thank the Foundation for Science and Technology for the
support expressed by funding the research project COPSRO - Computational
Approach to Ontology Profiling of Scientific Research Organizations. The authors
also thank all the support provided by GECAD.
References
1. Aha, D., Kibler D., Albert M.: Instance-based learning algorithms. Machine Learning 6,
pp. 37--66 (1991)
2. Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. In
Proceedings of the seventh international conference on World Wide Web 7 Computer
Networks, pp. 107--117 (1998)
3. Clare, A., King, R.: Knowledge discovery in multi-label phenotype data. In Proceedings of
the 5th European Conference on Principles of Data Mining and Knowledge Discovery, pp.
42--53. Freiburg: Germany (2001)
4. Godbole, S., Sarawagi, S.: Discriminative methods for multi-labeled classification. In
Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data
Mining, pp. 22--30 (2004)
5. Joachims, T.:Text categorization with support vector machines: learning with many
relevant features. In Proceedings of ECML-98, 10th European Conference on Machine
Learning pp.137--142 (1998)
6. John, George H., Langley P. Estimating continuous distributions in bayesian classifiers. In
Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, pp. 338–345
(1995)
564 A.P. Santos, F. Rodrigues