0% found this document useful (0 votes)
12 views

Automatic Classification of Algorithm Citation Functions in Scientific Literature TKDE2019

The document discusses a method for automatically classifying algorithm citation functions in scientific literature using heterogeneous ensemble machine-learning techniques. The proposed approach captures the relationships between algorithms by analyzing citation contexts, achieving an average F1 score of 0.749 for fine-grained classification. This classification can facilitate the creation of a large-scale algorithm citation network, aiding in the understanding of algorithmic evolution and trends over time.

Uploaded by

loverking1029
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Automatic Classification of Algorithm Citation Functions in Scientific Literature TKDE2019

The document discusses a method for automatically classifying algorithm citation functions in scientific literature using heterogeneous ensemble machine-learning techniques. The proposed approach captures the relationships between algorithms by analyzing citation contexts, achieving an average F1 score of 0.749 for fine-grained classification. This classification can facilitate the creation of a large-scale algorithm citation network, aiding in the understanding of algorithmic evolution and trends over time.

Uploaded by

loverking1029
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO.

10, OCTOBER 2020 1881

Automatic Classification of Algorithm Citation


Functions in Scientific Literature
Suppawong Tuarob , Member, IEEE, Sung Woo Kang, Poom Wettayakorn,
Chanatip Pornprasit, Tanakitti Sachati, Saeed-Ul Hassan, Member, IEEE, and Peter Haddawy

Abstract—Computer sciences and related disciplines evolve around developing, evaluating, and applying algorithms. Typically, an
algorithm is not developed from scratch, but uses and builds upon existing ones, which often are proposed and published in scholarly
articles. The ability to capture this evolution relationship among these algorithms in scientific literature would not only allow us to
understand how a particular algorithm is composed, but also shed light on large-scale analysis of algorithmic evolution through different
temporal spans and thematic scales. We propose to capture such evolution relationship between two algorithms by investigating the
knowledge represented in citation contexts, where authors explain how cited algorithms are used in their works. A set of heterogeneous
ensemble machine-learning methods is proposed, where the combination of two base classifiers trained with heterogeneous feature
types is used to automatically identify the algorithm usage relationship. The proposed heterogeneous ensemble methods achieve the
best average F1 of 0.749 and 0.905 for fine-grained and binary algorithm citation function classification, respectively. The success of
this study will allow us to generate a large-scale algorithm citation network from a collection of scholarly documents representing
multiple time spans, venues, and fields of study. Such a network will be used as an instrument not only to answer critical questions in
algorithm search, such as identifying the most influential and generalizable algorithms, but also to study the evolution of algorithmic
development and trends over time.

Index Terms—Algorithm citation, ensemble machine learning, scholarly big data, algorithmic evolution

Ç
1 INTRODUCTION

A LGORITHMS are ubiquitous in computer science and


computing-related literature. An algorithm is a set of
step-wise instructions for solving a well-defined computing
computing fields, such as medicine [7], finance [8], natural
sciences [9], and so on.
A majority of well-established and well-evaluated
problem. Examples of well-known algorithms include algorithms usually appear first in scientific articles, mostly
Dijkstra’s shortest path [1], Quicksort [1], and PageRank [2]. published through computer science conferences and journals.
These algorithms are crucial not only in computer science Table 1 reports the approximate numbers of algorithms pro-
literature, but also in other fields of study where problems posed in some reputable computer science conferences from
are framed as algorithms and solved by applying traditional 2005 to 2009, reproduced from [10]. Recent studies have inves-
algorithmic methods. For example, the algorithms used in tigated the extraction of these algorithms in the form
financial portfolio diversification are applied to diversify of pseudo-codes and algorithmic procedures [11]. Later,
the results of document searches in information retrieval AlgorithmSeer, a prototype search engine was proposed to
systems [3]. In product design analysis, a number of algo- facilitate searching of algorithms in massive scholarly docu-
rithms that were originally used in text mining, such as ments [12]. Such a system extracts and indexes textual
Latent Dirichlet Allocation (LDA) [4] and document ranking metadata that are locally available in algorithm-proposing
[5] are used effectively to extract notable product features documents, such as captions and reference sentences, to make
that are so widely discussed across social networks [6]. Fur- the extracted algorithms searchable through a web-search
thermore, many machine-learning algorithms have been interface. However, an algorithm possesses certain properties
used wholesale to solve critical problems in diverse non- (such as algorithm class, complexity, performance, type of
problem and data-structure, and instruction-wise information)
that cannot be captured by traditional document representa-
 S. Tuarob, P. Wettayakorn, C. Pornprasit, T. Sachati, and P. Haddawy are
tions, thus AlgorithmSeer would fail to deliver accurate search
with the Faculty of Information and Communication Technology, Mahidol results if these specific algorithmic needs are implicated.
University, Salaya Phutthamonthon 73170, Thailand. One aspect of algorithm semantics that could be useful in
E-mail: [email protected], {poom.wet, chanathip.por, tanakitti. algorithm retrieval is its use in scientific literature. For
sac}@student.mahidol.edu, [email protected]. example, algorithms that are used extensively to solve prob-
 S. Kang is with the College of Engineering, Inha University, Incheon
22212, South Korea. E-mail: [email protected]. lems in diverse fields of study tend to be generalizable, and
 S. Hassan is with Information Technology University, Punjab, Lahore, are preferred by users who seek reliable algorithmic solu-
Pakistan. E-mail: [email protected]. tions to their problems. Similarly, if an algorithm is used as
Manuscript received 30 Sept. 2018; revised 2 Mar. 2019; accepted 9 Apr. 2019. a building block in many generations of advanced algo-
Date of publication 26 Apr. 2019; date of current version 10 Sept. 2020. rithms, then it is likely to be highly influential, and thus
(Corresponding author: Suppawong Tuarob.) desired by researchers looking for both a solid baseline to
Recommended for acceptance by Z. Cai.
benchmark their newly proposed algorithm, and a way to
Digital Object Identifier no. 10.1109/TKDE.2019.2913376

1041-4347 ß 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Sheffield University. Downloaded on February 15,2025 at 01:24:59 UTC from IEEE Xplore. Restrictions apply.
1882 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 10, OCTOBER 2020

TABLE 1
Approximate Number of Algorithms Published in
Different Computer Science Conferences
During 2005–2009, from [10]

Conference Number of Proposed Algorithms


SIGIR 75
SIGMOD 301 Fig. 2. Example citation context, taken from [14].
STOC 74
VLDB 278 This article makes the following key contributions:
WWW 142
(1) We propose two sets of features to represent a cita-
tion context. These features characterize how the cit-
ing article uses the algorithms proposed in the cited
work. The first is drawn from contextual informa-
tion, where we extend the features from the previous
state-of-the-art feature set used to solve similar prob-
lems. The second is drawn from the content of a cita-
tion context using weighted n-grams.
Fig. 1. Example citation context, taken from [13]. (2) We explored the use of various ensemble methods
that allow base classifiers trained with different fea-
extend or improve upon it. The ability to automatically ture types to make collective decisions, such as com-
identify how an existing algorithm is used in scientific liter- bined features, majority vote, and weighted average.
ature would not only be pivotal to the algorithm retrieval (3) We validated our proposed feature sets and class-
task, but also pave the way for investigations into algorith- ification algorithms using standard information
mic evolution in a large-scale scholarly corpus. retrieval experiment protocols. Furthermore, we
We observe that the algorithm usage relationship can strengthened our reasons for choosing the proposed
be captured from the citation context, where the citing features by showing how each feature type impacts
article describes how an algorithm proposed in the cited classification performance.
work is used. Fig. 1 shows an example of a citation con- (4) We created a ground-truth dataset of citation con-
text implying that the authors of the citing work used the texts drawn from scientific documents in various
algorithm Deb’s NSGA-II proposed in [13] in their fields of study and venues, by manually labelling
research. The example in Fig. 2 represents a citation con- them with the algorithmic-usage types. This dataset
text that implies that the citing article proposes a new is available for others to use.
algorithm that extends from Accelerated A* (AA*) algorithm
originally proposed in [7]. The problem of discovering
algorithm usage is then framed as an algorithm-citation 2 BACKGROUND AND RELATED WORK
context classification in which a citation context that
The relevant literature is divided into two parts. Since
describes how a citing article uses the cited work is classi-
the objective of this research is to understand how exist-
fied into one of the algorithmic-usage types.
ing algorithms are used in scholarly works, the literature
While automatic citation-context classification has been a
on mining algorithms in scholarly data is first discussed.
focal point of research since Teufel et al. proposed a method
Next, we discuss works related to citation classifications,
to map a citation to its function type [15], most relevant
since we frame the problem at hand as a citation-context
studies use classification schemes that are inapplicable to
classification.
our current work, in which we focus on how an algorithm is
used in an article. Nevertheless, some of their classification
schemes and features have been adapted and extended in 2.1 Mining Algorithms in Scholarly Data
our proposed method. We found that the context features Sumit et al. suggested that algorithms are typically repre-
drawn from citation context alone are insufficient to identify sented as pseudo-codes in scholarly documents [17]. The loca-
an algorithm’s usage accurately, and propose to use features tions of the pseudo-codes were identified by using a set of
extracted from both the content and the context information regular expressions to detect the presence of corresponding
(such as article-level metadata, language structure, proxim- captions. Later, Sumit et al. proposed extracting the textual
ity, sentiment, n-grams, etc.), then combining these hetero- metadata of pseudo-codes using a set of reference sentences
geneous feature sets by ensemble-classification techniques, and synthesized synopsis that are generated from the textual
in which a base classifier learns a different feature set, or a content of the document in which each pseudo-code appears
set of combined feature space, then together makes an [10]. Tuarob et al. discovered that roughly 26 percent of
ensemble decision. This indicates the significant technical pseudo-codes in scholarly documents had no accompanying
novelty of our method over previously proposed captions, then proposed a set of ensemble machine-learning
approaches in which various feature sets are combined in a approaches to detect and extract pseudo-codes by using fea-
single feature pool. Such a feature-heterogeneous ensemble tures extracted from both the content and the context informa-
classification in our work has been found effective in many tion [11]. They first proposed a heuristic algorithm to extract
classification tasks in which a single feature set, drawn from sparse boxes from a text. A sparse box in a document is
just one aspect of the data, may not convey sufficient infor- defined as a sequence of consecutive lines in which there is
mation [16]. less textual content than a given threshold, and thus has a
Authorized licensed use limited to: Sheffield University. Downloaded on February 15,2025 at 01:24:59 UTC from IEEE Xplore. Restrictions apply.
TUAROB ET AL.: AUTOMATIC CLASSIFICATION OF ALGORITHM CITATION FUNCTIONS IN SCIENTIFIC LITERATURE 1883

higher chance of being a pseudo-code. A set of 47 features Although the functional analysis of citations in literature
were extracted from each sparse box. These features were has been extensively investigated [25], we discuss here only
divided into four groups: fontstyle-based (FS); context-based works on the automatic identification of citation functions,
(CX); content-based (CN); and structure-based (ST). A major- since these are relevant to our research problem.
ity vote of three base classifiers, trained on such features, was Garfield was among the pioneers in the field of automatic
used to make ensemble decisions on whether a sparse box is a citation-function classification [26]. Garfield analyzed a cor-
pseudo-code or not. Furthermore, Tuarob et al. discovered pus of citations and discovered that authors use citations
that authors use not only pseudo-codes but step-by-step for at least 15 different reasons, then discussed the needs of
instructions (i.e., algorithmic procedures) to represent algorith- automated citation classification systems. Teufel et al. pro-
mic concepts. Hence, they also proposed a set of machine posed a citation function scheme of 12 function classes,
learning based methods to identify the locations of these algo- which may be grouped into three sentiment classes (i.e.,
rithmic procedures. Recently, studies have found that pseudo- Negative, Positive, and Neutral) [15]. Features such as cue
code detection features can be enhanced by proximity infor- phrase, verb tense, voice, modality, and location (with
mation, such as their relative location in the document, and in respect to the document) are extracted from a citation sen-
which section the sparse box appears [18], [19]. tence. These features are used to train a kNN classifier to
Recently, Tuarob et al. proposed AlgorithmSeer, a search classify a citation sentence into an appropriate citation func-
system for algorithms in scholarly big data [12]. Their pro- tion class. A dataset of citation sentences extracted from 360
posed system detects algorithms and extracts corresponding articles from the Computation and Language e-print archive
metadata from scientific documents in the CiteseerX reposi- was used to validate the proposed methodology, finding an
tory [20]. The primary metadata are extracted from the docu- average F1 of 0.57 for functional classification, and an aver-
ment containing the algorithm, including reference sentences age F1 of 0.71 for sentiment classification.
and synthesized description [10]. For those algorithms whose Going onward from [15], Dong and Schafer studied organic
textual metadata were insufficient, they adopted a topic or perfunctory citations [27], and proposed a citation scheme
modelling-based document annotation algorithm proposed in of four classes, namely Background, Fundamental Idea, Technical
[11], [21] to transfer knowledge from metadata-rich algorithms Basis, and Comparison [28]. They found that using keyword-
to enhance the textual metadata for algorithms with sparse based features alone was insufficient to build an accurate,
textual metadata. Recently, Safder et al. proposed using recur- fine-grained classifier that captures how the cited work is
rent convolutional neural networks to locate the sentences in used in the citing article, and proposed combining three differ-
an algorithm-proposing document that discusses the effi- ent feature sets into a single feature space to mitigate this chal-
ciency of the proposed algorithms [22]. lenge. The three feature sets include textual features (i.e., cue
In addition to extracting algorithms and metadata, Tuarob words), physical features (i.e., location and density of the cita-
et al. analyzed the algorithm co-citation network [23], gener- tion), and syntactic features generated by part-of-speech pat-
ated from 1,370,000 documents (i.e., nodes) by filtering for cita- terns. The proposed methods were evaluated on a dataset of
tions whose citation sentences contain an algorithm keyword, roughly 1,768 citation sentences drawn from 122 documents
resulting in a network of 9,409,433 edges. The Markov Cluster in the ACL anthology.1
algorithm [24] was used to cluster the network, resulting in Guo et al. proposed using pair-wise features that charac-
clusters of documents that address similar algorithmic prob- terize the similarity between a pair of cited and citing
lems. Recently, these authors conducted an exploratory study articles [29]. These features were used to classify a citation
on 300 randomly selected algorithm citation contexts, and pro- according to three different citation schemes, namely
posed a classification scheme for algorithm citation functions. Research Question versus Methodology versus Dataset versus
This can be divided into nine classes:Extension; Direct-use; Sug- Evaluation, Organic versus Perfunctory, and Evolutionary ver-
gestion; Similarity; Difference; Analysis; Mention; Baseline; and sus Juxtapositional. The pair-wise features were combined
Argument. In this work, the previously proposed scheme is with local features (extracted from the citation sentence)
consolidated to make automatic classification feasible with an and global features (extracted from particle-level metadata),
acceptable degree of accuracy. and used to train a Random Forest [30] classifier. The evalu-
Most of the previous substantial works on algorithm ation of a dataset of 2,156 citations extracted from 54 ACM
mining focus on metadata extraction and the search for indi- SIGIR 2011 articles assessed the proposed methodology,
vidual algorithms that have been proposed in scholarly and found that the pair-wise features were useful only for
documents. To the best of our knowledge, we are the first to the Evolutionary versus Juxtapositional classification task.
investigate how existing algorithms amid the massive Hassan et al. proposed an extension to classify a citation
stream of computing literature may be utilized. The find- according to its features’ importance (i.e., Incidental versus
ings of this research could complement many existing algo- Important) [31]. The additional features include the cue
rithm mining tasks, such as by helping to identify words extracted from the citation context in each class. Five
influential and/or generalizable algorithms, ranking algo- classification algorithms (Support Vector Machine, Naive
rithm search results, identifying important algorithm inven- Bayes, Decision Tree, K-Nearest Neighbors, and Random
tors, and discovering articles that propose algorithms. The Forest) were trained on the proposed extension features,
results of this work could also be used to study the evolu- along with context and textual features from the existing lit-
tion of algorithmic usages, topics, and trends over various erature, and experiments were conducted on an annotated
time spans and fields of study. dataset of 465 citations. The best performance was achieved
using a Random Forest classifier, yielding an average ROC
of 0.91.
2.2 Automatic Citation Function Classification
Functional aspects of citations have been extensively stud-
ied, in terms of both function identification and behaviors. 1. http://aclweb.org/anthology/
Authorized licensed use limited to: Sheffield University. Downloaded on February 15,2025 at 01:24:59 UTC from IEEE Xplore. Restrictions apply.
1884 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 10, OCTOBER 2020

Recently, Jurgens et al. noted that authors are sensitive to it more efficient or to solve a different, emerging problem. If
discourse structure and publication venues when selecting an algorithm is cited yet neither used nor extended, then it is
and composing citations, and proposed an enhanced feature merely mentioned. Since a citation does not necessarily refer-
set [32], which is an extension of [15]. The additional fea- ence only algorithms (i.e., it could cite ideas, data, case stud-
tures include citation context topic, function pattern, proto- ies, theorems, etc.), another class NOTALGO was added to
typicality, and venue. They consolidated the general the proposed scheme to catch non-algorithmic citations.
citation function scheme from [15] to seven functions, The list of citation function classes and corresponding
namely Background, Motivation, Uses, Extension, Comparison, examples is presented in Table 2. Here, two schemes are
Contrast, and Future. The proposed features, combined with presented: UTILIZATION and USAGE. The UTILIZATION
features from [15], were used to train Random Forest classi- scheme is a binary classification in which a citation context
fiers and to evaluate the same dataset used by Teufel et al. is classified as UTILIZE if the cited work proposes an algo-
[15]. In terms of citation function classification, they rithm that is used or extended in the citing article, and
achieved the best average F1 of 0.53. NOTUTILIZE otherwise. This binary classification scheme
The objective of our research is to develop a set of meth- is designed to allow the detection and discovery of influen-
ods that automatically discover different usages of algo- tial and generalizable algorithms in scientific literature.
rithms in scientific literature. We conjecture that such The USAGE scheme presents a finer-grained classifica-
information could be captured from citation contexts in tion of algorithm functions. Specifically, the UTILIZE class
which a citing article describes how a cited work is utilized. is further divided into USE and EXTEND classes. Similarly,
Doing so allows us to extend useful features that have been the NOTUTILIZE class is split into MENTION and NOTA-
discovered in previous studies, such as by Teufel et al. [15], LGO classes. From the USE example in Table 2, it is appar-
Dong and Schafer [28], and Jurgens et al. [32]. However, the ent that the citing article uses Deb’s NSGA-II algorithm
fundamental challenge is different from that in previous proposed in [13] simply to generate sets of solutions. From
citation function analysis, in the sense that we focus on how this example of a citation context, we cannot determine that
algorithms proposed in the cited work are used in the citing the citing article extends the cited algorithm, so this citation
article, as opposed to general cited works, the nature of context belongs in the USE class. In the EXTEND example
whose usage is different from that of algorithms. Hence, in in Table 2, the citing article extends the Accelerated A* (AA*)
this work the classification scheme has been further consoli- algorithm previously proposed in [7] to develop the Iterative
dated from [32] to reflect the distinct, practical uses of algo- Accelerated A* (IAA*) algorithm, hence it would be classi-
rithms in scholarly work. Furthermore, additional set of fied as the EXTEND class. From the MENTION example in
features are proposed to better capture the various usage Table 2, while the cited work (i.e., [8]) proposes an algo-
types of algorithms, which can be inferred from the citation rithm, whether the cited algorithm is used or extended in
context. the citing article cannot be determined from the citation con-
text. Therefore, this example falls into the MENTION class.
3 METHODOLOGY Finally, in the NOTALGO example, it cannot be determined
from the citation context if the cited work (i.e., (Prabhakar
The problem of algorithm usage discovery is framed as a et al., 2007)) proposes any algorithms. Rather, the authors of
classification task in which a citation context is classified the citing article seem to be describing an experimental pro-
into an algorithm usage type. Mathematically, let D ¼ tocol in the cited work. Hence, this example is classified as
fd1 ; d2 ; d3 ; . . .g be the collection of scholarly documents, NOTALGO. In this work, we term UTILIZE, USE, and
CðdÞ ¼ fc1 ; c2 ; c3 ; . . .g be the set of all citation contexts in EXTEND as positive classes, and the remainder as negative
document d. A document d ¼ ht1 ; t2 ; t3 ; . . .i is an ordered classes.
sequence of sentences. A citation context, c ¼ ht; d0 ; dt i 2 The reason behind the UTILIZATION scheme is to pro-
Cðd0 Þ, is a tuple of textual content t that appears in d0 , and a vide the ability to identify “truly influential” algorithms
target cited work dt . The textual content t ¼ ht ; n ; nþ i is a which are either EXTENDed or USEd in a paper. Hence,
sequence of n sentences before t , the citation sentence t , EXTEND and USE classes are combined. MENTIONed
and nþ sentences after t . The problem statement then algorithms do not directly impact a citing algorithm, hence
becomes: given a citation context c from document d0 that they are not truly influential. As a future work, we would
cites dt , we would like to classify c into an algorithm-usage like to investigate evolution of algorithms over time. Hence,
class that describes how d0 uses algorithms proposed in dt .
the ability to identify algorithms that are truly utilized
The subsequent sections explain the classification schemes
(used or extended) could prove to be crucial.
used in this work, their proposed features, and the classifi-
Note that, while previous studies have proposed a finer-
cation methods.
grained algorithm citation-function scheme of nine classes
3.1 Algorithm Citation Function Schemes [33], here we consolidate the scheme into just four distinct
classes. This is not only to focus on identifying algorithms
The overarching goal of our research is to study how algo-
that are indeed either used or extended (hence influential),
rithms evolve over time. To do this, the ability to determine
but also to make it more feasible to develop automatic clas-
whether an existing algorithm is either used or extended in a
given work is crucial. In this research, we categorize the sification algorithms, since a fine-grained citation-function
usage of an existing algorithm into three classes: USE; classification has proved to be a challenging task that is yet
EXTEND; or MENTION. A citing article uses a cited algo- to be achieved with any great accuracy [32].
rithm to conduct experiments, generate data, etc., without
modifying it (i.e., simply using it, as-is). A citing article 3.2 Features
extends a cited algorithm either by using it as a building block Features are drawn from both local (textual information
in a more complex algorithm, or modifying it either to make within a citation context) and global (document-level
Authorized licensed use limited to: Sheffield University. Downloaded on February 15,2025 at 01:24:59 UTC from IEEE Xplore. Restrictions apply.
TUAROB ET AL.: AUTOMATIC CLASSIFICATION OF ALGORITHM CITATION FUNCTIONS IN SCIENTIFIC LITERATURE 1885

TABLE 2
List of Algorithm Citation Function Classes and Examples

Utilization Usage Class Example


Class
UTILIZE USE Therefore, it might be desirable to present a set of optimal solutions that
the end user may choose from. We have used Deb’s NSGA-II [13] to generate sets
of solutions. NSGA-II is a fast and elitist genetic algorithm framework
designed for dealing with multi-objective optimization problems.
EXTEND We present the Iterative Accelerated A (IAA ) algorithm for trajectory
planning in Section 3. This algorithm is an extension of the Accelerated A
(AA*) algorithm [7]. The original AA uses a variable discretization step
size to reduce
NOTUTILIZE MENTION In [6], we proposed an algorithm based on the SPLICE technique for speech
enhancement. In the same work, a speech detector based on the energy in the
bone channel was proposed. In [8], we proposed an algorithm called direct
filtering (DF) based on learning mappings in a maximum likelihood framework.
However, one drawback with the DF algorithm is the absence of a strong speech
mode.
NOTALGO The mixture was incubated for 10 min and the absorbance was measured at 412 nm
against appropriate blanks. The glutathione content was calculated by using
the standard plot under same experimental conditions (Prabhakar et al., 2007). The
mice were sacrificed under light ether anesthesia, liver samples of all group
were preserved in 10% neutral buffered formalin as described by Luna (1968).
The target citation is in bold-italic font.

metadata) scopes. In this work, we propose using two heter- II [13] to...), as opposed to group citations, when
ogenous sets of features: context and content features. These authors simply want to mention similar set of works or
heterogeneous feature sets are used to train individual algorithms (e.g., ...However, in mice lacking
machine classifiers in which the decisions are combined by Trim24, RAR and VDR repressed genes are reex-
ensemble methods. pressed [67, 68, 69, 70]...).
The lexical, morphological and grammatical (XL) fea-
tures are derived locally from the textual content in the cita-
3.2.1 Context Features tion context, representing the language patterns, tense,
Context features refer to both global features from article- voice, and certain writing styles that indicate algorithm
level metadata and also local features extracted from the cita- names. The language patterns are defined using a sequence
tion context using rules. Most of the context features are from of POS tags near the target citation symbol. Most of these
previous studies on citation-function classification [15], [28], language patterns are extended from [32]. The verb tense of
[32]. When implementing and testing these earlier features, the citation sentence may imply useful information. For
we observed that a majority of misclassified algorithm cita- example, authors typically use a past tense and/or the
tions were not well discriminated by these features, hence passive voice when discussing previous algorithms, and
we proposed an additional set of algorithm-specific features use the present tense and active voice when explaining
that were shown to enhance the classification performance how they use/extend the cited algorithms. Furthermore, a
significantly. All the context features are listed in Table 3, majority of algorithms have names, usually presented as
divided into five categories: structure (XS); lexical-morpho- Pascal-case words (e.g., RandomForest). Hence, the pres-
logical-grammatical (XL); cue word (XC); sentiment (XN); ence of Pascal-case words in the proximity of the corre-
and venue (XV). Features with source ‘T’ were originally sponding citation symbol may indicate that the citing article
proposed by Teufel et al. [15], ‘D’ by Dong and Schafer [28], refers to an algorithm proposed in the cited work.
and ‘J’ by Jurgens et al. [32]. Features with source ‘+’ are our The cue word (XC) features detect the presence of cue
own proposed features. words, further divided into 22 cue sets. Half of the cue
The structure (XS) features capture both the location of word sets were proposed by Dong and Schafer [28], who
and the reference density in the citation context. The loca- found that these were useful in citation-function classifica-
tion of a citation sentence is represented as relative to the tion tasks. In addition to these, we propose 11 cue sets to
document’s scope. From our previous studies [33], authors characterize the presence of algorithm entities, algorithm
typically announce an extension of previous algorithms classes, and actions on algorithms. For each cue set, 10
early in the paper and explain their usage in the methodol- numeric features were extracted from a given citation con-
ogy sections. Furthermore, authors typically mention or text, including binary and frequency of the cue words in
acknowledge the existence of previous algorithms in the various strategic portions of the citation context, such as the
literature review sections. The reference density character- citation sentence, the whole context, before/after the target
izes how the target algorithm reference is cited along with citation symbol, and so on.
other references. From our observations, authors typically The sentiment (XN) features quantify the tone used by
single out cited algorithms to emphasize how they would the authors in composing the citation context. While most
use or extend them (e.g., ... We have used Deb’s NSGA- authors tend to maintain an objective tone when writing
Authorized licensed use limited to: Sheffield University. Downloaded on February 15,2025 at 01:24:59 UTC from IEEE Xplore. Restrictions apply.
1886 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 10, OCTOBER 2020

TABLE 3
Context Features are Divided into Five Subsets: Structure (XS); Lexical-Morphological-Grammatical (XL);
Cue Words (XC); Sentiment (XN); and Venue (XV)

Type Source Feature


Structure (XS) T Relative position of t* in d0
T # of total citations in the t*
T # of total citations in the same group e.g. [12,13,14] = 3
+ # of total citations in the cita on context
Lexical, T Function patterns of Teufel (how to extract these)
Morphological, J Bootstrapped function patterns
and Grammatical J Custom function patterns
(XL) J Cita on prototypicality
J Whether the target citation is used in nominative or parenthecal form
J Whether preceeded by a Pascal-cased word
+ Whether a Pascal-cased word is in the preceeding 3 words
J Whether preceeded by all-capital case word
T Verb tense of the citation sentence (t*)
T Length of t* (the number of words in citation sentence)
T Length of the containing clause
T Is citation in a parenthecal statement
+ Portion of Noun words = # of nouns/# words
+ Portion of Verb words = # of verbs/# words
+ Portion of Pronouns = # of pronouns/# words
+ # named-entities = # of named entities
+ Portion named-entities = # of named entities/# words
Cue Words (XC) D CUE_SUBJECT = {we, our, us, table, figure, paper, algorithm, here}
D CUE_QUANTITY = {many, some, most, several , number of, numerous, variety,
D CUE_FREQUENCY = {usually, often, common, commonly, typical, ...}
D CUE_TENSE = {recent, recently, prior, previous, early}
D CUE_EXAMPLE = {such as, for example, for instance, e.g.}
D CUE_SUGGEST = {may, might, could, would, will, can, should}
D CUE_HEDGE = {suppose, conjecture, want, possible}
D CUE_IDEA = {following, similar to, motivate, inspired, idea, spirit}
D CUE_BASIS = {provided by, taken from, extracted from, based on, use, ...}
D CUE_COMPARE = {compare, differ, deviate, contrast, exceed, outperform, ...}
D CUE_RESULT = {result, accuracy, precision, performance, baseline}
+ CUE_ALGORITHM = {algorithm, method, approach, procedure, routine}
+ CUE_MODEL = {mechanism, framework, model, scheme, signature, system, ...}
+ CUE_ALGOCLASS = {calculation, waveguide, presentation, translation, ...}
+ CUE_USE = {use, using, uses, apply, applies, applied, applying}
+ CUE_EXTEND = {extend, extending, extended, extension, adapt, adapted}
+ CUE_PROPOSE = {propose, present, introduce, describe, develop, devise, ...}
+ CUE_EXPLAIN = {outline, present, derive, focus, describe, review, introduce, ...}
+ CUE_ALGOKEY = {algorithm, pseudocode, pseudo-code, procedure, ...}
+ CUE_DOCELKEY = {table, figure, fig., algorithm, pseudo-code, diagram}
+ CUE_PREP = {in, at, of, for, on, into, between, when}
+ CUE_HUMAN = {we, our, this paper, this research}
Sentiment (XN) + Positive sentiment level using sentistrength
+ Negative sentiment level using sentistrength
Venue (XV) J Citing paper venue
+ Citing paper publisher
J Cited paper venue
+ Cited paper publisher
+ Citing paper venue group
+ Cited paper venue group
T # of years differnce in publication dates

scientific documents, Athar and Teufel discovered that, to use an objective tone when explaining how they would
when discussing previous works, authors could express use or extend existing algorithms. In this research, SentiS-
sentiments of criticism that could be captured by sentiment trength2 is used to quantify positive and negative sentiment
analysis tools [34]. Such criticism is often found in the litera- scores in a given citation context and its various strategic
ture review sections, where authors mention previous algo-
rithms or discuss previous works. By contrast, authors tend 2. http://sentistrength.wlv.ac.uk/
Authorized licensed use limited to: Sheffield University. Downloaded on February 15,2025 at 01:24:59 UTC from IEEE Xplore. Restrictions apply.
TUAROB ET AL.: AUTOMATIC CLASSIFICATION OF ALGORITHM CITATION FUNCTIONS IN SCIENTIFIC LITERATURE 1887

portions. The reason for having both positive and negative Inverse Document Frequency (IDF). The IDF term-weight-
scores (instead of a single score that summarizes the senti- ing scheme is similar to the binary scheme, only the value of
ment polarity) is that a study has found that positive and each term is determined by its meaningfulness with respect
negative sentiments need not be in opposite directions on to the corpus. The meaningfulness of a term has an inverse
the same axis, but may coexist [35]. relationship to the number of documents in which it
The venue (XV) features capture the venue information appears. Formally, the IDF term weighting scheme is
of both the citing and cited works, in terms of the publish- defined as
ers, conferences, journals, and sub-fields in computer sci-  jSj
ence. From our observation, authors typically mention fi ¼ log 1þjs2S:vi 2sj if vi 2 t
idf
0 otherwise:
previous works (algorithms) from the same or similar ven-
ues as their own. Hence, the ability to decode the venue Where fiidf is the IDF value of the term vi .
information could help to glean citations that represent Term Frequency-Inverse Document Frequency (TF-IDF).
mentions of an algorithm. While the TF scheme takes the frequency of terms into
All the five context feature groups were combined into account, the IDF scheme is able to identify meaningful
one context feature space when training a base machine- terms that are likely to be discriminating features. In order
learning model. Different base machine-learning classifiers to combine these two schemes, we used the TF-IDF term
were tested, and the best were used in the ensemble step. weighting scheme, defined as
3.2.2 Content Features ( 
tfidf
1
þ 12 maxTF ðviTF
;tÞ jSj
 log 1þjs2S:v if vi 2 t
While the context features presented in the previous section fi ¼ 2 fv2tg ðv;tÞ i 2sj

can capture many positive instances, we noted that a major- 0 otherwise:


ity of misclassified citation contexts were composed not Where fitfidf is the TF-IDF weight of the term vi . The TF-IDF
only by linguistic styles and indirect word choices which of a term is the product of its TF and IDF scores respectively.
are not well captured by the rules defined in the context fea- The combination of these two statistics produces a single
tures, but also by the noise from text remnants caused by measure that takes advantage of both a term’s frequency
broken equations, symbolic characters, citation symbols and its meaningfulness.
themselves, and erroneous text remnants from the PDF
parser. Since a citation context is a short document (of
3.3 Classifiers
roughly three sentences), previous studies have shown that
weighted n-gram features could be additionally useful fea- Classification models of both USAGE and UTILIZATION
tures to represent discriminating linguistic patterns in short, schemes are trained independently. Specifically, USAGE
noisy texts [16]. These n-gram features from the textual con- and UTILIZATION classifiers are binary and four-class clas-
tent of the citation context are referred to as content sifiers respectively. A number of base classifiers were drawn
features. from multiple families of machine-learning classification
To extract the content features, a citation context was algorithms. For context features, the following base classi-
treated as a document. Various text pre-processing steps were fiers were explored: Bernoulli Naive Bayes (NB) [36]; Sup-
investigated for their applicability, including lower-casing, port Vector Machine (SVM) with linear kernels [37];
stemming, and term weighting. Four term-weighting meth- k-Nearest Neighbors (kNN) [38]; Repeated Incremental
ods were explored: Binary; Term Frequency (TF); Inverse Pruning to Produce Error Reduction (RIPPER) [39]; C4.5
Document Frequency (IDF); and Term Frequency-Inverse Decision Tree (C4.5) [40]; and Random Forest (RF) with 500
Document Frequency (TFIDF). Let S be the set of documents trees [30]. For content features, classifiers known for short-
(textual contents of citation contexts), V ¼ hv1 ; . . . ; vM i be the text document classification were evaluated, including
vocabulary extracted from S, t be the test message, and Bernoulli Naive Bayes (NB), Discriminative Multinominal
F ðtÞ ¼ hf1 ; . . . ; fM i be the feature vector of the test message t. Naive Bayes (DMNB), Multinomial Naive Bayes (NBM),
We define the term weighting schemes as follows: Support Vector Machine with linear kernels, Random Forest
Binary Weight. The binary weighting scheme is the sim- with 500 trees, Sparse Generative Model (SG) [41], Deep
plest representation of term vectors. The feature value is 1 if Neural Networks (DNN) [42], and Convolutional Neural
the corresponding term appears in the document, and 0 oth- Networks (CNN) [43].
erwise. Mathematically, The Deep Neural Network model used in this research
has five hidden layers and one output layer. Each hidden
 layer contains 128 neurons. Binary cross-entropy was used
1 if vi 2 t and vi 2 V
fibin ¼ as the objective function, and RMSprop [44] as the opti-
0 otherwise:
mizer. The training was run for 200 epochs.
Where fibin is the binary value of the term vi . Furthermore, we also deployed a Convolutional Neural
Term Frequency (TF). The term frequency weighting network for algorithm citation context classification. The
scheme counts the occurrences of each term in the docu- CNN employs neuron layers with convolving filters that are
ment, hence taking the length of the document into account applied over the local feature vectors. In this work, we train
even when terms are duplicated. Mathematically, a CNN model with two convolutional layers along with
( max pooling schemes, one fully connected neurons layer on
freq 0:5 þ 0:5  maxTF ðviTF
;tÞ
if vi 2 t the top of a word vectors. These embedding/word vectors
fi ¼ fv2tg ðv;tÞ
are obtained from pre-trained word vectors over Wikipedia
0 otherwise:
2017, using FastText.3 Moreover, for regularization, we
Where fifreq is the TF weight of the term vi , and TF ðvi ; tÞ is
the number of occurrences of term vi in document t. 3. https://fasttext.cc/
Authorized licensed use limited to: Sheffield University. Downloaded on February 15,2025 at 01:24:59 UTC from IEEE Xplore. Restrictions apply.
1888 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 10, OCTOBER 2020

added dropout on penultimate layer, to avoid co-adaptation into a single feature pool may obscure the importance of the
of neurons. For hyperparameter settings, we used rectified context-feature set, whose size is much smaller than the con-
linear units (relu) at hidden layers, softmax activation tent-feature set [16].
function at the output layer, windows of filter size 64 and
dropout rate of 0.3 with batch size of 256. Also, we used
stochastic gradient descent over random mini batches with 4 EXPERIMENTS
the Adam optimizer, using learning rate of 0.001. Standard information retrieval experiments were conducted
Since the samples are imbalanced (See Section 4.1), with to validate the efficacy of the proposed features and classifi-
the majority distribution on class MENTION, and relatively cation algorithms. Specifically, document-wise 10-fold
low population from classes USE and EXTEND, data bal- cross-validation was performed for each classification
ancing techniques were explored to reduce such imbalance scheme (i.e., USAGE and UTILIZATION). Standard preci-
bias in the training sets. Specifically, the following data bal- sion, recall, F1, ROC, and Precision-Recall Curve (PRC)
ancing techniques were evaluated: were used as the evaluation metrics. In the test set, let Gc be
the set of citation context from class c. Assume that Rc is the
 Weight Re-balancing: Weights for all the samples are set of samples classifiedTas class c, so that the correctly clas-
re-calculated and re-assigned such that each class sified samples are Gc Rc . Precision, recall, and F1 are
has the same total weight. defined as follows:
 Resampling: Random sub-samples are generated
from the original dataset with replacement. The T T
jGc Rc j jGc Rc j 2  PRc  REc
minority and majority classes are up-sampled and PRc ¼ ; REc ¼ ; F 1c ¼ :
jRc j jGc j PRc þ REc
down-sampled, respectively, such that classes are
equal-sized, while maintaining the original size of
Since our focus is to discover influential algorithms that
the total population.
are indeed used or extended in the literature (i.e., the posi-
However, our preliminary evaluation results indicated
tive classes), the main evaluation metrics for the USAGE
that these data balancing techniques did not improve the
scheme were the weighted average F1 of the positive classes
performance for the USAGE scheme, and only marginally
(i.e., FUSE and FEXTEND ), or FUX . For the UTILIZATION
improved the performance for the UTILIZATION scheme.
scheme, FUTILIZE was used as the main criterion to compare
Hence, they were excluded from our experiments.
different feature types and classification algorithms. For
From a preliminary investigation, in which the 10-fold
each iteration of the 10-fold cross-validation, 10 percent of
cross-validation results from a SVM classifier trained with
the training data was reserved as hold-out data to tune the
only content features were compared with a Random Forest
probability cut-off to maximize FUTILIZE for UTILIZATION
classifier trained with only the context features, we found
(binary) classification, and g for the weighted average
that the SVM classifier misclassified 962 samples (11 per-
ensemble classification. All the experiments were conducted
cent) and the RF classifier 885 samples (10 percent). Of these
on a Linux server with an Intel Xeon E5-2620v4 2.1 GHz
misclassified samples, 539 (41 percent of all the misclassified
CPU and 32 GB of RAM. We used the implementation of
samples) were mutually misclassified, and the other 59 per-
the base machine-learning classifiers from Weka.4 We
cent of the misclassified samples were either misclassified
developed the code for data labelling, data processing, fea-
by the context-based classifier or by the content-based clas-
ture extraction, ensemble machine learning, and evaluation
sifier. Since the majority of content and context features are
framework by ourselves.
derived from different aspects of citation contexts, we con-
jectured that such a small overlap in the misclassified 4.1 Dataset
results may imply that the ensemble decisions of base classi- Unlike previous works on citation classification in which
fiers, each of which was trained on a different feature type, the ground truth datasets are drawn from specific conferen-
could then help the other to correctly classify samples that ces with limited temporal spans and with sufficient struc-
would otherwise be misclassified by individual classifiers. tural metadata (i.e., a section-wise XML format), such as
In addition to previous works on citation function classi- the Computation and Language e-print archive[15] or ACL
fication, in which different feature types are combined into conference [28], [32], our annotated corpus comprises ran-
a single feature space, we propose training the base classi- domly selected citation contexts from scholarly documents
fiers on the context- and content-feature sets separately. The in the CiteseerX repository, a digital library that stores and
best classifiers for each feature set were then combined indexes over 2.4 million research articles in computer sci-
using standard ensemble classification techniques, such as ence and related fields [45]. Since our goal is to discover
weighted average and majority vote. The weight g 2 ½0; 1 algorithm usage in scientific literature that can potentially
was used to control the influence of context-based and con- be used to study the evolution of algorithms, it is important
tent-based classifiers, as follows: that the training dataset should comprise documents that
represent a wide range of document types, venues, com-
P AVG ¼ g  P X þ ð1  gÞ  P T : (1) puter science sub-fields, and temporal spans.
A total of 8,063 documents were randomly selected from
P AVG is the weighted average probability for each class, the CiteseerX repository covering conferences, journals, and
P X and P T are the probability distribution from context- technical report documents from various venues such as
and content-based classifiers, respectively. The reason for CVPR, ICDT, VLDB, SODA, and INFOCOM, and multiple
choosing heterogeneous ensemble learning is because previ- computer science disciplines such as AI, algorithm and
ous work on binary short-text classification has shown that
combining context features with content (n-grams) features 4. https://www.cs.waikato.ac.nz/ml/weka/
Authorized licensed use limited to: Sheffield University. Downloaded on February 15,2025 at 01:24:59 UTC from IEEE Xplore. Restrictions apply.
TUAROB ET AL.: AUTOMATIC CLASSIFICATION OF ALGORITHM CITATION FUNCTIONS IN SCIENTIFIC LITERATURE 1889

TABLE 4
Distribution of Annotated Citation Contexts in Each Class,
According to Both the USAGE and UTILIZATION
Annotation Schemes

Utilization Usage # Samples % of Total


UTILIZE USE 752 8.55
EXTEND 378 4.30
NOTUTILIZE MENTION 5,750 65.37
NOTALGO 1,916 21.78
Total 8,796 100.00

Fig. 3. Distribution of publication years of the documents in our dataset.


classifier performed best in terms of the precision of the
USE and EXTEND classes, while the NB classifier achieved
theory, data mining, parallel computing, security, HCI, and the best recall of the two positive classes. The best F1 for
bioinformatics. Fig. 3 shows the distribution of publication classes USE, MENTION, and NOTALGO were achieved by
years of the selected documents, ranging from 1985 to 2015, the RF classifier, while the SVM classifier had the best F1 for
with the majority of documents published during 2003 to the EXTEND class. Since FUX was the main evaluation crite-
2006. rion for the USAGE scheme, SVM was selected as the best
From the selected documents, 405,843 citation contexts context-based classifier for the USAGE scheme.
were extracted using the ParsCit algorithm [46]. Each For the UTILIZATION scheme, the classification results
citation context comprises a citation sentence together with are listed in Table 6. The SVM and NB classifiers yielded the
one sentence before and after it, containing on average best results in terms of precision and recall of the positive
65 words. Of these, citation contexts containing algorithm- class (i.e., UTILIZE), respectively. However, the RF classifier
indicating key words (such as algorithm, procedure, outperformed the other classifiers in terms of FUTIL , along
pseudo-code, classification, method, and approach) [33] with averaged F1, ROC, and PRC. Hence, the RF classifier
were randomly selected for hand labelling by four com- was chosen as the best base classifier, with the context fea-
puter science students using the USAGE scheme defined in tures for the UTILIZATION scheme.
Section 3.1. Since citation contexts containing algorithm-
indicating key words tend to contain algorithm citations,
we intentionally added non-algorithm citation contexts 4.2.2 Content-Based Features
(also verified by the four human taggers) to the dataset to Although the context features were handcrafted to charac-
increase the negative (class NOTALGO) population. In total, terize the algorithmic functional representation of citation
8,796 citation contexts were hand labelled. The distribution contexts, upon preliminary investigation of the misclassi-
of classes according to both the USAGE and UTILIZATION fied results by the context classifiers, we found that a major-
annotation schemes are shown in Table 4. The inter- ity of the positive samples were classified as MENTION or
agreement measured by Fleiss’ kappa is 0.62, suggesting a NOTUTILIZE classes, especially those that did not contain
substantial level of agreement [47]. To the best of our knowl- signaling cue words. Furthermore, we noted that some of
edge, we have annotated the largest citation context dataset the positive samples misclassified as MENTION contained a
to date, in terms of sample size. The dataset is available at combination of words that, in proximity, could be signal.
https://goo.gl/LrtHM8 for research purposes. For example, “...was used to ..” and “...used to
generate...”, when in the same citation context may
4.2 Results imply that an algorithm had been used in the citing work.
The experiments were conducted in two stages. First, base Similarly, a majority of negative samples were misclassified
classifiers were evaluated with all the context features and as USE or EXTEND, since they simply contained cue words
content features. Second, the best base classifiers for con- that indicate algorithm usage and extension, such as “...
text and content features were chosen to evaluate the majority of the benefits of the optimization
ensemble learning. The subsequent sub-sections report the algorithm used come early in the algorithm...”,
experimental results for both USAGE and UTILIZATION which is clearly a MENTION sample yet was classified as
schemes. USE by a context-based classifier. Hence, weighted n-gram
features that encode words appearing in the same proximity
4.2.1 Context-Based Features could prove to be additionally useful.
Context features were extracted from the dataset, then were Each citation context was treated as a document in which
used to validate the selected base classifiers. Table 5 reports the textual information is pre-processed by lower-casing,
the classification results of base classifiers trained with only tokenizing, and stemming. In this work, the vocabulary of
context features on the USAGE scheme. In terms of FUX , the the n-grams was generated from the corpus of training data,
SVM classifier performed the best (FUX = 0.476), slightly bet- consisting the union of uni-grams, bi-grams, and tri-grams.
ter than the RF classifier (FUX = 0.461). However, in terms of Each citation context was represented by a vector of TF-IDF
FAVG the RF classifier achieved the best result, with FAVG of weighted uni-grams, bi-grams, and tri-grams. We also tried
0.748. Looking at finer-grained classification results, the RF other term-weighting schemes such as binary, TF, and IDF.
Authorized licensed use limited to: Sheffield University. Downloaded on February 15,2025 at 01:24:59 UTC from IEEE Xplore. Restrictions apply.
1890 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 10, OCTOBER 2020

TABLE 5
Classification Results for USAGE Scheme, Using Only Context Features, Trained with Base Machine-Learning Classifiers

Precision, recall, and F1 for each class, along with weighted average F1 (FAVG ), F1 of classes USE and EXTEND (FUX ), area under ROC (ROCAVG ),
area under precision-recall curve (PRCAVG ), and training time (ms) are reported.

TABLE 6
Classification Results for the UTILIZATION Scheme, Using Only Context Features, Trained with Base Machine-Learning Classifiers

Classifier PNOTUTIL PUTIL RNOTUTIL RUTIL FNOTUTIL FUTIL FAVG ROCAVG PRCAVG TrainTime (ms)
NB 0.957 0.368 0.810 0.752 0.877 0.495 0.828 0.855 0.904 1887
SVM 0.913 0.681 0.974 0.371 0.943 0.480 0.883 0.864 0.921 688609
kNN 0.915 0.521 0.946 0.402 0.930 0.454 0.869 0.764 0.868 57038
RIPPER 0.920 0.609 0.959 0.431 0.939 0.505 0.883 0.700 0.853 46769
C4.5 0.917 0.531 0.945 0.422 0.931 0.470 0.872 0.727 0.842 23243
RF 0.947 0.558 0.925 0.647 0.936 0.599 0.892 0.901 0.938 432237

Precision, recall, and F1 for each class, along with weighted average F1 (FAVG ), area under ROC (ROC_AVG), area under precision-recall curve (PRC_AVG),
and training time (ms) are reported.

TABLE 7
Classification Results for USAGE Scheme, Using Only Content (n-grams) Features, Trained with Base
Machine-Learning Classifiers

Precision, recall, and F1 for each class, along with weighted average F1 (FAVG ), F1 of classes USE and EXTEND (FUX ), area under ROC (ROCAVG ),
area under precision-recall curve (PRCAVG ), and training time (ms) are reported.

Overall, the TF-IDF weights yielded optimal performance. and the NB classifier, which gave the highest recall but had
Note that stop words were not removed, since they can pro- low precision in the UTILIZE class. Overall, the SVM classi-
vide useful information, especially prepositions, pronouns, fier was deemed to be the most suitable content-based clas-
articles, and modal verbs. sifier for the UTILIZATION scheme.
Table 7 lists notable classification results from base classi- It is worth noting that the RF classifier performed signifi-
fiers trained with content features for the USAGE scheme. cantly worse (in terms of FUX and FUTILIZE ) when trained
Eight base classifiers were tested with content features with the content features, as opposed to the context features.
including NB, DMNB, NBM, SVM, RF, SG, DNN, and CNN There could be two explanations for this phenomenon. First,
(refer to Section 3.3). In terms of F1, the SG classifier out- Random Forest is built upon a multitude of decision trees,
performed the others for classes USE, MENTION, and which is suitable for rule-extracted features, each of which
NOTALGO, and the SVM classifier for class EXTEND. The should be an indicating signal of the class attribute. Since
SVM classifier yielded the best FUX of 0.508, 6.7 percent the content feature space, represented by n-grams, can be
improvement from the best context-based classifier. Hence, massive and sparse, this can result in not only inaccurate
we conclude that SVM is the most suitable content-based classification, but also over-fitted branch splitting. Although
classifier for the USAGE scheme. the RF algorithm has a built-in automatic feature selection
Table 8 shows the classification results of the base classi- mechanism, the massive size of the word features may over-
fiers trained with only the content (n-gram) features on UTI- whelm such an ability. Second, a decision tree can be imple-
LIZATION scheme. There is consensus that the SVM mented by a set of rules to characterize each class. While
classifier yielded the best F1 for classes UTILIZE (FUTILIZE = the samples of minority classes (e.g., USE, EXTEND, and
0.482), NOTUTILIZE (FNOTUTILIZE = 0.944), and weighted NOTALGO) are significantly fewer than the MENTION
average F1 (FAVG = 0.884). Other notable results for the UTI- class, this can cause the RF classifiers not only to extract
LIZATION scheme include the RF classifier, which yielded insufficient rules for these minority classes (as evidenced by
the best precision yet had poor recall in the UTILIZE class, the highest precision but poor recall of the minority classes),
Authorized licensed use limited to: Sheffield University. Downloaded on February 15,2025 at 01:24:59 UTC from IEEE Xplore. Restrictions apply.
TUAROB ET AL.: AUTOMATIC CLASSIFICATION OF ALGORITHM CITATION FUNCTIONS IN SCIENTIFIC LITERATURE 1891

TABLE 8
Classification Results for UTILIZATION Scheme, Using Only Content (n-grams) Features, Trained with
Base Machine-Learning Classifiers

Classifier PNOTUTIL PUTIL RNOTUTIL RUTIL FNOTUTIL FUTIL FAVG ROCAVG PRCAVG TrainTime (ms)
NB 0.936 0.250 0.703 0.672 0.803 0.365 0.747 0.709 0.837 3,142,208
DMNB 0.905 0.740 0.985 0.299 0.943 0.426 0.877 0.852 0.920 5,633
NBM 0.892 0.332 0.929 0.239 0.910 0.278 0.829 0.704 0.852 363
SVM 0.913 0.701 0.977 0.367 0.944 0.482 0.884 0.865 0.922 7,364,915
RF 0.888 0.874 0.997 0.144 0.939 0.247 0.850 0.877 0.929 9,780,036
SG 0.903 0.805 0.990 0.276 0.944 0.411 0.876 0.879 0.931 11,071
DNN 0.915 0.622 0.971 0.346 0.942 0.444 0.881 0.658 0.295 3,380,000
CNN 0.870 0.490 0.960 0.220 0.910 0.310 0.820 0.590 0.665 33,090,000

Precision, recall, and F1 for each class, along with weighted average F1 (FAVG ), area under ROC (ROCAVG ), area under precision-recall curve (PRCAVG ), and
training time (ms) are reported.

but also to make the classification over-fitted to the majority 0.461) and content features (FUX = 0.244). While one might
classes (as evidenced by the highest recall of class MEN- feel that the combined features would enhance the perfor-
TION (RMENTION = 0.994)). Hence, we conclude that Ran- mance of individual base learners due to incorporating
dom Forest and tree-based classifiers are not suitable to more information, this is not true for the RF classifier. Add-
capture rules from bag-of-words features, in alignment with ing the context features to the pool of content features actu-
the findings by Tuarob et al. on short-text classification [16]. ally worsened the performance of the RF classifier, yet it
Such word features would be more suitable for projection- still yielded the highest precision of the minority classes,
based or generative classifiers, such as DMNB, NBM, SVM, namely USE and EXTEND, and achieved the highest recall
and SG, which yielded much better overall results. in the majority class (i.e., MENTION). Such a phenomenon
It is also worth noting that deep learning based methods can be due to the fact that the content-feature pool is already
(i.e., DNN and CNN) performed worse than SVM for both too large and sparse for a tree-based classifier (i.e., RF) to
USAGE and UTILIZATION schemes. While deep learning handle effectively. Adding more features, although infor-
methods have been shown to work well for text classifica- mative, not only makes the algorithm generate too many
tion tasks [43], they require a relatively large amount of unusable rules, but also creates classification bias towards
labeled data to achieve optimal performance. Since the data- the majority class for multi-class classification.
set used in this research comprises 8,796 samples, distrib- The ensemble classification results for the UTILIZATION
uted across four classes, it may not be sufficient for DNN scheme are listed in Table 10. For the UTILIZATION
and CNN to automatically capture the wide variety of lan- scheme, the weighted average ensemble of a context-based
guage styles used to compose these citation contexts. RF classifier and a content-based SVM classifier yielded the
best results in terms of FUTIL = 0.639, FNOTUTIL = 0.944, and
FAVG = 0.905. Fig. 4 summarizes the FUX and FUTIL from dif-
4.2.3 Ensemble Classification ferent ensemble methods for USAGE and UTILIZATION
After we tested each feature type (i.e., context and content schemes respectively.
features) separately and compared the misclassification The results confirmed our conjecture that the ensemble
results from the best base classifiers for each feature type, decision of the context-based classifier and the content-based
we found that only 41 percent of the misclassified samples classifier would help to correct each other, making an overall
were mutually misclassified by both the context- and con- better classification decision. The FUTIL achieved by the
tent-based classifiers. If we could combine the knowledge best configuration of the weighted average method was
from both feature types, then the machine learners would improved by 32.6 percent over the best FUTIL achieved by the
be able to make ensemble decisions to correct each other’s best content-based classifier, and 6.7 percent over the best
mistakes, minimizing the other 59 percent of the exclusively context-based classifier. The optimal g is 0.749, implying that
misclassified samples. the decision was still significantly influenced by the context-
We proposed using three methods to combine the con- based classifier (i.e., 74.9 percent); however, the additional
text and content features: combined features; majority vot- information offered by the content-based classifier contrib-
ing of the best three base classifiers from each feature type; uted to increased classification efficacy. The majority voting
and weighted average of the best classifier from each feature of the three best context-based classifiers (RF, RIPPER, and
type. Table 9 summarizes the ensemble classification results NB) and the best three content-based classifiers (SVM, SG,
for the USAGE scheme. The SVM classifier trained with the and DMNB) performed second best in terms of FUTIL . The
combined features yielded the highest FUX of 0.522, 9.7 and classifiers trained with the combined pool of features per-
2.8 percent improvement from the best context-based and formed reasonably well, apart from the RF classifier. Consis-
content-based classifiers, respectively. Such a classifier also tent with the earlier findings, tree-based classifiers such as
yielded the best FAVG of 0.749. The second best method, in Random Forest are not suitable for learning from the mas-
terms of FUX , was the weighted average of a SVM classifier sively sparse features extracted from our dataset.
trained with context features and a SVM classifier trained
with content features, with FUX of 0.498.
It is worth noting that the RF classifier trained with the 4.3 Impact of Context Features
combined features (FUX = 0.130) performed significantly One of the main technical contributions of this work is the
worse than when trained with only the context (FUX = proposed additional set of context features crafted to
Authorized licensed use limited to: Sheffield University. Downloaded on February 15,2025 at 01:24:59 UTC from IEEE Xplore. Restrictions apply.
1892 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 10, OCTOBER 2020

TABLE 9
Ensemble Classification Results for USAGE Scheme

Precision, recall, and F1 for each class, along with weighted average F1 (FAVG ), F1 of classes USE and EXTEND (FUX ), area under ROC (ROCAVG ), area
under precision-recall curve (PRCAVG ), and training time (ms) are reported. X and T denote base classifiers trained with only context and content features,
respectively.

TABLE 10
Ensemble Classification Results for UTILIZATION Scheme

Ensemble Option PNOTUTIL PUTIL RNOTUTIL RUTIL FNOTUTIL FUTIL FAVG


Combined Features NB 0.964 0.372 0.801 0.798 0.875 0.507 0.828
Combined Features SVM 0.927 0.663 0.964 0.486 0.944 0.561 0.896
Combined Features RF 0.877 0.836 0.998 0.054 0.934 0.101 0.827
XðRF-RIPPER-NBÞ
Majority Vote TðSVM-SG-DMNBÞ
0.945 0.603 0.938 0.630 0.941 0.613 0.899
Weighted Average X(RF), T(SVM), 0.749 0.950 0.615 0.938 0.666 0.944 0.639 0.905

Precision, recall, and F1 for each class, along with weighted average F1 (FAVG ), area under ROC (ROCAVG ), area under precision-recall curve (PRCAVG ), and
training time (ms) are reported. X and T denote base classifiers trained with only context and content features, respectively.

characterize the various algorithm usage types that can be encode the linguistic characteristics of a given citation con-
captured from a citation context. The subsequent sections text, including tense, Pascal-case words, length, citation
investigate further the impact of different context-feature symbol format, and named entities. The tense-based fea-
types and the impact of our proposed features on both the tures could be useful in capturing mentioned algorithms,
USAGE and UTILIZATION schemes. which are typically expressed in the past tense in the litera-
ture review sections. The presence of Pascal-case words can
4.3.1 Impact of Context Feature Types indicate algorithm names, which are often expressed when
authors describe the use of such algorithms. Hence, the XL
The context features used in our proposed methodology
features are particularly advantageous when used to cap-
may be divided into five types: Structure (XS); Lexical and
ture USE and MENTION classes. The XN features encode
Grammatical (XL); Cue Word (XC); Sentiment (XN); and
sentiments in different parts of a given citation context, and
Venue (XV) (Refer to Section 3.2). In this section, a different
were found to be useful in capturing MENTION samples, as
SVM classifier was evaluated with each of the context fea-
evidenced by FMENTION of 0.875. While it is customary to
ture types and combined context features (X-ALL) to
write scientific articles in sentiment-neutral language to
explore how each feature type contributes to the classifica-
present factual results and avoid unjustified opinions,
tion performance of each class. The findings explain not
recent studies have discovered that a majority of citation
only how effective each context feature type is, but also how
contexts do convey sentiment [34]. Most such opinion-con-
well each context-feature type captures the characteristics of
taining citation contexts are found in the literature review
algorithm citation classes.
sections, in which authors criticize previous algorithms. The
Table 11 reports the classification results of various
XS features allow the classifier to detect such opinion-con-
context feature types, trained with SVM classifiers, on the
taining algorithm citation contexts and classify them as
USAGE scheme. The XC features yielded the best average
MENTION samples. The XV features characterize the venue
F1 of 0.749, and also the best F1 for classes USE, MENTION,
and NOTALGO. Since most of the XC features were
extracted using the presence of cue words, the results sug-
gest that such indicating terms are the most effective fea-
tures to characterize various algorithm usage types in our
dataset. The XS features locate the citation context in the
paper. These features can be particularly useful to detect
mentioned algorithm citations in literature review sections,
which are typically located either at the beginning of the
article or before the conclusion, depending on the format of
the venue. Furthermore, such structural features can filter
out a majority of remnant samples due to erroneous citation
context parsing. Most of this remnant citation context com-
prises mishandled references from the reference sections,
located at the end of most articles. Hence, as evidenced by
the results in Table 11, the XS features are capable of detect-
ing the MENTION class (with FMENTION of 0.787) and iden- Fig. 4. Comparison of ensemble classification results for USAGE (FUX )
tifying some of the NOTALGO samples. The XL features and UTILIZATION (FUTIL ) schemes.
Authorized licensed use limited to: Sheffield University. Downloaded on February 15,2025 at 01:24:59 UTC from IEEE Xplore. Restrictions apply.
TUAROB ET AL.: AUTOMATIC CLASSIFICATION OF ALGORITHM CITATION FUNCTIONS IN SCIENTIFIC LITERATURE 1893

TABLE 11
Classification Results for USAGE Scheme, Using Various Context-Feature Types, Trained with SVM Classifiers

Precision, recall, and F1 for each class, along with weighted average F1 (FAVG ), F1 of classes USE and EXTEND (FUX ), area under ROC (ROCAVG ),
area under precision-recall curve (PRCAVG ), and training time (ms) are reported.

TABLE 12
Classification Results for UTILIZATION Scheme, Using Different Context-Feature Types, Trained with RF Classifiers

Features PNOTUTIL PUTIL RNOTUTIL RUTIL FNOTUTIL FUTIL FAVG ROCAVG PRCAVG TrainTime (ms)
XS 0.918 0.156 0.390 0.765 0.548 0.259 0.511 0.599 0.814 60563
XL 0.928 0.296 0.798 0.577 0.858 0.392 0.798 0.753 0.865 205108
XC 0.944 0.561 0.928 0.626 0.936 0.592 0.892 0.896 0.935 417697
XN 0.881 0.130 0.100 0.908 0.180 0.227 0.186 0.507 0.777 139884
XV 0.878 0.130 0.194 0.817 0.318 0.224 0.306 0.525 0.784 134979
X-ALL 0.947 0.558 0.925 0.647 0.936 0.599 0.892 0.901 0.938 385634

Precision, recall, and F1 for each class, along with weighted average F1 (FAVG ), area under ROC (ROCAVG ), area under precision-recall curve (PRCAVG ), and
training time (ms) are reported.

of the citing and cited work. Although such publisher-based information from the citation contexts, while earlier works
features are not particularly useful for capturing the posi- aimed to identify general citation functions; and 2) We wanted
tive classes, they allow the classifier to recognize MENTION to develop a method to handle citation contexts from heteroge-
samples, with FMENTION of 0.769. Aligning with the analysis neous time spans, sources, and types of scientific documents,
from [32], authors tend to mention most frequently those including conference papers, journals, and technical reports,
works that were previously published in the same venue. while earlier works proposed methods developed specifically
Therefore, this knowledge may allow the classifier to recog- for citation contexts extracted from documents published by a
nize a majority of the MENTION samples. While various single venue with a tight time span. Therefore, we proposed a
types of the context features allow the classifier to learn dif- set of novel features that allows the classifier to identify various
ferent sets of contextual characteristics, the combination of usages of existing algorithms in the citing article. Most of our
the five types of contextual features (X-All) yielded the best proposed features are algorithm-specific cue words, sentiments,
FUX of 0.476, suggesting that the many types of context fea- and modifications of previously proposed features (see Table 3).
tures extract different aspects of the citation contexts that While the combination of our proposed features, and earlier
may be additionally useful in combination. It is worth not- ones allowed the classifier to achieve optimal classification per-
ing that, while the combined features (X-All) yielded the formance, this section reports on the analysis of the efficacy of
best FUX , a drop of 2.54 percent of FAVG from the XC feature the proposed features compared to those previously proposed.
set was observed. However, since we focused on the posi- Table 13 lists the classification results for the USAGE
tive classes (i.e., USE and EXTEND), whose classification scheme, using context features proposed by Teufel et al.
performance is mainly indicated by FUX , we concluded that [15] (T 2006), Dong and Schafer [28] (D 2011), Jurgens et al.
the combined context-feature set is the most suitable for the [32] (J 2018), the combination of earlier features (T+D+J),
algorithm citation-context dataset used in this research. our proposed features, and a combination of both earlier
Besides the USAGE scheme, we also investigated the impact and proposed features (All), using SVM as the base classi-
of various context-feature types for the UTILIZATION scheme, fier. Both T 2006 and J 2018 feature sets performed similarly,
using RF as the base classifier. The classification results are
summarized in Table 12. Similar to the analysis for the
USAGE scheme, the XC features were the most useful con-
text-feature set, yielding the highest FUTIL of 0.592 and FAVG
of 0.892 among the various context-feature types. Combining
all the context-feature types allowed the classifier to achieve
optimal performance, an increase in FUTIL by roughly 1.2 per-
cent from the classifier trained with the XC features alone.
Fig. 5 summarizes the FUX and FUTIL from different feature
types for USAGE and UTILIZATION schemes respectively.

4.3.2 Impact of Novel Features


Our research problem and scope are unlike previous citation
classification works [15], [28], [32] for two reasons: 1) We Fig. 5. Comparison of classification results using different feature types
wanted to identify the use of algorithms in the literature using for USAGE (FUX ) and UTILIZATION (FUTIL ) schemes.
Authorized licensed use limited to: Sheffield University. Downloaded on February 15,2025 at 01:24:59 UTC from IEEE Xplore. Restrictions apply.
1894 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 10, OCTOBER 2020

TABLE 13
Classification Results for USAGE Scheme, Using Various Context Features Proposed by Earlier Works
Trained with SVM Classifiers

T 2006, D 2011, J 2018, and T+D+J are the feature sets proposed by [15], [28], [32] and the combination of the three feature sets, respectively. Precision,
recall, and F1 for each class, along with weighted average F1 (FAVG ), F1 of classes USE and EXTEND (FUX ), area under ROC (ROCAVG ), area under
precision-recall curve (PRCAVG ), and training time (ms) are reported.

TABLE 14
Classification Results for UTILIZATION Scheme, Using Various Context Features Proposed by Different Works,
Trained with RF Classifiers

Features PNOTUTIL PUTIL RNOTUTIL RUTIL FNOTUTIL FUTIL FAVG ROCAVG PRCAVG TrainTime (ms)
T 2006 0.905 0.155 0.459 0.672 0.609 0.252 0.563 0.576 0.804 167921
D 2011 0.937 0.383 0.856 0.606 0.895 0.470 0.840 0.820 0.893 443440
J 2018 0.865 0.128 0.044 0.954 0.083 0.226 0.101 0.509 0.778 120989
T+D+J 0.938 0.418 0.876 0.607 0.906 0.495 0.853 0.842 0.907 388750
Proposed 0.948 0.522 0.911 0.660 0.929 0.583 0.884 0.893 0.932 310403
All 0.947 0.558 0.925 0.647 0.936 0.599 0.892 0.901 0.938 394287

T 2006, D 2011, J 2018, and T+D+J are the feature sets proposed by [15], [28], [32] and the combination of the three feature sets, respectively. Precision, recall,
and F1 for each class, along with weighted average F1 (FAVG ), area under ROC (ROCAVG ), area under precision-recall curve (PRCAVG ), and training time (ms)
are reported.

especially in terms of F1 measures. Specifically, such fea- Table 14 displays similar classification results by RF classi-
tures are not effective to train a classifier to capture the posi- fiers trained with earlier features and our proposed features
tive classes, as evidenced by low FUX of 0.044 and 0.010 for the UTILIZATION scheme. Similar to the analysis earlier,
respectively. An explanation for this phenomenon is that, our proposed features are more effective in terms of repre-
while T 2006 and J 2018 features such as location, tense, and senting the characteristics of an algorithm citation context, as
venue features are useful to identify MENTION samples, evidenced by higher FUTIL of 0.583 (17.8 percent improvement
they do not characterize well the USE and EXTEND sam- from T+D+J features and FAVG of 0.884 (3.6 percent improve-
ples, which constitute our positive classes. Furthermore, ment from T+D+J features). Nonetheless, a combination of all
most of J 2018 features are composed of patterns generated the features achieves the highest FUTIL of 0.599 and FAVG of
from a specific ACL corpus, which may not be generalizable 0.892. Fig. 6 summarizes the FUX and FUTIL from using differ-
to the wide range of writing styles and citation formats fol- ent context features proposed by different works for USAGE
lowed by authors in various venues, fields of study, and and UTILIZATION schemes respectively.
publication types. D 2011 features performed much better
than T 2006 and J 2018 in both FUX and FAVG , because most 5 CONCLUSIONS AND FUTURE DIRECTIONS
of D 2011 features are cue words, some of which are capable
of capturing the usage of the cited work. The combination Algorithms are essential parts of computer science and related
of the earlier features (i.e., T+D+J) additionally helped the literature. Scientists utilize existing algorithms in multiple
classifier to achieve a higher classification performance
than when trained with individual feature sets.
Regardless, compared with earlier features, our feature set
is able to perform better in terms of F1 in every class. Specifi-
cally, our features are more effective because: 1) the proposed
cue words were specifically handcrafted to capture different
algorithm usages from citation contexts; 2) the proposed fea-
tures were drawn from our dataset, characterized by the het-
erogeneous writing styles followed by the various fields of
study and venues. As a result, our proposed features are
capable of training the machine-learning classifier to distin-
guish effectively not only between algorithmic usages, but
also to filter out those citation contexts that are unrelated to
algorithms. Finally, the combination of our proposed features
and earlier ones (All) allows the classifier to achieve the opti-
mal FUX of 0.476, while sacrificing roughly 2.4 percent of Fig. 6. Comparison of classification results using various context fea-
FAVG , mainly due to the marginal decrease in the ability to tures proposed by different works for USAGE (FUX ) and UTILIZATION
capture the MENTION samples. (FUTIL ) schemes.
Authorized licensed use limited to: Sheffield University. Downloaded on February 15,2025 at 01:24:59 UTC from IEEE Xplore. Restrictions apply.
TUAROB ET AL.: AUTOMATIC CLASSIFICATION OF ALGORITHM CITATION FUNCTIONS IN SCIENTIFIC LITERATURE 1895

ways, such as using, extending, or simply mentioning them. [9] A. Karpatne, I. Ebert-Uphoff, S. Ravela, H. A. Babaie, and V. Kumar,
“Machine learning for the geosciences: Challenges and oppor-
The ability to automatically identify algorithm usage in scien- tunities,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 8, pp. 1544–1554,
tific literature would not only allow us to discover influential Aug. 2019.
and generalizable algorithms that could be extensively used to [10] S. Bhatia and P. Mitra, “Summarizing figures, tables, and algo-
solve problems in diverse fields of study, but also enable us to rithms in scientific publications to augment search results,” ACM
construct a temporal algorithm usage network to study the evo- Trans. Inf. Syst., vol. 30, no. 1, 2012, Art. no. 3.
[11] S. Tuarob, S. Bhatia, P. Mitra, and C. L. Giles, “Automatic detec-
lution of algorithmic development over time. We conjecture tion of pseudocodes in scholarly documents using machine
that the algorithm usage in a scientific work, represented by a learning,” in Proc. 12th Int. Conf. Document Anal. Recognit., 2013,
scholarly document, could be captured in citation contexts pp. 738–742.
where authors discuss the use of existing algorithms. We frame [12] S. Tuarob, S. Bhatia, P. Mitra, and C. L. Giles, “AlgorithmSeer: A
system for extracting and searching for algorithms in scholarly
the problem as a citation-context classification problem, where big data,” IEEE Trans. Big Data, vol. 2, no. 1, pp. 3–17, Mar. 2016.
a citation context is classified according to two classification [13] A. Sokolov, A. Sanyal, D. Whitley, and Y. Malaiya, “Dynamic
schemes: USAGE and UTILIZATION. The ground-truth dataset power minimization during combinational circuit testing as a
comprises 8,796 citation contexts, randomly selected from traveling salesman problem,” in Proc. IEEE Congress Evolutionary
scholarly documents that represent a wide variety of temporal Comput., 2005, vol. 2, pp. 1088–1095.
[14] S. Kopriva, D. Sislak, D. Pavlıcek, and M. Pechoucek, “Iterative
spans, document types, fields of study, and venues. We pro- accelerated A* path planning,” in Proc. 49th IEEE Conf. Decision
posed a novel set of features that effectively characterize the Control, 2010, pp. 1201–1206.
usage of existing algorithms that can be extracted from a cita- [15] S. Teufel, A. Siddharthan, and D. Tidhar, “Automatic classifica-
tion context. The best results for the USAGE scheme were tion of citation function,” in Proc. Conf. Empirical Methods Natural
Lang. Process., 2006, pp. 103–110.
achieved by a SVM classifier trained with combined content [16] S. Tuarob, C. S. Tucker, M. Salathe, and N. Ram, “An ensemble
and context features. The best results for the UTILIZATION heterogeneous classification methodology for discovering health-
scheme were achieved by the weighted average heterogeneous related knowledge in social media messages,” J. Biomed. Inform.,
ensemble classifiers, comprising a SVM classifier trained with vol. 49, pp. 255–268, 2014.
content features, and a Random Forest classifier trained with [17] S. Bhatia, S. Tuarob, P. Mitra, and C. L. Giles, “An algorithm
search engine for software developers,” in Proc. 3rd Int. Workshop
context features. Future works include improving the quality of Search-Driven Develop.: Users Infrastructure Tools Eval., 2011,
paper metadata extraction using deep learning based parsers pp. 13–16.
[48], using the developed algorithm-citation context classifiers [18] S. Tuarob, P. Mitra, and C. L. Giles, “A hybrid approach to dis-
to discover algorithm usage in large-scale scholarly documents, cover semantic hierarchical sections in scholarly documents,” in
Proc. 13th Int. Conf. Document Anal. Recognit., 2015, pp. 1081–1085.
such as those hosted by the CiteseerX digital library, and gener- [19] S. Tuarob, “Improving pseudo-code detection in ubiquitous schol-
ating an algorithm usage network that would allow us to fur- arly data using ensemble machine learning,” in Proc. Int. Comput.
ther examine and study how algorithms are developed and Sci. Eng. Conf., 2016, pp. 1–6.
evolve through different temporal and thematic scales. [20] Z. Wu, J. Wu, M. Khabsa, K. Williams, H.-H. Chen, W. Huang,
S. Tuarob, S. R. Choudhury, A. Ororbia, P. Mitra, et al., “Towards
building a scholarly big data platform: Challenges, lessons and
ACKNOWLEDGMENTS opportunities,” in Proc. 14th ACM/IEEE-CS Joint Conf. Digital
Libraries, 2014, pp. 117–126.
This project is supported by the Office of Higher Education [21] S. Tuarob, L. C. Pouchard, P. Mitra, and C. L. Giles, “A generalized
Commission (OHEC) Thailand and the Thailand Research topic modeling approach for automatic document annotation,” Int.
J. Digital Libraries, vol. 16, no. 2, pp. 111–128, 2015.
Fund (TRF), through grant MRG6080252, and the Korean Foun- [22] I. Safder, J. Sarfraz, S.-U. Hassan, M. Ali, and S. Tuarob,
dation for Advanced Studies (KFAS) International Scholar “Detecting target text related to algorithmic efficiency in scholarly
Exchange Fellowship (ISEF) for the academic year of 2017–2018. big data using recurrent convolutional neural network model,” in
Proc. Int. Conf. Asian Digital Libraries, 2017, pp. 30–40.
[23] S. Tuarob, P. Mitra, and C. L. Giles, “Improving algorithm search
REFERENCES using the algorithm co-citation network,” in Proc. 12th ACM/IEEE-
CS Joint Conf. Digital Libraries, 2012, pp. 277–280.
[1] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduc- [24] S. Dongen, “Graph clustering by flow simulation [Ph. D. dis-
tion to Algorithms. Cambridge, MA, USA: MIT Press, 2009. sertation],” Centers for Mathematics and Computer. Science, Uni-
[2] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pageRank versity of Utrecht, 2000.
citation ranking: Bringing order to the web,” Stanford InfoLab, [25] I. Spiegel-Rosing, “Science studies: Bibliometric and content ana-
Stanford, CA, 1999, http://ilpubs.stanford.edu:8090/422/. lysis,” Social Stud. Sci., vol. 7, no. 1, pp. 97–113, 1977.
[3] J. Wang, “Mean-variance analysis: A new document ranking the- [26] E. Garfield, et al., “Can citation indexing be automated,” in
ory in information retrieval,” in Proc. Eur. Conf. Inf. Retrieval, 2009, Proc. Symp. Statistical Assoc. Methods Mechanized Documentation,
pp. 4–16. vol. 269, pp. 189–192, 1965.
[4] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet [27] M. J. Moravcsik and P. Murugesan, “Some results on the function
allocation,” J. Mach. Learn. Res., vol. 3, no. Jan, pp. 993–1022, 2003. and quality of citations,” Social Stud. Sci., vol. 5, no. 1, pp. 86–92, 1975.
[5] H. Sch€ utze, C. D. Manning, and P. Raghavan, Introduction to Infor- [28] C. Dong and U. Sch€afer, “Ensemble-style self-training on citation
mation Retrieval. Cambridge, U.K.: Cambridge Univ. Press, 2008, classification,” in Proc. 5th Int. Joint Conf. Natural Lang. Process.,
vol. 39. 2011, pp. 623–631.
[6] S. Tuarob and C. S. Tucker, “Quantifying product favorability and [29] C. Guo, Y. Yu, A. Sanjari, and X. Liu, “Citation role labeling
extracting notable product features using large scale social media via local, pairwise, and global features,” Proc. Amer. Soc. Inf. Sci.
data,” J. Comput. Inf. Sci. Eng., vol. 15, no. 3, 2015, Art. no. 031003. Technol., vol. 51, no. 1, pp. 1–10, 2014.
[7] O. Frunza, D. Inkpen, and T. Tran, “A machine learning [30] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32,
approach for identifying disease-treatment relations in short 2001.
texts,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 6, pp. 801–814, [31] S.-U. Hassan, A. Akram, and P. Haddawy, “Identifying important
Jun. 2011. citations using contextual information from full text,” in Proc. 17th
[8] Q. Li, Y. Chen, J. Wang, Y. Chen, and H. Chen, “Web media and ACM/IEEE Joint Conf. Digital Libraries, 2017, pp. 41–48.
stock markets: A survey and future directions from a big data [32] D. Jurgens, S. Kumar, R. Hoover, D. McFarland, and D. Jurafsky,
perspective,” IEEE Trans. Knowl. Data Eng., vol. 30, no. 2, pp. 381–399, “Measuring the evolution of a scientific field through citation
Feb. 2018. frames,” Trans. Assoc. Comput. Linguistics, vol. 6, pp. 391–406, 2018.

Authorized licensed use limited to: Sheffield University. Downloaded on February 15,2025 at 01:24:59 UTC from IEEE Xplore. Restrictions apply.
1896 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 10, OCTOBER 2020

[33] S. Tuarob, P. Mitra, and C. L. Giles, “A classification scheme for Poom Wettayakorn received the BS degree in
algorithm citation function in scholarly works,” in Proc. 13th information and communication technology from
ACM/IEEE-CS Joint Conf. Digital Libraries, 2013, pp. 367–368. Mahidol University. His research interests include
[34] A. Athar and S. Teufel, “Context-enhanced citation sentiment text mining, multivariate time-series forecasting,
detection,” in Proc. Conf. North Amer. Chapter Assoc. Comput. and a wide spectrum of image processing appli-
Linguistics: Human Lang. Technol., 2012, pp. 597–601. cations using deep-learning techniques.
[35] M. Thelwall, “The heart and soul of the web? Sentiment strength
detection in the social web with SentiStrength,” in Cyberemotions.
Berlin, Germany: Springer, 2017, pp. 119–134.
[36] G. H. John and P. Langley, “Estimating continuous distributions
in Bayesian classifiers,” in Proc. 11th Conf. Uncertainty Artif. Intell.,
1995, pp. 338–345.
[37] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector Chanathip Pornprasit is working toward the
machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, 2011, graduate degree in computer science at Mahidol
Art. no. 27. University. His research interests evolve around
[38] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learning applications of advanced machine-learning tech-
algorithms,” Mach. Learn., vol. 6, no. 1, pp. 37–66, 1991. niques in text mining and natural language proc-
[39] W. W. Cohen, “Fast effective rule induction,” in Proc. 12th Int. essing fields.
Conf. Mach. Learn., 1995, pp. 115–123.
[40] J. R. Quinlan, C4. 5: Programs for Machine Learning. Amsterdam,
Netherlands: Elsevier, 2014.
[41] A. Puurula, “Scalable text classification with sparse generative mod-
eling,” in Proc. Pacific Rim Int. Conf. Artif. Intell., 2012, pp. 458–469.
[42] J. Schmidhuber, “Deep learning in neural networks: An over-
view,” Neural Netw., vol. 61, pp. 85–117, 2015. Tanakitti Sachati is working toward the graduate
[43] Y. Kim, “Convolutional neural networks for sentence classi- degree in computer science at Mahidol Univer-
fication,” Proc. 2014 Conf. Empirical Methods in Natural Language sity. His expertise lies in large-scale data storage
Processing, Oct. 2014, pp. 1746–1751. and manipulation, information retrieval, text min-
[44] M. C. Mukkamala and M. Hein, “Variants of RMSProp and ada- ing, and machine-learning applications.
grad with logarithmic regret bounds,” in Proc. 34th Int. Conf.
Mach. Learn., 2017, pp. 2545–2553.
[45] C. Caragea, J. Wu, A. Ciobanu, K. Williams, J. Fernandez-Ramırez,
H.-H. Chen, Z. Wu, and L. Giles, “CiteSeerx: A scholarly big data-
set,” in Proc. Eur. Conf. Inf. Retrieval, 2014, pp. 311–322.
[46] I. G. Councill, C. L. Giles, and M.-Y. Kan, “ParsCit: An open-
source CRF reference string parsing package,” in Proc. Int. Conf.
Lang. Resources Eval., 2008, vol. 8, pp. 661–667. Saeed Ul Hassan received the PhD degree in
[47] J. Sim and C. C. Wright, “The kappa statistic in reliability studies: the field of information management from the
Use, interpretation, and sample size requirements,” Phys. Therapy, Asian Institute of Technology, Thailand. He is the
vol. 85, no. 3, pp. 257–268, 2005. director of the Scientometrics Lab and a faculty
[48] A. Prasad, M. Kaur, and M.-Y. Kan, “Neural parsCit: A deep member at Information Technology University
learning-based reference string parser,” Int. J. Digital Libraries, (ITU). His research interests lie within the areas
vol. 19, no. 4, pp. 323–337, 2018. of data science, scientometrics, bibliometric tools
for evidence-based research policy formulation,
Suppawong Tuarob received the BSE and MSE information retrieval, and text mining. He is a
degrees both in computer science and engineer- member of the IEEE.
ing from the University of Michigan-Ann Arbor
and the MS degree in industrial engineering and
the PhD degree in computer science and engi- Peter Haddawy received the BA degree in math-
neering both from the Pennsylvania State Uni- ematics from Pomona College, and the MSc
versity. Currently, he is an assistant professor of and PhD degrees in computer science from the
computer science at Mahidol University, Thai- University of Illinois Urbana-Champaign. He is a
land. His research involves data mining in large- professor with the Faculty of Information and
scale scholarly, social media and healthcare Communication Technology, Mahidol University,
domains by applying multiple cutting-edge techni- and director of the Mahidol Bremen Medical Infor-
ques, such as machine learning, topic modelling, and sentiment analy- matics Research Unit there. He is also a honorary
sis. He is a member of the IEEE. professor of medical informatics at the University
of Bremen. He was a tenured associate professor
with the Department of EE&CS, University of
Sung Woo Kang received the BS degree in
Wisconsin-Milwaukee, professor of computer science and information
industrial engineering from Inha University, in
management, Asian Institute of Technology, as well as vice-president
Korea, the MS degree in industrial engineering
for academic affairs there, and served in the United Nations as director
from Myongji University, in Korea, and the PhD
of UNU-IIST. He has been a Fulbright fellow, Hanse-Wissenschaftskol-
degree in industrial and manufacturing engineer-
leg fellow, Avery Brundage Scholar, and Shell Oil Company fellow. His
ing from the Pennsylvania State University. He is
research interests are in the areas of artificial intelligence, applications
currently an assistant professor of industrial &
of AI in medicine and public health, and scientometrics.
management engineering, School of Engineer-
ing, Inha University. He is the director of the
Technical Approach for Computing Trend Infor-
" For more information on this or any other computing topic,
mation & Convergence System (TACTICS) Labo-
ratory. His research interests include data mining, massive data please visit our Digital Library at www.computer.org/csdl.
processing, product service design (which covers intelligent informa-
tion), and knowledge management.

Authorized licensed use limited to: Sheffield University. Downloaded on February 15,2025 at 01:24:59 UTC from IEEE Xplore. Restrictions apply.

You might also like