A_comparative_analysis_of_text_similarity_measures_and_algorithms_in_research_paper_recommender_systems
A_comparative_analysis_of_text_similarity_measures_and_algorithms_in_research_paper_recommender_systems
Abstract—The increase in the number of online published The world wide web is nowadays being used for conducting
research papers can be attributed to the recent developments of research and searching for relevant research papers [1], and on
the internet and web technologies. However, researchers and the other hand, knowledge is constantly being discovered by
online users have a difficult time getting relevant and accurate
researchers, and it’s being captured and stored in various
information due to information explosion on the internet. In this
repositories round the world, leading to information overload
paper, we seek to establish which algorithms and similarity metric
combinations can be used to optimise the search and [2]. Henceforth, to address this challenges and to satisfy users
recommendation of articles in a research paper recommender informational needs, researchers are exploiting approaches
systems. Our investigation utilised non-linear classification from recommender systems and techniques from informatio n
algorithms with text similarity measures. An offline evaluation search and retrieval [3]. However, these task of finding related
approach is utilised to determine the models accuracy and or similar paper is not always easy and therefore data minin g
performance, while various similarity metrics are assessed using algorithms, document recommendation techniques and
available datasets. We will utilise the Recursive PARTitioning information retrieval methods are all combined together and
(rpart), Random Forest and Boosted machine learning algorithms utilised on research paper features to find the most similar,
on research paper similarity evaluation datasets. The rpart
algorithm generally performed well when compared to the
relevant and important document that can be used by a
Boosted and the Random Forest algorithms by getting an average researcher. To the researcher, Machine learning algorithms,
accuracy and time efficiency of 80.73 and 2.354628 seconds data mining methods and natural language processing all work
respectively. The cosine similarity performed best when compared together to automatically discover, classify and recommen d
with the other similarity metrics. New similarity metrics and electronic documents [4].
measures are going to be proposed. It has been established in this Feature engineering is the combination of multiple features
paper that there are better combinations of metrics and which might be predictors in a model, and this brings ab out
algorithms when attempting to develop models that can be used more effectiveness than using single predictors. The ratio of
for research paper similarity evaluation and recommendation.
two predictors is highly likely to be more effective than utilising
Further challenges and open issues are identified.
two or more independent predictors [5]. Term similarity is a
Keywords— recommender systems; documents; data mining; concept that is used to check whether two documents are similar
information retrieval; big data; similarity metrics and measures through measuring their term similarity. Other possible
measures might include: the length of the document, the
I. INT RODUCTION number of common terms, usual or unusual terms, and the
The internet has brought about the increase of electronic number of times a terms appears. Text mining necessitates
documents online, further compelling textual categorization extraction of textual features from documents to facilitate
and classification of documents in various online repositories. operations like document retrieval, classification, and
At the present times, text mining, machine learning and natural summarization [4]. Hence we seek to utilise research paper term
language processing techniques and methodologies have been features to help us find similarities between the papers.
employed to process big data that is steadily overwhelming the The focal objectives of this work includes: first, perform a
internet user. Widespread research has also brought about an thorough review of scientific publications based on document
explosion of research papers and scientific outputs in form of search and retrieval, as well as highlighting the similarit y
journals and scientific literature works. Hence, researchers are measures that are used. Secondly, test and evaluate information
search and retrieval models. Thirdly, establish similarity metrics
determined to search for topics of interest that are related to
that supports our claims, and finally, future directions and
their area of expertise. They might also be motivated by the
recommendations will be discussed. The contribution of this
need to including research papers as part of their citations, or article is the construction of a classifier that can be used to
getting information about new research fronts in a particular predict class labels of research papers accurately.
field.
ISBN 978-1-5386-1001-5/18/$31.00 ©2018 IEEE
Authorized licensed use limited to: T.C. Cumhurbaskanligi Kutuphanesi. Downloaded on January 31,2025 at 10:34:59 UTC from IEEE Xplore. Restrictions apply.
II. LITERATURE REVIEW above space can be called as a research-paper space. Therefore
According to [6] similarity between documents can be a research-paper vector ⃗ , can be defined as
partitioned into three categories: first, string-based (character-
based and term-based) secondly, corpus-based and finally ⃗ ≡∑ ( , ) ⃗ (1)
knowledge based (similarity, relatedness). This work utilises
term-based similarity measures i.e. Cosine similarit y , where ( , ) is a function that measures the importance of
Euclidean distance, Jaccard similarity and Pearson coefficient term t in the research paper corpus.
similarity measure. [7] Performed a comparative systematic
A. Defining the importance function
study on similarity measure for online documents. They
compared four similarity measures (Euclidean, cosine, Pearson With a wide range of ways to quantify and measure the
correlation and extended Jaccard) in conjunction a variety of importance of words in a corpus, we select to use the frequency
clustering techniques (k-means, weighted graph partitioning, of terms t, in the research paper document d, to be equal to the
hyper-graph partitioning, self-organising feature map and importance function described as ( , ). It is however
random). Our research however follows a different direction in important to note that the vector ⃗ will not express the full
that it concentrates on the four similarity measures mentioned information about the document, as some information will not
above, but they are used in conjunction with classification be during the computation of the importance function. A variety
techniques (rpart, boosted and the random forest algorithms). In of measures have to be applied on document in preparation for
their work, the cosine similarity metric performed better than the activities that determine the importance function.
the rest, and the weighted-graph outperformed the other B. Pre-processing data
clustering techniques. Our work similarly compares the four In any research paper or document, the availability of stop
similarity measures on how they perform with classification words (i.e. the, is, an etc.) forms part of unimportant terms that
algorithms. are not useful and habitually reduces the performance of the
A. Huang [8] Investigated partitioning clustering computing systems. On the other hand, rare word or terms in a
algorithms with hierarchical clustering schemes and it was research paper or document are less frequent and should be
established that partitional clustering algorithms performed considered as important. The Zipf’s law states that, “the
better. Further, similarity measures were utilised to compare frequency that a word appears is inversely proportional to its
and analyse the effectiveness of these similarity measures on rank of importance. It takes the following form:
document clustering. Their experiments established that three
components ultimately affect the final result in a text clustering
(2)
scenario: the objects, distance or similarity measures used and
the clustering algorithm employed in the experiment. It has also
been reported that with the given diversity set of distance and where the Term-Frequency will dampen the importance of
similarity measures available in data mining, their effectiveness commonly available terms while increasing the importance of
in text classification is still not very clear. rare terms. Terms were stemmed using the Porter’s stemming
Aggarwal et al. [9] Performed a combination of corpus - algorithm [10]. To define the enhanced concept of importance,
based semantic related with knowledge-based semantic let N be the complete quantity of research papers in the corpus.
similarity to calculate the degree of similarity between two Let be the total number of research paper documents that
texts. All the scores obtained were fed into the linear regression have the terms t appearing in them, known as research-paper
and bagging machine learning algorithms. At the end of their document frequency for term t. Let , represent the number
experiments they discovered that there was a significant of times a term t appears in a research paper document , it is
improvement when both similarity-based and knowledge-based known as the term frequency. Hence to define the term
measures were fused together. Our experiment is utilising term- frequency inverse document frequency (tf-idf) for the
based similarity measures. The terms can be extracted fro m representation above we will use the following equation
research paper titles, abstract, tags etc. [1]. In this work, the
titles of the research papers were utilised to calculate similarit y ( , )≡ , ∗ ( ) (3)
between papers.
III. PROBLEM DEFINITION Where the log 2 is used to dampen the effect of inverse document
frequency (IDF).
A research paper document can be represented as a
vector ⃗, which is contained in a collection of research papers C. Finding documents similar to query
represented as corpus c, that in turn contains terms t, and this To find a similar or relevant document, the target papers’
can be denoted as ⃗ which is a symbol of a single unit vector metadata will be used as a query q, which can be defined as a
in a corpus. Each unit vector that is enclosed within a corpus vector ⃗ . It should be noted however that the query is actually
should be orthogonal to other unit vectors within the same a research paper, which can also be represented as document
corpus. Hence a collection of all unit vectors ⃗ in a corpus vector
comprises the formal space of the corpus that will correspond
⃗ ≡∑ ( , ) ⃗ (4)
to all the terms that will be contained in the dictionary. The
Authorized licensed use limited to: T.C. Cumhurbaskanligi Kutuphanesi. Downloaded on January 31,2025 at 10:34:59 UTC from IEEE Xplore. Restrictions apply.
D. Relevance measure Where the term set = { , … , } as mentioned previously
above, the use of the tfidf values as terms weights, for instance
A similarity measure determines how close or separate a
query vector ⃗ is from a document in the corpus ⃗ , and ⃗= ( , ) and ⃗= ( , ).
choosing a suitable similarity measure for classification
C. Jaccard coefficient
algorithms is central to designing efficient and effective
research paper recommender system. This similarity measure is sometimes referred to as the
Tanimoto coefficient, which measures the similarity as the
IV. METRICS
intersection of objects, divided by the union of the objects
With the availability of various distance measures, we need
to qualify the measures going to be used as authentic metrics ⃗∙ ⃗
using the following conditions: Let the metric space be a set X ( ⃗, ⃗) =
| ⃗| + ⃗ − ⃗ ∙ ⃗
having a distance function d that assigns a real number d(x,y) to
every part x, y ∈ X satisfying the axioms (7)
• d(x,y) ≥ 0 implying that the distance between any two This measure ranges from zero (0) to one (1). It is 1 when ⃗ =
points must be greater than zero, and a nonnegative ⃗ and this implies that that two documents are same. It will be
value. 0 if ⃗ ⃗ are disjoint, meaning that the two documents
• d(x,y) = 0 ⇔ x = y implying that the distance between are totally dissimilar. The corresponding similarity distance
the two objects can be zero (0) on the condition that if measure will be
and only if they are identical.
• d(x,y) = d(y,x) implying that the distance between the =1− .
two points is the same irrespective from which point it
is measured from.
• d(x,y) + d(y,z) ≥ d(x,z) implying that sum any two sides D. Pearson correlation coefficient
must be greater than or equal to the length of the The Pearson correlation coefficient measure is another
remaining side. metric that can be used to measure the relationship between
A. Cosine similarity textual term features in document classification. Given a set of
terms = { , … , } the Pearson correlation metric can take
The cosine similarity metric satisfies the above four
the following form
mentioned axioms qualifying it as a true metric. When
documents are represented as vectors, then the degree of ∑ ⃗× ⃗ − ×
similarity can be measured as the correlation of two documents, ⃗, ⃗ =
and this is usually quantified through the use of the cosine ∑ ⃗− [ ∑ ⃗− ]
angles between the two vectorized documents [8]. To find the
(8)
similarity that exists between the document ⃗ and query ⃗ we
may consider the angle θ between the query ⃗ and the research where are the terms that are utilised in the query
paper document ⃗. This will be accomplished by the utilising (research paper) and document(s) of the corpus respectively.
the following equation Pearson correlation ranges from +1 to -1, where +1 means a
very strong relationship, 0 means no relationship while -1
⃗∙ ⃗ means that there is a very strong relationship.
⃗
( ⃗, ) =
| ⃗| × ⃗
EXPERIMENTS AND DISCUSSION
(5)
The process of conducting experiments to determine which
The function cos θ measures the similarity between the query algorithm and similarity measure should be used to get better
(target paper) terms and the document (research papers) terms, performance is not a trivial task, since the experiments maybe
and ⃗ and ⃗ are m-dimensional vectors over the set of terms subjective depending upon the datasets type and experimental
= { ,… , } conditions.
B Euclidean distance A. Datasets
Let D represent the distance measure between a query ⃗ and This work utilised offline evaluation methods due to the
benefits of dataset availability, affordability, short experimen t
a document ⃗. Therefore the Euclidean distance between the
duration and the suitability of the method in determining the
query terms ⃗ and document terms ⃗ can be represented as:
( ⃗, ⃗) = ⃗− ⃗
(6)
Authorized licensed use limited to: T.C. Cumhurbaskanligi Kutuphanesi. Downloaded on January 31,2025 at 10:34:59 UTC from IEEE Xplore. Restrictions apply.
accuracy of recommender system. Datasets 1 used for similarit y
examination provided some ground truth established on the
annotations on 220 artificial intelligent (AI) research paper
documents that were published by eight (8) experts in the field
AI. The use of the term-frequency inverse-document frequency
together with the cosine similarity measure the 30 most similar
papers in a collection of 16597 were found for each of the 220
papers. Figure 0-1., shows the labels/ tags/ labels that were used
for the dataset, they were either similar (positive) or dissimilar
(negative). To evaluate any research paper similarity method
that has been developed, each paper in the testids.txt should be
compared against 30 other papers that are in the evaluation.txt
file. The file documens.txt contains all the records about the
papers that are used in the experiment.
B. Evaluation
To evaluate the performance of our system we utilised a Figure III-1: Proportion of research papers annotated by experts
dataset from the Ghent University, department of computer The similarity between the research papers was
science that is used to evaluate the similarity of research papers. accomplished by utilising the cosine similarity. Having
We have three major evaluation methodologies in the field of
measured the cosine similarity, the measures can be taken to
research paper recommender systems, which include: Offlin e collect top-k most similar papers.
evaluations, Online evaluations and User studies [11]. The
algorithms split the dataset into the training and testing datasets Table III-1: Performance of algorithms
on ratios of 70% and 30% of the dataset for training and testing
respectively. RF Rpart Boosted
C. Experimental results Time efficiency 39.83342s 2.354628s 41.35908s
Model Accuracy 80.38 % 80.73 % 83.20 %
The purpose of the experiments conducted was to test the
performance of data mining algorithms that are going to be used Area under curve 0.6201 0.6201 0.7741
to develop our research paper recommender system. We tested (ROC)
3 algorithms (Random Forest, Recursive Partitioning and
Boosted tree) for their accuracy and efficiency. When we
evaluated the performance of all the three algorithms, the rpart
algorithm proved to be more efficient and accurate when
compared to the other two counterparts that had a very poor
performance. Farther accuracy of the prediction was conducted,
and the rpart machine learning algorithm was selected and used
in the classification of research papers due to its shortest
running time to classify the datasets. The performance of the
algorithms is shown in the table 0-1.
Authorized licensed use limited to: T.C. Cumhurbaskanligi Kutuphanesi. Downloaded on January 31,2025 at 10:34:59 UTC from IEEE Xplore. Restrictions apply.
D. Discussion [5] M. Kuhn and K. Johnson. (2013). Applied predictive
modeling. Available: http://dx.doi.org/10.1007/97 8- 1-
The rationale of choosing the rpart machine learning
4614-6849-3
algorithm over the rest of the algorithms is it’s time
[6] W. H. Gomaa and A. A. Fahmy, "A survey of text
performance and classification accuracy. The boosted
similarity approaches," International Journal of
performance was impressive when it came to model accuracy,
Computer Applications, vol. 68, 2013.
it scored 83.2% with an AUC of 0.7741. However, it took
[7] A. Strehl and J. Ghosh, "Impact of similarity measures
41.35908 seconds to process, thus taking too long. This is also
on web-page clustering," 2000.
the case with the random forest algorithm that had a model
[8] A. Huang, "Similarity measures for text document
accuracy of 80.38 % and an AUC of 0.6201, but with 39.83342
clustering," in Proceedings of the sixth new zealand
seconds. The rpart algorithm scored an 80.73 on model
computer science research student conference
accuracy, a 0.6201 on AUC, with the shortest time duration of
(NZCSRSC2008), Christchurch, New Zealand, 2008,
2.354628 seconds.
pp. 49-56.
The top-N most similar research papers for the query can be [9] N. Aggarwal, K. Asooja, and P. Buitelaar,
ranked by computing the angle θ for every document and then "DERI&UPM: Pushing corpus based relatedness to
return the N research papers with the smallest angles. A similarity: Shared task system description," in
function that computes the degree of similarity in documents Proceedings of the First Joint Conference on Lexical
can be referred to as a similarity measure. With similarit y and Computational Semantics-Volume 1: Proceedings
measures, it is possible to rank documents in order of of the main conference and the shared task, and
importance, control the size of retrieved documents through the Volume 2: Proceedings of the Sixth International
enforcement of a threshold and reformulate a relevance Workshop on Semantic Evaluation, 2012, pp. 643-647.
feedback. [10] J. B. Lovins, "Development of a stemming algorithm,"
1968.
Using decision trees for classification on few tests often
[11] J. Beel and S. Langer, "A comparison of offline
leads to poor performance when it comes to text classification
evaluations, online evaluations, and user studies in the
[4]. Partitional clustering algorithms are known to be more context of research-paper recommender systems," in
suitable for processing large datasets as compared to
International Conference on Theory and Practice of
hierarchical clustering algorithms [8] and hence we propose a
Digital Libraries, 2015, pp. 153-168.
system that will automatically detect the research paper that is
being read or seen by a researcher, and based on the activities
on the research paper, important features from the paper are
extracted and used to compute how similarity they are with
other documents. More advanced models like the latent
semantic analysis will be used in future experiments since they
can classify documents as belonging to the same class even
when they don’t have similar words and term.
REFERENCE:
Authorized licensed use limited to: T.C. Cumhurbaskanligi Kutuphanesi. Downloaded on January 31,2025 at 10:34:59 UTC from IEEE Xplore. Restrictions apply.