tmpEF92 TMP
tmpEF92 TMP
Know-Center
Inffeldgasse 13/VI, 8010 Graz
{pkraker, aenkhbayar}@know-center.at
2
Abstract
In a scientific publishing environment that is increasingly moving online,
identifiers of scholarly work are gaining in importance. In this paper, we
analysed identifier distribution and coverage of articles from the discipline of
quantitative biology using arXiv, Mendeley and CrossRef as data sources.
The results show that when retrieving arXiv articles from Mendeley, we were
able to find more papers using the DOI than the arXiv ID. This indicates that
DOI may be a better identifier with respect to findability. We also find that
coverage of articles on Mendeley decreases in the most recent years, whereas
the coverage of DOIs does not decrease in the same order of magnitude. This
hints at the fact that there is a certain time lag involved, before articles are
covered in crowd-sourced services on the scholarly web.
Keywords: scholarly identifiers, pre-prints, arXiv, DOI, readership
Introduction
In a scientific publishing environment that is increasingly moving online,
identifiers of scholarly work are gaining in importance. With the advent of
pre-print archives, there is often more than one version of an article available
and these versions may be hosted in various places around the web. Scholarly
communication is no longer limited to articles alone, but it also takes place in
different forms on various social media platforms. Identifiers are therefore
crucial for disambiguation and traceability of scholarly articles and their reception.
The need for persistent identifiers is often mentioned in the literature (see
e.g. Davidson & Douglas, 1998; Bourne and Fink 2008) and consequently, a
variety of identifier systems have been proposed (see e.g. Van De Sompel et
al., 2001; Warner 2010). Prominent examples for identifiers on an article
level are the Digital Object Identifier or DOI (DOI Foundation, n.d.) and the
arXiv ID. Notable identifiers on the author level are author-based identifiers
such as ORCID (Haak et al., 2012) and Researcher ID (Thomson-Reuters,
n.d.). Some of the most longstanding identifiers predate the digital age, including the International Standard Book Number (ISBN) and the International Standard Serial Number (ISSN).
Despite their importance, little is empirically known about the coverage and
distribution of scholarly identifiers, and how they propagate on the scholarly
web. In our work, we are addressing this very gap in the scientometric literature. Specifically, our research was guided by the following research questions:
How are scholarly identifiers distributed in crowd-sourced systems,
e.g. pre-print archives and online reference management systems?
Which identifier combinations are the most common? Who are the top
providers of identifiers?
Does the provision of different identifiers have an influence on findability of scientific publications in other bibliographic and bibliometric
sources?
CrossRef, a metadata and linking service, and (iii) Mendeley, an online reference management system.
The data collection pipeline is shown in Figure 1. At first, we collected
metadata on all publicly available articles for quantitative biology. In all
cases, the most recent upload to arXiv was used and all older entries were
discarded. This resulted in n=14,195 metadata records. Quantitative biology
represents a medium-to-small collection on arXiv. The collected metadata
includes: arXiv ID, DOI (optional), title, authors, year, and journal (optional).
This data was sourced on 17.11.2014 and was used as a basis for all following steps. At first, the initial data set was divided into entries with DOI
(n=5,125 entries, 36.7%) and without a DOI (n=8,980 entries, 63.3%). arXiv
is primarily used as a way to disseminate pre-prints, and not all authors add a
DOI to the arXiv record after an article has been published. Therefore, we
performed a CrossRef meta-data lookup in order to acquire additional DOIs.
We used the following metadata to search for an entry: title, author, journal,
and year.
With this procedure, we found DOIs for an additional 1,885 entries, bringing
the number of entries with a DOI up to 7,100 (50.02%). We then attempted
to retrieve the corresponding documents for all entries on Mendeley. We
used either the arXiv ID or both the DOI and the arXiv ID to locate the document. If both arXiv ID and DOI yielded a result on Mendeley, the Mendeley
IDs were compared. If they didnt match, we used the result, which contained
additional identifier fields, e.g. a PubMed ID, if available. If both results
contained the same amount of articles, we chose the item found with the
DOI.
Finally, we compared the arXiv ID of the obtained Mendeley document with
the original arXiv entry. If the obtained Mendeley document did not provide
one, the two titles were compared using approximate string matching in order
to ascertain matching documents.
After this procedure, we arrived a final set of n=11,570 articles that could be
found on Mendeley (81.5%). For these articles, we retrieved basic readership
data and identifier data. Available identifiers on Mendeley are1:
arxiv: arXiv ID
doi: Digital Object Identifier (DOI)
isbn: International Standard Book Number (ISBN)
issn: International Standard Serial Number (ISSN)
pmid: PubMed ID (assigned to publications indexed in PubMed)
scopus: Scopus ID (assigned to publications indexed in Scopus)
ssrn: Social Science Research Network (SSRN) ID
1
see http://dev.mendeley.com/methods/#catalog-documents
Results
Identifier distribution in arXiv and findability on Mendeley
Table 1 sums up the basic results of the crawling process. Of the 14,195
unique articles, 36.7% had a DOI on arXiv. Using CrossRef, an additional
1,885 DOIs could be found, bringing the share of articles with a DOI up to
50.02%. 11,570 articles (81.5%) could finally be found on Mendeley.
There was a difference in findability with respect to whether we used a DOI
or the arXiv ID to search for the articles on Mendeley (see also Table 3). Of
the 14,195 articles, 72.6% could be retrieved on Mendeley using the arXiv
ID. In contrast to that, 91.4% of the 7,100 articles with a Digital Object Identifier (either on arXiv or via metadata lookup on CrossRef) could be found on
Mendeley using the DOI.
One of the reasons for that could be that records with a DOI do represent
articles that have eventually been published in a journal. In order to test this
assumption, we analysed the registrants for all entries with a DOI (7,100
articles). We used a list of DOI registrants by Alf Eaton2 with manual extensions to identify registrants. The results confirm our assumption (see Table
2). The top registrants are established publishers such as Elsevier and
Springer. These publishers usually assign DOIs to articles published in their
journals and books, in contrast to archives such as figshare, which assign a
DOI to any submitted article regardless of whether it was published in a
journal or not.
see https://gist.github.com/hubgit/5974843
CrossRef: additional
DOIs
1,885 (13.3%)
Mendeley:
found
11,570 (81.5%)
# DOIs
Percentage
1,507
1,029
668
502
439
335
21.2%
14.5%
9.4%
7.1%
6.2%
4.7%
217
194
180
141
1,888
7,100
3.1%
2.7%
2.5%
2.0%
26.6%
100%
To eliminate effects that relate to the nature of the article that has been posted
on arXiv (whether it stayed a pre-print or went on to become a journal article), we also compared findability for articles that have both a DOI and an
arXiv ID (see Table 3). We also found a difference in these cases: 91.4% of
articles with a DOI could be found using the very same identifier, whereas,
only 71.4% of articles with a DOI could be found with the arXiv ID. The
lowest findability was reported for articles with no DOI: of the 7,095 articles
with no DOI, only 69.0% were retrieved using the arXiv ID.
n
arXiv ID &
DOI
arXiv ID
Sum
6,492 (91.44%)
-
Another interesting fact found in the top providers is that the American Physical Society, which is, among other things, working to advance and diffuse
the knowledge of physics through its outstanding research journals3 is the
top registrant for DOIs in quantitative biology. One of the reasons for that
could be that arXiv allows authors to assign more than just one category to
each article. The analysis of article categories (see Table 4) shows that quantitative biology is the primary discipline for only 61.4% of articles with a
DOI (4,358 articles). 30.1% (2,178 articles) are assigned to a primary category that falls into the discipline of physics. This indicates a high number of
interdisciplinary articles in the sample.
Figure 2 shows the distribution of articles from 1992 to 2013. There is a
strong, at times exponential increase in the number of articles. The coverage
on Mendeley, however, has declined for the youngest articles as can be seen
in Figure 3. The percentage of articles with a DOI does not decrease in the
same order of magnitude.
see http://www.aps.org/about/index.cfm
Number of articles
4,358
2,178
247
211
105
1
7,100
Percentage
61.4%
30.7%
3.5%
3.0%
1.5%
0.0%
100.0%
Note that we left ISBN out of this analysis, because the metadata quality
was very poor with respect to this field on Mendeley.
frequency
arxiv
doi
scopus
pmid
issn
10,351
(89.5%)
8,321
(71.9%)
8,409
(72.7%)
5,477
(47.3%)
8,119
(70.2%)
25.4
25.4
32.4
25.9
mean reader20.4
ship
.
Acknowledgments
The Know-Center is funded within the Austrian COMET program Competence Centers for Excellent Technologies - under the auspices of the Austrian
Federal Ministry of Transport, Innovation and Technology, the Austrian Federal Ministry of Economy, Family and Youth, and the State of Styria. COMET is managed by the Austrian Research Promotion Agency FFG.
Curriculum Vitae
Peter Kraker is a postdoctoral researcher in the field of social computing at
the Know-Center of Graz University of Technology. He completed his PhD
thesis on visualizing research fields based on scholarly communication on
the web at University of Graz with honours. His main research interests are
Science 2.0, Open Science and Altmetrics. Peter contributed to several EUfunded projects in these areas, including TEAM and STELLAR.
Asura Enkhbayar is junior researcher in the field of social computing at the
Know-Center of Graz University of Technology. He received his bachelors
degree in Electronic Engineering from the University of Applied Sciences
Technikum Vienna. His main research interests are Scholarly Communication, Data Mining and Analysis, Machine Learning, Medical Image Processing and Data Journalism.
Elisabeth Lex is the head of the Social Computing group at Know-Center and
she has an assistant professor position at Graz University of Technology. Her
research interests include Social Computing, Science 2.0, Web Science, Social Network Analysis, Machine Learning, Information Retrieval, and Data
Mining. Elisabeth is work package leader in the FP7 IP Learning Layers and
coordinates the FP7-PEOPLE-2011-IRSES project WIQ-EI.
References
Bourne, P. E., & Fink, J. L. (2008). I am not a scientist, I am a number. PLoS
Computational Biology, 4(12), e1000247. doi:10.1371/journal.pcbi.1000247
Davidson, L. A., & Douglas, K. (1998). Digital Object Identifiers: Promise
and Problems for Scholarly Publishing. The Journal of Electronic Publishing,
4(2). doi:10.3998/3336451.0004.203
DOI Foundation (n.d.). Digital Object
http://www.doi.org/ (Accessed 14/01/2015)
Identifier
System.
URL:
Haak, L. L., Fenner, M., Paglione, L., Pentz, E., & Ratner, H. (2012). ORCID: a system to uniquely identify researchers. Learned Publishing, 25(4),
259264. doi:10.1087/20120404
Kraker, P., Krner, C., Jack, K., & Granitzer, M. (2012). Harnessing User
Library Statistics for Research Evaluation and Knowledge Domain Visualization. In Proceedings of the 21st International Conference Companion on
World Wide Web (pp. 10171024). doi:10.1145/2187980.2188236
Sompel, H. Van De, Lagoze, C., Bekaert, J., Liu, X., Payette, S., & Warner,
S. (2006). An interoperable fabric for scholarly value chains. D-Lib Magazine, 12(10).
Thomson-Reuters (n.d.). ResearcherID. URL: http://www.researcherid.com
(Accessed 14/01/2015)
Van de Sompel, H., & Beit-Arie, O. (2001). Open linking in the scholarly
information environment using the OpenURL Framework. New Review of
Information Networking, 7(1), 5976. doi:10.1080/13614570109516969
Warner, S. (2010). Author identifiers in scholarly repositories. Journal of
Digital Information, 11(1), 110.
Zahedi, Z., Haustein, S., & Bowman, T. D. (2014). Exploring data quality
and retrieval strategies for Mendeley reader counts. In Metrics14 - ASIS&T
Workshop on Informetric and Scientometric Research Introduction. Retrieved from http://www.asis.org/SIG/SIGMET/data/uploads/sigmet2014/
zahedi.pdf (Accessed 04/03/2014)