1

The text mentions four groups, one for each of the groups of questions. I assume that the authors are the fourth group, as this article presents the recommendation of that group. However, I see nothing about the formation of the groups, any overlap among the groups or overlap among the groups and other working groups.

Could you provide information on how these groups were created and who were...

read more, vote or answer

waiting for moderation
Ask a question about this section
0

It seems to me that this paragraph overlooks significant progress regarding HTTP URI persistence that has been achieved in the past 5 years. Web archives combined with the Memento protocol and associated infrastructure provide a rather impressive level of persistence for HTTP URIs:

  1. Web archives around the world contain archival snapshots of HTTP-URI-identi...
read more, vote or answer

waiting for moderation
Ask a question about this section
Ask a question about this section

Achieving human and machine accessibility of cited data in scholarly publications

View article
Loading...
PeerJ Computer Science
Robust citation of archived methods and materials—particularly highly variable materials such as cell lines, engineered animal models, etc.—and software—are important questions not dealt with here. See Vasilevsky et al. (2013) for an excellent discussion of this topic for biological reagents.
Individuals representing the following organizations participated in the JDDCP development effort: Biomed Central; California Digital Library; CODATA-ICSTI Task Group on Data Citation Standards and Practices; Columbia University; Creative Commons; DataCite; Digital Science; Elsevier; European Molecular Biology Laboratories/European Bioinformatics Institute; European Organization for Nuclear Research (CERN); Federation of Earth Science Information Partners (ESIP); FORCE11.org; Harvard Institute for Quantitative Social Sciences; ICSU World Data System; International Association of STM Publishers; Library of Congress (US); Massachusetts General Hospital; MIT Libraries; NASA Solar Data Analysis Center; The National Academies (US); OpenAIRE; Rensselaer Polytechnic Institute; Research Data Alliance; Science Exchange; National Snow and Ice Data Center (US); Natural Environment Research Council (UK); National Academy of Sciences (US); SBA Research (AT); National Information Standards Organization (US); University of California, San Diego; University of Leuven/KU Leuven (NL); University of Oxford; VU University Amsterdam; World Wide Web Consortium (Digital Publishing Activity). See https://www.force11.org/datacitation/workinggroup for details.
These organizations include the American Physical Society, Association of Research Libraries, Biomed Central, CODATA, CrossRef, DataCite, DataONE, Data Registration Agency for Social and Economic Data, ELIXIR, Elsevier, European Molecular Biology Laboratories/European Bioinformatics Institute, Leibniz Institute for the Social Sciences, Inter-University Consortium for Political and Social Research, International Association of STM Publishers, International Union of Biochemistry and Molecular Biology, International Union of Crystallography, International Union of Geodesy and Geophysics, National Information Standards Organization (US), Nature Publishing Group, OpenAIRE, PLoS (Public Library of Science), Research Data Alliance, Royal Society of Chemistry, Swiss Institute of Bioinformatics, Cambridge Crystallographic Data Centre, Thomson Reuters, and the University of California Curation Center (California Digital Library).
NISO Z39.96-2012 is derived from the former “NLM-DTD” model originally developed by the US National Library of Medicine.
URIs are very similar in concept to the more widely understood Uniform Resource Locators (URL, or “Web address”), but URIs do not specify the location of an object or service—they only identify it. URIs specify abstract resources on the Web. The associated server is responsible for resolving a URI to a specific physical resource—if the resource is resolvable. (URIs may also be used to identify physical things such as books in a library, which are not directly resolvable resources on the Web.)
ORCiD IDs are numbers identifying individual researchers issued by a consortium of prominent academic publishers and others (Editors, 2010; Maunsell, 2014).
Force11.org (http://force11.org) is a community of scholars, librarians, archivists, publishers and research funders that has arisen organically to help facilitate the change toward improved knowledge creation and sharing. It is incorporated as a US 501(c)3 not-for-profit organization in California.

Main article text

 

Introduction

Background

Why cite data?

The eight core Principles of data citation

  • Principle 1—Importance: “Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.”

  • Principle 2—Credit and Attribution: “Data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data, recognizing that a single style or mechanism of attribution may not be applicable to all data.”

  • Principle 3—Evidence: “In scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited.”

  • Principle 4—Unique Identification: “A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community.”

  • Principle 5—Access: “Data citations should facilitate access to the data themselves and to such associated metadata, documentation, code, and other materials, as are necessary for both humans and machines to make informed use of the referenced data.”

  • Principle 6—Persistence: “Unique identifiers, and metadata describing the data, and its disposition, should persist—even beyond the lifespan of the data they describe.”

  • Principle 7—Specificity and Verifiability: “Data citations should facilitate identification of, access to, and verification of the specific data that support a claim. Citations or citation metadata should include information about provenance and fixity sufficient to facilitate verifying that the specific time slice, version and/or granular portion of data retrieved subsequently is the same as was originally cited.”

  • Principle 8—Interoperability and Flexibility: “Citation methods should be sufficiently flexible to accommodate the variant practices among communities, but should not differ so much that they compromise interoperability of data citation practices across communities.”

Implementation questions arising from the JDDCP

  1. Document Data Model—How should publishers adapt their document data models to support direct citation of data?

  2. Publishing Workflows—How should publishers change their editorial workflows to support data citation? What do publisher data deposition and citation workflows look like where data is being cited today, such as in Nature Scientific Data or GigaScience?

  3. Common Repository Application Program Interfaces (APIs)—Are there any approaches that can provide standard programmatic access to data repositories for data deposition, search and retrieval?

  4. Identifiers, Metadata, and Machine Accessibility—What identifier schemes, identifier resolution patterns, standard metadata, and recommended machine programmatic accessibility patterns are recommended for directly cited data?

The Document Data Model group noted that publishers use a variety of XML schemas (Bray et al., 2008; Gao, Sperberg-McQueen & Thompson, 2012; Peterson et al., 2012) to model scholarly articles. However, there is a relevant National Information Standards Organization (NISO) specification, NISO Z39.96-2012, which is increasingly used by publishers, and is the archival form for biomedical publications in PubMed Central. 4 This group therefore developed a proposal for revision of the NISO Journal Article Tag Suite to support direct data citation. NISO-JATS version 1.1d2 (National Center for Biotechnology Information, 2014), a revision based on this proposal, was released on December 29, 2014, by the JATS Standing Committee, and is considered a stable release, although it is not yet an official revision of the NISO Z39.96-2012 standard.

The Publishing Workflows group met jointly with the Research Data Alliance’s Publishing Data Workflows Working Group to collect and document exemplar publishing workflows. An article on this topic is in preparation, reviewing basic requirements and exemplar workflows from Nature Scientific Data, GigaScience (Biomed Central), F1000Research, and Geoscience Data Journal (Wiley).

  • definition of machine accessibility;

  • identifiers and identifier schemes;

  • landing pages;

  • minimum acceptable information on landing pages;

  • best practices for dataset description; and

  • recommended data access methods.

Recommendations for Achieving Machine Accessibility

What is machine accessibility?

Unique identification

Digital Object Identifiers (DOIs)

Handle System (HDLs)

Identifiers.org Uniform Resource Identifiers (URIs)

PURLs

HTTP URIs

URIs (Uniform Resource Identifiers) are strings of characters used to identify resources. They are the identifier system for the Web. URIs begin with a scheme name, such as http or ftp or mailto, followed by a colon, and then a scheme-specific part. HTTP URIs will be quite familiar as they are typed every day into browser address bars, and begin with http:. Their scheme-specific part is next, beginning with “//”, followed by an identifier, which often but not always is resolvable to a specific resource on the Web. URIs by themselves have no mechanism for storing metadata about any objects to which they are supposed to resolve, nor do they have any particular associated persistence policy. However, other identifier schemes with such properties, such as DOIs, are often represented as URIs for convenience (Berners-Lee, Fielding & Masinter, 1998; Jacobs & Walsh, 2004).

Archival Resource Key (ARKs)

National Bibliography Number (NBNs)

Landing pages

There are three main reasons to resolve identifiers to landing pages rather than directly to data. First, as proposed in the JDDCP, the metadata and the data may have different lifespans, the metadata potentially surviving the data. This is true because data storage imposes costs on the hosting organization. Just as printed volumes in a library may be de-accessioned from time to time, based on considerations of their value and timeliness, so will datasets. The JDDCP proposes that metadata, essentially cataloging information on the data, should still remain a citable part of the scholarly record even when the dataset may no longer be available.

  • (recommended) Dataset descriptions: The landing page must provide descriptions of the datasets available, and information on how to programmatically retrieve data where a user or device is so authorized. (See Dataset description for formats.)

  • (conditional) Versions: What versions of the data are available, if there is more than one version that may be accessed.

  • (optional) Explanatory or contextual information: Provide explanations, contextual guidance, caveats, and/or documentation for data use, as appropriate.

  • (conditional) Access controls: Access controls based on content licensing, Protected Health Information (PHI) status, Institutional Review Board (IRB) authorization, embargo, or other restrictions, should be implemented here if they are required.

  • (recommended) Persistence statement: Reference to a statement describing the data and metadata persistence policies of the repository should be provided at the landing page. Data persistence policies will vary by repository but should be clearly described. (See Persistence guarantee for recommended language).

  • (recommended) Licensing information: Information regarding licensing should be provided, with links to the relevant licensing or waiver documents as required (e.g., Creative Commons CC0 waiver description (https://creativecommons.org/publicdomain/zero/1.0/), or other relevant material.)

  • (conditional) Data availability and disposition: The landing page should provide information on the availability of the data if it is restricted, or has been de-accessioned (i.e., removed from the archive). As stated in the JDDCP, metadata should persist beyond de-accessioning.

  • (optional) Tools/software: What tools and software may be associated or useful with the datasets, and how to obtain them (certain datasets are not readily usable without specific software).

Content encoding on landing pages

  • HTML; that is, the native browser-interpretable format used to generate a graphical and/or language-based display in a browser window, for human reading and understanding.

  • At least one non-proprietary machine-readable format; that is, a content format with a fully specified syntax capable of being parsed by software without ambiguity, at a data element level. Options: XML, JSON/JSON-LD, RDF (Turtle, RDF-XML, N-Triples, N-Quads), microformats, microdata, RDFa.

Best practices for dataset description

  • Dataset Identifier: A machine-actionable identifier resolvable on the Web to the dataset.

  • Title: The title of the dataset.

  • Description: A description of the dataset, with more information than the title.

  • Creator: The person(s) and/or organizations who generated the dataset and are responsible for its integrity.

  • Publisher/Contact: The organization and/or contact who published the dataset and is responsible for its persistence.

  • PublicationDate/Year/ReleaseDate: ISO 8601 standard dates are preferred (Klyne & Newman, 2002).

  • Version: The dataset version identifier (if applicable).

Serving the landing pages

Persistence guarantees

“[Organization/Institution Name] is committed to maintaining persistent identifiers in [Repository Name] so that they will continue to resolve to a landing page providing metadata describing the data, including elements of stewardship, provenance, and availability.

[Organization/Institution Name] has made the following plan for organizational persistence and succession: [plan].”

Implementation: Stakeholder Responsibilities

  1. Archives and repositories: (a) Identifiers, (b) resolution behavior, (c) landing page metadata elements, (d) dataset description and (e) data access methods, should all conform to the technical recommendations in this article.

  2. Registries: Registries of data repositories such as databib (http://databib.org) and r3data (http://www.re3data.org) should document repository conformance to these recommendations as part of their registration process, and should make this information readily available to researchers and the public. This also applies to lists of “recommended” repositories maintained by publishers, such as those maintained by Nature Scientific Data (http://www.nature.com/sdata/data-policies/repositories) and F1000Research (http://f1000research.com/for-authors/data-guidelines).

  3. Researchers: Researchers should treat their original data as first-class research objects. They should ensure it is deposited in an archive that adheres to the practices described here. We also encourage authors to publish preferentially with journals which implement these practices.

  4. Funding agencies: Agencies and philanthropies funding research should require that recipients of funding follow the guidelines applicable to them.

  5. Scholarly societies: Scholarly societies should strongly encourage adoption of these practices by their members and by publications that they oversee.

  6. Academic institutions: Academic institutions should strongly encourage adoption of these practices by researchers appointed to them and should ensure that any institutional repositories they support also apply the practices relevant to them.

Conclusion

Additional Information and Declarations

Competing Interests

The authors declare there are no competing interests.

Author Contributions

Joan Starr and Tim Clark conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper.

Funding

This work was funded in part by generous grants from the US National Institutes of Health and National Aeronautics and Space Administration, the Alfred P. Sloan Foundation, and the European Union (FP7). Support from the National Institutes of Health (NIH) was provided via grant # NIH 1U54AI117925-01 in the Big Data to Knowledge program, supporting the Center for Expanded Data Annotation and Retrieval (CEDAR). Support from the National Aeronautics and Space Administration (NASA) was provided under Contract NNG13HQ04C for the Continued Operation of the Socioeconomic Data and Applications Center (SEDAC). Support from The Alfred P. Sloan Foundation was provided under two grants: a. Grant # 2012-3-23 to the Harvard Institute for Quantitative Social Sciences, “Helping Journals to Upgrade Data Publication for Reusable Research”; and b. a grant to the California Digital Library, “CLIR/DLF Postdoctoral Fellowship in Data Curation for the Sciences and Social Sciences”. The European Union partially supported this work under the FP7 contracts #269977 supporting the Alliance for Permanent Access and #269940 supporting Digital Preservation for Timeless Business Processes and Services. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

83 Citations 29,773 Views 2,237 Downloads