Skip to content

archive.org ↔ Open Library synchronisation

RayBB edited this page May 17, 2024 · 1 revision

In order for Open Library users to access readable and borrowable books from archive.org, the respective data records need to be correctly synchronised.

This page documents the specific and technical requirements, and lists potential challenges to keeping the records synchronised.

Significance

At of 2019-06-24, 60,000 books were in our inlibrary lending program but not available through OpenLibrary.org. That is about 6% of our entire catalog is not borrowable through Open Library. There are another 281k books (excluding inlibrary) in printdisabled which have isbns and MARCs and are not on Openlibrary.org. Without linking up archive.org and open Library we also lose out on the ability to reliably determine the availability of our works. We have chosen now as the time to work on this project because over the past months we have eliminated hundreds of thousands of orphaned additions which was a prerequisite to this project. Successful IA ↔ OL sync also means a revitalized import process which is more effective at importing internet archive works moving forward.

Existing fields

archive.org items

  • openlibrary_edition, format example: OL12345M, creates a link from the item's details page to the exact Open Library edition represented by this scan.
  • openlibrary_work, format example: OL12345W, creates a link from the item's details page to the Open Library work that groups other editions of this scan.
  • openlibrary, format example: OL5189756M, a now DEPRECATED reference to an Open Library edition. Potentially used in the archive.org scanning process to locate MARC records, and in Open Library import code as a short cut for matching existing records. Both uses need to be investigated and updated to use the newer fields above.

Open Library Edition level metadata

  • ocaid, format example: callofdistantmam00ward This is the primary useful link back to an archive.org item. It only stores one value, so there is an issue when there exists multiple scans of an edition on archive.org. Only one is linked from OL to IA, even though multiple IA items may refer to the same edition. The current OL sync process only automatically updates the archive.org item present in this ocaid field.
  • source_records, format example: ["ia:callofdistantmam00ward", ...]

Other less common IA related fields, possibly to be deprecated?:

  • "ia_box_id": ["IA113601"]
  • "ia_loaded_id": ["callofdistantmam00ward"]

Note All fake-subject references to archive.org categories that may have once been used for classifying borrowable status are now deprecated. Examples: In Library, Protected DAISY, Accessible_book, Internet Archive Wishlist, Lending library and possibly others. Issue #2107 tracks this clean up.

See Open Library Client JSON schemata for the currently recognised and useful metadata fields for Open Library records.

Technical requirements

PRIORITY: Borrowable books should be synchronised properly to enable discovery and utilisation

  • All borrowable collection:inlibrary books should have openlibrary_edition:

PRIORITY: Items with print-disabled digital copies should be correctly synchronised to enable discovery for those who need them

  • archive.org print disabled collection items representing books, which are not necessarily borrowable by users without print disabilities, should have entries on Open Library to capture the existence of a book we know about, and aid discovery by print disabled users. The following query uses the presence of an ISBN as an indicator that an item is a book with sufficient metadata to count as good for importing.

    • CRITERIA NOT MET: collection:printdisabled AND NOT collection:inlibrary AND NOT openlibrary_edition:* AND isbn:*
      • note the number of items resulting from this query will depend on user account privileges, and not all users will see all print disabled only items by default on archive.org. @ June 2019, there are 330K items in the maximal list that are not linked to Open Library.
      • Existing issue #1047
      • ‼️ SYNCH TASK RESULTS: @ 17 July (after running an import/re-import task for the MARC records) there are now 184,445 print-disabled only items without corresponding Open Library links (improvement: ~150K)
  • The following query attempts to locate items that are printdisabled only, do NOT have ISBNs in metadata, but are good scanned books collection:printdisabled AND NOT collection:inlibrary AND NOT openlibrary_edition:* AND NOT isbn:* AND collection:internetarchivebooks there are 13,708 results, but most appear to have incomplete titles ... strangely with ISBNs in the title field. It looks like these have stalled in the scanning process somehow?

Deprecate openlibrary in favour of openlibrary_edition + _work

  • All archive.org book items with a populated openlibrary metadata field should also have openlibrary_edition.

  • All archive.org book items with openlibrary_edition MUST have openlibrary_work, and vice versa.

    • CRITERION NOT MET: mediatype:texts AND openlibrary_edition:* AND NOT openlibrary_work:*
    • CRITERION NOT MET: mediatype:texts AND openlibrary_work:* AND NOT openlibrary_edition:*

Only import and synchronise books (IA mediatype:texts)

  • Openlibrary identifiers on archive.org should only be on mediatype:texts items as only books should be represented on Open Library.
    • CRITERION NOT MET openlibrary:* OR openlibrary_edition:* OR openlibrary_work:* AND NOT mediatype:texts
    • Some items not meeting this technical criterion may be legitimate. Some items appears to be gallery catalogs (i.e. books) that are linked to mediatype:image, and other archive.org items could be legitimate books that are mis-categorised. @ June 2019 there are 55 items matched above. Each item needs to be examined to find the fix, or at least to come up with a set of fix categories. Simply deleting the linking metadata would be incorrect in many of these situations as the links are probably a sign of further data issues on OL or IA.

Orphaned items with ocaid

This problem affects the ability of items to become synchronised using the existing mechanisms. The requirement for including both work id and edition id is affected. The solution is to resolve and add works for all editions which don't have them. This overall effort is being tracked on another wiki page , but the following notes relate to orphans with OCAIDs.

remaining total @ 26 June 2019: 38347

NONE are duplicated

re-running re-import process

20730 were successfully matched or had works created, fixing the orphan (54%)

316 were matched on a different existing edition. !!FIX for these: get orphan by opening https://openlibrary.org/books/ia:<ocaid> and then associate it with the matched work.

the remaining 17301 were not resolved due to the following issues:

Proposal: the no-imagecount and noindex-true OL orphans should simply be deleted. They tend to have been created from problematic archive.org records that we would not currently import, and the main reason for the no-index flag appears to be mismatched metadata, or otherwise bad scans. The records I have checked all seem to have better non-broken scanned items elsewhere, and have been imported properly via those.

Clone this wiki locally