Preserving Digital Materials: Final Report of The Digital Preservation and Archive Committee
Preserving Digital Materials: Final Report of The Digital Preservation and Archive Committee
Membership
Howard Besser, UCLA
Curtis Fornadley, UCLA
Anne Gilliland-Swetland, UCLA
Sal Guerena, UCSB
Bernie Hurley, UCB (Chair)
Richard Marciano, UCSD Supercomputer Center
Reagan Moore, UCSD Supercomputer Center
Barclay Ogden, UCB
David Walker, CDL
Brad Westbrook, UCSD
EXECUTIVE SUMMARY AND MAJOR RECOMMENDATIONS.................................... 3
THE NEED TO ADDRESS THE LONG–TERM PRESERVATION OF DIGITAL
MATERIALS .................................................................................................................. 5
DEFINING A UC LIBRARY PRESERVATION REPOSITORY ....................................... 6
FORMAL DEFINITION ......................................................................................... 6
SCOPE OF MATERIALS TO BE PRESERVED ................................................... 6
WHY A PRESERVATION REPOSITORY AND NOT AN “ARCHIVE” ................. 7
WHY A PRESERVATION REPOSITORY IS NOT A “BACK-UP”........................ 8
PRESERVATION REPOSITORY’S RELATIONSHIP TO THE OAIS REFERENCE
MODEL .......................................................................................................................... 8
OAIS ROLES AND RESPONSIBILITIES ............................................................. 9
Role of the Producer ................................................................................. 9
Role of the Consumer ............................................................................... 9
Role of the Preservation Repository Administration .............................. 9
Role of Management ................................................................................. 9
THE SIP, AIP AND DIP ...................................................................................... 10
PRESERVATION REPOSITORY SERVICES .............................................................. 10
INGEST, STORAGE AND DISSEMINATION SERVICES .................................. 10
DIGITAL OBJECT INTEGRITY SERVICES ....................................................... 11
Physical and Linking Integrity................................................................ 11
Data Migration Integrity Services ........................................................... 12
Basic Migration Service ................................................................ 12
Transformative Migration Services .............................................. 12
EDUCATION AND OUTREACH SERVICES ...................................................... 12
DISCOVERY AND DISSEMINATION SERVICES .............................................. 13
Advanced Discovery and Dissemination Services (optional) .............. 13
DATA RESCUE SERVICE (optional) ................................................................ 14
INTELLECTUAL PROPERTY AND COPYRIGHT ISSUES.......................................... 14
CENTRALIZED VS. DECENTRALIZED PRESERVATION REPOSITORY.................. 15
COSTS......................................................................................................................... 15
APPENDIX A: METHODS TO MITIGATE THE RISK OF LOSING DIGITAL
MATERIALS ................................................................................................................ 18
APPENDIX B: OVERVIEW OF THE OAIS REFERENCE MODEL ............................ 20
APPENDIX C: AGREEMENT TO TRANSFER DIGITAL MATERIALS TO THE CDL
PRESERVATION REPOSITORY ................................................................................. 24
2
EXECUTIVE SUMMARY AND MAJOR RECOMMENDATIONS
The long-term retention of digital library materials represents an urgent problem for
the University of California Libraries. Therefore, the Digital Preservation and Archiving
Committee (DPAC) recommends that the UC Libraries:
1) Establish a centralized UC Library preservation repository service that
conforms to the OAIS1 Reference Model, is administered by the CDL, and provides
the following services:
a) Education and Outreach – promotes the importance of digital
preservation, explains policies and procedures that govern the responsibilities of the
preservation repository and the libraries using its services, and provides expert
consultation and training on digital preservation issues. Centralizing this service will
limit the need to develop highly specialized, digital preservation expertise at every UC
library.
b) Ingest, Data Storage & Dissemination – creates submission and
dissemination agreements that define library and preservation repository responsibilities
for depositing, storing and returning digital materials from the repository. Initial
submissions to the repository should be limited to the following materials submitted by
UC Libraries: EADs, materials created to the CDL Digital Objects Standard (CDL-DOs)2
and MARC records.
c) Digital Object Integrity and Transformative Migration – creates
policies, procedures, tools and technologies that ensure the physical and intellectual
integrity of preserved materials. Physical integrity ensures digital objects (i.e., their bits)
are not inadvertently or maliciously altered. In the case of a transformative migration,
when a digital object’s bits must be changed, the preservation repository works with the
depositing library to retain all the essential information needed to ensure the continued
intellectual value of the object.
d) Discovery and Dissemination – allows for the identification and
retrieval of repository objects. The most basic discovery and dissemination service
returns objects that are requested by their unique ID. An advanced real-time object
export service would allow “access systems” that directly serve end-users the ability to
retrieve objects from the preservation repository and therefore, avoid the cost of storing
these locally. The DPAC does not recommend that the preservation repository directly
serve end-users.
e) Data Rescue – offers a recharge service to help libraries convert
legacy, non-standard digital objects to standards accepted by the repository.
1
The Open Archival Information System (OAIS) Reference Model is currently being reviewed as an
ISO Draft International Standard.
2
CDL Digital Objects (CDL-DOs) contain descriptive, administrative and structural metadata, as well as
the actual content (digitized images, text, audio, and video). Content types in CDL-DOs created to date
are primarily digitized images and/or text.
3
2) Form a governance/management structure comprised of major
stakeholders to establish policy and guide the CDL’s administration of the
preservation repository.
4
THE NEED TO ADDRESS THE LONG–TERM PRESERVATION OF
DIGITAL MATERIALS
The primary motivations to address digital preservation issues are itemized in the
Digital Preservation and Archiving Committee’s (DPAC’s) charge:
“The long-term retention of digital library materials represents an urgent problem
for the University of California. Faculty are hesitant to embrace publishing in
"electronic only" formats, as they require more assurance that their scholarship
will endure through time. In addition, UC libraries need to create an environment
where these digital materials become part of their permanent collections, thus
ensuring they are available for use by future generations of scholars. Finally, the
University must be concerned with protecting its significant and growing
investment in these digital assets.”
The DPAC concurs with the above statement and notes that its theme is a concern
over the longevity and integrity of digital materials: the faculties’ need to publish an
accurate and longstanding record of their work; the librarians’ increasing addition of
digital materials to their permanent collections and the institutional requirement to protect
the long-term fiscal investment made in digital collections.
The most cited risk to maintaining the longevity and integrity of digital materials is
the concern over migrating data to new technologies and accompanying data formats.
The motivation for adopting new technologies and formats will vary from capturing cost
efficiencies to providing better services. Some of these migrations are relatively simple;
for example, moving data to new storage media where the “bit streams” that represent the
content do not change. Other data migrations will be much more complex, such as
moving digital content to new technologies that employ new storage formats (i.e., the bit
streams do change) In this case, the challenge is to retain the essential information stored
in the original object that is necessary to ensure its continued intellectual integrity. Some
procedures that will minimize the risk of losing information are listed in Appendix A.
Another serious risk to the longevity and integrity of digital materials is the lack of
policies, procedures and best practices for creating, maintaining and preserving digital
objects. In many organizations, digital materials are treated as ephemeral resources, with
little thought given to their existence beyond posting them to a website. The UC libraries
are taking a leadership role in addressing this risk factor and have implemented well-
reasoned and practical approaches to building digital libraries objects. The creation and
adoption of UC standards such as the Encoded Archival Description (EAD), the CDL
Digital Object Standard and the CDL Digitization Standard are helping to ensure that
metadata and digital content produced by the University libraries are created and encoded
to quality levels that are likely to be preserved.
Protecting the longevity and integrity of digital library materials is an urgent need and
major challenge for the University of California Libraries. Therefore, the DPAC
recommends the UC Libraries create a digital preservation program that focuses on the
5
formation of a “preservation repository.” The following sections of this report define a
preservation repository and itemize the services it provides.
FORMAL DEFINITION
The DPAC defines a preservation repository, as a trusted, neutral, shared service
where the libraries can deposit digital materials to ensure their long-term integrity and
accessibility3.
3
As the DPAC was finalizing its work, the RLG/OCLC Working Group released its report titled, Attributes
of a Trusted Digital Repository: Meeting the Needs of Research Resources
(http://www.rlg.org/longterm/attributes01.pdf ). The DPAC believes that this thoughtful report provides a
broad framework that supports the preservation repository definition and services as conceived by the
DPAC.
6
• Experimental Data Formats for Digital Content
CDL-DOs contain descriptive, administrative and structural metadata, as
well as the actual content. The data formats that represent content in existing CDL-DOs
are images and text (e.g., digitized photographs, transcribed oral histories). The DPAC
recommends that the preservation repository program be used as a testbed to
experiment with other content formats as well, including those “born digital.” The
goal of these experiments will be to help understand how the preservation repository
services can be extended to collections that use these data formats.
The digital library community uses the last and broader definition of an archive
when it talks of creating archives of digital materials, as is evidenced by the terminology
used in the OAIS Reference Model. Library special collections departments may
consider themselves archives under the second definition. Finally, archivists may prefer
to use the first and specific definition of an archive.
7
WHY A PRESERVATION REPOSITORY IS NOT A “BACK-UP”
Backing-up data is not the same as preserving digital materials. A back-up is a
copy of a computer’s proprietary file system, not a replication of the standardized digital
objects it initially imported. It is important to understand that most “access systems” do
not store a standard form of a digital object. For example, the OAC access system
disassembles EADs and stores their data in a series of proprietary database tables and
indexes, which are spread out across the computer's file system. The EAD’s cannot be
recreated and exported from this system.
Certainly, back-ups are critically important for library computing systems in order
to recover from a catastrophic failure (i.e., loss of data). They are optimized to restore
lost data by restoring the computer’s proprietary file system in the shortest time possible.
For this reason, a preservation repository itself would be backed-up.
OAIS
SIP ! Repository DIP !
Producer [Repository Consumer
Administration]
(AIP)
Management
8
The DPAC has adopted the OAIS terminology, as it has precise meaning and is
being used internationally. The most notable exception to this guideline is that the OAIS
term “archive,” has been renamed to “preservation repository,” for reasons explained
earlier. A more detailed diagram of the OAIS model can be found in Appendix B, or at:
http://ssdoo.gsfc.nasa.gov/nost/isoas/dads/OAISOverview.html
Role of Management
Management sets overall policy for the preservation repository service and is
responsible for securing funding. It must resolve high-level conflicts between producers,
consumers and the preservation repository administration. Within the UC system, the
Regional Library Facility Boards provide a model for a preservation repository
9
governance structure. The DPAC recommends a preservation repository management
membership roster that reflects the preservation repository's stakeholders. For
example, it could include two University Librarians, one or two faculty members, and
one representative each from the Collection Development Committee (CDC), Library
Technology Advisory Group (LTAG), and the preservation repository administration.
The AIP is the archival information package and represents the format used to
store the materials in the preservation repository. For now, this can also be MARC, EAD
and CDL digital object standard definition, extended to include any additional
information required by the repository (e.g., a link from the digital object to the
submission agreement it was entered under).
Finally, the DIP is the dissemination information package and represents the
format and encoding used to send materials extracted from the preservation repository to
consumers. As the DPAC is not recommending a public interface, the main consumers
are the producers. Again, MARC, EAD and CDL Digital object definitions could be
used.
10
The DPAC recommends that only materials covered by a pre-negotiated
submission and dissemination agreement be deposited in the preservation repository.
The first case is where links are within an object. For example, the CDL-DO
standard has a file inventory section that tracks all the different files that make up the
digital content. If a digitized book has 100 pages, there could be 100 TIFF files that
represent the master images for all pages. These files can be embedded in the object, or
pointed to via links. In the latter case, digital object integrity would require depositing
the separate content files in the repository.
The most problematic case is when a deposited object links to a digital object
outside the repository. This could happen when the link is to an object that has copyright
restrictions and therefore cannot be added to the repository.
Careful planning will help to achieve object integrity and should be addressed
in the submission agreement negotiated between the producer and preservation repository
administration. With the producer’s cooperation, it should be possible for the
administration to ensure integrity within and between objects in the repository. However,
the preservation repository may not be able to ensure long-term integrity for relationships
represented by links to materials outside the repository.
11
Data Migration Integrity Services
All policies and procedures for the preservation repository are designed to
protect the integrity of deposited materials over the long-term. There will come times
when it will become necessary to migrate digital objects to new, more efficient and cost
effective technologies and data formats. The DPAC describes a migration as the ability
to create an exact copy or transformed version of deposited materials. An exact copy is
one in which the migrated metadata and content has not been changed (i.e., the bits that
make up the metadata and digital content are the same). A transformed version of the
original digital object is created when the bits that represent the object’s metadata and
content need to be changed.
12
repository and the community using its services, and provide expert consultation and
training on digital preservation issues.
The long-term preservation of digital materials requires input and action from
many different stakeholders: faculty, library selectors, catalogers, technologists,
preservation repository staff, etc. The preservation repository administration is in a
unique position to take a proactive role in educating the University community to the
risks inherent in the long-term preservation of digital materials, as well as actions to
mitigate these risks. In particular, preservation repository staff could help to educate
producers about their responsibilities in minimizing risks. For example, producer
communities need to establish digital preservation policies, as well as workflows and best
practices to capture and encode digital content and metadata, including preservation
metadata. The staff could organize workshops and training sessions to help campuses
identify digital preservation issues, develop responses, and then document these in
submission agreements.
13
First, the DPAC feels it’s important to strictly limit end-user access to the
preservation repository in order to contain costs. If a preservation repository begins
adding advanced discovery and dissemination services for end-users, the complexity and
the cost of the system will increase dramatically. No matter how many advanced services
are added, there would be continued pressure to keep adding new features found in
commercial access systems. The DPAC believes the preservation repository will collapse
under its own complexity and cost, if it attempts to directly compete with end-user access
systems
Preservation rights address the legality of replicating digital objects for preservation
purposes (i.e., exact copies or transformed versions). Without preservation rights, digital
objects cannot be preserved over time. Preservation rights must be secured or granted by
the producer and specified in the submission agreement. If these rights change for
materials held in the preservation repository, the producer must alert the repository
administration in a timely manner.
14
The DPAC recommends that UC libraries strive to only license digital materials for
which preservation rights can be secured, and that the university administration and
libraries work to ensure that intellectual property legislation does not impede the
preservation of digital materials (e.g., restrict the right to make exact, transformed or
derivative copies needed for preservation).
The DPAC further recommends that CDL act as the centralized preservation
repository administration, as it is well placed to ensure that best practices are both
negotiated in submission agreements, and followed in practice.
If the preservation repository is serving content to access systems via the real-time
object export feature, the DPAC recommends that the preservation repository be run as
a “high availability” service (e.g. using clustered servers) to minimize unscheduled
downtime. The costs and benefits of the high availability service would have to be
determined to justify the service.
COSTS
The cost of an initial implementation of a preservation repository is separated into
the one-time capital costs and recurring labor costs. Each cost element relates to an
initial system scaled to store three currently existing collections: MARC records, EAD
records, and the CDL digital objects. The initial scale of the system is:
• 30 million records
• 5 TB total capacity (last for 3-4 years)
• Access rate of 10% per year
15
• Approximately 11 submission agreements to start (one per library)
$377K is the hardware/software startup cost and $227K is the recurring salary
expense needed to implement an off-line preservation repository. This implementation
would not include the real-time object export service (i.e., providing digital objects to
public access systems). An additional $87K startup expense would be required if the
preservation repository were expanded to support real-time object export, for a total
startup cost of $464K. The following explains the costs in more detail. All these costs
assume the preservation repository runs in an established data center.
Director of the Preservation Repository ($111K per year; includes 23% benefits) – A
CRM II would be hired to manage the repository services, advocate use, and manage the
interactions with the UC campus libraries, including negotiation of the submission
agreements.
Production ($116 per year, includes 23% benefits) – A 0.5 PA IV ($55K) is needed
to manage the database, schedule processing for the submission process, and manage the
submission process. 0.25 FTE PA III ($23K) is needed as a system administrator to
manage the accession, storage, and access platforms. 0.5 FTE PA II ($38K) is needed to
support the submission and retrieval of digital objects.
Hardware and Software ($377K) – The basic hardware system is designed to support
an off-line storage service. The components are:
• Accession platform ($48K) – includes a 2-processor server at $25K, an Oracle
license at $18k, and 50 GB of disk space for the information catalog at $5k.
• Access platform ($48K) – the system also serves as the back-up system to the
accessioning platform, and uses a duplicate hardware configuration.
• Storage system ( $281K) – tape storage is assumed as the cheapest support
mechanism for off-line access. The cost of tape media is currently $1,250 per
TB. The cost of buying a tape robot is about $125K, and the cost of a
hierarchical storage manager is about $100K. The cost of the CPUs for
managing the system is about $50K. The capacity of the system is much
greater than initially needed.
16
Expanded hardware and software system for a real-time object export service ($87K
additional, for a total startup cost of $464K) – The basic hardware system would be
augmented with a disk cache to support on-line access. Disk systems currently cost about
$25,000 per TB. Given that the disk cache only needs to hold actively referenced data,
the size of the disk cache should be about 35% of the archive size (set from the combined
access rates). This implies a disk cache of 3.5 TBs at a cost of $87K.
17
APPENDIX A: METHODS TO MITIGATE THE RISK OF LOSING
DIGITAL MATERIALS
All organizations that wish to mitigate the risk of losing digital materials should:
18
store the same materials in multiple repositories using different technologies.
This would protect against a catastrophic software failure that damages materials
in the primary repository and its replications.
12) User Education
Last but not least, it is critical that producers and consumers of preservation
repository services be educated to its mission, policy and procedures, as well as
their own responsibilities (e.g., developing and following standards, guidelines
and best practices, negotiating submission and dissemination agreements with the
repository, etc.).
19
APPENDIX B: OVERVIEW OF THE OAIS REFERENCE MODEL
The functional areas of figure 1 include both systems and people needed to support the
OAIS operation. The OAIS is an archive that meets a set of responsibilities as defined in
the reference model document and this allows an OAIS archive to be distinguished from
other uses of the term 'archive.'
20
Information objects (which may be any type of data), together with attributes needed for
efficient ingest, archival preservation, and searching, are received from data Producers by
the INGEST function using a Submission Information Package (SIP). The INGEST
function does validation, adds supporting information as needed, ensures that the
information is understandable to the designated Consumer communities, and peforms any
transformations needed to put the information into archival storage forms. These
transformations may include reorganizing and reformatting to meet archival storage and
dissemination needs. The resulting information objects are sent to ARCHIVAL
STORAGE using an Archival Information Package, and search information (e.g., Catalog
data) used to support Consumer selection of archived data is sent to DATA
MANAGEMENT as Descriptive Information.
DATA MANAGEMENT is the repository for all information used to support search aids,
and for all information (outside of ARCHIVAL STORAGE) used to support the general
operation of the archive. It stores all the Descriptive Information (catalog information)
used to support searching and ordering. It stores all the request information generated by
Consumers and by the archive in responding to requests. It stores all the information
about Consumers.
Consumers interact primarily with the ACCESS and DISSEMINATION function to find
and receive information objects of interest. The finding aids used are supported by the
catalog data (Descriptive Information) held by DATA MANAGEMENT. Requests to
ARCHIVAL STORAGE yield Archival Information Packages (AIPs) which are
processed as needed by ACCESS and DISSEMINATION to complete the order. Standing
orders are processed automatically as the information becomes available and meets
distribution requirements. Disseminations are provided as Dissemination Information
Packages (DIPs) to the Consumer using some protocol (e. g., FTP, http, or tape).
21
Information Models
The other major dimension for the reference model is the modeling of information in the
OAIS. A general model of archival information objects is shown in Figure 2. Much more
extensive modeling is contained in the full document.
The Archival Information Package (AIP) contains two primary information objects that
are identified as Content Information (CI) and Preservation Description Information
(PDI). The Content Information is that information which is the primary information
submitted for preservation. The scope of what constitutes this information is agreed to
between the archive and the Producer. To be complete, and preservable for the long-term,
this information must include the associated Representation Information (or format
information) that turns the Content Information bits into meaningful information.
Once the Content Information has been determined, it is possible to ask what constitutes
the Preservation Description Information for that particular Content Information. The
PDI includes several types of additional information that are needed to help preserve the
Content Information. These are:
o Reference: How consumers can uniquely identify the Content Information from any
other Content Information.
o Provenance: Who has had custody of the Content Information and what was its source.
This would include the processing that generated it.
22
o Context: How the Content Information relates to other information objects, such as
why it was created and how it may be used with other information objects.
o Fixity: Information and mechanisms used to protect the Content Information from
accidental change.
The PDI information is needed for long-term preservation and its completeness is a key
element in determining the quality of the archival function being performed.
Within the archive, the Content Information and Preservation Description Information
need to be tracked and associated. This is done using the Packaging Information. For
example, this may consist of some directory and file names, and their underlying
implementations, on some medium. Or it may consist of a tar file together with some
information relating the Content Information bits, its Representation Information, and
Preservation Description Information.
Also associated with the Archival Information Package is the Descriptive Information.
This is the information that is used to populate finding aids and is typically thought of as
the catalogue information. It is this information that supports Consumer searches or that
triggers the dissemination of information in response to a standing order.
23
APPENDIX C: AGREEMENT TO TRANSFER DIGITAL
MATERIALS TO THE CDL PRESERVATION REPOSITORY
I: REPOSITORY INFORMATION
A: Transferring Agency
C: Address
D: Contact Person
E: Title:
F: Email address:
G: Telephone Number:
24
[Note: the upgrade provision assumes the Data Rescue Service is
implemented and is a recharge service.]
5: will identify all persons having authority to submit and retrieve the
Producer’s digital objects stored in the Preservation Repository.
25
Dissemination Information Package (DIP) specified in this
Submission Agreement;
___ Permanent
26