LOCKSS: Format Migration

From CLOCKSS Trusted Digital Repository Documents
Revision as of 01:30, 27 September 2013 by Dshr (Talk | contribs) (Initial version)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

LOCKSS: Format Migration

Obsolescence of Web Formats

Jeff Rothenberg's path-breaking pre-Web analysis of digital preservation predicted that:

  • The probability of an individual format going obsolete was high.
  • The time that would elapse between the introduction of a new format and its obsolescence would likely be short.

The OAIS reference model inherited this analysis, and its implication that expending resources now to prepare for the likely rapid obsolescence of formats was desirable.

The LOCKSS technology was designed from the start specifically to preserve content published on the Web. A Web format becomes obsolete when support for it is removed from browsers. Theoretical analyses of the mechanisms by which this would happen predicted that this would be a rare occurrence, because the incentives for doing so are weak, and the disincentives are strong. Subsequent practical research by Matt Holden of INA into the renderability of Web formats that were predicted to be the most likely to suffer obsolescence, audio-visual formats from the early days of the Web, showed that 15 years later format obsolescence was negligible.

LOCKSS Strategy for Format Obsolescence

Thus Web archives, such as the CLOCKSS archive, have a different model of when to devote resources to format migration, because:

  • The probability of a format going obsolete is low.
  • If a format does go obsolete, it will be a long time after its introduction.

Further, studies have shown that:

Given these observations, the LOCKSS system's strategy for preserving content is:

  • Store, and maintain the integrity of, the original bits.
  • Exploit the content negotiation capabilities of the Web (and presumably any successor technology to the Web) to detect when a reader's browser does not support the original format in which the bits are stored.
  • If this shows it to be necessary, use John Ockerbloom's Typed Object Model technology to construct a format migration pipeline capable of migrating from the original format to a format the reader's browser can render.
  • Use this pipeline to generate a temporary access copy of the original in a format suitable for the reader's browser.
  • Discard this access copy when it is no longer needed.

A framework to support this strategy was implemented in the LOCKSS software and demonstrated in 2005.

This strategy has a number of significant advantages:

  • It uses the minimum amount of storage.
  • It does not waste resources migrating content which is unlikely to be accessed and, if ever accessed, is unlikely to have suffered format obsolescence.
  • It performs any format migration that is actually necessary as late as possible, when the technology for performing it is likely to be better.
  • It expends resources as late as possible, exploiting the time value of money to the maximum extent.

Format Migration in the CLOCKSS Archive

The CLOCKSS archive is a dark archive. No readers (Consumers in the OAIS terminology) ever "interact with [CLOCKSS] services to find preserved information of interest and to access that information in detail". If content is ever triggered from the archive, readers access it from one of a number of re-publishing systems. Dissemination of triggered content is a transaction between the archive and one or more of these republishing systems which involves construction of a Dissemination Information Package and its transmission to the re-publishing system(s). If a subsequent reader's browser is unable to render the format in which the digital object was represented in the DIP, and is thus stored in the re-publishing system the reader is accessing, the technique described above can be applied.

The LOCKSS software used by the CLOCKSS archive integrates the File Identification Tool Set (FITS), which includes JHOVE, DROID and other tools. FITS is currently configured as follows:

  • Priority: Put DROID at top of list, then JHOVE.
  • Disable NLNZ Metadata Extractor, which gives Exceptions.
  • Turn off validation, which is very time consuming on large files.
  • Configure DROID to exclude html, xhtml & pdf because it performs badly on these file types.
  • Configure JHOVE to exclude js because it performs badly on files of this type.

Various outputs from FITS are available through the LOCKSS daemon's GUI for every URL in an AU.

Thus, if a format in which a digital object is stored in the archive is known at the time of a trigger event to be obsolete (in that the vast majority of browsers in general use are unable to render it) the technique described above can be applied in the process of generating the DIP by emulating a browser that cannot render the format in question. In this case the format in which the digital object is stored in the re-publishing system will be different from that in the archive, the result of a format migration of the original. The original continues to be preserved in its original format in the archive. What is stored in the re-publishing system is the temporary access copy.

Availability of Format Converters

Any strategy for format migration, not just the one taken by the LOCKSS software, depends upon the timely availability of converters capable of transforming the doomed format into a less doomed one. As regards the Web formats preserved by LOCKSS networks and the CLOCKSS archive, the sunk investment in (and thus value of) existing Web content in format A means that format A will not be rendered obsolete by format B (i.e. support for rendering format A will not be removed from, and support for format B added to, browsers in common use) unless and until there is a suitable converter from format A to format B. Thus the risk of a format going obsolete with no suitable converter is low.

Further, the Web content of LOCKSS networks and the CLOCKSS archive can be satisfactorily rendered by a completely open source stack. Thus there are open source renderers for the content, which:

Note that, since in the LOCKSS approach format migration takes place at access time rather than at some earlier pre-emptive migration time, any criticism of the approach on the basis that converters might not be available applies a fortiori to the pre-emptive approach.

Change Process

Changes to this document require:

  • Review by LOCKSS Engineering Staff
  • Approval by LOCKSS Chief Scientist

Relevant Documents

  1. Jeff Rothenberg "Ensuring the Longevity of Digital Documents" Scientific American Vol. 272 No. 1 1995
  2. OAIS (2012) CCSDS 650.0-M-2: Reference Model for an Open Archival Information System (OAIS). Magenta Book. Issue 1. June 2012 (ISO 14721:2003) http://public.ccsds.org/publications/archive/650x0m2.pdf accessed 2013.08.31
  3. David S. H. Rosenthal and Vicky Reich. “Permanent Web Publishing”, In Proceedings of the FREENIX Track: 2000 USENIX Annual Technical Conference. June 18-23, 2000, San Diego, California. pp. 129-140. http://lockss.org/locksswiki/files/Freenix2000.pdf
  4. David S. H. Rosenthal. "Format Obsolescence: Scenarios", April 29, 2007 http://blog.dshr.org/2007/04/format-obsolescence-scenarios.html
  5. David S. H. Rosenthal. "Spring CNI Plenary: The Remix". April 10, 2009 http://blog.dshr.org/2009/04/spring-cni-plenary-remix.html
  6. David S. H. Rosenthal. "Format Obsolescence: Right Here Right Now?" January 3, 2008 http://blog.dshr.org/2008/01/format-obsolescence-right-here-right.html
  7. John Ockerbloom. "Mediating Among Diverse Data Formats". Tech. Rep. CMU-CS-98-102, Carnegie-Mellon University, 1998. http://reports-archive.adm.cs.cmu.edu/anon/1998/CMU-CS-98-102.pdf
  8. David S. H. Rosenthal, Thomas Lipkis, Thomas Robertson, Seth Morabito. “Transparent Format Migration of Preserved Web Content”, D-Lib Magazine, vol. 11, no. 1, January 2005. doi:10.1045/january2005-rosenthal
  9. Ian Adams, Ethan L. Miller, Mark W. Storer. "Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories", Tech. Rept. UCSC-SSRC-11-01, University of California, Santa Cruz, March 2011 http://www.ssrc.ucsc.edu/pub/adams-ssrctr-11-01.html
  10. David S. H. Rosenthal, Daniel C. Rosenthal, Ethan L. Miller, Ian F. Adams, Mark W. Storer, Erez Zadok. "The Economics of Long-Term Digital Storage", Memory of the World in the Digital Age, Vancouver, BC, September 2012. http://www.lockss.org/locksswp/wp-content/uploads/2012/09/unesco2012.pdf
  11. David S. H. Rosenthal. "Formats Through Time" October 9, 2012 http://blog.dshr.org/2012/10/formats-through-time.html
  12. David S. H. Rosenthal. "Are Format Specifications Important For Preservation?" January 4, 2009 http://blog.dshr.org/2009/01/are-format-specifications-important-for.html