LOCKSS: Format Migration

From CLOCKSS Trusted Digital Repository Documents
Revision as of 04:00, 7 December 2013 by Dshr (Talk | contribs)

Jump to: navigation, search

Contents

LOCKSS: Format Migration

Obsolescence of Web Formats

Jeff Rothenberg's path-breaking pre-Web analysis of digital preservation predicted that:

  • The probability of an individual format going obsolete was high.
  • The time that would elapse between the introduction of a new format and its obsolescence would likely be short.

The OAIS reference model inherited this analysis, and its implication that expending resources now to prepare for the likely rapid obsolescence of formats was desirable.

The LOCKSS technology was designed from the start specifically to preserve content published on the Web. A Web format becomes obsolete when support for it is removed from browsers. Theoretical analyses of the mechanisms by which this would happen predicted that this would be a rare occurrence, because the incentives for doing so are weak, and the disincentives are strong. Subsequent practical research by Matt Holden of INA into the renderability of Web formats that were predicted to be the most likely to suffer obsolescence, audio-visual formats from the early days of the Web, showed that 15 years later format obsolescence was negligible.

Further, the alternative to format migration is emulation. The argument for format migration has always been that it would be impractical to deliver emulation to end-users. Recent work has demonstrated two viable paths to delivering emulation to readers of the types of web content preserved by LOCKSS and CLOCKSS:

  • A team from the University of Freiburg presented papers at IDCC2013 and iPRES2013 showing that it was possible to deliver emulation-as-a-cloud service to browsers using only HTML5, with no special plugin. What delivery method could be more convenient than embedding a live emulation in a Web page simply by pasting a link into it?
  • Building on earlier work by, among others the University of Oxford, running Javascript emulations of obsolete environments in the reader's browser is now routine:

Thus it is far from clear that, even if Web formats eventually suffer obsolescence, format migration would be necessary. By the time obsolescence might happen, it might well be that delivering a transparent emulation to the reader's browser would be the preferred method.

LOCKSS Strategy for Format Obsolescence

Thus Web archives, such as the CLOCKSS archive, have a different model of when to devote resources to format migration, because:

  • The probability of a format going obsolete is low.
  • If a format does go obsolete, it will be a long time after its introduction.
  • It may well be that emulation, rather than format migration, would be the preferred way to deliver content in an obsolete format, were obsolescence ever to occur.

Further, studies have shown that:

Given these observations, the LOCKSS system's strategy for preserving content is:

  • Store, and maintain the integrity of, the original bits.
  • Exploit the content negotiation capabilities of the Web (and presumably any successor technology to the Web) to detect when a reader's browser does not support the original format in which the bits are stored.
  • If this shows it to be necessary, use John Ockerbloom's Typed Object Model technology to construct a format migration pipeline capable of migrating from the original format to a format the reader's browser can render.
  • Use this pipeline to generate a temporary access copy of the original in a format suitable for the reader's browser.
  • Discard this access copy when it is no longer needed.

A framework to support this strategy was implemented in the LOCKSS software and demonstrated in 2005. To avoid wasting resources implementing capabilities which have no realistic prospect of being needed in the foreseeable future, work in this area is on hold. When there is evidence that some format of content under preservation is facing obsolescence, a decision will be taken as to whether a production version of this migration strategy is the appropriate path to take, or whether (for example) an in-browser emulation strategy would be more effective.

This strategy has a number of significant advantages:

  • It uses the minimum amount of storage.
  • It does not waste resources migrating content which is unlikely to be accessed and, if ever accessed, is unlikely to have suffered format obsolescence.
  • It performs any format migration that is actually necessary as late as possible, when the technology for performing it is likely to be better.
  • It expends resources as late as possible, exploiting the time value of money to the maximum extent.
  • It does not commit to format migration, which may not be the appropriate strategy at the time the reader requests access.

This dual strategy of being prepared for both format migration and emulation is the approach endorsed by Jeff Rothenberg in a March, 2012 presentation (e.g. slide 41).

Format Migration in the CLOCKSS Archive

The CLOCKSS archive is a dark archive. No readers (Consumers in the OAIS terminology) ever "interact with [CLOCKSS] services to find preserved information of interest and to access that information in detail". If content is ever triggered from the archive, readers access it from one of a number of re-publishing systems. Dissemination of triggered content is a transaction between the archive and one or more of these republishing systems which involves construction of a Dissemination Information Package and its transmission to the re-publishing system(s). If a subsequent reader's browser is unable to render the format in which the digital object was represented in the DIP, and is thus stored in the re-publishing system the reader is accessing, the technique described above can be applied.

The LOCKSS software used by the CLOCKSS archive integrates the File Identification Tool Set (FITS), which includes JHOVE, DROID and other tools. FITS is currently configured as follows:

  • Priority: Put DROID at top of list, then JHOVE.
  • Disable NLNZ Metadata Extractor, which gives Exceptions.
  • Turn off validation, which is very time consuming on large files.
  • Configure DROID to exclude html, xhtml & pdf because it performs badly on these file types.
  • Configure JHOVE to exclude js because it performs badly on files of this type.

Various outputs from FITS are available through the LOCKSS daemon's GUI for every URL in an AU.

Thus, if a format in which a digital object is stored in the archive is known at the time of a trigger event to be obsolete (in that the vast majority of browsers in general use are unable to render it) the technique described above can be applied in the process of generating the DIP by emulating a browser that cannot render the format in question. In this case the format in which the digital object is stored in the re-publishing system will be different from that in the archive, the result of a format migration of the original. The original continues to be preserved in its original format in the archive. What is stored in the re-publishing system is the temporary access copy.

Availability of Format Converters

Any strategy for format migration, not just the one taken by the LOCKSS software, depends upon the timely availability of converters capable of transforming the doomed format into a less doomed one. As regards the Web formats preserved by LOCKSS networks and the CLOCKSS archive, the sunk investment in (and thus value of) existing Web content in format A means that format A will not be rendered obsolete by format B (i.e. support for rendering format A will not be removed from, and support for format B added to, browsers in common use) unless and until there is a suitable converter from format A to format B. Thus the risk of a format going obsolete with no suitable converter is low.

Further, the Web content of LOCKSS networks and the CLOCKSS archive can be satisfactorily rendered by a completely open source stack. Thus there are open source renderers for the content, which:

Note that, since in the LOCKSS approach format migration takes place at access time rather than at some earlier pre-emptive migration time, any criticism of the approach on the basis that converters might not be available applies a fortiori to the pre-emptive approach.

Again, to avoid wasting resources implementing capabilities which have no realistic prospect of being needed in the foreseeable future, work in this area, such as integration with a registry of format convertors, is on hold.

Change Process

Changes to this document require:

  • Review by LOCKSS Engineering Staff
  • Approval by LOCKSS Chief Scientist

Relevant Documents

  1. Jeff Rothenberg "Ensuring the Longevity of Digital Documents" Scientific American Vol. 272 No. 1 1995
  2. OAIS (2012) CCSDS 650.0-M-2: Reference Model for an Open Archival Information System (OAIS). Magenta Book. Issue 1. June 2012 (ISO 14721:2003) http://public.ccsds.org/publications/archive/650x0m2.pdf accessed 2013.08.31
  3. David S. H. Rosenthal and Vicky Reich. “Permanent Web Publishing”, In Proceedings of the FREENIX Track: 2000 USENIX Annual Technical Conference. June 18-23, 2000, San Diego, California. pp. 129-140. http://lockss.org/locksswiki/files/Freenix2000.pdf
  4. David S. H. Rosenthal. "Format Obsolescence: Scenarios", April 29, 2007 http://blog.dshr.org/2007/04/format-obsolescence-scenarios.html
  5. David S. H. Rosenthal. "Spring CNI Plenary: The Remix". April 10, 2009 http://blog.dshr.org/2009/04/spring-cni-plenary-remix.html
  6. David S. H. Rosenthal. "Format Obsolescence: Right Here Right Now?" January 3, 2008 http://blog.dshr.org/2008/01/format-obsolescence-right-here-right.html
  7. John Ockerbloom. "Mediating Among Diverse Data Formats". Tech. Rep. CMU-CS-98-102, Carnegie-Mellon University, 1998. http://reports-archive.adm.cs.cmu.edu/anon/1998/CMU-CS-98-102.pdf
  8. David S. H. Rosenthal, Thomas Lipkis, Thomas Robertson, Seth Morabito. “Transparent Format Migration of Preserved Web Content”, D-Lib Magazine, vol. 11, no. 1, January 2005. doi:10.1045/january2005-rosenthal
  9. Ian Adams, Ethan L. Miller, Mark W. Storer. "Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories", Tech. Rept. UCSC-SSRC-11-01, University of California, Santa Cruz, March 2011 http://www.ssrc.ucsc.edu/pub/adams-ssrctr-11-01.html
  10. David S. H. Rosenthal, Daniel C. Rosenthal, Ethan L. Miller, Ian F. Adams, Mark W. Storer, Erez Zadok. "The Economics of Long-Term Digital Storage", Memory of the World in the Digital Age, Vancouver, BC, September 2012. http://www.lockss.org/locksswp/wp-content/uploads/2012/09/unesco2012.pdf
  11. David S. H. Rosenthal. "Formats Through Time" October 9, 2012 http://blog.dshr.org/2012/10/formats-through-time.html
  12. David S. H. Rosenthal. "Are Format Specifications Important For Preservation?" January 4, 2009 http://blog.dshr.org/2009/01/are-format-specifications-important-for.html