LOCKSS: Basic Concepts

From CLOCKSS Trusted Digital Repository Documents
Revision as of 02:07, 25 July 2014 by Dshr (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

LOCKSS: Basic Concepts

This document introduces some basic concepts of the LOCKSS technology that are needed to understand the remainder of the documentation.

LOCKSS Program

The LOCKSS Program started on 4 October 1998 under the auspices of Stanford University Libraries to develop and support the LOCKSS technology. It was funded initially by a small grant from Michael Lesk at the NSF, and then supported by the Andrew W. Mellon Foundation, the NSF and Sun Microsystems. It transitioned from grant funding to the "Red Hat" model of free, open source software and paid support thanks to a matching grant from the Mellon Foundation, and has been financially stable since 2008 on that basis. Although program staff are Stanford employees, at no time has Stanford provided any financial support for the program. The program pays all staff and operational costs, Stanford indirect costs, and an "occupancy charge" for its office space.

LOCKSS Daemon

The LOCKSS daemon is a large (>200K lines of code) Java program that turns a generic Linux system into a digital preservation appliance called a LOCKSS box. The LOCKSS daemon is the only application program that runs in a LOCKSS box. Every action of a LOCKSS box, for ingest, preservation, dissemination and administration is performed by the LOCKSS daemon. The LOCKSS daemon is administered via a Web interface that allows authorized administrators to direct it to collect content, control how that content is disseminated, and monitor the daemon's performance. Among the functions performed by the LOCKSS daemon are:

  • Ingest via Web crawling, or file import.
  • Preservation via the LOCKSS: Polling and Repair Protocol.
  • Dissemination by acting as both a Web server and a Web proxy, and by file export.
  • Administration via a Web user interface.
  • Status and statistics reporting.

Because all access to preserved content is mediated by the LOCKSS daemon, the physical representation of its internal data structures, such as how content and metadata are stored, is essentially of academic interest only. In particular, the fundamental abstraction that the LOCKSS daemon presents is not that it preserves files. It preserves URLs; their content and their associated headers (metadata) as a unit, although we often casually refer to these (content, header) pairs as "files". Their internal representations are not visible to those using the system to ingest or disseminate content, but only to those administering the underlying system. Only in exceptional circumstances does an administrator log in to the underlying operating system; all routine and normal diagnostic operations are performed through the Web interface.

Preservation

The LOCKSS system is sometimes criticized as providing only bit-level preservation, but this is a misunderstanding. The system employs exactly the same techniques (and in most cases exactly the same software tools) as other preservation systems, including:

The difference between LOCKSS and most other preservation systems lies not in what techniques are employed but in when those techniques are employed. In the interest of economy, the LOCKSS system stores only the original bits, and delays all operations on them except integrity checking as long as possible. So, for example, unlike systems that preemptively migrate formats in bulk that are not yet obsolete into formats that are presumed to be less obsolete, thereby consuming processing resources, and store both the original and the migrated copies, thereby consuming storage resources, LOCKSS migrates formats only of individual files, and only when a read's request indicates that migration of that file is necessary. The migrated version is discarded when no longer needed to save on storage. This capability was demonstrated in 2005 but has remained unused in practice because the formats of content preserved in the LOCKSS system are not going obsolete.

Plugins

The behavior of the LOCKSS daemon is generic. It must be adapted to the requirements of particular content it is to preserve. This is done via the "plugin" for that content, which is an instance of a Java class. In most cases, it is an instance of class DefinablePlugin whose behavior has been customized by parameters in an XML file; colloquially this XML file is often referred to as "the plugin" because it contains all the information that distinguishes this plugin from another that also uses DefinablePlugin. This information includes, for example, the classes which DefinablePlugin can use to extract metadata from the relevant content. A plugin (the class plus the parameters) represents a class of content, such as "content published on HighWire's H2O platform".

Archival Units

Here is an example plugin XML file, for Taylor and Francis journals, It defines the class of content "published by Taylor and Francis". There are many journals in this class, and content is continually being added to them, so for operational convenience we divide the class into Archival Units (AUs) representing, typically, a year or a volume of a journal. Each AU is defined by the plugin class name, in this case org.lockss.plugin.taylorandfrancis.ClockssTaylorAndFrancisPlugin and a set of definitional parameters defined by the XML file, in this case:

  • base_url
  • journal_id
  • volume_name

For example, the plugin specifies that for an AU of a particular journal identified by these parameters, crawling should start at au_start_url:

  <entry>
    <string>au_start_url</string>
    <string>"%sclockss/%s/%s/index.html", base_url, journal_id, volume_name</string>
  </entry>

Title Database

The values for these parameters come from the Title Data Base (TDB), which is not actually a database, but a knowledge base represented as a set of text files in an easy-to-edit syntax that are processed into an XML file that is obtained by the LOCKSS daemon. For each AU in the system, there is a TDB entry providing the plugin class name and a (name, value) pair for each of the parameters defined by that plugin class that are different from the default. The TDB entry for Advances in Building Energy Research, defining its AUs from 2007-2014, looks like this:

{

  publisher <
    name = Taylor & Francis ;
    info[contract] = 2008 ;
    info[tester] = A
  >

  plugin = org.lockss.plugin.taylorandfrancis.ClockssTaylorAndFrancisPlugin
  param[base_url] = http://www.tandfonline.com/
  implicit < status ; status2 ; year ; name ; param[volume_name] >
...

  {

    title <
      name = Advances in Building Energy Research ;
      issn = 1751-2549 ;
      eissn = 1756-2201 ;
      issnl = 1751-2549
    >

    param[journal_id] = taer20

    au < manifest ; exists ; 2007 ; Advances in Building Energy Research Volume 1 ; 1 >
    au < manifest ; exists ; 2008 ; Advances in Building Energy Research Volume 2 ; 2 >
    au < manifest ; exists ; 2009 ; Advances in Building Energy Research Volume 3 ; 3 >
    au < zapped ; finished ; 2010 ; Advances in Building Energy Research Volume 4 ; 4 >
    au < finished ; crawling ; 2011 ; Advances in Building Energy Research Volume 5 ; 5 >
    au < finished ; crawling ; 2012 ; Advances in Building Energy Research Volume 6 ; 6 >
    au < crawling ; exists ; 2013 ; Advances in Building Energy Research Volume 7 ; 7 >
    au < expected ; exists ; 2014 ; Advances in Building Energy Research Volume 8 ; 8 >

  }
...
}

The definitional parameters are specified as follows:

  • base_url is http://www.tandfonline.com/ for all Taylor and Francis journals specified at the top.
  • journal_id is taer20 specified in the section for Advances in Building Energy Research.
  • volume_name is specified by the 5th column of the table.

The text form of the TDB is preserved in the LOCKSS source code repository at SourceForge, which is backed up each night to an on-site and an off-site system, both maintained by the LOCKSS team, in addition to SourceForge's backups. There is a copy of the XML form of the TDB for each LOCKSS network on each LOCKSS box in the network, in addition to the copy on the Property Server and its backup in the Amazon cloud.

AUID

Everywhere an AU needs to be uniquely identified, we use an internal name, its Archival Unit ID (AUID) as the means to do so, for example as a key in maps and databases, or in the messages of the LOCKSS: Polling and Repair Protocol. The AUID for an AU is immutable string with an encoded representation of:

  • The fully-qualified Java class name of the plugin.
  • For each of the definitional parameters defined by the plugin XML:
    • The parameter name.
    • The parameter value.

Because it contains the class of the plugin and all the definitional parameters, the AUID is unique to an AU irrespective of which box it is on,

The AUID for the AU for Volume 6 of Advances in Building Energy Research, defined by the TDB entry above, and used as the example in Definition of AIP is:

org|lockss|plugin|taylorandfrancis|ClockssTaylorAndFrancisPlugin&base_url~http%3A%2F%2Fwww%2Etandfonline%2Ecom%2F&journal_id~taer20&volume_name~6

Archival Units also have an external "AU name", which is a human-readable string used in Web pages and reports but for no other purpose.