Difference between revisions of "LOCKSS: Basic Concepts"

From CLOCKSS Trusted Digital Repository Documents
Jump to: navigation, search
(Initial version)

Revision as of 21:45, 6 April 2014

Contents

LOCKSS: Basic Concepts

This document introduces some basic concepts of the LOCKSS technology that are needed to understand the remainder of the documentation.

LOCKSS Daemon

The LOCKSS daemon is a large (>200K lines of code) Java program that turns a generic Linux system into a digital preservation appliance called a LOCKSS box. The LOCKSS daemon is the only application program that runs in a LOCKSS box. Every action of a LOCKSS box, for ingest, preservation, dissemination and administration is performed by the LOCKSS daemon. The LOCKSS daemon is administered via a Web interface that allows authorized administrators to direct it to collect content, control how that content is disseminated, and monitor the daemon's performance. Among the functions performed by the LOCKSS daemon are:

  • Ingest via Web crawling, or file import.
  • Preservation via the LOCKSS: Polling and Repair Protocol.
  • Dissemination by acting as both a Web server and a Web proxy, and by file export.
  • Administration via a Web user interface.
  • Status and statistics reporting.

Because all access to preserved content is mediated by the LOCKSS daemon, the physical representation of its internal data structures, such as how content and metadata are stored, is essentially of academic interest only. In particular, the fundamental abstraction that the LOCKSS daemon presents is not that it preserves files. It preserves URLs; their content and their associated headers (metadata) as a unit, although we often casually refer to these (content, header) pairs as "files". Their internal representations are not visible to those using the system to ingest or disseminate content, but only to those administering the underlying system. Only in exceptional circumstances does an administrator log in to the underlying operating system; all routine and normal diagnostic operations are performed through the Web interface.

Plugins

The behavior of the LOCKSS daemon is generic. It must be adapted to the requirements of particular content it is to preserve. This is done via the"plugin" for that content, which is an instance of a Java class. In most cases, it is an instance of class DefinablePlugin whose behavior has been customized by parameters in an XML file; colloquially this XML file is often referred to as "the plugin" because it contains all the information that distinguishes this plugin from another that also uses DefinablePlugin. This information includes, for example, the classes which DefinablePlugin can use to extract metadata from the relevant content. A plugin (the class plus the parameters) represents a class of content, such as "content published on HighWire's H2O platform".

Archival Units

Here is an example plugin XML file, for Taylor and Francis journals, It defines the class of content "published by Taylor and Francis". There are many journals in this class, and content is continually being added to them, so for operational convenience we divide the class into Archival Units (AUs) representing, typically, a year or a volume of a journal. Each AU is defined by the plugin class name, in this case org.lockss.plugin.taylorandfrancis.ClockssTaylorAndFrancisPlugin and a set of definitional parameters defined by the XML file, in this case:

  • base_url
  • journal_id
  • volume_name

For example, the plugin specifies that for an AU of a particular journal identified by these parameters, crawling should start at au_start_url:

  <entry>
    <string>au_start_url</string>
    <string>"%sclockss/%s/%s/index.html", base_url, journal_id, volume_name</string>
  </entry>

Title Database

The values for these parameters come from the Title Data Base (TDB), which is not actually a database, but a knowledge base represented as a set of text files in an easy-to-edit syntax that are processed into an XML file that is obtained by the LOCKSS daemon. For each AU in the system, there is a TDB entry providing the plugin class name and a (name, value) pair for each of the parameters defined by that plugin class that are different from the default. The TDB entry for Advances in Building Energy Research, defining its AUs from 2007-2014, looks like this:

{

  publisher <
    name = Taylor & Francis ;
    info[contract] = 2008 ;
    info[tester] = A
  >

  plugin = org.lockss.plugin.taylorandfrancis.ClockssTaylorAndFrancisPlugin
  param[base_url] = http://www.tandfonline.com/
  implicit < status ; status2 ; year ; name ; param[volume_name] >
...

  {

    title <
      name = Advances in Building Energy Research ;
      issn = 1751-2549 ;
      eissn = 1756-2201 ;
      issnl = 1751-2549
    >

    param[journal_id] = taer20

    au < manifest ; exists ; 2007 ; Advances in Building Energy Research Volume 1 ; 1 >
    au < manifest ; exists ; 2008 ; Advances in Building Energy Research Volume 2 ; 2 >
    au < manifest ; exists ; 2009 ; Advances in Building Energy Research Volume 3 ; 3 >
    au < zapped ; finished ; 2010 ; Advances in Building Energy Research Volume 4 ; 4 >
    au < finished ; crawling ; 2011 ; Advances in Building Energy Research Volume 5 ; 5 >
    au < finished ; crawling ; 2012 ; Advances in Building Energy Research Volume 6 ; 6 >
    au < crawling ; exists ; 2013 ; Advances in Building Energy Research Volume 7 ; 7 >
    au < expected ; exists ; 2014 ; Advances in Building Energy Research Volume 8 ; 8 >

  }
...
}

The definitional parameters are specified as follows:

  • base_url is http://www.tandfonline.com/ for all Taylor and Francis journals specified at the top.
  • journal_id is taer20 specified in the section for Advances in Building Energy Research.
  • volume_name is specified by the 5th column of the table.

The text form of the TDB is preserved in the LOCKSS source code repository at SourceForge, which is backed up each night to an on-site and an off-site system, both maintained by the LOCKSS team, in addition to SourceForge's backups. There is a copy of the XML form of the TDB for each LOCKSS network on each LOCKSS box in the network, in addition to the copy on the Property Server and its backup in the Amazon cloud.

AUID

Everywhere an AU needs to be uniquely identified, we use an internal name, its Archival Unit ID (AUID) as the means to do so, for example as a key in maps and databases, or in the messages of the LOCKSS: Polling and Repair Protocol. The AUID for an AU is immutable string with an encoded representation of:

  • The fully-qualified Java class name of the plugin.
  • &
  • For each of the definitional parameters defined by the plugin XML:
    • The parameter name.
    • ~
    • The parameter value.
    • & (except for the last parameter)

Because it contains the class of the plugin and all the definitional parameters, the AUID is unique to an AU irrespective of which box it is on,

The AUID for the AU for Volume 6 of Advances in Building Energy Research, defined by the TDB entry above, and used as the example in Definition of AIP is:

org|lockss|plugin|taylorandfrancis|ClockssTaylorAndFrancisPlugin&base_url~http%3A%2F%2Fwww%2Etandfonline%2Ecom%2F&journal_id~taer20&volume_name~6

Archival Units also have an external "AU name", which is a human-readable string used in Web pages and reports but for no other purpose.

Change Process

Changes to this document require:

  • Review by LOCKSS Engineering Staff
  • Approval by LOCKSS Chief Scientist