Difference between revisions of "Definition of AIP"

From CLOCKSS Trusted Digital Repository Documents
Jump to: navigation, search
(Creating an AIP from a file transfer SIP: Response to Site Visit Schedule)
(restoring post-2016 edits)
 
(3 intermediate revisions by one user not shown)
Line 26: Line 26:
 
== CLOCKSS Archival Information Package (AIP) ==
 
== CLOCKSS Archival Information Package (AIP) ==
  
CLOCKSS calls its AIPs Archival Units (AUs). They are constructed from [[Definition of SIP|CLOCKSS SIPs]] as described below. The internal representation of an AU is as a directory hierarchy in a POSIX file system in which the directory names represent the components of the name of the URL from which content was collected, including the terminal component. Each of these directories, in addition to the names of descendant components, may contain files whose names start with <tt>#</tt> (and thus cannot be components of the URL name, since they would be tags within the URL) containing metadata relating to the URL up to that component. Additionally each directory may contain a directory named <tt>#content</tt> containing a sequence of versions of content obtained from the URL up to that component, and metadata relating to that version of the content. This metadata is represented as a file containing {key, value} pairs. It will include all the HTTP headers obtained by the <tt>GET</tt> that obtained the content, in particular the <tt>Content-Type</tt>, one component of which is the <tt>Media-Type</tt>. It will also include additional metadata generated during ingestion of that version of the content at that URL, typically including a checksum, and a timestamp. This somewhat complex representation is required because:
+
CLOCKSS calls its AIPs [[LOCKSS: Basic Concepts#Archival Units|Archival Units (AUs)]]. They are constructed from [[Definition of SIP|CLOCKSS SIPs]] as described below. The internal representation of an AU is as a directory hierarchy in a POSIX file system in which the directory names represent the components of the name of the URL from which content was collected, including the terminal component. The order of the protocol and host fields of the URL is reversed, so content collected from the same host via multiple protocols (e.g. HTTP and HTTPS) is in subtrees of the host's directory. Each of these directories, in addition to the names of descendant components, may contain files whose names start with <tt>#</tt> (and thus cannot be components of the URL name, since they would be tags within the URL) containing metadata relating to the URL up to that component. Additionally each directory may contain a directory named <tt>#content</tt> containing a sequence of versions of content obtained from the URL up to that component, and metadata relating to that version of the content. This metadata is represented as a file containing {key, value} pairs. It will include all the HTTP headers obtained by the <tt>GET</tt> that obtained the content, in particular the <tt>Content-Type</tt>, one component of which is the <tt>Media-Type</tt>. It will also include additional metadata generated during ingestion of that version of the content at that URL, typically including a checksum, and a timestamp. This somewhat complex representation is required because:
 
* The URLs <tt>http://www.example.com/foo</tt> and <tt>http://www.example.com/foo/bar</tt> may both contain (different) content.
 
* The URLs <tt>http://www.example.com/foo</tt> and <tt>http://www.example.com/foo/bar</tt> may both contain (different) content.
 
* The content and/or the headers obtained from <tt>http://www.example.com/foo/bar</tt> at time T(0) and at time T(1) may differ.
 
* The content and/or the headers obtained from <tt>http://www.example.com/foo/bar</tt> at time T(0) and at time T(1) may differ.
The representation allows for easy access by tools other than the LOCKSS daemon, for example shell scripts.
+
The representation allows for easy access by tools other than the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]], for example shell scripts, if the LOCKSS software ever goes obsolete.
 +
 
 +
[[File:AU-Structure.png]]
  
 
=== CLOCKSS AIP Examples ===
 
=== CLOCKSS AIP Examples ===
Line 52: Line 54:
 
<li>'''Context:''' the context information for an AU consists of two pieces of information which together allow the daemon to construct an instance of a Java class customized to suit the AU:
 
<li>'''Context:''' the context information for an AU consists of two pieces of information which together allow the daemon to construct an instance of a Java class customized to suit the AU:
 
<ul>
 
<ul>
<li>The ''parameters'', which are a set of {name, value} pairs providing the arguments needed to construct an instance of the class selected by the plugin ID.</li>
+
<li>The ''parameters'', which are a set of {name, value} pairs providing the arguments needed to construct an instance of the class selected by the [[LOCKSS: Basic Concepts#Plugins|plugin ID]].</li>
 
<li>The ''plugin ID'', which, in encoded form, identifies the class to be instantiated and supplies additional information.</li>
 
<li>The ''plugin ID'', which, in encoded form, identifies the class to be instantiated and supplies additional information.</li>
 
</ul>
 
</ul>
Line 58: Line 60:
 
The context for the content of a URL in an AU consists of the associated metadata, including the <tt>Content-Type</tt> and the other HTTP headers which together provide the information a Web browser uses to render the content.<br />
 
The context for the content of a URL in an AU consists of the associated metadata, including the <tt>Content-Type</tt> and the other HTTP headers which together provide the information a Web browser uses to render the content.<br />
 
Each AU is a stand-alone, self-contained object that, if necessary, can be disseminated (triggered) independently. Although it may be part of a larger bibliographic unit in a logical sense, it is self-contained in its representation. To ensure this common URLs, for example a journal logo image, are replicated in each AU that refers to them.</li>
 
Each AU is a stand-alone, self-contained object that, if necessary, can be disseminated (triggered) independently. Although it may be part of a larger bibliographic unit in a logical sense, it is self-contained in its representation. To ensure this common URLs, for example a journal logo image, are replicated in each AU that refers to them.</li>
<li>'''Reference:''' Each AU has an immutable internal name, computed from its context information and stored with it. The name of the AU is the same on all CLOCKSS boxes, although the location of the AU instance in the box's file system may differ. The use of this name to locate the instance of an AU on a CLOCKSS box, or content within it, is [[Definition of AIP#Locating Digital Objects|described below]].</li>
+
<li>'''Reference:''' Each AU has an immutable internal name, its [[LOCKSS: Basic Concepts#AUID|AUID]] computed from its context information and stored in the file named <tt>#au_id_file</tt> in the root directory of the AU. The name of the AU is the same on all CLOCKSS boxes, although the location of the AU instance in the box's file system may differ. The use of this name to locate the instance of an AU on a CLOCKSS box, or content within it, is [[Definition of AIP#Locating Digital Objects|described below]].</li>
 
<li>'''Access Rights:''' the CLOCKSS archive is a dark archive; access to the content of all AUs is forbidden unless in the future the [[CLOCKSS: Extracting Triggered Content|CLOCKSS board declares a trigger event]] for a specific set of AUs. Thus it is neither necessary nor possible to store the access rights to an AU with the AU itself, since they will only be determined in the future.</li></ul>
 
<li>'''Access Rights:''' the CLOCKSS archive is a dark archive; access to the content of all AUs is forbidden unless in the future the [[CLOCKSS: Extracting Triggered Content|CLOCKSS board declares a trigger event]] for a specific set of AUs. Thus it is neither necessary nor possible to store the access rights to an AU with the AU itself, since they will only be determined in the future.</li></ul>
  
 
== Creating AIPs from SIPs ==
 
== Creating AIPs from SIPs ==
  
When the AU is created its root directory is created, and the context information (plugin ID and AU parameters) are recorded in a read-only file in that directory. The same information is recorded in a central file in the daemon's configuration directory.
+
When the AU is created, its root directory is created, and the context information ([[LOCKSS: Basic Concepts#Plugins|plugin ID]] and AU parameters) are recorded in a read-only file in that directory. The same information is recorded in a central file in the daemon's configuration directory.
  
 
Because most publishers whose materials are preserved in the CLOCKSS archive are serial publishers, emitting a continuous stream of articles, AUs are typically constructed from a series of SIPs, each one representing the articles published since its predecessor SIP. There are [[Definition of SIP|two kinds of SIP]]:
 
Because most publishers whose materials are preserved in the CLOCKSS archive are serial publishers, emitting a continuous stream of articles, AUs are typically constructed from a series of SIPs, each one representing the articles published since its predecessor SIP. There are [[Definition of SIP|two kinds of SIP]]:
Line 79: Line 81:
 
=== Creating an AIP from a harvest SIP ===
 
=== Creating an AIP from a harvest SIP ===
  
When the scheduled time for the AU's requested collection attempt arrives, the plugin configures the box's Web crawler as it first verifies that the appropriate [[Definition of SIP|CLOCKSS permission statement]] is present on the publisher's Web site, then crawls the site. The plugin bounds the crawl in the URL name space by determining which links to follow, rate-limits the crawl so as not to disrupt operation of the publisher's Web site, and stores all newly-discovered content and the related metadata in the appropriate place relative to the AU's root directory in the box's repository, as described [[Definition of AIP#Creating AIPs from SIPs|above]].
+
When the scheduled time for the AU's requested collection attempt arrives, the [[LOCKSS: Basic Concepts#Plugins|plugin]] configures the box's Web crawler as it first verifies that the appropriate [[Definition of SIP|CLOCKSS permission statement]] is present on the publisher's Web site, then crawls the site. The plugin bounds the crawl in the URL name space by determining which links to follow, rate-limits the crawl so as not to disrupt operation of the publisher's Web site, and stores all newly-discovered content and the related metadata in the appropriate place relative to the AU's root directory in the box's repository, as described [[Definition of AIP#Creating AIPs from SIPs|above]].
  
 
Note that for quality assurance (QA) reasons, as set out in [[CLOCKSS: Ingest Pipeline]], in practice each harvest AU is created twice:
 
Note that for quality assurance (QA) reasons, as set out in [[CLOCKSS: Ingest Pipeline]], in practice each harvest AU is created twice:
Line 86: Line 88:
 
** crawl the content from the CLOCKSS ingest boxes
 
** crawl the content from the CLOCKSS ingest boxes
 
** start regular integrity checks with the other production CLOCKSS boxes using the [[LOCKSS: Polling and Repair Protocol]].
 
** start regular integrity checks with the other production CLOCKSS boxes using the [[LOCKSS: Polling and Repair Protocol]].
The process the production boxes use is exactly the same as used by the ingest boxes, except that the production box is configured to crawl using the LOCKSS daemon on an ingest box as a proxy. The LOCKSS daemon on the ingest box is configured to act as a proxy that returns the content it has for the URL, or 404 for content it does not have. This ensures that the URLs from which the production box collects content point to the publisher, although the content will come from the ingest box. Once agreement is reached among the production boxes, the AU can be removed from the ingest boxes. Removal is recorded by changing the state of the AU in the Title DataBase (TDB) to ZAPPED (See [[LOCKSS: Extracting Bibliographic Metadata#Metadata Database Indexing|LOCKSS: Extracting Bibliographic Metadata]]). The current process for doing so has been deemed inadequate (see [[CLOCKSS: Ingest Pipeline#Harvest Process|CLOCKSS: Ingest Pipeline]]); a replacement process is under development.
+
The process the production boxes use is exactly the same as used by the ingest boxes, except that the production box is configured to crawl using the LOCKSS daemon on an ingest box as a proxy. The LOCKSS daemon on the ingest box is configured to act as a proxy that returns the content it has for the publisher's URL, or 404 for content it does not have. This ensures that the URLs from which the production box collects content point to the publisher, although the content will be that previously collected from the publisher's URL by the ingest box (see [[CLOCKSS: Ingest Pipeline#Harvest Content Processing|Harvest Content Processing]]). Note that the publisher's web site is not involved in this process. Once agreement is reached among the production boxes, the AU can be removed from the ingest boxes. Removal is recorded by changing the state of the AU in the [[LOCKSS: Basic Concepts#Title Database|Title DataBase (TDB)]] to ZAPPED. The current process for doing so has been deemed inadequate (see [[CLOCKSS: Ingest Pipeline#Harvest Process|CLOCKSS: Ingest Pipeline]]); a replacement process is under development.
  
 
=== Creating an AIP from a file transfer SIP ===
 
=== Creating an AIP from a file transfer SIP ===
Line 97: Line 99:
 
Content and metadata in SIPs received via file transfer is organized by shell scripts into a directory hierarchy that is exported to the production CLOCKSS boxes by a Web server, called the ''staging server''. The hierarchy contains a directory per publisher containing a directory per year that contains the content and metadata received from that publisher during that year, and an automatically generated manifest page containing links to all the files and a CLOCKSS permission statement. Some publishers' file transfer SIPs do not contain fixity information, for those publishers fixity information is computed as soon as possible after file transfer and stored with the content.
 
Content and metadata in SIPs received via file transfer is organized by shell scripts into a directory hierarchy that is exported to the production CLOCKSS boxes by a Web server, called the ''staging server''. The hierarchy contains a directory per publisher containing a directory per year that contains the content and metadata received from that publisher during that year, and an automatically generated manifest page containing links to all the files and a CLOCKSS permission statement. Some publishers' file transfer SIPs do not contain fixity information, for those publishers fixity information is computed as soon as possible after file transfer and stored with the content.
  
At the start of each new year an AU for the new year's SIPs from each publisher supplying content via file transfer is configured on the production CLOCKSS boxes via the [[LOCKSS: Property Server Operations|CLOCKSS property server]]. The AU's plugin is configured to start crawling from the automatically generated manifest page of that year's directory for that publisher on the staging server at regular intervals. Each time the box crawls, it adds content and metadata from newly received SIPs to the box's copy of the AU. In this case the files in the AIPs have names reflecting the files in the SIPs on the staging server.  The publisher's names or URLs for these files are contained in, or recoverable from metadata in the SIPs/AIPs. The content will be associated with them when and if it is triggered.
+
At the start of each new year an AU for the new year's SIPs from each publisher supplying content via file transfer is configured on the production CLOCKSS boxes via the [[LOCKSS: Property Server Operations|CLOCKSS property server]]. The AU's [[LOCKSS: Basic Concepts#Plugins|plugin]] is configured to start crawling from the automatically generated manifest page of that year's directory for that publisher on the staging server at regular intervals. Each time the box crawls, it adds content and metadata from newly received SIPs to the box's copy of the AU. In this case the files in the AIPs have names reflecting the files in the SIPs on the staging server.  The publisher's names or URLs for these files are contained in, or recoverable from metadata in the SIPs/AIPs. The content will be associated with them when and if it is triggered.
  
 
After the year has ended and agreement has been reached on the AU among the production CLOCKSS boxes, the AU can be removed from the staging server and its status changed to ZAPPED in the TDB (See [[LOCKSS: Extracting Bibliographic Metadata#Metadata Database Indexing|LOCKSS: Extracting Bibliographic Metadata]]). A review deemed the initial process for doing so inadequate and error-prone. AUs will not be removed from the staging server until an acceptable process has been developed and tested. This process must integrate with the improved [[CLOCKSS: Logging and Records#External Reports|process for generating external reports]].
 
After the year has ended and agreement has been reached on the AU among the production CLOCKSS boxes, the AU can be removed from the staging server and its status changed to ZAPPED in the TDB (See [[LOCKSS: Extracting Bibliographic Metadata#Metadata Database Indexing|LOCKSS: Extracting Bibliographic Metadata]]). A review deemed the initial process for doing so inadequate and error-prone. AUs will not be removed from the staging server until an acceptable process has been developed and tested. This process must integrate with the improved [[CLOCKSS: Logging and Records#External Reports|process for generating external reports]].
Line 103: Line 105:
 
== CLOCKSS Representation Information ==
 
== CLOCKSS Representation Information ==
  
[[CLOCKSS: Designated Community]] documents that the software Knowledge Base of the eventual Consumers of triggered content includes web browsers and their associated plugins capable of rendering the formats used on the Web at the time of their publication. This part of the Representation Information for the CLOCKSS archive is preserved outside the CLOCKSS archive, primarily in open source code repositories.
+
[[CLOCKSS: Designated Community]] documents that the software Knowledge Base of the eventual Consumers of triggered content includes web browsers and their associated plugins capable of rendering the formats used on the Web at the time of their publication. This part of the Representation Information for the CLOCKSS Archive is preserved outside the CLOCKSS Archive, primarily in open source code repositories.
  
[[LOCKSS: Format Migration]] documents how evolution in this Knowledge Base through time can be handled using the Content Negotiation capabilities of the Web to implement [http://dx.doi.org/10.1045/january2005-rosenthal transparent, on-access format migration]. This capability is implemented in the LOCKSS software, which is preserved outside the CLOCKSS archive, in the SourceForge repository and elsewhere.
+
[[LOCKSS: Format Migration]] documents how evolution in this Knowledge Base through time can be handled using the Content Negotiation capabilities of the Web to implement [http://dx.doi.org/10.1045/january2005-rosenthal transparent, on-access format migration]. This capability is implemented in the LOCKSS software, which is preserved outside the CLOCKSS Archive, in the GitHub repository and elsewhere.
  
 
Thus the representation Information that is necessary to preserve with content objects in a CLOCKSS AIP is that information needed by web browsers to render, and the LOCKSS software to migrate, the content obtained from a URL. This information is is two parts:
 
Thus the representation Information that is necessary to preserve with content objects in a CLOCKSS AIP is that information needed by web browsers to render, and the LOCKSS software to migrate, the content obtained from a URL. This information is is two parts:
Line 132: Line 134:
 
Changes to this document require:
 
Changes to this document require:
 
* Review by LOCKSS Engineering Staff
 
* Review by LOCKSS Engineering Staff
* Approval by LOCKSS Chief Scientist
+
* Approval by LOCKSS Technical Manager
  
 
== Relevant Documents ==
 
== Relevant Documents ==

Latest revision as of 23:26, 14 August 2019

Contents

CLOCKSS Definition of AIP

OAIS Archival Information Package (AIP)

The OAIS definition of AIP is:

Archival Information Package (AIP): An Information Package, consisting of the Content Information and the associated Preservation Description Information (PDI), which is preserved within an OAIS.

Other relevant definitions are:

AIP Edition: An AIP whose Content Information or Preservation Description Information has been upgraded or improved with the intent not to preserve information, but to increase or improve it. An AIP edition is not considered to be the result of a Migration.
AIP Version: An AIP whose Content Information or Preservation Description Information has undergone a Transformation on a source AIP and is a candidate to replace the source AIP. An AIP version is considered to be the result of a Digital Migration.
Archival Information Collection (AIC): An Archival Information Package whose Content Information is an aggregation of other Archival Information Packages.
Archival Information Unit (AIU): An Archival Information Package where the Archive chooses not to break down the Content Information into other Archival Information Packages. An AIU can consist of multiple digital objects (e.g., multiple files).
Preservation Description Information (PDI): The information which is necessary for adequate preservation of the Content Information and which can be categorized as Provenance, Reference, Fixity, Context, and Access Rights Information.

The OAIS discussion of AIP is:

Within the OAIS one or more SIPs are transformed into one or more Archival Information Packages (AIPs) for preservation. The AIP has a complete set of PDI for the associated Content Information. The AIP may also contain a collection of other AIPs, and this is discussed and modeled in section 4. The Packaging Information of the AIP will conform to OAIS internal standards, and it may vary as it is managed by the OAIS.

The OAIS definition of Representation Information is:

Representation Information: The information that maps a Data Object into more meaningful concepts. An example of Representation Information for a bit sequence which is a FITS file might consist of the FITS standard which defines the format plus a dictionary which defines the meaning in the file of keywords which are not part of the standard. Another example is JPEG software which is used to render a JPEG file; rendering the JPEG file as bits is not very meaningful to humans but the software, which embodies an understanding of the JPEG standard, maps the bits into pixels which can then be rendered as an image for human viewing.

Other relevant definitions are:

Representation Network: The set of Representation Information that fully describes the meaning of a Data Object. Representation Information in digital forms needs additional Representation Information so its digital forms can be understood over the Long Term.
Representation Rendering Software: A type of software that displays Representation Information of an Information Object in forms understandable to humans.

The OAIS discussion of Representation Information is:

In general, it can be said that ‘Data interpreted using its Representation Information yields Information’, ... In order for this Information Object to be successfully preserved, it is critical for an OAIS to identify clearly and to understand clearly the Data Object and its associated Representation Information. For digital information, this means the OAIS must clearly identify the bits and the Representation Information that applies to those bits. ... As a further complication, the recursive nature of Representation Information, which typically is composed of its own data and its own Representation Information, typically leads to a network of Representation Information objects. Since a key purpose of an OAIS is to preserve information for a Designated Community, the OAIS must understand the Knowledge Base of its Designated Community to understand the minimum Representation Information that must be maintained. The OAIS should then make a decision between maintaining the minimum Representation Information needed for its Designated Community, or maintaining a larger amount of Representation Information that may allow understanding by a larger Consumer community with a less specialized Knowledge Base, which would be the equivalent of extending the definition of the Designated Community. Over time, evolution of the Designated Community’s Knowledge Base may require updates to the Representation Information to ensure continued understanding. The choice, for an OAIS, to collect all the relevant Representation Information or to reference its existence in another trusted or partner OAIS Archive, is an implementation and organization decision.

CLOCKSS Archival Information Package (AIP)

CLOCKSS calls its AIPs Archival Units (AUs). They are constructed from CLOCKSS SIPs as described below. The internal representation of an AU is as a directory hierarchy in a POSIX file system in which the directory names represent the components of the name of the URL from which content was collected, including the terminal component. The order of the protocol and host fields of the URL is reversed, so content collected from the same host via multiple protocols (e.g. HTTP and HTTPS) is in subtrees of the host's directory. Each of these directories, in addition to the names of descendant components, may contain files whose names start with # (and thus cannot be components of the URL name, since they would be tags within the URL) containing metadata relating to the URL up to that component. Additionally each directory may contain a directory named #content containing a sequence of versions of content obtained from the URL up to that component, and metadata relating to that version of the content. This metadata is represented as a file containing {key, value} pairs. It will include all the HTTP headers obtained by the GET that obtained the content, in particular the Content-Type, one component of which is the Media-Type. It will also include additional metadata generated during ingestion of that version of the content at that URL, typically including a checksum, and a timestamp. This somewhat complex representation is required because:

The representation allows for easy access by tools other than the LOCKSS daemon, for example shell scripts, if the LOCKSS software ever goes obsolete.

AU-Structure.png

CLOCKSS AIP Examples

File Transfer AU

The Journal of Laser Applications Volume 24:

Harvest AU

Advances in Building Energy Research Volume 6:

CLOCKSS Preservation Description Information (PDI)

The OAIS Reference Model classifies the Preservation Description Information (PDI) included in an AIP as follows:

  • Provenance: the provenance of each version of each URL in the AU can be determined as follows. The content of that version was obtained from the URL represented by the path from the root of the AU's directory hierarchy to the parent of the content directory, at the time of the timestamp. If this content was the result of a repair from another CLOCKSS box, that is recorded in the metadata.
  • Fixity: the metadata for each version of each URL in the AU includes a checksum computed at the time it was obtained and, if one is available, a checksum provided by the Web server from which it was obtained.
  • Context: the context information for an AU consists of two pieces of information which together allow the daemon to construct an instance of a Java class customized to suit the AU:
    • The parameters, which are a set of {name, value} pairs providing the arguments needed to construct an instance of the class selected by the plugin ID.
    • The plugin ID, which, in encoded form, identifies the class to be instantiated and supplies additional information.

    In effect, the context for the AU is a customized instance of a Java class, normally referred to as its plugin. It is thus executable, capable of performing operations on the AU such as adding content and metadata from a SIP, extracting metadata, and taking part in integrity checks.
    The context for the content of a URL in an AU consists of the associated metadata, including the Content-Type and the other HTTP headers which together provide the information a Web browser uses to render the content.

    Each AU is a stand-alone, self-contained object that, if necessary, can be disseminated (triggered) independently. Although it may be part of a larger bibliographic unit in a logical sense, it is self-contained in its representation. To ensure this common URLs, for example a journal logo image, are replicated in each AU that refers to them.
  • Reference: Each AU has an immutable internal name, its AUID computed from its context information and stored in the file named #au_id_file in the root directory of the AU. The name of the AU is the same on all CLOCKSS boxes, although the location of the AU instance in the box's file system may differ. The use of this name to locate the instance of an AU on a CLOCKSS box, or content within it, is described below.
  • Access Rights: the CLOCKSS archive is a dark archive; access to the content of all AUs is forbidden unless in the future the CLOCKSS board declares a trigger event for a specific set of AUs. Thus it is neither necessary nor possible to store the access rights to an AU with the AU itself, since they will only be determined in the future.

Creating AIPs from SIPs

When the AU is created, its root directory is created, and the context information (plugin ID and AU parameters) are recorded in a read-only file in that directory. The same information is recorded in a central file in the daemon's configuration directory.

Because most publishers whose materials are preserved in the CLOCKSS archive are serial publishers, emitting a continuous stream of articles, AUs are typically constructed from a series of SIPs, each one representing the articles published since its predecessor SIP. There are two kinds of SIP:

  • A harvest SIP represents content that the CLOCKSS archive will ingest by crawling the publisher's web site.
  • A file transfer SIP represents content that the publisher will package and transfer to the CLOCKSS archive via FTP, rsync, or other file transfer machanism.

The process of creating an AU (AIP) from a SIP is different for each of the two types of SIP, but each process starts by configuring the AU on the CLOCKSS boxes, which involves supplying the context information (plugin ID and the parameters it requires) via the CLOCKSS property server. Each box creates a root directory for its instance of the AU at a suitable place in its POSIX file system and calls the AU's plugin to arrange attempts to collect SIPs. The plugin contains information about times when content collection is allowed. The LOCKSS daemon on each box maintains a schedule of all AUs' collection attempts; when an AU requests a collection attempt it is scheduled so as to conform to box-wide and publisher-specific limits on the number of simultaneous collection attempts.

AIP (AU) instances in CLOCKSS boxes may be created even before the first SIP supplying content for them is available. The instance continues to accumulate content via a sequence of SIPs becoming available through time, as the LOCKSS daemon repeatedly crawls the Web server from which SIPs are collected. Nominally, each AU represents a delimited span of time, such as a year or a volume of a journal. But because SIPs containing errata or corrections can arrive even after the delimited span of time, there is in general never a point in time at which the AU can be said to be definitively complete in the sense that no further content will ever be added.

Although, as documented in CLOCKSS: Ingest Pipeline, the quality assurance process that content undergoes as it is ingested into the CLOCKSS archive includes some visual spot checks that the content renders properly in a web browser, these are primarily intended to assure that all necessary URLs are being harvested. At the scale of the CLOCKSS archive's operations it is not feasible for these checks to be exhaustive. Further, the Content Information in an AIP is copyright by the publisher. It is their responsibility to ensure that it is Independently Understandable for their readers, who are also the eventual Consumers of the content if it is ever triggered from the CLOCKSS archive. Even if these spot checks were to detect some rendering problems, the CLOCKSS archive would not be permitted to modify the Content Information to correct them.

The goal of the CLOCKSS: Ingest Pipeline is not to ensure that the AUs are "complete and correct"; there are no operationally implementable definitions of those terms. If there were, they would involve second-guessing the publishers. The goal is rather to ensure that the AUs "faithfully reflect what the publisher has published on their web site (for harvested AUs) or supplied to the archive (for file transfer AUs)".

Creating an AIP from a harvest SIP

When the scheduled time for the AU's requested collection attempt arrives, the plugin configures the box's Web crawler as it first verifies that the appropriate CLOCKSS permission statement is present on the publisher's Web site, then crawls the site. The plugin bounds the crawl in the URL name space by determining which links to follow, rate-limits the crawl so as not to disrupt operation of the publisher's Web site, and stores all newly-discovered content and the related metadata in the appropriate place relative to the AU's root directory in the box's repository, as described above.

Note that for quality assurance (QA) reasons, as set out in CLOCKSS: Ingest Pipeline, in practice each harvest AU is created twice:

  • First, a temporary AU is created on the CLOCKSS ingest machines. These machines collect the AU's current content, and then come to agreement on the content using the LOCKSS: Polling and Repair Protocol.
  • Once agreement is reached, the permanent AU is created on each of the production CLOCKSS boxes. They then:

The process the production boxes use is exactly the same as used by the ingest boxes, except that the production box is configured to crawl using the LOCKSS daemon on an ingest box as a proxy. The LOCKSS daemon on the ingest box is configured to act as a proxy that returns the content it has for the publisher's URL, or 404 for content it does not have. This ensures that the URLs from which the production box collects content point to the publisher, although the content will be that previously collected from the publisher's URL by the ingest box (see Harvest Content Processing). Note that the publisher's web site is not involved in this process. Once agreement is reached among the production boxes, the AU can be removed from the ingest boxes. Removal is recorded by changing the state of the AU in the Title DataBase (TDB) to ZAPPED. The current process for doing so has been deemed inadequate (see CLOCKSS: Ingest Pipeline); a replacement process is under development.

Creating an AIP from a file transfer SIP

Publishers choosing to supply content via file transfer can choose either:

  • To push the content via FTP, SFTP or rsync to a CLOCKSS-run ingest server, using a user name and password chosen by the CLOCKSS team and specific to the publisher.
  • To have the CLOCKSS archive's ingest server pull the content from a publisher-run FTP, SFTP or rsync server using a user name and password chosen by, and specific to the publisher.

In both cases, the combination of the DNS name at the publisher's end, the user name, and the password identifies the publisher for provenance purposes.

Content and metadata in SIPs received via file transfer is organized by shell scripts into a directory hierarchy that is exported to the production CLOCKSS boxes by a Web server, called the staging server. The hierarchy contains a directory per publisher containing a directory per year that contains the content and metadata received from that publisher during that year, and an automatically generated manifest page containing links to all the files and a CLOCKSS permission statement. Some publishers' file transfer SIPs do not contain fixity information, for those publishers fixity information is computed as soon as possible after file transfer and stored with the content.

At the start of each new year an AU for the new year's SIPs from each publisher supplying content via file transfer is configured on the production CLOCKSS boxes via the CLOCKSS property server. The AU's plugin is configured to start crawling from the automatically generated manifest page of that year's directory for that publisher on the staging server at regular intervals. Each time the box crawls, it adds content and metadata from newly received SIPs to the box's copy of the AU. In this case the files in the AIPs have names reflecting the files in the SIPs on the staging server. The publisher's names or URLs for these files are contained in, or recoverable from metadata in the SIPs/AIPs. The content will be associated with them when and if it is triggered.

After the year has ended and agreement has been reached on the AU among the production CLOCKSS boxes, the AU can be removed from the staging server and its status changed to ZAPPED in the TDB (See LOCKSS: Extracting Bibliographic Metadata). A review deemed the initial process for doing so inadequate and error-prone. AUs will not be removed from the staging server until an acceptable process has been developed and tested. This process must integrate with the improved process for generating external reports.

CLOCKSS Representation Information

CLOCKSS: Designated Community documents that the software Knowledge Base of the eventual Consumers of triggered content includes web browsers and their associated plugins capable of rendering the formats used on the Web at the time of their publication. This part of the Representation Information for the CLOCKSS Archive is preserved outside the CLOCKSS Archive, primarily in open source code repositories.

LOCKSS: Format Migration documents how evolution in this Knowledge Base through time can be handled using the Content Negotiation capabilities of the Web to implement transparent, on-access format migration. This capability is implemented in the LOCKSS software, which is preserved outside the CLOCKSS Archive, in the GitHub repository and elsewhere.

Thus the representation Information that is necessary to preserve with content objects in a CLOCKSS AIP is that information needed by web browsers to render, and the LOCKSS software to migrate, the content obtained from a URL. This information is is two parts:

  • The Mime-Type and other HTTP headers obtained from the URL. The metadata preserved for each version of each URL within an AU includes all HTTP headers.
  • "Magic Number" and other information contained in the HTTP content payload itself. The content preserved for each version of each URL within an AU is the entire HTTP content payload.

A web browser has no other information upon which to base its rendering of the content of a URL than its name (perhaps with a file extension) and these two parts, which must therefore be adequate for the purpose of making the content understandable to Consumers.

Locating Digital Objects

The term "digital objects" is often used in discussions of preserved digital information. In the CLOCKSS content, it might refer to either of two types of object:

  • AIPs, or in the CLOCKSS context AUs.
  • Content objects within an AU.

Locating AIPs

The CLOCKSS archive has a single class of AIP, called an Archival Unit (AU). As described above, the Reference part of the AU's Preservation Description Information is an immutable name that is the same on each CLOCKSS box. The location of the instance of the AU on a particular CLOCKSS box may be obtained by querying a map from this internal name to the path to the AU's root directory. This map is built during box startup and subsequently maintained as new AUs are created, or AUs moved.

Locating content objects

Content within an AU can be located in one of two ways:

  • Via the URL from which it was obtained. As described above, there is a reversible mapping between the URI from which the content was collected and the path from the root of the AU containing it in the POSIX file system containing the AU. This enables content within an AU to be located via the map between the AU's internal name and its root location, and the components of the URL.
  • Via metadata search. As described above, content within an AU can be located by querying the metadata database using specific bibliographic metadata fields to match against the bibliographic metadata supplied by the publisher or derived from the AU's context.

Change Process

Changes to this document require:

  • Review by LOCKSS Engineering Staff
  • Approval by LOCKSS Technical Manager

Relevant Documents

  1. OAIS (2012) CCSDS 650.0-M-2: Reference Model for an Open Archival Information System (OAIS). Magenta Book. Issue 1. June 2012 (ISO 14721:2003) http://public.ccsds.org/publications/archive/650x0m2.pdf accessed 2013.08.31
  2. Definition of SIP
  3. CLOCKSS: Extracting Triggered Content
  4. LOCKSS: Metadata Database
  5. LOCKSS: Extracting Bibliographic Metadata
  6. LOCKSS: Polling and Repair Protocol
  7. CLOCKSS: Ingest Pipeline
  8. CLOCKSS: Designated Community
  9. LOCKSS: Format Migration
  10. David S.H. Rosenthal, Thomas Lipkis, Thomas Robertson, Seth Morabito. “Transparent Format Migration of Preserved Web Content”, D-Lib Magazine, vol. 11, no. 1, January 2005. http://dx.doi.org/10.1045/january2005-rosenthal accessed 2013.8.7