Definition of SIP

From CLOCKSS Trusted Digital Repository Documents
Revision as of 19:30, 5 April 2014 by Dshr (Talk | contribs)

Jump to: navigation, search

Contents

CLOCKSS Definition of SIP

OAIS Submission Information Package (SIP)

The OAIS definition of a SIP is:

Submission Information Package (SIP): An Information Package that is delivered by the Producer to the OAIS for use in the construction or update of one or more AIPs and/or the associated Descriptive Information.

Other relevant definitions are:

Data Submission Session: A delivery of media or a single telecommunications session that provides Data to an OAIS. The Data Submission Session format/contents is based on a data model negotiated between the OAIS and the Producer in the Submission Agreement. This data model identifies the logical constructs used by the Producer and how they are represented on each media delivery or in the telecommunication session.
Submission Agreement: The agreement reached between an OAIS and the Producer that specifies a data model, and any other arrangements needed, for the Data Submission Session. This data model identifies format/contents and the logical constructs used by the Producer and how they are represented on each media delivery or in a telecommunication session.
Ingest Functional Entity: The OAIS functional entity that contains the services and functions that accept Submission Information Packages from Producers, prepares Archival Information Packages for storage, and ensures that Archival Information Packages and their supporting Descriptive Information become established within the OAIS.

The discussion of SIPs in OAIS is:

The Submission Information Package (SIP) is that package that is sent to an OAIS by a Producer. Its form and detailed content are typically negotiated between the Producer and the OAIS (see related standards in 1.5). Most SIPs will have some Content Information and some PDI.
The relationships between SIPs and AIPs can be complex; as well as a simple one-to-one relationship in which one SIP produces one AIP, other possibilities include: one AIP being produced from multiple SIPs produced at different times by one Producer or by many Producers; one SIP resulting in a number of AIPs; and many SIPs from one or more sources being unbundled and recombined in different ways to produce many AIPs. Even in the first case, the OAIS may have to perform a number of transformations on the SIP. The Packaging Information will always be present in some form.

CLOCKSS Submission Information Package (SIP)

The majority of the content the CLOCKSS archive is chartered to preserve is electronic journals, which are serials. Serials publishers (Producers in OAIS terminology) emit a continuous stream of articles through time, together in most cases with metadata that organizes the articles into a logical structure of issues, volumes and journals. Decisions to trigger content from the CLOCKSS archive will normally be taken on at a journal, or perhaps at a range of volumes, granularity. It is thus important to preserve this logical structure; the CLOCKSS archive does so by organizing content into Archival Units (AUs) which generally correspond to a volume or a year of a journal, but in some cases may correspond to a year of multiple journals. An AU, together with some related information, forms the CLOCKSS archive's AIP.

Publishers submitting content to the CLOCKSS archive choose one of two ways to do so:

  • Harvest, in which the CLOCKSS archive collects the content the publisher supplies to readers from their web site shortly after it is published (See CLOCKSS; Ingest Pipeline).
  • File transfer, in which the publisher packages up content and metadata in a form they define and arranges for the packages to be transferred to the CLOCKSS archive at a time of the publisher's choosing.

Comparing these with OAIS' SIP definitions above, some important aspects are evident:

  • SIPs are a structure that the archive imposes on what from the publisher's point of view is a continuous stream of content. SIP is not a useful concept in communicating with publishers.
  • In neither case is the form of content as it is transferred to the archive defined by the archive. It is defined by the publisher; the archive has very limited ability to affect this definition during negotiation of the Submission Agreement (see CLOCKSS: Publisher Agreement).
  • In both cases the content is constructed by the publisher.
  • In neither case is the content transferred a completed logical unit of content. What is transferred is the set of articles published since the last transfer. This will in almost all cases form only part of a logical unit, for example a volume of a journal, and will in some cases form part of multiple logical units. Examples are transfers that span the end of one volume and the start of its successor, and file transfers from large publishers which typically include articles from many journals. Thus CLOCKSS SIPs are "for use in the construction or update of one or more AIPs".
  • In neither case is a completed logical unit of content, or an entire SIP, transferred at a single point in time:
    • For harvested content, the timing of the transfer is under the control of the archive. The CLOCKSS archive cannot wait until the successor of a volume has started then ingest the articles the older volume contains as a completed unit because:
      • Doing so places content published early in a volume at additional risk.
      • Publishers are not constrained to cease publishing content in an earlier volume once they have started publishing in its successor. Both "publish ahead of print" and corrections are examples in which this happens.
    • For file transfer content, the timing of the transfer is under the control of the publisher. In general, publishers arrange for the transfer to happen shortly after publication, but they are not constrained to do so, nor are they constrained to identify the point in time at which a volume is complete.

The CLOCKSS archive is thus an example of how "The relationships between SIPs and AIPs can be complex".

Because the CLOCKSS archive provides two different ways publishers can choose to provide content for preservation, it supports two different types of SIP, for harvest and for file transfer.

Harvest SIP

Publishers choosing to have the CLOCKSS archive harvest their content normally demonstrate its logical structure by providing volume and issue table-of-contents (ToC) pages. The volume ToC acquires links to the issues as they are published; the issue ToC acquires links to the articles in that issue as they are published. The CLOCKSS archive requires two things of each publisher choosing harvest that, together with the content as it accumulates, form the SIP for that logical unit of content from that publisher:

  • A volume ToC page, or some page playing an equivalent role, via whose links the content that is to form the AU can be found. This page is termed the "manifest page".
  • A permission statement on the manifest page, or on a page with a known relation to the manifest page, containing either a Creative Commons license, or a statement granting CLOCKSS permission to collect and preserve the content pointed to by the manifest page.

File Transfer SIP

Publishers choosing to supply content via file transfer do so in some format of their choosing, subject to agreement by the CLOCKSS archive that the format is adequately specified and contains adequate metadata. The transferred files containing content and metadata form the SIP for that content from that publisher.

Creating AIPs from SIPs

In both harvest and file transfer cases the CLOCKSS: Ingest Pipeline uses additional, internal publisher-specific information to create AUs (AIPs) from the SIPs. For details, see Definition of AIP. As described in CLOCKSS: Logging and Records, the CLOCKSS Executive Director receives regular reports on the progress of this process, upon which publisher billing is based.

Content Information and Information Properties

OAIS Content Information and Information Properties

In the OAIS model, the SIP is a means to transfer Content Information containing Information Properties to the archive. The definitions of these terms are:

Content Information: A set of information that is the original target of preservation or that includes part or all of that information. It is an Information Object composed of its Content Data Object and its Representation Information.
Information Property: That part of the Content Information as described by the Information Property Description. The detailed expression, or value, of that part of the information content is conveyed by the appropriate parts of the Content Data Object and its Representation Information.

Other relevant definitions are:

Content Data Object: The Data Object, that together with associated Representation Information, comprises the Content Information.
Information Property Description: The description of the Information Property. It is a description of a part of the information content of a Content Information object that is highlighted for a particular purpose.
Representation Information: The information that maps a Data Object into more meaningful concepts. An example of Representation Information for a bit sequence which is a FITS file might consist of the FITS standard which defines the format plus a dictionary which defines the meaning in the file of keywords which are not part of the standard. Another example is JPEG software which is used to render a JPEG file; rendering the JPEG file as bits is not very meaningful to humans but the software, which embodies an understanding of the JPEG standard, maps the bits into pixels which can then be rendered as an image for human viewing.

The discussion of Information Properties includes:

The Producer may provide, or the Archive may itself define, as part of the Provenance Information, Information Property Descriptions of Information Properties which should be maintained over time, and indeed may provide Information Property Descriptions of Information Properties which do not need to be maintained over time. An Information Property is that part of the Content Information as described by the Information Property Description. An Information Property Description is a description of a part of the information content of a Content Information object that is highlighted for a particular purpose. The detailed expression, or value, of that part of the information content is conveyed by the appropriate parts of the Content Data Object and its Representation Information. For example, consider a simple digital book which when rendered appears as pages with margins, title, chapter headings, paragraphs, and text lines composed of words and punctuation. Information Property Descriptions for Information Properties that must be preserved could be expressed as ‘paragraph identification’ and ‘characters expressing words and punctuation’. The Information Properties would consist of all the book’s paragraph identifications, words, and punctuation as expressed by the Content Data Object and its Representation Information. This means that all formatting other than the recognition of paragraphs and readable text could be altered while still maintaining required preservation. The Archive may express an evaluation of the Authenticity of its holdings, based on community practice and recommendations (including best practices, guidelines, standards, and legal requirements). For example scientific Archives may have less stringent evaluation criteria than State Archives; however, the Consumer may make his/her own judgment of the Authenticity starting with the evidence obtained from PDI.

CLOCKSS Content Information and Information Properties

The OAIS definition of Information Properties is, in effect, properties of the Content Information that are, or are not, the same before and after a transformation of the content as part of a preservation operation. An example would be paragraph indentifications, which might or might not be preserved by a format migration of the preserved content.

The CLOCKSS archive does not perform content transformations as part of preservation operations, only as part of dissemination operations. The archive preserves the content it ingests in its original format and as its original bits. This content includes bibliographic and representational metadata. As described in LOCKSS: Extracting Bibliographic Metadata subsequent to ingestion the bibliographic metadata will be extracted and added to the LOCKSS: Metadata Database for ease of access and reporting. The metadata database is merely a cache of information from the preserved content. The database is not itself preserved; it can be reconstructed from the preserved content, which is unaffected by metadata extraction.

If content is ever triggered, as part of the process of generating a DIP containing it, the triggered content may be transformed from its original form to a form that is understandable by applying the Knowledge Base of the eventual Consumers. CLOCKSS: Extracting Triggered Content describes this type of transformation as applied to file transfer content currently. The reasons for this policy are:

  • The CLOCKSS archive is a dark archive. It is intended that the vast majority of its content will never be triggered. Devoting resources to transforming content that will never be triggered would be wasteful.
  • CLOCKSS: Designated Community specifies that the Knowledge Base of the eventual Consumers of any triggered content includes Web browsers, or their equivalent in some successor technology to the Web. The goal is that the eventual Consumers of any triggered content are as able to understand it as readers of the publisher's Web site were at the time of publication. To ensure this, two conditions must hold:
    • The original Content Information must include everything that would have been provided to a reader of the publisher's Web site at the time of publication, see Definition of AIP.
    • The DIP containing the triggered Content Information must include it in a format that Web browsers, or their equivalent in some successor technology to the Web, can render successfully. If this is not the original format, a format migration will be performed during generation of the DIP, see LOCKSS: Format Migration#Format Migration in the CLOCKSS Archive, but this will not affect the preserved AIP.

Thus the Information Properties that the CLOCKSS archive preserves are the entire set of Information Properties of the Content Information. No operations are performed on the preserved content that would affect those properties. It is not necessary for the CLOCKSS archive to specify individual Information Properties that are, or are not, to be preserved since all Information Properties are preserved. If a transformation is performed as part of generating a DIP, the properties of the Content Information may not be the same before and after the transformation, but this does not mean that the Information Properties have not been preserved. A later, different DIP generation process might result in the properties of the Content Information being the same before and after the transformation.

Change Process

Changes to this document require:

  • Review by LOCKSS Engineering Staff
  • Approval by LOCKSS Chief Scientist

Relevant Documents

  1. OAIS (2012) CCSDS 650.0-M-2: Reference Model for an Open Archival Information System (OAIS). Magenta Book. Issue 1. June 2012 (ISO 14721:2003) http://public.ccsds.org/publications/archive/650x0m2.pdf accessed 2013.08.31
  2. LOCKSS: Format Migration
  3. Definition of AIP
  4. Definition of DIP
  5. CLOCKSS: Designated Community
  6. CLOCKSS: Extracting Triggered Content
  7. LOCKSS: Extracting Bibliographic Metadata
  8. LOCKSS: Metadata Database
  9. CLOCKSS: Publisher Agreement