CLOCKSS: Extracting Triggered Content

From CLOCKSS Trusted Digital Repository Documents
Revision as of 03:57, 28 September 2013 by Dshr (Talk | contribs)

Jump to: navigation, search

Contents

CLOCKSS: Extracting Triggered Content

The CLOCKSS board may decide to trigger content from the CLOCKSS archive when it is no longer available from any publisher. Reasons for doing so include:

  • A publisher discontinuing an entire title.
  • A publisher who has acquired a title deciding not to re-host a run of back issues.
  • Disaster.

The contract under which publishers deposit content in the CLOCKSS archive states that triggered content will be re-published under a Creative Commons license. At present, EDINA and Stanford have volunteered to re-publish triggered content, but it is important to note that the Creative Commons license means that anyone can re-publish such content, and that re-publishing is not a core function of the CLOCKSS archive.

Once the CLOCKSS board notifies the LOCKSS Executive Director that a trigger event has occurred, the CLOCKSS Content Lead and assigned LOCKSS staff run a process that delivers the triggered content to the re-publishing sites, and updates a section of the CLOCKSS website with links pointing to the re-published content at both sites. In OAIS terminology, the package delivered to the re-publishing sites is a Distribution Information Package (DIP).

Overview of the Trigger Process

The trigger process involves:

  1. identifying the triggered content in the CLOCKSS preservation network,
  2. extracting the triggered content from the CLOCKSS preservation network,
  3. preparing the content for publication on the triggered content machines,
  4. re-hosting triggered content on the CLOCKSS triggered content site,
  5. re-registering triggered article DOIs.

The following sections describe the process as it is currently implemented. It is possible that content triggered in the future might contain files whose format has become obsolete. The additional processes that would be needed in this situation are described in LOCKSS: Format Migration.

Identifying the Triggered Content

The triggering process begins by identifying the triggered content within the CLOCKSS preservation network. First, the Archival Units (AUs) that contain the title are identified by querying the database of bibliographic metadata that is compiled and maintained by each CLOCKSS box. This database contains a record of articles published for each preserved title, indexed by year, volume, and issue.

A query to this database on a production CLOCKSS box yields a list of AU identifiers (AUIDs) that identify the associated AUs (AIPs in OAIS terminology) being preserved in the CLOCKSS network. A similar query to an ingest machine identifies any triggered AUs that have yet to emerge from the ingest pipeline (see CLOCKSS: Ingest Pipeline). The AUIDs are the same on every CLOCKSS box and, as described in Definition of AIP, uniquely identify the location on each box of that AU (AIP). For content that was harvested from a publishers website, a single AU will normally contain all the files for a given volume or a year. For content obtained via file transfer from publishers, content is organized into AUs that contain titles, volumes, and issues from the same publisher that were ingested in a given calendar year.

Extracting the Triggered Content

Once the AU has been identified, the triggered content must be extracted from the CLOCKSS preservation network. If the content has already been released to the production CLOCKSS boxes, the content is extracted from the Stanford production CLOCKSS box (clockss-stanford.clockss.org) because it is the one most available to the technical staff. Content that is still in the ingest pipeline is flushed through the CLOCKSS: Ingest Pipeline to the production boxes and triggered from there

For harvested content the extraction process involves locating the directories in the CLOCKSS box repository that correspond to each AU of the triggered content. Each of the hierarchies under these directories is included in the DIP. If all content is on the Stanford CLOCKSS box, it can be exported directly into zip or tar files using the LOCKSS daemon's export functionality.

For file transfer content, the extraction process involves locating the directories in the CLOCKSS box that correspond to the AUs that contain the triggered content. Within these directories are either sub-directories or archive files that contain the content. These directories or archive files are copied from the repository to a server where the triggered content will be prepared for publication.

Preparing File Transfer Content for Dissemination

File transfer content is typically a collection of files that include PDFs of articles, XML formatted full-text files, supporting files such as images and multi-media, and metadata files that contain bibliographic information. These files are input to the processes of the publishing platform that displays the publisher's content to readers. The exact content of the collection varies between publishers. Typically, further processing involving the following steps is required to extract the content from the directories or archive files that were copied from the CLOCKSS box repository, and generate from them a web site:

  • Unpacking any archive files that contain the content into their own directories and isolate the files that correspond to the triggered content. Publishers tend to group files for individual issues into their own directories, so isolating the files involves retaining only those directories that correspond to the content being triggered and discarding directories for other content.
  • Adding any full-text PDF files unmodified to the web site.
  • Adding other files, such as images and multi-media, to the web site.
  • Rendering any XML files into readable full-text HTML pages.
  • Generating HTML abstract pages linking to the full-text PDF and HTML pages.
  • Inserting article-level bibliographic metadata into the abstract pages <meta> tags.
  • Creating article-level metadata files in standard forms such as RIS and BibTex, and linking to them from the abstract pages.
  • Generating an HTML issue table of contents page with links to the article abstract pages.
  • Generating a volume table of contents page with links to the issue table of contents pages.
  • Generating a journal table of contents page with links to the volume table of contents pages.

This process takes place on a temporary preparation server. Here is a sample generated issue table of contents page:

Generated issue table of contents

Here is a sample generated article abstract page:

Generated article abstract

Here is an example of Dublin Core article-level metadata included in the generated abstract page shown above:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="dc.Format" content="text/HTML">
<meta name="dc.Publisher" content="Taylor & Francis">
<meta name="dc.Title" content="Implementation and Assessment of an Introductory Pharmacy Practice Course Sequence ">
<meta name="dc.Identifier" scheme="coden" content="Journal of Pharmacy Teaching, Vol. 14, No. 1, 2007: pp. 5–17">
<meta name="dc.Identifier" scheme="doi" content="10.1300/J060v14n01_02">
<meta name="dc.Date" content="31Oct2007">
<meta name="dc.Creator" content=" Assistant Professor and coordinator Dr. Emily W. Evans Pharm.D. and AE-C and CDM">
<meta name="keywords" content=", Education, pharmacy practice, laboratory education">

The result of this process is a directory hierarchy capable of being exported by a web server such as Apache. Initially, this was the form in which file transfer content was disseminated to the re-publishing sites, which used Apache directly. Now, the content is exported by Apache running on the preparation server, and ingested in the normal way by a LOCKSS daemon, after which the generated directory hierarchy can be deleted. This results in a set of AUs analogous to those which would have been obtained had the LOCKSS daemon ingested the content from the original publisher, albeit with a different look and feel. As time permits, early triggered content will be re-disseminated using this technique, as it provides better compatibility with link resolvers and other library systems.

Assembling the DIP

The AUs to be re-published can be collected from the selected CLOCKSS box and ingest machines (for harvested content) or from the preparation server (for file transfer content) and converted to a compressed archive using zip or tar. This forms a DIP that can be transferred to the re-publishing sites using rsync, sftp or scp.

Re-publishing the Triggered Content

Currently, two sites re-publish triggered content, Stanford University and EDINA at the University of Edinburgh. Both use a combination of an Apache web server and a LOCKSS daemon to do so although, as described in Definition of AIP, the structure of AUs allows easy access by other tools to the content and metadata. Other ways to re-publish the content delivered as this type of DIP are easy to envisage, such as a shell script to convert it to an Apache web site.

The technique for re-publishing newly triggered content delivered as this type of DIP used by the current re-publishing sites is as follows. On each of the re-publishing machines:

  • Configure the appropriate plugin and Title Database (TDB) entries, by updating the triggered-content configuration and plugin repositories on props.lockss.org (see LOCKSS: Property Server Operations.
  • Force the LOCKSS daemon to reload its configuration and plugins from the repository.
  • Unpack the AUs into the repository hierarchy of the LOCKSS daemon.

The final step is add an entry for the the newly triggered title to the index of triggered titles on the “Triggered Content” section of the CLOCKSS website.

Triggered titles index

The entry points to a new landing page for the title that provides information about the title, publication history, and triggering process. It also includes links to the issues hosted at Edina and Stanford.

Triggered title landing page

Because the triggered content carries Creative Commons licenses, other institutions can also re-publish it. For example, here is Annals of Clinical Teaching at the Internet Archive.

Re-registering Triggered Article DOIs

Most publishers register individual article Digital Object Identifiers (DOIs) with a registrar sponsored by DOI International, such as CrossRef Once a title is no longer available from the publisher, the registration records for the articles should be updated to refer to the content hosted at Edina and Stanford. This is done by preparing a tab-separated file with the DOI for each article and the corresponding new URL. A separate file is required for the articles hosted at Edina and Stanford. The registrar uses the data in this file to update the records for the corresponding DOIs.

For content that is being served by the two re-publishing servers, this task is simple because their LOCKSS daemon provides an OpenURL resolver, allowing access to articles via their DOIs. The DOI information is available from the article metadata. Here is a portion of the file for a title being hosted at Edina that can be sent to the DOI registrar.

10.1300/J060v09n02_04	http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n02_04 
10.1300/J060v09n02_05	http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n02_05 
10.1300/J060v09n02_07	http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n02_07 
10.1300/J060v09n02_08	http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n02_08 
10.1300/J060v09n01_02	http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n01_02 
10.1300/J060v09n01_03	http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n01_03 

Recording the Dissemination

Generation of a DIP is performed only at the direction of the CLOCKSS board, which is recorded in their minutes.

Configuring the Re-publishing Systems

The current re-publishing sites each comprise a LOCKSS daemon and an Apache web server. Both currently reside on the same machine, but they could also be hosted on separate machines. The Apache web server is running on the standard HTTP port 80, while the LOCKSS daemon is serving content on port 8082. External access to the LOCKSS daemon is limited to administrative uses and the Apache server on the system where it is running.

The Apache server is running a proxy module that forwards requests matching a certain pattern to the CLOCKSS daemon's content server. A triggered content request is first handled by the node's Apache web server, where the request is processed through the Apache web server's ProxyPassMatch filter. If the request matches a filter pattern, the request is passed to the LOCKSS daemon's ServeContent servlet.

Triggering Content-5.png

If the requested content is in the AUs of the LOCKSS daemon, the page is returned to the Apache web server and then to the user. Otherwise, the daemon returns an HTTP error response, and the Apache web server returns a page that indicates the content was not preserved. This configuration serves two purposes. The first is to provide additional processing capabilities beyond those available from the LOCKSS daemon. The second is so that external access to the re-published content can be made through a standard port, an important consideration for many IT firewall configurations.

For file transfer content that was triggered using the earlier technique and has yet to be updated, the Apache server acts as a normal web server that responds to requests for the prepared content.

Triggering Content-6.png

Virtual machine requirements

Re-publishing hosts have modest requirements:

  • Processor: Single core CPU at 2GHz or better
  • Memory 1GB
  • Storage: 40GB
  • Network: 10/100mbit

Setting up the CLOCKSS daemon

Re-publishing hostconfig:

The LOCKSS daemons at the re-publishing sites are part of the clockss-triggered preservation group. When running hostconfig on a new CLOCKSS Triggered Content Node, care needs to be taken to configure it as such:

Props URL: http://props.lockss.org:8001/clockss-triggered/lockss.xml
Preservation Group: clockss-triggered

Re-publishing ServeContent servlet:

The LOCKSS daemon needs to have its ServeContent servlet enabled on port 8082. This is done by logging into the LOCKSS daemon's administrative UI, clicking on "Content Access Options" then "Content Server Options" and then checking "Enable content server on port 8082".

Setting up the Apache server

The following configuration is used for the Apache server at each re-publishing site:

NameVirtualHost triggered.SITE.clockss.org:80
<VirtualHost triggered.SITE.clockss.org:80>
    ServerAdmin support@support.clockss.org
    DocumentRoot /var/www/html
    ServerName triggered.SITE.clockss.org
    <IfModule mod_proxy.c>
        ProxyRequests Off
        ProxyVia On
        <Proxy triggered.SITE.clockss.org/*>
                AddDefaultCharset off
                Order deny,allow
                Allow from all
        </Proxy>
        ProxyPassMatch ^/((ServeContent|images).*)$ http://localhost:8082/$1
        ProxyErrorOverride On
        ErrorDocument 404 /not-preserved.html
    </IfModule>
    ErrorLog logs/error_log
    CustomLog logs/access_log common
</VirtualHost>

Managing Re-publishing Sites

The re-publishing site should follow the security, maintenance and upgrade guidelines for CLOCKSS boxes.

Balancing load across CLOCKSS triggered content nodes

A load balancing Apache server is also configured that provides a single point of access to the re-publishing servers at EDINA and Stanford. It is here. This additional Apache instance serves two purposes: the first is to provide a single URL for services that can accept only a single URL, such as link resolvers. The second purpose is to ensure high availability of the triggered content.

Triggering Content-7.png

Here is the configuration file for this load balancing Apache server.

<VirtualHost *>
        ServerName triggered.clockss.org
        ServerAdmin support@support.clockss.org
        DocumentRoot /var/www/clockss-triggered
        <IfModule mod_proxy.c>
                ProxyRequests off
                ProxyVia On
                ProxyPassMatch  ^/((ServeContent|images).*)$ balancer://triggered-pool/$1
                <Proxy balancer://triggered-pool>
                        BalancerMember http://triggered.edina.clockss.org/
                        BalancerMember http://triggered.stanford.clockss.org/
                        ProxySet lbmethod=byrequests
                </Proxy>
        </IfModule>
        CustomLog /var/log/apache2/access.log combined
        ErrorLog /var/log/apache2/error.log
        ServerSignature On
</VirtualHost> 

Change Process

Changes to this document require:

  • Review by:
    • CLOCKSS Content Lead
    • CLOCKSS Network Administrator
  • Approval by LOCKSS Technical Lead

Relevant Documents

  1. Definition of AIP
  2. Definition of DIP
  3. CLOCKSS Triggered Content
  4. LOCKSS: Format Migration
  5. LOCKSS: Metadata Database
  6. CLOCKSS; Ingest Pipeline
  7. LOCKSS: Property Server Operations
  8. CLOCKSS: Box Operations