CLOCKSS: Ingest Pipeline

From CLOCKSS Trusted Digital Repository Documents
Revision as of 05:50, 16 December 2013 by Dshr (Talk | contribs)

Jump to: navigation, search

Contents

CLOCKSS: Ingest Pipeline

The CLOCKSS Archive ingests two different types of content:

  • Harvest Content - the contents of a publisher's Web site, obtained by crawling it.
  • File Transfer content - the form of the content the publisher uses to create their Web site, obtained from the publisher typically via FTP.

Harvest Content Pipeline

Harvest Publisher Engagement

The process a new publisher signing up with CLOCKSS undergoes depends on whether the publisher uses a publishing platform already supported by the LOCKSS software:

  • If so, the discussion need cover only the process for turning on access for the CLOCKSS Archive's ingest machines.
  • Otherwise, a plugin writer from the LOCKSS team is designated to analyze the publisher's site and:
    • develop requirements for the publisher plugin as described in LOCKSS: Software Development Process.
    • work with the publisher to add CLOCKSS permission pages and make any other necessary changes to their site.

In all cases, all interactions with the publisher take place through the RT ticketing system and are thus recorded permanently.

The CLOCKSS Plugin Lead is responsible for this process.

Harvest Plugin Development and Testing

The designated LOCKSS team member:

Once the plugin passes its tests and the CLOCKSS ingest machines have access to the publisher's site with CLOCKSS permissions, the plugin writer works with the LOCKSS content team to process sample batches of publisher content for quality assurance.

The CLOCKSS Plugin Lead is responsible for this process.

Harvest Content Processing

Once a plugin has been developed and tested with unit tests and sample content, it is released to a set of content-testing boxes, which are identical to production CLOCKSS boxes except generally smaller. Then:

  • Under the direction of an internally developed testing framework (AUTest), two daemons with the plugin are directed to collect two copies each of a substantial amount of real content, one copy collected from Stanford's IP address range, the other from Rice or Indiana.
  • AUTest then directs each of the two daemons to compute the message digest of the filtered contents of each collected AU.
  • AUTest compares the results to ensure correct collection and correct operation of hash filters.
  • Metadata is extracted and checked for sanity.
  • A sample of the content is browsed by a human tester to ensure that:
    • all the types of files that should be collected are,
    • and that files that shouldn't be collected are properly excluded.

Any problems detected are addressed by modifying and testing the plugin in a development environment, or by contacting the publisher if necessary to resolve systemic site errors, then the tests are repeated on the content-testing boxes.

The process above is repeated on a substantial sample of new content as it becomes available (in following years/volumes, or new publications from the same publisher), in order to detect changes to the publisher's site, or the format of new titles, which require changes to the plugin.

Once an AU has been successfully tested on the content-testing boxes, it is configured for collection on all ingest boxes. If the plugin is new or changed it is also released to the ingest network. The ingest boxes collect the content and run polls to detect and resolve transient collection errors. When all the copies of an AU have come into full agreement on the ingest boxes, they are then configured on the CLOCKSS production boxes.

Depending on the complexity and diversity of the publisher's content the "substantial sample" can be anything from quite small to the publisher's entire content. Note that the need for a "substantial sample" of content for testing means that there is a delay between the time the publisher starts adding content to the bibliographic unit represented by the AU, and the time the ingest network starts collecting it.

The CLOCKSS Content Lead is responsible for this process.

Harvest Process

Once AUs of content are configured on the CLOCKSS production boxes, they schedule collection of the content from the ingest boxes, as described in Definition of AIP.

A harvest content AU is considered to be preserved when it has been ingested by all the production CLOCKSS boxes and at least one poll on it has been successful (see LOCKSS: Polling and Repair Protocol). The AU can then be removed from the ingest boxes. A review of the process for doing so deemed it inadequate and too manual; a replacement process is under development that will integrate with the improvements being made to external report generation.

The CLOCKSS Content Lead is responsible for both the current and replacement processes.

Completeness of Harvest Content

Completeness in this context means that the content in the CLOCKSS archive faithfully reflects the content on the publisher's web site. The CLOCKSS ingest pipeline includes multiple ingest CLOCKSS boxes that individually harvest content from the publisher's website. Once collected, the ingest CLOCKSS boxes poll among themselves and repair content objects from the publiser if there is disagreement. The content testing process requires complete agreement among the ingest boxes before the content is released to the product CLOCKSS network. As a result the content described by the CLOCKSS plugin and parameters for the Submission Information Package is preserved in the CLOCKSS PLN.

To account for the possibility of errors in the rules and procedures of the CLOCKSS plugin, a second layer of testing is done by visually inspecting content submitted by a publisher to ensure that it functions as it did on the publisher's website, and all that content available from the publisher's website is also preserved in the CLOCKSS ingest boxes before being released to the CLOCKSS production network.

The CLOCKSS Content Lead is responsible for this process.

Correctness of Harvest Content

Correctness in this context means that the publisher's content is Independently Understandable by the CLOCKSS: Designated Community. For the CLOCKSS archive, this is the responsibility of the publisher, not of the archive. Harvest content is used by customers of the publisher on a regular basis, so content that is not Independently Understandable on the publisher's website is rare, and is normally detected by the author. Under the CLOCKSS: Publisher Agreement, the publisher may submit content in any form, including proprietary formats, so full validation of content and content types is not always practical.

Annual Harvest Content Cycle

The processing cycle for the harvest content a publisher is publishing during a given year can be described as follows.

  • The first phase begins typically in the second quarter of the year.
    • A sample of the publisher's AUs for the year are configured and processed on the content-testing machines to ensure the plugin works as expected, and corrective action is taken if necessary.
    • When the plugin is deemed satisfactory, all the publisher's AUs for the year are configured on the ingest machines and allowed to crawl and poll.
  • The second phase continues throughout the rest of the year.
    • Poll results for AUs on the ingest machines are monitored, and corrective action is taken if necessary, including enhancing the plugin.
    • More content samples may be configured and processed on the content-testing machines to ensure the plugin continues to work as expected throughout the year.
  • The third phase begins at the beginning of the following year.
    • As the AUs for the year that is ending stop growing, final poll results are monitored until the AUs are deemed fully processed, at which point they are configured on the production machines.
    • Crawl and poll results of the AUs on the production machines are then monitored until the AUs are deemed fully processed.

As regards subsequent errata and corrections, we assume that publishers follow the NLM best practice guidelines and at least refer and link to them in a subsequent issue. If the publisher does, they will be collected with that issue.

The CLOCKSS Content Lead is responsible for this process.

e-Book Processing

Unlike serial content which is published incrementally over a period of time, harvest content that takes the form of books can be processed over a regular cycle. During each cycle:

  • A sample of the publisher's books that were published during the previous interval is configured on the content-testing machines to verify that the plugin works adequately, and corrective action is taken if necessary.
  • When the plugin is deemed ready, all the publisher's books from the previous interval are configured on the ingest machines and allowed to crawl and poll. Results are monitored and corrective action is taken if necessary.
  • When a book is deemed fully processed on the ingest machines, it is configured on the production machines. The crawl and poll results are then monitored on the production machines until the book is deemed fully processed.

The CLOCKSS Content Lead is responsible for this process.

File Transfer Content Pipeline

File Transfer Publisher Engagement

The designated LOCKSS team member's discussion with a new publisher whose content is to be received via file transfer needs to determine two things:

  • The means by which the file transfer content is transferred to the CLOCKSS ingest machines. This is up to the publisher, techniques that have been implemented include:
    • FTP from an FTP server that the publisher maintains to a CLOCKSS ingest machine.
    • FTP by the publisher to an FTP server on a CLOCKSS ingest machine.
    • rsync between a publisher machine and a CLOCKSS ingest machine
  • The format of the file transfer content, in particular:
    • How the ingest scripts can verify that the content received is correct, for example by checking manifests and checksums.
    • How the content can be rendered if it is ever triggered. A publisher-specific version of the abstract plan described in CLOCKSS: Extracting Triggered Content should be drawn up.

In all cases, all interactions with the publisher take place through the RT ticketing system and are thus recorded permanently.

The CLOCKSS Plugin Lead is responsible for this process.

Ingest script development

Once the information above is available, the designated team member writes shell scripts to be executed from cron that ensure that collected content is up-to-date by:

  • If FTP from the publisher, collect any as-yet-uncollected content.
  • If rsync, run rsync against the publisher's machine.
  • If the format in which the publisher makes file transfer content available lacks checksums and manifests, the ingest script must generate them as the content is collected.
  • Otherwise, the ingest script verifies the manifest and checksums in the content, alerting the content team if any discrepancies are found.

The designated team member also writes a verification script that is run against all content on the file transfer ingest machine at intervals.

The CLOCKSS Plugin Lead is responsible for this process.

File Transfer Plugin Development and Testing

File transfer content also requires a plugin, These plugins are developed and tested in the same way as harvest plugins (see above), except that the emphasis is on metadata extraction because it is in some cases more complex, whereas crawling is trivial and polling requires no filters.

The CLOCKSS Plugin Lead is responsible for this process.

File Transfer Content Pre-Processing

File transfer content undergoes a subset of the testing steps described above for harvest content. The structure of file transfer AUs is simple and consistent (typically a directory hierarchy) so testing the crawl rules on a large sample of content is unnecessary. All the CLOCKSS boxes collect the same copy of the content so few of the complexities of preserving harvest content (such as hash filters) are present.

A slightly simplified workflow in AUTest directs content-testing boxes to collect a single copy of a smaller sample of AUs from the staging server, and to extract metadata and check it for sanity. After any problems are corrected, the AU and similar AUs are configured on the CLOCKSS production boxes.

File Transfer Content Ingest

The collection scripts are added to the ingest user's crontab, and run daily to collect any new content. Every 24 hours the entire content of the file transfer ingest machine is synchronized with an (a) an on-site backup and (b) an off-site backup using rsync. The verification scripts are run against each of these backup copies at intervals.

Once the ingested content is verified it is staged to a Web server that can be accessed only by the CLOCKSS boxes. The CLOCKSS boxes crawl the file transfer content staged on the Web server under control of a file transfer plugin and preserve it, as described in Definition of AIP.

File transfer content is considered to be preserved when it has been ingested by all production CLOCKSS boxes and at least one poll on it has been successful (see LOCKSS: Polling and Repair Protocol).

The CLOCKSS Content Lead is responsible for this process.

Completeness of File Transfer Content

Completeness in this context means that the content in the CLOCKSS archive faithfully reflects the content delivered by the publisher. Content submitted via file transfer is either made available from a content server operated by the publisher, or delivered by the publisher to a content server hosted by CLOCKSS. In either case the content is transferred to a staging server operated by CLOCKSS. The ingest scripts include procedures that verify the files submitted by the publisher correspond to those on the content server. Where publishers provide checksum information, checksums are also compared with locally computed checksums to ensure that the content is the same. If the checksums differ, the local copy is deleted and re-collected.

Correctness of File Transfer Content

Correctness in this context means that, if the content is ever triggered, the result will be Independently Understandable by the CLOCKSS: Designated Community. During initial engagement with file transfer publishers, their content types are assessed to develop a plan for rendering the content if it is ever triggered. One goal of the ingest scripts is to verify that the content types submitted match this assessment; if they do not the plan for triggering the content must be revised to account for the new content types.

Feedback

The CLOCKSS Executive Director receives regular reports of the article counts ingested, upon which publishers are billed (See CLOCKSS: Logging and Records).

Content AUs can be in one of three preservation states, as reported to the publisher, the CLOCKSS Board and the Keepers Registery (See CLOCKSS: Logging and Records):

  1. Committed for preservation
  2. In process
  3. Preserved

Progress is tracked through the AU's configuration file.

The CLOCKSS Content Lead is responsible for this process.

Change Process

Changes to this document require:

  • Review by:
    • LOCKSS Technical Staff
    • CLOCKSS Plugin Lead
    • CLOCKSS Content Lead
  • Approval by CLOCKSS Technical Lead

Relevant Documents

  1. CLOCKSS: Designated Community
  2. CLOCKSS: Logging and Records
  3. LOCKSS: Software Development Process
  4. LOCKSS: Polling and Repair Protocol
  5. Definition of SIP
  6. Definition of AIP
  7. CLOCKSS: Publisher Agreement