CLOCKSS: Ingest Pipeline

From CLOCKSS Trusted Digital Repository Documents
Jump to: navigation, search

CLOCKSS: Ingest Pipeline

The CLOCKSS Archive ingests two different types of content, both under the control of an appropriate plugin in the LOCKSS daemon:

  • Harvest Content - the contents of a publisher's Web site, obtained by crawling it.
  • File Transfer content - the form of the content the publisher uses to create their Web site, obtained from the publisher typically via FTP.

Harvest Content Pipeline

Harvest Publisher Engagement

The process a new publisher signing up with CLOCKSS undergoes depends on whether the publisher uses a publishing platform already supported by the LOCKSS software:

  • If so, the discussion need cover only the process for turning on access for the CLOCKSS Archive's ingest machines.
  • Otherwise, a plugin writer from the LOCKSS team is designated to analyze the publisher's site and:
    • develop requirements for the publisher plugin as described in LOCKSS: Software Development Process.
    • work with the publisher to add CLOCKSS permission pages and make any other necessary changes to their site.

In all cases, all interactions with the publisher take place through the RT ticketing system and are thus recorded permanently.

The CLOCKSS Harvest Plugin Lead is responsible for this process.

Harvest Plugin Development and Testing

The designated LOCKSS team member:

Once the plugin passes its tests and the CLOCKSS ingest machines have access to the publisher's site with CLOCKSS permissions, the plugin writer works with the LOCKSS content team to process sample batches of publisher content for quality assurance.

The CLOCKSS Harvest Plugin Lead is responsible for this process.

Harvest Content Processing

Once a plugin has been developed and tested with unit tests and sample content, it is released to a set of content-testing boxes, which are identical to production CLOCKSS boxes except generally smaller. Then:

  • Under the direction of an internally developed testing framework (AUTest), two daemons with the plugin are directed to collect two copies each of a substantial amount of real content, one copy collected from Stanford's IP address range, the other from Rice or Indiana.
  • AUTest then directs each of the two daemons to compute the message digest of the filtered contents of each collected AU.
  • AUTest compares the results to ensure correct collection and correct operation of hash filters.
  • Metadata is extracted and checked for sanity.
  • A sample of the content is browsed by a human tester to verify the correctness of the plugin by ensuring that:
    • all the types of files that should be collected are,
    • and that files that shouldn't be collected, such as advertisements or articles from previous years, are properly excluded.

Any problems detected are addressed by modifying and testing the plugin in a development environment, or by contacting the publisher if necessary to resolve systemic site errors, then the tests are repeated on the content-testing boxes.

The process above is repeated on a substantial sample of new content as it becomes available (in following years/volumes, or new publications from the same publisher), in order to detect changes to the publisher's site, or the format of new titles, which require changes to the plugin.

Once an AU has been successfully tested on the content-testing boxes, it is configured for collection on all boxes in the ingest network. If the plugin is new or changed it is also released to the ingest network. Each ingest box collects the content and the network runs polls to detect and resolve transient collection errors. When all the copies of an AU have come into full agreement on the ingest boxes, they are then configured on the network of CLOCKSS production boxes.

Depending on the complexity and diversity of the publisher's content the "substantial sample" can be anything from quite small to the publisher's entire content. Note that the need for a "substantial sample" of content for testing means that there is a delay between the time the publisher starts adding content to the bibliographic unit represented by the AU, and the time the ingest network starts collecting it.

The CLOCKSS Harvest Content Lead is responsible for this process.

Harvest Process

Once AUs of content are configured on the CLOCKSS production boxes, they schedule collection of the content from the ingest boxes, as described in Definition of AIP.

A harvest content AU is considered to be preserved when it has been ingested by all the production CLOCKSS boxes and at least one poll on it has been successful (see LOCKSS: Polling and Repair Protocol). The AU can then be removed from the ingest boxes. A review of the process for doing so deemed it inadequate and too manual; a replacement process is under development that will integrate with the improvements being made to external report generation.

The CLOCKSS Harvest Content Lead is responsible for both the current and replacement processes.

Completeness of Harvest Content

Completeness in this context means that the content in the CLOCKSS archive faithfully reflects the content on the publisher's web site. The CLOCKSS ingest pipeline includes multiple ingest CLOCKSS boxes that individually harvest content from the publisher's website. Once collected, the ingest CLOCKSS boxes poll among themselves and repair content objects from the publisher if there is disagreement. The content testing process requires complete agreement among the ingest boxes before the content is released to the product CLOCKSS network. As a result the content described by the CLOCKSS plugin and parameters for the Submission Information Package is preserved in the CLOCKSS PLN.

To account for the possibility of errors in the rules and procedures of the CLOCKSS plugin, a second layer of testing is done by visually inspecting content submitted by a publisher to ensure that it functions as it did on the publisher's website, and that all content available from the publisher's website is also preserved in the CLOCKSS ingest boxes before being released to the CLOCKSS production network.

The CLOCKSS Harvest Content Lead is responsible for this process.

Correctness of Harvest Content

Correctness in this context means that the publisher's content is Independently Understandable by the CLOCKSS: Designated Community. For the CLOCKSS archive, this is the responsibility of the publisher, not of the archive. Harvest content is used by customers of the publisher on a regular basis, so content that is not Independently Understandable on the publisher's website is rare, and is normally detected by the author. Under the CLOCKSS: Publisher Agreement, the publisher may submit content in any form, including proprietary formats, so full validation of content and content types is not always practical. When the LOCKSS team detects cases in which content might not be Independently Understandable they are reported to the publisher.

Some content from both harvest and file transfer publishers is not intended to be displayed by browsers but to be input to other software. For example, the census of file formats requested during the ISO16363 audit showed the following instances of chemical formats:

  • chemical/x-cif 6850 instances
  • chemical/x-cml 6125 instances
  • chemical/x-cdx 5260 instances
  • chemical/cif 646 instances
  • chemical/x-chemdraw 247 instances
  • chemical/x-pdb 64 instances
  • chemical/x-xyz 44 instances
  • chemical/x-fasta 26 instances

The Mime-Types not intended for the browser are either widely supported formats (such as Microsoft Office formats) or formats specific to some field (such as Chemistry: chemical/x-cif, chemical/x-cml, etc.). The evolution of these formats is a problem for the specific field; it is not something to which an archive can provide a generic solution. But see the discussion of emulation in LOCKSS: Format Migration.

Annual Harvest Content Cycle

The processing cycle for the harvest content a publisher is publishing during a given year can be described as follows.

  • The first phase begins typically in the second quarter of the year.
    • A sample of the publisher's AUs for the year are configured and processed on the content-testing machines to ensure the plugin works as expected, and corrective action is taken if necessary.
    • When the plugin is deemed satisfactory, all the publisher's AUs for the year are configured on the ingest machines and allowed to crawl and poll.
  • The second phase continues throughout the rest of the year.
    • Poll results for AUs on the ingest machines are monitored, and corrective action is taken if necessary, including enhancing the plugin.
    • More content samples may be configured and processed on the content-testing machines to ensure the plugin continues to work as expected throughout the year.
  • The third phase begins at the beginning of the following year.
    • After allowing for final updates (a time determined by the publishing frequency and experience with the publisher), a final "deep re-crawl" of each of that year's AUs is scheduled.
    • Then crawling for those AUs is disabled.
    • Then poll results are monitored until the AUs are deemed fully processed, at which point they are configured on the production machines.
    • Crawl and poll results of the AUs on the production machines are then monitored until the AUs are deemed fully processed.

The CLOCKSS Harvest Content Lead is responsible for this process.

Harvest Content Hosted Elsewhere

As described in Harvest Content Processing, during content testing the list of excluded URLs is audited and the harvest content is inspected visually. Systematic use by the publisher of off-site web servers, for example Figshare or S3, is detected at this stage. Figshare, for example, is already being preserved in CLOCKSS, so content hosted there would be collected. Otherwise, the publisher can be asked to include appropriate permission in the off-site web server so that the content there can be collected, or to switch to file transfer. Non-systematic use, for example an individual author linking to their blog, will not result in preservation of the linked-to content as it cannot be considered part of the publisher's copyright content.

Harvest Content Interactivity

Some harvest journals are using Javascript-based techniques such as AJAX that require a Web crawler to execute (some of) their content rather than simply parsing it to find the links to other URLs that need to be harvested. The LOCKSS team has been drawing attention to and working on this problem since at least 2009. We worked with students at C-MU West to prototype collections via AJAX. Under our current Mellon grant, we have developed a production AJAX collector and will use it when required.

e-Book Processing

Unlike serial content which is published incrementally over a period of time, harvest content that takes the form of books can be processed over a regular cycle. During each cycle:

  • A sample of the publisher's books that were published during the previous interval is configured on the content-testing machines to verify that the plugin works adequately, and corrective action is taken if necessary.
  • When the plugin is deemed ready, all the publisher's books from the previous interval are configured on the ingest machines and allowed to crawl and poll. Results are monitored and corrective action is taken if necessary.
  • When a book is deemed fully processed on the ingest machines, it is configured on the production machines. The crawl and poll results are then monitored on the production machines until the book is deemed fully processed.

The CLOCKSS Harvest Content Lead is responsible for this process.

File Transfer Content Pipeline

File Transfer Publisher Engagement

The designated LOCKSS team member's discussion with a new publisher whose content is to be received via file transfer needs to determine two things:

  • The means by which the file transfer content is transferred to the CLOCKSS ingest machines. This is up to the publisher, techniques that have been implemented include:
    • FTP from an FTP server that the publisher maintains to a CLOCKSS ingest machine.
    • FTP by the publisher to an FTP server on a CLOCKSS ingest machine.
    • rsync between a publisher machine and a CLOCKSS ingest machine
  • The format of the file transfer content, in particular:
    • How the ingest scripts can verify that the content received is correct, for example by checking manifests and checksums.
    • How the content can be rendered if it is ever triggered, the triggering plan. A publisher-specific version of the abstract plan described in CLOCKSS: Extracting Triggered Content should be drawn up.

In all cases, all interactions with the publisher take place through the RT ticketing system and are thus recorded permanently.

The CLOCKSS File Transfer Plugin Lead is responsible for this process.

Ingest script development

Once the information above is available, the designated team member writes shell scripts to be executed from cron that ensure that collected content is up-to-date by:

  • If FTP from the publisher, collect any as-yet-uncollected content.
  • If rsync, run rsync against the publisher's machine.
  • If the format in which the publisher makes file transfer content available lacks checksums and manifests, the ingest script must generate them as the content is collected.
  • Otherwise, the ingest script verifies the manifest and checksums in the content, alerting the content team if any discrepancies are found.

The designated team member also writes a verification script that is run against all content on the file transfer ingest machine at intervals.

The CLOCKSS File Transfer Plugin Lead is responsible for this process.

File Transfer Plugin Development and Testing

File transfer content also requires a plugin, These plugins are developed and tested in the same way as harvest plugins (see above), except that the emphasis is on metadata extraction because it is in some cases more complex, whereas crawling is trivial and polling requires no filters.

The CLOCKSS File Transfer Plugin Lead is responsible for this process.

File Transfer Content Pre-Processing

File transfer content undergoes a subset of the testing steps described above for harvest content. The structure of file transfer AUs is simple and consistent (typically a directory hierarchy) so testing the crawl rules on a large sample of content is unnecessary. All the CLOCKSS boxes collect the same copy of the content so few of the complexities of preserving harvest content (such as hash filters) are present.

A slightly simplified workflow in AUTest directs content-testing boxes to collect a single copy of a smaller sample of AUs from the staging server, and to extract metadata and check it for sanity. After any problems are corrected, the AU and similar AUs are configured on the CLOCKSS production boxes.

File Transfer Content Ingest

The collection scripts are added to the ingest user's crontab, and run daily to collect any new content. Every 24 hours the entire content of the file transfer ingest machine is synchronized with an (a) an on-site backup and (b) an off-site backup using rsync. The verification scripts are run against each of these backup copies at intervals.

Once the ingested content is verified it is staged to a Web server that can be accessed only by the CLOCKSS boxes. The CLOCKSS boxes crawl the file transfer content staged on the Web server under control of a file transfer plugin and preserve it, as described in Definition of AIP.

The staging Web server uses a clone of the directory hierarchy into which the file transfer content is collected. This hierarchy is created, by linking to the files, when the content in question is released to production. The names of the files in this hierarchy are the names assigned to them by the publisher as part of the FTP transfer process. These files are typically archive files, such as TAR or ZIP. The names of the files contained in these archives are those assigned by the publisher. File transfer SIPs do not necessarily correspond to a publisher's web site, and those that do, do not in general include the publisher's URLs, so the publisher's URLs (as opposed to their file names) are not preserved. Thus the process of triggering file transfer content cannot, and does not attempt to, reproduce the URL structure of the publisher's web site.

File transfer content is considered to be preserved when it has been ingested by all production CLOCKSS boxes and at least one poll on it has been successful (see LOCKSS: Polling and Repair Protocol).

The CLOCKSS File Transfer Content Lead is responsible for this process.

Completeness of File Transfer Content

Completeness in this context means that the content in the CLOCKSS archive faithfully reflects the content delivered by the publisher. Content submitted via file transfer is either made available from a content server operated by the publisher, or delivered by the publisher to a content server hosted by CLOCKSS. In either case the content is transferred to a staging server operated by CLOCKSS. The ingest scripts include procedures that verify the files submitted by the publisher correspond to those on the content server. Where publishers provide checksum information, checksums are also compared with locally computed checksums to ensure that the content is the same. If the checksums differ, the local copy is deleted and re-collected.

Correctness of File Transfer Content

Correctness in this context means that, if the content is ever triggered, the result will be Independently Understandable by the CLOCKSS: Designated Community. During initial engagement with file transfer publishers, their content types are assessed to develop a plan for rendering the content if it is ever triggered. One goal of the ingest scripts is to verify that the content types submitted match this assessment; if they do not the triggering plan for the content must be revised to account for the new content types.

Errata, Corrections and Retractions

As regards subsequent errata and corrections, it is assumed that publishers follow the NLM best practice guidelines and at least refer and link to the the errata or corrections in a subsequent issue. If the publisher does, they will be collected with that issue.

As regards retractions, the CLOCKSS: Participating Publisher Agreement paragraph 2.E states:

E. At the request of Publisher, CLOCKSS will make reasonable efforts to identify Archived Content that contains information that has been retracted or removed (“Retracted”) by Publisher due to potential inaccuracy (“Retracted Content”). Upon a Trigger Event (as defined below) involving such Retracted Content, the CLOCKSS Board (the “Board”) shall determine whether such Retracted Content will be included in the Release (as defined below), and if Released, what if any notifications should accompany such Retracted Content. If the CLOCKSS Board determines that the Release of such Retracted Content would present a legal risk to CLOCKSS or Publisher, the Content will not be Released unless CLOCKSS and Publisher are indemnified against any Damages (as defined below). If Released Archived Content is later Retracted, the CLOCKSS Board shall determine what actions, if any, to take, including possible removal of such Retracted Content from public view. If Publisher at any time notifies CLOCKSS that Publisher has made a good-faith determination that the Release of certain Retracted Content may present a legal risk to CLOCKSS or Publisher, and the CLOCKSS Board nevertheless elects to Release such Retracted Content, Publisher shall have no indemnity obligation to CLOCKSS, whether pursuant to Section 10 below or otherwise, to the extent that any Claim (as defined below) arises from CLOCKSS Release of such Retracted Content.

The CLOCKSS: Participating Publisher Agreement paragraph 4.F states:

Any Released Content may be authorized to be removed from Host Organization(s) and public access by the affirmative vote of at least seventy-five percent (75%) of all the members of the Board then in office (“Removal Authorization”). However, because of the impact of a Removal Authorization, a negative vote of three (3) or more members of the Board then in office, shall override such Removal Authorization leaving the Released Content available to public access, unless the removal is requested by the Publisher, in which case the Released Archived Content shall be removed from Host Organization(s) and public access.

The LOCKSS team has yet to receive any such request. Presumably such a request would be, and would need to be preserved, at an article level. The request would be recorded in the triggering plan for that publisher (for file transfer content), and in a CLOCKSS: Retraction Requests document (preserved in CLOCKSS) and an indication added to the TDB that there was a retraction request for that AU. If and when that title was triggered, the trigger proccess would exclude the retracted content. The team is investigating whether participating publishers could supply regular lists of retractions, instead of specific requests for triggered content to the Board. These would be recorded routinely, rather than dealing with retractions only as part of the trigger process.

There is potentially some conflict between these terms and the use of CC licenses for triggered content, since even if content redacted or withdrawn after triggering is removed from the triggered content sites other sites may have copied and re-published it, as is permitted by the CC license. The LOCKSS team are investigating possible ways to resolve this conflict, but presumably the CLOCKSS Archive would not be liable for the copies elsewhere that were made before notification of the redaction or withdrawal.

As regards changed harvest content, when an AU is closed in the Annual Harvest Content Cycle a final deep re-crawl of the content is undertaken to ensure that the closed AU contains up-to-date content. In addition, the CrossMark service offers the ability to detect versioning of articles. The LOCKSS team are investigating the possibility of using CrossMark to drive re-harvesting of updated content. this may be particularly important since at least one harvest publisher is already planning to update content without formal notification.

As regards changed file transfer content, the CLOCKSS archive is dependent upon the publisher supplying it in subsequent SIPs. The triggering plan for each file transfer publisher includes information describing how the process of extracting triggered content for that publisher ensures that the most recent version of content is triggered.

Omissions

As regards unintentional omissions at the article level, the CLOCKSS archive reports article counts ingested to the publisher for billing purposes. These counts can be checked against the publisher's article counts. The LOCKSS team have had instances of the ingest article counts being both too small (articles had been ingested but the counting process was wrong, which the team detected) and too large (publishers reported this but their article counts were wrong).

Unintentional omissions at the file level in harvest content would appear as broken links on the publisher's web site. The CLOCKSS crawlers on the ingest boxes would see these as URLs that should have been collected but which returned 404. The crawler's response to these 404s is configurable; almost all plugins are configured either to report via a warning and continue collecting the site or, after a delay, retry the fetch up to N times then report via a warning and continue.

The ability to detect unintentional omissions at the file level in file transfer content varies depending on the structure of the individual publisher's SIP which is encoded in the relevant ingest script.

As regards intentional omissions from file transfer content, there is nothing that an archive can do to prevent a file transfer publisher refusing to supply content. For harvest content, since that is obtained by emulating what the publisher's readers do to obtain content, a simplistic approach to intentionally omitting content risks also depriving readers of the content. More sophisticated approaches would avoid this risk but would add substantial overhead for the publisher. Fundamentally, the CLOCKSS archive is preserving what the publishers pay it to preserve.

Feedback

The CLOCKSS Executive Director receives regular reports of the article counts ingested, upon which publishers are billed (See CLOCKSS: Logging and Records).

Content AUs can be in one of three preservation states, as reported to the publisher, the CLOCKSS Board and the Keepers Registery (See CLOCKSS: Logging and Records):

  1. Committed for preservation
  2. In process
  3. Preserved

Progress is tracked through the AU's configuration file.

The CLOCKSS Harvest and File Transfer Content Leads are responsible for this process fro their respective content.

Change Process

Changes to this document require:

  • Review by:
    • LOCKSS Technical Staff
    • CLOCKSS Plugin Lead
    • CLOCKSS Content Lead
  • Approval by CLOCKSS Technical Lead

Relevant Documents

  1. CLOCKSS: Designated Community
  2. CLOCKSS: Logging and Records
  3. LOCKSS: Software Development Process
  4. LOCKSS: Polling and Repair Protocol
  5. Definition of SIP
  6. Definition of AIP
  7. CLOCKSS: Publisher Agreement