CLOCKSS: Logging and Records

From CLOCKSS Trusted Digital Repository Documents
Revision as of 03:33, 10 April 2014 by Dshr (Talk | contribs)

Jump to: navigation, search

Contents

CLOCKSS: Logging and Records

The CLOCKSS system uses three types of record:

  • Logs: detailed logs, at an extensively customizable level of the operations of the LOCKSS daemon written to log files on the host machine. The purpose of these logs is to enable diagnosis of problems that arise. Logs are retained on the machine that generated them in /var/log.
  • Alerts: messages sent off-machine by the LOCKSS daemon when significant events occur. The purpose of Alerts is to draw attention to potential problems that may need diagnosis. Alerts are sent via e-mail to the <clockss-alerts> mail alias, and added to the log files on the host machine via the syslog mechanism.
  • Records: statistical summaries and business records of the operation of the system as a whole, not of individual boxes. They are provided to the CLOCKSS board and CLOCKSS member organizations, and electronic copies are being stored in a system run by the Executive Director.

Retention Policy

Although the LOCKSS daemon can generate extremely detailed logs, doing so routinely is counter-productive. It buries the signal in the noise. The goal of the logging and record policy, in the absence of a specific problem to diagnose, is to:

  • Generate Logs adequate to, and retain them long enough to, enable simple diagnosis.
  • Generate Alerts on any condition that the daemon determines is anomalous, and on other significant events, with sufficient detail to draw the system administrator's attention to problems requiring diagnosis, and to retain them indefinitely.
  • Generate the Records needed for business and governance, and for monitoring of the CLOCKSS network's overall performance, and to retain them indefinitely.

Specific log retention policies for each CLOCKSS box are specified in /etc/logrotate.conf and the files in /etc/logrotate.d/. On each CLOCKSS Box:

  • System logs are retained for a month.
  • At least the most recent 20MB of LOCKSS daemon log data is retained.

Ingest Alerts

An Alert is generated at the end of each crawl of a SIP that meets certain criteria recording the final status of the crawl, and the number of HTTP 200 results obtained (this is equivalent to the number of new URLs which were found, plus the number of existing URLs that were found to have modified content). An example of such an alert:

  Date: Sat 19 Feb 2011 04:17:24 PST
  From: LOCKSS box ingest2.clockss.org <clockss-alert@xxx.xxx>
  Subject: [lockss-alert] LOCKSS box info: CrawlEnd

  LOCKSS box 'ingest2.clockss.org' raised an alert at Sat Feb 19 04:12:24 PST 2011

  Name: CrawlEnd
  Severity: info
  AU: Nature Reviews Genetics Volume 11
  Explanation: Crawl ended successfully: 2276 new files

  Crawl ended successfully, 2276 new files, 4 warnings.

Here is an example failed crawl alert from an ingest box:

From: LOCKSS box ingest1.clockss.org <clockss-alert@xxx.xxx>
To: clockss-alert@xxx.xxx
Date: Thu 20 Mar 2014 21:50:18 PDT
Subject: [clockss-alert] LOCKSS box warning: CrawlFailed

LOCKSS box 'ingest1.clockss.org' raised an alert at Thu Mar 20 21:45:18 PDT 2014

Name: CrawlFailed
Severity: warning
AU: Journal of Pharmacology and Experimental Therapeutics Volume 346
Explanation: Crawl finished with error: Can't fetch permission page: 0 files fetched, 0 warnings, 1 error

Preservation Alerts

An Alert is generated at the end of each poll that detects an integrity problem:

  • If there were a non-zero number of URLs for which:
    • A repair was needed because the content failed to match the consensus.
    • Repair content was fetched.
    • The repair content failed to match the consensus.
  • If there were a non-zero number of URL version newly flagged as suspect because their content failed to match the locally stored hash.

Here is an example alert caused by injection of a failure of a repair to match the consensus during testing in the STF test environment:

From: LOCKSS box quark <xxx@xxx.xxx>
To: clockss-alert@xxx.xxx
Date: Thu 20 Mar 2014 22:50:33 PDT
Subject: LOCKSS box warning: PersistentDisagreement

LOCKSS box 'quark' raised an alert at Thu Mar 20 22:50:33 PDT 2014

Name: PersistentDisagreement
Severity: warning
AU: Simulated Content: simContent
AUID: org|lockss|plugin|simulated|SimulatedPlugin&root~simContent
Explanation: Poll did not achieve consensus on all files

21 URLs tallied, 95.23% agreement
1 repair received, 0 not received.

1 repair didn't resolve disagreement:
http://www.example.com/003file.txt

Dissemination Alerts

The CLOCKSS archive is a dark archive; access to the content is permitted only at the direction of the CLOCKSS board. Thus, as described in CLOCKSS: Box Operations, the content access mechanisms of the LOCKSS daemon are disabled, and packet filters are used to further prevent access. Nevertheless, Alerts are generated on any access to the content in order that they may be treated as Security Alerts.

Here is a sample access alert from an ingest box. These accesses are expected as they come from production boxes crawling the ingest box; the alerts were turned on briefly as a test but would normally be disabled.

From: LOCKSS box ingest1.clockss.org <clockss-alert@xxx.xxx>
To: clockss-alert@xxx.xxx
Date: Sun 05 Jan 2014 08:42:42 PST
Subject: LOCKSS box info: ContentAccess (multiple)

LOCKSS box 'ingest1.clockss.org' raised an alert at Sun Jan 05 08:12:31 PST 2014

Name: ContentAccess
Severity: info
Explanation: Proxy access: http://www.nature.com/rj/style/group.css : 200 from cache in 398ms

==========================================================================
LOCKSS box 'ingest1.clockss.org' raised an alert at Sun Jan 05 08:12:33 PST 2014

Name: ContentAccess
Severity: info
Explanation: Proxy access: http://www.nature.com/nbt/journal/v29/n3/abs/nbt.1829.html : 200 from cache in 243ms

==========================================================================
LOCKSS box 'ingest1.clockss.org' raised an alert at Sun Jan 05 08:12:33 PST 2014

Name: ContentAccess
Severity: info
Explanation: Proxy access: http://www.nature.com/nbt/journal/v29/n3/abs/nbt.1829.html : 200 from cache in 1ms

==========================================================================
LOCKSS box 'ingest1.clockss.org' raised an alert at Sun Jan 05 08:12:33 PST 2014

Name: ContentAccess
Severity: info
Explanation: Proxy access: http://www.nature.com/nbt/journal/v29/n3/abs/nbt.1829.html : 200 from cache in 1ms

==========================================================================
LOCKSS box 'ingest1.clockss.org' raised an alert at Sun Jan 05 08:12:33 PST 2014

Name: ContentAccess
Severity: info
Explanation: Proxy access: http://www.nature.com/nbt/journal/v29/n3/abs/nbt.1829.html : 200 from cache in 0ms

==========================================================================
LOCKSS box 'ingest1.clockss.org' raised an alert at Sun Jan 05 08:12:34 PST 2014

Name: ContentAccess
Severity: info
Explanation: Proxy access: http://www.nature.com/nbt/journal/v29/n9/covers/index.html : 200 from cache in 157ms

==========================================================================
LOCKSS box 'ingest1.clockss.org' raised an alert at Sun Jan 05 08:12:34 PST 2014

Name: ContentAccess
Severity: info
Explanation: Proxy access: http://www.nature.com/nbt/journal/v29/n9/covers/index.html : 200 from cache in 0ms

==========================================================================
LOCKSS box 'ingest1.clockss.org' raised an alert at Sun Jan 05 08:12:35 PST 2014

Name: ContentAccess
Severity: info
Explanation: Proxy access: http://www.nature.com/nbt/journal/v29/n9/covers/index.html : 200 from cache in 0ms

==========================================================================
LOCKSS box 'ingest1.clockss.org' raised an alert at Sun Jan 05 08:12:36 PST 2014

Name: ContentAccess
Severity: info
Explanation: Proxy access: http://www.nature.com/nbt/journal/v29/n9/covers/index.html : 200 from cache in 0ms

Administrative and Security Alerts

Alerts are generated on the following administrative actions:

  • Changes to the configuration files.
  • Changes to the access control permissions.
  • Adding or de-activating an AU.
  • Enabling or disabling the content servers.
  • User account added or removed or password changed.

External Communications

Engagement

Engagement with harvest content publishers before ingestion is described in CLOCKSS; Ingest Pipeline.

Engagement with file transfer content publishers before ingestion is described in CLOCKSS; Ingest Pipeline.

In all cases interactions with the publisher take place through the RT ticketing system, so they are recorded permanently.

External Reports

The technology for generating reports is being revised; the earlier technology became too inefficient as the number of articles on each box grew because it generated reports on each box from the LOCKSS: Metadata Database then merged them. The new technology is a centralized database with a row for each article, a column for each of the production and ingest boxes, and the cell containing the ingest timestamp of the article on that box, obtained by a regular polling process that asks each box for the articles ingested since the last time it was asked.

The following reports are generated for external consumption:

  • Monthly reports of the state of preservation of all serials committed to preservation in the CLOCKSS archive are delivered to the CLOCKSS board, the Keepers Registry and posted on the Web.
  • KBART reports are generated monthly and posted on the Web. For the Global LOCKSS Network, these reports are used to update link resolver knowledge bases so that libraries can provide their readers access to the content of their LOCKSS box. Because the CLOCKSS archive is a dark archive, these reports cannot be used to update link resolvers. However, several analysis tools use KBART as an input format, so the KBART reports for CLOCKSS are made public.
  • The CLOCKSS Executive Director is sent an e-mail report of the article counts in the CLOCKSS archive weekly. These reports are preserved in Stanford's backup system.
  • The CLOCKSS archive charges publishers a small fee for each current article ingested, billed quarterly. Thus a quarterly report is generated showing for each publisher the number of their articles ingested in that quarter for each publication year. The report is submitted to the CLOCKSS Executive Director for onward transmission to the publishers. Significant discrepancies between this and the publisher's own article counts will result (and have resulted) in investigation and corrective action. To aid in this process more detailed reports, down to the article level, can be generated on request.

The CLOCKSS Metadata Lead is responsible for the production and dissemination of these reports.

Monitoring

Log Monitoring

  • Ingest boxes: The CLOCKSS Content Lead is responsible for monitoring logs on the ingest boxes
  • Production boxes: The CLOCKSS Technical Lead is responsible for monitoring logs on production boxes when needed.
  • Web servers: The CLOCKSS Network Administrator is responsible for monitoring web server logs.

Alert Monitoring

The CLOCKSS Technical Lead is responsible for monitoring the Alerts generated by CLOCKSS boxes.

Nagios

The state of the CLOCKSS infrastructure, including the CLOCKSS boxes and the ingest machines, is monitored by Nagios as described in CLOCKSS: Box Operations.

The CLOCKSS Network Administrator is responsible for monitoring via Nagios.

Network Diagnostics

The LOCKSS team's internal monitoring and evaluation processes identified some areas in which the efficiency of the polling process could be improved in the context of the Global LOCKSS Network (GLN). The Andrew W. Mellon Foundation funded work to implement and evaluate improvements in these areas. This is expected to be complete by March 2015. Although these improvements will be deployed to the CLOCKSS network, because there are many fewer boxes in the CLOCKSS network than the GLN the areas of inefficiency are not relevant to the CLOCKSS network. Thus the improvements are not expected to make a substantial difference to the performance of the CLOCKSS network.

The Mellon-funded work included development of improved instrumentation and analysis software, which polls the administrative Web UI of each LOCKSS box in a network to collect vast amounts of data about the operations of each box. For examples of the use of this software, see LOCKSS: Polling and Repair Protocol.

The CLOCKSS Network Administrator is responsible for collecting and analyzing this data.

Change Process

Changes to this document require:

  • Review by:
    • LOCKSS Engineering Staff
    • CLOCKSS Network Administrator
  • Approval by CLOCKSS Technical Lead

Relevant Documents

  1. CLOCKSS: Box Operations
  2. CLOCKSS: Ingest Pipeline
  3. LOCKSS: Polling and Repair Protocol
  4. Definition of AIP