Difference between revisions of "CLOCKSS: Threats and Mitigations"

From CLOCKSS Trusted Digital Repository Documents
Jump to: navigation, search
(Initial version)
 
(restoring post-2016 edits)
 
(4 intermediate revisions by one user not shown)
Line 53: Line 53:
 
Senior technical staff of the LOCKSS Program attend and speak at many international digital preservation conferences (see [http://www.lockss.org/news-media/talks/ LOCKSS Talks page]). They conduct leading-edge research and publish extensively (see [http://www.lockss.org/news-media/publications/ LOCKSS Publications page] and [http://blog.dshr.org Dr. David S. H. Rosenthal's blog]) in digital preservation. They have done so consistently since 2000. This in-house expertise acts in place of a subscription to a technology watch service as regards digital preservation technologies.
 
Senior technical staff of the LOCKSS Program attend and speak at many international digital preservation conferences (see [http://www.lockss.org/news-media/talks/ LOCKSS Talks page]). They conduct leading-edge research and publish extensively (see [http://www.lockss.org/news-media/publications/ LOCKSS Publications page] and [http://blog.dshr.org Dr. David S. H. Rosenthal's blog]) in digital preservation. They have done so consistently since 2000. This in-house expertise acts in place of a subscription to a technology watch service as regards digital preservation technologies.
  
Risk: Senior engineers are in demand in Silicon Valley, and these key team members could be recruited away. This risk is mitigated because there are four of them, and the fact that they are all older. Older engineers are less in demand by industry, and find Stanford's excellent benefits attractive in comparison to industry's higher salaries but benefit structures more aimed at youngsters. Staffing the LOCKSS program is a continuous process. Position descriptions and classifications adhere to Stanford University's standards..
+
Risk: Senior engineers are in demand in Silicon Valley, and these key team members could be recruited away. This risk is mitigated because there are four of them, and the fact that they are all older. Older engineers are less in demand by industry, and find Stanford's excellent benefits attractive in comparison to industry's higher salaries but benefit structures more aimed at youngsters. Staffing the LOCKSS program is a continuous process. Position descriptions and classifications adhere to Stanford University's standards.
  
 
=== Operational Awareness ===
 
=== Operational Awareness ===
Line 85: Line 85:
 
CLOCKSS boxes can be down for extended periods without impairing the function of the network as a whole, as the box in Tokyo was in the aftermath of the Fukushima disaster.
 
CLOCKSS boxes can be down for extended periods without impairing the function of the network as a whole, as the box in Tokyo was in the aftermath of the Fukushima disaster.
  
The CLOCKSS archive does not maintain service contracts for its hardware. It owns the hardware even though it is located at remote sites. The architecture of the CLOCKSS network means that rapid response to outages at individual sites is not required. All CLOCKSS hardware shipped to remote sites is equipped with redundant power supplies, and undergoes extended burn-in before shipment. Experience shows that failures of hardware components other than disks are rare. CLOCKSS boxes are equipped with warm spare disks to cover for disk failures. Non-disk failures are typically handled by exchanging a complete server with one from the LOCKSS team. Thus service contracts are not economically justified.
+
The CLOCKSS Archive does not maintain service contracts for its hardware. It owns the hardware even though it is located at remote sites. The architecture of the CLOCKSS network means that rapid response to outages at individual sites is not required. All CLOCKSS hardware shipped to remote sites is equipped with redundant power supplies, and undergoes extended burn-in before shipment. Experience shows that failures of hardware components other than disks are rare. CLOCKSS boxes are equipped with warm spare disks to cover for disk failures. Non-disk failures are typically handled by exchanging a complete server with one from the LOCKSS team. Thus service contracts are not economically justified.
  
 
Risk: There is a risk that delays in repairing hardware failures would result in enough CLOCKSS boxes being down simultaneously to impair the function of the network. This risk is mitigated by [[CLOCKSS: Logging and Records|monitoring of box operations]] and treating hardware repair as urgent. The purchase of spare hardware that could immediately be shipped from Stanford to reduce the delay in repair is being investigated.
 
Risk: There is a risk that delays in repairing hardware failures would result in enough CLOCKSS boxes being down simultaneously to impair the function of the network. This risk is mitigated by [[CLOCKSS: Logging and Records|monitoring of box operations]] and treating hardware repair as urgent. The purchase of spare hardware that could immediately be shipped from Stanford to reduce the delay in repair is being investigated.
Line 124: Line 124:
 
* Easy to monitor for obsolescence, since that would be an industry-wide event.
 
* Easy to monitor for obsolescence, since that would be an industry-wide event.
 
* Easily replaced with newer physical or virtual resources when necessary.
 
* Easily replaced with newer physical or virtual resources when necessary.
The LOCKSS team monitors the state of the [[CLOCKSS: Hardware and Software Inventory#Hardware|hardware inventory]] as documented in [[CLOCKSS: Logging and Records]]. The CLOCKSS archive has [[CLOCKSS: Box Operations#Hardware Replacement|technical]] and [[CLOCKSS; Budget and Planning Process|financial]] plans in place to replace failed or life-expired hardware.
+
The LOCKSS team monitors the state of the [[CLOCKSS: Hardware and Software Inventory#Hardware|hardware inventory]] as documented in [[CLOCKSS: Logging and Records]]. The CLOCKSS archive has [[CLOCKSS: Box Operations#Hardware Replacement|technical]] and [[CLOCKSS: Budget and Planning Process|financial]] plans in place to replace failed or life-expired hardware.
  
 
There are three reasons why ingest machines or CLOCKSS boxes might need to be replaced:
 
There are three reasons why ingest machines or CLOCKSS boxes might need to be replaced:
Line 141: Line 141:
 
=== Software Obsolescence ===
 
=== Software Obsolescence ===
  
All [[CLOCKSS: Hardware and Software Inventory#Software|software in use by the CLOCKSS archive]] is either:
+
All [[CLOCKSS: Hardware and Software Inventory#Software|software in use by the CLOCKSS Archive]] is either:
 
* free, open-source, industry standard software such as Linux and Java, or
 
* free, open-source, industry standard software such as Linux and Java, or
 
* internally developed free, open-source software (the LOCKSS daemon), or
 
* internally developed free, open-source software (the LOCKSS daemon), or
 
* internally developed tools used for content testing, and diagnosis of the CLOCKSS network's performance.
 
* internally developed tools used for content testing, and diagnosis of the CLOCKSS network's performance.
The LOCKSS daemon used to preserve the CLOCKSS archive's content [[LOCKSS: Software Development Process#Dependencies|depends upon]]:
+
The LOCKSS daemon used to preserve the CLOCKSS Archive's content [[LOCKSS: Software Development Process#Dependencies|depends upon]]:
 
* A POSIX file system.
 
* A POSIX file system.
 
* A Java virtual machine, level 6 or above.
 
* A Java virtual machine, level 6 or above.
Line 151: Line 151:
 
Changes which prevent the Linux environment satisfying these requirements are considered unlikely in the foreseeable future, and if they were to be envisaged by the Linux community it would only be after open discussion of which the LOCKSS team would be aware (see [[CLOCKSS: Threats and Mitigations#Awareness|Awareness above]]). The LOCKSS software is maintained by the LOCKSS team using processes defined in [[LOCKSS: Software Development Process]]. LOCKSS Program technical staff monitor the evolution of the open source ecosystem and, when indicated, routinely migrate the LOCKSS software (for example, from one Java library to another deemed more suitable). The threat of obsolescence is also monitored by the testing processes described in [[LOCKSS: Software Development Process]] and [[CLOCKSS: Ingest Pipeline]]. Loss of key team members could impact the effectiveness of this; for mitigation see [[CLOCKSS: Threats and Mitigations#Awareness|Awareness above]]. The rest of the stack is maintained by the Linux, Apache and other open source communities. Since all the software is free and open-source, no financial provision other than the normal [[LOCKSS: Software Development Process]] funding need be made for its replacement or upgrade.
 
Changes which prevent the Linux environment satisfying these requirements are considered unlikely in the foreseeable future, and if they were to be envisaged by the Linux community it would only be after open discussion of which the LOCKSS team would be aware (see [[CLOCKSS: Threats and Mitigations#Awareness|Awareness above]]). The LOCKSS software is maintained by the LOCKSS team using processes defined in [[LOCKSS: Software Development Process]]. LOCKSS Program technical staff monitor the evolution of the open source ecosystem and, when indicated, routinely migrate the LOCKSS software (for example, from one Java library to another deemed more suitable). The threat of obsolescence is also monitored by the testing processes described in [[LOCKSS: Software Development Process]] and [[CLOCKSS: Ingest Pipeline]]. Loss of key team members could impact the effectiveness of this; for mitigation see [[CLOCKSS: Threats and Mitigations#Awareness|Awareness above]]. The rest of the stack is maintained by the Linux, Apache and other open source communities. Since all the software is free and open-source, no financial provision other than the normal [[LOCKSS: Software Development Process]] funding need be made for its replacement or upgrade.
  
One consequence of software obsolescence might be format obsolescence. The CLOCKSS archive [[LOCKSS: Format Migration|implements format migration on access]]. Doing so depends on the eventual availability of format converters, a topic discussed [[LOCKSS: Format Migration#Availability of Format Converters|here]].
+
One consequence of software obsolescence might be format obsolescence. The CLOCKSS Archive [[LOCKSS: Format Migration|implements format migration on access]]. Doing so depends on the eventual availability of format converters, a topic discussed [[LOCKSS: Format Migration#Availability of Format Converters|here]].
  
 
Risk: The risk of the open source community being unable to sustain the dependencies on the Java virtual machine, some Java libraries, and the availability of a POSIX file system, or the Apache web server, is assessed as low. These basic dependencies have been stable since the LOCKSS prototype nearly 15 years ago. The requirements development process described in [[LOCKSS: Software Development Process]] might fail to detect the need for a change from the LOCKSS community or the content being preserved. Changes in the rest of the software stack might trigger a failure of one or more dependencies of the LOCKSS daemon. The unit and functional testing processes described in [[LOCKSS: Software Development Process]] are designed to detect this. CLOCKSS Archive income from libraries and publishers might not be adequate for the work needed to adapt to new publishers and conform to the evolution of existing publishers, leading to a backlog of content to be ingested.
 
Risk: The risk of the open source community being unable to sustain the dependencies on the Java virtual machine, some Java libraries, and the availability of a POSIX file system, or the Apache web server, is assessed as low. These basic dependencies have been stable since the LOCKSS prototype nearly 15 years ago. The requirements development process described in [[LOCKSS: Software Development Process]] might fail to detect the need for a change from the LOCKSS community or the content being preserved. Changes in the rest of the software stack might trigger a failure of one or more dependencies of the LOCKSS daemon. The unit and functional testing processes described in [[LOCKSS: Software Development Process]] are designed to detect this. CLOCKSS Archive income from libraries and publishers might not be adequate for the work needed to adapt to new publishers and conform to the evolution of existing publishers, leading to a backlog of content to be ingested.
Line 165: Line 165:
 
* Distributing a syntactically malformed property file, which will be detected by the boxes and treated as a service interruption.
 
* Distributing a syntactically malformed property file, which will be detected by the boxes and treated as a service interruption.
 
* Distributing a syntactically correct property file that sets unsuitable property values. The LOCKSS daemon software is skeptical of property values. Critical properties have range checks and the code takes other defensive measures to ensure that erroneous property values can at worst cause daemon activities such as polling to stop; they cannot cause loss of or damage to content.  
 
* Distributing a syntactically correct property file that sets unsuitable property values. The LOCKSS daemon software is skeptical of property values. Critical properties have range checks and the code takes other defensive measures to ensure that erroneous property values can at worst cause daemon activities such as polling to stop; they cannot cause loss of or damage to content.  
An error by a software developer, or the daemon or plugin build master, could affect the entire network but the [[LOCKSS: Software Development Process|testing and release process]] is as automated as possible and designed to catch such errors before they get to the network.
+
An error by a [[LOCKSS: Basic Concepts#LOCKSS Daemon|daemon]] or [[LOCKSS: Basic Concepts#Plugins|plugin]] developer, or the daemon or plugin build master, could affect the entire network but the [[LOCKSS: Software Development Process|testing and release process]] is as automated as possible and designed to catch such errors before they get to the network.
  
 
Other precautions taken include:
 
Other precautions taken include:
Line 185: Line 185:
 
All CLOCKSS documents are preserved in the CLOCKSS Archive, and thus in each of the CLOCKSS boxes.
 
All CLOCKSS documents are preserved in the CLOCKSS Archive, and thus in each of the CLOCKSS boxes.
  
It appears that the use of the <tt>ServeContent</tt> servlet to serve most of the triggered content is preventing the [http://web.archive.org/web/ Internet Archive's WayBack Machine] preserving it. We plan to investigate possible ways around this issue so that the Internet Archive would also be a re-publishing server for triggered content.
+
It appears that the use of the <tt>ServeContent</tt> servlet to serve most of the triggered content is preventing the [http://web.archive.org/web/ Internet Archive's Wayback Machine] preserving it. We plan to investigate possible ways around this issue so that the Internet Archive would also be a re-publishing server for triggered content.
  
 
Risk: Given the high risk of a natural disaster in the Bay Area futher attention is needed to maintaining critical data outside the area, not merely off-site.
 
Risk: Given the high risk of a natural disaster in the Bay Area futher attention is needed to maintaining critical data outside the area, not merely off-site.
Line 216: Line 216:
 
Internal attack could take one of two forms:
 
Internal attack could take one of two forms:
 
* Insider abuse at the CLOCKSS host institutions is limited to affecting a single box, not the preserved content as a whole. This is because each of the (currently 12) CLOCKSS boxes is independently administered; insiders at the host institution have access only to their box. The CLOCKSS boxes do not trust each other, only the consensus of the boxes as a whole.
 
* Insider abuse at the CLOCKSS host institutions is limited to affecting a single box, not the preserved content as a whole. This is because each of the (currently 12) CLOCKSS boxes is independently administered; insiders at the host institution have access only to their box. The CLOCKSS boxes do not trust each other, only the consensus of the boxes as a whole.
* Insider abuse by the LOCKSS team. The policy is that when a new CLOCKSS box is bought up, the LOCKSS staff managing the network have write and administrative access to it via <tt>sudo</tt>. All such accesses are logged. Once confidence is achieved in the working relationship with staff at the host institution, this access is terminated. This stage has been achieved with XXX/11 remote CLOCKSS boxes. Eventually, the LOCKSS staff will have such access only to the box at Stanford; their access to the other boxes is limited to:
+
* Insider abuse by the LOCKSS team. The policy is that when a new CLOCKSS box is bought up, the LOCKSS staff managing the network have write and administrative access to it via <tt>sudo</tt>. All such accesses are logged. Once confidence is achieved in the working relationship with staff at the host institution, this access is terminated. This stage has been achieved with 7/11 remote CLOCKSS boxes. Eventually, the LOCKSS staff will have such access only to the box at Stanford; their access to the other boxes is limited to:
 
** read-only data collection (see [[CLOCKSS: Box Operations]])
 
** read-only data collection (see [[CLOCKSS: Box Operations]])
 
** changes to the LOCKSS daemon configuration (see [[LOCKSS: Property Server Operations]] and discussion of [[CLOCKSS: Threats and Mitigations#Operator Error|Operator Error]] above)
 
** changes to the LOCKSS daemon configuration (see [[LOCKSS: Property Server Operations]] and discussion of [[CLOCKSS: Threats and Mitigations#Operator Error|Operator Error]] above)
** changes to the LOCKSS daemon software (see [[LOCKSS: Software Development Process]]), which could introduce malicious code into the network. This risk is mitigated by the use of SourceForge's source code control system, which allows code changes to be traced to their authorized committers, and easily rescinded, code signing (CLOCKSS boxes verify the signature on all software, whether from the LOCKSS team or from the CentOS repositories,  before installing it), and the staged release process.
+
** changes to the LOCKSS daemon software (see [[LOCKSS: Software Development Process]]), which could introduce malicious code into the network. This risk is mitigated by the use of GitHub's source code control system, which allows code changes to be traced to their authorized committers, and easily rescinded, code signing (CLOCKSS boxes verify the signature on all software, whether from the LOCKSS team or from the CentOS repositories,  before installing it), and the staged release process.
  
 
With the exception of the Stanford CLOCKSS box, staff at the host institution of each CLOCKSS box have access only to their box, not to any of the others. Their role in changing the system is limited to maintaining the operating system of their box current with the requirements of the network. LOCKSS staff have read-only access to all boxes for monitoring and data collection purposes.
 
With the exception of the Stanford CLOCKSS box, staff at the host institution of each CLOCKSS box have access only to their box, not to any of the others. Their role in changing the system is limited to maintaining the operating system of their box current with the requirements of the network. LOCKSS staff have read-only access to all boxes for monitoring and data collection purposes.
Line 225: Line 225:
 
Members of the LOCKSS staff have delineated roles, responsibilities and authorizations regarding making changes to the system as follows:
 
Members of the LOCKSS staff have delineated roles, responsibilities and authorizations regarding making changes to the system as follows:
 
* LOCKSS technical staff can check changes in to the daemon source code repository, but these changes are not deployed to the CLOCKSS boxes until they have been tested, included in a release candidate built and signed by the LOCKSS build master, and approved for release by the LOCKSS technical lead. The LOCKSS source code control system identifies the author of each change to the system. See [[LOCKSS: Software Development Process]].
 
* LOCKSS technical staff can check changes in to the daemon source code repository, but these changes are not deployed to the CLOCKSS boxes until they have been tested, included in a release candidate built and signed by the LOCKSS build master, and approved for release by the LOCKSS technical lead. The LOCKSS source code control system identifies the author of each change to the system. See [[LOCKSS: Software Development Process]].
* LOCKSS content staff can check changes in to the plugin source, but these changes are not deployed to the CLOCKSS boxes until they have been tested, included in a plugin release built and signed by the plugin build master, and approved for release by the content lead. See [[LOCKSS: Software Development Process]].
+
* LOCKSS content staff can check changes in to the plugin source, but these changes are not deployed to the CLOCKSS boxes until they have been tested, included in a plugin release built and signed by the plugin build master, and approved for release by the plugin lead. See [[LOCKSS: Software Development Process]].
 
* Access to the server room containing the Stanford CLOCKSS box, the property server and other critical systems is restricted to the LOCKSS sysadmin and the LOCKSS senior engineers. Similar secure physical locations are required of the other CLOCKSS boxes, see [[CLOCKSS: Box Operations#Requirements for CLOCKSS host sites|CLOCKSS: Box Operations]].
 
* Access to the server room containing the Stanford CLOCKSS box, the property server and other critical systems is restricted to the LOCKSS sysadmin and the LOCKSS senior engineers. Similar secure physical locations are required of the other CLOCKSS boxes, see [[CLOCKSS: Box Operations#Requirements for CLOCKSS host sites|CLOCKSS: Box Operations]].
  
Line 249: Line 249:
 
Changes to this document require:
 
Changes to this document require:
 
* Review by LOCKSS Engineering Staff
 
* Review by LOCKSS Engineering Staff
* Approval by LOCKSS Chief Scientist
+
* Approval by LOCKSS Program Manager
  
 
== Relevant Documents ==
 
== Relevant Documents ==
Line 255: Line 255:
 
# [http://dx.doi.org/10.1145/1047915.1047917 ''LOCKSS: A Peer-to-Peer Digital Preservation System'']
 
# [http://dx.doi.org/10.1145/1047915.1047917 ''LOCKSS: A Peer-to-Peer Digital Preservation System'']
 
# [https://www.usenix.org/conference/2005-usenix-annual-technical-conference/attrition-defenses-peer-peer-digital-preservation ''Attrition Defenses for a Peer-to-Peer Digital Preservation System'']
 
# [https://www.usenix.org/conference/2005-usenix-annual-technical-conference/attrition-defenses-peer-peer-digital-preservation ''Attrition Defenses for a Peer-to-Peer Digital Preservation System'']
# [http://www.lockss.org/news-media/talks/ LOCKSS Talks page]
+
# [https://www.lockss.org/about/resources LOCKSS papers and presentations]
# [http://www.lockss.org/news-media/publications/ LOCKSS Publications page]
+
# [https://blog.dshr.org Dr. David S. H. Rosenthal's blog]
# [http://blog.dshr.org Dr. David S. H. Rosenthal's blog]
+
 
# [[CLOCKSS: Box Operations]]
 
# [[CLOCKSS: Box Operations]]
 
# [[LOCKSS: Polling and Repair Protocol]]
 
# [[LOCKSS: Polling and Repair Protocol]]

Latest revision as of 23:23, 14 August 2019

Contents

CLOCKSS: Threats and Mitigations

Threat Model

The system architecture and operations policies of the CLOCKSS Archive are based on the threat model underlying the LOCKSS technology, which was formalized in a 2005 paper published by the LOCKSS team, Requirements for Digital Preservation Systems: A Bottom-Up Approach, and periodic reviews of code, configuration and policies. The paper identified the following threats:

  • Media Failure.
  • Hardware Failure.
  • Software Failure.
  • Communication Errors.
  • Failure of Network Services.
  • Media & Hardware Obsolescence.
  • Software Obsolescence.
  • Operator Error.
  • Natural Disaster.
  • External Attack.
  • Internal Attack.
  • Economic Failure.
  • Organizational Failure.

Mitigation Strategy

This set of threats is the basis for code and operations reviews. Although the set includes threats that are not traditionally classified as security risks, the LOCKSS team treats all threats as potentially security-related. As an example of the reasoning behind this, consider "Communication Errors". These might be random, they might be load-related, or they might be caused by a denial-of-service attack.

As described in LOCKSS: A Peer-to-Peer Digital Preservation System, Attrition Defenses for a Peer-to-Peer Digital Preservation System and other papers noted here, the LOCKSS system design assumes a hostile environment with a powerful adversary. It does not assume that all the boxes in a network such as CLOCKSS are benign, it merely assumes that the majority of boxes do not behave maliciously. The analysis of these papers assumes a completely distributed network, composed of independent peers with no central control whatsoever. If that were the case, these design assumptions applied to a network with a sufficiently large number of replicas would provide robust defenses against all the threats noted above. An attacker would have to compromise, and maintain control of a large majority of the peers for an extended period of time in order to modify, delete, or significantly prevent access to the content.

Of necessity, the CLOCKSS network does have central control, which therefore provides a vector by which some of the threats can be effective. In particular, the following threats applied to the central CLOCKSS organization can be effective and need to be mitigated as described in the following sections:

  • Software Obsolescence.
  • Operator Error.
  • Natural Disaster.
  • External Attack.
  • Internal Attack.
  • Economic Failure.
  • Organizational Failure.

The DLIB paper details the approach the LOCKSS technology takes to mitigating each of these risks. The CLOCKSS Archive is implemented using the LOCKSS technology but, because of its nature as a tightly-controlled dark archive configures the technology in ways that further reduce risk as compared to the Global LOCKSS Network for which the technology was originally designed. The configuration of the CLOCKSS network is described in CLOCKSS: Box Operations but briefly the additional defenses include:

  • Implementing a large number (currently 12) of CLOCKSS boxes each holding the entire content of the archive.
  • Ensuring that, after an initial period, each CLOCKSS box's operating system is configured to prevent write or administrative access except by staff at the host institution.
  • Securing communication among authorized CLOCKSS boxes using SSL certificate checks at both ends of each connection.
  • Preventing dissemination of content from CLOCKSS boxes except during an approved trigger event (see CLOCKSS: Extracting Triggered Content).

The CLOCKSS network consists of (currently 12) CLOCKSS boxes in the US (Stanford, Rice, Indiana, Virginia, OCLC) and in Australia, Canada, Italy, Japan, Hong Kong, Germany and Scotland. Each of these boxes is configured to preserve a complete copy of all content successfully ingested into the CLOCKSS Archive, which is continually audited by the LOCKSS: Polling and Repair Protocol. Any CLOCKSS trigger event or any failure of both of the replicated triggered content servers could be satisfied by extracting and disseminating content from any one of these boxes as described in CLOCKSS: Extracting Triggered Content. Thus in this replicated system architecture each box is backed up by all of the others. The LOCKSS: Polling and Repair Protocol keeps each box informed of the existence and state of the content at each of the other boxes.

Awareness

The CLOCKSS archive's awareness strategy is in two parts:

  • Environmental awareness, meaning awareness of events outside the archive that could affect its operations.
  • Operational awareness, meaning awareness of the state of the CLOCKSS PLN and the effectiveness of its operations.

Environmental Awareness

Anticipating technology changes is one role of the senior engineers of the LOCKSS team, among whom are four each with more than 20 years in senior engineering positions in Silicon Valley. Awareness of future technology trends is a job requirement for positions such as these. To fulfill this requirement the team can draw on expertise from the Computer Science Departments of Stanford and UC Santa Cruz, and a network of colleagues in senior engineering and research positions in industry giants including Oracle, Google, NetApp, Seagate and HP, and in the Linux and FreeBSD communities. This in-house and local expertise acts in place of a subscription to a technology watch service as regards the open source ecosystem and generic PC technologies.

Senior technical staff of the LOCKSS Program attend and speak at many international digital preservation conferences (see LOCKSS Talks page). They conduct leading-edge research and publish extensively (see LOCKSS Publications page and Dr. David S. H. Rosenthal's blog) in digital preservation. They have done so consistently since 2000. This in-house expertise acts in place of a subscription to a technology watch service as regards digital preservation technologies.

Risk: Senior engineers are in demand in Silicon Valley, and these key team members could be recruited away. This risk is mitigated because there are four of them, and the fact that they are all older. Older engineers are less in demand by industry, and find Stanford's excellent benefits attractive in comparison to industry's higher salaries but benefit structures more aimed at youngsters. Staffing the LOCKSS program is a continuous process. Position descriptions and classifications adhere to Stanford University's standards.

Operational Awareness

The mechanisms the CLOCKSS Archive uses to observe the operations of the CLOCKSS boxes via logging and Alerts are described in CLOCKSS: Logging and Records. That document also describes:

LOCKSS Program operations staff monitor the state of the network using logs, Alerts, Nagios, and internal tools.

Risk: The major risk is that too much information overwhelms the human monitoring.

Threat Mitigations and Risks

The following sections describe the CLOCKSS archive's approach to mitigating the threats identified by the Threat Model.

Media Failure

The hard disk media used by the CLOCKSS archive can fail in three ways:

Risk: There is a risk that a whole-disk failure would not be observed and the drive replaced in time before another drive in the RAID group failed. The result would be loss of data on the box in question. This would be repaired by the LOCKSS: Polling and Repair Protocol, but doing so would take some time.

Hardware Failure

Other than media failures, other components of CLOCKSS boxes can fail. Observed failures include:

  • Power supplies. CLOCKSS boxes have redundant power supplies, so the failure of one does not bring the box down.
  • Motherboards, whose failure does bring the box down.

CLOCKSS boxes can be down for extended periods without impairing the function of the network as a whole, as the box in Tokyo was in the aftermath of the Fukushima disaster.

The CLOCKSS Archive does not maintain service contracts for its hardware. It owns the hardware even though it is located at remote sites. The architecture of the CLOCKSS network means that rapid response to outages at individual sites is not required. All CLOCKSS hardware shipped to remote sites is equipped with redundant power supplies, and undergoes extended burn-in before shipment. Experience shows that failures of hardware components other than disks are rare. CLOCKSS boxes are equipped with warm spare disks to cover for disk failures. Non-disk failures are typically handled by exchanging a complete server with one from the LOCKSS team. Thus service contracts are not economically justified.

Risk: There is a risk that delays in repairing hardware failures would result in enough CLOCKSS boxes being down simultaneously to impair the function of the network. This risk is mitigated by monitoring of box operations and treating hardware repair as urgent. The purchase of spare hardware that could immediately be shipped from Stanford to reduce the delay in repair is being investigated.

Software Failure

Failures in the LOCKSS software are detected and diagnosed using the logging mechanisms. They are reported, and progress in their remediation tracked, as described in LOCKSS: Software Development Process. Because the LOCKSS daemon on each box operates independently, and because its operations are heavily randomized, it is unlikely that the occurrence of failures on multiple boxes would be correlated in time. If necessary, the various processes performed by each or all LOCKSS daemons, such as collecting content and integrity checks can be individually and temporarily disabled by means of the property server.

The LOCKSS and CLOCKSS networks are so large that full-scale testing in an isolated environment is economically infeasible. Testing in available isolated environments is a part of the process, but it cannot be representative of the load encountered by software in production use. This risk is mitigated by releasing new versions of the LOCKSS software to a number of LOCKSS boxes that the LOCKSS team runs as part of the Global LOCKSS Network (GLN) before they are made generally available or released to the CLOCKSS network. The GLN includes a much larger number of boxes than the CLOCKSS network, although on average each has much less content. Experience shows that problems not detected in an isolated testing environment are most likely to be caused by a large number of boxes rather than by a large amount of content per box.

Risk: There is a risk that bugs in the LOCKSS daemon could overwrite or delete content from a CLOCKSS box. This risk is mitigated in two ways:

  • Exclusion from the LOCKSS daemon of code that overwrites or deletes files from the repository, so that a bug cannot inappropriately execute it.
  • The LOCKSS: Polling and Repair Protocol, which would detect and repair the damage from another box.

Communication Errors

The CLOCKSS archive uses network communications for three purposes:

  • Ingest. HTTP is not a reliable transport protocol so, as described in CLOCKSS: Ingest Pipeline, content is ingested multiple times by different ingest machines and subsequently the LOCKSS: Polling and Repair Protocol detects and repairs any inconsistencies between the content at the ingest boxes.
  • Preservation. CLOCKSS boxes are configured to use SSL for all communication between them, specifically for the LOCKSS: Polling and Repair Protocol. Certificates are checked at both ends of all connections. Corruption would thus be detected. Interruptions of communication are normal; messages are re-tried until delivered or a specified time-out.
  • Dissemination. Communication problems during dissemination to the re-publishing servers would be detected by checksum verification and re-tried.

Risk: The mitigations are assessed as effective against the threat, so the risk is low.

Failure of Network Services

The CLOCKSS archive has to consider the possible failure of network services during each of the three phases:

  • Ingest:
    • Ingest of content via harvest requires the use of DNS and the publisher's Web server. Failure of either is presumed to be transient, and thus to delay but not prevent ingest.
    • Ingest of content via file transfer requires the use of DNS and a file transfer service such as ftp either at the publisher or at Stanford. Failure of either is presumed to be transient, and thus to delay but not prevent ingest.
  • Preservation. A major design goal of the LOCKSS: Polling and Repair Protocol was to avoid all dependencies on external network services, even DNS, since there was no guarantee that the service would continue. Provided it remains possible to route packets to a network address, the failure of other network services would not affect the preservation of CLOCKSS content. Currently, DNS is required during daemon start; a fix has been designed but has yet to be implemented.
  • Dissemination. Transfer of triggered content to the re-publishing servers requires DNS and a file transfer protocol such as rsync or sftp. Failure of either is presumed to be transient, so it would delay but not prevent dissemination.

Risk: If Internet connectivity were to be impossible for many months the content of the individual CLOCKSS boxes would be at significant risk, but this is assessed as a low probability event. The failures would be unlikely to be correlated, so once connectivity was restored the LOCKSS: Polling and Repair Protocol would have a high probability of recovering from them, although it would take some time.

Media & Hardware Obsolescence

All hardware and media components in use by the CLOCKSS archive are generic low-cost PC server technology and, as such:

  • Easy to monitor for obsolescence, since that would be an industry-wide event.
  • Easily replaced with newer physical or virtual resources when necessary.

The LOCKSS team monitors the state of the hardware inventory as documented in CLOCKSS: Logging and Records. The CLOCKSS archive has technical and financial plans in place to replace failed or life-expired hardware.

There are three reasons why ingest machines or CLOCKSS boxes might need to be replaced:

Monitoring means the risk of missing hardware failure or resource exhaustion is low, and because either would affect only one of the replicas the impact would be low. The risk of technological obsolescence of the hardware is low since it is all generic PC servers with no specialized components.

The technical specifications for the current hardware were drawn up with incremental upgrade over time in mind. The only components that we expect to upgrade in the next 5 years are the disk media. Beyond 5 years is too far ahead to draw up detailed specifications - for example would we want to use ARM-based micro-servers? Spintronic storage media? Named data networking? We can't know yet.

Risk: There is a risk that, when the time comes, financial resources would be inadequate to replace life-expired hardware. This risk is mitigated by:

  • Assuming a service life (typically 5 years) for equipment that is much less than the equipment is capable of.
  • Using generic, low cost equipment.
  • The replication inherent in the CLOCKSS PLN, which means that a few boxes could be out of service for some time without impacting the archive's operations.

Software Obsolescence

All software in use by the CLOCKSS Archive is either:

  • free, open-source, industry standard software such as Linux and Java, or
  • internally developed free, open-source software (the LOCKSS daemon), or
  • internally developed tools used for content testing, and diagnosis of the CLOCKSS network's performance.

The LOCKSS daemon used to preserve the CLOCKSS Archive's content depends upon:

  • A POSIX file system.
  • A Java virtual machine, level 6 or above.
  • A set of Java libraries.

Changes which prevent the Linux environment satisfying these requirements are considered unlikely in the foreseeable future, and if they were to be envisaged by the Linux community it would only be after open discussion of which the LOCKSS team would be aware (see Awareness above). The LOCKSS software is maintained by the LOCKSS team using processes defined in LOCKSS: Software Development Process. LOCKSS Program technical staff monitor the evolution of the open source ecosystem and, when indicated, routinely migrate the LOCKSS software (for example, from one Java library to another deemed more suitable). The threat of obsolescence is also monitored by the testing processes described in LOCKSS: Software Development Process and CLOCKSS: Ingest Pipeline. Loss of key team members could impact the effectiveness of this; for mitigation see Awareness above. The rest of the stack is maintained by the Linux, Apache and other open source communities. Since all the software is free and open-source, no financial provision other than the normal LOCKSS: Software Development Process funding need be made for its replacement or upgrade.

One consequence of software obsolescence might be format obsolescence. The CLOCKSS Archive implements format migration on access. Doing so depends on the eventual availability of format converters, a topic discussed here.

Risk: The risk of the open source community being unable to sustain the dependencies on the Java virtual machine, some Java libraries, and the availability of a POSIX file system, or the Apache web server, is assessed as low. These basic dependencies have been stable since the LOCKSS prototype nearly 15 years ago. The requirements development process described in LOCKSS: Software Development Process might fail to detect the need for a change from the LOCKSS community or the content being preserved. Changes in the rest of the software stack might trigger a failure of one or more dependencies of the LOCKSS daemon. The unit and functional testing processes described in LOCKSS: Software Development Process are designed to detect this. CLOCKSS Archive income from libraries and publishers might not be adequate for the work needed to adapt to new publishers and conform to the evolution of existing publishers, leading to a backlog of content to be ingested.

One form of software obsolescence would be if the Web evolved in such a way as to prevent the LOCKSS software from collecting, preserving or disseminating current Web content. This is a significant concern for all Web archiving technologies, and the LOCKSS technical staff have been in the forefront of addressing the issue by running workshops at the International Internet Preservation Consortium. See blog posts about the 2013 and 2012 workshops. The Andrew W. Mellon Foundation is funding the LOCKSS Program to work in this area through mid-2014.

Operator Error

An error by the operator of an individual CLOCKSS box affects that individual box but does not compromise the integrity of the network as a whole.

An error by an operator of the Property Server can affect the entire network, but only by:

  • Interrupting service, which has no deleterious effect because each box caches the most recent set of properties.
  • Distributing a syntactically malformed property file, which will be detected by the boxes and treated as a service interruption.
  • Distributing a syntactically correct property file that sets unsuitable property values. The LOCKSS daemon software is skeptical of property values. Critical properties have range checks and the code takes other defensive measures to ensure that erroneous property values can at worst cause daemon activities such as polling to stop; they cannot cause loss of or damage to content.

An error by a daemon or plugin developer, or the daemon or plugin build master, could affect the entire network but the testing and release process is as automated as possible and designed to catch such errors before they get to the network.

Other precautions taken include:

  • Operator access to CLOCKSS boxes is logged.
  • Administrative actions via the LOCKSS daemon's administrative Web interface cause Alerts (see CLOCKSS: Logging and Records).

Risk: Experience of the LOCKSS system in production use shows this is a low risk, in that many such errors have been made with no serious effect.

Natural Disaster

Since each of the (currently 12) CLOCKSS boxes is configured to contain a complete copy of the Archive's content, a disaster causing the total loss of a few CLOCKSS boxes does not need to be treated as a disaster, merely the routine replacement of a few network nodes as documented in CLOCKSS: Box Operations.

All CLOCKSS triggered content is disseminated via two mirrored Web servers, one at Stanford, California and one at EDINA, Scotland. A disaster at one of these sites would not interrupt service. Mirroring could be easily restored by copying from the unaffected site or from one of the CLOCKSS boxes as documented in CLOCKSS: Extracting Triggered Content. All content triggered from the CLOCKSS network is under a Creative Commons license; there are neither technical nor legal barriers to other, unaffiliated, institutions bringing up additional mirrors.

A disaster affecting the LOCKSS team at Stanford might interrupt service in terms of ingesting new content, since 3 of the 5 presentation ingest machines are located at Stanford. Each of these ingest machines contains the current state of the ingest pipeline, so that replacement machines could be cloned from one of the remaining machines at the cost of a week or two delay in ingest. (See CLOCKSS: Ingest Pipeline). The content of the source ingest pipeline is mirrored off-site.

A disaster affecting the LOCKSS team at Stanford might interrupt service in terms of the "property server" used to manage the CLOCKSS network. The LOCKSS team maintains a hot standby of the property server in Amazon's cloud. (See LOCKSS: Property Server Operations). Each CLOCKSS box caches a complete copy of the contents of the CLOCKSS property server, so a service interruption would be unlikely to affect their operation during the time needed to fail over to the hot standby.

All CLOCKSS documents are preserved in the CLOCKSS Archive, and thus in each of the CLOCKSS boxes.

It appears that the use of the ServeContent servlet to serve most of the triggered content is preventing the Internet Archive's Wayback Machine preserving it. We plan to investigate possible ways around this issue so that the Internet Archive would also be a re-publishing server for triggered content.

Risk: Given the high risk of a natural disaster in the Bay Area futher attention is needed to maintaining critical data outside the area, not merely off-site.

External Attack

As described in CLOCKSS: Box Operations, the configuration of each CLOCKSS box was carefully designed to prevent communication except with the other CLOCKSS boxes (enforced using SSL certificate checks at both ends of each connection) and with the CLOCKSS ingest and management machines (using firewall rules). CLOCKSS (and LOCKSS) boxes are single-function servers, there are no other services sharing the machine for an attacker to compromise. An attacker who, perhaps by compromising a machine used by a host institution's administrator, gains access to an individual CLOCKSS box does not compromise the integrity of the network as a whole, since the CLOCKSS boxes do not trust each other. The surface available to an external attacker is thus minimized. An attacker could compromise the CLOCKSS Property Server, and modify the configuration of all boxes in the network. This could impede network operations until control of the property server was restored, but due to the design of the LOCKSS technology it would not result in content in the CLOCKSS boxes being modified or lost permanently. See LOCKSS: Property Server Operations.

Each CLOCKSS box's operating system is maintained current with the CentOS repositories. Some CLOCKSS boxes update automatically from these repositories within 24 hours, some require administrator intervention. This mitigates the risk that an erroneous update from CentOS would impact all CLOCKSS boxes almost simultaneously.

The process by which security requirements for the the LOCKSS software are developed and addressed is described in LOCKSS: Software Development Process. Once a security enhancement for the LOCKSS daemon is released, all CLOCKSS boxes install it automatically within 24 hours.

The following precautions are taken to prevent unauthorized access via a CLOCKSS box's administrative Web interface:

  • Packet filters prevent access except from the box's host institution's network, and from the LOCKSS team's subnet at Stanford.
  • Access requires HTTPS.
  • Administrative access is logged.
  • Adminstrative actions cause Alerts, see CLOCKSS: Logging and Records.

If a attack compromises one or more ingest boxes, the ingest network should be stopped via the property server, all boxes disconnected from the network, the vulnerability diagnosed, and all boxes wiped and their BIOS and operating system re-installed from scratch. Their content should be re-ingested from the publisher.

If an attack compromises one or more production boxes, the production network should be stopped via the property server, the affected boxes disconnected from the network, the vulnerability diagnosed, and the affected boxes wiped and their BIOS and operating system re-installed from scratch. Unless a majority of the production boxes were compromised, the LOCKSS: Polling and Repair Protocol will detect and repair any corruption of their content.

Risk:

  • The open source community maintainers could issue a faulty update to a component of the CLOCKSS box stack.
  • The LOCKSS team could issue a faulty software update.

These risks are mitigated by the configuration of the CLOCKSS boxes, which prevents communication except with specifically authorized IP addresses, making it difficult for an attacker to exploit a remote vulnerability, and which prevents login access except by host institution administrators.

Internal Attack

Internal attack could take one of two forms:

  • Insider abuse at the CLOCKSS host institutions is limited to affecting a single box, not the preserved content as a whole. This is because each of the (currently 12) CLOCKSS boxes is independently administered; insiders at the host institution have access only to their box. The CLOCKSS boxes do not trust each other, only the consensus of the boxes as a whole.
  • Insider abuse by the LOCKSS team. The policy is that when a new CLOCKSS box is bought up, the LOCKSS staff managing the network have write and administrative access to it via sudo. All such accesses are logged. Once confidence is achieved in the working relationship with staff at the host institution, this access is terminated. This stage has been achieved with 7/11 remote CLOCKSS boxes. Eventually, the LOCKSS staff will have such access only to the box at Stanford; their access to the other boxes is limited to:
    • read-only data collection (see CLOCKSS: Box Operations)
    • changes to the LOCKSS daemon configuration (see LOCKSS: Property Server Operations and discussion of Operator Error above)
    • changes to the LOCKSS daemon software (see LOCKSS: Software Development Process), which could introduce malicious code into the network. This risk is mitigated by the use of GitHub's source code control system, which allows code changes to be traced to their authorized committers, and easily rescinded, code signing (CLOCKSS boxes verify the signature on all software, whether from the LOCKSS team or from the CentOS repositories, before installing it), and the staged release process.

With the exception of the Stanford CLOCKSS box, staff at the host institution of each CLOCKSS box have access only to their box, not to any of the others. Their role in changing the system is limited to maintaining the operating system of their box current with the requirements of the network. LOCKSS staff have read-only access to all boxes for monitoring and data collection purposes.

Members of the LOCKSS staff have delineated roles, responsibilities and authorizations regarding making changes to the system as follows:

  • LOCKSS technical staff can check changes in to the daemon source code repository, but these changes are not deployed to the CLOCKSS boxes until they have been tested, included in a release candidate built and signed by the LOCKSS build master, and approved for release by the LOCKSS technical lead. The LOCKSS source code control system identifies the author of each change to the system. See LOCKSS: Software Development Process.
  • LOCKSS content staff can check changes in to the plugin source, but these changes are not deployed to the CLOCKSS boxes until they have been tested, included in a plugin release built and signed by the plugin build master, and approved for release by the plugin lead. See LOCKSS: Software Development Process.
  • Access to the server room containing the Stanford CLOCKSS box, the property server and other critical systems is restricted to the LOCKSS sysadmin and the LOCKSS senior engineers. Similar secure physical locations are required of the other CLOCKSS boxes, see CLOCKSS: Box Operations.

Risk: There is a risk that the LOCKSS build master could compromise the build process to introduce malware. Although this would be evident after the damage was done, because the signed package would not correspond to the tagged source, it is hard to see any pro-active mitigation.

Economic Failure

The LOCKSS software is maintained by the LOCKSS team, funded jointly by the CLOCKSS archive and the LOCKSS Alliance. The LOCKSS team has been economically sustainable for more than 5 years solely on this basis without grant funding.

There is a risk that the CLOCKSS administration might commit to preserve publishers whose content is very large without charging them enough to fund the storage necessary for their content. This risk is mitigated by regular reports on system capacity to CLOCKSS administration. Loss of CLOCKSS Archive library members would reduce funding without corresponding reduction in content (as loss of publisher members would) and might make timely hardware replacements difficult. This risk is mitigated by the 30-year history of exponential drops in storage cost per byte, and the existence of 12 complete replicas of the content, which makes the temporary loss of a few replicas while waiting for replacement less important.

Organizational Failure

For the business aspects of failing over to a successor organization see CLOCKSS: Succession Plan.

If, as part of the CLOCKSS: Succession Plan, it becomes necessary to transfer custody of the content of the CLOCKSS archive, this could be achieved in multiple ways. The successor organization could take custody of the content and metadata by, among other possible means:

  • Importing the content exported by a production CLOCKSS box in one of the packaging formats supported by the LOCKSS daemon, including, ZIP, TAR and WARC files.
  • Crawling the content from a production CLOCKSS box using a standard Web crawler such as the Internet Archive's Heritrix.
  • Using shell scripts to traverse the file systems containing the LOCKSS daemon's repository, described in Definition of AIP, to create a different packaging format, then importing that.

Change Process

Changes to this document require:

  • Review by LOCKSS Engineering Staff
  • Approval by LOCKSS Program Manager

Relevant Documents

  1. Requirements for Digital Preservation Systems: A Bottom-Up Approach
  2. LOCKSS: A Peer-to-Peer Digital Preservation System
  3. Attrition Defenses for a Peer-to-Peer Digital Preservation System
  4. LOCKSS papers and presentations
  5. Dr. David S. H. Rosenthal's blog
  6. CLOCKSS: Box Operations
  7. LOCKSS: Polling and Repair Protocol
  8. CLOCKSS: Extracting Triggered Content
  9. CLOCKSS: Logging and Records
  10. CLOCKSS: Ingest Pipeline
  11. LOCKSS: Property Server Operations
  12. CLOCKSS: Hardware and Software Inventory
  13. LOCKSS: Software Development Process
  14. LOCKSS: Format Migration
  15. CLOCKSS: Succession Plan
  16. Definition of AIP