LOCKSS: Polling and Repair Protocol
Contents |
LOCKSS: Polling and Repair Protocol
Overview
LOCKSS boxes run the LOCKSS polling and repair protocol as described in our ACM Transactions on Computer Systems paper. The paper describes the polling mechanism as applying to a single file; the LOCKSS daemon applies it to an entire Archival Unit (AU) of content. Each LOCKSS daemon chooses at random the next AU upon which it will use the LOCKSS polling and repair protocol to perform integrity checks. It acts as the poller to call a poll on that AU by:
- Selecting a random sample of the other CLOCKSS boxes (the voters).
- Inviting the voters to participate in a poll on the AU, and sending each of them a freshly-generated random nonce Np.
- The poll involves the voters voting by:
- Generating a fresh random nonce Nv.
- Creating a vote containing, for every URL in the voter's instance of the AU:
- The URL
- The hash of the concatenation of Np, Nv and the content of the URL.
- Sending the vote to the poller. Note that the vote contains a hash for each URL in the voter's instance of the AU, but that hash is not the hash of the content. The nonces ensure that the hash in the vote is different for every vote in every poll. The voter cannot simply remember the hash it initially created, it must re-hash every URL each time it votes.
- The poller tallies the votes by:
- For each URL in the poller's instance of the AU:
- For each voter:
- Computing the hash of Np, Nv and the content of the URL in the poller's instance of the AU.
- Comparing the result with the hash value for that URL in that voter's vote.
- For each voter:
- Note that the nonces ensure that the poller must re-hash every URL in the AU; it cannot simply remember the hash it initially created.
- For each URL in the poller's instance of the AU:
- In tallying the votes, the poller may detect that:
- A URL it has does not match the consensus of the voters, or
- A URL that the consensus of the voters says should be present in the AU is missing from the poller's AU, or
- A URL it has does not match the checksum generated when it was stored.
- If so, it repairs the problem by:
- requesting a new copy from one of the voters that agreed with the consensus,
- then verifying that the new copy does agree with the consensus.
In this way, at unpredictable but fairly regular intervals, every poll on an AU checks the union of the set of URLs in that AU on the box calling the poll (poller) and the boxes voting (voters). The check establishes that the URL on the poller agrees with the consensus of the boxes voting in the poll (voters) as to that URL's content. If it does not, it is repaired from one of the boxes in the consensus. Under our current Mellon grant we are investigating the potential benefits of an enhancement to the mechanism that results in every poll on an AU checking that every URL in that AU on each voter agrees with the same URL on the poller.
Configuration of CLOCKSS Network
As described in CLOCKSS: Box Operations the CLOCKSS boxes are configured to form a Private LOCKSS Network (PLN) including the following configuration options:
- Because the CLOCKSS PLN is a closed network secured by SSL certificate checks at both ends of all connections, the defenses against sybil attacks, which involve the adversary creating new peer identities, are not necessary and are not implemented.
- The efficiency enhancements described below are being gradually and cautiously deployed to the CLOCKSS PLN.
Currently, on average, a poll is called on each AU instance approximately once every 100 days. Since there are currently 12 boxes in the CLOCKSS network, approximately every 8 days on average one instance of a given AU is checked.
Enhancements
The LOCKSS team's internal monitoring and evaluation processes identified some areas in which the efficiency of the polling process could be improved in the context of the Global LOCKSS Network (GLN). The Andrew W. Mellon Foundation funded work to implement and evaluate improvements in these areas; the grant period extends through March 2015. Although these improvements will be deployed to the CLOCKSS network, because there are many fewer boxes in the CLOCKSS network than the GLN the areas of inefficiency are less relevant to the CLOCKSS network. Thus the improvements are not expected to make a substantial difference to the performance of the CLOCKSS network.
The Mellon-funded work included development of improved instrumentation and analysis software, which polls the administrative Web UI of each LOCKSS box in a network to collect vast amounts of data about the operations of each box. These tools were used on the CLOCKSS network for an initial 59-day period, collecting over 18M data items. The data collected has yet to be fully analyzed but initial analysis shows that the polling process among CLOCKSS boxes continues to operate satisfactorily. Some examples of the graphs generated follow.
This graph shows the number of AU instances in CLOCKSS boxes which have reached agreement with N other CLOCKSS boxes, showing the progress AUs make after ingest as the LOCKSS: Polling and Repair Protocol identifies matching AU instances at other boxes. It will be seen that there are few AU instances in the sample with few boxes with whom they have reached agreement, and that the majority of AU instances have reached agreement with AU instances at the majority of other CLOCKSS boxes. This graph shows the extent of agreement among the over 40,000 successfully completed polls in the sample. As can be seen, the overwhelming majority of the polls showed complete agreement. Polls with less than complete agreement are likely to have been caused by polling among AU instances that were still collecting content, so had different sub-sets of the URLs in an AU.Demonstration
The CRL auditors requested a demonstration of the polling and repair process. Demonstrating this on production content is difficult. The content is generally large, so polls take a long time. Each box is running many polls simultaneously, so the log entries for these polls are interleaved. Turning the logging level on polling up enough to show full details would affect all polls underway simultaneously, so the volume of log data would be overwhelming. Instead, we provided a live demonstration using a network of 5 LOCKSS daemons in the STF testing framework, preserving an AU of synthetic content. It consisted of two polls, the first detected no damage and the second created, detected and repaired damage to the content of one URL. Annotated logs of the first poll are available from the poller and a voter. Annotated logs of the second poll are available from the poller and a voter.
Change Process
Changes to this document require:
- Review by LOCKSS Engineering Staff
- Approval by LOCKSS Chief Scientist
Relevant Documents
- CLOCKSS: Box Operations
- Petros Maniatis, Mema Roussopoulos, TJ Giuli, David S.H. Rosenthal, Mary Baker, and Yanto Muliadi. “LOCKSS: A Peer-to-Peer Digital Preservation System”, ACM Transactions on Computer Systems vol. 23, no. 1, February 2005, pp. 2-50. http://dx.doi.org/10.1145/1047915.1047917 accessed 2013.8.7