http://documents.clockss.org/api.php?action=feedcontributions&user=Dshr&feedformat=atomCLOCKSS Trusted Digital Repository Documents - User contributions [en]2024-03-28T12:09:33ZUser contributionsMediaWiki 1.21.2http://documents.clockss.org/index.php/LOCKSS:_Polling_and_Repair_ProtocolLOCKSS: Polling and Repair Protocol2015-07-21T05:30:19Z<p>Dshr: Reviewed TAL, approved DSHR</p>
<hr />
<div>= LOCKSS: Polling and Repair Protocol =<br />
<br />
== Overview ==<br />
<br />
LOCKSS boxes run the LOCKSS polling and repair protocol as described in [http://dx.doi.org/10.1145/1047915.1047917 our ''ACM Transactions on Computer Systems'' paper]. The paper describes the polling mechanism as applying to a single file; the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] applies it to an entire [[LOCKSS: Basic Concepts#Archival Unit|Archival Unit (AU)]] of content. Each LOCKSS daemon chooses at random the next AU upon which it will use the LOCKSS polling and repair protocol to perform integrity checks. It acts as the ''poller'' to call a poll on that AU by:<br />
* Selecting a random sample of the other CLOCKSS boxes (the <i>voters</i>).<br />
* Inviting the voters to participate in a <i>poll</i> on the AU, and sending each of them a freshly-generated random nonce ''Np''.<br />
* The poll involves the voters voting by:<br />
** Generating a fresh random nonce ''Nv''.<br />
** Creating a vote containing, for every URL in the voter's instance of the AU:<br />
*** The URL<br />
*** The hash of the concatenation of ''Np'', ''Nv'' and the content of the URL.<br />
** Sending the vote to the poller. Note that the vote contains a hash for each URL in the voter's instance of the AU, but that hash is not the hash of the content. The nonces ensure that the hash in the vote is different for every vote in every poll. The voter cannot simply remember the hash it initially created, it must re-hash every URL each time it votes.<br />
* The poller tallies the votes by:<br />
** For each URL in the poller's instance of the AU:<br />
*** For each voter:<br />
**** Computing the hash of ''Np'', ''Nv'' and the content of the URL in the poller's instance of the AU.<br />
**** Comparing the result with the hash value for that URL in that voter's vote.<br />
** Note that the nonces ensure that the poller must re-hash every URL in the AU; it cannot simply remember the hash it initially created.<br />
* In tallying the votes, the poller may detect that:<br />
** A URL it has does not match the consensus of the voters, or<br />
** A URL that the consensus of the voters says should be present in the AU is missing from the poller's AU, or<br />
** A URL it has does not match the checksum generated when it was stored.<br />
* If so, it repairs the problem by:<br />
** requesting a new copy from one of the voters that agreed with the consensus,<br />
** then verifying that the new copy does agree with the consensus.<br />
<br />
In this way, at unpredictable but fairly regular intervals, every poll on an AU checks the union of the set of URLs in that AU on the box calling the poll (poller) and the boxes voting (voters). The check establishes that the URL on the poller agrees with the consensus of the boxes voting in the poll (voters) as to that URL's content. If it does not, it is repaired from one of the boxes in the consensus. Under our current Mellon grant we are investigating the potential benefits of an enhancement to the mechanism that results in every poll on an AU checking that every URL in that AU on each voter agrees with the same URL on the poller.<br />
<br />
== Configuration of CLOCKSS Network ==<br />
<br />
As described in [[CLOCKSS: Box Operations]] the CLOCKSS boxes are configured to form a Private LOCKSS Network (PLN) including the following configuration options:<br />
* Because the CLOCKSS PLN is a closed network secured by SSL certificate checks at both ends of all connections, the defenses against sybil attacks, which involve the adversary creating new peer identities, are not necessary and are not implemented.<br />
* The efficiency enhancements described below are being gradually and cautiously deployed to the CLOCKSS PLN.<br />
<br />
Currently, on average, a poll is called on each AU instance approximately once every 100 days. Since there are currently 12 boxes in the CLOCKSS network, approximately every 8 days on average one instance of a given AU is checked.<br />
<br />
== Enhancements ==<br />
<br />
The LOCKSS team's internal monitoring and evaluation processes identified some areas in which the efficiency of the polling process could be improved in the context of the Global LOCKSS Network (GLN). The [http://www.lockss.org/news-media/news/lockss-program-receives-andrew-w-mellon-foundation-grant/ Andrew W. Mellon Foundation funded work to implement and evaluate improvements] in these areas; the grant period extends through March 2015. Although these improvements will be deployed to the CLOCKSS network, because there are many fewer boxes in the CLOCKSS network than the GLN the areas of inefficiency are less relevant to the CLOCKSS network. Thus the improvements are not expected to make a substantial difference to the performance of the CLOCKSS network.<br />
<br />
The Mellon-funded work included development of improved instrumentation and analysis software, which polls the administrative Web UI of each LOCKSS box in a network to collect vast amounts of data about the operations of each box. These tools were used on the CLOCKSS network for an initial 59-day period, collecting over 18M data items. The data collected has yet to be fully analyzed but initial analysis shows that the polling process among CLOCKSS boxes continues to operate satisfactorily. Some examples of the graphs generated follow.<br />
<br />
[[File:hist_pr_auid_count27.png|200px|thumb|center]] This graph shows the number of AU instances in CLOCKSS boxes which have reached agreement with N other CLOCKSS boxes, showing the progress AUs make after ingest as the LOCKSS: Polling and Repair Protocol identifies matching AU instances at other boxes. It will be seen that there are few AU instances in the sample with few boxes with whom they have reached agreement, and that the majority of AU instances have reached agreement with AU instances at the majority of other CLOCKSS boxes.<br />
<br />
[[File:Sample Graph 2.png|200px|thumb|center]] This graph shows the extent of agreement among the over 40,000 successfully completed polls in the sample. As can be seen, the overwhelming majority of the polls showed complete agreement. Polls with less than complete agreement are likely to have been caused by polling among AU instances that were still collecting content, so had different sub-sets of the URLs in an AU.<br />
<br />
== Demonstration ==<br />
<br />
The CRL auditors requested a demonstration of the polling and repair process. Demonstrating this on production content is difficult. The content is generally large, so polls take a long time. Each box is running many polls simultaneously, so the log entries for these polls are interleaved. Turning the logging level on polling up enough to show full details would affect all polls underway simultaneously, so the volume of log data would be overwhelming. Instead, we provided a live demonstration using a network of 5 LOCKSS daemons in the [[LOCKSS:_Software_Development_Process#Functional_Tests|STF testing framework]], preserving an AU of synthetic content. It consisted of two polls, the first detected no damage and the second created, detected and repaired damage to the content of one URL. Annotated logs of the first poll are available from the [[Media:Poller-good.pdf|poller]] and a [[Media:Voter-good.pdf|voter]]. Annotated logs of the second poll are available from the [[Media:Poller-bad.pdf|poller]] and a [[Media:Voter-bad.pdf|voter]].<br />
<br />
=== Replicating the Demonstration ===<br />
<br />
These demos have been included in the 1.66 release of the LOCKSS software. These instructions have been tested on a vanilla install of Ubuntu 14.04.1, up-to-date as of August 4. They should work on other recent Debian-based Linux systems.<br />
<br />
The first step is to install the pre-requisites:<br />
<pre><br />
foo@bar:~$ cd<br />
foo@bar:~$ sudo apt-get install default-jdk ant subversion libxml2-utils<br />
[sudo] password for foo:<br />
Reading package lists... 0%<br />
...<br />
0 upgraded, 46 newly installed, 0 to remove and 0 not upgraded.<br />
Need to get 70.6 MB of archives.<br />
After this operation, 127 MB of additional disk space will be used.<br />
Do you want to continue? [Y/n] y<br />
...<br />
done.<br />
foo@bar:~$ ls -l /etc/alternatives/javac<br />
lrwxrwxrwx 1 root root 42 Aug 4 18:45 /etc/alternatives/javac -&gt; /usr/lib/jvm/java-7-openjdk-i386/bin/javac<br />
foo@bar:~$ export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386<br />
foo@bar:~$ <br />
</pre><br />
The next step is to check the latest release of the LOCKSS daemon out from SourceForge:<br />
<pre><br />
foo@bar:~$ svn checkout svn://svn.code.sf.net/p/lockss/svn/lockss-daemon/tags/last_released_daemon lockss-daemon<br />
...<br />
foo@bar:~$ <br />
</pre><br />
The next step is to build the LOCKSS daemon. This takes a while, there's a lot of code to build. It generates Java warnings that you should be able to ignore, but no errors. Just to be sure that everything is OK, we run the unit and functional tests on the daemon that gets built. This takes much longer, especially on the little netbook I'm using to test the instructions:<br />
<pre><br />
foo@bar:~$ mkdir ~/.ant<br />
foo@bar:~$ cd ~/.ant<br />
foo@bar:~$ ln -s ~/lockss-daemon/lib .<br />
foo@bar:~$ cd ~/lockss-daemon<br />
foo@bar:~/lockss-daemon$ ant<br />
Buildfile: /home/foo/lockss-daemon/build.xml<br />
...<br />
BUILD SUCCESSFUL<br />
Total time: 81 minutes 21 seconds<br />
<br />
real 81m21.980s<br />
user 84m29.308s<br />
sys 5m13.708s<br />
foo@bar:~/lockss-daemon$ <br />
</pre><br />
The next step is to configure STF for the demos. The demos work without this configuration, but they are much more informative with it:<br />
<pre><br />
foo@bar:~$ cd ~/lockss-daemon/test/frameworks/run_stf foo@bar:~/lockss-daemon/test/frameworks/run_stf$ cp testsuite.opt.demo testsuite.opt foo@bar:~/lockss-daemon/test/frameworks/run_stf$<br />
</pre><br />
This configuration ensures that:<br />
* The logs contain detailed information about the polling and repair process.<br />
* The logs aren't deleted after the demo.<br />
* The daemons stay running until you hit Enter. This allows you use a Web browser to access the UI of the daemons and see the polling and voting status pages. See the STF README.txt file for details of how to do this.<br />
Now you can go ahead and run the first demo in the STF test framework. It creates a network of 5 LOCKSS boxes each preserving an Archival Unit (AU) of synthetic content, and causes the first box to call a poll on it, which should result in complete agreement among the boxes:<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ python testsuite.py AuditDemo1<br />
11:27:35.057: INFO: ===================================<br />
11:27:35.057: INFO: Demo a V3 poll with no disagreement<br />
11:27:35.057: INFO: -----------------------------------<br />
11:27:35.250: INFO: Starting framework in /home/foo/gamma/lockss-daemon/test/frameworks/run_stf/testcase-1<br />
11:27:35.266: INFO: Waiting for framework to become ready<br />
11:27:45.624: INFO: Creating simulated AU's<br />
11:27:47.546: INFO: Waiting for simulated AU's to crawl<br />
11:27:47.759: INFO: AU's completed initial crawl<br />
11:27:47.760: INFO: No nodes damaged on client localhost:8041<br />
11:27:47.777: INFO: Waiting for a V3 poll to be called...<br />
11:28:18.087: INFO: Successfully called a V3 poll<br />
11:28:18.088: INFO: Checking V3 poll result...<br />
11:28:18.215: INFO: Asymmetric client localhost:8042 repairers OK<br />
11:28:18.249: INFO: Asymmetric client localhost:8043 repairers OK<br />
11:28:18.287: INFO: Asymmetric client localhost:8044 repairers OK<br />
11:28:18.322: INFO: Asymmetric client localhost:8045 repairers OK<br />
11:28:18.425: INFO: AU successfully polled<br />
11:28:19.427: INFO: No deadlocks detected<br />
&gt;&gt;&gt; Delaying shutdown. Press Enter to continue...<br />
11:29:08.161: INFO: Stopping framework<br />
----------------------------------------------------------------------<br />
Ran 1 test in 93.213s<br />
<br />
OK<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
You will find that the demo has created a file system tree under testcase-1 with a directory for each of the five boxes in the network:<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ ls testcase-1<br />
daemon-8041 daemon-8042 daemon-8043 daemon-8044 daemon-8045 lockss.opt lockss.txt<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
daemon-8041 is the poller, the box that called the poll and tallied the result. You can see its log (an annotated version is here):<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ ls -l testcase-1/daemon-8041/test.out<br />
-rw-rw-r-- 1 foo foo 31399 Aug 5 13:56 testcase-1/daemon-8041/test.out<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
daemon-8042 through daemon-8045 are the voters, the boxes whose content is compared with the poller's. You can see their logs (an annotated version is here):<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ ls -l testcase-1/daemon-8042/test.out<br />
-rw-rw-r-- 1 foo foo 14755 Aug 5 13:56 testcase-1/daemon-8042/test.out<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$<br />
</pre><br />
Now we clean up in preparation for the second demo:<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ ./clean.sh<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
In the second demo one of the daemons calls a poll, but before it does one file in its simulated content is damaged. The other 4 vote, and they all disagree with the poller about the damaged file. The poller requests a repair of this file from one of the voters. Once the repair is received, the poller re-tallies the poll and now finds 100% agreement. The logs end up in the usual place, annotated versions are available for the poller and a voter.<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ python testsuite.py AuditDemo2<br />
11:16:24.793: INFO: ================================================<br />
11:16:24.793: INFO: Demo a basic V3 poll with repair via open access<br />
11:16:24.793: INFO: ------------------------------------------------<br />
11:16:24.987: INFO: Starting framework in /home/foo/gamma/lockss-daemon/test/frameworks/run_stf/testcase-1<br />
11:16:25.002: INFO: Waiting for framework to become ready<br />
11:16:35.392: INFO: Creating simulated AU's<br />
11:16:37.454: INFO: Waiting for simulated AU's to crawl<br />
11:16:37.671: INFO: AU's completed initial crawl<br />
11:16:38.320: INFO: Damaged the following node(s) on client localhost:8041:<br />
http://www.example.com/branch1/001file.txt<br />
11:16:38.337: INFO: Waiting for a V3 poll to be called...<br />
11:17:03.523: INFO: Successfully called a V3 poll<br />
11:17:03.523: INFO: Waiting for V3 repair...<br />
11:17:03.765: INFO: Asymmetric client localhost:8042 repairers OK<br />
11:17:03.802: INFO: Asymmetric client localhost:8043 repairers OK<br />
11:17:03.839: INFO: Asymmetric client localhost:8044 repairers OK<br />
11:17:03.869: INFO: Asymmetric client localhost:8045 repairers OK<br />
11:17:03.943: INFO: AU successfully repaired<br />
11:17:04.945: INFO: No deadlocks detected<br />
&gt;&gt;&gt; Delaying shutdown. Press Enter to continue...<br />
11:17:15.661: INFO: Stopping framework<br />
----------------------------------------------------------------------<br />
Ran 1 test in 50.956s<br />
<br />
OK<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
Now we clean up in preparation for the third demo:<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ ./clean.sh<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
In the second demo, the simulated content was open access, so there was no restriction on the voter sending a repair to the poller. The common case is that the content is not open access, in which case the voter has to remember agreeing with the poller in the past about the AU being repaired so that it doesn't leak content to boxes that could not get it directly from the publisher.<br />
<br />
In the third demo the daemons achieve agreement on the non-open access content before damage is created at the poller. Then when the poller next calls a poll, detects the damage and requests a repair, the voter remembers the prior agreement and sends a repair.<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ python testsuite.py AuditDemo3<br />
11:18:05.527: INFO: =======================================================<br />
11:18:05.527: INFO: Demo a basic V3 poll with repair via previous agreement<br />
11:18:05.527: INFO: -------------------------------------------------------<br />
11:18:05.722: INFO: Starting framework in /home/foo/gamma/lockss-daemon/test/frameworks/run_stf/testcase-1<br />
11:18:05.743: INFO: Waiting for framework to become ready<br />
11:18:21.199: INFO: Creating simulated AU's<br />
11:18:23.138: INFO: Waiting for simulated AU's to crawl<br />
11:18:23.355: INFO: AU's completed initial crawl<br />
11:18:23.449: INFO: Waiting for a V3 poll by all simulated caches<br />
11:18:48.653: INFO: Client on port 8041 called V3 poll...<br />
11:18:48.694: INFO: Client on port 8042 called V3 poll...<br />
11:18:48.732: INFO: Client on port 8043 called V3 poll...<br />
11:18:48.764: INFO: Client on port 8044 called V3 poll...<br />
11:18:48.814: INFO: Client on port 8045 called V3 poll...<br />
11:18:48.814: INFO: Waiting for all peers to win their polls<br />
11:18:48.891: INFO: Client on port 8041 won V3 poll...<br />
11:18:48.972: INFO: Client on port 8042 won V3 poll...<br />
11:18:49.072: INFO: Client on port 8043 won V3 poll...<br />
11:18:49.157: INFO: Client on port 8044 won V3 poll...<br />
11:18:49.248: INFO: Client on port 8045 won V3 poll...<br />
11:18:50.347: INFO: Damaged the following node(s) on client localhost:8041:<br />
http://www.example.com/001file.bin<br />
http://www.example.com/001file.txt<br />
http://www.example.com/002file.bin<br />
http://www.example.com/002file.txt<br />
http://www.example.com/branch1/001file.bin<br />
http://www.example.com/branch1/001file.txt<br />
http://www.example.com/branch1/002file.bin<br />
http://www.example.com/branch1/002file.txt<br />
http://www.example.com/branch1/index.html<br />
http://www.example.com/index.html<br />
11:18:50.375: INFO: Waiting for a V3 poll to be called...<br />
11:19:25.638: INFO: Successfully called a V3 poll<br />
11:19:25.714: INFO: Waiting for a V3 poll to be called...<br />
11:19:25.742: INFO: Successfully called a V3 poll<br />
11:19:25.742: INFO: Waiting for V3 repair...<br />
11:19:26.871: INFO: Asymmetric client localhost:8042 repairers OK<br />
11:19:26.871: INFO: Asymmetric client localhost:8042 repairers OK<br />
11:19:26.872: INFO: Asymmetric client localhost:8042 repairers OK<br />
11:19:26.872: INFO: Asymmetric client localhost:8042 repairers OK<br />
11:19:26.919: INFO: Asymmetric client localhost:8043 repairers OK<br />
11:19:26.919: INFO: Asymmetric client localhost:8043 repairers OK<br />
11:19:26.919: INFO: Asymmetric client localhost:8043 repairers OK<br />
11:19:26.920: INFO: Asymmetric client localhost:8043 repairers OK<br />
11:19:26.955: INFO: Asymmetric client localhost:8044 repairers OK<br />
11:19:26.955: INFO: Asymmetric client localhost:8044 repairers OK<br />
11:19:26.955: INFO: Asymmetric client localhost:8044 repairers OK<br />
11:19:26.956: INFO: Asymmetric client localhost:8044 repairers OK<br />
11:19:26.999: INFO: Asymmetric client localhost:8045 repairers OK<br />
11:19:26.999: INFO: Asymmetric client localhost:8045 repairers OK<br />
11:19:26.999: INFO: Asymmetric client localhost:8045 repairers OK<br />
11:19:26.999: INFO: Asymmetric client localhost:8045 repairers OK<br />
11:19:27.083: INFO: AU successfully repaired<br />
11:19:28.086: INFO: No deadlocks detected<br />
&gt;&gt;&gt; Delaying shutdown. Press Enter to continue...<br />
11:20:46.489: INFO: Stopping framework<br />
----------------------------------------------------------------------<br />
Ran 1 test in 161.058s<br />
<br />
OK<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
Finally we clean up again:<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ ./clean.sh ; rm testsuite.opt<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by LOCKSS Engineering Staff<br />
* Approval by LOCKSS Chief Scientist<br />
<br />
== Relevant Documents ==<br />
# [[CLOCKSS: Box Operations]]<br />
# Petros Maniatis, Mema Roussopoulos, TJ Giuli, David S.H. Rosenthal, Mary Baker, and Yanto Muliadi. “LOCKSS: A Peer-to-Peer Digital Preservation System”, ACM Transactions on Computer Systems vol. 23, no. 1, February 2005, pp. 2-50. http://dx.doi.org/10.1145/1047915.1047917 accessed 2013.8.7</div>Dshrhttp://documents.clockss.org/index.php/LOCKSS:_Polling_and_Repair_ProtocolLOCKSS: Polling and Repair Protocol2015-07-21T05:27:51Z<p>Dshr: Update for subversion</p>
<hr />
<div>= LOCKSS: Polling and Repair Protocol =<br />
<br />
== Overview ==<br />
<br />
LOCKSS boxes run the LOCKSS polling and repair protocol as described in [http://dx.doi.org/10.1145/1047915.1047917 our ''ACM Transactions on Computer Systems'' paper]. The paper describes the polling mechanism as applying to a single file; the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] applies it to an entire [[LOCKSS: Basic Concepts#Archival Unit|Archival Unit (AU)]] of content. Each LOCKSS daemon chooses at random the next AU upon which it will use the LOCKSS polling and repair protocol to perform integrity checks. It acts as the ''poller'' to call a poll on that AU by:<br />
* Selecting a random sample of the other CLOCKSS boxes (the <i>voters</i>).<br />
* Inviting the voters to participate in a <i>poll</i> on the AU, and sending each of them a freshly-generated random nonce ''Np''.<br />
* The poll involves the voters voting by:<br />
** Generating a fresh random nonce ''Nv''.<br />
** Creating a vote containing, for every URL in the voter's instance of the AU:<br />
*** The URL<br />
*** The hash of the concatenation of ''Np'', ''Nv'' and the content of the URL.<br />
** Sending the vote to the poller. Note that the vote contains a hash for each URL in the voter's instance of the AU, but that hash is not the hash of the content. The nonces ensure that the hash in the vote is different for every vote in every poll. The voter cannot simply remember the hash it initially created, it must re-hash every URL each time it votes.<br />
* The poller tallies the votes by:<br />
** For each URL in the poller's instance of the AU:<br />
*** For each voter:<br />
**** Computing the hash of ''Np'', ''Nv'' and the content of the URL in the poller's instance of the AU.<br />
**** Comparing the result with the hash value for that URL in that voter's vote.<br />
** Note that the nonces ensure that the poller must re-hash every URL in the AU; it cannot simply remember the hash it initially created.<br />
* In tallying the votes, the poller may detect that:<br />
** A URL it has does not match the consensus of the voters, or<br />
** A URL that the consensus of the voters says should be present in the AU is missing from the poller's AU, or<br />
** A URL it has does not match the checksum generated when it was stored.<br />
* If so, it repairs the problem by:<br />
** requesting a new copy from one of the voters that agreed with the consensus,<br />
** then verifying that the new copy does agree with the consensus.<br />
<br />
In this way, at unpredictable but fairly regular intervals, every poll on an AU checks the union of the set of URLs in that AU on the box calling the poll (poller) and the boxes voting (voters). The check establishes that the URL on the poller agrees with the consensus of the boxes voting in the poll (voters) as to that URL's content. If it does not, it is repaired from one of the boxes in the consensus. Under our current Mellon grant we are investigating the potential benefits of an enhancement to the mechanism that results in every poll on an AU checking that every URL in that AU on each voter agrees with the same URL on the poller.<br />
<br />
== Configuration of CLOCKSS Network ==<br />
<br />
As described in [[CLOCKSS: Box Operations]] the CLOCKSS boxes are configured to form a Private LOCKSS Network (PLN) including the following configuration options:<br />
* Because the CLOCKSS PLN is a closed network secured by SSL certificate checks at both ends of all connections, the defenses against sybil attacks, which involve the adversary creating new peer identities, are not necessary and are not implemented.<br />
* The efficiency enhancements described below are being gradually and cautiously deployed to the CLOCKSS PLN.<br />
<br />
Currently, on average, a poll is called on each AU instance approximately once every 100 days. Since there are currently 12 boxes in the CLOCKSS network, approximately every 8 days on average one instance of a given AU is checked.<br />
<br />
== Enhancements ==<br />
<br />
The LOCKSS team's internal monitoring and evaluation processes identified some areas in which the efficiency of the polling process could be improved in the context of the Global LOCKSS Network (GLN). The [http://www.lockss.org/news-media/news/lockss-program-receives-andrew-w-mellon-foundation-grant/ Andrew W. Mellon Foundation funded work to implement and evaluate improvements] in these areas; the grant period extends through March 2015. Although these improvements will be deployed to the CLOCKSS network, because there are many fewer boxes in the CLOCKSS network than the GLN the areas of inefficiency are less relevant to the CLOCKSS network. Thus the improvements are not expected to make a substantial difference to the performance of the CLOCKSS network.<br />
<br />
The Mellon-funded work included development of improved instrumentation and analysis software, which polls the administrative Web UI of each LOCKSS box in a network to collect vast amounts of data about the operations of each box. These tools were used on the CLOCKSS network for an initial 59-day period, collecting over 18M data items. The data collected has yet to be fully analyzed but initial analysis shows that the polling process among CLOCKSS boxes continues to operate satisfactorily. Some examples of the graphs generated follow.<br />
<br />
[[File:hist_pr_auid_count27.png|200px|thumb|center]] This graph shows the number of AU instances in CLOCKSS boxes which have reached agreement with N other CLOCKSS boxes, showing the progress AUs make after ingest as the LOCKSS: Polling and Repair Protocol identifies matching AU instances at other boxes. It will be seen that there are few AU instances in the sample with few boxes with whom they have reached agreement, and that the majority of AU instances have reached agreement with AU instances at the majority of other CLOCKSS boxes.<br />
<br />
[[File:Sample Graph 2.png|200px|thumb|center]] This graph shows the extent of agreement among the over 40,000 successfully completed polls in the sample. As can be seen, the overwhelming majority of the polls showed complete agreement. Polls with less than complete agreement are likely to have been caused by polling among AU instances that were still collecting content, so had different sub-sets of the URLs in an AU.<br />
<br />
== Demonstration ==<br />
<br />
The CRL auditors requested a demonstration of the polling and repair process. Demonstrating this on production content is difficult. The content is generally large, so polls take a long time. Each box is running many polls simultaneously, so the log entries for these polls are interleaved. Turning the logging level on polling up enough to show full details would affect all polls underway simultaneously, so the volume of log data would be overwhelming. Instead, we provided a live demonstration using a network of 5 LOCKSS daemons in the [[LOCKSS:_Software_Development_Process#Functional_Tests|STF testing framework]], preserving an AU of synthetic content. It consisted of two polls, the first detected no damage and the second created, detected and repaired damage to the content of one URL. Annotated logs of the first poll are available from the [[Media:Poller-good.pdf|poller]] and a [[Media:Voter-good.pdf|voter]]. Annotated logs of the second poll are available from the [[Media:Poller-bad.pdf|poller]] and a [[Media:Voter-bad.pdf|voter]].<br />
<br />
=== Replicating the Demonstration ===<br />
<br />
These demos have been included in the 1.66 release of the LOCKSS software. These instructions have been tested on a vanilla install of Ubuntu 14.04.1, up-to-date as of August 4. They should work on other recent Debian-based Linux systems.<br />
<br />
The first step is to install the pre-requisites:<br />
<pre><br />
foo@bar:~$ cd<br />
foo@bar:~$ sudo apt-get install default-jdk ant subversion libxml2-utils<br />
[sudo] password for foo:<br />
Reading package lists... 0%<br />
...<br />
0 upgraded, 46 newly installed, 0 to remove and 0 not upgraded.<br />
Need to get 70.6 MB of archives.<br />
After this operation, 127 MB of additional disk space will be used.<br />
Do you want to continue? [Y/n] y<br />
...<br />
done.<br />
foo@bar:~$ ls -l /etc/alternatives/javac<br />
lrwxrwxrwx 1 root root 42 Aug 4 18:45 /etc/alternatives/javac -&gt; /usr/lib/jvm/java-7-openjdk-i386/bin/javac<br />
foo@bar:~$ export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386<br />
foo@bar:~$ <br />
</pre><br />
The next step is to check the latest release of the LOCKSS daemon out from SourceForge:<br />
<pre><br />
foo@bar:~$ svn checkout svn://svn.code.sf.net/p/lockss/svn/lockss-daemon/tags/last_released_daemon lockss-daemon<br />
...<br />
foo@bar:~$ <br />
</pre><br />
The next step is to build the LOCKSS daemon. This takes a while, there's a lot of code to build. It generates Java warnings that you should be able to ignore, but no errors. Just to be sure that everything is OK, we run the unit and functional tests on the daemon that gets built. This takes much longer, especially on the little netbook I'm using to test the instructions:<br />
<pre><br />
foo@bar:~$ mkdir ~/.ant<br />
foo@bar:~$ cd ~/.ant<br />
foo@bar:~$ ln -s ~/lockss-daemon/lib .<br />
foo@bar:~$ cd ~/lockss-daemon<br />
foo@bar:~/lockss-daemon$ ant<br />
Buildfile: /home/foo/lockss-daemon/build.xml<br />
...<br />
BUILD SUCCESSFUL<br />
Total time: 81 minutes 21 seconds<br />
<br />
real 81m21.980s<br />
user 84m29.308s<br />
sys 5m13.708s<br />
foo@bar:~/lockss-daemon$ <br />
</pre><br />
The next step is to configure STF for the demos. The demos work without this configuration, but they are much more informative with it:<br />
<pre><br />
foo@bar:~$ cd ~/lockss-daemon/test/frameworks/run_stf foo@bar:~/lockss-daemon/test/frameworks/run_stf$ cp testsuite.opt.demo testsuite.opt foo@bar:~/lockss-daemon/test/frameworks/run_stf$<br />
</pre><br />
This configuration ensures that:<br />
* The logs contain detailed information about the polling and repair process.<br />
* The logs aren't deleted after the demo.<br />
* The daemons stay running until you hit Enter. This allows you use a Web browser to access the UI of the daemons and see the polling and voting status pages. See the STF README.txt file for details of how to do this.<br />
Now you can go ahead and run the first demo in the STF test framework. It creates a network of 5 LOCKSS boxes each preserving an Archival Unit (AU) of synthetic content, and causes the first box to call a poll on it, which should result in complete agreement among the boxes:<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ python testsuite.py AuditDemo1<br />
11:27:35.057: INFO: ===================================<br />
11:27:35.057: INFO: Demo a V3 poll with no disagreement<br />
11:27:35.057: INFO: -----------------------------------<br />
11:27:35.250: INFO: Starting framework in /home/foo/gamma/lockss-daemon/test/frameworks/run_stf/testcase-1<br />
11:27:35.266: INFO: Waiting for framework to become ready<br />
11:27:45.624: INFO: Creating simulated AU's<br />
11:27:47.546: INFO: Waiting for simulated AU's to crawl<br />
11:27:47.759: INFO: AU's completed initial crawl<br />
11:27:47.760: INFO: No nodes damaged on client localhost:8041<br />
11:27:47.777: INFO: Waiting for a V3 poll to be called...<br />
11:28:18.087: INFO: Successfully called a V3 poll<br />
11:28:18.088: INFO: Checking V3 poll result...<br />
11:28:18.215: INFO: Asymmetric client localhost:8042 repairers OK<br />
11:28:18.249: INFO: Asymmetric client localhost:8043 repairers OK<br />
11:28:18.287: INFO: Asymmetric client localhost:8044 repairers OK<br />
11:28:18.322: INFO: Asymmetric client localhost:8045 repairers OK<br />
11:28:18.425: INFO: AU successfully polled<br />
11:28:19.427: INFO: No deadlocks detected<br />
&gt;&gt;&gt; Delaying shutdown. Press Enter to continue...<br />
11:29:08.161: INFO: Stopping framework<br />
----------------------------------------------------------------------<br />
Ran 1 test in 93.213s<br />
<br />
OK<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
You will find that the demo has created a file system tree under testcase-1 with a directory for each of the five boxes in the network:<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ ls testcase-1<br />
daemon-8041 daemon-8042 daemon-8043 daemon-8044 daemon-8045 lockss.opt lockss.txt<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
daemon-8041 is the poller, the box that called the poll and tallied the result. You can see its log (an annotated version is here):<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ ls -l testcase-1/daemon-8041/test.out<br />
-rw-rw-r-- 1 foo foo 31399 Aug 5 13:56 testcase-1/daemon-8041/test.out<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
daemon-8042 through daemon-8045 are the voters, the boxes whose content is compared with the poller's. You can see their logs (an annotated version is here):<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ ls -l testcase-1/daemon-8042/test.out<br />
-rw-rw-r-- 1 foo foo 14755 Aug 5 13:56 testcase-1/daemon-8042/test.out<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$<br />
</pre><br />
Now we clean up in preparation for the second demo:<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ ./clean.sh<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
In the second demo one of the daemons calls a poll, but before it does one file in its simulated content is damaged. The other 4 vote, and they all disagree with the poller about the damaged file. The poller requests a repair of this file from one of the voters. Once the repair is received, the poller re-tallies the poll and now finds 100% agreement. The logs end up in the usual place, annotated versions are available for the poller and a voter.<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ python testsuite.py AuditDemo2<br />
11:16:24.793: INFO: ================================================<br />
11:16:24.793: INFO: Demo a basic V3 poll with repair via open access<br />
11:16:24.793: INFO: ------------------------------------------------<br />
11:16:24.987: INFO: Starting framework in /home/foo/gamma/lockss-daemon/test/frameworks/run_stf/testcase-1<br />
11:16:25.002: INFO: Waiting for framework to become ready<br />
11:16:35.392: INFO: Creating simulated AU's<br />
11:16:37.454: INFO: Waiting for simulated AU's to crawl<br />
11:16:37.671: INFO: AU's completed initial crawl<br />
11:16:38.320: INFO: Damaged the following node(s) on client localhost:8041:<br />
http://www.example.com/branch1/001file.txt<br />
11:16:38.337: INFO: Waiting for a V3 poll to be called...<br />
11:17:03.523: INFO: Successfully called a V3 poll<br />
11:17:03.523: INFO: Waiting for V3 repair...<br />
11:17:03.765: INFO: Asymmetric client localhost:8042 repairers OK<br />
11:17:03.802: INFO: Asymmetric client localhost:8043 repairers OK<br />
11:17:03.839: INFO: Asymmetric client localhost:8044 repairers OK<br />
11:17:03.869: INFO: Asymmetric client localhost:8045 repairers OK<br />
11:17:03.943: INFO: AU successfully repaired<br />
11:17:04.945: INFO: No deadlocks detected<br />
&gt;&gt;&gt; Delaying shutdown. Press Enter to continue...<br />
11:17:15.661: INFO: Stopping framework<br />
----------------------------------------------------------------------<br />
Ran 1 test in 50.956s<br />
<br />
OK<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
Now we clean up in preparation for the third demo:<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ ./clean.sh<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
In the second demo, the simulated content was open access, so there was no restriction on the voter sending a repair to the poller. The common case is that the content is not open access, in which case the voter has to remember agreeing with the poller in the past about the AU being repaired so that it doesn't leak content to boxes that could not get it directly from the publisher.<br />
<br />
In the third demo the daemons achieve agreement on the non-open access content before damage is created at the poller. Then when the poller next calls a poll, detects the damage and requests a repair, the voter remembers the prior agreement and sends a repair.<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ python testsuite.py AuditDemo3<br />
11:18:05.527: INFO: =======================================================<br />
11:18:05.527: INFO: Demo a basic V3 poll with repair via previous agreement<br />
11:18:05.527: INFO: -------------------------------------------------------<br />
11:18:05.722: INFO: Starting framework in /home/foo/gamma/lockss-daemon/test/frameworks/run_stf/testcase-1<br />
11:18:05.743: INFO: Waiting for framework to become ready<br />
11:18:21.199: INFO: Creating simulated AU's<br />
11:18:23.138: INFO: Waiting for simulated AU's to crawl<br />
11:18:23.355: INFO: AU's completed initial crawl<br />
11:18:23.449: INFO: Waiting for a V3 poll by all simulated caches<br />
11:18:48.653: INFO: Client on port 8041 called V3 poll...<br />
11:18:48.694: INFO: Client on port 8042 called V3 poll...<br />
11:18:48.732: INFO: Client on port 8043 called V3 poll...<br />
11:18:48.764: INFO: Client on port 8044 called V3 poll...<br />
11:18:48.814: INFO: Client on port 8045 called V3 poll...<br />
11:18:48.814: INFO: Waiting for all peers to win their polls<br />
11:18:48.891: INFO: Client on port 8041 won V3 poll...<br />
11:18:48.972: INFO: Client on port 8042 won V3 poll...<br />
11:18:49.072: INFO: Client on port 8043 won V3 poll...<br />
11:18:49.157: INFO: Client on port 8044 won V3 poll...<br />
11:18:49.248: INFO: Client on port 8045 won V3 poll...<br />
11:18:50.347: INFO: Damaged the following node(s) on client localhost:8041:<br />
http://www.example.com/001file.bin<br />
http://www.example.com/001file.txt<br />
http://www.example.com/002file.bin<br />
http://www.example.com/002file.txt<br />
http://www.example.com/branch1/001file.bin<br />
http://www.example.com/branch1/001file.txt<br />
http://www.example.com/branch1/002file.bin<br />
http://www.example.com/branch1/002file.txt<br />
http://www.example.com/branch1/index.html<br />
http://www.example.com/index.html<br />
11:18:50.375: INFO: Waiting for a V3 poll to be called...<br />
11:19:25.638: INFO: Successfully called a V3 poll<br />
11:19:25.714: INFO: Waiting for a V3 poll to be called...<br />
11:19:25.742: INFO: Successfully called a V3 poll<br />
11:19:25.742: INFO: Waiting for V3 repair...<br />
11:19:26.871: INFO: Asymmetric client localhost:8042 repairers OK<br />
11:19:26.871: INFO: Asymmetric client localhost:8042 repairers OK<br />
11:19:26.872: INFO: Asymmetric client localhost:8042 repairers OK<br />
11:19:26.872: INFO: Asymmetric client localhost:8042 repairers OK<br />
11:19:26.919: INFO: Asymmetric client localhost:8043 repairers OK<br />
11:19:26.919: INFO: Asymmetric client localhost:8043 repairers OK<br />
11:19:26.919: INFO: Asymmetric client localhost:8043 repairers OK<br />
11:19:26.920: INFO: Asymmetric client localhost:8043 repairers OK<br />
11:19:26.955: INFO: Asymmetric client localhost:8044 repairers OK<br />
11:19:26.955: INFO: Asymmetric client localhost:8044 repairers OK<br />
11:19:26.955: INFO: Asymmetric client localhost:8044 repairers OK<br />
11:19:26.956: INFO: Asymmetric client localhost:8044 repairers OK<br />
11:19:26.999: INFO: Asymmetric client localhost:8045 repairers OK<br />
11:19:26.999: INFO: Asymmetric client localhost:8045 repairers OK<br />
11:19:26.999: INFO: Asymmetric client localhost:8045 repairers OK<br />
11:19:26.999: INFO: Asymmetric client localhost:8045 repairers OK<br />
11:19:27.083: INFO: AU successfully repaired<br />
11:19:28.086: INFO: No deadlocks detected<br />
&gt;&gt;&gt; Delaying shutdown. Press Enter to continue...<br />
11:20:46.489: INFO: Stopping framework<br />
----------------------------------------------------------------------<br />
Ran 1 test in 161.058s<br />
<br />
OK<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
Finally we clean up again:<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ ./clean.sh ; rm testsuite.opt<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by LOCKSS Engineering Staff<br />
* Approval by LOCKSS Chief Scientist<br />
<br />
== Relevant Documents ==<br />
# [[CLOCKSS: Box Operations]]<br />
# Petros Maniatis, Mema Roussopoulos, TJ Giuli, David S.H. Rosenthal, Mary Baker, and Yanto Muliadi. “LOCKSS: A Peer-to-Peer Digital Preservation System”, ACM Transactions on Computer Systems vol. 23, no. 1, February 2005, pp. 2-50. http://dx.doi.org/10.1145/1047915.1047917 accessed 2013.8.7</div>Dshrhttp://documents.clockss.org/index.php/LOCKSS:_Polling_and_Repair_ProtocolLOCKSS: Polling and Repair Protocol2014-08-13T14:37:14Z<p>Dshr: /* Demonstration */ Add Demo instructions - Approved by DSHR</p>
<hr />
<div>= LOCKSS: Polling and Repair Protocol =<br />
<br />
== Overview ==<br />
<br />
LOCKSS boxes run the LOCKSS polling and repair protocol as described in [http://dx.doi.org/10.1145/1047915.1047917 our ''ACM Transactions on Computer Systems'' paper]. The paper describes the polling mechanism as applying to a single file; the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] applies it to an entire [[LOCKSS: Basic Concepts#Archival Unit|Archival Unit (AU)]] of content. Each LOCKSS daemon chooses at random the next AU upon which it will use the LOCKSS polling and repair protocol to perform integrity checks. It acts as the ''poller'' to call a poll on that AU by:<br />
* Selecting a random sample of the other CLOCKSS boxes (the <i>voters</i>).<br />
* Inviting the voters to participate in a <i>poll</i> on the AU, and sending each of them a freshly-generated random nonce ''Np''.<br />
* The poll involves the voters voting by:<br />
** Generating a fresh random nonce ''Nv''.<br />
** Creating a vote containing, for every URL in the voter's instance of the AU:<br />
*** The URL<br />
*** The hash of the concatenation of ''Np'', ''Nv'' and the content of the URL.<br />
** Sending the vote to the poller. Note that the vote contains a hash for each URL in the voter's instance of the AU, but that hash is not the hash of the content. The nonces ensure that the hash in the vote is different for every vote in every poll. The voter cannot simply remember the hash it initially created, it must re-hash every URL each time it votes.<br />
* The poller tallies the votes by:<br />
** For each URL in the poller's instance of the AU:<br />
*** For each voter:<br />
**** Computing the hash of ''Np'', ''Nv'' and the content of the URL in the poller's instance of the AU.<br />
**** Comparing the result with the hash value for that URL in that voter's vote.<br />
** Note that the nonces ensure that the poller must re-hash every URL in the AU; it cannot simply remember the hash it initially created.<br />
* In tallying the votes, the poller may detect that:<br />
** A URL it has does not match the consensus of the voters, or<br />
** A URL that the consensus of the voters says should be present in the AU is missing from the poller's AU, or<br />
** A URL it has does not match the checksum generated when it was stored.<br />
* If so, it repairs the problem by:<br />
** requesting a new copy from one of the voters that agreed with the consensus,<br />
** then verifying that the new copy does agree with the consensus.<br />
<br />
In this way, at unpredictable but fairly regular intervals, every poll on an AU checks the union of the set of URLs in that AU on the box calling the poll (poller) and the boxes voting (voters). The check establishes that the URL on the poller agrees with the consensus of the boxes voting in the poll (voters) as to that URL's content. If it does not, it is repaired from one of the boxes in the consensus. Under our current Mellon grant we are investigating the potential benefits of an enhancement to the mechanism that results in every poll on an AU checking that every URL in that AU on each voter agrees with the same URL on the poller.<br />
<br />
== Configuration of CLOCKSS Network ==<br />
<br />
As described in [[CLOCKSS: Box Operations]] the CLOCKSS boxes are configured to form a Private LOCKSS Network (PLN) including the following configuration options:<br />
* Because the CLOCKSS PLN is a closed network secured by SSL certificate checks at both ends of all connections, the defenses against sybil attacks, which involve the adversary creating new peer identities, are not necessary and are not implemented.<br />
* The efficiency enhancements described below are being gradually and cautiously deployed to the CLOCKSS PLN.<br />
<br />
Currently, on average, a poll is called on each AU instance approximately once every 100 days. Since there are currently 12 boxes in the CLOCKSS network, approximately every 8 days on average one instance of a given AU is checked.<br />
<br />
== Enhancements ==<br />
<br />
The LOCKSS team's internal monitoring and evaluation processes identified some areas in which the efficiency of the polling process could be improved in the context of the Global LOCKSS Network (GLN). The [http://www.lockss.org/news-media/news/lockss-program-receives-andrew-w-mellon-foundation-grant/ Andrew W. Mellon Foundation funded work to implement and evaluate improvements] in these areas; the grant period extends through March 2015. Although these improvements will be deployed to the CLOCKSS network, because there are many fewer boxes in the CLOCKSS network than the GLN the areas of inefficiency are less relevant to the CLOCKSS network. Thus the improvements are not expected to make a substantial difference to the performance of the CLOCKSS network.<br />
<br />
The Mellon-funded work included development of improved instrumentation and analysis software, which polls the administrative Web UI of each LOCKSS box in a network to collect vast amounts of data about the operations of each box. These tools were used on the CLOCKSS network for an initial 59-day period, collecting over 18M data items. The data collected has yet to be fully analyzed but initial analysis shows that the polling process among CLOCKSS boxes continues to operate satisfactorily. Some examples of the graphs generated follow.<br />
<br />
[[File:hist_pr_auid_count27.png|200px|thumb|center]] This graph shows the number of AU instances in CLOCKSS boxes which have reached agreement with N other CLOCKSS boxes, showing the progress AUs make after ingest as the LOCKSS: Polling and Repair Protocol identifies matching AU instances at other boxes. It will be seen that there are few AU instances in the sample with few boxes with whom they have reached agreement, and that the majority of AU instances have reached agreement with AU instances at the majority of other CLOCKSS boxes.<br />
<br />
[[File:Sample Graph 2.png|200px|thumb|center]] This graph shows the extent of agreement among the over 40,000 successfully completed polls in the sample. As can be seen, the overwhelming majority of the polls showed complete agreement. Polls with less than complete agreement are likely to have been caused by polling among AU instances that were still collecting content, so had different sub-sets of the URLs in an AU.<br />
<br />
== Demonstration ==<br />
<br />
The CRL auditors requested a demonstration of the polling and repair process. Demonstrating this on production content is difficult. The content is generally large, so polls take a long time. Each box is running many polls simultaneously, so the log entries for these polls are interleaved. Turning the logging level on polling up enough to show full details would affect all polls underway simultaneously, so the volume of log data would be overwhelming. Instead, we provided a live demonstration using a network of 5 LOCKSS daemons in the [[LOCKSS:_Software_Development_Process#Functional_Tests|STF testing framework]], preserving an AU of synthetic content. It consisted of two polls, the first detected no damage and the second created, detected and repaired damage to the content of one URL. Annotated logs of the first poll are available from the [[Media:Poller-good.pdf|poller]] and a [[Media:Voter-good.pdf|voter]]. Annotated logs of the second poll are available from the [[Media:Poller-bad.pdf|poller]] and a [[Media:Voter-bad.pdf|voter]].<br />
<br />
=== Replicating the Demonstration ===<br />
<br />
These demos have been included in the 1.66 release of the LOCKSS software. These instructions have been tested on a vanilla install of Ubuntu 14.04.1, up-to-date as of August 4. They should work on other recent Debian-based Linux systems.<br />
<br />
The first step is to install the pre-requisites:<br />
<pre><br />
foo@bar:~$ cd<br />
foo@bar:~$ sudo apt-get install default-jdk ant cvs libxml2-utils<br />
[sudo] password for foo:<br />
Reading package lists... 0%<br />
...<br />
0 upgraded, 46 newly installed, 0 to remove and 0 not upgraded.<br />
Need to get 70.6 MB of archives.<br />
After this operation, 127 MB of additional disk space will be used.<br />
Do you want to continue? [Y/n] y<br />
...<br />
done.<br />
foo@bar:~$ ls -l /etc/alternatives/javac<br />
lrwxrwxrwx 1 root root 42 Aug 4 18:45 /etc/alternatives/javac -&gt; /usr/lib/jvm/java-7-openjdk-i386/bin/javac<br />
foo@bar:~$ export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386<br />
foo@bar:~$ <br />
</pre><br />
The next step is to check the latest release of the LOCKSS daemon out from SourceForge:<br />
<pre><br />
foo@bar:~$ export CVSROOT=:pserver:anonymous@lockss.cvs.sourceforge.net:/cvsroot/lockss<br />
foo@bar:~$ cvs get -P -rlast_released_daemon lockss-daemon<br />
cvs checkout: Updating lockss-daemon<br />
...<br />
cvs checkout: Updating lockss-daemon/tools/test/src/org/lockss/keystore<br />
U lockss-daemon/tools/test/src/org/lockss/keystore/TestEditKeyStores.java<br />
foo@bar:~$ <br />
</pre><br />
The next step is to build the LOCKSS daemon. This takes a while, there's a lot of code to build. It generates Java warnings that you should be able to ignore, but no errors. Just to be sure that everything is OK, we run the unit and functional tests on the daemon that gets built. This takes much longer, especially on the little netbook I'm using to test the instructions:<br />
<pre><br />
foo@bar:~$ mkdir ~/.ant<br />
foo@bar:~$ cd ~/.ant<br />
foo@bar:~$ ln -s ~/lockss-daemon/lib .<br />
foo@bar:~$ cd ~/lockss-daemon<br />
foo@bar:~/lockss-daemon$ ant<br />
Buildfile: /home/foo/lockss-daemon/build.xml<br />
...<br />
BUILD SUCCESSFUL<br />
Total time: 81 minutes 21 seconds<br />
<br />
real 81m21.980s<br />
user 84m29.308s<br />
sys 5m13.708s<br />
foo@bar:~/lockss-daemon$ <br />
</pre><br />
The next step is to configure STF for the demos. The demos work without this configuration, but they are much more informative with it:<br />
<pre><br />
foo@bar:~$ cd ~/lockss-daemon/test/frameworks/run_stf foo@bar:~/lockss-daemon/test/frameworks/run_stf$ cp testsuite.opt.demo testsuite.opt foo@bar:~/lockss-daemon/test/frameworks/run_stf$<br />
</pre><br />
This configuration ensures that:<br />
* The logs contain detailed information about the polling and repair process.<br />
* The logs aren't deleted after the demo.<br />
* The daemons stay running until you hit Enter. This allows you use a Web browser to access the UI of the daemons and see the polling and voting status pages. See the STF README.txt file for details of how to do this.<br />
Now you can go ahead and run the first demo in the STF test framework. It creates a network of 5 LOCKSS boxes each preserving an Archival Unit (AU) of synthetic content, and causes the first box to call a poll on it, which should result in complete agreement among the boxes:<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ python testsuite.py AuditDemo1<br />
11:27:35.057: INFO: ===================================<br />
11:27:35.057: INFO: Demo a V3 poll with no disagreement<br />
11:27:35.057: INFO: -----------------------------------<br />
11:27:35.250: INFO: Starting framework in /home/foo/gamma/lockss-daemon/test/frameworks/run_stf/testcase-1<br />
11:27:35.266: INFO: Waiting for framework to become ready<br />
11:27:45.624: INFO: Creating simulated AU's<br />
11:27:47.546: INFO: Waiting for simulated AU's to crawl<br />
11:27:47.759: INFO: AU's completed initial crawl<br />
11:27:47.760: INFO: No nodes damaged on client localhost:8041<br />
11:27:47.777: INFO: Waiting for a V3 poll to be called...<br />
11:28:18.087: INFO: Successfully called a V3 poll<br />
11:28:18.088: INFO: Checking V3 poll result...<br />
11:28:18.215: INFO: Asymmetric client localhost:8042 repairers OK<br />
11:28:18.249: INFO: Asymmetric client localhost:8043 repairers OK<br />
11:28:18.287: INFO: Asymmetric client localhost:8044 repairers OK<br />
11:28:18.322: INFO: Asymmetric client localhost:8045 repairers OK<br />
11:28:18.425: INFO: AU successfully polled<br />
11:28:19.427: INFO: No deadlocks detected<br />
&gt;&gt;&gt; Delaying shutdown. Press Enter to continue...<br />
11:29:08.161: INFO: Stopping framework<br />
----------------------------------------------------------------------<br />
Ran 1 test in 93.213s<br />
<br />
OK<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
You will find that the demo has created a file system tree under testcase-1 with a directory for each of the five boxes in the network:<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ ls testcase-1<br />
daemon-8041 daemon-8042 daemon-8043 daemon-8044 daemon-8045 lockss.opt lockss.txt<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
daemon-8041 is the poller, the box that called the poll and tallied the result. You can see its log (an annotated version is here):<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ ls -l testcase-1/daemon-8041/test.out<br />
-rw-rw-r-- 1 foo foo 31399 Aug 5 13:56 testcase-1/daemon-8041/test.out<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
daemon-8042 through daemon-8045 are the voters, the boxes whose content is compared with the poller's. You can see their logs (an annotated version is here):<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ ls -l testcase-1/daemon-8042/test.out<br />
-rw-rw-r-- 1 foo foo 14755 Aug 5 13:56 testcase-1/daemon-8042/test.out<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$<br />
</pre><br />
Now we clean up in preparation for the second demo:<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ ./clean.sh<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
In the second demo one of the daemons calls a poll, but before it does one file in its simulated content is damaged. The other 4 vote, and they all disagree with the poller about the damaged file. The poller requests a repair of this file from one of the voters. Once the repair is received, the poller re-tallies the poll and now finds 100% agreement. The logs end up in the usual place, annotated versions are available for the poller and a voter.<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ python testsuite.py AuditDemo2<br />
11:16:24.793: INFO: ================================================<br />
11:16:24.793: INFO: Demo a basic V3 poll with repair via open access<br />
11:16:24.793: INFO: ------------------------------------------------<br />
11:16:24.987: INFO: Starting framework in /home/foo/gamma/lockss-daemon/test/frameworks/run_stf/testcase-1<br />
11:16:25.002: INFO: Waiting for framework to become ready<br />
11:16:35.392: INFO: Creating simulated AU's<br />
11:16:37.454: INFO: Waiting for simulated AU's to crawl<br />
11:16:37.671: INFO: AU's completed initial crawl<br />
11:16:38.320: INFO: Damaged the following node(s) on client localhost:8041:<br />
http://www.example.com/branch1/001file.txt<br />
11:16:38.337: INFO: Waiting for a V3 poll to be called...<br />
11:17:03.523: INFO: Successfully called a V3 poll<br />
11:17:03.523: INFO: Waiting for V3 repair...<br />
11:17:03.765: INFO: Asymmetric client localhost:8042 repairers OK<br />
11:17:03.802: INFO: Asymmetric client localhost:8043 repairers OK<br />
11:17:03.839: INFO: Asymmetric client localhost:8044 repairers OK<br />
11:17:03.869: INFO: Asymmetric client localhost:8045 repairers OK<br />
11:17:03.943: INFO: AU successfully repaired<br />
11:17:04.945: INFO: No deadlocks detected<br />
&gt;&gt;&gt; Delaying shutdown. Press Enter to continue...<br />
11:17:15.661: INFO: Stopping framework<br />
----------------------------------------------------------------------<br />
Ran 1 test in 50.956s<br />
<br />
OK<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
Now we clean up in preparation for the third demo:<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ ./clean.sh<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
In the second demo, the simulated content was open access, so there was no restriction on the voter sending a repair to the poller. The common case is that the content is not open access, in which case the voter has to remember agreeing with the poller in the past about the AU being repaired so that it doesn't leak content to boxes that could not get it directly from the publisher.<br />
<br />
In the third demo the daemons achieve agreement on the non-open access content before damage is created at the poller. Then when the poller next calls a poll, detects the damage and requests a repair, the voter remembers the prior agreement and sends a repair.<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ python testsuite.py AuditDemo3<br />
11:18:05.527: INFO: =======================================================<br />
11:18:05.527: INFO: Demo a basic V3 poll with repair via previous agreement<br />
11:18:05.527: INFO: -------------------------------------------------------<br />
11:18:05.722: INFO: Starting framework in /home/foo/gamma/lockss-daemon/test/frameworks/run_stf/testcase-1<br />
11:18:05.743: INFO: Waiting for framework to become ready<br />
11:18:21.199: INFO: Creating simulated AU's<br />
11:18:23.138: INFO: Waiting for simulated AU's to crawl<br />
11:18:23.355: INFO: AU's completed initial crawl<br />
11:18:23.449: INFO: Waiting for a V3 poll by all simulated caches<br />
11:18:48.653: INFO: Client on port 8041 called V3 poll...<br />
11:18:48.694: INFO: Client on port 8042 called V3 poll...<br />
11:18:48.732: INFO: Client on port 8043 called V3 poll...<br />
11:18:48.764: INFO: Client on port 8044 called V3 poll...<br />
11:18:48.814: INFO: Client on port 8045 called V3 poll...<br />
11:18:48.814: INFO: Waiting for all peers to win their polls<br />
11:18:48.891: INFO: Client on port 8041 won V3 poll...<br />
11:18:48.972: INFO: Client on port 8042 won V3 poll...<br />
11:18:49.072: INFO: Client on port 8043 won V3 poll...<br />
11:18:49.157: INFO: Client on port 8044 won V3 poll...<br />
11:18:49.248: INFO: Client on port 8045 won V3 poll...<br />
11:18:50.347: INFO: Damaged the following node(s) on client localhost:8041:<br />
http://www.example.com/001file.bin<br />
http://www.example.com/001file.txt<br />
http://www.example.com/002file.bin<br />
http://www.example.com/002file.txt<br />
http://www.example.com/branch1/001file.bin<br />
http://www.example.com/branch1/001file.txt<br />
http://www.example.com/branch1/002file.bin<br />
http://www.example.com/branch1/002file.txt<br />
http://www.example.com/branch1/index.html<br />
http://www.example.com/index.html<br />
11:18:50.375: INFO: Waiting for a V3 poll to be called...<br />
11:19:25.638: INFO: Successfully called a V3 poll<br />
11:19:25.714: INFO: Waiting for a V3 poll to be called...<br />
11:19:25.742: INFO: Successfully called a V3 poll<br />
11:19:25.742: INFO: Waiting for V3 repair...<br />
11:19:26.871: INFO: Asymmetric client localhost:8042 repairers OK<br />
11:19:26.871: INFO: Asymmetric client localhost:8042 repairers OK<br />
11:19:26.872: INFO: Asymmetric client localhost:8042 repairers OK<br />
11:19:26.872: INFO: Asymmetric client localhost:8042 repairers OK<br />
11:19:26.919: INFO: Asymmetric client localhost:8043 repairers OK<br />
11:19:26.919: INFO: Asymmetric client localhost:8043 repairers OK<br />
11:19:26.919: INFO: Asymmetric client localhost:8043 repairers OK<br />
11:19:26.920: INFO: Asymmetric client localhost:8043 repairers OK<br />
11:19:26.955: INFO: Asymmetric client localhost:8044 repairers OK<br />
11:19:26.955: INFO: Asymmetric client localhost:8044 repairers OK<br />
11:19:26.955: INFO: Asymmetric client localhost:8044 repairers OK<br />
11:19:26.956: INFO: Asymmetric client localhost:8044 repairers OK<br />
11:19:26.999: INFO: Asymmetric client localhost:8045 repairers OK<br />
11:19:26.999: INFO: Asymmetric client localhost:8045 repairers OK<br />
11:19:26.999: INFO: Asymmetric client localhost:8045 repairers OK<br />
11:19:26.999: INFO: Asymmetric client localhost:8045 repairers OK<br />
11:19:27.083: INFO: AU successfully repaired<br />
11:19:28.086: INFO: No deadlocks detected<br />
&gt;&gt;&gt; Delaying shutdown. Press Enter to continue...<br />
11:20:46.489: INFO: Stopping framework<br />
----------------------------------------------------------------------<br />
Ran 1 test in 161.058s<br />
<br />
OK<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
Finally we clean up again:<br />
<pre><br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ ./clean.sh ; rm testsuite.opt<br />
foo@bar:~/lockss-daemon/test/frameworks/run_stf$ <br />
</pre><br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by LOCKSS Engineering Staff<br />
* Approval by LOCKSS Chief Scientist<br />
<br />
== Relevant Documents ==<br />
# [[CLOCKSS: Box Operations]]<br />
# Petros Maniatis, Mema Roussopoulos, TJ Giuli, David S.H. Rosenthal, Mary Baker, and Yanto Muliadi. “LOCKSS: A Peer-to-Peer Digital Preservation System”, ACM Transactions on Computer Systems vol. 23, no. 1, February 2005, pp. 2-50. http://dx.doi.org/10.1145/1047915.1047917 accessed 2013.8.7</div>Dshrhttp://documents.clockss.org/index.php/LOCKSS:_Basic_ConceptsLOCKSS: Basic Concepts2014-07-25T02:07:33Z<p>Dshr: /* Preservation */ Typos</p>
<hr />
<div>= LOCKSS: Basic Concepts =<br />
<br />
This document introduces some basic concepts of the LOCKSS technology that are needed to understand the remainder of the documentation.<br />
<br />
== LOCKSS Program ==<br />
<br />
The LOCKSS Program started on [http://blog.dshr.org/2013/10/it-was-fifteen-years-ago-today.html 4 October 1998] under the auspices of Stanford University Libraries to develop and support the LOCKSS technology. It was funded initially by a small grant from Michael Lesk at the NSF, and then supported by the Andrew W. Mellon Foundation, the NSF and Sun Microsystems. It transitioned from grant funding to the "Red Hat" model of free, open source software and paid support thanks to a matching grant from the Mellon Foundation, and has been financially stable since 2008 on that basis. Although program staff are Stanford employees, at no time has Stanford provided any financial support for the program. The program pays all staff and operational costs, Stanford indirect costs, and an "occupancy charge" for its office space.<br />
<br />
== LOCKSS Daemon ==<br />
<br />
The LOCKSS daemon is a large (>200K lines of code) Java program that turns a generic Linux system into a digital preservation appliance called a LOCKSS box. The LOCKSS daemon is the only application program that runs in a LOCKSS box. Every action of a LOCKSS box, for ingest, preservation, dissemination and administration is performed by the LOCKSS daemon. The LOCKSS daemon is administered via a Web interface that allows authorized administrators to direct it to collect content, control how that content is disseminated, and monitor the daemon's performance. Among the functions performed by the LOCKSS daemon are:<br />
* Ingest via Web crawling, or file import.<br />
* Preservation via the [http://documents.clockss.org/index.php/LOCKSS:_Polling_and_Repair_Protocol LOCKSS: Polling and Repair Protocol].<br />
* Dissemination by acting as both a Web server and a Web proxy, and by file export.<br />
* Administration via a Web user interface.<br />
* Status and statistics reporting.<br />
Because all access to preserved content is mediated by the LOCKSS daemon, the physical representation of its internal data structures, such as how content and metadata are stored, is essentially of academic interest only. In particular, the fundamental abstraction that the LOCKSS daemon presents is not that it preserves ''files''. It preserves ''URLs''; their content and their associated headers (metadata) as a unit, although we often casually refer to these (content, header) pairs as "files". Their internal representations are not visible to those using the system to ingest or disseminate content, but only to those administering the underlying system. Only in exceptional circumstances does an administrator log in to the underlying operating system; all routine and normal diagnostic operations are performed through the Web interface.<br />
<br />
== Preservation ==<br />
<br />
The LOCKSS system is sometimes criticized as providing only bit-level preservation, but this is a misunderstanding. The system employs exactly the same techniques (and in most cases exactly the same software tools) as other preservation systems, including:<br />
* Format identification, [[LOCKSS: Format Migration#Format Migration in the CLOCKSS Archive|using FITS software]]<br />
* Format verification, [[LOCKSS: Format Migration#Format Migration in the CLOCKSS Archive|using FITS software]].<br />
* Format migration, see [[LOCKSS: Format Migration]].<br />
* Metadata creation, see [[LOCKSS: Extracting Bibliographic Metadata]] and [[LOCKSS: Metadata Database]].<br />
The difference between LOCKSS and most other preservation systems lies not in ''what'' techniques are employed but in ''when'' those techniques are employed. In the interest of economy, the LOCKSS system stores only the original bits, and delays all operations on them except integrity checking as long as possible. So, for example, unlike systems that preemptively migrate formats in bulk that are not yet obsolete into formats that are presumed to be less obsolete, thereby consuming processing resources, and store both the original and the migrated copies, thereby consuming storage resources, LOCKSS migrates formats only of individual files, and only when a read's request indicates that migration of that file is necessary. The migrated version is discarded when no longer needed to save on storage. This capability was [http://www.dlib.org/dlib/january05/rosenthal/01rosenthal.html demonstrated in 2005] but has remained unused in practice because the formats of content preserved in the LOCKSS system are [[LOCKSS: Format Migration#Obsolescence of Web Formats|not going obsolete]].<br />
<br />
== Plugins ==<br />
<br />
The behavior of the LOCKSS daemon is generic. It must be adapted to the requirements of particular content it is to preserve. This is done via the "plugin" for that content, which is an instance of a Java class. In most cases, it is an instance of class DefinablePlugin whose behavior has been customized by parameters in an XML file; colloquially this XML file is often referred to as "the plugin" because it contains all the information that distinguishes this plugin from another that also uses DefinablePlugin. This information includes, for example, the classes which DefinablePlugin can use to extract metadata from the relevant content. A plugin (the class plus the parameters) represents a class of content, such as "content published on HighWire's H2O platform".<br />
<br />
== Archival Units ==<br />
<br />
Here is an [[Media:ClockssTaylorAndFrancisPlugin.xml.pdf|example plugin XML file]], for Taylor and Francis journals, It defines the class of content "published by Taylor and Francis". There are many journals in this class, and content is continually being added to them, so for operational convenience we divide the class into Archival Units (AUs) representing, typically, a year or a volume of a journal. Each AU is defined by the plugin class name, in this case <tt>org.lockss.plugin.taylorandfrancis.ClockssTaylorAndFrancisPlugin</tt> and a set of definitional parameters defined by the XML file, in this case:<br />
* <tt>base_url</tt><br />
* <tt>journal_id</tt><br />
* <tt>volume_name</tt><br />
<br />
For example, the plugin specifies that for an AU of a particular journal identified by these parameters, crawling should start at <tt>au_start_url</tt>:<br />
<pre><br />
<entry><br />
<string>au_start_url</string><br />
<string>&quot;%sclockss/%s/%s/index.html&quot;, base_url, journal_id, volume_name</string><br />
</entry><br />
</pre><br />
<br />
== Title Database ==<br />
<br />
The values for these parameters come from the Title Data Base (TDB), which is not actually a database, but a knowledge base represented as a set of text files in an easy-to-edit syntax that are processed into an XML file that is obtained by the LOCKSS daemon. For each AU in the system, there is a TDB entry providing the plugin class name and a (name, value) pair for each of the parameters defined by that plugin class that are different from the default. The TDB entry for Advances in Building Energy Research, defining its AUs from 2007-2014, looks like this:<br />
<pre><br />
{<br />
<br />
publisher <<br />
name = Taylor & Francis ;<br />
info[contract] = 2008 ;<br />
info[tester] = A<br />
><br />
<br />
plugin = org.lockss.plugin.taylorandfrancis.ClockssTaylorAndFrancisPlugin<br />
param[base_url] = http://www.tandfonline.com/<br />
implicit < status ; status2 ; year ; name ; param[volume_name] ><br />
...<br />
<br />
{<br />
<br />
title <<br />
name = Advances in Building Energy Research ;<br />
issn = 1751-2549 ;<br />
eissn = 1756-2201 ;<br />
issnl = 1751-2549<br />
><br />
<br />
param[journal_id] = taer20<br />
<br />
au < manifest ; exists ; 2007 ; Advances in Building Energy Research Volume 1 ; 1 ><br />
au < manifest ; exists ; 2008 ; Advances in Building Energy Research Volume 2 ; 2 ><br />
au < manifest ; exists ; 2009 ; Advances in Building Energy Research Volume 3 ; 3 ><br />
au < zapped ; finished ; 2010 ; Advances in Building Energy Research Volume 4 ; 4 ><br />
au < finished ; crawling ; 2011 ; Advances in Building Energy Research Volume 5 ; 5 ><br />
au < finished ; crawling ; 2012 ; Advances in Building Energy Research Volume 6 ; 6 ><br />
au < crawling ; exists ; 2013 ; Advances in Building Energy Research Volume 7 ; 7 ><br />
au < expected ; exists ; 2014 ; Advances in Building Energy Research Volume 8 ; 8 ><br />
<br />
}<br />
...<br />
}<br />
</pre><br />
<br />
The definitional parameters are specified as follows:<br />
* <tt>base_url</tt> is <tt>http://www.tandfonline.com/</tt> for all Taylor and Francis journals specified at the top.<br />
* <tt>journal_id</tt> is <tt>taer20</tt> specified in the section for Advances in Building Energy Research.<br />
* <tt>volume_name</tt> is specified by the 5th column of the table.<br />
<br />
The text form of the TDB is preserved in the LOCKSS source code repository at SourceForge, which is backed up each night to an on-site and an off-site system, both maintained by the LOCKSS team, in addition to SourceForge's backups. There is a copy of the XML form of the TDB for each LOCKSS network on each LOCKSS box in the network, in addition to the copy on the [[LOCKSS: Property Server Operations|Property Server]] and its backup in the Amazon cloud.<br />
<br />
== AUID ==<br />
<br />
Everywhere an AU needs to be uniquely identified, we use an internal name, its Archival Unit ID (AUID) as the means to do so, for example as a key in maps and databases, or in the messages of the [http://documents.clockss.org/index.php/LOCKSS:_Polling_and_Repair_Protocol LOCKSS: Polling and Repair Protocol]. The AUID for an AU is immutable string with an encoded representation of:<br />
* The fully-qualified Java class name of the plugin.<br />
* For each of the definitional parameters defined by the plugin XML:<br />
** The parameter name.<br />
** The parameter value.<br />
Because it contains the class of the plugin and all the definitional parameters, the AUID is unique to an AU irrespective of which box it is on, <br />
<br />
The AUID for the AU for Volume 6 of Advances in Building Energy Research, defined by the TDB entry above, and used as the example in [http://documents.clockss.org/index.php/Definition_of_AIP#Harvest_AU Definition of AIP] is:<br />
<pre><br />
org|lockss|plugin|taylorandfrancis|ClockssTaylorAndFrancisPlugin&base_url~http%3A%2F%2Fwww%2Etandfonline%2Ecom%2F&journal_id~taer20&volume_name~6<br />
</pre><br />
<br />
Archival Units also have an external "AU name", which is a human-readable string used in Web pages and reports but for no other purpose.</div>Dshrhttp://documents.clockss.org/index.php/Main_PageMain Page2014-07-23T15:30:32Z<p>Dshr: /* ISO 16363 Criteria */ Clarification before release.</p>
<hr />
<div>= CLOCKSS Archive =<br />
<br />
Welcome to the documentation Wiki of the [http://www.clockss.org CLOCKSS Archive].<br />
<br />
== CLOCKSS Archive Documents ==<br />
<br />
These documents describe the organization, policies, practices and plans of the CLOCKSS Archive:<br />
* [[CLOCKSS: Mission Statement]]<br />
* [[CLOCKSS: Governance and Organization]]<br />
* [[CLOCKSS: Budget and Planning Process]]<br />
* [[CLOCKSS: Business Plan Overview]]<br />
* [[CLOCKSS: Business History]]<br />
* [[CLOCKSS: Collection Development]]<br />
* [[CLOCKSS: Preservation Strategy]]<br />
* [[CLOCKSS: Access Policy]]<br />
* [[CLOCKSS: Succession Plan]]<br />
* [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement]<br />
* [http://www.clockss.org/clockss/Contribute_to_CLOCKSS CLOCKSS: Fees]<br />
<br />
== LOCKSS Program Documents ==<br />
<br />
These documents describe the [http://www.lockss.org LOCKSS technology]:<br />
* [[LOCKSS: Basic Concepts]]<br />
* [[LOCKSS: Polling and Repair Protocol]]<br />
* [[LOCKSS: Format Migration]]<br />
* [[LOCKSS: Metadata Database]] <br />
* [[LOCKSS: Extracting Bibliographic Metadata]]<br />
* [[LOCKSS: Software Development Process]]<br />
* [[LOCKSS: Property Server Operations]]<br />
<br />
== LOCKSS Adaptations to CLOCKSS Archive Documents ==<br />
<br />
These documents describe how the LOCKSS technology is used by the CLOCKSS Archive:<br />
* [[CLOCKSS: Threats and Mitigations]]<br />
* [[CLOCKSS: Logging and Records]]<br />
* [[CLOCKSS: Ingest Pipeline]]<br />
* [[CLOCKSS: Box Operations]]<br />
* [[CLOCKSS: Extracting Triggered Content]]<br />
* [[CLOCKSS: Hardware and Software Inventory]]<br />
<br />
== OAIS Conformance Documents ==<br />
<br />
These documents describe the mapping between the CLOCKSS Archive and the [http://public.ccsds.org/publications/archive/650x0m2.pdf OAIS Reference Architecture]:<br />
* [[CLOCKSS: Designated Community]]<br />
* [[CLOCKSS: Mandatory Responsibilities]]<br />
* [[Definition of SIP]]<br />
* [[Definition of AIP]]<br />
* [[Definition of DIP]]<br />
<br />
== Background ==<br />
<br />
During the preparation for the [http://www.crl.edu/archiving-preservation/digital-archives/metrics-assessing-and-certifying/trac TRAC audit] of the CLOCKSS Archive, it was decided to make as much documentation of the CLOCKSS Archive public as possible. This wiki is the result. The contents have been collected from a variety of sources, including:<br />
* Published papers<br />
* CLOCKSS board minutes and other Board documents<br />
* The LOCKSS team's internal wiki<br />
* The LOCKSS team's ticketing and bug tracking systems<br />
Many of these sources were not intended to be made public, and contained confidential or inappropriate material. The contents of this wiki were extracted from them and reviewed for publication as part of the audit preparations. These pages document:<br />
* The structure, policies and practices of the CLOCKSS Archive.<br />
* The conformance of the CLOCKSS Archive to the OAIS Reference Model<br />
* The policies, practices and technology of the LOCKSS Program, which operates the CLOCKSS Archive under contract to the CLOCKSS Board.<br />
* The adaptations made to the generic LOCKSS technology for the purposes of the CLOCKSS Archive.<br />
As structures, policies, practices and technologies change, these documents will be maintained so that up-to-date information on these topics is available to the public.<br />
<br />
In particular, some documents were edited after their initial submission but before release of the certification report by the auditors to clarify issues that arose during the audit process and to include some material from the confidential part of the submission that was judged non-confidential. Viewing the history an individual page will reveal any such changes.<br />
<br />
== ISO 16363 Criteria ==<br />
<br />
For the purposes of the audit, the wiki also contains a page for each of the [http://www.iso.org/iso/catalogue_detail.htm?csnumber=56510 ISO 16363 criteria]. The audit was actually conducted using the closely-related but shortly to be obsolete [http://www.crl.edu/archiving-preservation/digital-archives/metrics-assessing-and-certifying/trac TRAC criteria]; these pages use the ISO criteria in the interest of future-proofing. The goal for these pages was that they be a finding aid for the auditors, allowing them to easily locate the relevant parts of the documents for each criterion, but that all actual content be in the documents. Thus interested readers can use the documents without needing to refer to the criteria pages:<br />
<br />
* [[3) Organizational Infrastructure]]<br />
* [[4) Digital Object Management]]<br />
* [[5) Infrastructure and Security Risk Management]]<br />
<br />
== Confidential Documents ==<br />
<br />
The auditors requested additional information, some of which was confidential:<br />
* Requested information that was not confidential was added to the documents in this Wiki. For example, URL lists and metadata for sample AUs were added to [[Definition of AIP]].<br />
* Requested information that was confidential was supplied in a separate Wiki which was deactivated at the end of the audit.<br />
<br />
== CLOCKSS Permission Statement ==<br />
<br />
CLOCKSS system has permission to ingest, preserve, and serve this Archival Unit.</div>Dshrhttp://documents.clockss.org/index.php/Main_PageMain Page2014-07-23T15:25:16Z<p>Dshr: /* Background */ Clarification before release.</p>
<hr />
<div>= CLOCKSS Archive =<br />
<br />
Welcome to the documentation Wiki of the [http://www.clockss.org CLOCKSS Archive].<br />
<br />
== CLOCKSS Archive Documents ==<br />
<br />
These documents describe the organization, policies, practices and plans of the CLOCKSS Archive:<br />
* [[CLOCKSS: Mission Statement]]<br />
* [[CLOCKSS: Governance and Organization]]<br />
* [[CLOCKSS: Budget and Planning Process]]<br />
* [[CLOCKSS: Business Plan Overview]]<br />
* [[CLOCKSS: Business History]]<br />
* [[CLOCKSS: Collection Development]]<br />
* [[CLOCKSS: Preservation Strategy]]<br />
* [[CLOCKSS: Access Policy]]<br />
* [[CLOCKSS: Succession Plan]]<br />
* [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement]<br />
* [http://www.clockss.org/clockss/Contribute_to_CLOCKSS CLOCKSS: Fees]<br />
<br />
== LOCKSS Program Documents ==<br />
<br />
These documents describe the [http://www.lockss.org LOCKSS technology]:<br />
* [[LOCKSS: Basic Concepts]]<br />
* [[LOCKSS: Polling and Repair Protocol]]<br />
* [[LOCKSS: Format Migration]]<br />
* [[LOCKSS: Metadata Database]] <br />
* [[LOCKSS: Extracting Bibliographic Metadata]]<br />
* [[LOCKSS: Software Development Process]]<br />
* [[LOCKSS: Property Server Operations]]<br />
<br />
== LOCKSS Adaptations to CLOCKSS Archive Documents ==<br />
<br />
These documents describe how the LOCKSS technology is used by the CLOCKSS Archive:<br />
* [[CLOCKSS: Threats and Mitigations]]<br />
* [[CLOCKSS: Logging and Records]]<br />
* [[CLOCKSS: Ingest Pipeline]]<br />
* [[CLOCKSS: Box Operations]]<br />
* [[CLOCKSS: Extracting Triggered Content]]<br />
* [[CLOCKSS: Hardware and Software Inventory]]<br />
<br />
== OAIS Conformance Documents ==<br />
<br />
These documents describe the mapping between the CLOCKSS Archive and the [http://public.ccsds.org/publications/archive/650x0m2.pdf OAIS Reference Architecture]:<br />
* [[CLOCKSS: Designated Community]]<br />
* [[CLOCKSS: Mandatory Responsibilities]]<br />
* [[Definition of SIP]]<br />
* [[Definition of AIP]]<br />
* [[Definition of DIP]]<br />
<br />
== Background ==<br />
<br />
During the preparation for the [http://www.crl.edu/archiving-preservation/digital-archives/metrics-assessing-and-certifying/trac TRAC audit] of the CLOCKSS Archive, it was decided to make as much documentation of the CLOCKSS Archive public as possible. This wiki is the result. The contents have been collected from a variety of sources, including:<br />
* Published papers<br />
* CLOCKSS board minutes and other Board documents<br />
* The LOCKSS team's internal wiki<br />
* The LOCKSS team's ticketing and bug tracking systems<br />
Many of these sources were not intended to be made public, and contained confidential or inappropriate material. The contents of this wiki were extracted from them and reviewed for publication as part of the audit preparations. These pages document:<br />
* The structure, policies and practices of the CLOCKSS Archive.<br />
* The conformance of the CLOCKSS Archive to the OAIS Reference Model<br />
* The policies, practices and technology of the LOCKSS Program, which operates the CLOCKSS Archive under contract to the CLOCKSS Board.<br />
* The adaptations made to the generic LOCKSS technology for the purposes of the CLOCKSS Archive.<br />
As structures, policies, practices and technologies change, these documents will be maintained so that up-to-date information on these topics is available to the public.<br />
<br />
In particular, some documents were edited after their initial submission but before release of the certification report by the auditors to clarify issues that arose during the audit process and to include some material from the confidential part of the submission that was judged non-confidential. Viewing the history an individual page will reveal any such changes.<br />
<br />
== ISO 16363 Criteria ==<br />
<br />
For the purposes of the audit, the wiki also contains a page for each of the ISO 16363 criteria. The goal for these pages was that they be a finding aid for the auditors, allowing them to easily locate the relevant parts of the documents for each criterion, but that all actual content be in the documents. Thus interested readers can thus use the documents without needing to refer to the criteria pages:<br />
<br />
* [[3) Organizational Infrastructure]]<br />
* [[4) Digital Object Management]]<br />
* [[5) Infrastructure and Security Risk Management]]<br />
<br />
== Confidential Documents ==<br />
<br />
The auditors requested additional information, some of which was confidential:<br />
* Requested information that was not confidential was added to the documents in this Wiki. For example, URL lists and metadata for sample AUs were added to [[Definition of AIP]].<br />
* Requested information that was confidential was supplied in a separate Wiki which was deactivated at the end of the audit.<br />
<br />
== CLOCKSS Permission Statement ==<br />
<br />
CLOCKSS system has permission to ingest, preserve, and serve this Archival Unit.</div>Dshrhttp://documents.clockss.org/index.php/File:Voter-bad.pdfFile:Voter-bad.pdf2014-07-21T17:43:32Z<p>Dshr: </p>
<hr />
<div></div>Dshrhttp://documents.clockss.org/index.php/File:Poller-bad.pdfFile:Poller-bad.pdf2014-07-21T17:43:05Z<p>Dshr: </p>
<hr />
<div></div>Dshrhttp://documents.clockss.org/index.php/File:Voter-good.pdfFile:Voter-good.pdf2014-07-21T17:42:30Z<p>Dshr: </p>
<hr />
<div></div>Dshrhttp://documents.clockss.org/index.php/File:Poller-good.pdfFile:Poller-good.pdf2014-07-21T17:41:43Z<p>Dshr: </p>
<hr />
<div></div>Dshrhttp://documents.clockss.org/index.php/LOCKSS:_Polling_and_Repair_ProtocolLOCKSS: Polling and Repair Protocol2014-07-21T17:40:53Z<p>Dshr: /* Enhancements */ Add Demonstration section - approved by DSHR</p>
<hr />
<div>= LOCKSS: Polling and Repair Protocol =<br />
<br />
== Overview ==<br />
<br />
LOCKSS boxes run the LOCKSS polling and repair protocol as described in [http://dx.doi.org/10.1145/1047915.1047917 our ''ACM Transactions on Computer Systems'' paper]. The paper describes the polling mechanism as applying to a single file; the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] applies it to an entire [[LOCKSS: Basic Concepts#Archival Unit|Archival Unit (AU)]] of content. Each LOCKSS daemon chooses at random the next AU upon which it will use the LOCKSS polling and repair protocol to perform integrity checks. It acts as the ''poller'' to call a poll on that AU by:<br />
* Selecting a random sample of the other CLOCKSS boxes (the <i>voters</i>).<br />
* Inviting the voters to participate in a <i>poll</i> on the AU, and sending each of them a freshly-generated random nonce ''Np''.<br />
* The poll involves the voters voting by:<br />
** Generating a fresh random nonce ''Nv''.<br />
** Creating a vote containing, for every URL in the voter's instance of the AU:<br />
*** The URL<br />
*** The hash of the concatenation of ''Np'', ''Nv'' and the content of the URL.<br />
** Sending the vote to the poller. Note that the vote contains a hash for each URL in the voter's instance of the AU, but that hash is not the hash of the content. The nonces ensure that the hash in the vote is different for every vote in every poll. The voter cannot simply remember the hash it initially created, it must re-hash every URL each time it votes.<br />
* The poller tallies the votes by:<br />
** For each URL in the poller's instance of the AU:<br />
*** For each voter:<br />
**** Computing the hash of ''Np'', ''Nv'' and the content of the URL in the poller's instance of the AU.<br />
**** Comparing the result with the hash value for that URL in that voter's vote.<br />
** Note that the nonces ensure that the poller must re-hash every URL in the AU; it cannot simply remember the hash it initially created.<br />
* In tallying the votes, the poller may detect that:<br />
** A URL it has does not match the consensus of the voters, or<br />
** A URL that the consensus of the voters says should be present in the AU is missing from the poller's AU, or<br />
** A URL it has does not match the checksum generated when it was stored.<br />
* If so, it repairs the problem by:<br />
** requesting a new copy from one of the voters that agreed with the consensus,<br />
** then verifying that the new copy does agree with the consensus.<br />
<br />
In this way, at unpredictable but fairly regular intervals, every poll on an AU checks the union of the set of URLs in that AU on the box calling the poll (poller) and the boxes voting (voters). The check establishes that the URL on the poller agrees with the consensus of the boxes voting in the poll (voters) as to that URL's content. If it does not, it is repaired from one of the boxes in the consensus. Under our current Mellon grant we are investigating the potential benefits of an enhancement to the mechanism that results in every poll on an AU checking that every URL in that AU on each voter agrees with the same URL on the poller.<br />
<br />
== Configuration of CLOCKSS Network ==<br />
<br />
As described in [[CLOCKSS: Box Operations]] the CLOCKSS boxes are configured to form a Private LOCKSS Network (PLN) including the following configuration options:<br />
* Because the CLOCKSS PLN is a closed network secured by SSL certificate checks at both ends of all connections, the defenses against sybil attacks, which involve the adversary creating new peer identities, are not necessary and are not implemented.<br />
* The efficiency enhancements described below are being gradually and cautiously deployed to the CLOCKSS PLN.<br />
<br />
Currently, on average, a poll is called on each AU instance approximately once every 100 days. Since there are currently 12 boxes in the CLOCKSS network, approximately every 8 days on average one instance of a given AU is checked.<br />
<br />
== Enhancements ==<br />
<br />
The LOCKSS team's internal monitoring and evaluation processes identified some areas in which the efficiency of the polling process could be improved in the context of the Global LOCKSS Network (GLN). The [http://www.lockss.org/news-media/news/lockss-program-receives-andrew-w-mellon-foundation-grant/ Andrew W. Mellon Foundation funded work to implement and evaluate improvements] in these areas; the grant period extends through March 2015. Although these improvements will be deployed to the CLOCKSS network, because there are many fewer boxes in the CLOCKSS network than the GLN the areas of inefficiency are less relevant to the CLOCKSS network. Thus the improvements are not expected to make a substantial difference to the performance of the CLOCKSS network.<br />
<br />
The Mellon-funded work included development of improved instrumentation and analysis software, which polls the administrative Web UI of each LOCKSS box in a network to collect vast amounts of data about the operations of each box. These tools were used on the CLOCKSS network for an initial 59-day period, collecting over 18M data items. The data collected has yet to be fully analyzed but initial analysis shows that the polling process among CLOCKSS boxes continues to operate satisfactorily. Some examples of the graphs generated follow.<br />
<br />
[[File:hist_pr_auid_count27.png|200px|thumb|center]] This graph shows the number of AU instances in CLOCKSS boxes which have reached agreement with N other CLOCKSS boxes, showing the progress AUs make after ingest as the LOCKSS: Polling and Repair Protocol identifies matching AU instances at other boxes. It will be seen that there are few AU instances in the sample with few boxes with whom they have reached agreement, and that the majority of AU instances have reached agreement with AU instances at the majority of other CLOCKSS boxes.<br />
<br />
[[File:Sample Graph 2.png|200px|thumb|center]] This graph shows the extent of agreement among the over 40,000 successfully completed polls in the sample. As can be seen, the overwhelming majority of the polls showed complete agreement. Polls with less than complete agreement are likely to have been caused by polling among AU instances that were still collecting content, so had different sub-sets of the URLs in an AU.<br />
<br />
== Demonstration ==<br />
<br />
The CRL auditors requested a demonstration of the polling and repair process. Demonstrating this on production content is difficult. The content is generally large, so polls take a long time. Each box is running many polls simultaneously, so the log entries for these polls are interleaved. Turning the logging level on polling up enough to show full details would affect all polls underway simultaneously, so the volume of log data would be overwhelming. Instead, we provided a live demonstration using a network of 5 LOCKSS daemons in the [[LOCKSS:_Software_Development_Process#Functional_Tests|STF testing framework]], preserving an AU of synthetic content. It consisted of two polls, the first detected no damage and the second created, detected and repaired damage to the content of one URL. Annotated logs of the first poll are available from the [[Media:Poller-good.pdf|poller]] and a [[Media:Voter-good.pdf|voter]]. Annotated logs of the second poll are available from the [[Media:Poller-bad.pdf|poller]] and a [[Media:Voter-bad.pdf|voter]].<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by LOCKSS Engineering Staff<br />
* Approval by LOCKSS Chief Scientist<br />
<br />
== Relevant Documents ==<br />
# [[CLOCKSS: Box Operations]]<br />
# Petros Maniatis, Mema Roussopoulos, TJ Giuli, David S.H. Rosenthal, Mary Baker, and Yanto Muliadi. “LOCKSS: A Peer-to-Peer Digital Preservation System”, ACM Transactions on Computer Systems vol. 23, no. 1, February 2005, pp. 2-50. http://dx.doi.org/10.1145/1047915.1047917 accessed 2013.8.7</div>Dshrhttp://documents.clockss.org/index.php/CLOCKSS:_Extracting_Triggered_ContentCLOCKSS: Extracting Triggered Content2014-06-25T01:43:11Z<p>Dshr: Previous change and this one approved by Tom Lipkis</p>
<hr />
<div>= CLOCKSS: Extracting Triggered Content =<br />
<br />
The CLOCKSS board may decide to trigger content from the CLOCKSS archive when it is no longer available from any publisher. Reasons for doing so include:<br />
* A publisher discontinuing an entire title.<br />
* A publisher who has acquired a title deciding not to re-host a run of back issues.<br />
* Disaster.<br />
The [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS Participating Publisher Agreement] paragraph 4.E states:<br />
<blockquote><br />
Released Content use terms and restrictions will be determined by an accompanying Creative Commons license (or equivalent license) chosen either by Publisher or, if Publisher fails to respond within thirty (30) days following receipt by it of the notice described above, by the CLOCKSS Board.<br />
</blockquote><br />
At present, EDINA and Stanford have volunteered to re-publish triggered content, but it is important to note that the Creative Commons license means that anyone can re-publish such content, and that re-publishing is not a core function of the CLOCKSS archive.<br />
<br />
Once the CLOCKSS board notifies the LOCKSS Executive Director that a trigger event has occurred, the CLOCKSS Metadata Lead and assigned LOCKSS staff run a process that delivers the triggered content to the re-publishing sites, and updates a [https://www.clockss.org/clockss/Triggered_Content section of the CLOCKSS website] with links pointing to the re-published content at both sites. In OAIS terminology, the package delivered to the re-publishing sites is a [[Definition of DIP|Distribution Information Package (DIP)]].<br />
<br />
Additionally, a publisher with content in CLOCKSS can request to be supplied with a copy. In such a case, the process described below is followed but the content is delivered to the requesting publisher not to a re-publishing site.<br />
<br />
The [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement] specifies that content must be unavailable for 6 months before it is triggered. It is important to observe that this delay is mandatory only when content is being triggered ''without the consent of the publisher''. In all cases so far content has been triggered ''at the request of the publisher'', and the technical process described below has taken 2-4 weeks.<br />
<br />
== Overview of the Trigger Process ==<br />
<br />
The trigger process involves: <br />
# identifying the triggered content in the CLOCKSS preservation network,<br />
# extracting the triggered content from the CLOCKSS preservation network, <br />
# preparing the content for publication on the triggered content machines,<br />
# re-hosting triggered content on the CLOCKSS triggered content site,<br />
# re-registering triggered article DOIs.<br />
The following sections describe the process as it is currently implemented. It is possible that content triggered in the future might contain files whose format has become obsolete. The additional processes that would be needed in this situation are described in [[LOCKSS: Format Migration#Format Migration in the CLOCKSS Archive|LOCKSS: Format Migration]].<br />
<br />
The CLOCKSS Metadata Lead is responsible for this process.<br />
<br />
== Identifying the Triggered Content ==<br />
<br />
The triggering process begins by identifying the triggered content within the CLOCKSS preservation network. First, the [[LOCKSS: Basic Concepts#Archival Units|Archival Units (AUs)]] that contain the title are identified by querying the [[LOCKSS: Metadata Database|database of bibliographic metadata]] that is compiled and maintained by each CLOCKSS box. This database contains a record of articles published for each preserved title, indexed by year, volume, and issue.<br />
<br />
A query to this database on a production CLOCKSS box yields a list of [[LOCKSS: Basic Concepts#AUID|AU identifiers (AUIDs)]] that identify the associated AUs ([[Definition of AIP|AIPs]] in OAIS terminology) being preserved in the CLOCKSS network. A similar query to an ingest machine identifies any triggered AUs that have yet to emerge from the ingest pipeline (see [[CLOCKSS: Ingest Pipeline]]). The AUIDs are the same on every CLOCKSS box and, as described in [[Definition of AIP]], uniquely identify the location on each box of that AU (AIP). For content that was harvested from a publishers website, a single AU will normally contain all the files for a given volume or a year. For content obtained via file transfer from publishers, content is organized into AUs that contain titles, volumes, and issues from the same publisher that were ingested in a given calendar year.<br />
<br />
== Extracting the Triggered Content ==<br />
<br />
Once the AU has been identified, the triggered content must be extracted from the CLOCKSS preservation network. If the content has already been released to the production CLOCKSS boxes, the content is extracted from the Stanford production CLOCKSS box (<tt>clockss-stanford.clockss.org</tt>) because it is the one most available to the technical staff. Content that is still in the ingest pipeline is flushed through the [[CLOCKSS: Ingest Pipeline]] to the production boxes and triggered from there<br />
<br />
For harvested content the extraction process involves locating the directories in the CLOCKSS box repository that correspond to each AU of the triggered content. Each of the hierarchies under these directories is included in the DIP. If all content is on the Stanford CLOCKSS box, it can be exported directly into zip or tar files using the LOCKSS daemon's export functionality.<br />
<br />
For file transfer content, the extraction process involves locating the directories in the CLOCKSS box that correspond to the AUs that contain the triggered content. Within these directories are either sub-directories or archive files that contain the content. These directories or archive files are copied from the repository to a server where the triggered content will be prepared for publication.<br />
<br />
In both cases, a check is performed in case some of the extracted content includes material that is recorded as having been retracted or withdrawn, as described in [[CLOCKSS: Ingest Pipeline#Errata, Corrections and Retractions|CLOCKSS; Ingest Pipeline]]. If so, that content is excluded before the trigger process continues.<br />
<br />
== Preparing File Transfer Content for Dissemination ==<br />
<br />
File transfer content is typically a collection of files that include PDFs of articles, XML formatted full-text files, supporting files such as images and multi-media, and metadata files that contain bibliographic information. These files are input to the processes of the publishing platform that displays the publisher's content to readers. The exact content of the collection varies between publishers. Typically, further processing involving the following steps is required to extract the content from the directories or archive files that were copied from the CLOCKSS box repository, and generate from them a web site:<br />
* Running the script that verifies the MD5 checksums stored with the content.<br />
* Unpacking any archive files that contain the content into their own directories and isolate the files that correspond to the triggered content. Publishers tend to group files for individual issues into their own directories, so isolating the files involves retaining only those directories that correspond to the content being triggered and discarding directories for other content.<br />
* Adding any full-text PDF files unmodified to the web site.<br />
* Adding other files, such as images and multi-media, to the web site.<br />
* Rendering any XML files into readable full-text HTML pages.<br />
* Generating HTML abstract pages linking to the full-text PDF and HTML pages.<br />
* Inserting article-level bibliographic metadata into the abstract pages <meta> tags.<br />
* Creating article-level metadata files in standard forms such as RIS and BibTex, and linking to them from the abstract pages.<br />
* Generating an HTML issue table of contents page with links to the article abstract pages.<br />
* Generating a volume table of contents page with links to the issue table of contents pages.<br />
* Generating a journal table of contents page with links to the volume table of contents pages. <br />
This process takes place on a temporary ''preparation server''. Here is a sample generated issue table of contents page:<br />
<br />
[[File:Triggering_Content-1.png|thumb|center|Generated issue table of contents]]<br />
<br />
Here is a sample generated article abstract page:<br />
<br />
[[File:Triggering_Content-2.png|thumb|center|Generated article abstract]]<br />
<br />
Here is an example of Dublin Core article-level metadata included in the generated abstract page shown above:<br />
<pre style="font-size:12px"><br />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><br />
<meta name="dc.Format" content="text/HTML"><br />
<meta name="dc.Publisher" content="Taylor &amp; Francis"><br />
<meta name="dc.Title" content="Implementation and Assessment of an Introductory Pharmacy Practice Course Sequence "><br />
<meta name="dc.Identifier" scheme="coden" content="Journal of Pharmacy Teaching, Vol. 14, No. 1, 2007: pp. 5–17"><br />
<meta name="dc.Identifier" scheme="doi" content="10.1300/J060v14n01_02"><br />
<meta name="dc.Date" content="31Oct2007"><br />
<meta name="dc.Creator" content=" Assistant Professor and coordinator Dr. Emily W. Evans Pharm.D. and AE-C and CDM"><br />
<meta name="keywords" content=", Education, pharmacy practice, laboratory education"><br />
</pre><br />
<br />
The result of this process is a directory hierarchy capable of being exported by a web server such as Apache. Initially, this was the form in which file transfer content was disseminated to the re-publishing sites, which used Apache directly. Now, the content is exported by Apache running on the preparation server, and ingested in the normal way by a LOCKSS daemon, after which the generated directory hierarchy can be deleted. This results in a set of AUs analogous to those which would have been obtained had the LOCKSS daemon ingested the content from the original publisher, albeit with a different look and feel. As time permits, early triggered content will be re-disseminated using this technique, as it provides better compatibility with link resolvers and other library systems.<br />
<br />
== Assembling the DIP ==<br />
<br />
The AUs to be re-published can be collected from the selected CLOCKSS box and ingest machines (for harvested content) or from the preparation server (for file transfer content) and converted to a compressed archive using <tt>zip</tt> or <tt>tar</tt>. This forms a [[Definition of DIP|DIP]] that can be transferred to the re-publishing sites using <tt>rsync</tt>, <tt>sftp</tt> or <tt>scp</tt>.<br />
<br />
== Re-publishing the Triggered Content ==<br />
<br />
Currently, two sites re-publish triggered content, [http://triggered.stanford.clockss.org Stanford University] and [http://triggered.edina.clockss.org EDINA at the University of Edinburgh]. Both use a combination of an Apache web server and a LOCKSS daemon to do so although, as described in [[Definition of AIP]], the structure of AUs allows easy access by other tools to the content and metadata. Other ways to re-publish the content delivered as this type of [[Definition of DIP|DIP]] are easy to envisage, such as a shell script to convert it to an Apache web site.<br />
<br />
The technique for re-publishing newly triggered content delivered as this type of [[Definition of DIP|DIP]] used by the current re-publishing sites is as follows. On each of the re-publishing machines:<br />
* Configure the appropriate [[LOCKSS: Basic Concepts#Plugins|plugin]] and [[LOCKSS: Basic Concepts#Title Database|Title Database (TDB)]] entries, by updating the triggered-content configuration and plugin repositories on <tt>props.lockss.org</tt> (see [[LOCKSS: Property Server Operations]]).<br />
* Force the LOCKSS daemon to reload its configuration and plugins from the repository.<br />
* Unpack the AUs into the repository hierarchy of the LOCKSS daemon.<br />
* Make a visual check of the resulting website.<br />
<br />
The final step is add an entry for the the newly triggered title to the index of triggered titles on the “Triggered Content” section of the CLOCKSS website.<br />
<br />
[[File:Triggering_Content-3.png|thumb|center|Triggered titles index]]<br />
<br />
The entry points to a new landing page for the title that provides information about the title, publication history, and triggering process. It also includes links to the issues hosted at Edina and Stanford.<br />
<br />
[[File:Triggering_Content-4.png|thumb|center|Triggered title landing page]]<br />
<br />
Because the triggered content carries Creative Commons licenses, other institutions can also re-publish it. For example, here is [http://web.archive.org/web/*/http://www.clockss.org/clockss/Annals_of_Clinical_Psychiatry<i>Annals of Clinical Teaching</i>] at the [http://www.archive.org Internet Archive].<br />
<br />
== Re-registering Triggered Article DOIs ==<br />
<br />
Most publishers register individual article Digital Object Identifiers (DOIs) with a registrar sponsored by DOI International, such as [http://www.crossref.org CrossRef]. Once a title is no longer available from the publisher, the registration records for the articles should be updated to refer to the content hosted at Edina and Stanford. This is done by preparing a tab-separated file with the DOI for each article and the corresponding new URL. A separate file is required for the articles hosted at Edina and Stanford. The registrar uses the data in this file to update the records for the corresponding DOIs.<br />
<br />
For content that is being served by the two re-publishing servers, this task is simple because their LOCKSS daemon provides an OpenURL resolver, allowing access to articles via their DOIs. The DOI information is available from the article metadata, via the DOI link in the daemon UI. Here is a portion of the file for a title being hosted at Edina that can be sent to the DOI registrar.<br />
<br />
<pre><br />
10.1300/J060v09n02_04 http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n02_04 <br />
10.1300/J060v09n02_05 http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n02_05 <br />
10.1300/J060v09n02_07 http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n02_07 <br />
10.1300/J060v09n02_08 http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n02_08 <br />
10.1300/J060v09n01_02 http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n01_02 <br />
10.1300/J060v09n01_03 http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n01_03 <br />
</pre><br />
<br />
== Recording the Dissemination ==<br />
<br />
Generation of a DIP is performed only at the direction of the CLOCKSS board, which is recorded in their minutes.<br />
<br />
== Configuring the Re-publishing Systems ==<br />
<br />
The current re-publishing sites each comprise a LOCKSS daemon and an Apache web server. Both currently reside on the same machine, but they could also be hosted on separate machines. The Apache web server is running on the standard HTTP port 80, while the LOCKSS daemon is serving content on port 8082. External access to the LOCKSS daemon is limited to administrative uses and the Apache server on the system where it is running.<br />
<br />
The Apache server is running a proxy module that forwards requests matching a certain pattern to the CLOCKSS daemon's content server. A triggered content request is first handled by the node's Apache web server, where the request is processed through the Apache web server's ProxyPassMatch filter. If the request matches a filter pattern, the request is passed to the LOCKSS daemon's ServeContent servlet. <br />
<br />
[[File:Triggering_Content-5.png]]<br />
<br />
If the requested content is in the AUs of the LOCKSS daemon, the page is returned to the Apache web server and then to the user. Otherwise, the daemon returns an HTTP error response, and the Apache web server returns a page that indicates the content was not preserved. This configuration serves two purposes. The first is to provide additional processing capabilities beyond those available from the LOCKSS daemon. The second is so that external access to the re-published content can be made through a standard port, an important consideration for many IT firewall configurations.<br />
<br />
For file transfer content that was triggered using the earlier technique and has yet to be updated, the Apache server acts as a normal web server that responds to requests for the prepared content.<br />
<br />
[[File:Triggering_Content-6.png]]<br />
<br />
=== Virtual machine requirements ===<br />
<br />
<br />
Re-publishing hosts have modest requirements:<br />
<br />
* Processor: Single core CPU at 2GHz or better<br />
* Memory 1GB<br />
* Storage: 40GB<br />
* Network: 10/100mbit<br />
<br />
=== Setting up the CLOCKSS daemon ===<br />
<br />
''Re-publishing hostconfig:''<br />
<br />
The LOCKSS daemons at the re-publishing sites are part of the clockss-triggered preservation group. When running hostconfig on a new CLOCKSS Triggered Content Node, care needs to be taken to configure it as such:<br />
<br />
<pre><br />
Props URL: http://props.lockss.org:8001/clockss-triggered/lockss.xml<br />
Preservation Group: clockss-triggered<br />
</pre><br />
<br />
''Re-publishing ServeContent servlet:''<br />
<br />
The LOCKSS daemon needs to have its ServeContent servlet enabled on port 8082. This is done by logging into the LOCKSS daemon's administrative UI, clicking on "Content Access Options" then "Content Server Options" and then checking "Enable content server on port 8082".<br />
<br />
=== Setting up the Apache server ===<br />
<br />
The following configuration is used for the Apache server at each re-publishing site:<br />
<br />
<pre><br />
NameVirtualHost triggered.SITE.clockss.org:80<br />
<VirtualHost triggered.SITE.clockss.org:80><br />
ServerAdmin support@support.clockss.org<br />
DocumentRoot /var/www/html<br />
ServerName triggered.SITE.clockss.org<br />
<IfModule mod_proxy.c><br />
ProxyRequests Off<br />
ProxyVia On<br />
<Proxy triggered.SITE.clockss.org/*><br />
AddDefaultCharset off<br />
Order deny,allow<br />
Allow from all<br />
</Proxy><br />
ProxyPassMatch ^/((ServeContent|images).*)$ http://localhost:8082/$1<br />
ProxyErrorOverride On<br />
ErrorDocument 404 /not-preserved.html<br />
</IfModule><br />
ErrorLog logs/error_log<br />
CustomLog logs/access_log common<br />
</VirtualHost><br />
</pre><br />
<br />
=== Managing Re-publishing Sites ===<br />
<br />
The re-publishing site should follow the [[CLOCKSS: Box Operations|security, maintenance and upgrade guidelines for CLOCKSS boxes]].<br />
<br />
=== Balancing load across CLOCKSS triggered content nodes ===<br />
<br />
A load balancing Apache server is also configured that provides a single point of access to the re-publishing servers at [http://triggered.edina.clockss.org EDINA] and [http://triggered.stanford.clockss.org Stanford]. It is [http://triggered.clockss.org here]. This additional Apache instance serves two purposes: the first is to provide a single URL for services that can accept only a single URL, such as link resolvers. The second purpose is to ensure high availability of the triggered content.<br />
<br />
[[File:Triggering_Content-7.png]]<br />
<br />
Here is the configuration file for this load balancing Apache server.<br />
<pre><br />
<VirtualHost *><br />
ServerName triggered.clockss.org<br />
ServerAdmin support@support.clockss.org<br />
DocumentRoot /var/www/clockss-triggered<br />
<IfModule mod_proxy.c><br />
ProxyRequests off<br />
ProxyVia On<br />
ProxyPassMatch ^/((ServeContent|images).*)$ balancer://triggered-pool/$1<br />
<Proxy balancer://triggered-pool><br />
BalancerMember http://triggered.edina.clockss.org/<br />
BalancerMember http://triggered.stanford.clockss.org/<br />
ProxySet lbmethod=byrequests<br />
</Proxy><br />
</IfModule><br />
CustomLog /var/log/apache2/access.log combined<br />
ErrorLog /var/log/apache2/error.log<br />
ServerSignature On<br />
</VirtualHost> <br />
</pre><br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by:<br />
** CLOCKSS Content Lead<br />
** CLOCKSS Network Administrator<br />
* Approval by CLOCKSS Technical Lead<br />
<br />
== Relevant Documents ==<br />
# [[Definition of AIP]]<br />
# [[Definition of DIP]]<br />
# [https://www.clockss.org/clockss/Triggered_Content CLOCKSS Triggered Content]<br />
# [[LOCKSS: Format Migration]]<br />
# [[LOCKSS: Metadata Database]]<br />
# [[CLOCKSS: Ingest Pipeline]]<br />
# [[LOCKSS: Property Server Operations]]<br />
# [[CLOCKSS: Box Operations]]</div>Dshrhttp://documents.clockss.org/index.php/LOCKSS:_Basic_ConceptsLOCKSS: Basic Concepts2014-06-20T14:21:28Z<p>Dshr: Section added based on discussion of draft certification report</p>
<hr />
<div>= LOCKSS: Basic Concepts =<br />
<br />
This document introduces some basic concepts of the LOCKSS technology that are needed to understand the remainder of the documentation.<br />
<br />
== LOCKSS Program ==<br />
<br />
The LOCKSS Program started on [http://blog.dshr.org/2013/10/it-was-fifteen-years-ago-today.html 4 October 1998] under the auspices of Stanford University Libraries to develop and support the LOCKSS technology. It was funded initially by a small grant from Michael Lesk at the NSF, and then supported by the Andrew W. Mellon Foundation, the NSF and Sun Microsystems. It transitioned from grant funding to the "Red Hat" model of free, open source software and paid support thanks to a matching grant from the Mellon Foundation, and has been financially stable since 2008 on that basis. Although program staff are Stanford employees, at no time has Stanford provided any financial support for the program. The program pays all staff and operational costs, Stanford indirect costs, and an "occupancy charge" for its office space.<br />
<br />
== LOCKSS Daemon ==<br />
<br />
The LOCKSS daemon is a large (>200K lines of code) Java program that turns a generic Linux system into a digital preservation appliance called a LOCKSS box. The LOCKSS daemon is the only application program that runs in a LOCKSS box. Every action of a LOCKSS box, for ingest, preservation, dissemination and administration is performed by the LOCKSS daemon. The LOCKSS daemon is administered via a Web interface that allows authorized administrators to direct it to collect content, control how that content is disseminated, and monitor the daemon's performance. Among the functions performed by the LOCKSS daemon are:<br />
* Ingest via Web crawling, or file import.<br />
* Preservation via the [http://documents.clockss.org/index.php/LOCKSS:_Polling_and_Repair_Protocol LOCKSS: Polling and Repair Protocol].<br />
* Dissemination by acting as both a Web server and a Web proxy, and by file export.<br />
* Administration via a Web user interface.<br />
* Status and statistics reporting.<br />
Because all access to preserved content is mediated by the LOCKSS daemon, the physical representation of its internal data structures, such as how content and metadata are stored, is essentially of academic interest only. In particular, the fundamental abstraction that the LOCKSS daemon presents is not that it preserves ''files''. It preserves ''URLs''; their content and their associated headers (metadata) as a unit, although we often casually refer to these (content, header) pairs as "files". Their internal representations are not visible to those using the system to ingest or disseminate content, but only to those administering the underlying system. Only in exceptional circumstances does an administrator log in to the underlying operating system; all routine and normal diagnostic operations are performed through the Web interface.<br />
<br />
== Preservation ==<br />
<br />
The LOCKSS system is sometimes criticized as providing only bit-level preservation, but this is a misunderstanding. The system employs exactly the same techniques (and in most cases exactly the same software tools) as other preservation systems, including:<br />
* Format identification, [[LOCKSS: Format Migration#Format Migration in the CLOCKSS Archive|using FITS software]]<br />
* Format verification, [[LOCKSS: Format Migration#Format Migration in the CLOCKSS Archive|using FITS software]].<br />
* Format migration, see [[LOCKSS: Format Migration]].<br />
* Metadata creation, see [[LOCKSS; Extracting Bibliographic Metadata]] and [[LOCKSS; Metadata Database]].<br />
The difference between LOCKSS and most other preservation systems lies not in ''what'' techniques are employed but in ''when'' those techniques are employed. In the interest of economy, the LOCKSS system stores only the original bits, and delays all operations on them except integrity checking as long as possible. So, for example, unlike systems that preemptively migrate formats in bulk that are not yet obsolete into formats that are presumed to be less obsolete, thereby consuming processing resources, and store both the original and the migrated copies, thereby consuming storage resources, LOCKSS migrates formats only of individual files, and only when a read's request indicates that migration of that file is necessary. The migrated version is discarded when no longer needed to save on storage. This capability was [http://www.dlib.org/dlib/january05/rosenthal/01rosenthal.html demonstrated in 2005] but has remained unused in practice because the formats of content preserved in the LOCKSS system are [[LOCKSS: Format Migration#Obsolescence of Web Formats|not going obsolete]].<br />
<br />
== Plugins ==<br />
<br />
The behavior of the LOCKSS daemon is generic. It must be adapted to the requirements of particular content it is to preserve. This is done via the "plugin" for that content, which is an instance of a Java class. In most cases, it is an instance of class DefinablePlugin whose behavior has been customized by parameters in an XML file; colloquially this XML file is often referred to as "the plugin" because it contains all the information that distinguishes this plugin from another that also uses DefinablePlugin. This information includes, for example, the classes which DefinablePlugin can use to extract metadata from the relevant content. A plugin (the class plus the parameters) represents a class of content, such as "content published on HighWire's H2O platform".<br />
<br />
== Archival Units ==<br />
<br />
Here is an [[Media:ClockssTaylorAndFrancisPlugin.xml.pdf|example plugin XML file]], for Taylor and Francis journals, It defines the class of content "published by Taylor and Francis". There are many journals in this class, and content is continually being added to them, so for operational convenience we divide the class into Archival Units (AUs) representing, typically, a year or a volume of a journal. Each AU is defined by the plugin class name, in this case <tt>org.lockss.plugin.taylorandfrancis.ClockssTaylorAndFrancisPlugin</tt> and a set of definitional parameters defined by the XML file, in this case:<br />
* <tt>base_url</tt><br />
* <tt>journal_id</tt><br />
* <tt>volume_name</tt><br />
<br />
For example, the plugin specifies that for an AU of a particular journal identified by these parameters, crawling should start at <tt>au_start_url</tt>:<br />
<pre><br />
<entry><br />
<string>au_start_url</string><br />
<string>&quot;%sclockss/%s/%s/index.html&quot;, base_url, journal_id, volume_name</string><br />
</entry><br />
</pre><br />
<br />
== Title Database ==<br />
<br />
The values for these parameters come from the Title Data Base (TDB), which is not actually a database, but a knowledge base represented as a set of text files in an easy-to-edit syntax that are processed into an XML file that is obtained by the LOCKSS daemon. For each AU in the system, there is a TDB entry providing the plugin class name and a (name, value) pair for each of the parameters defined by that plugin class that are different from the default. The TDB entry for Advances in Building Energy Research, defining its AUs from 2007-2014, looks like this:<br />
<pre><br />
{<br />
<br />
publisher <<br />
name = Taylor & Francis ;<br />
info[contract] = 2008 ;<br />
info[tester] = A<br />
><br />
<br />
plugin = org.lockss.plugin.taylorandfrancis.ClockssTaylorAndFrancisPlugin<br />
param[base_url] = http://www.tandfonline.com/<br />
implicit < status ; status2 ; year ; name ; param[volume_name] ><br />
...<br />
<br />
{<br />
<br />
title <<br />
name = Advances in Building Energy Research ;<br />
issn = 1751-2549 ;<br />
eissn = 1756-2201 ;<br />
issnl = 1751-2549<br />
><br />
<br />
param[journal_id] = taer20<br />
<br />
au < manifest ; exists ; 2007 ; Advances in Building Energy Research Volume 1 ; 1 ><br />
au < manifest ; exists ; 2008 ; Advances in Building Energy Research Volume 2 ; 2 ><br />
au < manifest ; exists ; 2009 ; Advances in Building Energy Research Volume 3 ; 3 ><br />
au < zapped ; finished ; 2010 ; Advances in Building Energy Research Volume 4 ; 4 ><br />
au < finished ; crawling ; 2011 ; Advances in Building Energy Research Volume 5 ; 5 ><br />
au < finished ; crawling ; 2012 ; Advances in Building Energy Research Volume 6 ; 6 ><br />
au < crawling ; exists ; 2013 ; Advances in Building Energy Research Volume 7 ; 7 ><br />
au < expected ; exists ; 2014 ; Advances in Building Energy Research Volume 8 ; 8 ><br />
<br />
}<br />
...<br />
}<br />
</pre><br />
<br />
The definitional parameters are specified as follows:<br />
* <tt>base_url</tt> is <tt>http://www.tandfonline.com/</tt> for all Taylor and Francis journals specified at the top.<br />
* <tt>journal_id</tt> is <tt>taer20</tt> specified in the section for Advances in Building Energy Research.<br />
* <tt>volume_name</tt> is specified by the 5th column of the table.<br />
<br />
The text form of the TDB is preserved in the LOCKSS source code repository at SourceForge, which is backed up each night to an on-site and an off-site system, both maintained by the LOCKSS team, in addition to SourceForge's backups. There is a copy of the XML form of the TDB for each LOCKSS network on each LOCKSS box in the network, in addition to the copy on the [[LOCKSS: Property Server Operations|Property Server]] and its backup in the Amazon cloud.<br />
<br />
== AUID ==<br />
<br />
Everywhere an AU needs to be uniquely identified, we use an internal name, its Archival Unit ID (AUID) as the means to do so, for example as a key in maps and databases, or in the messages of the [http://documents.clockss.org/index.php/LOCKSS:_Polling_and_Repair_Protocol LOCKSS: Polling and Repair Protocol]. The AUID for an AU is immutable string with an encoded representation of:<br />
* The fully-qualified Java class name of the plugin.<br />
* For each of the definitional parameters defined by the plugin XML:<br />
** The parameter name.<br />
** The parameter value.<br />
Because it contains the class of the plugin and all the definitional parameters, the AUID is unique to an AU irrespective of which box it is on, <br />
<br />
The AUID for the AU for Volume 6 of Advances in Building Energy Research, defined by the TDB entry above, and used as the example in [http://documents.clockss.org/index.php/Definition_of_AIP#Harvest_AU Definition of AIP] is:<br />
<pre><br />
org|lockss|plugin|taylorandfrancis|ClockssTaylorAndFrancisPlugin&base_url~http%3A%2F%2Fwww%2Etandfonline%2Ecom%2F&journal_id~taer20&volume_name~6<br />
</pre><br />
<br />
Archival Units also have an external "AU name", which is a human-readable string used in Web pages and reports but for no other purpose.</div>Dshrhttp://documents.clockss.org/index.php/CLOCKSS:_Extracting_Triggered_ContentCLOCKSS: Extracting Triggered Content2014-06-20T14:16:37Z<p>Dshr: /* CLOCKSS: Extracting Triggered Content */ Clarification resulting from discussion of draft certification report</p>
<hr />
<div>= CLOCKSS: Extracting Triggered Content =<br />
<br />
The CLOCKSS board may decide to trigger content from the CLOCKSS archive when it is no longer available from any publisher. Reasons for doing so include:<br />
* A publisher discontinuing an entire title.<br />
* A publisher who has acquired a title deciding not to re-host a run of back issues.<br />
* Disaster.<br />
The [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS Participating Publisher Agreement] paragraph 4.E states:<br />
<blockquote><br />
Released Content use terms and restrictions will be determined by an accompanying Creative Commons license (or equivalent license) chosen either by Publisher or, if Publisher fails to respond within thirty (30) days following receipt by it of the notice described above, by the CLOCKSS Board.<br />
</blockquote><br />
At present, EDINA and Stanford have volunteered to re-publish triggered content, but it is important to note that the Creative Commons license means that anyone can re-publish such content, and that re-publishing is not a core function of the CLOCKSS archive.<br />
<br />
Once the CLOCKSS board notifies the LOCKSS Executive Director that a trigger event has occurred, the CLOCKSS Metadata Lead and assigned LOCKSS staff run a process that delivers the triggered content to the re-publishing sites, and updates a [https://www.clockss.org/clockss/Triggered_Content section of the CLOCKSS website] with links pointing to the re-published content at both sites. In OAIS terminology, the package delivered to the re-publishing sites is a [[Definition of DIP|Distribution Information Package (DIP)]].<br />
<br />
Additionally, a publisher with content in CLOCKSS can request to be supplied with a copy. In such a case, the process described below is followed but the content is delivered to the requesting publisher not to a re-publishing site.<br />
<br />
The [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement] specifies that content must be unavailable for 6 months before it is triggered. It is important to observe that this delay is mandatory when content is being triggered ''without the consent of the publisher''. In all cases so far content has been triggered ''at the request of the publisher'', and the technical process described below has taken 2-4 weeks.<br />
<br />
== Overview of the Trigger Process ==<br />
<br />
The trigger process involves: <br />
# identifying the triggered content in the CLOCKSS preservation network,<br />
# extracting the triggered content from the CLOCKSS preservation network, <br />
# preparing the content for publication on the triggered content machines,<br />
# re-hosting triggered content on the CLOCKSS triggered content site,<br />
# re-registering triggered article DOIs.<br />
The following sections describe the process as it is currently implemented. It is possible that content triggered in the future might contain files whose format has become obsolete. The additional processes that would be needed in this situation are described in [[LOCKSS: Format Migration#Format Migration in the CLOCKSS Archive|LOCKSS: Format Migration]].<br />
<br />
The CLOCKSS Metadata Lead is responsible for this process.<br />
<br />
== Identifying the Triggered Content ==<br />
<br />
The triggering process begins by identifying the triggered content within the CLOCKSS preservation network. First, the [[LOCKSS: Basic Concepts#Archival Units|Archival Units (AUs)]] that contain the title are identified by querying the [[LOCKSS: Metadata Database|database of bibliographic metadata]] that is compiled and maintained by each CLOCKSS box. This database contains a record of articles published for each preserved title, indexed by year, volume, and issue.<br />
<br />
A query to this database on a production CLOCKSS box yields a list of [[LOCKSS: Basic Concepts#AUID|AU identifiers (AUIDs)]] that identify the associated AUs ([[Definition of AIP|AIPs]] in OAIS terminology) being preserved in the CLOCKSS network. A similar query to an ingest machine identifies any triggered AUs that have yet to emerge from the ingest pipeline (see [[CLOCKSS: Ingest Pipeline]]). The AUIDs are the same on every CLOCKSS box and, as described in [[Definition of AIP]], uniquely identify the location on each box of that AU (AIP). For content that was harvested from a publishers website, a single AU will normally contain all the files for a given volume or a year. For content obtained via file transfer from publishers, content is organized into AUs that contain titles, volumes, and issues from the same publisher that were ingested in a given calendar year.<br />
<br />
== Extracting the Triggered Content ==<br />
<br />
Once the AU has been identified, the triggered content must be extracted from the CLOCKSS preservation network. If the content has already been released to the production CLOCKSS boxes, the content is extracted from the Stanford production CLOCKSS box (<tt>clockss-stanford.clockss.org</tt>) because it is the one most available to the technical staff. Content that is still in the ingest pipeline is flushed through the [[CLOCKSS: Ingest Pipeline]] to the production boxes and triggered from there<br />
<br />
For harvested content the extraction process involves locating the directories in the CLOCKSS box repository that correspond to each AU of the triggered content. Each of the hierarchies under these directories is included in the DIP. If all content is on the Stanford CLOCKSS box, it can be exported directly into zip or tar files using the LOCKSS daemon's export functionality.<br />
<br />
For file transfer content, the extraction process involves locating the directories in the CLOCKSS box that correspond to the AUs that contain the triggered content. Within these directories are either sub-directories or archive files that contain the content. These directories or archive files are copied from the repository to a server where the triggered content will be prepared for publication.<br />
<br />
In both cases, a check is performed in case some of the extracted content includes material that is recorded as having been retracted or withdrawn, as described in [[CLOCKSS: Ingest Pipeline#Errata, Corrections and Retractions|CLOCKSS; Ingest Pipeline]]. If so, that content is excluded before the trigger process continues.<br />
<br />
== Preparing File Transfer Content for Dissemination ==<br />
<br />
File transfer content is typically a collection of files that include PDFs of articles, XML formatted full-text files, supporting files such as images and multi-media, and metadata files that contain bibliographic information. These files are input to the processes of the publishing platform that displays the publisher's content to readers. The exact content of the collection varies between publishers. Typically, further processing involving the following steps is required to extract the content from the directories or archive files that were copied from the CLOCKSS box repository, and generate from them a web site:<br />
* Running the script that verifies the MD5 checksums stored with the content.<br />
* Unpacking any archive files that contain the content into their own directories and isolate the files that correspond to the triggered content. Publishers tend to group files for individual issues into their own directories, so isolating the files involves retaining only those directories that correspond to the content being triggered and discarding directories for other content.<br />
* Adding any full-text PDF files unmodified to the web site.<br />
* Adding other files, such as images and multi-media, to the web site.<br />
* Rendering any XML files into readable full-text HTML pages.<br />
* Generating HTML abstract pages linking to the full-text PDF and HTML pages.<br />
* Inserting article-level bibliographic metadata into the abstract pages <meta> tags.<br />
* Creating article-level metadata files in standard forms such as RIS and BibTex, and linking to them from the abstract pages.<br />
* Generating an HTML issue table of contents page with links to the article abstract pages.<br />
* Generating a volume table of contents page with links to the issue table of contents pages.<br />
* Generating a journal table of contents page with links to the volume table of contents pages. <br />
This process takes place on a temporary ''preparation server''. Here is a sample generated issue table of contents page:<br />
<br />
[[File:Triggering_Content-1.png|thumb|center|Generated issue table of contents]]<br />
<br />
Here is a sample generated article abstract page:<br />
<br />
[[File:Triggering_Content-2.png|thumb|center|Generated article abstract]]<br />
<br />
Here is an example of Dublin Core article-level metadata included in the generated abstract page shown above:<br />
<pre style="font-size:12px"><br />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><br />
<meta name="dc.Format" content="text/HTML"><br />
<meta name="dc.Publisher" content="Taylor &amp; Francis"><br />
<meta name="dc.Title" content="Implementation and Assessment of an Introductory Pharmacy Practice Course Sequence "><br />
<meta name="dc.Identifier" scheme="coden" content="Journal of Pharmacy Teaching, Vol. 14, No. 1, 2007: pp. 5–17"><br />
<meta name="dc.Identifier" scheme="doi" content="10.1300/J060v14n01_02"><br />
<meta name="dc.Date" content="31Oct2007"><br />
<meta name="dc.Creator" content=" Assistant Professor and coordinator Dr. Emily W. Evans Pharm.D. and AE-C and CDM"><br />
<meta name="keywords" content=", Education, pharmacy practice, laboratory education"><br />
</pre><br />
<br />
The result of this process is a directory hierarchy capable of being exported by a web server such as Apache. Initially, this was the form in which file transfer content was disseminated to the re-publishing sites, which used Apache directly. Now, the content is exported by Apache running on the preparation server, and ingested in the normal way by a LOCKSS daemon, after which the generated directory hierarchy can be deleted. This results in a set of AUs analogous to those which would have been obtained had the LOCKSS daemon ingested the content from the original publisher, albeit with a different look and feel. As time permits, early triggered content will be re-disseminated using this technique, as it provides better compatibility with link resolvers and other library systems.<br />
<br />
== Assembling the DIP ==<br />
<br />
The AUs to be re-published can be collected from the selected CLOCKSS box and ingest machines (for harvested content) or from the preparation server (for file transfer content) and converted to a compressed archive using <tt>zip</tt> or <tt>tar</tt>. This forms a [[Definition of DIP|DIP]] that can be transferred to the re-publishing sites using <tt>rsync</tt>, <tt>sftp</tt> or <tt>scp</tt>.<br />
<br />
== Re-publishing the Triggered Content ==<br />
<br />
Currently, two sites re-publish triggered content, [http://triggered.stanford.clockss.org Stanford University] and [http://triggered.edina.clockss.org EDINA at the University of Edinburgh]. Both use a combination of an Apache web server and a LOCKSS daemon to do so although, as described in [[Definition of AIP]], the structure of AUs allows easy access by other tools to the content and metadata. Other ways to re-publish the content delivered as this type of [[Definition of DIP|DIP]] are easy to envisage, such as a shell script to convert it to an Apache web site.<br />
<br />
The technique for re-publishing newly triggered content delivered as this type of [[Definition of DIP|DIP]] used by the current re-publishing sites is as follows. On each of the re-publishing machines:<br />
* Configure the appropriate [[LOCKSS: Basic Concepts#Plugins|plugin]] and [[LOCKSS: Basic Concepts#Title Database|Title Database (TDB)]] entries, by updating the triggered-content configuration and plugin repositories on <tt>props.lockss.org</tt> (see [[LOCKSS: Property Server Operations]]).<br />
* Force the LOCKSS daemon to reload its configuration and plugins from the repository.<br />
* Unpack the AUs into the repository hierarchy of the LOCKSS daemon.<br />
* Make a visual check of the resulting website.<br />
<br />
The final step is add an entry for the the newly triggered title to the index of triggered titles on the “Triggered Content” section of the CLOCKSS website.<br />
<br />
[[File:Triggering_Content-3.png|thumb|center|Triggered titles index]]<br />
<br />
The entry points to a new landing page for the title that provides information about the title, publication history, and triggering process. It also includes links to the issues hosted at Edina and Stanford.<br />
<br />
[[File:Triggering_Content-4.png|thumb|center|Triggered title landing page]]<br />
<br />
Because the triggered content carries Creative Commons licenses, other institutions can also re-publish it. For example, here is [http://web.archive.org/web/*/http://www.clockss.org/clockss/Annals_of_Clinical_Psychiatry<i>Annals of Clinical Teaching</i>] at the [http://www.archive.org Internet Archive].<br />
<br />
== Re-registering Triggered Article DOIs ==<br />
<br />
Most publishers register individual article Digital Object Identifiers (DOIs) with a registrar sponsored by DOI International, such as [http://www.crossref.org CrossRef]. Once a title is no longer available from the publisher, the registration records for the articles should be updated to refer to the content hosted at Edina and Stanford. This is done by preparing a tab-separated file with the DOI for each article and the corresponding new URL. A separate file is required for the articles hosted at Edina and Stanford. The registrar uses the data in this file to update the records for the corresponding DOIs.<br />
<br />
For content that is being served by the two re-publishing servers, this task is simple because their LOCKSS daemon provides an OpenURL resolver, allowing access to articles via their DOIs. The DOI information is available from the article metadata, via the DOI link in the daemon UI. Here is a portion of the file for a title being hosted at Edina that can be sent to the DOI registrar.<br />
<br />
<pre><br />
10.1300/J060v09n02_04 http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n02_04 <br />
10.1300/J060v09n02_05 http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n02_05 <br />
10.1300/J060v09n02_07 http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n02_07 <br />
10.1300/J060v09n02_08 http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n02_08 <br />
10.1300/J060v09n01_02 http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n01_02 <br />
10.1300/J060v09n01_03 http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n01_03 <br />
</pre><br />
<br />
== Recording the Dissemination ==<br />
<br />
Generation of a DIP is performed only at the direction of the CLOCKSS board, which is recorded in their minutes.<br />
<br />
== Configuring the Re-publishing Systems ==<br />
<br />
The current re-publishing sites each comprise a LOCKSS daemon and an Apache web server. Both currently reside on the same machine, but they could also be hosted on separate machines. The Apache web server is running on the standard HTTP port 80, while the LOCKSS daemon is serving content on port 8082. External access to the LOCKSS daemon is limited to administrative uses and the Apache server on the system where it is running.<br />
<br />
The Apache server is running a proxy module that forwards requests matching a certain pattern to the CLOCKSS daemon's content server. A triggered content request is first handled by the node's Apache web server, where the request is processed through the Apache web server's ProxyPassMatch filter. If the request matches a filter pattern, the request is passed to the LOCKSS daemon's ServeContent servlet. <br />
<br />
[[File:Triggering_Content-5.png]]<br />
<br />
If the requested content is in the AUs of the LOCKSS daemon, the page is returned to the Apache web server and then to the user. Otherwise, the daemon returns an HTTP error response, and the Apache web server returns a page that indicates the content was not preserved. This configuration serves two purposes. The first is to provide additional processing capabilities beyond those available from the LOCKSS daemon. The second is so that external access to the re-published content can be made through a standard port, an important consideration for many IT firewall configurations.<br />
<br />
For file transfer content that was triggered using the earlier technique and has yet to be updated, the Apache server acts as a normal web server that responds to requests for the prepared content.<br />
<br />
[[File:Triggering_Content-6.png]]<br />
<br />
=== Virtual machine requirements ===<br />
<br />
<br />
Re-publishing hosts have modest requirements:<br />
<br />
* Processor: Single core CPU at 2GHz or better<br />
* Memory 1GB<br />
* Storage: 40GB<br />
* Network: 10/100mbit<br />
<br />
=== Setting up the CLOCKSS daemon ===<br />
<br />
''Re-publishing hostconfig:''<br />
<br />
The LOCKSS daemons at the re-publishing sites are part of the clockss-triggered preservation group. When running hostconfig on a new CLOCKSS Triggered Content Node, care needs to be taken to configure it as such:<br />
<br />
<pre><br />
Props URL: http://props.lockss.org:8001/clockss-triggered/lockss.xml<br />
Preservation Group: clockss-triggered<br />
</pre><br />
<br />
''Re-publishing ServeContent servlet:''<br />
<br />
The LOCKSS daemon needs to have its ServeContent servlet enabled on port 8082. This is done by logging into the LOCKSS daemon's administrative UI, clicking on "Content Access Options" then "Content Server Options" and then checking "Enable content server on port 8082".<br />
<br />
=== Setting up the Apache server ===<br />
<br />
The following configuration is used for the Apache server at each re-publishing site:<br />
<br />
<pre><br />
NameVirtualHost triggered.SITE.clockss.org:80<br />
<VirtualHost triggered.SITE.clockss.org:80><br />
ServerAdmin support@support.clockss.org<br />
DocumentRoot /var/www/html<br />
ServerName triggered.SITE.clockss.org<br />
<IfModule mod_proxy.c><br />
ProxyRequests Off<br />
ProxyVia On<br />
<Proxy triggered.SITE.clockss.org/*><br />
AddDefaultCharset off<br />
Order deny,allow<br />
Allow from all<br />
</Proxy><br />
ProxyPassMatch ^/((ServeContent|images).*)$ http://localhost:8082/$1<br />
ProxyErrorOverride On<br />
ErrorDocument 404 /not-preserved.html<br />
</IfModule><br />
ErrorLog logs/error_log<br />
CustomLog logs/access_log common<br />
</VirtualHost><br />
</pre><br />
<br />
=== Managing Re-publishing Sites ===<br />
<br />
The re-publishing site should follow the [[CLOCKSS: Box Operations|security, maintenance and upgrade guidelines for CLOCKSS boxes]].<br />
<br />
=== Balancing load across CLOCKSS triggered content nodes ===<br />
<br />
A load balancing Apache server is also configured that provides a single point of access to the re-publishing servers at [http://triggered.edina.clockss.org EDINA] and [http://triggered.stanford.clockss.org Stanford]. It is [http://triggered.clockss.org here]. This additional Apache instance serves two purposes: the first is to provide a single URL for services that can accept only a single URL, such as link resolvers. The second purpose is to ensure high availability of the triggered content.<br />
<br />
[[File:Triggering_Content-7.png]]<br />
<br />
Here is the configuration file for this load balancing Apache server.<br />
<pre><br />
<VirtualHost *><br />
ServerName triggered.clockss.org<br />
ServerAdmin support@support.clockss.org<br />
DocumentRoot /var/www/clockss-triggered<br />
<IfModule mod_proxy.c><br />
ProxyRequests off<br />
ProxyVia On<br />
ProxyPassMatch ^/((ServeContent|images).*)$ balancer://triggered-pool/$1<br />
<Proxy balancer://triggered-pool><br />
BalancerMember http://triggered.edina.clockss.org/<br />
BalancerMember http://triggered.stanford.clockss.org/<br />
ProxySet lbmethod=byrequests<br />
</Proxy><br />
</IfModule><br />
CustomLog /var/log/apache2/access.log combined<br />
ErrorLog /var/log/apache2/error.log<br />
ServerSignature On<br />
</VirtualHost> <br />
</pre><br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by:<br />
** CLOCKSS Content Lead<br />
** CLOCKSS Network Administrator<br />
* Approval by CLOCKSS Technical Lead<br />
<br />
== Relevant Documents ==<br />
# [[Definition of AIP]]<br />
# [[Definition of DIP]]<br />
# [https://www.clockss.org/clockss/Triggered_Content CLOCKSS Triggered Content]<br />
# [[LOCKSS: Format Migration]]<br />
# [[LOCKSS: Metadata Database]]<br />
# [[CLOCKSS: Ingest Pipeline]]<br />
# [[LOCKSS: Property Server Operations]]<br />
# [[CLOCKSS: Box Operations]]</div>Dshrhttp://documents.clockss.org/index.php/Main_PageMain Page2014-06-20T14:09:20Z<p>Dshr: /* Background */ Note about post-submission edits</p>
<hr />
<div>= CLOCKSS Archive =<br />
<br />
Welcome to the documentation Wiki of the [http://www.clockss.org CLOCKSS Archive].<br />
<br />
== CLOCKSS Archive Documents ==<br />
<br />
These documents describe the organization, policies, practices and plans of the CLOCKSS Archive:<br />
* [[CLOCKSS: Mission Statement]]<br />
* [[CLOCKSS: Governance and Organization]]<br />
* [[CLOCKSS: Budget and Planning Process]]<br />
* [[CLOCKSS: Business Plan Overview]]<br />
* [[CLOCKSS: Business History]]<br />
* [[CLOCKSS: Collection Development]]<br />
* [[CLOCKSS: Preservation Strategy]]<br />
* [[CLOCKSS: Access Policy]]<br />
* [[CLOCKSS: Succession Plan]]<br />
* [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement]<br />
* [http://www.clockss.org/clockss/Contribute_to_CLOCKSS CLOCKSS: Fees]<br />
<br />
== LOCKSS Program Documents ==<br />
<br />
These documents describe the [http://www.lockss.org LOCKSS technology]:<br />
* [[LOCKSS: Basic Concepts]]<br />
* [[LOCKSS: Polling and Repair Protocol]]<br />
* [[LOCKSS: Format Migration]]<br />
* [[LOCKSS: Metadata Database]] <br />
* [[LOCKSS: Extracting Bibliographic Metadata]]<br />
* [[LOCKSS: Software Development Process]]<br />
* [[LOCKSS: Property Server Operations]]<br />
<br />
== LOCKSS Adaptations to CLOCKSS Archive Documents ==<br />
<br />
These documents describe how the LOCKSS technology is used by the CLOCKSS Archive:<br />
* [[CLOCKSS: Threats and Mitigations]]<br />
* [[CLOCKSS: Logging and Records]]<br />
* [[CLOCKSS: Ingest Pipeline]]<br />
* [[CLOCKSS: Box Operations]]<br />
* [[CLOCKSS: Extracting Triggered Content]]<br />
* [[CLOCKSS: Hardware and Software Inventory]]<br />
<br />
== OAIS Conformance Documents ==<br />
<br />
These documents describe the mapping between the CLOCKSS Archive and the [http://public.ccsds.org/publications/archive/650x0m2.pdf OAIS Reference Architecture]:<br />
* [[CLOCKSS: Designated Community]]<br />
* [[CLOCKSS: Mandatory Responsibilities]]<br />
* [[Definition of SIP]]<br />
* [[Definition of AIP]]<br />
* [[Definition of DIP]]<br />
<br />
== Background ==<br />
<br />
During the preparation for the [http://www.iso.org/iso/catalogue_detail.htm?csnumber=56510 ISO 16363 audit] of the CLOCKSS Archive, it was decided to make as much documentation of the CLOCKSS Archive public as possible. This wiki is the result. The contents have been collected from a variety of sources, including:<br />
* Published papers<br />
* CLOCKSS board minutes and other Board documents<br />
* The LOCKSS team's internal wiki<br />
* The LOCKSS team's ticketing and bug tracking systems<br />
Many of these sources were not intended to be made public, and contained confidential or inappropriate material. The contents of this wiki were extracted from them and reviewed for publication as part of the audit preparations. These pages document:<br />
* The structure, policies and practices of the CLOCKSS Archive.<br />
* The conformance of the CLOCKSS Archive to the OAIS Reference Model<br />
* The policies, practices and technology of the LOCKSS Program, which operates the CLOCKSS Archive under contract to the CLOCKSS Board.<br />
* The adaptations made to the generic LOCKSS technology for the purposes of the CLOCKSS Archive.<br />
As structures, policies, practices and technologies change, these documents will be maintained so that up-to-date information on these topics is available to the public.<br />
<br />
In particular, some documents were edited after their initial submission but before release of the audit report to the auditors to clarify issues that arose during the audit process. Viewing the history an individual page will reveal any such changes.<br />
<br />
== ISO 16363 Criteria ==<br />
<br />
For the purposes of the audit, the wiki also contains a page for each of the ISO 16363 criteria. The goal for these pages was that they be a finding aid for the auditors, allowing them to easily locate the relevant parts of the documents for each criterion, but that all actual content be in the documents. Thus interested readers can thus use the documents without needing to refer to the criteria pages:<br />
<br />
* [[3) Organizational Infrastructure]]<br />
* [[4) Digital Object Management]]<br />
* [[5) Infrastructure and Security Risk Management]]<br />
<br />
== Confidential Documents ==<br />
<br />
The auditors requested additional information, some of which was confidential:<br />
* Requested information that was not confidential was added to the documents in this Wiki. For example, URL lists and metadata for sample AUs were added to [[Definition of AIP]].<br />
* Requested information that was confidential was supplied in a separate Wiki which was deactivated at the end of the audit.<br />
<br />
== CLOCKSS Permission Statement ==<br />
<br />
CLOCKSS system has permission to ingest, preserve, and serve this Archival Unit.</div>Dshrhttp://documents.clockss.org/index.php/4.2.7_Content_Information_is_understandable_for_Designated_Community_at_the_time_of_AIP_creation4.2.7 Content Information is understandable for Designated Community at the time of AIP creation2014-06-20T14:03:46Z<p>Dshr: Edited to reflect discussion of the draft certification report</p>
<hr />
<div>== 4.2.7 - The repository shall ensure that the Content Information of the AIPs is understandable for their Designated Community at the time of creation of the AIP. ==<br />
<br />
In the context of the CLOCKSS Archive the responsibility for ensuring that the Content Information is Independently Understandable by the [[CLOCKSS: Designated Community]] ''at the time of ingest'' lies with the publisher, not with the archive. The archive's responsibility is to ensure that the level of understandability at the time of ingest is preserved through time.<br />
<br />
[[Definition of AIP#Creating AIPS from SIPs|Definition of AIP]] points out that at the time of creation of a CLOCKSS AIP (AU) it contains no Content Information.<br />
<br />
The processes that ensure that the Content Information accumulated through time by an AU "faithfully reflects" what the publisher published (for harvest content) or supplied (for file transfer content) are described in [[CLOCKSS: Ingest Pipeline]].<br />
<br />
=== Relevant Documents ===<br />
# [[Definition of AIP]]<br />
# [[CLOCKSS: Ingest Pipeline]]<br />
# [[CLOCKSS: Designated Community]]</div>Dshrhttp://documents.clockss.org/index.php/LOCKSS:_Extracting_Bibliographic_MetadataLOCKSS: Extracting Bibliographic Metadata2014-04-10T16:20:47Z<p>Dshr: /* LOCKSS: Extracting Bibliographic Metadata */</p>
<hr />
<div>= LOCKSS: Extracting Bibliographic Metadata =<br />
<br />
A part of the LOCKSS preservation process is extracting and indexing bibliographic metadata. Bibliographic metadata is supplied as part of the content submitted by a publisher as [[Definition of SIP|Submission Information Packages]] (SIPs) and preserved in [[Definition of AIP|Archival Information Packages]] in LOCKSS preservation networks.<br />
<br />
Publishers are required to submit adequate bibliographic metadata with the content to support the four uses the CLOCKSS Archive makes of such metadata:<br />
* To enable the CLOCKSS organization to verify and bill for the content submitted by the publisher. This requires reasonably accurate counts of articles per publisher, but no other metadata.<br />
* To track the holdings and progress in preserving content of LOCKSS networks through reports to agencies such as the [http://thekeepers.org/ Keepers] registry. This requires reasonably accurate publisher and date and/or volume range metadata, but no other metadata.<br />
* To identify relevant content in responding to a [[CLOCKSS: Extracting Triggered Content|trigger event]]. This requires accurate journal and volume metatadata.<br />
* To make content disseminated from LOCKSS boxes, and [[CLOCKSS: Extracting Triggered Content|triggered from the CLOCKSS network]], accessible to end users using bibliographic information through online tools such as link resolvers. This requires full accurate metadata.<br />
<br />
This document describes the kinds of bibliographic information that is preserved by LOCKSS systems, the formats and methods most commonly used for transmitting bibliographic metadata, the mechanisms used to extract and index the bibliographic metadata in a metadata database, and ways of presenting and querying extracted bibliographic metadata. Note that for the routine uses of bibliographic metadata, full metadata is not required and a [[#Errors in Metadata|small level of noise]] is acceptable. For the uses which only happen as part of a trigger event, full, accurate metadata is required, but there is time to detect and remedy any remaining noise. Thus the goal in normal processing is not perfect metadata, but metadata with an acceptably low level of noise.<br />
<br />
== Kinds of Bibliographic Metadata ==<br />
<br />
LOCKSS can preserve many kinds of content. The mechanisms underlying the preservation process operate on storage units, typically the content of a URL on a website, or files on disk or contained within archive formats such as ZIP or TAR files. However, bibliographic metadata pertains to bibliographic units such as journal articles or book chapters. Therefore the kinds of bibliographic metadata that are supplied by publsihers are related to bibliographic units rather than to storage units.<br />
<br />
The kinds of bibliographic metadata that are extracted and stored by LOCKSS depends on the bibliographic type of the preserved content. For serials such as journals, the bibliographic metadata includes:<br />
<br />
* Publisher<br />
* Publication name (e.g. name of journal)<br />
* ISSN and eISSN<br />
* DOI<br />
* Volume<br />
* Issue<br />
* Publication date (preferably cover date)<br />
* Article title<br />
* Article DOI<br />
* Article author(s)<br />
* Article number<br />
* Article page range (start page or start/end page)<br />
* Article keywords<br />
* Article summary<br />
<br />
For books and monographs, the bibliographic metadata includes:<br />
<br />
* Publisher<br />
* Publication name (e.g. name of book)<br />
* Edition<br />
* ISBN and eISBN<br />
* DOI<br />
* Volume<br />
* Publication date (preferably cover date)<br />
* Author(s)/Editor(s)<br />
* Keywords<br />
* Summary<br />
<br />
For individual book or monograph chapters, the bibliographic metadata also includes:<br />
<br />
* Chapter title<br />
* Chapter DOI<br />
* Chapter author(s)<br />
* Chapter number<br />
* Chapter page ranges (start page or start/end page)<br />
* Chapter author(s)<br />
* Chapter keywords<br />
* Chapter summary<br />
<br />
For book or monograph series, the bibliographic metadata also includes:<br />
<br />
* Series name<br />
* Series ISSN and eISSN<br />
<br />
In addition, certain physical metadata is also collected about the relationship of the bibliographic unit to its original [[Definition of SIP|Submission Information Package]] (SIP) and the [[Definition of AIP|AIP]] where it is preserved. This is collected from the CLOCKSS system primarily for auditing purposes, and to assist in any eventual triggering of the content. These include:<br />
<br />
* Publishing platform (e.g. HighWire Press)<br />
* [[LOCKSS: Basic Concepts#Plugins|Plugin ID]] and [[LOCKSS: Basic Concepts#AUID|AUID]] (identifies both the SIP and the AIP)<br />
* URLs of bibliographic features (e.g. abstract, full-text PDF, full-text HTML, citation file)<br />
* Date bibliographic unit was first added to the AIP<br />
<br />
== Formats and Methods for Transmitting Bibliographic Metadata ==<br />
<br />
Publishers encode and transmit bibliographic metadata in a variety of ways. How the metadata is encoded depends heavily on whether the content is being harvested from the publisher's website, or transferred to CLOCKSS by the publisher at a time of their choosing.<br />
<br />
=== Harvest Content ===<br />
<br />
Content that is harvested from the publisher's website generally takes the form of HTML pages with links to supporting files such as PDF and ePub files, videos and audio files, and other file types. publishers deliver metadata for harvested content in one of several formats and mechanisms: <br />
<br />
The most common is to embed metadata as HTML META tags in the header of each article HTML page, either the abstract page or in the full text HTML page or both. The most frequently used metadata encodings include Google Scholar and Dublin Core. Some publishers supply the same metadata in both formats in a file. Google Scholar is the encoding to facilitate searching for content through Google's Scholar project. Here is an example of the Google Scholar encoding for an article of a typical journal:<br />
<br />
<pre><br />
<meta content="Molecular Interventions" name="citation_journal_title" /><br />
<meta content="1534-0384" name="citation_issn" /><br />
<meta content="1543-2548" name="citation_issn" /><br />
<meta content="Duckles, Sue P." name="citation_authors" /><br />
<meta content="Dial M for Molecular" name="citation_title" /><br />
<meta content="04/01/2001" name="citation_date" /><br />
<meta content="1" name="citation_volume" /><br />
<meta content="1" name="citation_issue" /><br />
<meta content="6" name="citation_firstpage" /><br />
<meta content="1/1/6" name="citation_id" /><br />
<meta content="molint;1/1/6" name="citation_mjid" /><br />
<meta content="http://molinterv.aspetjournals.org/content/1/1/6.full" name="citation_fulltext_html_url" /><br />
<meta content="http://molinterv.aspetjournals.org/content/1/1/6.full.pdf" name="citation_pdf_url" /><br />
</pre><br />
<br />
Here is the same information encoded as Dublin Core:<br />
<br />
<pre><br />
<meta name="DC.Format" content="text/html" /><br />
<meta name="DC.Language" content="en" /><br />
<meta content="Dial M for Molecular" name="DC.Title" /><br />
<meta content="" name="DC.Identifier" /><br />
<meta content="2001-04-01" name="DC.Date" /><br />
<meta content="American Society for Pharmacology and Experimental Therapeutics" name="DC.Publisher" /><br />
<meta content="Sue P. Duckles" name="DC.Contributor" /><br />
</pre><br />
<br />
Dublin Core is a more limited encoding, and full information is not always available in this encoding without resorting to non-standard extensions. It is often necessary to consult both encodings to extract complete metadata.<br />
<br />
The other way bibliographic information is provided for harvested content is through a separate file that is linked to from the abstract or full-text HTML page and is meant for use by citation management systems. The most common format is RIS, developed by Research Information Systems as an interchange format among citation management systems. Here is the RIS representation of the same article:<br />
<br />
<pre><br />
TY - JOUR<br />
PB - American Society for Pharmacology and Experimental Therapeutics<br />
JO - Molecular Interventions<br />
SN - 1534-0384<br />
SN - 1543-2548<br />
TI - Dial M for Molecular<br />
AU - Sue P. Duckles <br />
PY - 2001<br />
DA - 2001-04-01<br />
VL - 1<br />
IS - 1<br />
SP - 6<br />
UR - http://molinterv.aspetjournals.org/content/1/1/6.full" name="citation_fulltext_html_url<br />
</pre><br />
<br />
=== File Transfer Content ===<br />
<br />
Transferred content is typically in the form of "pre-publication" content that is used as input to a publication system. The content in this case consists of document files in PDF, ePub and other formats, supporting files such as image, audio, and video files, and one or more tagged text files that provide the content and metadata and refer to the other files. Many publishers have developed proprietary formats, but some have begun using evolving industry-wide formats, such as JATS (Journal Article Tag Suite) for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language of the World Wide Web Consortium. There are several profiles for describing various kinds of content.<br />
<br />
Extracting metadata from proprietary formats depends on an understanding of how the publisher has encoded the metadata information. The JATS XML encoding as an open standard is well-understood and it is relatively simple to extract metadata from it.<br />
<br />
== Mechanisms for Extracting and Indexing Bibliographic Metadata ==<br />
<br />
LOCKSS preserves content at the level of storage units, yet metadata is at the level of bibliographic units, so a mechanism is necessary to identify the storage units that contain metadata for each bibliographic unit, and then to extract the metadata and index the bibliographic units in the [[LOCKSS: Metadata Database|Metadata Database]]. Since the representation of metadata in preserved content depends on the publishing platform, the mechanism for performing these steps is expressed as a framework in the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]]. Plugins that conform to this framework embody all the rules and procedures for processing content from individual publishing platforms.<br />
<br />
=== Article Iterator ===<br />
<br />
The Article Iterator is slightly misnamed. Itis a plugin framework that identifies files that represent different types of bibliographic units within an[[Definition of AIP|AIP]] (Archival Unit or AU). Examples of bibliographic units are abstract, full-text, supplementary material, and so on. The definition of the mapping between the various types of bibligraphic units and the files within the AU is plugin-specific.<br />
<br />
The article iterator typically identifies files using patterns that match a specific file such as an HTML article page. For each such page, the article iterator identifies all the related files that also represent bibliographic features.<br />
<br />
For example, an article iterator whose patterns match the HTML abstract page can attempt to locate related files that represent the metadata, full-text HTML, full-text PDF, citation manager files, and other bibliographic features. The article iterator labels these feature URLs for use by the metadata extractor and other clients of the article iterator.<br />
<br />
=== Metadata Extractor ===<br />
<br />
The metadata extractor is a plugin framework that extracts metadata for each bibliographic unit using features identified by the article iterator. It typically uses the feature identified as the metadata feature. The metadata extractor operates by processing the contents of the storage unit that contains the metadata, identifying specific metadata elements, validating and normalizing values, and creating a record that contains the metadata for that bibliographic unit.<br />
<br />
The processing of the files to identify and extract the metadata values is unique to the plugin that governs content for a given AIP. For harvested journal articles that provide metadata using META tags in abstract or full-text HTML pages, the extraction process involves parsing the HTML header and extracting the corresponding "name" and "content" attribute values. Once the pairs are isolated, the values can be validated to ensure they conform to the expected type.<br />
<br />
==== Errors in Metadata ====<br />
<br />
Validation, deduplication and other forms of error correction are necessary because it is not uncommon for there to be errors in provided metadata. Pages are maintained on the LOCKSS team's internal wiki with information about metadata problems. A PDF of one of them, describing one class of problems, is [http://tdrd.clockss.org/images/6/63/MetadataProblems.pdf here].<br />
<br />
Normalization is necessary to ensure a common representation for the type of value. For example an ISSN should have four digits, an hyphen, three digits followed by a digit or the letter 'X". Normalization entails ensuring the correct punctation and ensuring that a final "X" is upper-cased. An additional example is the publisher name in the header of HTML files, which is often specified as:<br />
<pre><br />
<meta name="dc.publisher" content="publisher_name" /><br />
</pre><br />
It's common to find inconsistent spellings, abbreviations, etc. of the same publisher name in different files. These are translated to a consistent name by a manually-maintained mapping table. Entries are added whenever an unexpected publisher name appears in a report.<br />
<br />
A further example is missing DOIs. Some content may not have had a DOI assigned. In other cases, a DOI may have been assigned but the metadata extractor may fail to find it in the content, either because it is missing or because it is not in the place(s) that the extractor expects.<br />
<br />
The report generators require a de-duplication step. It uses a combination of bibliographic items, including publisher, publication title, publication year, volume, issue, start page, and a computed article ID. Two metadata items are the same if all the available values are the same. The DOI is the preferred unique ID, but if it isn't available a substitute is generated using the title, if there is one, and otherwise the access URL. The article ID is computed the same way regardless of whether it is harvest or file transfer content. Because these IDs are successively less reliable as unique IDs, the de-duplication is not completely reliable in the face of noisy metadata.<br />
<br />
=== Metadata Database Indexing ===<br />
<br />
Once metadata has been extracted from a bibliographic unit, the Metadata Manager in the LOCKSS box indexes it in the metadata database. The indexing operation is independent of how the bibliographic units were identified or the metadata was extracted. Parts of the indexing process are common to all bibliographic types, while other parts depend on which type is being indexed. The first section of this document shows what metadata is stored in the metadata database for each bibliographic type. <br />
<br />
The Metadata Manager also checks that the metadata for a bibliographic unit is complete enough to index. At least basic information such as the publisher, publication title, one or more feature URLs, and the ID of the AU (AIP), indicating the plugin and the preservation parameters, is required. Some publishers supply incomplete metadata. The Metadata Manager attempts to fill in missing bibliographic information from the [[LOCKSS: Basic Concepts#Title Database|Title Database (TDB)]] entry for the AU (AIP).<br />
<br />
The TDB is a per-publisher knowledge base that is maintained by the LOCKSS team and is used to add new AUs to a LOCKSS box. The TDB provides all the preservation parameters necessary to define the AU, plus additional, readily available bibliographic information. For harvest content, this includes the publication name, publisher name, the ISSN/eISSN, ISBN/eISBN, volume, publication name, and publisher proprietary identifier. For transferred content, less bibliographic information is available in the TDB entries because each AU typically includes all content from a given publisher for a given year.<br />
<br />
If the publisher or publication title is not available from either the metadata or the TDB, the Metadata Manager generates values that can be readily identified for all bibliographic units in the AU being indexed. These "gensyms" can later be updated once more complete bibliographic information becomes available. Flagging missing metadata in this way is the only use the system makes of "gensyms".<br />
<br />
== Querying and Presenting Extracted Metadata ==<br />
<br />
The administrator of a LOCKSS box and, for CLOCKSS, authorized CLOCKSS staff can access the archive, including for the purposes of querying the metadata and generating various reports required to operate LOCKSS preservation networks and report the state of preservation.<br />
<br />
There are several ways that authorized staff can query the metadata of a LOCKSS or CLOCKSS box. Since the metadata is stored in a relational metadata database, it is possible for custom and standard report generators to run SQL queries against the metadata database of a LOCKSS box, or all CLOCKSS boxes, in the preservation network. <br />
<br />
An example of this is the monthly report submitted to the Keepers registry at the University of Edinburgh. The Keepers report shows the range of years and volumes for every title that is committed for preservation, in process, or preserved in the CLOCKSS preservation network. This report is necessary to satisfy the reporting requirements of the CLOCKSS board and the Designated Community. Custom reports can also be written that enable the CLOCKSS staff to satisfy requests from publishers under the terms of the [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement].<br />
<br />
Other custom report generators could also be written using SQL access that export the bibliographic information of the metadata database. As an example, it would be possible to create a custom report that exports the bibliographic information in METS (Metadata Encoding and Transmission Standard) XML format for further processing, since there is a close correspondence between the LOCKSS and METS schema and types.<br />
<br />
== Relevant Documents ==<br />
* [[Definition of SIP]]<br />
* [[Definition of AIP]]<br />
* [[LOCKSS: Metadata Database]]<br />
* [[CLOCKSS: Designated Community]]<br />
* [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement]<br />
* David S. H. Rosenthal "Talk on LOCKSS Metadata Extraction at IPCC2013" 29 April 2013 http://blog.dshr.org/2013/04/talk-on-lockss-metadata-extraction-at.html<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by LOCKSS Engineering Staff<br />
* Approval by LOCKSS Technical Lead</div>Dshrhttp://documents.clockss.org/index.php/LOCKSS:_Extracting_Bibliographic_MetadataLOCKSS: Extracting Bibliographic Metadata2014-04-10T16:18:42Z<p>Dshr: /* LOCKSS: Extracting Bibliographic Metadata */</p>
<hr />
<div>= LOCKSS: Extracting Bibliographic Metadata =<br />
<br />
A part of the LOCKSS preservation process is extracting and indexing bibliographic metadata. Bibliographic metadata is supplied as part of the content submitted by a publisher as [[Definition of SIP|Submission Information Packages]] (SIPs) and preserved in [[Definition of AIP|Archival Information Packages]] in LOCKSS preservation networks.<br />
<br />
Publishers are required to submit adequate bibliographic metadata with the content to support the four uses the CLOCKSS Archive makes of such metadata:<br />
* To enable the CLOCKSS organization to verify and bill for the content submitted by the publisher. This requires reasonably accurate counts of articles per publisher, but no other metadata.<br />
* To track the holdings and progress in preserving content of LOCKSS networks through reports to agencies such as the [http://thekeepers.org/ Keepers] registry. This requires reasonably accurate publisher and date and/or volume range metadata, but no other metadata.<br />
* To identify relevant content in responding to a [[CLOCKSS: Extracting Triggered Content|trigger event]]. This requires accurate journal and volume metatadata.<br />
* To make content disseminated from LOCKSS boxes, and [[CLOCKSS: Extracting Triggered Content|triggered from the CLOCKSS network]], accessible to end users using bibliographic information through online tools such as link resolvers. This requires full accurate metadata.<br />
<br />
This document describes the kinds of bibliographic information that is preserved by LOCKSS systems, the formats and methods most commonly used for transmitting bibliographic metadata, the mechanisms used to extract and index the bibliographic metadata in a metadata database, and ways of presenting and querying extracted bibliographic metadata. Note that for the routine uses of bibliographic metadata, full metadata is not required and a small level of noise is acceptable. For the uses which only happen as part of a trigger event, full, accurate metadata is required, but there is time to detect and remedy any remaining noise. Thus the goal in normal processing is not perfect metadata, but metadata with an acceptably low level of noise.<br />
<br />
== Kinds of Bibliographic Metadata ==<br />
<br />
LOCKSS can preserve many kinds of content. The mechanisms underlying the preservation process operate on storage units, typically the content of a URL on a website, or files on disk or contained within archive formats such as ZIP or TAR files. However, bibliographic metadata pertains to bibliographic units such as journal articles or book chapters. Therefore the kinds of bibliographic metadata that are supplied by publsihers are related to bibliographic units rather than to storage units.<br />
<br />
The kinds of bibliographic metadata that are extracted and stored by LOCKSS depends on the bibliographic type of the preserved content. For serials such as journals, the bibliographic metadata includes:<br />
<br />
* Publisher<br />
* Publication name (e.g. name of journal)<br />
* ISSN and eISSN<br />
* DOI<br />
* Volume<br />
* Issue<br />
* Publication date (preferably cover date)<br />
* Article title<br />
* Article DOI<br />
* Article author(s)<br />
* Article number<br />
* Article page range (start page or start/end page)<br />
* Article keywords<br />
* Article summary<br />
<br />
For books and monographs, the bibliographic metadata includes:<br />
<br />
* Publisher<br />
* Publication name (e.g. name of book)<br />
* Edition<br />
* ISBN and eISBN<br />
* DOI<br />
* Volume<br />
* Publication date (preferably cover date)<br />
* Author(s)/Editor(s)<br />
* Keywords<br />
* Summary<br />
<br />
For individual book or monograph chapters, the bibliographic metadata also includes:<br />
<br />
* Chapter title<br />
* Chapter DOI<br />
* Chapter author(s)<br />
* Chapter number<br />
* Chapter page ranges (start page or start/end page)<br />
* Chapter author(s)<br />
* Chapter keywords<br />
* Chapter summary<br />
<br />
For book or monograph series, the bibliographic metadata also includes:<br />
<br />
* Series name<br />
* Series ISSN and eISSN<br />
<br />
In addition, certain physical metadata is also collected about the relationship of the bibliographic unit to its original [[Definition of SIP|Submission Information Package]] (SIP) and the [[Definition of AIP|AIP]] where it is preserved. This is collected from the CLOCKSS system primarily for auditing purposes, and to assist in any eventual triggering of the content. These include:<br />
<br />
* Publishing platform (e.g. HighWire Press)<br />
* [[LOCKSS: Basic Concepts#Plugins|Plugin ID]] and [[LOCKSS: Basic Concepts#AUID|AUID]] (identifies both the SIP and the AIP)<br />
* URLs of bibliographic features (e.g. abstract, full-text PDF, full-text HTML, citation file)<br />
* Date bibliographic unit was first added to the AIP<br />
<br />
== Formats and Methods for Transmitting Bibliographic Metadata ==<br />
<br />
Publishers encode and transmit bibliographic metadata in a variety of ways. How the metadata is encoded depends heavily on whether the content is being harvested from the publisher's website, or transferred to CLOCKSS by the publisher at a time of their choosing.<br />
<br />
=== Harvest Content ===<br />
<br />
Content that is harvested from the publisher's website generally takes the form of HTML pages with links to supporting files such as PDF and ePub files, videos and audio files, and other file types. publishers deliver metadata for harvested content in one of several formats and mechanisms: <br />
<br />
The most common is to embed metadata as HTML META tags in the header of each article HTML page, either the abstract page or in the full text HTML page or both. The most frequently used metadata encodings include Google Scholar and Dublin Core. Some publishers supply the same metadata in both formats in a file. Google Scholar is the encoding to facilitate searching for content through Google's Scholar project. Here is an example of the Google Scholar encoding for an article of a typical journal:<br />
<br />
<pre><br />
<meta content="Molecular Interventions" name="citation_journal_title" /><br />
<meta content="1534-0384" name="citation_issn" /><br />
<meta content="1543-2548" name="citation_issn" /><br />
<meta content="Duckles, Sue P." name="citation_authors" /><br />
<meta content="Dial M for Molecular" name="citation_title" /><br />
<meta content="04/01/2001" name="citation_date" /><br />
<meta content="1" name="citation_volume" /><br />
<meta content="1" name="citation_issue" /><br />
<meta content="6" name="citation_firstpage" /><br />
<meta content="1/1/6" name="citation_id" /><br />
<meta content="molint;1/1/6" name="citation_mjid" /><br />
<meta content="http://molinterv.aspetjournals.org/content/1/1/6.full" name="citation_fulltext_html_url" /><br />
<meta content="http://molinterv.aspetjournals.org/content/1/1/6.full.pdf" name="citation_pdf_url" /><br />
</pre><br />
<br />
Here is the same information encoded as Dublin Core:<br />
<br />
<pre><br />
<meta name="DC.Format" content="text/html" /><br />
<meta name="DC.Language" content="en" /><br />
<meta content="Dial M for Molecular" name="DC.Title" /><br />
<meta content="" name="DC.Identifier" /><br />
<meta content="2001-04-01" name="DC.Date" /><br />
<meta content="American Society for Pharmacology and Experimental Therapeutics" name="DC.Publisher" /><br />
<meta content="Sue P. Duckles" name="DC.Contributor" /><br />
</pre><br />
<br />
Dublin Core is a more limited encoding, and full information is not always available in this encoding without resorting to non-standard extensions. It is often necessary to consult both encodings to extract complete metadata.<br />
<br />
The other way bibliographic information is provided for harvested content is through a separate file that is linked to from the abstract or full-text HTML page and is meant for use by citation management systems. The most common format is RIS, developed by Research Information Systems as an interchange format among citation management systems. Here is the RIS representation of the same article:<br />
<br />
<pre><br />
TY - JOUR<br />
PB - American Society for Pharmacology and Experimental Therapeutics<br />
JO - Molecular Interventions<br />
SN - 1534-0384<br />
SN - 1543-2548<br />
TI - Dial M for Molecular<br />
AU - Sue P. Duckles <br />
PY - 2001<br />
DA - 2001-04-01<br />
VL - 1<br />
IS - 1<br />
SP - 6<br />
UR - http://molinterv.aspetjournals.org/content/1/1/6.full" name="citation_fulltext_html_url<br />
</pre><br />
<br />
=== File Transfer Content ===<br />
<br />
Transferred content is typically in the form of "pre-publication" content that is used as input to a publication system. The content in this case consists of document files in PDF, ePub and other formats, supporting files such as image, audio, and video files, and one or more tagged text files that provide the content and metadata and refer to the other files. Many publishers have developed proprietary formats, but some have begun using evolving industry-wide formats, such as JATS (Journal Article Tag Suite) for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language of the World Wide Web Consortium. There are several profiles for describing various kinds of content.<br />
<br />
Extracting metadata from proprietary formats depends on an understanding of how the publisher has encoded the metadata information. The JATS XML encoding as an open standard is well-understood and it is relatively simple to extract metadata from it.<br />
<br />
== Mechanisms for Extracting and Indexing Bibliographic Metadata ==<br />
<br />
LOCKSS preserves content at the level of storage units, yet metadata is at the level of bibliographic units, so a mechanism is necessary to identify the storage units that contain metadata for each bibliographic unit, and then to extract the metadata and index the bibliographic units in the [[LOCKSS: Metadata Database|Metadata Database]]. Since the representation of metadata in preserved content depends on the publishing platform, the mechanism for performing these steps is expressed as a framework in the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]]. Plugins that conform to this framework embody all the rules and procedures for processing content from individual publishing platforms.<br />
<br />
=== Article Iterator ===<br />
<br />
The Article Iterator is slightly misnamed. Itis a plugin framework that identifies files that represent different types of bibliographic units within an[[Definition of AIP|AIP]] (Archival Unit or AU). Examples of bibliographic units are abstract, full-text, supplementary material, and so on. The definition of the mapping between the various types of bibligraphic units and the files within the AU is plugin-specific.<br />
<br />
The article iterator typically identifies files using patterns that match a specific file such as an HTML article page. For each such page, the article iterator identifies all the related files that also represent bibliographic features.<br />
<br />
For example, an article iterator whose patterns match the HTML abstract page can attempt to locate related files that represent the metadata, full-text HTML, full-text PDF, citation manager files, and other bibliographic features. The article iterator labels these feature URLs for use by the metadata extractor and other clients of the article iterator.<br />
<br />
=== Metadata Extractor ===<br />
<br />
The metadata extractor is a plugin framework that extracts metadata for each bibliographic unit using features identified by the article iterator. It typically uses the feature identified as the metadata feature. The metadata extractor operates by processing the contents of the storage unit that contains the metadata, identifying specific metadata elements, validating and normalizing values, and creating a record that contains the metadata for that bibliographic unit.<br />
<br />
The processing of the files to identify and extract the metadata values is unique to the plugin that governs content for a given AIP. For harvested journal articles that provide metadata using META tags in abstract or full-text HTML pages, the extraction process involves parsing the HTML header and extracting the corresponding "name" and "content" attribute values. Once the pairs are isolated, the values can be validated to ensure they conform to the expected type.<br />
<br />
==== Errors in Metadata ====<br />
<br />
Validation, deduplication and other forms of error correction are necessary because it is not uncommon for there to be errors in provided metadata. Pages are maintained on the LOCKSS team's internal wiki with information about metadata problems. A PDF of one of them, describing one class of problems, is [http://tdrd.clockss.org/images/6/63/MetadataProblems.pdf here].<br />
<br />
Normalization is necessary to ensure a common representation for the type of value. For example an ISSN should have four digits, an hyphen, three digits followed by a digit or the letter 'X". Normalization entails ensuring the correct punctation and ensuring that a final "X" is upper-cased. An additional example is the publisher name in the header of HTML files, which is often specified as:<br />
<pre><br />
<meta name="dc.publisher" content="publisher_name" /><br />
</pre><br />
It's common to find inconsistent spellings, abbreviations, etc. of the same publisher name in different files. These are translated to a consistent name by a manually-maintained mapping table. Entries are added whenever an unexpected publisher name appears in a report.<br />
<br />
A further example is missing DOIs. Some content may not have had a DOI assigned. In other cases, a DOI may have been assigned but the metadata extractor may fail to find it in the content, either because it is missing or because it is not in the place(s) that the extractor expects.<br />
<br />
The report generators require a de-duplication step. It uses a combination of bibliographic items, including publisher, publication title, publication year, volume, issue, start page, and a computed article ID. Two metadata items are the same if all the available values are the same. The DOI is the preferred unique ID, but if it isn't available a substitute is generated using the title, if there is one, and otherwise the access URL. The article ID is computed the same way regardless of whether it is harvest or file transfer content. Because these IDs are successively less reliable as unique IDs, the de-duplication is not completely reliable in the face of noisy metadata.<br />
<br />
=== Metadata Database Indexing ===<br />
<br />
Once metadata has been extracted from a bibliographic unit, the Metadata Manager in the LOCKSS box indexes it in the metadata database. The indexing operation is independent of how the bibliographic units were identified or the metadata was extracted. Parts of the indexing process are common to all bibliographic types, while other parts depend on which type is being indexed. The first section of this document shows what metadata is stored in the metadata database for each bibliographic type. <br />
<br />
The Metadata Manager also checks that the metadata for a bibliographic unit is complete enough to index. At least basic information such as the publisher, publication title, one or more feature URLs, and the ID of the AU (AIP), indicating the plugin and the preservation parameters, is required. Some publishers supply incomplete metadata. The Metadata Manager attempts to fill in missing bibliographic information from the [[LOCKSS: Basic Concepts#Title Database|Title Database (TDB)]] entry for the AU (AIP).<br />
<br />
The TDB is a per-publisher knowledge base that is maintained by the LOCKSS team and is used to add new AUs to a LOCKSS box. The TDB provides all the preservation parameters necessary to define the AU, plus additional, readily available bibliographic information. For harvest content, this includes the publication name, publisher name, the ISSN/eISSN, ISBN/eISBN, volume, publication name, and publisher proprietary identifier. For transferred content, less bibliographic information is available in the TDB entries because each AU typically includes all content from a given publisher for a given year.<br />
<br />
If the publisher or publication title is not available from either the metadata or the TDB, the Metadata Manager generates values that can be readily identified for all bibliographic units in the AU being indexed. These "gensyms" can later be updated once more complete bibliographic information becomes available. Flagging missing metadata in this way is the only use the system makes of "gensyms".<br />
<br />
== Querying and Presenting Extracted Metadata ==<br />
<br />
The administrator of a LOCKSS box and, for CLOCKSS, authorized CLOCKSS staff can access the archive, including for the purposes of querying the metadata and generating various reports required to operate LOCKSS preservation networks and report the state of preservation.<br />
<br />
There are several ways that authorized staff can query the metadata of a LOCKSS or CLOCKSS box. Since the metadata is stored in a relational metadata database, it is possible for custom and standard report generators to run SQL queries against the metadata database of a LOCKSS box, or all CLOCKSS boxes, in the preservation network. <br />
<br />
An example of this is the monthly report submitted to the Keepers registry at the University of Edinburgh. The Keepers report shows the range of years and volumes for every title that is committed for preservation, in process, or preserved in the CLOCKSS preservation network. This report is necessary to satisfy the reporting requirements of the CLOCKSS board and the Designated Community. Custom reports can also be written that enable the CLOCKSS staff to satisfy requests from publishers under the terms of the [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement].<br />
<br />
Other custom report generators could also be written using SQL access that export the bibliographic information of the metadata database. As an example, it would be possible to create a custom report that exports the bibliographic information in METS (Metadata Encoding and Transmission Standard) XML format for further processing, since there is a close correspondence between the LOCKSS and METS schema and types.<br />
<br />
== Relevant Documents ==<br />
* [[Definition of SIP]]<br />
* [[Definition of AIP]]<br />
* [[LOCKSS: Metadata Database]]<br />
* [[CLOCKSS: Designated Community]]<br />
* [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement]<br />
* David S. H. Rosenthal "Talk on LOCKSS Metadata Extraction at IPCC2013" 29 April 2013 http://blog.dshr.org/2013/04/talk-on-lockss-metadata-extraction-at.html<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by LOCKSS Engineering Staff<br />
* Approval by LOCKSS Technical Lead</div>Dshrhttp://documents.clockss.org/index.php/CLOCKSS:_Ingest_PipelineCLOCKSS: Ingest Pipeline2014-04-10T03:45:05Z<p>Dshr: Changes after auditor visit: approved by Tom Lipkis</p>
<hr />
<div>= CLOCKSS: Ingest Pipeline =<br />
<br />
The CLOCKSS Archive ingests two different types of content, both under the control of an appropriate [[LOCKSS: Basic Concepts#Plugins|plugin]] in the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]]:<br />
* Harvest Content - the contents of a publisher's Web site, obtained by crawling it.<br />
* File Transfer content - the form of the content the publisher uses to create their Web site, obtained from the publisher typically via FTP.<br />
<br />
== Harvest Content Pipeline ==<br />
<br />
=== Harvest Publisher Engagement ===<br />
<br />
The process a new publisher signing up with CLOCKSS undergoes depends on whether the publisher uses a publishing platform already supported by the LOCKSS software:<br />
* If so, the discussion need cover only the process for turning on access for the CLOCKSS Archive's ingest machines.<br />
* Otherwise, a plugin writer from the LOCKSS team is designated to analyze the publisher's site and:<br />
** develop requirements for the publisher plugin as described in [[LOCKSS: Software Development Process]].<br />
** work with the publisher to add CLOCKSS permission pages and make any other necessary changes to their site.<br />
In all cases, all interactions with the publisher take place through the RT ticketing system and are thus recorded permanently.<br />
<br />
The CLOCKSS Harvest Plugin Lead is responsible for this process.<br />
<br />
=== Harvest Plugin Development and Testing ===<br />
<br />
The designated LOCKSS team member:<br />
* Works with the publisher to grant access to the CLOCKSS ingest machines IP addresses.<br />
* Develops and tests any necessary software enhancements including unit tests (see [[LOCKSS: Software Development Process]]).<br />
* Works with a plugin writer to implement and test the necessary plugin and its unit tests (see [[LOCKSS: Software Development Process]]).<br />
<br />
Once the plugin passes its tests and the CLOCKSS ingest machines have access to the publisher's site with CLOCKSS permissions, the plugin writer works with the LOCKSS content team to process sample batches of publisher content for quality assurance.<br />
<br />
The CLOCKSS Harvest Plugin Lead is responsible for this process.<br />
<br />
=== Harvest Content Processing ===<br />
<br />
Once a plugin has been developed and tested with unit tests and sample content, it is released to a set of content-testing boxes, which are identical to production CLOCKSS boxes except generally smaller. Then:<br />
* Under the direction of an internally developed testing framework (AUTest), two daemons with the plugin are directed to collect two copies each of a substantial amount of real content, one copy collected from Stanford's IP address range, the other from Rice or Indiana.<br />
* AUTest then directs each of the two daemons to compute the message digest of the filtered contents of each collected AU.<br />
* AUTest compares the results to ensure correct collection and correct operation of hash filters.<br />
* Metadata is extracted and checked for sanity.<br />
* A sample of the content is browsed by a human tester to verify the correctness of the plugin by ensuring that:<br />
** all the types of files that should be collected are,<br />
** and that files that shouldn't be collected, such as advertisements or articles from previous years, are properly excluded.<br />
Any problems detected are addressed by modifying and testing the plugin in a development environment, or by contacting the publisher if necessary to resolve systemic site errors, then the tests are repeated on the content-testing boxes.<br />
<br />
The process above is repeated on a substantial sample of new content as it becomes available (in following years/volumes, or new publications from the same publisher), in order to detect changes to the publisher's site, or the format of new titles, which require changes to the plugin.<br />
<br />
Once an AU has been successfully tested on the content-testing boxes, it is configured for collection on all boxes in the ingest network. If the plugin is new or changed it is also released to the ingest network. Each ingest box collects the content and the network runs polls to detect and resolve transient collection errors. When all the copies of an AU have come into full agreement on the ingest boxes, they are then configured on the network of CLOCKSS production boxes.<br />
<br />
Depending on the complexity and diversity of the publisher's content the "substantial sample" can be anything from quite small to the publisher's entire content. Note that the need for a "substantial sample" of content for testing means that there is a delay between the time the publisher starts adding content to the bibliographic unit represented by the AU, and the time the ingest network starts collecting it.<br />
<br />
The CLOCKSS Harvest Content Lead is responsible for this process.<br />
<br />
==== Harvest Process ====<br />
<br />
Once AUs of content are configured on the CLOCKSS production boxes, they schedule collection of the content from the ingest boxes, as described in [[Definition of AIP#Creating an AIP from a harvest SIP|Definition of AIP]].<br />
<br />
A harvest content AU is considered to be preserved when it has been ingested by all the production CLOCKSS boxes and at least one poll on it has been successful (see [[LOCKSS: Polling and Repair Protocol]]). The AU can then be removed from the ingest boxes. A review of the process for doing so deemed it inadequate and too manual; a replacement process is under development that will integrate with the improvements being made to [[CLOCKSS: Logging and Records#External Reports|external report generation]].<br />
<br />
The CLOCKSS Harvest Content Lead is responsible for both the current and replacement processes.<br />
<br />
==== Completeness of Harvest Content ====<br />
<br />
Completeness in this context means that the content in the CLOCKSS archive faithfully reflects the content on the publisher's web site. The CLOCKSS ingest pipeline includes multiple ingest CLOCKSS boxes that individually harvest content from the publisher's website. Once collected, the ingest CLOCKSS boxes poll among themselves and repair content objects from the publisher if there is disagreement. The content testing process requires complete agreement among the ingest boxes before the content is released to the product CLOCKSS network. As a result the content described by the CLOCKSS plugin and parameters for the [[Definition of SIP|Submission Information Package]] is preserved in the CLOCKSS PLN.<br />
<br />
To account for the possibility of errors in the rules and procedures of the CLOCKSS plugin, a second layer of testing is done by visually inspecting content submitted by a publisher to ensure that it functions as it did on the publisher's website, and that all content available from the publisher's website is also preserved in the CLOCKSS ingest boxes before being released to the CLOCKSS production network.<br />
<br />
The CLOCKSS Harvest Content Lead is responsible for this process.<br />
<br />
==== Correctness of Harvest Content ====<br />
<br />
Correctness in this context means that the publisher's content is Independently Understandable by the [[CLOCKSS: Designated Community]]. For the CLOCKSS archive, this is the responsibility of the publisher, not of the archive. Harvest content is used by customers of the publisher on a regular basis, so content that is not Independently Understandable on the publisher's website is rare, and is normally detected by the author. Under the [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement], the publisher may submit content in any form, including proprietary formats, so full validation of content and content types is not always practical. When the LOCKSS team detects cases in which content might not be Independently Understandable they are reported to the publisher.<br />
<br />
Some content from both harvest and file transfer publishers is not intended to be displayed by browsers but to be input to other software. For example, the census of file formats requested during the ISO16363 audit showed the following instances of chemical formats:<br />
* chemical/x-cif 6850 instances<br />
* chemical/x-cml 6125 instances<br />
* chemical/x-cdx 5260 instances<br />
* chemical/cif 646 instances<br />
* chemical/x-chemdraw 247 instances<br />
* chemical/x-pdb 64 instances<br />
* chemical/x-xyz 44 instances<br />
* chemical/x-fasta 26 instances<br />
The Mime-Types not intended for the browser are either widely supported formats (such as Microsoft Office formats) or formats specific to some field (such as Chemistry: chemical/x-cif, chemical/x-cml, etc.). The evolution of these formats is a problem for the specific field; it is not something to which an archive can provide a generic solution. But see the discussion of emulation in [[LOCKSS:_Format_Migration|LOCKSS: Format Migration]].<br />
<br />
==== Annual Harvest Content Cycle ====<br />
<br />
The processing cycle for the harvest content a publisher is publishing during a given year can be described as follows.<br />
* The first phase begins typically in the second quarter of the year.<br />
** A sample of the publisher's AUs for the year are configured and processed on the content-testing machines to ensure the plugin works as expected, and corrective action is taken if necessary.<br />
** When the plugin is deemed satisfactory, all the publisher's AUs for the year are configured on the ingest machines and allowed to crawl and poll.<br />
* The second phase continues throughout the rest of the year.<br />
** Poll results for AUs on the ingest machines are monitored, and corrective action is taken if necessary, including enhancing the plugin.<br />
** More content samples may be configured and processed on the content-testing machines to ensure the plugin continues to work as expected throughout the year.<br />
* The third phase begins at the beginning of the following year.<br />
** After allowing for final updates (a time determined by the publishing frequency and experience with the publisher), a final "deep re-crawl" of each of that year's AUs is scheduled.<br />
** Then crawling for those AUs is disabled.<br />
** Then poll results are monitored until the AUs are deemed fully processed, at which point they are configured on the production machines.<br />
** Crawl and poll results of the AUs on the production machines are then monitored until the AUs are deemed fully processed.<br />
<br />
The CLOCKSS Harvest Content Lead is responsible for this process.<br />
<br />
==== Harvest Content Hosted Elsewhere ====<br />
<br />
As described in [[#Harvest_Content_Processing|Harvest Content Processing]], during content testing the list of excluded URLs is audited and the harvest content is inspected visually. Systematic use by the publisher of off-site web servers, for example Figshare or S3, is detected at this stage. Figshare, for example, is already being preserved in CLOCKSS, so content hosted there would be collected. Otherwise, the publisher can be asked to include appropriate permission in the off-site web server so that the content there can be collected, or to switch to file transfer. Non-systematic use, for example an individual author linking to their blog, will not result in preservation of the linked-to content as it cannot be considered part of the publisher's copyright content.<br />
<br />
==== Harvest Content Interactivity ====<br />
<br />
Some harvest journals are using Javascript-based techniques such as AJAX that require a Web crawler to execute (some of) their content rather than simply parsing it to find the links to other URLs that need to be harvested. The LOCKSS team has been drawing attention to and working on this problem [http://blog.dshr.org/2009/04/spring-cni-plenary-remix.html since at least 2009]. We [http://blog.dshr.org/2011/08/moonalice-plays-palo-alto.html worked with students at C-MU West to prototype collections via AJAX]. Under our current Mellon grant, we have developed a production AJAX collector and will use it when required.<br />
<br />
==== e-Book Processing ====<br />
<br />
Unlike serial content which is published incrementally over a period of time, harvest content that takes the form of books can be processed over a regular cycle. During each cycle:<br />
* A sample of the publisher's books that were published during the previous interval is configured on the content-testing machines to verify that the plugin works adequately, and corrective action is taken if necessary.<br />
* When the plugin is deemed ready, all the publisher's books from the previous interval are configured on the ingest machines and allowed to crawl and poll. Results are monitored and corrective action is taken if necessary.<br />
* When a book is deemed fully processed on the ingest machines, it is configured on the production machines. The crawl and poll results are then monitored on the production machines until the book is deemed fully processed.<br />
<br />
The CLOCKSS Harvest Content Lead is responsible for this process.<br />
<br />
== File Transfer Content Pipeline ==<br />
<br />
=== File Transfer Publisher Engagement ===<br />
<br />
The designated LOCKSS team member's discussion with a new publisher whose content is to be received via file transfer needs to determine two things:<br />
* The means by which the file transfer content is transferred to the CLOCKSS ingest machines. This is up to the publisher, techniques that have been implemented include:<br />
** FTP from an FTP server that the publisher maintains to a CLOCKSS ingest machine.<br />
** FTP by the publisher to an FTP server on a CLOCKSS ingest machine.<br />
** <tt>rsync</tt> between a publisher machine and a CLOCKSS ingest machine<br />
* The format of the file transfer content, in particular:<br />
** How the ingest scripts can verify that the content received is correct, for example by checking manifests and checksums.<br />
** How the content can be rendered if it is ever triggered, the ''triggering plan''. A publisher-specific version of the abstract plan described in [[CLOCKSS: Extracting Triggered Content#Preparing File Transfer Content for Dissemination|CLOCKSS: Extracting Triggered Content]] should be drawn up.<br />
In all cases, all interactions with the publisher take place through the RT ticketing system and are thus recorded permanently.<br />
<br />
The CLOCKSS File Transfer Plugin Lead is responsible for this process.<br />
<br />
=== Ingest script development ===<br />
<br />
Once the information above is available, the designated team member writes shell scripts to be executed from <tt>cron</tt> that ensure that collected content is up-to-date by:<br />
* If FTP from the publisher, collect any as-yet-uncollected content.<br />
* If <tt>rsync</tt>, run rsync against the publisher's machine.<br />
* If the format in which the publisher makes file transfer content available lacks checksums and manifests, the ingest script must generate them as the content is collected.<br />
* Otherwise, the ingest script verifies the manifest and checksums in the content, alerting the content team if any discrepancies are found.<br />
<br />
The designated team member also writes a verification script that is run against all content on the file transfer ingest machine at intervals.<br />
<br />
The CLOCKSS File Transfer Plugin Lead is responsible for this process.<br />
<br />
=== File Transfer Plugin Development and Testing ===<br />
<br />
File transfer content also requires a plugin, These plugins are developed and tested in the same way as harvest plugins (see [[CLOCKSS: Ingest Pipeline#Harvest Plugin Development and Testing|above]]), except that the emphasis is on [[LOCKSS: Extracting Bibliographic Metadata|metadata extraction]] because it is in some cases more complex, whereas crawling is trivial and polling requires no filters.<br />
<br />
The CLOCKSS File Transfer Plugin Lead is responsible for this process.<br />
<br />
=== File Transfer Content Pre-Processing ===<br />
<br />
File transfer content undergoes a subset of the testing steps described above for [[CLOCKSS: Ingest Pipeline#Harvest Content Processing|harvest content]]. The structure of file transfer AUs is simple and consistent (typically a directory hierarchy) so testing the crawl rules on a large sample of content is unnecessary. All the CLOCKSS boxes collect the same copy of the content so few of the complexities of preserving harvest content (such as hash filters) are present.<br />
<br />
A slightly simplified workflow in AUTest directs content-testing boxes to collect a single copy of a smaller sample of AUs from the staging server, and to extract metadata and check it for sanity. After any problems are corrected, the AU and similar AUs are configured on the CLOCKSS production boxes.<br />
<br />
=== File Transfer Content Ingest ===<br />
<br />
The collection scripts are added to the ingest user's <tt>crontab</tt>, and run daily to collect any new content. Every 24 hours the entire content of the file transfer ingest machine is synchronized with an (a) an on-site backup and (b) an off-site backup using <tt>rsync</tt>. The verification scripts are run against each of these backup copies at intervals.<br />
<br />
Once the ingested content is verified it is staged to a Web server that can be accessed only by the CLOCKSS boxes. The CLOCKSS boxes crawl the file transfer content staged on the Web server under control of a file transfer plugin and preserve it, as described in [[Definition of AIP#Creating an AIP from a file transfer SIP|Definition of AIP]].<br />
<br />
The staging Web server uses a clone of the directory hierarchy into which the file transfer content is collected. This hierarchy is created, by linking to the files, when the content in question is released to production. The names of the files in this hierarchy are the names assigned to them by the publisher as part of the FTP transfer process. These files are typically archive files, such as TAR or ZIP. The names of the files contained in these archives are those assigned by the publisher. File transfer SIPs do not necessarily correspond to a publisher's web site, and those that do, do not in general include the publisher's URLs, so the publisher's URLs (as opposed to their file names) are not preserved. Thus the process of [[CLOCKSS: Extracting Triggered Content#Preparing File Transfer Content for Dissemination|triggering file transfer content]] cannot, and does not attempt to, reproduce the URL structure of the publisher's web site.<br />
<br />
File transfer content is considered to be preserved when it has been ingested by all production CLOCKSS boxes and at least one poll on it has been successful (see [[LOCKSS: Polling and Repair Protocol]]). <br />
<br />
The CLOCKSS File Transfer Content Lead is responsible for this process.<br />
<br />
==== Completeness of File Transfer Content ====<br />
<br />
Completeness in this context means that the content in the CLOCKSS archive faithfully reflects the content delivered by the publisher. Content submitted via file transfer is either made available from a content server operated by the publisher, or delivered by the publisher to a content server hosted by CLOCKSS. In either case the content is transferred to a staging server operated by CLOCKSS. The ingest scripts include procedures that verify the files submitted by the publisher correspond to those on the content server. Where publishers provide checksum information, checksums are also compared with locally computed checksums to ensure that the content is the same. If the checksums differ, the local copy is deleted and re-collected.<br />
<br />
==== Correctness of File Transfer Content ====<br />
<br />
Correctness in this context means that, if the content is ever triggered, the result will be Independently Understandable by the [[CLOCKSS: Designated Community]]. During initial engagement with file transfer publishers, their content types are assessed to develop a plan for rendering the content if it is ever triggered. One goal of the ingest scripts is to verify that the content types submitted match this assessment; if they do not the [[CLOCKSS: Ingest Pipeline#File Transfer Publisher Engagement|triggering plan for the content]] must be revised to account for the new content types.<br />
<br />
== Errata, Corrections and Retractions ==<br />
<br />
As regards subsequent errata and corrections, it is assumed that publishers follow the NLM ''best practice'' guidelines and at least refer and link to the the errata or corrections in a subsequent issue. If the publisher does, they will be collected with that issue.<br />
<br />
As regards retractions, the [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Participating Publisher Agreement] paragraph 2.E states:<br />
<blockquote><br />
E. At the request of Publisher, CLOCKSS will make reasonable efforts to identify Archived Content that contains information that has been retracted or removed (“Retracted”) by Publisher due to potential inaccuracy (“Retracted Content”). Upon a Trigger Event (as defined below) involving such Retracted Content, the CLOCKSS Board (the “Board”) shall determine whether such Retracted Content will be included in the Release (as defined below), and if Released, what if any notifications should accompany such Retracted Content. If the CLOCKSS Board determines that the Release of such Retracted Content would present a legal risk to CLOCKSS or Publisher, the Content will not be Released unless CLOCKSS and Publisher are indemnified against any Damages (as defined below). If Released Archived Content is later Retracted, the CLOCKSS Board shall determine what actions, if any, to take, including possible removal of such Retracted Content from public view. If Publisher at any time notifies CLOCKSS that Publisher has made a good-faith determination that the Release of certain Retracted Content may present a legal risk to CLOCKSS or Publisher, and the CLOCKSS Board nevertheless elects to Release such Retracted Content, Publisher shall have no indemnity obligation to CLOCKSS, whether pursuant to Section 10 below or otherwise, to the extent that any Claim (as defined below) arises from CLOCKSS Release of such Retracted Content.<br />
</blockquote><br />
The [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Participating Publisher Agreement] paragraph 4.F states:<br />
<blockquote><br />
Any Released Content may be authorized to be removed from Host Organization(s) and public access by the affirmative vote of at least seventy-five percent (75%) of all the members of the Board then in office (“Removal Authorization”). However, because of the impact of a Removal Authorization, a negative vote of three (3) or more members of the Board then in office, shall override such Removal Authorization leaving the Released Content available to public access, unless the removal is requested by the Publisher, in which case the Released Archived Content shall be removed from Host Organization(s) and public access.<br />
</blockquote><br />
The LOCKSS team has yet to receive any such request. Presumably such a request would be, and would need to be preserved, at an article level. The request would be recorded in the triggering plan for that publisher (for file transfer content), and in a CLOCKSS: Retraction Requests document (preserved in CLOCKSS) and an indication added to the [[LOCKSS: Basic Concepts#Title Database|TDB]] that there was a retraction request for that AU. If and when that title was triggered, the trigger proccess would exclude the retracted content. The team is investigating whether participating publishers could supply regular lists of retractions, instead of specific requests for triggered content to the Board. These would be recorded routinely, rather than dealing with retractions only as part of the trigger process.<br />
<br />
There is potentially some conflict between these terms and the use of CC licenses for triggered content, since even if content redacted or withdrawn after triggering is removed from the triggered content sites other sites may have copied and re-published it, as is permitted by the CC license. The LOCKSS team are investigating possible ways to resolve this conflict, but presumably the CLOCKSS Archive would not be liable for the copies elsewhere that were made before notification of the redaction or withdrawal.'''<br />
<br />
As regards changed harvest content, when an AU is closed in the [[CLOCKSS:_Ingest_Pipeline#Annual_Harvest_Content_Cycle|Annual Harvest Content Cycle]] a final deep re-crawl of the content is undertaken to ensure that the closed AU contains up-to-date content. In addition, the [http://www.crossref.org/crossmark/ CrossMark service] offers the ability to detect versioning of articles. The LOCKSS team are investigating the possibility of using CrossMark to drive re-harvesting of updated content. this may be particularly important since at least one harvest publisher is already planning to update content without formal notification.<br />
<br />
As regards changed file transfer content, the CLOCKSS archive is dependent upon the publisher supplying it in subsequent SIPs. The triggering plan for each file transfer publisher includes information describing how the [[CLOCKSS:_Extracting_Triggered_Content#Preparing_File_Transfer_Content_for_Dissemination|process of extracting triggered content]] for that publisher ensures that the most recent version of content is triggered.<br />
<br />
== Omissions ==<br />
<br />
As regards unintentional omissions at the article level, the CLOCKSS archive [[CLOCKSS:_Logging_and_Records#External_Reports|reports article counts]] ingested to the publisher for billing purposes. These counts can be checked against the publisher's article counts. The LOCKSS team have had instances of the ingest article counts being both too small (articles had been ingested but the counting process was wrong, which the team detected) and too large (publishers reported this but their article counts were wrong).<br />
<br />
Unintentional omissions at the file level in harvest content would appear as broken links on the publisher's web site. The CLOCKSS crawlers on the ingest boxes would see these as URLs that should have been collected but which returned 404. The crawler's response to these 404s is configurable; almost all plugins are configured either to report via a warning and continue collecting the site or, after a delay, retry the fetch up to N times then report via a warning and continue.<br />
<br />
The ability to detect unintentional omissions at the file level in file transfer content varies depending on the structure of the individual publisher's SIP which is [[CLOCKSS:_Ingest_Pipeline#Ingest_script_development|encoded in the relevant ingest script]].<br />
<br />
As regards intentional omissions from file transfer content, there is nothing that an archive can do to prevent a file transfer publisher refusing to supply content. For harvest content, since that is obtained by emulating what the publisher's readers do to obtain content, a simplistic approach to intentionally omitting content risks also depriving readers of the content. More sophisticated approaches would avoid this risk but would add substantial overhead for the publisher. Fundamentally, the CLOCKSS archive is preserving what the publishers pay it to preserve.<br />
<br />
== Feedback ==<br />
<br />
The CLOCKSS Executive Director receives regular reports of the article counts ingested, upon which publishers are billed (See [[CLOCKSS: Logging and Records#External Reports|CLOCKSS: Logging and Records]]).<br />
<br />
Content AUs can be in one of three preservation states, as reported to the publisher, the CLOCKSS Board and the Keepers Registery (See [[CLOCKSS: Logging and Records#External Reports|CLOCKSS: Logging and Records]]):<br />
# Committed for preservation<br />
# In process<br />
# Preserved<br />
Progress is tracked through the AU's configuration file.<br />
<br />
The CLOCKSS Harvest and File Transfer Content Leads are responsible for this process fro their respective content.<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by:<br />
** LOCKSS Technical Staff<br />
** CLOCKSS Plugin Lead<br />
** CLOCKSS Content Lead<br />
* Approval by CLOCKSS Technical Lead<br />
<br />
== Relevant Documents ==<br />
# [[CLOCKSS: Designated Community]]<br />
# [[CLOCKSS: Logging and Records]]<br />
# [[LOCKSS: Software Development Process]]<br />
# [[LOCKSS: Polling and Repair Protocol]]<br />
# [[Definition of SIP]]<br />
# [[Definition of AIP]]<br />
# [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement]</div>Dshrhttp://documents.clockss.org/index.php/Definition_of_AIPDefinition of AIP2014-04-10T03:42:00Z<p>Dshr: Changes after auditor visit: approved by David Rosenthal</p>
<hr />
<div>= CLOCKSS Definition of AIP =<br />
<br />
== OAIS Archival Information Package (AIP) ==<br />
<br />
The OAIS definition of AIP is:<br />
<blockquote>Archival Information Package (AIP): An Information Package, consisting of the Content Information and the associated Preservation Description Information (PDI), which is preserved within an OAIS.</blockquote><br />
Other relevant definitions are:<br />
<blockquote>AIP Edition: An AIP whose Content Information or Preservation Description Information has been upgraded or improved with the intent not to preserve information, but to increase or improve it. An AIP edition is not considered to be the result of a Migration.</blockquote><br />
<blockquote>AIP Version: An AIP whose Content Information or Preservation Description Information has undergone a Transformation on a source AIP and is a candidate to replace the source AIP. An AIP version is considered to be the result of a Digital Migration.</blockquote><br />
<blockquote>Archival Information Collection (AIC): An Archival Information Package whose Content Information is an aggregation of other Archival Information Packages.</blockquote><br />
<blockquote>Archival Information Unit (AIU): An Archival Information Package where the Archive chooses not to break down the Content Information into other Archival Information Packages. An AIU can consist of multiple digital objects (e.g., multiple files).</blockquote><br />
<blockquote>Preservation Description Information (PDI): The information which is necessary for adequate preservation of the Content Information and which can be categorized as Provenance, Reference, Fixity, Context, and Access Rights Information.</blockquote><br />
The OAIS discussion of AIP is:<br />
<blockquote>Within the OAIS one or more SIPs are transformed into one or more Archival Information Packages (AIPs) for preservation. The AIP has a complete set of PDI for the associated Content Information. The AIP may also contain a collection of other AIPs, and this is discussed and modeled in section 4. The Packaging Information of the AIP will conform to OAIS internal standards, and it may vary as it is managed by the OAIS.</blockquote><br />
<br />
The OAIS definition of Representation Information is:<br />
<blockquote>'''Representation Information:''' The information that maps a Data Object into more meaningful concepts. An example of Representation Information for a bit sequence which is a FITS file might consist of the FITS standard which defines the format plus a dictionary which defines the meaning in the file of keywords which are not part of the standard.<br />
<br />
Another example is JPEG software which is used to render a JPEG file; rendering the JPEG file as bits is not very meaningful to humans but the software, which embodies an understanding of the JPEG standard, maps the bits into pixels which can then be rendered as an image for human viewing.</blockquote><br />
Other relevant definitions are:<br />
<blockquote>'''Representation Network:''' The set of Representation Information that fully describes the meaning of a Data Object. Representation Information in digital forms needs additional Representation Information so its digital forms can be understood over the Long Term.</blockquote><br />
<blockquote>'''Representation Rendering Software:''' A type of software that displays Representation Information of an Information Object in forms understandable to humans.'''</blockquote><br />
The OAIS discussion of Representation Information is:<br />
<blockquote>In general, it can be said that ‘Data interpreted using its Representation Information yields Information’, ... In order for this Information Object to be successfully preserved, it is critical for an OAIS to identify clearly and to understand clearly the Data Object and its associated Representation Information. For digital information, this means the OAIS must clearly identify the bits and the Representation Information that applies to those bits. ... As a further complication, the recursive nature of Representation Information, which typically is composed of its own data and its own Representation Information, typically leads to a network of Representation Information objects. Since a key purpose of an OAIS is to preserve information for a Designated Community, the OAIS must understand the Knowledge Base of its Designated Community to understand the minimum Representation Information that must be maintained. The OAIS should then make a decision between maintaining the minimum Representation Information needed for its Designated Community, or maintaining a larger amount of Representation Information that may allow understanding by a larger Consumer community with a less specialized Knowledge Base, which would be the equivalent of extending the definition of the Designated Community. Over time, evolution of the Designated Community’s Knowledge Base may require updates to the Representation Information to ensure continued understanding. The choice, for an OAIS, to collect all the relevant Representation Information or to reference its existence in another trusted or partner OAIS Archive, is an implementation and organization decision.</blockquote><br />
<br />
== CLOCKSS Archival Information Package (AIP) ==<br />
<br />
CLOCKSS calls its AIPs [[LOCKSS: Basic Concepts#Archival Units|Archival Units (AUs)]]. They are constructed from [[Definition of SIP|CLOCKSS SIPs]] as described below. The internal representation of an AU is as a directory hierarchy in a POSIX file system in which the directory names represent the components of the name of the URL from which content was collected, including the terminal component. The order of the protocol and host fields of the URL is reversed, so content collected from the same host via multiple protocols (e.g. HTTP and HTTPS) is in subtrees of the host's directory. Each of these directories, in addition to the names of descendant components, may contain files whose names start with <tt>#</tt> (and thus cannot be components of the URL name, since they would be tags within the URL) containing metadata relating to the URL up to that component. Additionally each directory may contain a directory named <tt>#content</tt> containing a sequence of versions of content obtained from the URL up to that component, and metadata relating to that version of the content. This metadata is represented as a file containing {key, value} pairs. It will include all the HTTP headers obtained by the <tt>GET</tt> that obtained the content, in particular the <tt>Content-Type</tt>, one component of which is the <tt>Media-Type</tt>. It will also include additional metadata generated during ingestion of that version of the content at that URL, typically including a checksum, and a timestamp. This somewhat complex representation is required because:<br />
* The URLs <tt>http://www.example.com/foo</tt> and <tt>http://www.example.com/foo/bar</tt> may both contain (different) content.<br />
* The content and/or the headers obtained from <tt>http://www.example.com/foo/bar</tt> at time T(0) and at time T(1) may differ.<br />
The representation allows for easy access by tools other than the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]], for example shell scripts, if the LOCKSS software ever goes obsolete.<br />
<br />
[[File:AU-Structure.png]]<br />
<br />
=== CLOCKSS AIP Examples ===<br />
<br />
==== File Transfer AU ====<br />
<br />
The Journal of Laser Applications Volume 24:<br />
* [[Media:LA24 urls.pdf|A list of URLs (87-page PDF)]].<br />
* [[Media:LA24 metadata.pdf|The metadata]].<br />
<br />
==== Harvest AU ====<br />
<br />
Advances in Building Energy Research Volume 6:<br />
* [[Media:ABER V6 urls.pdf|A list of URLs (43-page PDF)]].<br />
* [[Media:ABER V6 metadata.pdf|The metadata]].<br />
<br />
=== CLOCKSS Preservation Description Information (PDI) ===<br />
<br />
The OAIS Reference Model classifies the Preservation Description Information (PDI) included in an AIP as follows:<br />
<ul><li>'''Provenance:''' the provenance of each version of each URL in the AU can be determined as follows. The content of that version was obtained from the URL represented by the path from the root of the AU's directory hierarchy to the parent of the content directory, at the time of the timestamp. If this content was the result of a repair from another CLOCKSS box, that is recorded in the metadata. </li><br />
<li>'''Fixity:''' the metadata for each version of each URL in the AU includes a checksum computed at the time it was obtained and, if one is available, a checksum provided by the Web server from which it was obtained.</li><br />
<li>'''Context:''' the context information for an AU consists of two pieces of information which together allow the daemon to construct an instance of a Java class customized to suit the AU:<br />
<ul><br />
<li>The ''parameters'', which are a set of {name, value} pairs providing the arguments needed to construct an instance of the class selected by the [[LOCKSS: Basic Concepts#Plugins|plugin ID]].</li><br />
<li>The ''plugin ID'', which, in encoded form, identifies the class to be instantiated and supplies additional information.</li><br />
</ul><br />
In effect, the context for the AU is a customized instance of a Java class, normally referred to as its ''plugin''. It is thus executable, capable of performing operations on the AU such as [[Definition of AIP#Creating AIPs from SIPs|adding content and metadata from a SIP]], [[LOCKSS: Extracting Bibliographic Metadata|extracting metadata]], and taking part in [[LOCKSS: Polling and Repair Protocol|integrity checks]].<br /><br />
The context for the content of a URL in an AU consists of the associated metadata, including the <tt>Content-Type</tt> and the other HTTP headers which together provide the information a Web browser uses to render the content.<br /><br />
Each AU is a stand-alone, self-contained object that, if necessary, can be disseminated (triggered) independently. Although it may be part of a larger bibliographic unit in a logical sense, it is self-contained in its representation. To ensure this common URLs, for example a journal logo image, are replicated in each AU that refers to them.</li><br />
<li>'''Reference:''' Each AU has an immutable internal name, its [[LOCKSS: Basic Concepts#AUID|AUID]] computed from its context information and stored in the file named <tt>#au_id_file</tt> in the root directory of the AU. The name of the AU is the same on all CLOCKSS boxes, although the location of the AU instance in the box's file system may differ. The use of this name to locate the instance of an AU on a CLOCKSS box, or content within it, is [[Definition of AIP#Locating Digital Objects|described below]].</li><br />
<li>'''Access Rights:''' the CLOCKSS archive is a dark archive; access to the content of all AUs is forbidden unless in the future the [[CLOCKSS: Extracting Triggered Content|CLOCKSS board declares a trigger event]] for a specific set of AUs. Thus it is neither necessary nor possible to store the access rights to an AU with the AU itself, since they will only be determined in the future.</li></ul><br />
<br />
== Creating AIPs from SIPs ==<br />
<br />
When the AU is created its root directory is created, and the context information ([[LOCKSS: Basic Concepts#Plugins|plugin ID]] and AU parameters) are recorded in a read-only file in that directory. The same information is recorded in a central file in the daemon's configuration directory.<br />
<br />
Because most publishers whose materials are preserved in the CLOCKSS archive are serial publishers, emitting a continuous stream of articles, AUs are typically constructed from a series of SIPs, each one representing the articles published since its predecessor SIP. There are [[Definition of SIP|two kinds of SIP]]:<br />
* A ''harvest'' SIP represents content that the CLOCKSS archive will ingest by crawling the publisher's web site.<br />
* A ''file transfer'' SIP represents content that the publisher will package and transfer to the CLOCKSS archive via FTP, rsync, or other file transfer machanism.<br />
<br />
The process of creating an AU (AIP) from a SIP is different for each of the two types of SIP, but each process starts by ''configuring'' the AU on the CLOCKSS boxes, which involves supplying the context information (plugin ID and the parameters it requires) via the [[LOCKSS: Property Server Operations|CLOCKSS property server]]. Each box creates a root directory for its instance of the AU at a suitable place in its POSIX file system and calls the AU's plugin to arrange attempts to collect SIPs. The plugin contains information about times when content collection is allowed. The LOCKSS daemon on each box maintains a schedule of all AUs' collection attempts; when an AU requests a collection attempt it is scheduled so as to conform to box-wide and publisher-specific limits on the number of simultaneous collection attempts.<br />
<br />
AIP (AU) instances in CLOCKSS boxes may be created even before the first SIP supplying content for them is available. The instance continues to accumulate content via a sequence of SIPs becoming available through time, as the LOCKSS daemon repeatedly crawls the Web server from which SIPs are collected. Nominally, each AU represents a delimited span of time, such as a year or a volume of a journal. But because SIPs containing errata or corrections can arrive even after the delimited span of time, there is in general never a point in time at which the AU can be said to be definitively complete in the sense that no further content will ever be added.<br />
<br />
Although, as documented in [[CLOCKSS: Ingest Pipeline]], the quality assurance process that content undergoes as it is ingested into the CLOCKSS archive includes some visual spot checks that the content renders properly in a web browser, these are primarily intended to assure that all necessary URLs are being harvested. At the scale of the CLOCKSS archive's operations it is not feasible for these checks to be exhaustive. Further, the Content Information in an AIP is copyright by the publisher. It is their responsibility to ensure that it is Independently Understandable for their readers, who are also the eventual Consumers of the content if it is ever triggered from the CLOCKSS archive. Even if these spot checks were to detect some rendering problems, the CLOCKSS archive would not be permitted to modify the Content Information to correct them.<br />
<br />
The goal of the [[CLOCKSS: Ingest Pipeline]] is not to ensure that the AUs are "complete and correct"; there are no operationally implementable definitions of those terms. If there were, they would involve second-guessing the publishers. The goal is rather to ensure that the AUs "faithfully reflect what the publisher has published on their web site (for harvested AUs) or supplied to the archive (for file transfer AUs)".<br />
<br />
=== Creating an AIP from a harvest SIP ===<br />
<br />
When the scheduled time for the AU's requested collection attempt arrives, the [[LOCKSS: Basic Concepts#Plugins|plugin]] configures the box's Web crawler as it first verifies that the appropriate [[Definition of SIP|CLOCKSS permission statement]] is present on the publisher's Web site, then crawls the site. The plugin bounds the crawl in the URL name space by determining which links to follow, rate-limits the crawl so as not to disrupt operation of the publisher's Web site, and stores all newly-discovered content and the related metadata in the appropriate place relative to the AU's root directory in the box's repository, as described [[Definition of AIP#Creating AIPs from SIPs|above]].<br />
<br />
Note that for quality assurance (QA) reasons, as set out in [[CLOCKSS: Ingest Pipeline]], in practice each harvest AU is created twice:<br />
* First, a temporary AU is created on the CLOCKSS ingest machines. These machines collect the AU's current content, and then come to agreement on the content using the [[LOCKSS: Polling and Repair Protocol]].<br />
* Once agreement is reached, the permanent AU is created on each of the production CLOCKSS boxes. They then:<br />
** crawl the content from the CLOCKSS ingest boxes<br />
** start regular integrity checks with the other production CLOCKSS boxes using the [[LOCKSS: Polling and Repair Protocol]].<br />
The process the production boxes use is exactly the same as used by the ingest boxes, except that the production box is configured to crawl using the LOCKSS daemon on an ingest box as a proxy. The LOCKSS daemon on the ingest box is configured to act as a proxy that returns the content it has for the publisher's URL, or 404 for content it does not have. This ensures that the URLs from which the production box collects content point to the publisher, although the content will be that previously collected from the publisher's URL by the ingest box (see [[CLOCKSS: Ingest Pipeline#Harvest Content Processing|Harvest Content Processing]]). Note that the publisher's web site is not involved in this process. Once agreement is reached among the production boxes, the AU can be removed from the ingest boxes. Removal is recorded by changing the state of the AU in the [[LOCKSS: Basic Concepts#Title Database|Title DataBase (TDB)]] to ZAPPED. The current process for doing so has been deemed inadequate (see [[CLOCKSS: Ingest Pipeline#Harvest Process|CLOCKSS: Ingest Pipeline]]); a replacement process is under development.<br />
<br />
=== Creating an AIP from a file transfer SIP ===<br />
<br />
Publishers choosing to supply content via file transfer can choose either:<br />
* To ''push'' the content via <tt>FTP</tt>, <tt>SFTP</tt> or <tt>rsync</tt> to a CLOCKSS-run ingest server, using a user name and password chosen by the CLOCKSS team and specific to the publisher. <br />
* To have the CLOCKSS archive's ingest server ''pull'' the content from a publisher-run <tt>FTP</tt>, <tt>SFTP</tt> or <tt>rsync</tt> server using a user name and password chosen by, and specific to the publisher.<br />
In both cases, the combination of the DNS name at the publisher's end, the user name, and the password identifies the publisher for provenance purposes.<br />
<br />
Content and metadata in SIPs received via file transfer is organized by shell scripts into a directory hierarchy that is exported to the production CLOCKSS boxes by a Web server, called the ''staging server''. The hierarchy contains a directory per publisher containing a directory per year that contains the content and metadata received from that publisher during that year, and an automatically generated manifest page containing links to all the files and a CLOCKSS permission statement. Some publishers' file transfer SIPs do not contain fixity information, for those publishers fixity information is computed as soon as possible after file transfer and stored with the content.<br />
<br />
At the start of each new year an AU for the new year's SIPs from each publisher supplying content via file transfer is configured on the production CLOCKSS boxes via the [[LOCKSS: Property Server Operations|CLOCKSS property server]]. The AU's [[LOCKSS: Basic Concepts#Plugins|plugin]] is configured to start crawling from the automatically generated manifest page of that year's directory for that publisher on the staging server at regular intervals. Each time the box crawls, it adds content and metadata from newly received SIPs to the box's copy of the AU. In this case the files in the AIPs have names reflecting the files in the SIPs on the staging server. The publisher's names or URLs for these files are contained in, or recoverable from metadata in the SIPs/AIPs. The content will be associated with them when and if it is triggered.<br />
<br />
After the year has ended and agreement has been reached on the AU among the production CLOCKSS boxes, the AU can be removed from the staging server and its status changed to ZAPPED in the TDB (See [[LOCKSS: Extracting Bibliographic Metadata#Metadata Database Indexing|LOCKSS: Extracting Bibliographic Metadata]]). A review deemed the initial process for doing so inadequate and error-prone. AUs will not be removed from the staging server until an acceptable process has been developed and tested. This process must integrate with the improved [[CLOCKSS: Logging and Records#External Reports|process for generating external reports]].<br />
<br />
== CLOCKSS Representation Information ==<br />
<br />
[[CLOCKSS: Designated Community]] documents that the software Knowledge Base of the eventual Consumers of triggered content includes web browsers and their associated plugins capable of rendering the formats used on the Web at the time of their publication. This part of the Representation Information for the CLOCKSS archive is preserved outside the CLOCKSS archive, primarily in open source code repositories.<br />
<br />
[[LOCKSS: Format Migration]] documents how evolution in this Knowledge Base through time can be handled using the Content Negotiation capabilities of the Web to implement [http://dx.doi.org/10.1045/january2005-rosenthal transparent, on-access format migration]. This capability is implemented in the LOCKSS software, which is preserved outside the CLOCKSS archive, in the SourceForge repository and elsewhere.<br />
<br />
Thus the representation Information that is necessary to preserve with content objects in a CLOCKSS AIP is that information needed by web browsers to render, and the LOCKSS software to migrate, the content obtained from a URL. This information is is two parts:<br />
* The <tt>Mime-Type</tt> and other HTTP headers obtained from the URL. The metadata preserved for each version of each URL within an AU includes all HTTP headers.<br />
* "Magic Number" and other information contained in the HTTP content payload itself. The content preserved for each version of each URL within an AU is the entire HTTP content payload.<br />
A web browser has no other information upon which to base its rendering of the content of a URL than its name (perhaps with a file extension) and these two parts, which must therefore be adequate for the purpose of making the content understandable to Consumers.<br />
<br />
== Locating Digital Objects ==<br />
<br />
The term "digital objects" is often used in discussions of preserved digital information. In the CLOCKSS content, it might refer to either of two types of object:<br />
* AIPs, or in the CLOCKSS context AUs.<br />
* Content objects within an AU.<br />
<br />
=== Locating AIPs ===<br />
<br />
The CLOCKSS archive has a single class of AIP, called an Archival Unit (AU). As [[Definition of AIP#CLOCKSS Preservation Description Information (PDI)|described above]], the Reference part of the AU's Preservation Description Information is an immutable name that is the same on each CLOCKSS box. The location of the instance of the AU on a particular CLOCKSS box may be obtained by querying a map from this internal name to the path to the AU's root directory. This map is built during box startup and subsequently maintained as new AUs are created, or AUs moved.<br />
<br />
=== Locating content objects ===<br />
<br />
Content within an AU can be located in one of two ways:<br />
* Via the URL from which it was obtained. As [[Definition of AIP#CLOCKSS Archival Information Package (AIP)|described above]], there is a reversible mapping between the URI from which the content was collected and the path from the root of the AU containing it in the POSIX file system containing the AU. This enables content within an AU to be located via the map between the AU's internal name and its root location, and the components of the URL.<br />
* Via metadata search. As [[Definition of AIP#CLOCKSS Preservation Description Information (PDI)|described above]], content within an AU can be located by querying the [[LOCKSS: Metadata Database|metadata database]] using specific bibliographic metadata fields to match against the [[LOCKSS: Extracting Bibliographic Metadata|bibliographic metadata supplied by the publisher]] or derived from the AU's context. <br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by LOCKSS Engineering Staff<br />
* Approval by LOCKSS Chief Scientist<br />
<br />
== Relevant Documents ==<br />
# OAIS (2012) CCSDS 650.0-M-2: Reference Model for an Open Archival Information System (OAIS). Magenta Book. Issue 1. June 2012 (ISO 14721:2003) http://public.ccsds.org/publications/archive/650x0m2.pdf accessed 2013.08.31<br />
# [[Definition of SIP]]<br />
# [[CLOCKSS: Extracting Triggered Content]]<br />
# [[LOCKSS: Metadata Database]]<br />
# [[LOCKSS: Extracting Bibliographic Metadata]]<br />
# [[LOCKSS: Polling and Repair Protocol]]<br />
# [[CLOCKSS: Ingest Pipeline]]<br />
# [[CLOCKSS: Designated Community]]<br />
# [[LOCKSS: Format Migration]]<br />
# David S.H. Rosenthal, Thomas Lipkis, Thomas Robertson, Seth Morabito. “Transparent Format Migration of Preserved Web Content”, D-Lib Magazine, vol. 11, no. 1, January 2005. http://dx.doi.org/10.1045/january2005-rosenthal accessed 2013.8.7</div>Dshrhttp://documents.clockss.org/index.php/CLOCKSS:_Extracting_Triggered_ContentCLOCKSS: Extracting Triggered Content2014-04-10T03:36:42Z<p>Dshr: Changes after auditor visit: approved by Tom Lipkis</p>
<hr />
<div>= CLOCKSS: Extracting Triggered Content =<br />
<br />
The CLOCKSS board may decide to trigger content from the CLOCKSS archive when it is no longer available from any publisher. Reasons for doing so include:<br />
* A publisher discontinuing an entire title.<br />
* A publisher who has acquired a title deciding not to re-host a run of back issues.<br />
* Disaster.<br />
The [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS Participating Publisher Agreement] paragraph 4.E states:<br />
<blockquote><br />
Released Content use terms and restrictions will be determined by an accompanying Creative Commons license (or equivalent license) chosen either by Publisher or, if Publisher fails to respond within thirty (30) days following receipt by it of the notice described above, by the CLOCKSS Board.<br />
</blockquote><br />
At present, EDINA and Stanford have volunteered to re-publish triggered content, but it is important to note that the Creative Commons license means that anyone can re-publish such content, and that re-publishing is not a core function of the CLOCKSS archive.<br />
<br />
Once the CLOCKSS board notifies the LOCKSS Executive Director that a trigger event has occurred, the CLOCKSS Metadata Lead and assigned LOCKSS staff run a process that delivers the triggered content to the re-publishing sites, and updates a [https://www.clockss.org/clockss/Triggered_Content section of the CLOCKSS website] with links pointing to the re-published content at both sites. In OAIS terminology, the package delivered to the re-publishing sites is a [[Definition of DIP|Distribution Information Package (DIP)]].<br />
<br />
Additionally, a publisher with content in CLOCKSS can request to be supplied with a copy. In such a case, the process described below is followed but the content is delivered to the requesting publisher not to a re-publishing site.<br />
<br />
== Overview of the Trigger Process ==<br />
<br />
The trigger process involves: <br />
# identifying the triggered content in the CLOCKSS preservation network,<br />
# extracting the triggered content from the CLOCKSS preservation network, <br />
# preparing the content for publication on the triggered content machines,<br />
# re-hosting triggered content on the CLOCKSS triggered content site,<br />
# re-registering triggered article DOIs.<br />
The following sections describe the process as it is currently implemented. It is possible that content triggered in the future might contain files whose format has become obsolete. The additional processes that would be needed in this situation are described in [[LOCKSS: Format Migration#Format Migration in the CLOCKSS Archive|LOCKSS: Format Migration]].<br />
<br />
The CLOCKSS Metadata Lead is responsible for this process.<br />
<br />
== Identifying the Triggered Content ==<br />
<br />
The triggering process begins by identifying the triggered content within the CLOCKSS preservation network. First, the [[LOCKSS: Basic Concepts#Archival Units|Archival Units (AUs)]] that contain the title are identified by querying the [[LOCKSS: Metadata Database|database of bibliographic metadata]] that is compiled and maintained by each CLOCKSS box. This database contains a record of articles published for each preserved title, indexed by year, volume, and issue.<br />
<br />
A query to this database on a production CLOCKSS box yields a list of [[LOCKSS: Basic Concepts#AUID|AU identifiers (AUIDs)]] that identify the associated AUs ([[Definition of AIP|AIPs]] in OAIS terminology) being preserved in the CLOCKSS network. A similar query to an ingest machine identifies any triggered AUs that have yet to emerge from the ingest pipeline (see [[CLOCKSS: Ingest Pipeline]]). The AUIDs are the same on every CLOCKSS box and, as described in [[Definition of AIP]], uniquely identify the location on each box of that AU (AIP). For content that was harvested from a publishers website, a single AU will normally contain all the files for a given volume or a year. For content obtained via file transfer from publishers, content is organized into AUs that contain titles, volumes, and issues from the same publisher that were ingested in a given calendar year.<br />
<br />
== Extracting the Triggered Content ==<br />
<br />
Once the AU has been identified, the triggered content must be extracted from the CLOCKSS preservation network. If the content has already been released to the production CLOCKSS boxes, the content is extracted from the Stanford production CLOCKSS box (<tt>clockss-stanford.clockss.org</tt>) because it is the one most available to the technical staff. Content that is still in the ingest pipeline is flushed through the [[CLOCKSS: Ingest Pipeline]] to the production boxes and triggered from there<br />
<br />
For harvested content the extraction process involves locating the directories in the CLOCKSS box repository that correspond to each AU of the triggered content. Each of the hierarchies under these directories is included in the DIP. If all content is on the Stanford CLOCKSS box, it can be exported directly into zip or tar files using the LOCKSS daemon's export functionality.<br />
<br />
For file transfer content, the extraction process involves locating the directories in the CLOCKSS box that correspond to the AUs that contain the triggered content. Within these directories are either sub-directories or archive files that contain the content. These directories or archive files are copied from the repository to a server where the triggered content will be prepared for publication.<br />
<br />
In both cases, a check is performed in case some of the extracted content includes material that is recorded as having been retracted or withdrawn, as described in [[CLOCKSS: Ingest Pipeline#Errata, Corrections and Retractions|CLOCKSS; Ingest Pipeline]]. If so, that content is excluded before the trigger process continues.<br />
<br />
== Preparing File Transfer Content for Dissemination ==<br />
<br />
File transfer content is typically a collection of files that include PDFs of articles, XML formatted full-text files, supporting files such as images and multi-media, and metadata files that contain bibliographic information. These files are input to the processes of the publishing platform that displays the publisher's content to readers. The exact content of the collection varies between publishers. Typically, further processing involving the following steps is required to extract the content from the directories or archive files that were copied from the CLOCKSS box repository, and generate from them a web site:<br />
* Running the script that verifies the MD5 checksums stored with the content.<br />
* Unpacking any archive files that contain the content into their own directories and isolate the files that correspond to the triggered content. Publishers tend to group files for individual issues into their own directories, so isolating the files involves retaining only those directories that correspond to the content being triggered and discarding directories for other content.<br />
* Adding any full-text PDF files unmodified to the web site.<br />
* Adding other files, such as images and multi-media, to the web site.<br />
* Rendering any XML files into readable full-text HTML pages.<br />
* Generating HTML abstract pages linking to the full-text PDF and HTML pages.<br />
* Inserting article-level bibliographic metadata into the abstract pages <meta> tags.<br />
* Creating article-level metadata files in standard forms such as RIS and BibTex, and linking to them from the abstract pages.<br />
* Generating an HTML issue table of contents page with links to the article abstract pages.<br />
* Generating a volume table of contents page with links to the issue table of contents pages.<br />
* Generating a journal table of contents page with links to the volume table of contents pages. <br />
This process takes place on a temporary ''preparation server''. Here is a sample generated issue table of contents page:<br />
<br />
[[File:Triggering_Content-1.png|thumb|center|Generated issue table of contents]]<br />
<br />
Here is a sample generated article abstract page:<br />
<br />
[[File:Triggering_Content-2.png|thumb|center|Generated article abstract]]<br />
<br />
Here is an example of Dublin Core article-level metadata included in the generated abstract page shown above:<br />
<pre style="font-size:12px"><br />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><br />
<meta name="dc.Format" content="text/HTML"><br />
<meta name="dc.Publisher" content="Taylor &amp; Francis"><br />
<meta name="dc.Title" content="Implementation and Assessment of an Introductory Pharmacy Practice Course Sequence "><br />
<meta name="dc.Identifier" scheme="coden" content="Journal of Pharmacy Teaching, Vol. 14, No. 1, 2007: pp. 5–17"><br />
<meta name="dc.Identifier" scheme="doi" content="10.1300/J060v14n01_02"><br />
<meta name="dc.Date" content="31Oct2007"><br />
<meta name="dc.Creator" content=" Assistant Professor and coordinator Dr. Emily W. Evans Pharm.D. and AE-C and CDM"><br />
<meta name="keywords" content=", Education, pharmacy practice, laboratory education"><br />
</pre><br />
<br />
The result of this process is a directory hierarchy capable of being exported by a web server such as Apache. Initially, this was the form in which file transfer content was disseminated to the re-publishing sites, which used Apache directly. Now, the content is exported by Apache running on the preparation server, and ingested in the normal way by a LOCKSS daemon, after which the generated directory hierarchy can be deleted. This results in a set of AUs analogous to those which would have been obtained had the LOCKSS daemon ingested the content from the original publisher, albeit with a different look and feel. As time permits, early triggered content will be re-disseminated using this technique, as it provides better compatibility with link resolvers and other library systems.<br />
<br />
== Assembling the DIP ==<br />
<br />
The AUs to be re-published can be collected from the selected CLOCKSS box and ingest machines (for harvested content) or from the preparation server (for file transfer content) and converted to a compressed archive using <tt>zip</tt> or <tt>tar</tt>. This forms a [[Definition of DIP|DIP]] that can be transferred to the re-publishing sites using <tt>rsync</tt>, <tt>sftp</tt> or <tt>scp</tt>.<br />
<br />
== Re-publishing the Triggered Content ==<br />
<br />
Currently, two sites re-publish triggered content, [http://triggered.stanford.clockss.org Stanford University] and [http://triggered.edina.clockss.org EDINA at the University of Edinburgh]. Both use a combination of an Apache web server and a LOCKSS daemon to do so although, as described in [[Definition of AIP]], the structure of AUs allows easy access by other tools to the content and metadata. Other ways to re-publish the content delivered as this type of [[Definition of DIP|DIP]] are easy to envisage, such as a shell script to convert it to an Apache web site.<br />
<br />
The technique for re-publishing newly triggered content delivered as this type of [[Definition of DIP|DIP]] used by the current re-publishing sites is as follows. On each of the re-publishing machines:<br />
* Configure the appropriate [[LOCKSS: Basic Concepts#Plugins|plugin]] and [[LOCKSS: Basic Concepts#Title Database|Title Database (TDB)]] entries, by updating the triggered-content configuration and plugin repositories on <tt>props.lockss.org</tt> (see [[LOCKSS: Property Server Operations]]).<br />
* Force the LOCKSS daemon to reload its configuration and plugins from the repository.<br />
* Unpack the AUs into the repository hierarchy of the LOCKSS daemon.<br />
* Make a visual check of the resulting website.<br />
<br />
The final step is add an entry for the the newly triggered title to the index of triggered titles on the “Triggered Content” section of the CLOCKSS website.<br />
<br />
[[File:Triggering_Content-3.png|thumb|center|Triggered titles index]]<br />
<br />
The entry points to a new landing page for the title that provides information about the title, publication history, and triggering process. It also includes links to the issues hosted at Edina and Stanford.<br />
<br />
[[File:Triggering_Content-4.png|thumb|center|Triggered title landing page]]<br />
<br />
Because the triggered content carries Creative Commons licenses, other institutions can also re-publish it. For example, here is [http://web.archive.org/web/*/http://www.clockss.org/clockss/Annals_of_Clinical_Psychiatry<i>Annals of Clinical Teaching</i>] at the [http://www.archive.org Internet Archive].<br />
<br />
== Re-registering Triggered Article DOIs ==<br />
<br />
Most publishers register individual article Digital Object Identifiers (DOIs) with a registrar sponsored by DOI International, such as [http://www.crossref.org CrossRef]. Once a title is no longer available from the publisher, the registration records for the articles should be updated to refer to the content hosted at Edina and Stanford. This is done by preparing a tab-separated file with the DOI for each article and the corresponding new URL. A separate file is required for the articles hosted at Edina and Stanford. The registrar uses the data in this file to update the records for the corresponding DOIs.<br />
<br />
For content that is being served by the two re-publishing servers, this task is simple because their LOCKSS daemon provides an OpenURL resolver, allowing access to articles via their DOIs. The DOI information is available from the article metadata, via the DOI link in the daemon UI. Here is a portion of the file for a title being hosted at Edina that can be sent to the DOI registrar.<br />
<br />
<pre><br />
10.1300/J060v09n02_04 http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n02_04 <br />
10.1300/J060v09n02_05 http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n02_05 <br />
10.1300/J060v09n02_07 http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n02_07 <br />
10.1300/J060v09n02_08 http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n02_08 <br />
10.1300/J060v09n01_02 http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n01_02 <br />
10.1300/J060v09n01_03 http://triggered.edina.clockss.org/ServeContent?rft_id=info:doi/10.1300/J060v09n01_03 <br />
</pre><br />
<br />
== Recording the Dissemination ==<br />
<br />
Generation of a DIP is performed only at the direction of the CLOCKSS board, which is recorded in their minutes.<br />
<br />
== Configuring the Re-publishing Systems ==<br />
<br />
The current re-publishing sites each comprise a LOCKSS daemon and an Apache web server. Both currently reside on the same machine, but they could also be hosted on separate machines. The Apache web server is running on the standard HTTP port 80, while the LOCKSS daemon is serving content on port 8082. External access to the LOCKSS daemon is limited to administrative uses and the Apache server on the system where it is running.<br />
<br />
The Apache server is running a proxy module that forwards requests matching a certain pattern to the CLOCKSS daemon's content server. A triggered content request is first handled by the node's Apache web server, where the request is processed through the Apache web server's ProxyPassMatch filter. If the request matches a filter pattern, the request is passed to the LOCKSS daemon's ServeContent servlet. <br />
<br />
[[File:Triggering_Content-5.png]]<br />
<br />
If the requested content is in the AUs of the LOCKSS daemon, the page is returned to the Apache web server and then to the user. Otherwise, the daemon returns an HTTP error response, and the Apache web server returns a page that indicates the content was not preserved. This configuration serves two purposes. The first is to provide additional processing capabilities beyond those available from the LOCKSS daemon. The second is so that external access to the re-published content can be made through a standard port, an important consideration for many IT firewall configurations.<br />
<br />
For file transfer content that was triggered using the earlier technique and has yet to be updated, the Apache server acts as a normal web server that responds to requests for the prepared content.<br />
<br />
[[File:Triggering_Content-6.png]]<br />
<br />
=== Virtual machine requirements ===<br />
<br />
<br />
Re-publishing hosts have modest requirements:<br />
<br />
* Processor: Single core CPU at 2GHz or better<br />
* Memory 1GB<br />
* Storage: 40GB<br />
* Network: 10/100mbit<br />
<br />
=== Setting up the CLOCKSS daemon ===<br />
<br />
''Re-publishing hostconfig:''<br />
<br />
The LOCKSS daemons at the re-publishing sites are part of the clockss-triggered preservation group. When running hostconfig on a new CLOCKSS Triggered Content Node, care needs to be taken to configure it as such:<br />
<br />
<pre><br />
Props URL: http://props.lockss.org:8001/clockss-triggered/lockss.xml<br />
Preservation Group: clockss-triggered<br />
</pre><br />
<br />
''Re-publishing ServeContent servlet:''<br />
<br />
The LOCKSS daemon needs to have its ServeContent servlet enabled on port 8082. This is done by logging into the LOCKSS daemon's administrative UI, clicking on "Content Access Options" then "Content Server Options" and then checking "Enable content server on port 8082".<br />
<br />
=== Setting up the Apache server ===<br />
<br />
The following configuration is used for the Apache server at each re-publishing site:<br />
<br />
<pre><br />
NameVirtualHost triggered.SITE.clockss.org:80<br />
<VirtualHost triggered.SITE.clockss.org:80><br />
ServerAdmin support@support.clockss.org<br />
DocumentRoot /var/www/html<br />
ServerName triggered.SITE.clockss.org<br />
<IfModule mod_proxy.c><br />
ProxyRequests Off<br />
ProxyVia On<br />
<Proxy triggered.SITE.clockss.org/*><br />
AddDefaultCharset off<br />
Order deny,allow<br />
Allow from all<br />
</Proxy><br />
ProxyPassMatch ^/((ServeContent|images).*)$ http://localhost:8082/$1<br />
ProxyErrorOverride On<br />
ErrorDocument 404 /not-preserved.html<br />
</IfModule><br />
ErrorLog logs/error_log<br />
CustomLog logs/access_log common<br />
</VirtualHost><br />
</pre><br />
<br />
=== Managing Re-publishing Sites ===<br />
<br />
The re-publishing site should follow the [[CLOCKSS: Box Operations|security, maintenance and upgrade guidelines for CLOCKSS boxes]].<br />
<br />
=== Balancing load across CLOCKSS triggered content nodes ===<br />
<br />
A load balancing Apache server is also configured that provides a single point of access to the re-publishing servers at [http://triggered.edina.clockss.org EDINA] and [http://triggered.stanford.clockss.org Stanford]. It is [http://triggered.clockss.org here]. This additional Apache instance serves two purposes: the first is to provide a single URL for services that can accept only a single URL, such as link resolvers. The second purpose is to ensure high availability of the triggered content.<br />
<br />
[[File:Triggering_Content-7.png]]<br />
<br />
Here is the configuration file for this load balancing Apache server.<br />
<pre><br />
<VirtualHost *><br />
ServerName triggered.clockss.org<br />
ServerAdmin support@support.clockss.org<br />
DocumentRoot /var/www/clockss-triggered<br />
<IfModule mod_proxy.c><br />
ProxyRequests off<br />
ProxyVia On<br />
ProxyPassMatch ^/((ServeContent|images).*)$ balancer://triggered-pool/$1<br />
<Proxy balancer://triggered-pool><br />
BalancerMember http://triggered.edina.clockss.org/<br />
BalancerMember http://triggered.stanford.clockss.org/<br />
ProxySet lbmethod=byrequests<br />
</Proxy><br />
</IfModule><br />
CustomLog /var/log/apache2/access.log combined<br />
ErrorLog /var/log/apache2/error.log<br />
ServerSignature On<br />
</VirtualHost> <br />
</pre><br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by:<br />
** CLOCKSS Content Lead<br />
** CLOCKSS Network Administrator<br />
* Approval by CLOCKSS Technical Lead<br />
<br />
== Relevant Documents ==<br />
# [[Definition of AIP]]<br />
# [[Definition of DIP]]<br />
# [https://www.clockss.org/clockss/Triggered_Content CLOCKSS Triggered Content]<br />
# [[LOCKSS: Format Migration]]<br />
# [[LOCKSS: Metadata Database]]<br />
# [[CLOCKSS: Ingest Pipeline]]<br />
# [[LOCKSS: Property Server Operations]]<br />
# [[CLOCKSS: Box Operations]]</div>Dshrhttp://documents.clockss.org/index.php/CLOCKSS:_Logging_and_RecordsCLOCKSS: Logging and Records2014-04-10T03:33:36Z<p>Dshr: Changes after auditor visit: approved by Tom Lipkis</p>
<hr />
<div>= CLOCKSS: Logging and Records =<br />
<br />
The CLOCKSS system uses three types of record:<br />
* '''Logs:''' detailed logs, at an extensively customizable level of the operations of the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] written to log files on the host machine. The purpose of these logs is to enable diagnosis of problems that arise. Logs are retained on the machine that generated them in <tt>/var/log</tt>.<br />
* '''Alerts:''' messages sent off-machine by the LOCKSS daemon when significant events occur. The purpose of Alerts is to draw attention to potential problems that may need diagnosis. Alerts are sent via e-mail to the <<tt>clockss-alerts</tt>> mail alias, and added to the log files on the host machine via the <tt>syslog</tt> mechanism.<br />
* '''Records:''' statistical summaries and business records of the operation of the system as a whole, not of individual boxes. They are provided to the CLOCKSS board and CLOCKSS member organizations, and electronic copies are being stored in a system run by the Executive Director.<br />
<br />
== Retention Policy ==<br />
<br />
Although the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] can generate extremely detailed logs, doing so routinely is counter-productive. It buries the signal in the noise. The goal of the logging and record policy, in the absence of a specific problem to diagnose, is to:<br />
* Generate Logs adequate to, and retain them long enough to, enable simple diagnosis.<br />
* Generate Alerts on any condition that the daemon determines is anomalous, and on other significant events, with sufficient detail to draw the system administrator's attention to problems requiring diagnosis, and to retain them indefinitely.<br />
* Generate the Records needed for business and governance, and for monitoring of the CLOCKSS network's overall performance, and to retain them indefinitely.<br />
Specific log retention policies for each CLOCKSS box are specified in <tt>/etc/logrotate.conf</tt> and the files in <tt>/etc/logrotate.d/</tt>. On each CLOCKSS Box:<br />
* System logs are retained for a month.<br />
* At least the most recent 20MB of LOCKSS daemon log data is retained.<br />
<br />
== Ingest Alerts ==<br />
<br />
An Alert is generated at the end of each crawl of a [[Definition of AIP#Creating AIPs from SIPs|SIP]] that meets certain criteria recording the final status of the crawl, and the number of HTTP 200 results obtained (this is equivalent to the number of new URLs which were found, plus the number of existing URLs that were found to have modified content). An example of such an alert:<br />
<pre><br />
Date: Sat 19 Feb 2011 04:17:24 PST<br />
From: LOCKSS box ingest2.clockss.org <clockss-alert@xxx.xxx><br />
Subject: [lockss-alert] LOCKSS box info: CrawlEnd<br />
<br />
LOCKSS box 'ingest2.clockss.org' raised an alert at Sat Feb 19 04:12:24 PST 2011<br />
<br />
Name: CrawlEnd<br />
Severity: info<br />
AU: Nature Reviews Genetics Volume 11<br />
Explanation: Crawl ended successfully: 2276 new files<br />
<br />
Crawl ended successfully, 2276 new files, 4 warnings.<br />
</pre><br />
Here is an example failed crawl alert from an ingest box:<br />
<pre><br />
From: LOCKSS box ingest1.clockss.org <clockss-alert@xxx.xxx><br />
To: clockss-alert@xxx.xxx<br />
Date: Thu 20 Mar 2014 21:50:18 PDT<br />
Subject: [clockss-alert] LOCKSS box warning: CrawlFailed<br />
<br />
LOCKSS box 'ingest1.clockss.org' raised an alert at Thu Mar 20 21:45:18 PDT 2014<br />
<br />
Name: CrawlFailed<br />
Severity: warning<br />
AU: Journal of Pharmacology and Experimental Therapeutics Volume 346<br />
Explanation: Crawl finished with error: Can't fetch permission page: 0 files fetched, 0 warnings, 1 error<br />
</pre><br />
<br />
== Preservation Alerts ==<br />
<br />
An Alert is generated at the end of each [[LOCKSS: Polling and Repair Protocol|poll]] that detects an integrity problem:<br />
* If there were a non-zero number of URLs for which:<br />
** A repair was needed because the content failed to match the consensus.<br />
** Repair content was fetched.<br />
** The repair content failed to match the consensus.<br />
* If there were a non-zero number of URL version newly flagged as suspect because their content failed to match the locally stored hash.<br />
Here is an example alert caused by injection of a failure of a repair to match the consensus during testing in the STF test environment:<br />
<pre><br />
From: LOCKSS box quark <xxx@xxx.xxx><br />
To: clockss-alert@xxx.xxx<br />
Date: Thu 20 Mar 2014 22:50:33 PDT<br />
Subject: LOCKSS box warning: PersistentDisagreement<br />
<br />
LOCKSS box 'quark' raised an alert at Thu Mar 20 22:50:33 PDT 2014<br />
<br />
Name: PersistentDisagreement<br />
Severity: warning<br />
AU: Simulated Content: simContent<br />
AUID: org|lockss|plugin|simulated|SimulatedPlugin&root~simContent<br />
Explanation: Poll did not achieve consensus on all files<br />
<br />
21 URLs tallied, 95.23% agreement<br />
1 repair received, 0 not received.<br />
<br />
1 repair didn't resolve disagreement:<br />
http://www.example.com/003file.txt<br />
</pre><br />
<br />
== Dissemination Alerts ==<br />
<br />
The CLOCKSS archive is a dark archive; access to the content is permitted only at the direction of the CLOCKSS board. Thus, as described in [[CLOCKSS: Box Operations]], the content access mechanisms of the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] are disabled, and packet filters are used to further prevent access. Nevertheless, Alerts are generated on any access to the content in order that they may be treated as [[CLOCKSS: Logging and Records#Administrative and Security Alerts|Security Alerts]].<br />
<br />
Here is a sample access alert from an ingest box. These accesses are expected as they come from production boxes crawling the ingest box; the alerts were turned on briefly as a test but would normally be disabled.<br />
<pre><br />
From: LOCKSS box ingest1.clockss.org <clockss-alert@xxx.xxx><br />
To: clockss-alert@xxx.xxx<br />
Date: Sun 05 Jan 2014 08:42:42 PST<br />
Subject: LOCKSS box info: ContentAccess (multiple)<br />
<br />
LOCKSS box 'ingest1.clockss.org' raised an alert at Sun Jan 05 08:12:31 PST 2014<br />
<br />
Name: ContentAccess<br />
Severity: info<br />
Explanation: Proxy access: http://www.nature.com/rj/style/group.css : 200 from cache in 398ms<br />
<br />
==========================================================================<br />
LOCKSS box 'ingest1.clockss.org' raised an alert at Sun Jan 05 08:12:33 PST 2014<br />
<br />
Name: ContentAccess<br />
Severity: info<br />
Explanation: Proxy access: http://www.nature.com/nbt/journal/v29/n3/abs/nbt.1829.html : 200 from cache in 243ms<br />
<br />
==========================================================================<br />
LOCKSS box 'ingest1.clockss.org' raised an alert at Sun Jan 05 08:12:33 PST 2014<br />
<br />
Name: ContentAccess<br />
Severity: info<br />
Explanation: Proxy access: http://www.nature.com/nbt/journal/v29/n3/abs/nbt.1829.html : 200 from cache in 1ms<br />
<br />
==========================================================================<br />
LOCKSS box 'ingest1.clockss.org' raised an alert at Sun Jan 05 08:12:33 PST 2014<br />
<br />
Name: ContentAccess<br />
Severity: info<br />
Explanation: Proxy access: http://www.nature.com/nbt/journal/v29/n3/abs/nbt.1829.html : 200 from cache in 1ms<br />
<br />
==========================================================================<br />
LOCKSS box 'ingest1.clockss.org' raised an alert at Sun Jan 05 08:12:33 PST 2014<br />
<br />
Name: ContentAccess<br />
Severity: info<br />
Explanation: Proxy access: http://www.nature.com/nbt/journal/v29/n3/abs/nbt.1829.html : 200 from cache in 0ms<br />
<br />
==========================================================================<br />
LOCKSS box 'ingest1.clockss.org' raised an alert at Sun Jan 05 08:12:34 PST 2014<br />
<br />
Name: ContentAccess<br />
Severity: info<br />
Explanation: Proxy access: http://www.nature.com/nbt/journal/v29/n9/covers/index.html : 200 from cache in 157ms<br />
<br />
==========================================================================<br />
LOCKSS box 'ingest1.clockss.org' raised an alert at Sun Jan 05 08:12:34 PST 2014<br />
<br />
Name: ContentAccess<br />
Severity: info<br />
Explanation: Proxy access: http://www.nature.com/nbt/journal/v29/n9/covers/index.html : 200 from cache in 0ms<br />
<br />
==========================================================================<br />
LOCKSS box 'ingest1.clockss.org' raised an alert at Sun Jan 05 08:12:35 PST 2014<br />
<br />
Name: ContentAccess<br />
Severity: info<br />
Explanation: Proxy access: http://www.nature.com/nbt/journal/v29/n9/covers/index.html : 200 from cache in 0ms<br />
<br />
==========================================================================<br />
LOCKSS box 'ingest1.clockss.org' raised an alert at Sun Jan 05 08:12:36 PST 2014<br />
<br />
Name: ContentAccess<br />
Severity: info<br />
Explanation: Proxy access: http://www.nature.com/nbt/journal/v29/n9/covers/index.html : 200 from cache in 0ms<br />
</pre><br />
<br />
== Administrative and Security Alerts ==<br />
<br />
Alerts are generated on the following administrative actions:<br />
* Changes to the configuration files.<br />
* Changes to the access control permissions.<br />
* Adding or de-activating an AU.<br />
* Enabling or disabling the content servers.<br />
* User account added or removed or password changed.<br />
<br />
== External Communications ==<br />
<br />
=== Engagement ===<br />
<br />
Engagement with harvest content publishers before ingestion is described in [[CLOCKSS: Ingest Pipeline#Harvest Publisher Engagement|CLOCKSS; Ingest Pipeline]].<br />
<br />
Engagement with file transfer content publishers before ingestion is described in [[CLOCKSS: Ingest Pipeline#File Transfer Publisher Engagement|CLOCKSS; Ingest Pipeline]].<br />
<br />
In all cases interactions with the publisher take place through the RT ticketing system, so they are recorded permanently.<br />
<br />
=== External Reports ===<br />
<br />
The technology for generating reports is being revised; the earlier technology became too inefficient as the number of articles on each box grew because it generated reports on each box from the [[LOCKSS: Metadata Database]] then merged them. The new technology is a centralized database with a row for each article, a column for each of the production and ingest boxes, and the cell containing the ingest timestamp of the article on that box, obtained by a regular polling process that asks each box for the articles ingested since the last time it was asked.<br />
<br />
The following reports are generated for external consumption:<br />
* Monthly reports of the state of preservation of all serials committed to preservation in the CLOCKSS archive are delivered to the CLOCKSS board, the Keepers Registry and posted [http://www.clockss.org/keepers/ on the Web].<br />
* KBART reports are generated monthly and posted [http://www.clockss.org/kbart/ on the Web]. For the Global LOCKSS Network, these reports are used to update link resolver knowledge bases so that libraries can provide their readers access to the content of their LOCKSS box. Because the CLOCKSS archive is a dark archive, these reports cannot be used to update link resolvers. However, several analysis tools use KBART as an input format, so the KBART reports for CLOCKSS are made public.<br />
* The CLOCKSS Executive Director is sent an e-mail report of the article counts in the CLOCKSS archive weekly. These reports are preserved in Stanford's backup system.<br />
* The CLOCKSS archive charges publishers a small fee for each current article ingested, billed quarterly. Thus a quarterly report is generated showing for each publisher the number of their articles ingested in that quarter for each publication year. The report is submitted to the CLOCKSS Executive Director for onward transmission to the publishers. Significant discrepancies between this and the publisher's own article counts will result (and have resulted) in investigation and corrective action. To aid in this process more detailed reports, down to the article level, can be generated on request.<br />
<br />
The CLOCKSS Metadata Lead is responsible for the production and dissemination of these reports.<br />
<br />
== Monitoring ==<br />
<br />
=== Log Monitoring ===<br />
<br />
* Ingest boxes: The CLOCKSS Content Lead is responsible for monitoring logs on the ingest boxes<br />
* Production boxes: The CLOCKSS Technical Lead is responsible for monitoring logs on production boxes when needed.<br />
* Web servers: The CLOCKSS Network Administrator is responsible for monitoring web server logs.<br />
<br />
=== Alert Monitoring ===<br />
<br />
The CLOCKSS Technical Lead is responsible for monitoring the Alerts generated by CLOCKSS boxes.<br />
<br />
== Nagios ==<br />
<br />
The state of the CLOCKSS infrastructure, including the CLOCKSS boxes and the ingest machines, is monitored by Nagios as described in [[CLOCKSS: Box Operations#Nagios|CLOCKSS: Box Operations]].<br />
<br />
The CLOCKSS Network Administrator is responsible for monitoring via Nagios.<br />
<br />
== Network Diagnostics ==<br />
<br />
The LOCKSS team's internal monitoring and evaluation processes identified some areas in which the efficiency of the polling process could be improved in the context of the Global LOCKSS Network (GLN). The Andrew W. Mellon Foundation funded work to implement and evaluate improvements in these areas. This is expected to be complete by March 2015. Although these improvements will be deployed to the CLOCKSS network, because there are many fewer boxes in the CLOCKSS network than the GLN the areas of inefficiency are not relevant to the CLOCKSS network. Thus the improvements are not expected to make a substantial difference to the performance of the CLOCKSS network.<br />
<br />
The Mellon-funded work included development of improved instrumentation and analysis software, which polls the administrative Web UI of each LOCKSS box in a network to collect vast amounts of data about the operations of each box. For examples of the use of this software, see [[LOCKSS: Polling and Repair Protocol#Enhancements|LOCKSS: Polling and Repair Protocol]].<br />
<br />
The CLOCKSS Network Administrator is responsible for collecting and analyzing this data.<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by:<br />
** LOCKSS Engineering Staff<br />
** CLOCKSS Network Administrator<br />
* Approval by CLOCKSS Technical Lead<br />
<br />
== Relevant Documents ==<br />
# [[CLOCKSS: Box Operations]]<br />
# [[CLOCKSS: Ingest Pipeline]]<br />
# [[LOCKSS: Polling and Repair Protocol]]<br />
# [[Definition of AIP]]</div>Dshrhttp://documents.clockss.org/index.php/LOCKSS:_Polling_and_Repair_ProtocolLOCKSS: Polling and Repair Protocol2014-04-10T03:30:19Z<p>Dshr: Changes after auditor visit: approved by David Rosenthal</p>
<hr />
<div>= LOCKSS: Polling and Repair Protocol =<br />
<br />
== Overview ==<br />
<br />
LOCKSS boxes run the LOCKSS polling and repair protocol as described in [http://dx.doi.org/10.1145/1047915.1047917 our ''ACM Transactions on Computer Systems'' paper]. The paper describes the polling mechanism as applying to a single file; the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] applies it to an entire [[LOCKSS: Basic Concepts#Archival Unit|Archival Unit (AU)]] of content. Each LOCKSS daemon chooses at random the next AU upon which it will use the LOCKSS polling and repair protocol to perform integrity checks. It acts as the ''poller'' to call a poll on that AU by:<br />
* Selecting a random sample of the other CLOCKSS boxes (the <i>voters</i>).<br />
* Inviting the voters to participate in a <i>poll</i> on the AU, and sending each of them a freshly-generated random nonce ''Np''.<br />
* The poll involves the voters voting by:<br />
** Generating a fresh random nonce ''Nv''.<br />
** Creating a vote containing, for every URL in the voter's instance of the AU:<br />
*** The URL<br />
*** The hash of the concatenation of ''Np'', ''Nv'' and the content of the URL.<br />
** Sending the vote to the poller. Note that the vote contains a hash for each URL in the voter's instance of the AU, but that hash is not the hash of the content. The nonces ensure that the hash in the vote is different for every vote in every poll. The voter cannot simply remember the hash it initially created, it must re-hash every URL each time it votes.<br />
* The poller tallies the votes by:<br />
** For each URL in the poller's instance of the AU:<br />
*** For each voter:<br />
**** Computing the hash of ''Np'', ''Nv'' and the content of the URL in the poller's instance of the AU.<br />
**** Comparing the result with the hash value for that URL in that voter's vote.<br />
** Note that the nonces ensure that the poller must re-hash every URL in the AU; it cannot simply remember the hash it initially created.<br />
* In tallying the votes, the poller may detect that:<br />
** A URL it has does not match the consensus of the voters, or<br />
** A URL that the consensus of the voters says should be present in the AU is missing from the poller's AU, or<br />
** A URL it has does not match the checksum generated when it was stored.<br />
* If so, it repairs the problem by:<br />
** requesting a new copy from one of the voters that agreed with the consensus,<br />
** then verifying that the new copy does agree with the consensus.<br />
<br />
In this way, at unpredictable but fairly regular intervals, every poll on an AU checks the union of the set of URLs in that AU on the box calling the poll (poller) and the boxes voting (voters). The check establishes that the URL on the poller agrees with the consensus of the boxes voting in the poll (voters) as to that URL's content. If it does not, it is repaired from one of the boxes in the consensus. Under our current Mellon grant we are investigating the potential benefits of an enhancement to the mechanism that results in every poll on an AU checking that every URL in that AU on each voter agrees with the same URL on the poller.<br />
<br />
== Configuration of CLOCKSS Network ==<br />
<br />
As described in [[CLOCKSS: Box Operations]] the CLOCKSS boxes are configured to form a Private LOCKSS Network (PLN) including the following configuration options:<br />
* Because the CLOCKSS PLN is a closed network secured by SSL certificate checks at both ends of all connections, the defenses against sybil attacks, which involve the adversary creating new peer identities, are not necessary and are not implemented.<br />
* The efficiency enhancements described below are being gradually and cautiously deployed to the CLOCKSS PLN.<br />
<br />
Currently, on average, a poll is called on each AU instance approximately once every 100 days. Since there are currently 12 boxes in the CLOCKSS network, approximately every 8 days on average one instance of a given AU is checked.<br />
<br />
== Enhancements ==<br />
<br />
The LOCKSS team's internal monitoring and evaluation processes identified some areas in which the efficiency of the polling process could be improved in the context of the Global LOCKSS Network (GLN). The [http://www.lockss.org/news-media/news/lockss-program-receives-andrew-w-mellon-foundation-grant/ Andrew W. Mellon Foundation funded work to implement and evaluate improvements] in these areas. This is expected to be complete by March 2014. Although these improvements will be deployed to the CLOCKSS network, because there are many fewer boxes in the CLOCKSS network than the GLN the areas of inefficiency are less relevant to the CLOCKSS network. Thus the improvements are not expected to make a substantial difference to the performance of the CLOCKSS network.<br />
<br />
The Mellon-funded work included development of improved instrumentation and analysis software, which polls the administrative Web UI of each LOCKSS box in a network to collect vast amounts of data about the operations of each box. These tools were used on the CLOCKSS network for an initial 59-day period, collecting over 18M data items. The data collected has yet to be fully analyzed but initial analysis shows that the polling process among CLOCKSS boxes continues to operate satisfactorily. Some examples of the graphs generated follow.<br />
<br />
[[File:hist_pr_auid_count27.png|200px|thumb|center]] This graph shows the number of AU instances in CLOCKSS boxes which have reached agreement with N other CLOCKSS boxes, showing the progress AUs make after ingest as the LOCKSS: Polling and Repair Protocol identifies matching AU instances at other boxes. It will be seen that there are few AU instances in the sample with few boxes with whom they have reached agreement, and that the majority of AU instances have reached agreement with AU instances at the majority of other CLOCKSS boxes.<br />
<br />
[[File:Sample Graph 2.png|200px|thumb|center]] This graph shows the extent of agreement among the over 40,000 successfully completed polls in the sample. As can be seen, the overwhelming majority of the polls showed complete agreement. Polls with less than complete agreement are likely to have been caused by polling among AU instances that were still collecting content, so had different sub-sets of the URLs in an AU.<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by LOCKSS Engineering Staff<br />
* Approval by LOCKSS Chief Scientist<br />
<br />
== Relevant Documents ==<br />
# [[CLOCKSS: Box Operations]]<br />
# Petros Maniatis, Mema Roussopoulos, TJ Giuli, David S.H. Rosenthal, Mary Baker, and Yanto Muliadi. “LOCKSS: A Peer-to-Peer Digital Preservation System”, ACM Transactions on Computer Systems vol. 23, no. 1, February 2005, pp. 2-50. http://dx.doi.org/10.1145/1047915.1047917 accessed 2013.8.7</div>Dshrhttp://documents.clockss.org/index.php/File:Hist_pr_auid_count27.pngFile:Hist pr auid count27.png2014-04-10T03:29:46Z<p>Dshr: </p>
<hr />
<div></div>Dshrhttp://documents.clockss.org/index.php/LOCKSS:_Extracting_Bibliographic_MetadataLOCKSS: Extracting Bibliographic Metadata2014-04-10T03:26:13Z<p>Dshr: Changes after auditor visit: approved by Tom Lipkis</p>
<hr />
<div>= LOCKSS: Extracting Bibliographic Metadata =<br />
<br />
A part of the LOCKSS preservation process is extracting and indexing bibliographic metadata. Bibliographic metadata is supplied as part of the content submitted by a publisher as [[Definition of SIP|Submission Information Packages]] (SIPs) and preserved in [[Definition of AIP|Archival Information Packages]] in LOCKSS preservation networks.<br />
<br />
Publishers are required to submit adequate bibliographic metadata with the content to spport the four uses the CLOCKSS Archive makes of such metadata:<br />
* To enable the CLOCKSS organization to verify and bill for the content submitted by the publisher. This requires reasonably accurate counts of articles per publisher, but no other metadata.<br />
* To track the holdings and progress in preserving content of LOCKSS networks through reports to agencies such as the [http://thekeepers.org/ Keepers] registry. This requires reasonably accurate publisher and date and/or volume range metadata, but no other metadata.<br />
* To identify relevant content in responding to a [[CLOCKSS: Extracting Triggered Content|trigger event]]. This requires accurate journal and volume metatadata.<br />
* To make content disseminated from LOCKSS boxes, and [[CLOCKSS: Extracting Triggered Content|triggered from the CLOCKSS network]], accessible to end users using bibliographic information through online tools such as link resolvers. This requires full accurate metadata.<br />
<br />
This document describes the kinds of bibliographic information that is preserved by LOCKSS systems, the formats and methods most commonly used for transmitting bibliographic metadata, the mechanisms used to extract and index the bibliographic metadata in a metadata database, and ways of presenting and querying extracted bibliographic metadata. Note that for the routine uses of bibliographic metadata, full metadata is not required and a small level of noise is acceptable. For the uses which only happen as part of a trigger event, full, accurate metadata is required, but there is time to detect and remedy any remaining noise. Thus the goal in normal processing is not perfect metadata, but metadata with an acceptably low level of noise.<br />
<br />
== Kinds of Bibliographic Metadata ==<br />
<br />
LOCKSS can preserve many kinds of content. The mechanisms underlying the preservation process operate on storage units, typically the content of a URL on a website, or files on disk or contained within archive formats such as ZIP or TAR files. However, bibliographic metadata pertains to bibliographic units such as journal articles or book chapters. Therefore the kinds of bibliographic metadata that are supplied by publsihers are related to bibliographic units rather than to storage units.<br />
<br />
The kinds of bibliographic metadata that are extracted and stored by LOCKSS depends on the bibliographic type of the preserved content. For serials such as journals, the bibliographic metadata includes:<br />
<br />
* Publisher<br />
* Publication name (e.g. name of journal)<br />
* ISSN and eISSN<br />
* DOI<br />
* Volume<br />
* Issue<br />
* Publication date (preferably cover date)<br />
* Article title<br />
* Article DOI<br />
* Article author(s)<br />
* Article number<br />
* Article page range (start page or start/end page)<br />
* Article keywords<br />
* Article summary<br />
<br />
For books and monographs, the bibliographic metadata includes:<br />
<br />
* Publisher<br />
* Publication name (e.g. name of book)<br />
* Edition<br />
* ISBN and eISBN<br />
* DOI<br />
* Volume<br />
* Publication date (preferably cover date)<br />
* Author(s)/Editor(s)<br />
* Keywords<br />
* Summary<br />
<br />
For individual book or monograph chapters, the bibliographic metadata also includes:<br />
<br />
* Chapter title<br />
* Chapter DOI<br />
* Chapter author(s)<br />
* Chapter number<br />
* Chapter page ranges (start page or start/end page)<br />
* Chapter author(s)<br />
* Chapter keywords<br />
* Chapter summary<br />
<br />
For book or monograph series, the bibliographic metadata also includes:<br />
<br />
* Series name<br />
* Series ISSN and eISSN<br />
<br />
In addition, certain physical metadata is also collected about the relationship of the bibliographic unit to its original [[Definition of SIP|Submission Information Package]] (SIP) and the [[Definition of AIP|AIP]] where it is preserved. This is collected from the CLOCKSS system primarily for auditing purposes, and to assist in any eventual triggering of the content. These include:<br />
<br />
* Publishing platform (e.g. HighWire Press)<br />
* [[LOCKSS: Basic Concepts#Plugins|Plugin ID]] and [[LOCKSS: Basic Concepts#AUID|AUID]] (identifies both the SIP and the AIP)<br />
* URLs of bibliographic features (e.g. abstract, full-text PDF, full-text HTML, citation file)<br />
* Date bibliographic unit was first added to the AIP<br />
<br />
== Formats and Methods for Transmitting Bibliographic Metadata ==<br />
<br />
Publishers encode and transmit bibliographic metadata in a variety of ways. How the metadata is encoded depends heavily on whether the content is being harvested from the publisher's website, or transferred to CLOCKSS by the publisher at a time of their choosing.<br />
<br />
=== Harvest Content ===<br />
<br />
Content that is harvested from the publisher's website generally takes the form of HTML pages with links to supporting files such as PDF and ePub files, videos and audio files, and other file types. publishers deliver metadata for harvested content in one of several formats and mechanisms: <br />
<br />
The most common is to embed metadata as HTML META tags in the header of each article HTML page, either the abstract page or in the full text HTML page or both. The most frequently used metadata encodings include Google Scholar and Dublin Core. Some publishers supply the same metadata in both formats in a file. Google Scholar is the encoding to facilitate searching for content through Google's Scholar project. Here is an example of the Google Scholar encoding for an article of a typical journal:<br />
<br />
<pre><br />
<meta content="Molecular Interventions" name="citation_journal_title" /><br />
<meta content="1534-0384" name="citation_issn" /><br />
<meta content="1543-2548" name="citation_issn" /><br />
<meta content="Duckles, Sue P." name="citation_authors" /><br />
<meta content="Dial M for Molecular" name="citation_title" /><br />
<meta content="04/01/2001" name="citation_date" /><br />
<meta content="1" name="citation_volume" /><br />
<meta content="1" name="citation_issue" /><br />
<meta content="6" name="citation_firstpage" /><br />
<meta content="1/1/6" name="citation_id" /><br />
<meta content="molint;1/1/6" name="citation_mjid" /><br />
<meta content="http://molinterv.aspetjournals.org/content/1/1/6.full" name="citation_fulltext_html_url" /><br />
<meta content="http://molinterv.aspetjournals.org/content/1/1/6.full.pdf" name="citation_pdf_url" /><br />
</pre><br />
<br />
Here is the same information encoded as Dublin Core:<br />
<br />
<pre><br />
<meta name="DC.Format" content="text/html" /><br />
<meta name="DC.Language" content="en" /><br />
<meta content="Dial M for Molecular" name="DC.Title" /><br />
<meta content="" name="DC.Identifier" /><br />
<meta content="2001-04-01" name="DC.Date" /><br />
<meta content="American Society for Pharmacology and Experimental Therapeutics" name="DC.Publisher" /><br />
<meta content="Sue P. Duckles" name="DC.Contributor" /><br />
</pre><br />
<br />
Dublin Core is a more limited encoding, and full information is not always available in this encoding without resorting to non-standard extensions. It is often necessary to consult both encodings to extract complete metadata.<br />
<br />
The other way bibliographic information is provided for harvested content is through a separate file that is linked to from the abstract or full-text HTML page and is meant for use by citation management systems. The most common format is RIS, developed by Research Information Systems as an interchange format among citation management systems. Here is the RIS representation of the same article:<br />
<br />
<pre><br />
TY - JOUR<br />
PB - American Society for Pharmacology and Experimental Therapeutics<br />
JO - Molecular Interventions<br />
SN - 1534-0384<br />
SN - 1543-2548<br />
TI - Dial M for Molecular<br />
AU - Sue P. Duckles <br />
PY - 2001<br />
DA - 2001-04-01<br />
VL - 1<br />
IS - 1<br />
SP - 6<br />
UR - http://molinterv.aspetjournals.org/content/1/1/6.full" name="citation_fulltext_html_url<br />
</pre><br />
<br />
=== File Transfer Content ===<br />
<br />
Transferred content is typically in the form of "pre-publication" content that is used as input to a publication system. The content in this case consists of document files in PDF, ePub and other formats, supporting files such as image, audio, and video files, and one or more tagged text files that provide the content and metadata and refer to the other files. Many publishers have developed proprietary formats, but some have begun using evolving industry-wide formats, such as JATS (Journal Article Tag Suite) for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language of the World Wide Web Consortium. There are several profiles for describing various kinds of content.<br />
<br />
Extracting metadata from proprietary formats depends on an understanding of how the publisher has encoded the metadata information. The JATS XML encoding as an open standard is well-understood and it is relatively simple to extract metadata from it.<br />
<br />
== Mechanisms for Extracting and Indexing Bibliographic Metadata ==<br />
<br />
LOCKSS preserves content at the level of storage units, yet metadata is at the level of bibliographic units, so a mechanism is necessary to identify the storage units that contain metadata for each bibliographic unit, and then to extract the metadata and index the bibliographic units in the [[LOCKSS: Metadata Database|Metadata Database]]. Since the representation of metadata in preserved content depends on the publishing platform, the mechanism for performing these steps is expressed as a framework in the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]]. Plugins that conform to this framework embody all the rules and procedures for processing content from individual publishing platforms.<br />
<br />
=== Article Iterator ===<br />
<br />
The Article Iterator is slightly misnamed. Itis a plugin framework that identifies files that represent different types of bibliographic units within an[[Definition of AIP|AIP]] (Archival Unit or AU). Examples of bibliographic units are abstract, full-text, supplementary material, and so on. The definition of the mapping between the various types of bibligraphic units and the files within the AU is plugin-specific.<br />
<br />
The article iterator typically identifies files using patterns that match a specific file such as an HTML article page. For each such page, the article iterator identifies all the related files that also represent bibliographic features.<br />
<br />
For example, an article iterator whose patterns match the HTML abstract page can attempt to locate related files that represent the metadata, full-text HTML, full-text PDF, citation manager files, and other bibliographic features. The article iterator labels these feature URLs for use by the metadata extractor and other clients of the article iterator.<br />
<br />
=== Metadata Extractor ===<br />
<br />
The metadata extractor is a plugin framework that extracts metadata for each bibliographic unit using features identified by the article iterator. It typically uses the feature identified as the metadata feature. The metadata extractor operates by processing the contents of the storage unit that contains the metadata, identifying specific metadata elements, validating and normalizing values, and creating a record that contains the metadata for that bibliographic unit.<br />
<br />
The processing of the files to identify and extract the metadata values is unique to the plugin that governs content for a given AIP. For harvested journal articles that provide metadata using META tags in abstract or full-text HTML pages, the extraction process involves parsing the HTML header and extracting the corresponding "name" and "content" attribute values. Once the pairs are isolated, the values can be validated to ensure they conform to the expected type.<br />
<br />
==== Errors in Metadata ====<br />
<br />
Validation, deduplication and other forms of error correction are necessary because it is not uncommon for there to be errors in provided metadata. Pages are maintained on the LOCKSS team's internal wiki with information about metadata problems. A PDF of one of them, describing one class of problems, is [http://tdrd.clockss.org/images/6/63/MetadataProblems.pdf here].<br />
<br />
Normalization is necessary to ensure a common representation for the type of value. For example an ISSN should have four digits, an hyphen, three digits followed by a digit or the letter 'X". Normalization entails ensuring the correct punctation and ensuring that a final "X" is upper-cased. An additional example is the publisher name in the header of HTML files, which is often specified as:<br />
<pre><br />
<meta name="dc.publisher" content="publisher_name" /><br />
</pre><br />
It's common to find inconsistent spellings, abbreviations, etc. of the same publisher name in different files. These are translated to a consistent name by a manually-maintained mapping table. Entries are added whenever an unexpected publisher name appears in a report.<br />
<br />
A further example is missing DOIs. Some content may not have had a DOI assigned. In other cases, a DOI may have been assigned but the metadata extractor may fail to find it in the content, either because it is missing or because it is not in the place(s) that the extractor expects.<br />
<br />
The report generators require a de-duplication step. It uses a combination of bibliographic items, including publisher, publication title, publication year, volume, issue, start page, and a computed article ID. Two metadata items are the same if all the available values are the same. The DOI is the preferred unique ID, but if it isn't available a substitute is generated using the title, if there is one, and otherwise the access URL. The article ID is computed the same way regardless of whether it is harvest or file transfer content. Because these IDs are successively less reliable as unique IDs, the de-duplication is not completely reliable in the face of noisy metadata.<br />
<br />
=== Metadata Database Indexing ===<br />
<br />
Once metadata has been extracted from a bibliographic unit, the Metadata Manager in the LOCKSS box indexes it in the metadata database. The indexing operation is independent of how the bibliographic units were identified or the metadata was extracted. Parts of the indexing process are common to all bibliographic types, while other parts depend on which type is being indexed. The first section of this document shows what metadata is stored in the metadata database for each bibliographic type. <br />
<br />
The Metadata Manager also checks that the metadata for a bibliographic unit is complete enough to index. At least basic information such as the publisher, publication title, one or more feature URLs, and the ID of the AU (AIP), indicating the plugin and the preservation parameters, is required. Some publishers supply incomplete metadata. The Metadata Manager attempts to fill in missing bibliographic information from the [[LOCKSS: Basic Concepts#Title Database|Title Database (TDB)]] entry for the AU (AIP).<br />
<br />
The TDB is a per-publisher knowledge base that is maintained by the LOCKSS team and is used to add new AUs to a LOCKSS box. The TDB provides all the preservation parameters necessary to define the AU, plus additional, readily available bibliographic information. For harvest content, this includes the publication name, publisher name, the ISSN/eISSN, ISBN/eISBN, volume, publication name, and publisher proprietary identifier. For transferred content, less bibliographic information is available in the TDB entries because each AU typically includes all content from a given publisher for a given year.<br />
<br />
If the publisher or publication title is not available from either the metadata or the TDB, the Metadata Manager generates values that can be readily identified for all bibliographic units in the AU being indexed. These "gensyms" can later be updated once more complete bibliographic information becomes available. Flagging missing metadata in this way is the only use the system makes of "gensyms".<br />
<br />
== Querying and Presenting Extracted Metadata ==<br />
<br />
The administrator of a LOCKSS box and, for CLOCKSS, authorized CLOCKSS staff can access the archive, including for the purposes of querying the metadata and generating various reports required to operate LOCKSS preservation networks and report the state of preservation.<br />
<br />
There are several ways that authorized staff can query the metadata of a LOCKSS or CLOCKSS box. Since the metadata is stored in a relational metadata database, it is possible for custom and standard report generators to run SQL queries against the metadata database of a LOCKSS box, or all CLOCKSS boxes, in the preservation network. <br />
<br />
An example of this is the monthly report submitted to the Keepers registry at the University of Edinburgh. The Keepers report shows the range of years and volumes for every title that is committed for preservation, in process, or preserved in the CLOCKSS preservation network. This report is necessary to satisfy the reporting requirements of the CLOCKSS board and the Designated Community. Custom reports can also be written that enable the CLOCKSS staff to satisfy requests from publishers under the terms of the [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement].<br />
<br />
Other custom report generators could also be written using SQL access that export the bibliographic information of the metadata database. As an example, it would be possible to create a custom report that exports the bibliographic information in METS (Metadata Encoding and Transmission Standard) XML format for further processing, since there is a close correspondence between the LOCKSS and METS schema and types.<br />
<br />
== Relevant Documents ==<br />
* [[Definition of SIP]]<br />
* [[Definition of AIP]]<br />
* [[LOCKSS: Metadata Database]]<br />
* [[CLOCKSS: Designated Community]]<br />
* [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement]<br />
* David S. H. Rosenthal "Talk on LOCKSS Metadata Extraction at IPCC2013" 29 April 2013 http://blog.dshr.org/2013/04/talk-on-lockss-metadata-extraction-at.html<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by LOCKSS Engineering Staff<br />
* Approval by LOCKSS Technical Lead</div>Dshrhttp://documents.clockss.org/index.php/LOCKSS:_Basic_ConceptsLOCKSS: Basic Concepts2014-04-09T23:04:20Z<p>Dshr: </p>
<hr />
<div>= LOCKSS: Basic Concepts =<br />
<br />
This document introduces some basic concepts of the LOCKSS technology that are needed to understand the remainder of the documentation.<br />
<br />
== LOCKSS Daemon ==<br />
<br />
The LOCKSS daemon is a large (>200K lines of code) Java program that turns a generic Linux system into a digital preservation appliance called a LOCKSS box. The LOCKSS daemon is the only application program that runs in a LOCKSS box. Every action of a LOCKSS box, for ingest, preservation, dissemination and administration is performed by the LOCKSS daemon. The LOCKSS daemon is administered via a Web interface that allows authorized administrators to direct it to collect content, control how that content is disseminated, and monitor the daemon's performance. Among the functions performed by the LOCKSS daemon are:<br />
* Ingest via Web crawling, or file import.<br />
* Preservation via the [http://documents.clockss.org/index.php/LOCKSS:_Polling_and_Repair_Protocol LOCKSS: Polling and Repair Protocol].<br />
* Dissemination by acting as both a Web server and a Web proxy, and by file export.<br />
* Administration via a Web user interface.<br />
* Status and statistics reporting.<br />
Because all access to preserved content is mediated by the LOCKSS daemon, the physical representation of its internal data structures, such as how content and metadata are stored, is essentially of academic interest only. In particular, the fundamental abstraction that the LOCKSS daemon presents is not that it preserves ''files''. It preserves ''URLs''; their content and their associated headers (metadata) as a unit, although we often casually refer to these (content, header) pairs as "files". Their internal representations are not visible to those using the system to ingest or disseminate content, but only to those administering the underlying system. Only in exceptional circumstances does an administrator log in to the underlying operating system; all routine and normal diagnostic operations are performed through the Web interface.<br />
<br />
== Plugins ==<br />
<br />
The behavior of the LOCKSS daemon is generic. It must be adapted to the requirements of particular content it is to preserve. This is done via the "plugin" for that content, which is an instance of a Java class. In most cases, it is an instance of class DefinablePlugin whose behavior has been customized by parameters in an XML file; colloquially this XML file is often referred to as "the plugin" because it contains all the information that distinguishes this plugin from another that also uses DefinablePlugin. This information includes, for example, the classes which DefinablePlugin can use to extract metadata from the relevant content. A plugin (the class plus the parameters) represents a class of content, such as "content published on HighWire's H2O platform".<br />
<br />
== Archival Units ==<br />
<br />
Here is an [[Media:ClockssTaylorAndFrancisPlugin.xml.pdf|example plugin XML file]], for Taylor and Francis journals, It defines the class of content "published by Taylor and Francis". There are many journals in this class, and content is continually being added to them, so for operational convenience we divide the class into Archival Units (AUs) representing, typically, a year or a volume of a journal. Each AU is defined by the plugin class name, in this case <tt>org.lockss.plugin.taylorandfrancis.ClockssTaylorAndFrancisPlugin</tt> and a set of definitional parameters defined by the XML file, in this case:<br />
* <tt>base_url</tt><br />
* <tt>journal_id</tt><br />
* <tt>volume_name</tt><br />
<br />
For example, the plugin specifies that for an AU of a particular journal identified by these parameters, crawling should start at <tt>au_start_url</tt>:<br />
<pre><br />
<entry><br />
<string>au_start_url</string><br />
<string>&quot;%sclockss/%s/%s/index.html&quot;, base_url, journal_id, volume_name</string><br />
</entry><br />
</pre><br />
<br />
== Title Database ==<br />
<br />
The values for these parameters come from the Title Data Base (TDB), which is not actually a database, but a knowledge base represented as a set of text files in an easy-to-edit syntax that are processed into an XML file that is obtained by the LOCKSS daemon. For each AU in the system, there is a TDB entry providing the plugin class name and a (name, value) pair for each of the parameters defined by that plugin class that are different from the default. The TDB entry for Advances in Building Energy Research, defining its AUs from 2007-2014, looks like this:<br />
<pre><br />
{<br />
<br />
publisher <<br />
name = Taylor & Francis ;<br />
info[contract] = 2008 ;<br />
info[tester] = A<br />
><br />
<br />
plugin = org.lockss.plugin.taylorandfrancis.ClockssTaylorAndFrancisPlugin<br />
param[base_url] = http://www.tandfonline.com/<br />
implicit < status ; status2 ; year ; name ; param[volume_name] ><br />
...<br />
<br />
{<br />
<br />
title <<br />
name = Advances in Building Energy Research ;<br />
issn = 1751-2549 ;<br />
eissn = 1756-2201 ;<br />
issnl = 1751-2549<br />
><br />
<br />
param[journal_id] = taer20<br />
<br />
au < manifest ; exists ; 2007 ; Advances in Building Energy Research Volume 1 ; 1 ><br />
au < manifest ; exists ; 2008 ; Advances in Building Energy Research Volume 2 ; 2 ><br />
au < manifest ; exists ; 2009 ; Advances in Building Energy Research Volume 3 ; 3 ><br />
au < zapped ; finished ; 2010 ; Advances in Building Energy Research Volume 4 ; 4 ><br />
au < finished ; crawling ; 2011 ; Advances in Building Energy Research Volume 5 ; 5 ><br />
au < finished ; crawling ; 2012 ; Advances in Building Energy Research Volume 6 ; 6 ><br />
au < crawling ; exists ; 2013 ; Advances in Building Energy Research Volume 7 ; 7 ><br />
au < expected ; exists ; 2014 ; Advances in Building Energy Research Volume 8 ; 8 ><br />
<br />
}<br />
...<br />
}<br />
</pre><br />
<br />
The definitional parameters are specified as follows:<br />
* <tt>base_url</tt> is <tt>http://www.tandfonline.com/</tt> for all Taylor and Francis journals specified at the top.<br />
* <tt>journal_id</tt> is <tt>taer20</tt> specified in the section for Advances in Building Energy Research.<br />
* <tt>volume_name</tt> is specified by the 5th column of the table.<br />
<br />
The text form of the TDB is preserved in the LOCKSS source code repository at SourceForge, which is backed up each night to an on-site and an off-site system, both maintained by the LOCKSS team, in addition to SourceForge's backups. There is a copy of the XML form of the TDB for each LOCKSS network on each LOCKSS box in the network, in addition to the copy on the [[LOCKSS: Property Server Operations|Property Server]] and its backup in the Amazon cloud.<br />
<br />
== AUID ==<br />
<br />
Everywhere an AU needs to be uniquely identified, we use an internal name, its Archival Unit ID (AUID) as the means to do so, for example as a key in maps and databases, or in the messages of the [http://documents.clockss.org/index.php/LOCKSS:_Polling_and_Repair_Protocol LOCKSS: Polling and Repair Protocol]. The AUID for an AU is immutable string with an encoded representation of:<br />
* The fully-qualified Java class name of the plugin.<br />
* For each of the definitional parameters defined by the plugin XML:<br />
** The parameter name.<br />
** The parameter value.<br />
Because it contains the class of the plugin and all the definitional parameters, the AUID is unique to an AU irrespective of which box it is on, <br />
<br />
The AUID for the AU for Volume 6 of Advances in Building Energy Research, defined by the TDB entry above, and used as the example in [http://documents.clockss.org/index.php/Definition_of_AIP#Harvest_AU Definition of AIP] is:<br />
<pre><br />
org|lockss|plugin|taylorandfrancis|ClockssTaylorAndFrancisPlugin&base_url~http%3A%2F%2Fwww%2Etandfonline%2Ecom%2F&journal_id~taer20&volume_name~6<br />
</pre><br />
<br />
Archival Units also have an external "AU name", which is a human-readable string used in Web pages and reports but for no other purpose.</div>Dshrhttp://documents.clockss.org/index.php/LOCKSS:_Software_Development_ProcessLOCKSS: Software Development Process2014-04-06T22:37:46Z<p>Dshr: /* Testing */</p>
<hr />
<div>= LOCKSS: Software Development Process =<br />
<br />
The LOCKSS team operates two slightly different software development, testing and release processes, one for the core [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] code and one for the [[LOCKSS: Basic Concepts#Plugins|plugins]] that adapt the core daemon's behavior to particular content. All code is maintained in SourceForge's CVS repository. Two independent local copies of this repository are updated every 24 hours.<br />
<br />
== Daemon Development Process ==<br />
<br />
The development, testing and release process for the core daemon code has the following stages:<br />
# Requirements generation<br />
# Prioritization<br />
# Tracking<br />
# Testing<br />
# Approval<br />
# Release<br />
<br />
This process operates on a nominal 8-week release cycle; that is the goal is to release a new version of the daemon code every 8 weeks. A release can be delayed if last-minute problems are detected but if this happens a post-mortem is held to determine the cause and determine how to prevent it recurring. The cause has usually been some major enhancement that takes longer to debug than expected.<br />
<br />
=== Requirements ===<br />
<br />
There are four main sources of requirements for changes to the daemon:<br />
* The community of LOCKSS users.<br />
* The content processing staff in the LOCKSS team, as they process content from LOCKSS and CLOCKSS publishers for preservation and either:<br />
** notice content that isn't being collected or processed properly, or<br />
** encounter situations that require new capabilities.<br />
* The whole LOCKSS team as they interact with the LOCKSS daemon in testing, and the developers as they fix bugs and refactor code.<br />
* The deliverables under contracts with funders, as for example our current grant from the Andrew W. Mellon foundation.<br />
Support interactions with the community of LOCKSS users are tracked and managed using the RT ticketing system. If, during resolution of an RT ticket, it is determined that a change to the daemon is required, an issue to track it is generated in the Roundup bug tracking system. In some cases an informal conversation with a community member results in a Roundup issue being created without an RT ticket. The same may happen if a LOCKSS team member identifies a needed change, after informal discussion within the team, as a result of a code review, or if a deliverable is required by an external funder. Detailed instructions on the use of Roundup to create requirements and track progress are [http://wiki.lockss.org/cgi-bin/wiki.pl?RoundUp here].<br />
<br />
=== Prioritization ===<br />
<br />
At the start of each release cycle the development staff meet and review the Roundup issues to select the set of issues that will be prioritized for the upcoming release.<br />
This prioritization is based on the needs of the LOCKSS Alliance members and the CLOCKSS archive, <br />
deliverables for funders, security considerations and the severity of the bug in question.<br />
At intervals during the release cycle the prioritization is reviewed with the goal of meeting the<br />
8-week cycle time with the available resource and responding rapidly to any critical bugs.<br />
<br />
=== Tracking ===<br />
<br />
Roundup issues are assigned to developers after prioritization under the supervision of the LOCKSS Technical Lead.<br />
As progress is made the developer updates the issue's status.<br />
<br />
=== Testing ===<br />
<br />
Core daemon code undergoes three types of testing:<br />
* Unit testing.<br />
* Functional testing.<br />
* Load testing.<br />
<br />
Testing occurs in three stages:<br />
* Developers perform unit and functional testing before and after committing changes to SourceForge.<br />
* Every night the entire code base is checked out from SourceForge, built and both unit and functional tests are run. Any failures are reported to the whole team by e-mail. Remedying them is a high-priority task.<br />
* Every time a release candidate is built it runs unit and functional tests, then is installed on one or more of the LOCKSS team's GLN boxes for load testing.<br />
<br />
==== Unit Testing ====<br />
<br />
The LOCKSS team make heavy use of the Junit unit test framework for Java in the context of the Ant automated Java build tool.<br />
As of September 2013 there were 495 files containing 4953 individual tests for 867 classes in the daemon code base. <br />
All significant methods in all classes are required to have unit tests, although this requirement has yet to be satisfied.<br />
Where practical, subsystems should have unit tests.<br />
The "where practical" caveat is made necessary by the distributed and randomized architecture of the LOCKSS system,<br />
which makes some systems impractical to unit-test in the Junit framework.<br />
Coverage of these tests is evaluated at intervals using Jcoverage; it is currently assessed as requiring improvement.<br />
A few specialized subsystems have functional tests implemented in the Junit framework.<br />
These are treated as unit tests.<br />
<br />
Developers should make every effort to reproduce reported bugs by implementing<br />
one or more unit tests that fail because of the bug, which will then verify that the bug is fixed and does<br />
not re-appear later.<br />
<br />
==== Functional Tests ====<br />
<br />
The LOCKSS team uses an in-house functional testing framework called STF (Stochastic Test Framework).<br />
This currently implements 23 different test scenarios.<br />
In each, the framework sets up a small LOCKSS network, typically of 4-5 daemons, on a single machine.<br />
The framework interacts with the Web user interface of each of these daemons as a normal LOCKSS box<br />
administrator would to direct them to ingest and poll on relatively small amounts of simulated content.<br />
The framework is capable of injecting faults,<br />
such a localized damage to, or total loss of the simulated content.<br />
It uses the Web user interface of the daemons to monitor the results of the<br />
functional tests and detect any errors, such as failure to repair the injected faults.<br />
<br />
If a bug cannot be reproduced in the unit test environment efforts should be<br />
made to reproduce it by constructing a suitable scenario in STF that fails<br />
because of the bug, and can validate that the bug is fixed and does not re-appear.<br />
<br />
==== Load testing ====<br />
<br />
Production LOCKSS and CLOCKSS boxes operate at a scale that is logistically<br />
infeasible to reproduce in the unit and functional test environments.<br />
If a bug cannot be reproduced in the unit or functional test environments it must be reproduced manually in the internal test network.<br />
New daemon releases go through two types of load test:<br />
* They are installed in stages on a small internal network of 4 LOCKSS boxes with a substantial amount of selected real content. The behavior of these daemons is monitored closely and manual efforts are made to trigger any bugs that were only replicable under load testing.<br />
* They are then installed on at least 4 of the 13 GLN LOCKSS boxes that the LOCKSS team operates, which typically have at least 5TB of content. The behavior of these daemons is monitored closely and manual efforts are made to trigger any bugs that were only replicable under load testing.<br />
<br />
==== Hard-to-replicate bugs ====<br />
<br />
Sometimes bugs will be reported that the LOCKSS team fails to reproduce locally.<br />
Some diagnostic information will be available in the daemon logs at the<br />
reporting site.<br />
This is used to add self-checking and/or logging to the daemon to<br />
investigate the problem further.<br />
<br />
=== Approval ===<br />
<br />
When bug fixes and enhancements have been completed and pass unit and<br />
functional tests, the corresponding Roundup issues are set to Testing<br />
state. Once a release candidate has been built (below), each issue is<br />
checked again to confirm that it is operating properly in a production<br />
environment, then moved to Approved state. The rules for moving to<br />
Approved state vary according to the type of issue:<br />
<br />
* Changes that produce predictable behavioral differences are observed to ensure correct behavior.<br />
* Fixes for intermittent bugs (such as race conditions) remain in Testing state until the desired behavior is observed (or a suitable time elapses without observing the failure).<br />
* Changes that do not affect the daemon's behavior in observable ways (such as fixes or enhancements to test code) may be moved to Approved as soon as the test passes with the release build.<br />
<br />
=== Release ===<br />
<br />
On the freeze date, the LOCKSS build master creates a branch in the SourceForge repository,<br />
and labels it with the release name and the tag for the first release candidate.<br />
The first release candidate is then checked out from this branch and built.<br />
Subsequent fixes until the final release are made to both the release and main branches.<br />
All release candidates are tagged and built from the release branch. A release branch tag has the form:<br />
<code><br />
release-candidate_${N1}-${N2}-b${N3}<br />
</code><br />
where:<br />
* N1 is the LOCKSS daemon major version number, currently 1,<br />
* N2 is the LOCKSS daemon minor version number, currently 62,<br />
* N3 is the sequence number of the build, incremented by 1 each time a build is performed.<br />
<br />
If the build or tests fail, developers are notified, problems fixed, and a new release candidate is tagged and built.<br />
When successful the release candidate is signed using the LOCKSS code signing key,<br />
and uploaded to a test Yum repository, from where<br />
it is installed on a small internal test LOCKSS network.<br />
It is monitored carefully for at least a few days.<br />
If no serious problems are observed the candidate is installed on several internal boxes that participate in the<br />
GLN, in order to observe it under significant load.<br />
<br />
During the release testing phase each of the Roundup issues is checked<br />
to verify that it's operating as expected in the production system, and<br />
marked Approved if so. If problems are found in the release candidate,<br />
fixes are made on the branch (and the main branch if appropriate) and a<br />
new release candidate is produced and tested as above.<br />
<br />
When all issues (except possibly issues that can only be observed in production - see above) are marked Approved and the candidate has been running<br />
without significant problems for at least a week, the LOCKSS technical lead approves the release.<br />
The candidate is moved to the<br />
release Yum repository and the release announcement is sent. Users have<br />
the option to set their boxes up to install new release automatically or<br />
do it manually when they receive the release announcement.<br />
The Approved issues may then have their status changed to Resolved.<br />
<br />
=== Documentation ===<br />
<br />
Routine changes to the daemon are documented in Roundup, and in the commit messages in SourceForge. They do not require changes to the system's architectural documents.<br />
<br />
== Daemon Enhancements ==<br />
<br />
Significant enhancements, such as any architectural changes, to the daemon follow a slightly different process.<br />
An experimental branch is created in the SourceForge repository and used<br />
to preserve the various steps of development of a prototype. These steps<br />
normally include:<br />
* Requirements generation, a group discussion to identify and document in the internal Wiki outline requirements for the enhancement sufficient to allow development of a prototype.<br />
* Prototyping, development of a working enhancement and sufficient unit and/or functional tests to demonstrate that the enhancement meets its outline requirements.<br />
* Design review, a formal review of the design of the prototype and documentation of a design for a production implementation.<br />
* Implementation of a production implementation, either from scratch (in a second branch) or by evolving the prototype.<br />
* Code review, a formal review of the code that is proposed for addition to the main branch of the repository that identifies a set of changes that must be made before the changes are actually applied.<br />
* Merge, in which the implementation with any changes required by the code review is committed to the main branch of the SourceForge repository and the branch(es) used during development abandoned.<br />
<br />
The enhancements then undergo the normal testing, approval and release process.<br />
<br />
=== Enhancement Documentation ===<br />
<br />
Significant enhancements to the daemon may require changes to documents such as the [[Definition of AIP]]. These changes are identified during the design review, and are the responsibility of the developer concerned.<br />
<br />
== Plugin Development Process ==<br />
<br />
The development, testing and release process for [[LOCKSS: Basic Concepts#LOCKSS Daemon|daemon]] [[LOCKSS: Basic Concepts#Plugins|plugins]] has the following stages:<br />
# Requirements generation<br />
# Prioritization<br />
# Tracking<br />
# Development<br />
# Testing<br />
# Approval<br />
# Release<br />
<br />
The process normally operates asynchronously with the daemon development process; plugins are released when they are ready. If a plugin requires a new feature from the core daemon code its release will be delayed until after the relevant daemon release. This is enforced by having the plugin declare a minimum required daemon version, which prevents the plugin inadvertently being loaded by earlier daemons.<br />
<br />
=== Requirements ===<br />
<br />
The process for generating requirements for changes to the plugins is the same as for changes to the core daemon code, except that there is one additional source: new publishers joining LOCKSS and CLOCKSS. When a plugin writer is [[CLOCKSS: Ingest Pipeline#Harvest Publisher Engagement|assigned to analyze a new site]] they fill out a template that forms the specification for the new plugin and is checked in to CVS along with it. When changes are required, the appropriate changes are made to the specification and it is checked back in to CVS.<br />
<br />
=== Prioritization ===<br />
<br />
The process for prioritization of changes to the plugins differs from that for the daemon code in two respects:<br />
* It is a periodic process but it is not synchronous with the daemon release cycle. Plugin changes are released when ready not, except in special circumstances, together with daemon releases.<br />
* Some plugin developments may be urgent, if they are caused by impending cessation of publication, loss of access to the content, publisher change or move of the content between publishing platforms.<br />
<br />
=== Tracking ===<br />
<br />
Progress on plugin developments is tracked in JIRA.<br />
<br />
=== Development ===<br />
<br />
Plugin development should normally take place in the source tree of the current daemon release. Exceptionally, if the plugin needs bug fixes or new features not in the current release development can use the head of the CVS tree, with the plugin's <tt>required_daemon_version</tt> set to the next daemon release number. <br />
<br />
=== Testing ===<br />
<br />
Plugins are implemented partly in Java and partly in XML.<br />
The Java classes have unit tests in the same way as the core daemon code does.<br />
As of September 2013 there were 221 files with 911 individual tests for 248 plugins plus 325 auxilliary plugin classes.<br />
During every build all the plugin unit tests are run,<br />
and some additional validation is performed on the XML.<br />
<br />
Once a new or changed plugin has passed these tests, it is loaded into<br />
a daemon in a test environment which is manually directed to collect<br />
one or more [[LOCKSS: Basic Concepts#Archival Units|Archival Units (AUs)]] of its target content. Two checks<br />
are performed:<br />
* The status info and daemon logs are examined to detect errors such as unexpected 404s.<br />
* The collected content is browsed using the daemon's "audit proxy" (a proxy that returns only collected content and 404 for everything else) to ensure that all the desired content is collected and undesired content is not. For example, if the AU represents a volume of a journal, all articles belonging to that valume should be present, along with the common files they reference (e.g., style sheets), and there should be no articles belonging to other volumes.<br />
* A visual check of the collected content against the publisher's original.<br />
<br />
Once a new or changed plugin has passed these tests it is released to the LOCKSS or CLOCKSS content test network as appropriate.<br />
Under the control of an internally developed testing framework (AUTest) several AUs are collected, polled and the results checked for agreement.<br />
If the agreement is less than 100%, the reason is diagnosed.<br />
Either the plugin is further changed and the process repeated.<br />
or if the diagnosis is that collection from the publisher suffered transient errors,<br />
the AUs are re-collected and the check repeated.<br />
Metadata extraction is performed on the collected AUs and checked for correctness and completeness.<br />
<br />
=== Approval ===<br />
<br />
When a plugin is ready, it's released. Plugins are released<br />
individually, independently of the daemon, except for cases where a<br />
plugin requires a daemon feature that has not yet been released, in<br />
which case it waits for a daemon release.<br />
<br />
The LOCKSS or CLOCKSS Plugin Lead (as appropriate) approves the release of a new plugin.<br />
<br />
=== Release ===<br />
<br />
A plugin release requires the following steps:<br />
* The plugin build master packages on a build machine<br />
* The plugin build master signs the plugin with their key:<br />
** The keystore containing the keys that can sign plugins for the GLN is the daemon's default keystore, which is controlled by the LOCKSS technical lead.<br />
** PLNs, including the CLOCKSS PLN, have their own keystores. The CLOCKSS keystore is controlled by the LOCKSS technical lead.<br />
* The plugin is uploaded to the appropriate repository.<br />
* A plugin collection is triggered manually on one production or ingest box.<br />
* The logs are checked to ensure the plugin loaded correctly.<br />
* The remaining boxes will automatically fetch and load any new (or new versions of) plugins within 12 hours.<br />
For CLOCKSS, the plugin is released to the ingest and production boxes.<br />
<br />
=== Documentation ===<br />
<br />
Plugin changes are documented in JIRA, and in the commit messages in SourceForge. They do not require changes to the system's architectural documents.<br />
<br />
== Development Environment ==<br />
<br />
Developers are free to use the operating system and other tools of their choice on their own machines while developing LOCKSS software, provided that when performing [[LOCKSS: Software Development Process#Unit Testing|pre- and post-commit testing]] they use the currently approved versions of:<br />
* Apache Ant<br />
* The JDK<br />
* The Java libraries from CVS.<br />
Diversity in development environments assists in identifying hidden dependencies.<br />
<br />
== Dependencies ==<br />
<br />
Two types of dependency are of concern for the functioning of the LOCKSS system, and thus need to be proactively monitored:<br />
* The ability of operating system support for the requirements of the LOCKSS software, and the other software components used by the CLOCKSS archive.<br />
* The set of formats supported by the Web browsers in the Knowledge Base of the Designated Community.<br />
<br />
=== Operating System Dependencies ===<br />
<br />
The LOCKSS software depends upon:<br />
* A Java virtual machine, currently version 6 or 7.<br />
* A set of Java libraries.<br />
* A POSIX file system.<br />
* An SQL database.<br />
Any modern operating system can support these dependencies. Although the system is currently supported only on Red Hat compatible Linux distributions, in development it runs on many versions of Linux, on MacOS and with some restrictions on Windows. Some years ago it was ported from OpenBSD with little trouble. As the LOCKSS team has the software under continuous development on a range of operating systems, any problems with operating system support rapidly become evident to the team.<br />
<br />
Other tools required are similarly situated. The main one is the Apache web server, which is both widely supported, and could be replaced by a competitor with little trouble. Again, a lack of support for these tools would rapidly become evident to the team because they are indispensable in development.<br />
<br />
=== Browser Format Support ===<br />
<br />
[[LOCKSS: Format Migration|Web formats become obsolete]] when support for them is removed from the browsers in general use. The occurrence of such obsolescence is a subject of active research, in which the LOCKSS team participates on a continuing basis. Any looming obsolescence of a member of the set of formats identified by the [[LOCKSS: Format Migration#Format Migration in the CLOCKSS Archive|CLOCKSS software's use of the File Identification Tool Set (FITS)]] would become known through this research network.<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by:<br />
** LOCKSS Engineering Staff<br />
** LOCKSS Plugin Lead<br />
** CLOCKSS Plugin Lead<br />
** LOCKSS Content Lead<br />
** CLOCKSS Content Lead<br />
** LOCKSS Build Master<br />
* Approval by LOCKSS Technical Lead<br />
<br />
== Relevant Documents ==<br />
# [[LOCKSS: Format Migration]]<br />
# [[CLOCKSS: Ingest Pipeline]]</div>Dshrhttp://documents.clockss.org/index.php/CLOCKSS:_Threats_and_MitigationsCLOCKSS: Threats and Mitigations2014-04-06T22:18:27Z<p>Dshr: /* Operator Error */</p>
<hr />
<div>= CLOCKSS: Threats and Mitigations =<br />
<br />
== Threat Model ==<br />
<br />
The system architecture and operations policies of the CLOCKSS Archive are based on the threat model underlying the LOCKSS technology, which was formalized in a 2005 paper published by the LOCKSS team, [http://dx.doi.org/10.1045/november2005-rosenthal Requirements for Digital Preservation Systems: A Bottom-Up Approach], and periodic reviews of code, configuration and policies. The paper identified the following threats:<br />
* Media Failure.<br />
* Hardware Failure.<br />
* Software Failure.<br />
* Communication Errors.<br />
* Failure of Network Services.<br />
* Media & Hardware Obsolescence.<br />
* Software Obsolescence.<br />
* Operator Error.<br />
* Natural Disaster.<br />
* External Attack.<br />
* Internal Attack.<br />
* Economic Failure.<br />
* Organizational Failure.<br />
<br />
== Mitigation Strategy ==<br />
<br />
This set of threats is the basis for code and operations reviews. Although the set includes threats that are not traditionally classified as security risks, the LOCKSS team treats all threats as potentially security-related. As an example of the reasoning behind this, consider "Communication Errors". These might be random, they might be load-related, or they might be caused by a denial-of-service attack.<br />
<br />
As described in [http://dx.doi.org/10.1145/1047915.1047917 ''LOCKSS: A Peer-to-Peer Digital Preservation System''], [https://www.usenix.org/conference/2005-usenix-annual-technical-conference/attrition-defenses-peer-peer-digital-preservation ''Attrition Defenses for a Peer-to-Peer Digital Preservation System''] and other papers noted [http://www.lockss.org/news-media/publications/ here], the LOCKSS system design assumes a hostile environment with a powerful adversary. It does not assume that all the boxes in a network such as CLOCKSS are benign, it merely assumes that the majority of boxes do not behave maliciously. The analysis of these papers assumes a completely distributed network, composed of independent peers with no central control whatsoever. If that were the case, these design assumptions applied to a network with a sufficiently large number of replicas would provide robust defenses against all the threats noted above. An attacker would have to compromise, and maintain control of a large majority of the peers for an extended period of time in order to modify, delete, or significantly prevent access to the content.<br />
<br />
Of necessity, the CLOCKSS network does have central control, which therefore provides a vector by which some of the threats can be effective. In particular, the following threats applied to the central CLOCKSS organization can be effective and need to be mitigated as described in the following sections:<br />
* Software Obsolescence.<br />
* Operator Error.<br />
* Natural Disaster.<br />
* External Attack.<br />
* Internal Attack.<br />
* Economic Failure.<br />
* Organizational Failure.<br />
<br />
[http://dx.doi.org/10.1045/november2005-rosenthal The DLIB paper] details the approach the LOCKSS technology takes to mitigating each of these risks. The CLOCKSS Archive is implemented using the LOCKSS technology but, because of its nature as a tightly-controlled dark archive configures the technology in ways that further reduce risk as compared to the Global LOCKSS Network for which the technology was originally designed. The configuration of the CLOCKSS network is described in [[CLOCKSS: Box Operations]] but briefly the additional defenses include:<br />
* Implementing a large number (currently 12) of CLOCKSS boxes each holding the entire content of the archive.<br />
* Ensuring that, after an initial period, each CLOCKSS box's operating system is configured to prevent write or administrative access except by staff at the host institution.<br />
* Securing communication among authorized CLOCKSS boxes using SSL certificate checks at both ends of each connection.<br />
* Preventing dissemination of content from CLOCKSS boxes except during an approved trigger event (see [[CLOCKSS: Extracting Triggered Content]]).<br />
<br />
The CLOCKSS network consists of (currently 12) CLOCKSS boxes in the US (Stanford, Rice, Indiana, Virginia, OCLC) and in Australia, Canada, Italy, Japan, Hong Kong, Germany and Scotland. Each of these boxes is configured to preserve a complete copy of all content successfully ingested into the CLOCKSS Archive, which is continually audited by the [[LOCKSS: Polling and Repair Protocol]]. Any CLOCKSS trigger event or any failure of both of the replicated triggered content servers could be satisfied by extracting and disseminating content from any one of these boxes as described in [[CLOCKSS: Extracting Triggered Content]]. Thus in this replicated system architecture each box is backed up by all of the others. The [[LOCKSS: Polling and Repair Protocol]] keeps each box informed of the existence and state of the content at each of the other boxes.<br />
<br />
== Awareness ==<br />
<br />
The CLOCKSS archive's awareness strategy is in two parts:<br />
* Environmental awareness, meaning awareness of events outside the archive that could affect its operations.<br />
* Operational awareness, meaning awareness of the state of the CLOCKSS PLN and the effectiveness of its operations.<br />
<br />
=== Environmental Awareness ===<br />
<br />
Anticipating technology changes is one role of the senior engineers of the LOCKSS team, among whom are four each with more than 20 years in senior engineering positions in Silicon Valley. Awareness of future technology trends is a job requirement for positions such as these. To fulfill this requirement the team can draw on expertise from the Computer Science Departments of Stanford and UC Santa Cruz, and a network of colleagues in senior engineering and research positions in industry giants including Oracle, Google, NetApp, Seagate and HP, and in the Linux and FreeBSD communities. This in-house and local expertise acts in place of a subscription to a technology watch service as regards the open source ecosystem and generic PC technologies.<br />
<br />
Senior technical staff of the LOCKSS Program attend and speak at many international digital preservation conferences (see [http://www.lockss.org/news-media/talks/ LOCKSS Talks page]). They conduct leading-edge research and publish extensively (see [http://www.lockss.org/news-media/publications/ LOCKSS Publications page] and [http://blog.dshr.org Dr. David S. H. Rosenthal's blog]) in digital preservation. They have done so consistently since 2000. This in-house expertise acts in place of a subscription to a technology watch service as regards digital preservation technologies.<br />
<br />
Risk: Senior engineers are in demand in Silicon Valley, and these key team members could be recruited away. This risk is mitigated because there are four of them, and the fact that they are all older. Older engineers are less in demand by industry, and find Stanford's excellent benefits attractive in comparison to industry's higher salaries but benefit structures more aimed at youngsters. Staffing the LOCKSS program is a continuous process. Position descriptions and classifications adhere to Stanford University's standards..<br />
<br />
=== Operational Awareness ===<br />
<br />
The mechanisms the CLOCKSS Archive uses to observe the operations of the CLOCKSS boxes via logging and Alerts are described in [[CLOCKSS: Logging and Records]]. That document also describes:<br />
* [[CLOCKSS: Logging and Records#Monitoring|responsibilities for monitoring]] these information streams. <br />
* how data regarding the [[CLOCKSS: Logging and Records#Network Diagnostics|performance of the network as a whole]] is collected and analyzed<br />
<br />
LOCKSS Program operations staff [[CLOCKSS: Logging and Records|monitor the state of the network]] using logs, Alerts, Nagios, and internal tools.<br />
<br />
Risk: The major risk is that too much information overwhelms the human monitoring.<br />
<br />
== Threat Mitigations and Risks ==<br />
<br />
The following sections describe the CLOCKSS archive's approach to mitigating the threats identified by the [[CLOCKSS: Threats and Mitigations#Threat Model|Threat Model]].<br />
<br />
=== Media Failure ===<br />
<br />
The hard disk media used by the CLOCKSS archive can fail in three ways:<br />
* Individual data corruption, which is detected and repaired by the [[LOCKSS: Polling and Repair Protocol]].<br />
* Individual data inaccessibility, which is detected and repaired by the bad block handling of the disk and O/S, and the [[CLOCKSS: Box Operations#RAID Configuration|RAID configuration]] of the CLOCKSS boxes.<br />
* Whole-disk failure, which is detected by the O/S, handled by the [[CLOCKSS: Box Operations#RAID Configuration|RAID configuration]], and repaired by [[CLOCKSS: Box Operations#Component Failure|replacing the drive]].<br />
<br />
Risk: There is a risk that a whole-disk failure would not be observed and the drive replaced in time before another drive in the RAID group failed. The result would be loss of data on the box in question. This would be repaired by the [[LOCKSS: Polling and Repair Protocol]], but doing so would take some time.<br />
<br />
=== Hardware Failure ===<br />
<br />
Other than media failures, other components of CLOCKSS boxes can fail. Observed failures include:<br />
* Power supplies. CLOCKSS boxes have redundant power supplies, so the failure of one does not bring the box down.<br />
* Motherboards, whose failure does bring the box down.<br />
CLOCKSS boxes can be down for extended periods without impairing the function of the network as a whole, as the box in Tokyo was in the aftermath of the Fukushima disaster.<br />
<br />
The CLOCKSS archive does not maintain service contracts for its hardware. It owns the hardware even though it is located at remote sites. The architecture of the CLOCKSS network means that rapid response to outages at individual sites is not required. All CLOCKSS hardware shipped to remote sites is equipped with redundant power supplies, and undergoes extended burn-in before shipment. Experience shows that failures of hardware components other than disks are rare. CLOCKSS boxes are equipped with warm spare disks to cover for disk failures. Non-disk failures are typically handled by exchanging a complete server with one from the LOCKSS team. Thus service contracts are not economically justified.<br />
<br />
Risk: There is a risk that delays in repairing hardware failures would result in enough CLOCKSS boxes being down simultaneously to impair the function of the network. This risk is mitigated by [[CLOCKSS: Logging and Records|monitoring of box operations]] and treating hardware repair as urgent. The purchase of spare hardware that could immediately be shipped from Stanford to reduce the delay in repair is being investigated.<br />
<br />
=== Software Failure ===<br />
<br />
Failures in the LOCKSS software are detected and diagnosed using the [[CLOCKSS: Logging and Records|logging mechanisms]]. They are reported, and progress in their remediation tracked, as described in [[LOCKSS: Software Development Process]]. Because the LOCKSS daemon on each box operates independently, and because its operations are heavily randomized, it is unlikely that the occurrence of failures on multiple boxes would be correlated in time. If necessary, the various processes performed by each or all LOCKSS daemons, such as [[CLOCKSS: Ingest Pipeline|collecting content]] and [[LOCKSS: Polling and Repair Protocol|integrity checks]] can be individually and temporarily disabled by means of the [[LOCKSS: Property Server Operations|property server]].<br />
<br />
The LOCKSS and CLOCKSS networks are so large that full-scale testing in an isolated environment is economically infeasible. Testing in available isolated environments is a part of the process, but it cannot be representative of the load encountered by software in production use. This risk is mitigated by releasing new versions of the LOCKSS software to a number of LOCKSS boxes that the LOCKSS team runs as part of the Global LOCKSS Network (GLN) before they are made generally available or released to the CLOCKSS network. The GLN includes a much larger number of boxes than the CLOCKSS network, although on average each has much less content. Experience shows that problems not detected in an isolated testing environment are most likely to be caused by a large number of boxes rather than by a large amount of content per box.<br />
<br />
Risk: There is a risk that bugs in the LOCKSS daemon could overwrite or delete content from a CLOCKSS box. This risk is mitigated in two ways:<br />
* Exclusion from the LOCKSS daemon of code that overwrites or deletes files from the repository, so that a bug cannot inappropriately execute it.<br />
* The [[LOCKSS: Polling and Repair Protocol]], which would detect and repair the damage from another box.<br />
<br />
=== Communication Errors ===<br />
<br />
The CLOCKSS archive uses network communications for three purposes:<br />
* Ingest. HTTP is not a reliable transport protocol so, as described in [[CLOCKSS: Ingest Pipeline]], content is ingested multiple times by different ingest machines and subsequently the [[LOCKSS: Polling and Repair Protocol]] detects and repairs any inconsistencies between the content at the ingest boxes.<br />
* Preservation. CLOCKSS boxes are [[CLOCKSS: Box Operations#CLOCKSS PLN Configuration|configured to use SSL]] for all communication between them, specifically for the [[LOCKSS: Polling and Repair Protocol]]. Certificates are checked at both ends of all connections. Corruption would thus be detected. Interruptions of communication are normal; messages are re-tried until delivered or a specified time-out.<br />
* Dissemination. Communication problems during dissemination to the re-publishing servers would be detected by checksum verification and re-tried.<br />
<br />
Risk: The mitigations are assessed as effective against the threat, so the risk is low.<br />
<br />
=== Failure of Network Services ===<br />
<br />
The CLOCKSS archive has to consider the possible failure of network services during each of the three phases:<br />
* Ingest:<br />
** Ingest of content via harvest requires the use of DNS and the publisher's Web server. Failure of either is presumed to be transient, and thus to delay but not prevent ingest.<br />
** Ingest of content via file transfer requires the use of DNS and a file transfer service such as <tt>ftp</tt> either at the publisher or at Stanford. Failure of either is presumed to be transient, and thus to delay but not prevent ingest.<br />
* Preservation. A major design goal of the [[LOCKSS: Polling and Repair Protocol]] was to avoid all dependencies on external network services, even DNS, since there was no guarantee that the service would continue. Provided it remains possible to route packets to a network address, the failure of other network services would not affect the preservation of CLOCKSS content. Currently, DNS is required during daemon start; a fix has been designed but has yet to be implemented.<br />
* Dissemination. [[CLOCKSS: Extracting Triggered Content#Assembling the DIP|Transfer of triggered content]] to the re-publishing servers requires DNS and a file transfer protocol such as <tt>rsync</tt> or <tt>sftp</tt>. Failure of either is presumed to be transient, so it would delay but not prevent dissemination.<br />
<br />
Risk: If Internet connectivity were to be impossible for many months the content of the individual CLOCKSS boxes would be at significant risk, but this is assessed as a low probability event. The failures would be unlikely to be correlated, so once connectivity was restored the [[LOCKSS: Polling and Repair Protocol]] would have a high probability of recovering from them, although it would take some time.<br />
<br />
=== Media & Hardware Obsolescence ===<br />
<br />
All [[CLOCKSS: Hardware and Software Inventory#Hardware|hardware and media components]] in use by the CLOCKSS archive are generic low-cost PC server technology and, as such:<br />
* Easy to monitor for obsolescence, since that would be an industry-wide event.<br />
* Easily replaced with newer physical or virtual resources when necessary.<br />
The LOCKSS team monitors the state of the [[CLOCKSS: Hardware and Software Inventory#Hardware|hardware inventory]] as documented in [[CLOCKSS: Logging and Records]]. The CLOCKSS archive has [[CLOCKSS: Box Operations#Hardware Replacement|technical]] and [[CLOCKSS: Budget and Planning Process|financial]] plans in place to replace failed or life-expired hardware.<br />
<br />
There are three reasons why ingest machines or CLOCKSS boxes might need to be replaced:<br />
* Hardware failure, which would be revealed by the monitoring processes described in [[CLOCKSS: Logging and Records]], [[CLOCKSS: Ingest Pipeline]] and [[CLOCKSS: Box Operations]].<br />
* Resource exhaustion, would be revealed by the monitoring processes described in [[CLOCKSS: Logging and Records]], [[CLOCKSS: Ingest Pipeline]] and [[CLOCKSS: Box Operations]].<br />
* Technological obsolescence, which would be evident through the staff awareness described in [[CLOCKSS: Threats and Mitigations#Awareness|Awareness]].<br />
Monitoring means the risk of missing hardware failure or resource exhaustion is low, and because either would affect only one of the replicas the impact would be low. The risk of technological obsolescence of the hardware is low since it is all generic PC servers with no specialized components.<br />
<br />
The technical specifications for the current hardware were drawn up with incremental upgrade over time in mind. The only components that we expect to upgrade in the next 5 years are the disk media. Beyond 5 years is too far ahead to draw up detailed specifications - for example would we want to use ARM-based micro-servers? Spintronic storage media? Named data networking? We can't know yet.<br />
<br />
Risk: There is a risk that, when the time comes, financial resources would be inadequate to replace life-expired hardware. This risk is mitigated by:<br />
* Assuming a service life (typically 5 years) for equipment that is much less than the equipment is capable of.<br />
* Using generic, low cost equipment.<br />
* The replication inherent in the CLOCKSS PLN, which means that a few boxes could be out of service for some time without impacting the archive's operations.<br />
<br />
=== Software Obsolescence ===<br />
<br />
All [[CLOCKSS: Hardware and Software Inventory#Software|software in use by the CLOCKSS archive]] is either:<br />
* free, open-source, industry standard software such as Linux and Java, or<br />
* internally developed free, open-source software (the LOCKSS daemon), or<br />
* internally developed tools used for content testing, and diagnosis of the CLOCKSS network's performance.<br />
The LOCKSS daemon used to preserve the CLOCKSS archive's content [[LOCKSS: Software Development Process#Dependencies|depends upon]]:<br />
* A POSIX file system.<br />
* A Java virtual machine, level 6 or above.<br />
* A set of Java libraries.<br />
Changes which prevent the Linux environment satisfying these requirements are considered unlikely in the foreseeable future, and if they were to be envisaged by the Linux community it would only be after open discussion of which the LOCKSS team would be aware (see [[CLOCKSS: Threats and Mitigations#Awareness|Awareness above]]). The LOCKSS software is maintained by the LOCKSS team using processes defined in [[LOCKSS: Software Development Process]]. LOCKSS Program technical staff monitor the evolution of the open source ecosystem and, when indicated, routinely migrate the LOCKSS software (for example, from one Java library to another deemed more suitable). The threat of obsolescence is also monitored by the testing processes described in [[LOCKSS: Software Development Process]] and [[CLOCKSS: Ingest Pipeline]]. Loss of key team members could impact the effectiveness of this; for mitigation see [[CLOCKSS: Threats and Mitigations#Awareness|Awareness above]]. The rest of the stack is maintained by the Linux, Apache and other open source communities. Since all the software is free and open-source, no financial provision other than the normal [[LOCKSS: Software Development Process]] funding need be made for its replacement or upgrade.<br />
<br />
One consequence of software obsolescence might be format obsolescence. The CLOCKSS archive [[LOCKSS: Format Migration|implements format migration on access]]. Doing so depends on the eventual availability of format converters, a topic discussed [[LOCKSS: Format Migration#Availability of Format Converters|here]].<br />
<br />
Risk: The risk of the open source community being unable to sustain the dependencies on the Java virtual machine, some Java libraries, and the availability of a POSIX file system, or the Apache web server, is assessed as low. These basic dependencies have been stable since the LOCKSS prototype nearly 15 years ago. The requirements development process described in [[LOCKSS: Software Development Process]] might fail to detect the need for a change from the LOCKSS community or the content being preserved. Changes in the rest of the software stack might trigger a failure of one or more dependencies of the LOCKSS daemon. The unit and functional testing processes described in [[LOCKSS: Software Development Process]] are designed to detect this. CLOCKSS Archive income from libraries and publishers might not be adequate for the work needed to adapt to new publishers and conform to the evolution of existing publishers, leading to a backlog of content to be ingested.<br />
<br />
One form of software obsolescence would be if the Web evolved in such a way as to prevent the LOCKSS software from collecting, preserving or disseminating current Web content. This is a significant concern for all Web archiving technologies, and the LOCKSS technical staff have been in the forefront of addressing the issue by running workshops at the International Internet Preservation Consortium. See blog posts about the [http://blog.dshr.org/2013/04/talk-on-harvesting-future-web-at.html 2013] and [http://blog.dshr.org/2012/05/harvesting-and-preserving-future-web.html 2012] workshops. The Andrew W. Mellon Foundation is funding the LOCKSS Program to work in this area through mid-2014.<br />
<br />
=== Operator Error ===<br />
<br />
An error by the operator of an individual CLOCKSS box affects that individual box but does not compromise the integrity of the network as a whole.<br />
<br />
An error by an operator of the [[LOCKSS: Property Server Operations|Property Server]] can affect the entire network, but only by:<br />
* Interrupting service, which has no deleterious effect because each box caches the most recent set of properties.<br />
* Distributing a syntactically malformed property file, which will be detected by the boxes and treated as a service interruption.<br />
* Distributing a syntactically correct property file that sets unsuitable property values. The LOCKSS daemon software is skeptical of property values. Critical properties have range checks and the code takes other defensive measures to ensure that erroneous property values can at worst cause daemon activities such as polling to stop; they cannot cause loss of or damage to content. <br />
An error by a [[LOCKSS: Basic Concepts#LOCKSS Daemon|daemon]] or [[LOCKSS: Basic Concepts#Plugins|plugin]] developer, or the daemon or plugin build master, could affect the entire network but the [[LOCKSS: Software Development Process|testing and release process]] is as automated as possible and designed to catch such errors before they get to the network.<br />
<br />
Other precautions taken include:<br />
* Operator access to CLOCKSS boxes is logged.<br />
* Administrative actions via the LOCKSS daemon's administrative Web interface cause Alerts (see [[CLOCKSS: Logging and Records#Administrative and Security Alerts|CLOCKSS: Logging and Records]]).<br />
<br />
Risk: Experience of the LOCKSS system in production use shows this is a low risk, in that many such errors have been made with no serious effect.<br />
<br />
=== Natural Disaster ===<br />
<br />
Since each of the (currently 12) CLOCKSS boxes is configured to contain a complete copy of the Archive's content, a disaster causing the total loss of a few CLOCKSS boxes does not need to be treated as a disaster, merely the routine replacement of a few network nodes as documented in [[CLOCKSS: Box Operations]].<br />
<br />
All CLOCKSS triggered content is disseminated via two mirrored Web servers, one at Stanford, California and one at EDINA, Scotland. A disaster at one of these sites would not interrupt service. Mirroring could be easily restored by copying from the unaffected site or from one of the CLOCKSS boxes as documented in [[CLOCKSS: Extracting Triggered Content]]. All content triggered from the CLOCKSS network is under a Creative Commons license; there are neither technical nor legal barriers to other, unaffiliated, institutions bringing up additional mirrors.<br />
<br />
A disaster affecting the LOCKSS team at Stanford might interrupt service in terms of ingesting new content, since 3 of the 5 presentation ingest machines are located at Stanford. Each of these ingest machines contains the current state of the ingest pipeline, so that replacement machines could be cloned from one of the remaining machines at the cost of a week or two delay in ingest. (See [[CLOCKSS: Ingest Pipeline]]). The content of the source ingest pipeline is mirrored off-site.<br />
<br />
A disaster affecting the LOCKSS team at Stanford might interrupt service in terms of the "property server" used to manage the CLOCKSS network. The LOCKSS team maintains a hot standby of the property server in Amazon's cloud. (See [[LOCKSS: Property Server Operations]]). Each CLOCKSS box caches a complete copy of the contents of the CLOCKSS property server, so a service interruption would be unlikely to affect their operation during the time needed to fail over to the hot standby.<br />
<br />
All CLOCKSS documents are preserved in the CLOCKSS Archive, and thus in each of the CLOCKSS boxes.<br />
<br />
It appears that the use of the <tt>ServeContent</tt> servlet to serve most of the triggered content is preventing the [http://web.archive.org/web/ Internet Archive's WayBack Machine] preserving it. We plan to investigate possible ways around this issue so that the Internet Archive would also be a re-publishing server for triggered content.<br />
<br />
Risk: Given the high risk of a natural disaster in the Bay Area futher attention is needed to maintaining critical data outside the area, not merely off-site.<br />
<br />
=== External Attack ===<br />
<br />
As described in [[CLOCKSS: Box Operations]], the configuration of each CLOCKSS box was carefully designed to prevent communication except with the other CLOCKSS boxes (enforced using SSL certificate checks at both ends of each connection) and with the CLOCKSS ingest and management machines (using firewall rules). CLOCKSS (and LOCKSS) boxes are single-function servers, there are no other services sharing the machine for an attacker to compromise. An attacker who, perhaps by compromising a machine used by a host institution's administrator, gains access to an individual CLOCKSS box does not compromise the integrity of the network as a whole, since the CLOCKSS boxes do not trust each other. The surface available to an external attacker is thus minimized. An attacker could compromise the CLOCKSS Property Server, and modify the configuration of all boxes in the network. This could impede network operations until control of the property server was restored, but due to the design of the LOCKSS technology it would not result in content in the CLOCKSS boxes being modified or lost permanently. See [[LOCKSS: Property Server Operations]].<br />
<br />
Each CLOCKSS box's operating system is maintained current with the CentOS repositories. Some CLOCKSS boxes update automatically from these repositories within 24 hours, some require administrator intervention. This mitigates the risk that an erroneous update from CentOS would impact all CLOCKSS boxes almost simultaneously.<br />
<br />
The process by which security requirements for the the LOCKSS software are developed and addressed is described in [[LOCKSS: Software Development Process]]. Once a security enhancement for the LOCKSS daemon is released, all CLOCKSS boxes install it automatically within 24 hours.<br />
<br />
The following precautions are taken to prevent unauthorized access via a CLOCKSS box's administrative Web interface:<br />
* Packet filters prevent access except from the box's host institution's network, and from the LOCKSS team's subnet at Stanford.<br />
* Access requires HTTPS.<br />
* Administrative access is logged.<br />
* Adminstrative actions cause Alerts, see [[CLOCKSS: Logging and Records#Administrative and Security Alerts|CLOCKSS: Logging and Records]].<br />
<br />
If a attack compromises one or more ingest boxes, the ingest network should be stopped via the [[LOCKSS: Property Server Operations|property server]], all boxes disconnected from the network, the vulnerability diagnosed, and all boxes wiped and their BIOS and operating system re-installed from scratch. Their content should be re-ingested from the publisher.<br />
<br />
If an attack compromises one or more production boxes, the production network should be stopped via the [[LOCKSS: Property Server Operations|property server]], the affected boxes disconnected from the network, the vulnerability diagnosed, and the affected boxes wiped and their BIOS and operating system re-installed from scratch. Unless a majority of the production boxes were compromised, the [[LOCKSS: Polling and Repair Protocol]] will detect and repair any corruption of their content.<br />
<br />
Risk: <br />
* The open source community maintainers could issue a faulty update to a component of the CLOCKSS box stack.<br />
* The LOCKSS team could issue a faulty software update.<br />
These risks are mitigated by the configuration of the CLOCKSS boxes, which prevents communication except with specifically authorized IP addresses, making it difficult for an attacker to exploit a remote vulnerability, and which prevents login access except by host institution administrators.<br />
<br />
=== Internal Attack ===<br />
<br />
Internal attack could take one of two forms:<br />
* Insider abuse at the CLOCKSS host institutions is limited to affecting a single box, not the preserved content as a whole. This is because each of the (currently 12) CLOCKSS boxes is independently administered; insiders at the host institution have access only to their box. The CLOCKSS boxes do not trust each other, only the consensus of the boxes as a whole.<br />
* Insider abuse by the LOCKSS team. The policy is that when a new CLOCKSS box is bought up, the LOCKSS staff managing the network have write and administrative access to it via <tt>sudo</tt>. All such accesses are logged. Once confidence is achieved in the working relationship with staff at the host institution, this access is terminated. This stage has been achieved with 7/11 remote CLOCKSS boxes. Eventually, the LOCKSS staff will have such access only to the box at Stanford; their access to the other boxes is limited to:<br />
** read-only data collection (see [[CLOCKSS: Box Operations]])<br />
** changes to the LOCKSS daemon configuration (see [[LOCKSS: Property Server Operations]] and discussion of [[CLOCKSS: Threats and Mitigations#Operator Error|Operator Error]] above)<br />
** changes to the LOCKSS daemon software (see [[LOCKSS: Software Development Process]]), which could introduce malicious code into the network. This risk is mitigated by the use of SourceForge's source code control system, which allows code changes to be traced to their authorized committers, and easily rescinded, code signing (CLOCKSS boxes verify the signature on all software, whether from the LOCKSS team or from the CentOS repositories, before installing it), and the staged release process.<br />
<br />
With the exception of the Stanford CLOCKSS box, staff at the host institution of each CLOCKSS box have access only to their box, not to any of the others. Their role in changing the system is limited to maintaining the operating system of their box current with the requirements of the network. LOCKSS staff have read-only access to all boxes for monitoring and data collection purposes.<br />
<br />
Members of the LOCKSS staff have delineated roles, responsibilities and authorizations regarding making changes to the system as follows:<br />
* LOCKSS technical staff can check changes in to the daemon source code repository, but these changes are not deployed to the CLOCKSS boxes until they have been tested, included in a release candidate built and signed by the LOCKSS build master, and approved for release by the LOCKSS technical lead. The LOCKSS source code control system identifies the author of each change to the system. See [[LOCKSS: Software Development Process]].<br />
* LOCKSS content staff can check changes in to the plugin source, but these changes are not deployed to the CLOCKSS boxes until they have been tested, included in a plugin release built and signed by the plugin build master, and approved for release by the plugin lead. See [[LOCKSS: Software Development Process]].<br />
* Access to the server room containing the Stanford CLOCKSS box, the property server and other critical systems is restricted to the LOCKSS sysadmin and the LOCKSS senior engineers. Similar secure physical locations are required of the other CLOCKSS boxes, see [[CLOCKSS: Box Operations#Requirements for CLOCKSS host sites|CLOCKSS: Box Operations]].<br />
<br />
Risk: There is a risk that the LOCKSS build master could compromise the build process to introduce malware. Although this would be evident after the damage was done, because the signed package would not correspond to the tagged source, it is hard to see any pro-active mitigation.<br />
<br />
=== Economic Failure ===<br />
<br />
The LOCKSS software is maintained by the LOCKSS team, funded jointly by the CLOCKSS archive and the LOCKSS Alliance. The LOCKSS team has been economically sustainable for more than 5 years solely on this basis without grant funding.<br />
<br />
There is a risk that the CLOCKSS administration might commit to preserve publishers whose content is very large without charging them enough to fund the storage necessary for their content. This risk is mitigated by regular reports on system capacity to CLOCKSS administration. Loss of CLOCKSS Archive library members would reduce funding without corresponding reduction in content (as loss of publisher members would) and might make timely hardware replacements difficult. This risk is mitigated by the 30-year history of exponential drops in storage cost per byte, and the existence of 12 complete replicas of the content, which makes the temporary loss of a few replicas while waiting for replacement less important.<br />
<br />
=== Organizational Failure ===<br />
<br />
For the business aspects of failing over to a successor organization see [[CLOCKSS: Succession Plan]].<br />
<br />
If, as part of the [[CLOCKSS: Succession Plan]], it becomes necessary to transfer custody of the content of the CLOCKSS archive, this could be achieved in multiple ways. The successor organization could take custody of the content and metadata by, among other possible means:<br />
* Importing the content exported by a production CLOCKSS box in one of the packaging formats supported by the LOCKSS daemon, including, ZIP, TAR and WARC files.<br />
* Crawling the content from a production CLOCKSS box using a standard Web crawler such as the Internet Archive's Heritrix.<br />
* Using shell scripts to traverse the file systems containing the LOCKSS daemon's repository, described in [[Definition of AIP#CLOCKSS Archival Information Package (AIP)|Definition of AIP]], to create a different packaging format, then importing that.<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by LOCKSS Engineering Staff<br />
* Approval by LOCKSS Chief Scientist<br />
<br />
== Relevant Documents ==<br />
# [http://dx.doi.org/10.1045/november2005-rosenthal Requirements for Digital Preservation Systems: A Bottom-Up Approach]<br />
# [http://dx.doi.org/10.1145/1047915.1047917 ''LOCKSS: A Peer-to-Peer Digital Preservation System'']<br />
# [https://www.usenix.org/conference/2005-usenix-annual-technical-conference/attrition-defenses-peer-peer-digital-preservation ''Attrition Defenses for a Peer-to-Peer Digital Preservation System'']<br />
# [http://www.lockss.org/news-media/talks/ LOCKSS Talks page]<br />
# [http://www.lockss.org/news-media/publications/ LOCKSS Publications page]<br />
# [http://blog.dshr.org Dr. David S. H. Rosenthal's blog]<br />
# [[CLOCKSS: Box Operations]]<br />
# [[LOCKSS: Polling and Repair Protocol]]<br />
# [[CLOCKSS: Extracting Triggered Content]]<br />
# [[CLOCKSS: Logging and Records]]<br />
# [[CLOCKSS: Ingest Pipeline]]<br />
# [[LOCKSS: Property Server Operations]]<br />
# [[CLOCKSS: Hardware and Software Inventory]]<br />
# [[LOCKSS: Software Development Process]]<br />
# [[LOCKSS: Format Migration]]<br />
# [[CLOCKSS: Succession Plan]]<br />
# [[Definition of AIP]]</div>Dshrhttp://documents.clockss.org/index.php/LOCKSS:_Metadata_DatabaseLOCKSS: Metadata Database2014-04-06T22:13:29Z<p>Dshr: /* Membership of a Bibliographic Unit in an AIP (AU) */</p>
<hr />
<div>= LOCKSS: Metadata Database =<br />
<br />
The LOCKSS technology [[LOCKSS: Extracting Bibliographic Metadata|extracts bibliographic and other metadata]] from the preserved content and indexes it in a relational database. This database is not itself preserved, it is merely a cache of metadata extracted from the preserved content to facilitate access and management. The indexed metadata can be used to locate and retrieve information about articles, chapters, and other features of the preserved content. This document describes how this information is represented in the database tables.<br />
<br />
== Information Architecture of Stored Metadata ==<br />
<br />
The metadata database stores metadata information about preserved content at the level of bibliographic units. For periodicals such as journals, the bibliographic units are journal articles. For books, bibliographic units are either individual book chapters or an entire book if chapters are not delivered individually. Bibliographic units for other types of content can be identified and represented at the same level. Each bibliographic unit is represented by descriptive information such as authors, title (e.g. of the article or chapter), keywords, and abstract. Identifying information about the bibliographic unit is also stored, including the Digital Object Identifier (DOI), ISSN, ISBN.<br />
<br />
Information about relationships among bibliographic units is also stored in the metadata database. For example, a journal article is usually part of an issue, which is located at certain page numbers in the issue or as a certain article number. The issue is published as a certain issue number within a certain volume of a journal from a publisher. Similarly, a book chapter or article is part of a book or monograph, located at a certain chapter number, which may be part of a book or monographic series from a publisher. The metadata database stores relationship information that enables bibliographic units to be located with respect to these relationships. For example, all articles of an issue, all issues of a volume, all volumes of a journal, and all journals by a publisher can be identified. This enables reporting on, rendering, and browsing content in a way that reflects these relationships.<br />
<br />
Finally, the metadata database stores information about the membership of each bibliographic unit within a specific [[Definition of AIP|Archive Information Package]] (AIP), including the identifier of the AIP (the AUID), and the URI of the unit on the Provider's website in the case of harvested content, or relative to a delivered [[Definition of SIP|Submission Information Package]] (SIP) in the case of file transfer content. This information is sufficient to locate the bibliographic unit within the LOCKSS repository.<br />
<br />
== Schema Representation ==<br />
<br />
The schema used to represent bibliographic units, their bibliographic relationships, and their preservation information is encoded as tables in a relational model.<br />
<br />
=== Representation of Bibliographic Units ===<br />
<br />
A bibliographic unit is primarily represented by a MD_ITEM table that is decorated by a number of supporting tables with different types of bibliographic information, These tables include:<br />
<br />
* MD_ITEM_NAME -- the name of the bibliographic unit (e.g. article or book chapter title)<br />
* DOI -- the DOI of the bibliographic unit (e.g. article or book chapter DOI)<br />
* AUTHOR -- the authors of the unit (e.g. article or book chapter author)<br />
* KEYWORD -- the keywords for the unit (e.g. article or book chapter keywords)<br />
<br />
=== Relationships Among Bibliographic Units ===<br />
<br />
Relationships among bibliographic units are represented by supporting tables that decorate the MD_ITEM with intermediate levels of containment information:<br />
<br />
* BIB_ITEM -- the intermediate bibliographic information (e.g. volume, issue, start page)<br />
<br />
Relationships among bibliographic units are also represented by a parent MD_ITEM that represents the containing publication. This parent MD_ITEM is further decorated by supporting tables with publication information. These tables include:<br />
<br />
* MD_ITEM_NAME -- the name of the publication (e.g. journal or book title)<br />
* ISSN -- the print or online ISSN of a periodical<br />
* ISBN -- the print or online ISBN of a book<br />
* PUBLICATION -- the publication record<br />
* PUBLISHER -- The publisher for a given publication<br />
<br />
An additional level of MD_ITEM is also used to represent a book or monographic series. This grandparent MD_ITEM is further decorated by supporting tables with publication information. These tables include:<br />
<br />
* ISSN -- the print or online ISSN of a book series<br />
<br />
=== Membership of a Bibliographic Unit in an AIP (AU) ===<br />
<br />
Membership of bibliographic units in an AIP (AU) is represented by supporting tables that decorate the MD_ITEM. These tables include:<br />
<br />
* AU_MD -- the membership record of a bibliographic unit in an AU<br />
* AU -- the AU information including [[LOCKSS: Basic Concepts#AUID|AUID]] and [[LOCKSS: Basic Concepts#Plugins|plugin]]<br />
* PLUGIN -- the plugin information including the plugin ID and publishing platform<br />
* PLATFORM -- the publishing platform<br />
<br />
Membership is also represented by tables that decorate the MD_ITEM of the bibliographic unit:<br />
<br />
* URL -- the original URL of the bibliographic unit at the Provider (e.g. the article or book chapter URL)<br />
<br />
The plugin, AU key, and original URL of the bibliographic unit are sufficient to locate the bibliographic unit within its AIP and identify the SIP from which it came in the LOCKSS repository<br />
<br />
=== Graphical Representation of Schema ===<br />
<br />
A graphical representation is automatically generated from the metadata database schema.<br />
<br />
[[File:MetaDatabase-Schema.png|400px|center|thumb|LOCKSS Metadata Database Schema Diagram]]<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by LOCKSS Technical Staff<br />
* Approval by LOCKSS Technical Lead<br />
<br />
== Relevant Documents ==<br />
# [[LOCKSS: Extracting Bibliographic Metadata]]<br />
# [[Definition of SIP]]</div>Dshrhttp://documents.clockss.org/index.php/CLOCKSS:_Box_OperationsCLOCKSS: Box Operations2014-04-06T22:10:37Z<p>Dshr: /* LOCKSS Daemon Updates */</p>
<hr />
<div>= CLOCKSS: Box Operations =<br />
<br />
== Requirements for CLOCKSS host sites ==<br />
<br />
A CLOCKSS host site signs an agreement with the CLOCKSS board under which they commit to providing:<br />
* Rack space in a physically secure location accessible only to authorized personnel.<br />
* Power.<br />
* Cooling.<br />
* Network bandwidth.<br />
And to have an in-place disaster recovery plans if these resources fail. The CLOCKSS Executive Director has copies of these agreements.<br />
<br />
The CLOCKSS Network Administrator is responsible for all processes involving production and ingest CLOCKSS box hardware and software infrastructure.<br />
<br />
== Hardware Bringup ==<br />
<br />
=== Hardware Vendors ===<br />
All new systems are currently purchased from iXsystems based in San Jose, CA. The LOCKSS team's contact there is sales representative Kevin Lee ([mailto:kle@ixsystems.com kle@ixsystems.com]).<br />
<br />
The LOCKSS team previously worked with two hardware vendors, Iron Systems, located in Fremont, CA and eRacks in Orange, CA. Most of the systems currently in production were purchased from Iron Systems. Business contacts are Kawaljit Nagi ([mailto:kawal@ironsystems.com kawal@ironsystems.com]) and James King ([mailto:jamesk@eracks.com jamesk@eracks.com]), respectively. The LOCKSS team also has a relationship with Joesph Wolff ([mailto:joe@eracks.com joe@eracks.com]), the owner of eRacks.<br />
<br />
=== Hardware Purchase Process ===<br />
Purchases start as a discussion among LOCKSS Engineering Staff, the CLOCKSS Network Administrator and the CLOCKSS Executive Director about what is needed. A specification is drawn by the engineers and sent to our hardware vendors for an early quote. If the quote looks reasonable, it is sent to the CLOCKSS Executive Director, who negotiates a final quote and sends a purchase order. The CLOCKSS Executive Director is advised to release payment once the machine has been built, tested, verified working by the LOCKSS team, preconfigured with software (see below) and shipped. <br />
<br />
=== Hardware Warranty ===<br />
Machines purchased from Iron Systems and eRacks come with 1 - 2 years of limited hardware warranty, depending on hardware vendor. The warranty on most of the CLOCKSS boxes has since expired but it proved useful for exchanging failed hardware components. Details about the warranty are in the purchase orders that have been submitted as well as the invoices subsequently received.<br />
<br />
=== Recommended Hardware ===<br />
CLOCKSS hardware or virtual machines meet or exceed the following specifications:<br />
<br />
{| class="wikitable" style="margin: 0 auto"<br />
|-<br />
! scope="col"| Item<br />
! scope="col"| Ingest<br />
! scope="col"| Production<br />
! scope="col"| Triggered (VMware)<br />
|-<br />
| Chassis<br />
| | Supermicro 2U<br />
| | Supermicro 4U<br />
| | ''N/A''<br />
|-<br />
| Disk Bays<br />
| | 12<br />
| | 24<br />
| | ''N/A''<br />
|-<br />
| Processor<br />
| | AMD Operton 6128 (eight cores)<br />
| | Dual Xeon E5504 (eight cores)<br />
| | Dual-core CPU<br />
|-<br />
| RAM<br />
| | 16GB ECC<br />
| | 24GB ECC<br />
| | 4GB<br />
|-<br />
| Disk<br />
| | 8 x 3TB SATA 6Gbps 7200RPM<br />
| | 12 x 2TB SATA 3Gbps<br />
| | 40GB disk<br />
|-<br />
| RAID<br />
| | LSI Megaraid<br />
| | Software (Linux <tt>mdadm</tt>)<br />
| | None (Underlying RAID array)<br />
|-<br />
| Network<br />
| | Onboard dual Gbe<br />
| | Onboard dual Gbe<br />
| | 10/100Mb NIC<br />
|-<br />
| Remote Access<br />
| | IPMI<br />
| | IPMI<br />
| | VMware vSphere Client<br />
|-<br />
| Power Supply<br />
| | 800W redundant<br />
| | 1600W redundant<br />
| | ''N/A''<br />
|-<br />
|}<br />
<br />
=== CLOCKSS Virtual Machines ===<br />
<br />
CLOCKSS virtual machines run on servers running the VMWare vSphere ESXi 5 hypervisor.<br />
<br />
=== Hardware Service Life ===<br />
<br />
The service life of a CLOCKSS box is five to seven years. This is dictated by budget restrictions, software needs as well as cost and performance efficiency. Please see the section on [[#Hardware Replacement]] later in this document.<br />
<br />
=== Building and testing ===<br />
<br />
Our hardware vendors are responsible for sourcing components and building the machines once a purchase order is submitted by CLOCKSS. After a machine is built, both hardware vendors have an extensive and rigorous process in place to ensure a machine will function correctly.<br />
<br />
=== Remote Access via IPMI ===<br />
<br />
Every CLOCKSS box is equipped with an Intelligent Platform Management Interface (IPMI) module that provides remote "side-band" and "out-of-band" access. Although IPMI is invaluable in the event we need to access or recover a CLOCKSS box outside the scope of its operating system, there are known security vulnerabilities and so it is disabled by default on all machines. This is done by disabling it in BIOS and unplugging the network cable from the IPMI module.<br />
<br />
In the event we need to use IPMI, the following precautions should be taken:<br />
<br />
# Update the firmware, if a newer version exists.<br />
# Do not use the default username and password.<br />
# Make IPMI accessible only through a VPN or other secure connection.<br />
<br />
=== RAID Configuration ===<br />
<br />
The CLOCKSS Ingest machines have LSI MegaRAID based hardware RAID controllers and eight 3TB disks. These disks were partitioned into groups of four and put into a RAID5 configuration. Each RAID array then yields approximately 8.1TB of usable space after RAID and EXT4 filesystem overhead. Since the Ingest machines have an extra disk dedicated for the OS (called a "system disk"), the full space of the RAID arrays is dedicated to the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]]. This configuration also isolates problems on the system disk from content storage arrays. <br />
<br />
The CLOCKSS Production machines utilize the Linux kernel's <tt>mdadm</tt> module to implement software RAID. Similar to the CLOCKSS Ingest machines, the Production machines also have several RAID5 arrays consisting of three or four disks depending on the number of disks in the machine. In contrast to the Ingest machines, these RAID5 arrays are used by both the operating system and the LOCKSS daemon.<br />
<br />
=== RAID Health Monitoring ===<br />
<br />
The most common component failures are disks and it is expected that all CLOCKSS boxes will experience a few disk failures over the course of their service life. The RAID5 arrays employed in CLOCKSS boxes are only able to tolerate one disk failure each. It is thus important that any failed disks be found and replaced as quickly as possible; a double disk failure requires repair from another box via the [[LOCKSS: Polling and Repair Protocol]], which is much slower than RAID rebuild.<br />
<br />
The RAID array health on all CLOCKSS Production boxes are monitored by Nagios via the Nagios Remote Plugin Executor (NRPE). Each machine runs an NRPE daemon and has a copy of a custom plugin we wrote to monitor mdadm RAID arrays. NRPE executes the plugin on behalf of Nagios and returns the result. If Nagios does not receive an OK, it sends an alert to CLOCKSS engineers.<br />
<br />
== Software Bringup ==<br />
<br />
All the software used for the CLOCKSS infrastructure is freely available and open source.<br />
<br />
=== CentOS Installation and Configuration ===<br />
<br />
A base install of CentOS 6.x with Java options is performed on the partition or disk designated for the operating system then<br />
the following steps are taken to configure it as a CLOCKSS box:<br />
# The system accounts, <tt>lcap</tt> and <tt>lockss</tt> are created using <tt>useradd -r -m</tt> or equivalent. Then the LOCKSS LCAP public SSH keys are then added to their <tt>authorized_keys</tt> file, taking care to ensure the permissions are setup correctly.<br />
# The mount points are created for the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] repositories:<br />
#* The preferred naming scheme is <tt>/cacheN/gamma</tt> where <tt>N</tt> is an integer starting from 0, (e.g. <tt>/cache0/gamma</tt>, <tt>/cache1/gamma</tt> and so forth).<br />
#* The mount point ownerships are changed to the <tt>lockss</tt> user and group and permissions for <tt>other</tt> are stripped.<br />
#* <tt>noexec</tt> is added to the mount point parameters.<br />
#* Finally, the mount points should be added to <tt>updatedb</tt>'s <tt>PRUNEPATHS</tt> variable to prevent the system from indexing the LOCKSS daemon repositories.<br />
# Then the firewall (<tt>iptables</tt> rules) are set up:<br />
#* The LOCKSS daemon requires the LCAP port (port 9729), used to communicate with other LOCKSS daemons, to be open to all machines: This is accomplished under CentOS with this <tt>iptables</tt> rule: <pre>-A INPUT -m state --state NEW -m tcp -p tcp --dport 9729 -j ACCEPT</pre> Note: Although we have a list of IP addresses that should be granted access to this machine's LCAP port, we opt not to restrict access because it is protected by SSL certificate checks (see [[#CLOCKSS PLN Configuration]]). <br />
#* Additionally, a CLOCKSS box is setup so that the administrative ports 22 (OpenSSH) and 8080 through 8083 are accessible to the remote site's administrative subnet(s) and the LOCKSS subnet at Stanford (171.66.236.0/24, previously 171.66.236.0/26). The following <tt>iptables</tt> should be repeated for each subnet, taking care to replace <tt>SUBNET</tt> with the subnet in CIDR form:<pre>-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 22 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8080 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8081 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8082 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8083 -j ACCEPT</pre>These rules should be reflected in any external firewalls at the site.<br />
# Setting up the CLOCKSS repository is a two step process:<br />
## The first is to create a new file, <tt>lockss.repo</tt> under <tt>/etc/yum.repos.d</tt> containing the following lines:<pre>[lockss]&#10;name = LOCKSS Daemon Repository&#10;baseurl=http://www.lockss.org/clockss-repo/&#10;gpgcheck=1</pre><br />
##Then to install the LOCKSS RPM GPG key:<pre>rpm --import http://www.lockss.org/LOCKSS-GPG-RPM-KEY</pre> <br />
#The LOCKSS daemon can now be installed by invoking:<pre>yum install lockss-daemon</pre><br />
# The LOCKSS daemon installs a <tt>logrotate</tt> configuration file to <tt>/etc/logrotate.d/lockss</tt>. By default, the size is limited to 2M but on CLOCKSS boxes, we set the limit to 20M: <pre>/var/log/lockss/daemon {&#10; size 20M&#10; rotate 5&#10; compress&#10; delaycompress&#10; create&#10; notifempty&#10; missingok&#10;}&#10;&#10;/var/log/lockss/stdout {&#10; size 10k&#10; rotate 5&#10; compress&#10; copytruncate&#10; notifempty&#10; missingok&#10;}&#10;</pre><br />
# To keep the time on CLOCKSS boxes synchronized, we use <tt>NTP</tt> (or preferably, <tt>OpenNTPd</tt> where available). To install <tt>ntpd</tt>:<pre>yum install ntp</pre><br />
# To automatically install package updates, we use <tt>yum-cron</tt>. To install and configure it: <pre>yum install yum-cron</pre> Then ensure it will start automatically on runlevels 3, 4 and 5: <pre>chkconfig --add yum-cron</pre> <br />
# Some extra, optional packages we've found to be useful are <tt>screen</tt>, <tt>wget</tt>, <tt>vim</tt>, <tt>emacs</tt>, <tt>lynx</tt>. Their installation is highly recommended to ease troubleshooting.<pre>yum install screen wget vim emacs lynx</pre><br />
<br />
=== Java Runtime Environment (JRE) ===<br />
<br />
Early CLOCKSS boxes were configured to use the Sun Java JRE 6 (now Oracle Java JRE) but we now recommend using the OpenJDK JRE 6 freely available through the CentOS repository. All CLOCKSS boxes have a 64-bit JRE to take advantage of the large amount of memory available on these machines.<br />
<br />
=== LOCKSS Daemon Configuration ===<br />
<br />
Once the CentOS environment has been configured and verified working by a CLOCKSS engineer, the next step is to configure the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] as part of the CLOCKSS network. This is done using the <tt>hostconfig</tt> utility packaged with the LOCKSS daemon. For a CLOCKSS box, the following settings are recommended:<br />
<br />
{| class="wikitable"<br />
|-<br />
! scope="col"| Parameter<br />
! scope="col"| Description<br />
|-<br />
| <tt>Fully qualified hostname (FQDN) of this machine</tt><br />
| | Provided by CLOCKSS. We schema we use is clockss-<i>site</i>.clockss.org where <i>site</i> can uniquely identify the host institution of the machine.<br />
|-<br />
| <tt>IP address of this machine</tt><br />
| | A static public IP address provided by host institution.<br />
|-<br />
| <tt>Initial subnet for admin UI access</tt><br />
| | The subnet (in CIDR or X.Y.Z.* notation) that will be granted access to the web-based LOCKSS Administrative UI. The localhost is implicitly allowed. Additional subnets can be added through the LOCKSS Administrative UI.<br />
|-<br />
| <tt>LCAP V3 protocol port</tt><br />
| | The TCP port the daemon will listen on for LCAP communication from peers. Remote sites should ensure both internal and external firewalls allow all connections to this port.<br />
|-<br />
| <tt>Mail relay for this machine</tt><br />
| | This should be the DNS name of an SMTP relay that will accept and relay mail from this machine. The script will also prompt for a username and password if the mail relay requires them. It can also be set to <tt>localhost</tt> if the machine is capable of handling email. <br />
|-<br />
| <tt>E-mail address for administrator</tt><br />
| | Occasional alerts will be sent to this address by the LOCKSS daemon.<br />
|-<br />
| <tt>Path to java</tt><br />
| | The full path to a JRE. The default should suffice in most cases.<br />
|-<br />
| <tt>Java switches</tt><br />
| | Java switches to be passed to the JRE. It should be left blank in most cases.<br />
|-<br />
| <tt>Configuration URL</tt><br />
| | <ul><li>CLOCKSS Production: <tt>http://props.lockss.org:8001/clockss/lockss.xml</tt></li><li>CLOCKSS Ingest: <tt>http://props.lockss.org:8001/clockssingest/lockss.xml</tt></li><li>CLOCKSS Triggered: <tt>http://props.lockss.org:8001/clockss-triggered/lockss.xml</tt></li></ul><br />
|-<br />
| <tt>Preservation group(s)</tt><br />
| | <ul><li>CLOCKSS Production: <tt>clockss</tt></li><li>CLOCKSS Ingest: <tt>clockssingest</tt></li><li>CLOCKSS Triggered: <tt>clockss-triggered</tt></li></ul><br />
|-<br />
| <tt>Content storage directories</tt><br />
| | A semi-colon delineated list of paths to the LOCKSS repositories for use by this LOCKSS daemon.<br />
|-<br />
| <tt>Temporary storage directory</tt><br />
| | <tt>/cache0/gamma/tmp</tt><br />
|-<br />
| <tt>Password for web UI administration user admin</tt><br />
| | Set this to a strong password, known to the host site administrator but to no-one else.<br />
|-<br />
|}<br />
<br />
=== CLOCKSS PLN Configuration ===<br />
<br />
The production CLOCKSS network is configured via the [[LOCKSS: Property Server Operations|property server]] to:<br />
* Prevent access to the content by disabling both the proxy and content server functions of the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]].<br />
* Prevent interception of communication between boxes in the network by the use of SSL for the [[LOCKSS: Polling and Repair Protocol]].<br />
* Prevent communication via the [[LOCKSS: Polling and Repair Protocol]] except from other boxes in the same network by requiring each end of a newly established connection to verify the certificate of the other end against the corresponding public key in a keystore. During setup, or when the set of boxes in the network changes, the appropriate keystore is uploaded to a machine via <tt>scp</tt>. We also provide a small script for the CLOCKSS site administrator to run that installs the keystore into the correct location with restrictive permissions.<br />
<br />
== Monitoring ==<br />
<br />
=== Nagios ===<br />
<br />
The CLOCKSS infrastructure is monitored by the Nagios network monitoring system. The CLOCKSS boxes, servers and services that are monitored by Nagios include the CLOCKSS Production boxes, Ingest boxes, Triggered Content boxes and Property Server as well as the CLOCKSS HTTP, SMTP and FTP servers.<br />
<br />
The CLOCKSS Network Administrator is responsible for Nagios monitoring.<br />
<br />
=== Nagios plugins to monitor LOCKSS ===<br />
<br />
Custom Nagios plugins were written to enable Nagios to monitor the following [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] services:<br />
* LOCKSS Daemon Version<br />
* LOCKSS Daemon Uptime<br />
* LOCKSS Repository Spaces (to monitor disk usage)<br />
* LOCKSS Web Administrative UI accessibility<br />
* LCAP Accessibility<br />
<br />
Plugins shipped with Nagios allow us to monitor, where applicable:<br />
* OpenSSH accessibility <br />
* HTTP/HTTPS accessibility<br />
<br />
=== Nagios Alerts ===<br />
<br />
Nagios allows plugins to return four return codes depending on the nature and severity of an issue (or lack thereof). For any return other than <tt>OK</tt>, Nagios has been configured to notify CLOCKSS engineers through email alerts (see [[CLOCKSS: Logging and Records]]).<br />
<br />
=== Access control to Nagios instance ===<br />
<br />
Access to the CLOCKSS Nagios instance is restricted by an Apache Virtual Host definition to the LOCKSS subnet at Stanford (<tt>171.66.236.0/24</tt>) and by username and password.<br />
<br />
=== Nagios Redundancy ===<br />
<br />
CLOCKSS and LOCKSS have a second Nagios instance running in an Amacon EC2 instance located in West Virginia. The second Nagios is not able to monitor CLOCKSS boxes directly. This is by design: We do not want a machine outside of the Stanford network to have the type of access necessary to monitor internal CLOCKSS boxes and processes. Due to this restriction, it is configured only to monitor the CLOCKSS website and the Nagios instance running at Stanford. The second Nagios instance will alert CLOCKSS engineers should any problems occur at Stanford.<br />
<br />
== Hardware Upgrade ==<br />
<br />
Currently, all CLOCKSS boxes have at least four spare hotswap disk bays. A natural way to incrementally upgrade the disk capacity of deployed CLOCKSS boxes is to fill these disk bays with the largest disk available (at the moment, 4TB), create a new RAID array and finally to move content from an existing array. The disks comprising the old array can then be removed and the process can be repeated for the remaining RAID arrays.<br />
<br />
The other components in a CLOCKSS box are not expected to need upgrading during its service life.<br />
<br />
== Hardware Replacement ==<br />
<br />
=== Component Failure ===<br />
<br />
The components in all CLOCKSS boxes are readily available, cheap, off-the-shelf components and replacement components can be sourced from any well-stocked computer retailer. The most common component failures are disks. As mentioned earlier, the RAID5 arrays employed in CLOCKSS boxes are only able to tolerate one disk failure each so it is important that any failed disks be found and replaced as quickly as possible.<br />
<br />
If a disk fails, we contact the CLOCKSS box's hardware vendor if the CLOCKSS box is still under warranty. If the box is no longer under warranty, we prefer to purchase a disk and ship it expedited to the remote site. However, if the remote site is not within the United States, an alternative option is to ask the site administrator to source replacement parts within their country and send an invoice to CLOCKSS for a reimbursement.<br />
<br />
Other component failures are handled similarly. If a machine cannot be repaired, it is replaced.<br />
<br />
=== Planned CLOCKSS Box Replacement ===<br />
<br />
Despite the upgradability of existing hardware, we recognize that, eventually, it will no longer be cost effective or practical to continue to run old hardware due to increases in performance per watt efficiency, disk bandwidth, disk capacity and other measures. Software and content volume is also expected to push hardware towards obsolescence.<br />
<br />
Replacing an entire machine is a fairly straightforward process. After purchasing and configuring the machine as described elsewhere in this document, it is brought up alongside the machine it is to replace. To minimize downtime, content is copied in two passes, using <tt>rsync</tt>. The first pass copies the bulk of the data to the new machine. Once the first pass is complete, the LOCKSS daemon on the old machine is taken offline and its filesystems are placed into read-only mode. The second pass then copies any changes to the content that occurred during the first pass, and verifies the integrity of content from the first pass. If no errors occurred during the two passes, the old machine is taken offline and the new machine takes its place. Subsequent [[LOCKSS: Polling and Repair Protocol]] will also verify the integrity of the copy.<br />
<br />
== Software Updates ==<br />
<br />
=== Re-locating Content ===<br />
<br />
Relocating an AU to another LOCKSS repository filesystem, for example if a filesystem has filled up, requires the following steps:<br />
# Shutdown the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] to prevent it from modifying the AU. For extra precaution, temporarily remount the filesystem as read-only.<br />
# Copy the AU to the new LOCKSS repository. When the copy completes successfully, delete the the AU from the source filesystem.<br />
# If the LOCKSS repository has not yet been made available to the LOCKSS daemon, edit <tt>/etc/lockss/config.dat</tt> and append the new LOCKSS repository path to <tt>LOCKSS_DISK_PATHS</tt>.<br />
# The AU needs its <tt>lockss_repo</tt> added or modified to reflect the new path. This is done by editing the AU's <tt>lockss_repo</tt> line in <tt>/cache0/gamma/config/au.txt</tt>. For bulk relocation, a LOCKSS tool called <tt>auconf.pl</tt> can scan the LOCKSS repositories for AUs and update <tt>au.txt</tt> as necessary.<br />
# If the source filesystem was remounted read-only, remount it as read-write. <br />
# Finally, start the LOCKSS daemon and check that the AUs repository path points to the new location.<br />
<br />
If a large amount of content is being moved and the daemon requires minimal disruption, a modified procedure is followed:<br />
<br />
# With the LOCKSS daemon running and the filesystem mounted read-write, use rsync to copy the bulk of the content to the new filesystem.<br />
# When the first pass completes, stop the LOCKSS daemon and remount the source filesystem as read-only.<br />
# Run rsync again to copy any changes to the AUs since the last rsync was started.<br />
# Update <tt>/etc/lockss/config.dat</tt> and <tt>/cache0/gamma/config/au.txt</tt> as necessary.<br />
# Remount the filesystem as read-write. Bring up the LOCKSS daemon and check that the all the AUs repository paths have been updated.<br />
<br />
=== LOCKSS Daemon Updates ===<br />
<br />
All remote sites are encouraged to setup automatic updates for the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] package so that updates are installed automatically within a few hours of their release to the CLOCKSS RPM repository. Additionally, all systems administrators receive announcements of newly released daemons through the CLOCKSS Administrators mailing list; systems administrators opting to install updates manually are encouraged to use the announcements as a cue to install updates. We use Nagios to monitor the LOCKSS daemon version of all CLOCKSS boxes. If a CLOCKSS box has not been updated with the latest LOCKSS daemon within a day, we notify its systems administrator(s).<br />
<br />
=== System Package Updates ===<br />
<br />
Most CLOCKSS boxes are configured for unattended upgrade, so new upgrades will be installed automatically. All CLOCKSS box systems administrators are encouraged to join the CentOS Announce mailing list to receive real-time annoucements about security updates. If the box in question is not configured for unattended updates:<br />
* Security updates should be downloaded and installed as soon as they are announced.<br />
* All other non-critical updates should be installed at least once a quarter.<br />
<br />
Anyone can subscribe to the CentOS Announce mailing list by visiting the [http://lists.centos.org/mailman/listinfo/centos-announce CentOS-announce Mailing List] home page.<br />
<br />
=== System Updates ===<br />
<br />
During the lifetime of the CLOCKSS program, the CentOS project will release new major versions of their CentOS operating system (e.g., CentOS 5.x to 6.x). All production CLOCKSS boxes were configured so that its operating system filesystem is independent from its content filesystems. The CentOS upgrade path is then relatively simple: Save the LOCKSS daemon configuration, user accounts and OpenSSH keys, unmount and take offline all content filesystems then effectively reinstall CentOS and the LOCKSS daemon environment and restore the CLOCKSS box's configuration.<br />
<br />
The end of life dates for the CentOS versions currently installed on CLOCKSS boxes are:<br />
<br />
{| class="wikitable" style="margin: 0 auto"<br />
|-<br />
! scope="col"| Version<br />
! scope="col"| Release Date<br />
! scope="col"| Full Updates<br />
! scope="col"| Maintenance Updates<br />
|-<br />
| CentOS 5.x<br />
| | 2007-04-12<br />
| | Q1 2014<br />
| | 2017-03-31<br />
|-<br />
| CentOS 6.x<br />
| | 22011-07-10<br />
| | Q2 2017<br />
| | 2020-11-30<br />
|-<br />
|}<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by:<br />
** CLOCKSS Network Administrator<br />
** LOCKSS Engineering Staff<br />
* Approval by CLOCKSS Technical Lead</div>Dshrhttp://documents.clockss.org/index.php/CLOCKSS:_Box_OperationsCLOCKSS: Box Operations2014-04-06T22:09:46Z<p>Dshr: /* Re-locating Content */</p>
<hr />
<div>= CLOCKSS: Box Operations =<br />
<br />
== Requirements for CLOCKSS host sites ==<br />
<br />
A CLOCKSS host site signs an agreement with the CLOCKSS board under which they commit to providing:<br />
* Rack space in a physically secure location accessible only to authorized personnel.<br />
* Power.<br />
* Cooling.<br />
* Network bandwidth.<br />
And to have an in-place disaster recovery plans if these resources fail. The CLOCKSS Executive Director has copies of these agreements.<br />
<br />
The CLOCKSS Network Administrator is responsible for all processes involving production and ingest CLOCKSS box hardware and software infrastructure.<br />
<br />
== Hardware Bringup ==<br />
<br />
=== Hardware Vendors ===<br />
All new systems are currently purchased from iXsystems based in San Jose, CA. The LOCKSS team's contact there is sales representative Kevin Lee ([mailto:kle@ixsystems.com kle@ixsystems.com]).<br />
<br />
The LOCKSS team previously worked with two hardware vendors, Iron Systems, located in Fremont, CA and eRacks in Orange, CA. Most of the systems currently in production were purchased from Iron Systems. Business contacts are Kawaljit Nagi ([mailto:kawal@ironsystems.com kawal@ironsystems.com]) and James King ([mailto:jamesk@eracks.com jamesk@eracks.com]), respectively. The LOCKSS team also has a relationship with Joesph Wolff ([mailto:joe@eracks.com joe@eracks.com]), the owner of eRacks.<br />
<br />
=== Hardware Purchase Process ===<br />
Purchases start as a discussion among LOCKSS Engineering Staff, the CLOCKSS Network Administrator and the CLOCKSS Executive Director about what is needed. A specification is drawn by the engineers and sent to our hardware vendors for an early quote. If the quote looks reasonable, it is sent to the CLOCKSS Executive Director, who negotiates a final quote and sends a purchase order. The CLOCKSS Executive Director is advised to release payment once the machine has been built, tested, verified working by the LOCKSS team, preconfigured with software (see below) and shipped. <br />
<br />
=== Hardware Warranty ===<br />
Machines purchased from Iron Systems and eRacks come with 1 - 2 years of limited hardware warranty, depending on hardware vendor. The warranty on most of the CLOCKSS boxes has since expired but it proved useful for exchanging failed hardware components. Details about the warranty are in the purchase orders that have been submitted as well as the invoices subsequently received.<br />
<br />
=== Recommended Hardware ===<br />
CLOCKSS hardware or virtual machines meet or exceed the following specifications:<br />
<br />
{| class="wikitable" style="margin: 0 auto"<br />
|-<br />
! scope="col"| Item<br />
! scope="col"| Ingest<br />
! scope="col"| Production<br />
! scope="col"| Triggered (VMware)<br />
|-<br />
| Chassis<br />
| | Supermicro 2U<br />
| | Supermicro 4U<br />
| | ''N/A''<br />
|-<br />
| Disk Bays<br />
| | 12<br />
| | 24<br />
| | ''N/A''<br />
|-<br />
| Processor<br />
| | AMD Operton 6128 (eight cores)<br />
| | Dual Xeon E5504 (eight cores)<br />
| | Dual-core CPU<br />
|-<br />
| RAM<br />
| | 16GB ECC<br />
| | 24GB ECC<br />
| | 4GB<br />
|-<br />
| Disk<br />
| | 8 x 3TB SATA 6Gbps 7200RPM<br />
| | 12 x 2TB SATA 3Gbps<br />
| | 40GB disk<br />
|-<br />
| RAID<br />
| | LSI Megaraid<br />
| | Software (Linux <tt>mdadm</tt>)<br />
| | None (Underlying RAID array)<br />
|-<br />
| Network<br />
| | Onboard dual Gbe<br />
| | Onboard dual Gbe<br />
| | 10/100Mb NIC<br />
|-<br />
| Remote Access<br />
| | IPMI<br />
| | IPMI<br />
| | VMware vSphere Client<br />
|-<br />
| Power Supply<br />
| | 800W redundant<br />
| | 1600W redundant<br />
| | ''N/A''<br />
|-<br />
|}<br />
<br />
=== CLOCKSS Virtual Machines ===<br />
<br />
CLOCKSS virtual machines run on servers running the VMWare vSphere ESXi 5 hypervisor.<br />
<br />
=== Hardware Service Life ===<br />
<br />
The service life of a CLOCKSS box is five to seven years. This is dictated by budget restrictions, software needs as well as cost and performance efficiency. Please see the section on [[#Hardware Replacement]] later in this document.<br />
<br />
=== Building and testing ===<br />
<br />
Our hardware vendors are responsible for sourcing components and building the machines once a purchase order is submitted by CLOCKSS. After a machine is built, both hardware vendors have an extensive and rigorous process in place to ensure a machine will function correctly.<br />
<br />
=== Remote Access via IPMI ===<br />
<br />
Every CLOCKSS box is equipped with an Intelligent Platform Management Interface (IPMI) module that provides remote "side-band" and "out-of-band" access. Although IPMI is invaluable in the event we need to access or recover a CLOCKSS box outside the scope of its operating system, there are known security vulnerabilities and so it is disabled by default on all machines. This is done by disabling it in BIOS and unplugging the network cable from the IPMI module.<br />
<br />
In the event we need to use IPMI, the following precautions should be taken:<br />
<br />
# Update the firmware, if a newer version exists.<br />
# Do not use the default username and password.<br />
# Make IPMI accessible only through a VPN or other secure connection.<br />
<br />
=== RAID Configuration ===<br />
<br />
The CLOCKSS Ingest machines have LSI MegaRAID based hardware RAID controllers and eight 3TB disks. These disks were partitioned into groups of four and put into a RAID5 configuration. Each RAID array then yields approximately 8.1TB of usable space after RAID and EXT4 filesystem overhead. Since the Ingest machines have an extra disk dedicated for the OS (called a "system disk"), the full space of the RAID arrays is dedicated to the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]]. This configuration also isolates problems on the system disk from content storage arrays. <br />
<br />
The CLOCKSS Production machines utilize the Linux kernel's <tt>mdadm</tt> module to implement software RAID. Similar to the CLOCKSS Ingest machines, the Production machines also have several RAID5 arrays consisting of three or four disks depending on the number of disks in the machine. In contrast to the Ingest machines, these RAID5 arrays are used by both the operating system and the LOCKSS daemon.<br />
<br />
=== RAID Health Monitoring ===<br />
<br />
The most common component failures are disks and it is expected that all CLOCKSS boxes will experience a few disk failures over the course of their service life. The RAID5 arrays employed in CLOCKSS boxes are only able to tolerate one disk failure each. It is thus important that any failed disks be found and replaced as quickly as possible; a double disk failure requires repair from another box via the [[LOCKSS: Polling and Repair Protocol]], which is much slower than RAID rebuild.<br />
<br />
The RAID array health on all CLOCKSS Production boxes are monitored by Nagios via the Nagios Remote Plugin Executor (NRPE). Each machine runs an NRPE daemon and has a copy of a custom plugin we wrote to monitor mdadm RAID arrays. NRPE executes the plugin on behalf of Nagios and returns the result. If Nagios does not receive an OK, it sends an alert to CLOCKSS engineers.<br />
<br />
== Software Bringup ==<br />
<br />
All the software used for the CLOCKSS infrastructure is freely available and open source.<br />
<br />
=== CentOS Installation and Configuration ===<br />
<br />
A base install of CentOS 6.x with Java options is performed on the partition or disk designated for the operating system then<br />
the following steps are taken to configure it as a CLOCKSS box:<br />
# The system accounts, <tt>lcap</tt> and <tt>lockss</tt> are created using <tt>useradd -r -m</tt> or equivalent. Then the LOCKSS LCAP public SSH keys are then added to their <tt>authorized_keys</tt> file, taking care to ensure the permissions are setup correctly.<br />
# The mount points are created for the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] repositories:<br />
#* The preferred naming scheme is <tt>/cacheN/gamma</tt> where <tt>N</tt> is an integer starting from 0, (e.g. <tt>/cache0/gamma</tt>, <tt>/cache1/gamma</tt> and so forth).<br />
#* The mount point ownerships are changed to the <tt>lockss</tt> user and group and permissions for <tt>other</tt> are stripped.<br />
#* <tt>noexec</tt> is added to the mount point parameters.<br />
#* Finally, the mount points should be added to <tt>updatedb</tt>'s <tt>PRUNEPATHS</tt> variable to prevent the system from indexing the LOCKSS daemon repositories.<br />
# Then the firewall (<tt>iptables</tt> rules) are set up:<br />
#* The LOCKSS daemon requires the LCAP port (port 9729), used to communicate with other LOCKSS daemons, to be open to all machines: This is accomplished under CentOS with this <tt>iptables</tt> rule: <pre>-A INPUT -m state --state NEW -m tcp -p tcp --dport 9729 -j ACCEPT</pre> Note: Although we have a list of IP addresses that should be granted access to this machine's LCAP port, we opt not to restrict access because it is protected by SSL certificate checks (see [[#CLOCKSS PLN Configuration]]). <br />
#* Additionally, a CLOCKSS box is setup so that the administrative ports 22 (OpenSSH) and 8080 through 8083 are accessible to the remote site's administrative subnet(s) and the LOCKSS subnet at Stanford (171.66.236.0/24, previously 171.66.236.0/26). The following <tt>iptables</tt> should be repeated for each subnet, taking care to replace <tt>SUBNET</tt> with the subnet in CIDR form:<pre>-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 22 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8080 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8081 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8082 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8083 -j ACCEPT</pre>These rules should be reflected in any external firewalls at the site.<br />
# Setting up the CLOCKSS repository is a two step process:<br />
## The first is to create a new file, <tt>lockss.repo</tt> under <tt>/etc/yum.repos.d</tt> containing the following lines:<pre>[lockss]&#10;name = LOCKSS Daemon Repository&#10;baseurl=http://www.lockss.org/clockss-repo/&#10;gpgcheck=1</pre><br />
##Then to install the LOCKSS RPM GPG key:<pre>rpm --import http://www.lockss.org/LOCKSS-GPG-RPM-KEY</pre> <br />
#The LOCKSS daemon can now be installed by invoking:<pre>yum install lockss-daemon</pre><br />
# The LOCKSS daemon installs a <tt>logrotate</tt> configuration file to <tt>/etc/logrotate.d/lockss</tt>. By default, the size is limited to 2M but on CLOCKSS boxes, we set the limit to 20M: <pre>/var/log/lockss/daemon {&#10; size 20M&#10; rotate 5&#10; compress&#10; delaycompress&#10; create&#10; notifempty&#10; missingok&#10;}&#10;&#10;/var/log/lockss/stdout {&#10; size 10k&#10; rotate 5&#10; compress&#10; copytruncate&#10; notifempty&#10; missingok&#10;}&#10;</pre><br />
# To keep the time on CLOCKSS boxes synchronized, we use <tt>NTP</tt> (or preferably, <tt>OpenNTPd</tt> where available). To install <tt>ntpd</tt>:<pre>yum install ntp</pre><br />
# To automatically install package updates, we use <tt>yum-cron</tt>. To install and configure it: <pre>yum install yum-cron</pre> Then ensure it will start automatically on runlevels 3, 4 and 5: <pre>chkconfig --add yum-cron</pre> <br />
# Some extra, optional packages we've found to be useful are <tt>screen</tt>, <tt>wget</tt>, <tt>vim</tt>, <tt>emacs</tt>, <tt>lynx</tt>. Their installation is highly recommended to ease troubleshooting.<pre>yum install screen wget vim emacs lynx</pre><br />
<br />
=== Java Runtime Environment (JRE) ===<br />
<br />
Early CLOCKSS boxes were configured to use the Sun Java JRE 6 (now Oracle Java JRE) but we now recommend using the OpenJDK JRE 6 freely available through the CentOS repository. All CLOCKSS boxes have a 64-bit JRE to take advantage of the large amount of memory available on these machines.<br />
<br />
=== LOCKSS Daemon Configuration ===<br />
<br />
Once the CentOS environment has been configured and verified working by a CLOCKSS engineer, the next step is to configure the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] as part of the CLOCKSS network. This is done using the <tt>hostconfig</tt> utility packaged with the LOCKSS daemon. For a CLOCKSS box, the following settings are recommended:<br />
<br />
{| class="wikitable"<br />
|-<br />
! scope="col"| Parameter<br />
! scope="col"| Description<br />
|-<br />
| <tt>Fully qualified hostname (FQDN) of this machine</tt><br />
| | Provided by CLOCKSS. We schema we use is clockss-<i>site</i>.clockss.org where <i>site</i> can uniquely identify the host institution of the machine.<br />
|-<br />
| <tt>IP address of this machine</tt><br />
| | A static public IP address provided by host institution.<br />
|-<br />
| <tt>Initial subnet for admin UI access</tt><br />
| | The subnet (in CIDR or X.Y.Z.* notation) that will be granted access to the web-based LOCKSS Administrative UI. The localhost is implicitly allowed. Additional subnets can be added through the LOCKSS Administrative UI.<br />
|-<br />
| <tt>LCAP V3 protocol port</tt><br />
| | The TCP port the daemon will listen on for LCAP communication from peers. Remote sites should ensure both internal and external firewalls allow all connections to this port.<br />
|-<br />
| <tt>Mail relay for this machine</tt><br />
| | This should be the DNS name of an SMTP relay that will accept and relay mail from this machine. The script will also prompt for a username and password if the mail relay requires them. It can also be set to <tt>localhost</tt> if the machine is capable of handling email. <br />
|-<br />
| <tt>E-mail address for administrator</tt><br />
| | Occasional alerts will be sent to this address by the LOCKSS daemon.<br />
|-<br />
| <tt>Path to java</tt><br />
| | The full path to a JRE. The default should suffice in most cases.<br />
|-<br />
| <tt>Java switches</tt><br />
| | Java switches to be passed to the JRE. It should be left blank in most cases.<br />
|-<br />
| <tt>Configuration URL</tt><br />
| | <ul><li>CLOCKSS Production: <tt>http://props.lockss.org:8001/clockss/lockss.xml</tt></li><li>CLOCKSS Ingest: <tt>http://props.lockss.org:8001/clockssingest/lockss.xml</tt></li><li>CLOCKSS Triggered: <tt>http://props.lockss.org:8001/clockss-triggered/lockss.xml</tt></li></ul><br />
|-<br />
| <tt>Preservation group(s)</tt><br />
| | <ul><li>CLOCKSS Production: <tt>clockss</tt></li><li>CLOCKSS Ingest: <tt>clockssingest</tt></li><li>CLOCKSS Triggered: <tt>clockss-triggered</tt></li></ul><br />
|-<br />
| <tt>Content storage directories</tt><br />
| | A semi-colon delineated list of paths to the LOCKSS repositories for use by this LOCKSS daemon.<br />
|-<br />
| <tt>Temporary storage directory</tt><br />
| | <tt>/cache0/gamma/tmp</tt><br />
|-<br />
| <tt>Password for web UI administration user admin</tt><br />
| | Set this to a strong password, known to the host site administrator but to no-one else.<br />
|-<br />
|}<br />
<br />
=== CLOCKSS PLN Configuration ===<br />
<br />
The production CLOCKSS network is configured via the [[LOCKSS: Property Server Operations|property server]] to:<br />
* Prevent access to the content by disabling both the proxy and content server functions of the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]].<br />
* Prevent interception of communication between boxes in the network by the use of SSL for the [[LOCKSS: Polling and Repair Protocol]].<br />
* Prevent communication via the [[LOCKSS: Polling and Repair Protocol]] except from other boxes in the same network by requiring each end of a newly established connection to verify the certificate of the other end against the corresponding public key in a keystore. During setup, or when the set of boxes in the network changes, the appropriate keystore is uploaded to a machine via <tt>scp</tt>. We also provide a small script for the CLOCKSS site administrator to run that installs the keystore into the correct location with restrictive permissions.<br />
<br />
== Monitoring ==<br />
<br />
=== Nagios ===<br />
<br />
The CLOCKSS infrastructure is monitored by the Nagios network monitoring system. The CLOCKSS boxes, servers and services that are monitored by Nagios include the CLOCKSS Production boxes, Ingest boxes, Triggered Content boxes and Property Server as well as the CLOCKSS HTTP, SMTP and FTP servers.<br />
<br />
The CLOCKSS Network Administrator is responsible for Nagios monitoring.<br />
<br />
=== Nagios plugins to monitor LOCKSS ===<br />
<br />
Custom Nagios plugins were written to enable Nagios to monitor the following [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] services:<br />
* LOCKSS Daemon Version<br />
* LOCKSS Daemon Uptime<br />
* LOCKSS Repository Spaces (to monitor disk usage)<br />
* LOCKSS Web Administrative UI accessibility<br />
* LCAP Accessibility<br />
<br />
Plugins shipped with Nagios allow us to monitor, where applicable:<br />
* OpenSSH accessibility <br />
* HTTP/HTTPS accessibility<br />
<br />
=== Nagios Alerts ===<br />
<br />
Nagios allows plugins to return four return codes depending on the nature and severity of an issue (or lack thereof). For any return other than <tt>OK</tt>, Nagios has been configured to notify CLOCKSS engineers through email alerts (see [[CLOCKSS: Logging and Records]]).<br />
<br />
=== Access control to Nagios instance ===<br />
<br />
Access to the CLOCKSS Nagios instance is restricted by an Apache Virtual Host definition to the LOCKSS subnet at Stanford (<tt>171.66.236.0/24</tt>) and by username and password.<br />
<br />
=== Nagios Redundancy ===<br />
<br />
CLOCKSS and LOCKSS have a second Nagios instance running in an Amacon EC2 instance located in West Virginia. The second Nagios is not able to monitor CLOCKSS boxes directly. This is by design: We do not want a machine outside of the Stanford network to have the type of access necessary to monitor internal CLOCKSS boxes and processes. Due to this restriction, it is configured only to monitor the CLOCKSS website and the Nagios instance running at Stanford. The second Nagios instance will alert CLOCKSS engineers should any problems occur at Stanford.<br />
<br />
== Hardware Upgrade ==<br />
<br />
Currently, all CLOCKSS boxes have at least four spare hotswap disk bays. A natural way to incrementally upgrade the disk capacity of deployed CLOCKSS boxes is to fill these disk bays with the largest disk available (at the moment, 4TB), create a new RAID array and finally to move content from an existing array. The disks comprising the old array can then be removed and the process can be repeated for the remaining RAID arrays.<br />
<br />
The other components in a CLOCKSS box are not expected to need upgrading during its service life.<br />
<br />
== Hardware Replacement ==<br />
<br />
=== Component Failure ===<br />
<br />
The components in all CLOCKSS boxes are readily available, cheap, off-the-shelf components and replacement components can be sourced from any well-stocked computer retailer. The most common component failures are disks. As mentioned earlier, the RAID5 arrays employed in CLOCKSS boxes are only able to tolerate one disk failure each so it is important that any failed disks be found and replaced as quickly as possible.<br />
<br />
If a disk fails, we contact the CLOCKSS box's hardware vendor if the CLOCKSS box is still under warranty. If the box is no longer under warranty, we prefer to purchase a disk and ship it expedited to the remote site. However, if the remote site is not within the United States, an alternative option is to ask the site administrator to source replacement parts within their country and send an invoice to CLOCKSS for a reimbursement.<br />
<br />
Other component failures are handled similarly. If a machine cannot be repaired, it is replaced.<br />
<br />
=== Planned CLOCKSS Box Replacement ===<br />
<br />
Despite the upgradability of existing hardware, we recognize that, eventually, it will no longer be cost effective or practical to continue to run old hardware due to increases in performance per watt efficiency, disk bandwidth, disk capacity and other measures. Software and content volume is also expected to push hardware towards obsolescence.<br />
<br />
Replacing an entire machine is a fairly straightforward process. After purchasing and configuring the machine as described elsewhere in this document, it is brought up alongside the machine it is to replace. To minimize downtime, content is copied in two passes, using <tt>rsync</tt>. The first pass copies the bulk of the data to the new machine. Once the first pass is complete, the LOCKSS daemon on the old machine is taken offline and its filesystems are placed into read-only mode. The second pass then copies any changes to the content that occurred during the first pass, and verifies the integrity of content from the first pass. If no errors occurred during the two passes, the old machine is taken offline and the new machine takes its place. Subsequent [[LOCKSS: Polling and Repair Protocol]] will also verify the integrity of the copy.<br />
<br />
== Software Updates ==<br />
<br />
=== Re-locating Content ===<br />
<br />
Relocating an AU to another LOCKSS repository filesystem, for example if a filesystem has filled up, requires the following steps:<br />
# Shutdown the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] to prevent it from modifying the AU. For extra precaution, temporarily remount the filesystem as read-only.<br />
# Copy the AU to the new LOCKSS repository. When the copy completes successfully, delete the the AU from the source filesystem.<br />
# If the LOCKSS repository has not yet been made available to the LOCKSS daemon, edit <tt>/etc/lockss/config.dat</tt> and append the new LOCKSS repository path to <tt>LOCKSS_DISK_PATHS</tt>.<br />
# The AU needs its <tt>lockss_repo</tt> added or modified to reflect the new path. This is done by editing the AU's <tt>lockss_repo</tt> line in <tt>/cache0/gamma/config/au.txt</tt>. For bulk relocation, a LOCKSS tool called <tt>auconf.pl</tt> can scan the LOCKSS repositories for AUs and update <tt>au.txt</tt> as necessary.<br />
# If the source filesystem was remounted read-only, remount it as read-write. <br />
# Finally, start the LOCKSS daemon and check that the AUs repository path points to the new location.<br />
<br />
If a large amount of content is being moved and the daemon requires minimal disruption, a modified procedure is followed:<br />
<br />
# With the LOCKSS daemon running and the filesystem mounted read-write, use rsync to copy the bulk of the content to the new filesystem.<br />
# When the first pass completes, stop the LOCKSS daemon and remount the source filesystem as read-only.<br />
# Run rsync again to copy any changes to the AUs since the last rsync was started.<br />
# Update <tt>/etc/lockss/config.dat</tt> and <tt>/cache0/gamma/config/au.txt</tt> as necessary.<br />
# Remount the filesystem as read-write. Bring up the LOCKSS daemon and check that the all the AUs repository paths have been updated.<br />
<br />
=== LOCKSS Daemon Updates ===<br />
<br />
All remote sites are encouraged to setup automatic updates for the LOCKSS daemon package so that updates are installed automatically within a few hours of their release to the CLOCKSS RPM repository. Additionally, all systems administrators receive announcements of newly released daemons through the CLOCKSS Administrators mailing list; systems administrators opting to install updates manually are encouraged to use the announcements as a cue to install updates. We use Nagios to monitor the LOCKSS daemon version of all CLOCKSS boxes. If a CLOCKSS box has not been updated with the latest LOCKSS daemon within a day, we notify its systems administrator(s).<br />
<br />
=== System Package Updates ===<br />
<br />
Most CLOCKSS boxes are configured for unattended upgrade, so new upgrades will be installed automatically. All CLOCKSS box systems administrators are encouraged to join the CentOS Announce mailing list to receive real-time annoucements about security updates. If the box in question is not configured for unattended updates:<br />
* Security updates should be downloaded and installed as soon as they are announced.<br />
* All other non-critical updates should be installed at least once a quarter.<br />
<br />
Anyone can subscribe to the CentOS Announce mailing list by visiting the [http://lists.centos.org/mailman/listinfo/centos-announce CentOS-announce Mailing List] home page.<br />
<br />
=== System Updates ===<br />
<br />
During the lifetime of the CLOCKSS program, the CentOS project will release new major versions of their CentOS operating system (e.g., CentOS 5.x to 6.x). All production CLOCKSS boxes were configured so that its operating system filesystem is independent from its content filesystems. The CentOS upgrade path is then relatively simple: Save the LOCKSS daemon configuration, user accounts and OpenSSH keys, unmount and take offline all content filesystems then effectively reinstall CentOS and the LOCKSS daemon environment and restore the CLOCKSS box's configuration.<br />
<br />
The end of life dates for the CentOS versions currently installed on CLOCKSS boxes are:<br />
<br />
{| class="wikitable" style="margin: 0 auto"<br />
|-<br />
! scope="col"| Version<br />
! scope="col"| Release Date<br />
! scope="col"| Full Updates<br />
! scope="col"| Maintenance Updates<br />
|-<br />
| CentOS 5.x<br />
| | 2007-04-12<br />
| | Q1 2014<br />
| | 2017-03-31<br />
|-<br />
| CentOS 6.x<br />
| | 22011-07-10<br />
| | Q2 2017<br />
| | 2020-11-30<br />
|-<br />
|}<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by:<br />
** CLOCKSS Network Administrator<br />
** LOCKSS Engineering Staff<br />
* Approval by CLOCKSS Technical Lead</div>Dshrhttp://documents.clockss.org/index.php/CLOCKSS:_Box_OperationsCLOCKSS: Box Operations2014-04-06T22:08:27Z<p>Dshr: /* LOCKSS Daemon Configuration */</p>
<hr />
<div>= CLOCKSS: Box Operations =<br />
<br />
== Requirements for CLOCKSS host sites ==<br />
<br />
A CLOCKSS host site signs an agreement with the CLOCKSS board under which they commit to providing:<br />
* Rack space in a physically secure location accessible only to authorized personnel.<br />
* Power.<br />
* Cooling.<br />
* Network bandwidth.<br />
And to have an in-place disaster recovery plans if these resources fail. The CLOCKSS Executive Director has copies of these agreements.<br />
<br />
The CLOCKSS Network Administrator is responsible for all processes involving production and ingest CLOCKSS box hardware and software infrastructure.<br />
<br />
== Hardware Bringup ==<br />
<br />
=== Hardware Vendors ===<br />
All new systems are currently purchased from iXsystems based in San Jose, CA. The LOCKSS team's contact there is sales representative Kevin Lee ([mailto:kle@ixsystems.com kle@ixsystems.com]).<br />
<br />
The LOCKSS team previously worked with two hardware vendors, Iron Systems, located in Fremont, CA and eRacks in Orange, CA. Most of the systems currently in production were purchased from Iron Systems. Business contacts are Kawaljit Nagi ([mailto:kawal@ironsystems.com kawal@ironsystems.com]) and James King ([mailto:jamesk@eracks.com jamesk@eracks.com]), respectively. The LOCKSS team also has a relationship with Joesph Wolff ([mailto:joe@eracks.com joe@eracks.com]), the owner of eRacks.<br />
<br />
=== Hardware Purchase Process ===<br />
Purchases start as a discussion among LOCKSS Engineering Staff, the CLOCKSS Network Administrator and the CLOCKSS Executive Director about what is needed. A specification is drawn by the engineers and sent to our hardware vendors for an early quote. If the quote looks reasonable, it is sent to the CLOCKSS Executive Director, who negotiates a final quote and sends a purchase order. The CLOCKSS Executive Director is advised to release payment once the machine has been built, tested, verified working by the LOCKSS team, preconfigured with software (see below) and shipped. <br />
<br />
=== Hardware Warranty ===<br />
Machines purchased from Iron Systems and eRacks come with 1 - 2 years of limited hardware warranty, depending on hardware vendor. The warranty on most of the CLOCKSS boxes has since expired but it proved useful for exchanging failed hardware components. Details about the warranty are in the purchase orders that have been submitted as well as the invoices subsequently received.<br />
<br />
=== Recommended Hardware ===<br />
CLOCKSS hardware or virtual machines meet or exceed the following specifications:<br />
<br />
{| class="wikitable" style="margin: 0 auto"<br />
|-<br />
! scope="col"| Item<br />
! scope="col"| Ingest<br />
! scope="col"| Production<br />
! scope="col"| Triggered (VMware)<br />
|-<br />
| Chassis<br />
| | Supermicro 2U<br />
| | Supermicro 4U<br />
| | ''N/A''<br />
|-<br />
| Disk Bays<br />
| | 12<br />
| | 24<br />
| | ''N/A''<br />
|-<br />
| Processor<br />
| | AMD Operton 6128 (eight cores)<br />
| | Dual Xeon E5504 (eight cores)<br />
| | Dual-core CPU<br />
|-<br />
| RAM<br />
| | 16GB ECC<br />
| | 24GB ECC<br />
| | 4GB<br />
|-<br />
| Disk<br />
| | 8 x 3TB SATA 6Gbps 7200RPM<br />
| | 12 x 2TB SATA 3Gbps<br />
| | 40GB disk<br />
|-<br />
| RAID<br />
| | LSI Megaraid<br />
| | Software (Linux <tt>mdadm</tt>)<br />
| | None (Underlying RAID array)<br />
|-<br />
| Network<br />
| | Onboard dual Gbe<br />
| | Onboard dual Gbe<br />
| | 10/100Mb NIC<br />
|-<br />
| Remote Access<br />
| | IPMI<br />
| | IPMI<br />
| | VMware vSphere Client<br />
|-<br />
| Power Supply<br />
| | 800W redundant<br />
| | 1600W redundant<br />
| | ''N/A''<br />
|-<br />
|}<br />
<br />
=== CLOCKSS Virtual Machines ===<br />
<br />
CLOCKSS virtual machines run on servers running the VMWare vSphere ESXi 5 hypervisor.<br />
<br />
=== Hardware Service Life ===<br />
<br />
The service life of a CLOCKSS box is five to seven years. This is dictated by budget restrictions, software needs as well as cost and performance efficiency. Please see the section on [[#Hardware Replacement]] later in this document.<br />
<br />
=== Building and testing ===<br />
<br />
Our hardware vendors are responsible for sourcing components and building the machines once a purchase order is submitted by CLOCKSS. After a machine is built, both hardware vendors have an extensive and rigorous process in place to ensure a machine will function correctly.<br />
<br />
=== Remote Access via IPMI ===<br />
<br />
Every CLOCKSS box is equipped with an Intelligent Platform Management Interface (IPMI) module that provides remote "side-band" and "out-of-band" access. Although IPMI is invaluable in the event we need to access or recover a CLOCKSS box outside the scope of its operating system, there are known security vulnerabilities and so it is disabled by default on all machines. This is done by disabling it in BIOS and unplugging the network cable from the IPMI module.<br />
<br />
In the event we need to use IPMI, the following precautions should be taken:<br />
<br />
# Update the firmware, if a newer version exists.<br />
# Do not use the default username and password.<br />
# Make IPMI accessible only through a VPN or other secure connection.<br />
<br />
=== RAID Configuration ===<br />
<br />
The CLOCKSS Ingest machines have LSI MegaRAID based hardware RAID controllers and eight 3TB disks. These disks were partitioned into groups of four and put into a RAID5 configuration. Each RAID array then yields approximately 8.1TB of usable space after RAID and EXT4 filesystem overhead. Since the Ingest machines have an extra disk dedicated for the OS (called a "system disk"), the full space of the RAID arrays is dedicated to the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]]. This configuration also isolates problems on the system disk from content storage arrays. <br />
<br />
The CLOCKSS Production machines utilize the Linux kernel's <tt>mdadm</tt> module to implement software RAID. Similar to the CLOCKSS Ingest machines, the Production machines also have several RAID5 arrays consisting of three or four disks depending on the number of disks in the machine. In contrast to the Ingest machines, these RAID5 arrays are used by both the operating system and the LOCKSS daemon.<br />
<br />
=== RAID Health Monitoring ===<br />
<br />
The most common component failures are disks and it is expected that all CLOCKSS boxes will experience a few disk failures over the course of their service life. The RAID5 arrays employed in CLOCKSS boxes are only able to tolerate one disk failure each. It is thus important that any failed disks be found and replaced as quickly as possible; a double disk failure requires repair from another box via the [[LOCKSS: Polling and Repair Protocol]], which is much slower than RAID rebuild.<br />
<br />
The RAID array health on all CLOCKSS Production boxes are monitored by Nagios via the Nagios Remote Plugin Executor (NRPE). Each machine runs an NRPE daemon and has a copy of a custom plugin we wrote to monitor mdadm RAID arrays. NRPE executes the plugin on behalf of Nagios and returns the result. If Nagios does not receive an OK, it sends an alert to CLOCKSS engineers.<br />
<br />
== Software Bringup ==<br />
<br />
All the software used for the CLOCKSS infrastructure is freely available and open source.<br />
<br />
=== CentOS Installation and Configuration ===<br />
<br />
A base install of CentOS 6.x with Java options is performed on the partition or disk designated for the operating system then<br />
the following steps are taken to configure it as a CLOCKSS box:<br />
# The system accounts, <tt>lcap</tt> and <tt>lockss</tt> are created using <tt>useradd -r -m</tt> or equivalent. Then the LOCKSS LCAP public SSH keys are then added to their <tt>authorized_keys</tt> file, taking care to ensure the permissions are setup correctly.<br />
# The mount points are created for the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] repositories:<br />
#* The preferred naming scheme is <tt>/cacheN/gamma</tt> where <tt>N</tt> is an integer starting from 0, (e.g. <tt>/cache0/gamma</tt>, <tt>/cache1/gamma</tt> and so forth).<br />
#* The mount point ownerships are changed to the <tt>lockss</tt> user and group and permissions for <tt>other</tt> are stripped.<br />
#* <tt>noexec</tt> is added to the mount point parameters.<br />
#* Finally, the mount points should be added to <tt>updatedb</tt>'s <tt>PRUNEPATHS</tt> variable to prevent the system from indexing the LOCKSS daemon repositories.<br />
# Then the firewall (<tt>iptables</tt> rules) are set up:<br />
#* The LOCKSS daemon requires the LCAP port (port 9729), used to communicate with other LOCKSS daemons, to be open to all machines: This is accomplished under CentOS with this <tt>iptables</tt> rule: <pre>-A INPUT -m state --state NEW -m tcp -p tcp --dport 9729 -j ACCEPT</pre> Note: Although we have a list of IP addresses that should be granted access to this machine's LCAP port, we opt not to restrict access because it is protected by SSL certificate checks (see [[#CLOCKSS PLN Configuration]]). <br />
#* Additionally, a CLOCKSS box is setup so that the administrative ports 22 (OpenSSH) and 8080 through 8083 are accessible to the remote site's administrative subnet(s) and the LOCKSS subnet at Stanford (171.66.236.0/24, previously 171.66.236.0/26). The following <tt>iptables</tt> should be repeated for each subnet, taking care to replace <tt>SUBNET</tt> with the subnet in CIDR form:<pre>-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 22 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8080 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8081 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8082 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8083 -j ACCEPT</pre>These rules should be reflected in any external firewalls at the site.<br />
# Setting up the CLOCKSS repository is a two step process:<br />
## The first is to create a new file, <tt>lockss.repo</tt> under <tt>/etc/yum.repos.d</tt> containing the following lines:<pre>[lockss]&#10;name = LOCKSS Daemon Repository&#10;baseurl=http://www.lockss.org/clockss-repo/&#10;gpgcheck=1</pre><br />
##Then to install the LOCKSS RPM GPG key:<pre>rpm --import http://www.lockss.org/LOCKSS-GPG-RPM-KEY</pre> <br />
#The LOCKSS daemon can now be installed by invoking:<pre>yum install lockss-daemon</pre><br />
# The LOCKSS daemon installs a <tt>logrotate</tt> configuration file to <tt>/etc/logrotate.d/lockss</tt>. By default, the size is limited to 2M but on CLOCKSS boxes, we set the limit to 20M: <pre>/var/log/lockss/daemon {&#10; size 20M&#10; rotate 5&#10; compress&#10; delaycompress&#10; create&#10; notifempty&#10; missingok&#10;}&#10;&#10;/var/log/lockss/stdout {&#10; size 10k&#10; rotate 5&#10; compress&#10; copytruncate&#10; notifempty&#10; missingok&#10;}&#10;</pre><br />
# To keep the time on CLOCKSS boxes synchronized, we use <tt>NTP</tt> (or preferably, <tt>OpenNTPd</tt> where available). To install <tt>ntpd</tt>:<pre>yum install ntp</pre><br />
# To automatically install package updates, we use <tt>yum-cron</tt>. To install and configure it: <pre>yum install yum-cron</pre> Then ensure it will start automatically on runlevels 3, 4 and 5: <pre>chkconfig --add yum-cron</pre> <br />
# Some extra, optional packages we've found to be useful are <tt>screen</tt>, <tt>wget</tt>, <tt>vim</tt>, <tt>emacs</tt>, <tt>lynx</tt>. Their installation is highly recommended to ease troubleshooting.<pre>yum install screen wget vim emacs lynx</pre><br />
<br />
=== Java Runtime Environment (JRE) ===<br />
<br />
Early CLOCKSS boxes were configured to use the Sun Java JRE 6 (now Oracle Java JRE) but we now recommend using the OpenJDK JRE 6 freely available through the CentOS repository. All CLOCKSS boxes have a 64-bit JRE to take advantage of the large amount of memory available on these machines.<br />
<br />
=== LOCKSS Daemon Configuration ===<br />
<br />
Once the CentOS environment has been configured and verified working by a CLOCKSS engineer, the next step is to configure the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] as part of the CLOCKSS network. This is done using the <tt>hostconfig</tt> utility packaged with the LOCKSS daemon. For a CLOCKSS box, the following settings are recommended:<br />
<br />
{| class="wikitable"<br />
|-<br />
! scope="col"| Parameter<br />
! scope="col"| Description<br />
|-<br />
| <tt>Fully qualified hostname (FQDN) of this machine</tt><br />
| | Provided by CLOCKSS. We schema we use is clockss-<i>site</i>.clockss.org where <i>site</i> can uniquely identify the host institution of the machine.<br />
|-<br />
| <tt>IP address of this machine</tt><br />
| | A static public IP address provided by host institution.<br />
|-<br />
| <tt>Initial subnet for admin UI access</tt><br />
| | The subnet (in CIDR or X.Y.Z.* notation) that will be granted access to the web-based LOCKSS Administrative UI. The localhost is implicitly allowed. Additional subnets can be added through the LOCKSS Administrative UI.<br />
|-<br />
| <tt>LCAP V3 protocol port</tt><br />
| | The TCP port the daemon will listen on for LCAP communication from peers. Remote sites should ensure both internal and external firewalls allow all connections to this port.<br />
|-<br />
| <tt>Mail relay for this machine</tt><br />
| | This should be the DNS name of an SMTP relay that will accept and relay mail from this machine. The script will also prompt for a username and password if the mail relay requires them. It can also be set to <tt>localhost</tt> if the machine is capable of handling email. <br />
|-<br />
| <tt>E-mail address for administrator</tt><br />
| | Occasional alerts will be sent to this address by the LOCKSS daemon.<br />
|-<br />
| <tt>Path to java</tt><br />
| | The full path to a JRE. The default should suffice in most cases.<br />
|-<br />
| <tt>Java switches</tt><br />
| | Java switches to be passed to the JRE. It should be left blank in most cases.<br />
|-<br />
| <tt>Configuration URL</tt><br />
| | <ul><li>CLOCKSS Production: <tt>http://props.lockss.org:8001/clockss/lockss.xml</tt></li><li>CLOCKSS Ingest: <tt>http://props.lockss.org:8001/clockssingest/lockss.xml</tt></li><li>CLOCKSS Triggered: <tt>http://props.lockss.org:8001/clockss-triggered/lockss.xml</tt></li></ul><br />
|-<br />
| <tt>Preservation group(s)</tt><br />
| | <ul><li>CLOCKSS Production: <tt>clockss</tt></li><li>CLOCKSS Ingest: <tt>clockssingest</tt></li><li>CLOCKSS Triggered: <tt>clockss-triggered</tt></li></ul><br />
|-<br />
| <tt>Content storage directories</tt><br />
| | A semi-colon delineated list of paths to the LOCKSS repositories for use by this LOCKSS daemon.<br />
|-<br />
| <tt>Temporary storage directory</tt><br />
| | <tt>/cache0/gamma/tmp</tt><br />
|-<br />
| <tt>Password for web UI administration user admin</tt><br />
| | Set this to a strong password, known to the host site administrator but to no-one else.<br />
|-<br />
|}<br />
<br />
=== CLOCKSS PLN Configuration ===<br />
<br />
The production CLOCKSS network is configured via the [[LOCKSS: Property Server Operations|property server]] to:<br />
* Prevent access to the content by disabling both the proxy and content server functions of the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]].<br />
* Prevent interception of communication between boxes in the network by the use of SSL for the [[LOCKSS: Polling and Repair Protocol]].<br />
* Prevent communication via the [[LOCKSS: Polling and Repair Protocol]] except from other boxes in the same network by requiring each end of a newly established connection to verify the certificate of the other end against the corresponding public key in a keystore. During setup, or when the set of boxes in the network changes, the appropriate keystore is uploaded to a machine via <tt>scp</tt>. We also provide a small script for the CLOCKSS site administrator to run that installs the keystore into the correct location with restrictive permissions.<br />
<br />
== Monitoring ==<br />
<br />
=== Nagios ===<br />
<br />
The CLOCKSS infrastructure is monitored by the Nagios network monitoring system. The CLOCKSS boxes, servers and services that are monitored by Nagios include the CLOCKSS Production boxes, Ingest boxes, Triggered Content boxes and Property Server as well as the CLOCKSS HTTP, SMTP and FTP servers.<br />
<br />
The CLOCKSS Network Administrator is responsible for Nagios monitoring.<br />
<br />
=== Nagios plugins to monitor LOCKSS ===<br />
<br />
Custom Nagios plugins were written to enable Nagios to monitor the following [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] services:<br />
* LOCKSS Daemon Version<br />
* LOCKSS Daemon Uptime<br />
* LOCKSS Repository Spaces (to monitor disk usage)<br />
* LOCKSS Web Administrative UI accessibility<br />
* LCAP Accessibility<br />
<br />
Plugins shipped with Nagios allow us to monitor, where applicable:<br />
* OpenSSH accessibility <br />
* HTTP/HTTPS accessibility<br />
<br />
=== Nagios Alerts ===<br />
<br />
Nagios allows plugins to return four return codes depending on the nature and severity of an issue (or lack thereof). For any return other than <tt>OK</tt>, Nagios has been configured to notify CLOCKSS engineers through email alerts (see [[CLOCKSS: Logging and Records]]).<br />
<br />
=== Access control to Nagios instance ===<br />
<br />
Access to the CLOCKSS Nagios instance is restricted by an Apache Virtual Host definition to the LOCKSS subnet at Stanford (<tt>171.66.236.0/24</tt>) and by username and password.<br />
<br />
=== Nagios Redundancy ===<br />
<br />
CLOCKSS and LOCKSS have a second Nagios instance running in an Amacon EC2 instance located in West Virginia. The second Nagios is not able to monitor CLOCKSS boxes directly. This is by design: We do not want a machine outside of the Stanford network to have the type of access necessary to monitor internal CLOCKSS boxes and processes. Due to this restriction, it is configured only to monitor the CLOCKSS website and the Nagios instance running at Stanford. The second Nagios instance will alert CLOCKSS engineers should any problems occur at Stanford.<br />
<br />
== Hardware Upgrade ==<br />
<br />
Currently, all CLOCKSS boxes have at least four spare hotswap disk bays. A natural way to incrementally upgrade the disk capacity of deployed CLOCKSS boxes is to fill these disk bays with the largest disk available (at the moment, 4TB), create a new RAID array and finally to move content from an existing array. The disks comprising the old array can then be removed and the process can be repeated for the remaining RAID arrays.<br />
<br />
The other components in a CLOCKSS box are not expected to need upgrading during its service life.<br />
<br />
== Hardware Replacement ==<br />
<br />
=== Component Failure ===<br />
<br />
The components in all CLOCKSS boxes are readily available, cheap, off-the-shelf components and replacement components can be sourced from any well-stocked computer retailer. The most common component failures are disks. As mentioned earlier, the RAID5 arrays employed in CLOCKSS boxes are only able to tolerate one disk failure each so it is important that any failed disks be found and replaced as quickly as possible.<br />
<br />
If a disk fails, we contact the CLOCKSS box's hardware vendor if the CLOCKSS box is still under warranty. If the box is no longer under warranty, we prefer to purchase a disk and ship it expedited to the remote site. However, if the remote site is not within the United States, an alternative option is to ask the site administrator to source replacement parts within their country and send an invoice to CLOCKSS for a reimbursement.<br />
<br />
Other component failures are handled similarly. If a machine cannot be repaired, it is replaced.<br />
<br />
=== Planned CLOCKSS Box Replacement ===<br />
<br />
Despite the upgradability of existing hardware, we recognize that, eventually, it will no longer be cost effective or practical to continue to run old hardware due to increases in performance per watt efficiency, disk bandwidth, disk capacity and other measures. Software and content volume is also expected to push hardware towards obsolescence.<br />
<br />
Replacing an entire machine is a fairly straightforward process. After purchasing and configuring the machine as described elsewhere in this document, it is brought up alongside the machine it is to replace. To minimize downtime, content is copied in two passes, using <tt>rsync</tt>. The first pass copies the bulk of the data to the new machine. Once the first pass is complete, the LOCKSS daemon on the old machine is taken offline and its filesystems are placed into read-only mode. The second pass then copies any changes to the content that occurred during the first pass, and verifies the integrity of content from the first pass. If no errors occurred during the two passes, the old machine is taken offline and the new machine takes its place. Subsequent [[LOCKSS: Polling and Repair Protocol]] will also verify the integrity of the copy.<br />
<br />
== Software Updates ==<br />
<br />
=== Re-locating Content ===<br />
<br />
Relocating an AU to another LOCKSS repository filesystem, for example if a filesystem has filled up, requires the following steps:<br />
# Shutdown the LOCKSS daemon to prevent it from modifying the AU. For extra precaution, temporarily remount the filesystem as read-only.<br />
# Copy the AU to the new LOCKSS repository. When the copy completes successfully, delete the the AU from the source filesystem.<br />
# If the LOCKSS repository has not yet been made available to the LOCKSS daemon, edit <tt>/etc/lockss/config.dat</tt> and append the new LOCKSS repository path to <tt>LOCKSS_DISK_PATHS</tt>.<br />
# The AU needs its <tt>lockss_repo</tt> added or modified to reflect the new path. This is done by editing the AU's <tt>lockss_repo</tt> line in <tt>/cache0/gamma/config/au.txt</tt>. For bulk relocation, a LOCKSS tool called <tt>auconf.pl</tt> can scan the LOCKSS repositories for AUs and update <tt>au.txt</tt> as necessary.<br />
# If the source filesystem was remounted read-only, remount it as read-write. <br />
# Finally, start the LOCKSS daemon and check that the AUs repository path points to the new location.<br />
<br />
If a large amount of content is being moved and the daemon requires minimal disruption, a modified procedure is followed:<br />
<br />
# With the LOCKSS daemon running and the filesystem mounted read-write, use rsync to copy the bulk of the content to the new filesystem.<br />
# When the first pass completes, stop the LOCKSS daemon and remount the source filesystem as read-only.<br />
# Run rsync again to copy any changes to the AUs since the last rsync was started.<br />
# Update <tt>/etc/lockss/config.dat</tt> and <tt>/cache0/gamma/config/au.txt</tt> as necessary.<br />
# Remount the filesystem as read-write. Bring up the LOCKSS daemon and check that the all the AUs repository paths have been updated.<br />
<br />
=== LOCKSS Daemon Updates ===<br />
<br />
All remote sites are encouraged to setup automatic updates for the LOCKSS daemon package so that updates are installed automatically within a few hours of their release to the CLOCKSS RPM repository. Additionally, all systems administrators receive announcements of newly released daemons through the CLOCKSS Administrators mailing list; systems administrators opting to install updates manually are encouraged to use the announcements as a cue to install updates. We use Nagios to monitor the LOCKSS daemon version of all CLOCKSS boxes. If a CLOCKSS box has not been updated with the latest LOCKSS daemon within a day, we notify its systems administrator(s).<br />
<br />
=== System Package Updates ===<br />
<br />
Most CLOCKSS boxes are configured for unattended upgrade, so new upgrades will be installed automatically. All CLOCKSS box systems administrators are encouraged to join the CentOS Announce mailing list to receive real-time annoucements about security updates. If the box in question is not configured for unattended updates:<br />
* Security updates should be downloaded and installed as soon as they are announced.<br />
* All other non-critical updates should be installed at least once a quarter.<br />
<br />
Anyone can subscribe to the CentOS Announce mailing list by visiting the [http://lists.centos.org/mailman/listinfo/centos-announce CentOS-announce Mailing List] home page.<br />
<br />
=== System Updates ===<br />
<br />
During the lifetime of the CLOCKSS program, the CentOS project will release new major versions of their CentOS operating system (e.g., CentOS 5.x to 6.x). All production CLOCKSS boxes were configured so that its operating system filesystem is independent from its content filesystems. The CentOS upgrade path is then relatively simple: Save the LOCKSS daemon configuration, user accounts and OpenSSH keys, unmount and take offline all content filesystems then effectively reinstall CentOS and the LOCKSS daemon environment and restore the CLOCKSS box's configuration.<br />
<br />
The end of life dates for the CentOS versions currently installed on CLOCKSS boxes are:<br />
<br />
{| class="wikitable" style="margin: 0 auto"<br />
|-<br />
! scope="col"| Version<br />
! scope="col"| Release Date<br />
! scope="col"| Full Updates<br />
! scope="col"| Maintenance Updates<br />
|-<br />
| CentOS 5.x<br />
| | 2007-04-12<br />
| | Q1 2014<br />
| | 2017-03-31<br />
|-<br />
| CentOS 6.x<br />
| | 22011-07-10<br />
| | Q2 2017<br />
| | 2020-11-30<br />
|-<br />
|}<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by:<br />
** CLOCKSS Network Administrator<br />
** LOCKSS Engineering Staff<br />
* Approval by CLOCKSS Technical Lead</div>Dshrhttp://documents.clockss.org/index.php/CLOCKSS:_Box_OperationsCLOCKSS: Box Operations2014-04-06T22:06:54Z<p>Dshr: /* CentOS Installation and Configuration */</p>
<hr />
<div>= CLOCKSS: Box Operations =<br />
<br />
== Requirements for CLOCKSS host sites ==<br />
<br />
A CLOCKSS host site signs an agreement with the CLOCKSS board under which they commit to providing:<br />
* Rack space in a physically secure location accessible only to authorized personnel.<br />
* Power.<br />
* Cooling.<br />
* Network bandwidth.<br />
And to have an in-place disaster recovery plans if these resources fail. The CLOCKSS Executive Director has copies of these agreements.<br />
<br />
The CLOCKSS Network Administrator is responsible for all processes involving production and ingest CLOCKSS box hardware and software infrastructure.<br />
<br />
== Hardware Bringup ==<br />
<br />
=== Hardware Vendors ===<br />
All new systems are currently purchased from iXsystems based in San Jose, CA. The LOCKSS team's contact there is sales representative Kevin Lee ([mailto:kle@ixsystems.com kle@ixsystems.com]).<br />
<br />
The LOCKSS team previously worked with two hardware vendors, Iron Systems, located in Fremont, CA and eRacks in Orange, CA. Most of the systems currently in production were purchased from Iron Systems. Business contacts are Kawaljit Nagi ([mailto:kawal@ironsystems.com kawal@ironsystems.com]) and James King ([mailto:jamesk@eracks.com jamesk@eracks.com]), respectively. The LOCKSS team also has a relationship with Joesph Wolff ([mailto:joe@eracks.com joe@eracks.com]), the owner of eRacks.<br />
<br />
=== Hardware Purchase Process ===<br />
Purchases start as a discussion among LOCKSS Engineering Staff, the CLOCKSS Network Administrator and the CLOCKSS Executive Director about what is needed. A specification is drawn by the engineers and sent to our hardware vendors for an early quote. If the quote looks reasonable, it is sent to the CLOCKSS Executive Director, who negotiates a final quote and sends a purchase order. The CLOCKSS Executive Director is advised to release payment once the machine has been built, tested, verified working by the LOCKSS team, preconfigured with software (see below) and shipped. <br />
<br />
=== Hardware Warranty ===<br />
Machines purchased from Iron Systems and eRacks come with 1 - 2 years of limited hardware warranty, depending on hardware vendor. The warranty on most of the CLOCKSS boxes has since expired but it proved useful for exchanging failed hardware components. Details about the warranty are in the purchase orders that have been submitted as well as the invoices subsequently received.<br />
<br />
=== Recommended Hardware ===<br />
CLOCKSS hardware or virtual machines meet or exceed the following specifications:<br />
<br />
{| class="wikitable" style="margin: 0 auto"<br />
|-<br />
! scope="col"| Item<br />
! scope="col"| Ingest<br />
! scope="col"| Production<br />
! scope="col"| Triggered (VMware)<br />
|-<br />
| Chassis<br />
| | Supermicro 2U<br />
| | Supermicro 4U<br />
| | ''N/A''<br />
|-<br />
| Disk Bays<br />
| | 12<br />
| | 24<br />
| | ''N/A''<br />
|-<br />
| Processor<br />
| | AMD Operton 6128 (eight cores)<br />
| | Dual Xeon E5504 (eight cores)<br />
| | Dual-core CPU<br />
|-<br />
| RAM<br />
| | 16GB ECC<br />
| | 24GB ECC<br />
| | 4GB<br />
|-<br />
| Disk<br />
| | 8 x 3TB SATA 6Gbps 7200RPM<br />
| | 12 x 2TB SATA 3Gbps<br />
| | 40GB disk<br />
|-<br />
| RAID<br />
| | LSI Megaraid<br />
| | Software (Linux <tt>mdadm</tt>)<br />
| | None (Underlying RAID array)<br />
|-<br />
| Network<br />
| | Onboard dual Gbe<br />
| | Onboard dual Gbe<br />
| | 10/100Mb NIC<br />
|-<br />
| Remote Access<br />
| | IPMI<br />
| | IPMI<br />
| | VMware vSphere Client<br />
|-<br />
| Power Supply<br />
| | 800W redundant<br />
| | 1600W redundant<br />
| | ''N/A''<br />
|-<br />
|}<br />
<br />
=== CLOCKSS Virtual Machines ===<br />
<br />
CLOCKSS virtual machines run on servers running the VMWare vSphere ESXi 5 hypervisor.<br />
<br />
=== Hardware Service Life ===<br />
<br />
The service life of a CLOCKSS box is five to seven years. This is dictated by budget restrictions, software needs as well as cost and performance efficiency. Please see the section on [[#Hardware Replacement]] later in this document.<br />
<br />
=== Building and testing ===<br />
<br />
Our hardware vendors are responsible for sourcing components and building the machines once a purchase order is submitted by CLOCKSS. After a machine is built, both hardware vendors have an extensive and rigorous process in place to ensure a machine will function correctly.<br />
<br />
=== Remote Access via IPMI ===<br />
<br />
Every CLOCKSS box is equipped with an Intelligent Platform Management Interface (IPMI) module that provides remote "side-band" and "out-of-band" access. Although IPMI is invaluable in the event we need to access or recover a CLOCKSS box outside the scope of its operating system, there are known security vulnerabilities and so it is disabled by default on all machines. This is done by disabling it in BIOS and unplugging the network cable from the IPMI module.<br />
<br />
In the event we need to use IPMI, the following precautions should be taken:<br />
<br />
# Update the firmware, if a newer version exists.<br />
# Do not use the default username and password.<br />
# Make IPMI accessible only through a VPN or other secure connection.<br />
<br />
=== RAID Configuration ===<br />
<br />
The CLOCKSS Ingest machines have LSI MegaRAID based hardware RAID controllers and eight 3TB disks. These disks were partitioned into groups of four and put into a RAID5 configuration. Each RAID array then yields approximately 8.1TB of usable space after RAID and EXT4 filesystem overhead. Since the Ingest machines have an extra disk dedicated for the OS (called a "system disk"), the full space of the RAID arrays is dedicated to the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]]. This configuration also isolates problems on the system disk from content storage arrays. <br />
<br />
The CLOCKSS Production machines utilize the Linux kernel's <tt>mdadm</tt> module to implement software RAID. Similar to the CLOCKSS Ingest machines, the Production machines also have several RAID5 arrays consisting of three or four disks depending on the number of disks in the machine. In contrast to the Ingest machines, these RAID5 arrays are used by both the operating system and the LOCKSS daemon.<br />
<br />
=== RAID Health Monitoring ===<br />
<br />
The most common component failures are disks and it is expected that all CLOCKSS boxes will experience a few disk failures over the course of their service life. The RAID5 arrays employed in CLOCKSS boxes are only able to tolerate one disk failure each. It is thus important that any failed disks be found and replaced as quickly as possible; a double disk failure requires repair from another box via the [[LOCKSS: Polling and Repair Protocol]], which is much slower than RAID rebuild.<br />
<br />
The RAID array health on all CLOCKSS Production boxes are monitored by Nagios via the Nagios Remote Plugin Executor (NRPE). Each machine runs an NRPE daemon and has a copy of a custom plugin we wrote to monitor mdadm RAID arrays. NRPE executes the plugin on behalf of Nagios and returns the result. If Nagios does not receive an OK, it sends an alert to CLOCKSS engineers.<br />
<br />
== Software Bringup ==<br />
<br />
All the software used for the CLOCKSS infrastructure is freely available and open source.<br />
<br />
=== CentOS Installation and Configuration ===<br />
<br />
A base install of CentOS 6.x with Java options is performed on the partition or disk designated for the operating system then<br />
the following steps are taken to configure it as a CLOCKSS box:<br />
# The system accounts, <tt>lcap</tt> and <tt>lockss</tt> are created using <tt>useradd -r -m</tt> or equivalent. Then the LOCKSS LCAP public SSH keys are then added to their <tt>authorized_keys</tt> file, taking care to ensure the permissions are setup correctly.<br />
# The mount points are created for the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] repositories:<br />
#* The preferred naming scheme is <tt>/cacheN/gamma</tt> where <tt>N</tt> is an integer starting from 0, (e.g. <tt>/cache0/gamma</tt>, <tt>/cache1/gamma</tt> and so forth).<br />
#* The mount point ownerships are changed to the <tt>lockss</tt> user and group and permissions for <tt>other</tt> are stripped.<br />
#* <tt>noexec</tt> is added to the mount point parameters.<br />
#* Finally, the mount points should be added to <tt>updatedb</tt>'s <tt>PRUNEPATHS</tt> variable to prevent the system from indexing the LOCKSS daemon repositories.<br />
# Then the firewall (<tt>iptables</tt> rules) are set up:<br />
#* The LOCKSS daemon requires the LCAP port (port 9729), used to communicate with other LOCKSS daemons, to be open to all machines: This is accomplished under CentOS with this <tt>iptables</tt> rule: <pre>-A INPUT -m state --state NEW -m tcp -p tcp --dport 9729 -j ACCEPT</pre> Note: Although we have a list of IP addresses that should be granted access to this machine's LCAP port, we opt not to restrict access because it is protected by SSL certificate checks (see [[#CLOCKSS PLN Configuration]]). <br />
#* Additionally, a CLOCKSS box is setup so that the administrative ports 22 (OpenSSH) and 8080 through 8083 are accessible to the remote site's administrative subnet(s) and the LOCKSS subnet at Stanford (171.66.236.0/24, previously 171.66.236.0/26). The following <tt>iptables</tt> should be repeated for each subnet, taking care to replace <tt>SUBNET</tt> with the subnet in CIDR form:<pre>-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 22 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8080 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8081 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8082 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8083 -j ACCEPT</pre>These rules should be reflected in any external firewalls at the site.<br />
# Setting up the CLOCKSS repository is a two step process:<br />
## The first is to create a new file, <tt>lockss.repo</tt> under <tt>/etc/yum.repos.d</tt> containing the following lines:<pre>[lockss]&#10;name = LOCKSS Daemon Repository&#10;baseurl=http://www.lockss.org/clockss-repo/&#10;gpgcheck=1</pre><br />
##Then to install the LOCKSS RPM GPG key:<pre>rpm --import http://www.lockss.org/LOCKSS-GPG-RPM-KEY</pre> <br />
#The LOCKSS daemon can now be installed by invoking:<pre>yum install lockss-daemon</pre><br />
# The LOCKSS daemon installs a <tt>logrotate</tt> configuration file to <tt>/etc/logrotate.d/lockss</tt>. By default, the size is limited to 2M but on CLOCKSS boxes, we set the limit to 20M: <pre>/var/log/lockss/daemon {&#10; size 20M&#10; rotate 5&#10; compress&#10; delaycompress&#10; create&#10; notifempty&#10; missingok&#10;}&#10;&#10;/var/log/lockss/stdout {&#10; size 10k&#10; rotate 5&#10; compress&#10; copytruncate&#10; notifempty&#10; missingok&#10;}&#10;</pre><br />
# To keep the time on CLOCKSS boxes synchronized, we use <tt>NTP</tt> (or preferably, <tt>OpenNTPd</tt> where available). To install <tt>ntpd</tt>:<pre>yum install ntp</pre><br />
# To automatically install package updates, we use <tt>yum-cron</tt>. To install and configure it: <pre>yum install yum-cron</pre> Then ensure it will start automatically on runlevels 3, 4 and 5: <pre>chkconfig --add yum-cron</pre> <br />
# Some extra, optional packages we've found to be useful are <tt>screen</tt>, <tt>wget</tt>, <tt>vim</tt>, <tt>emacs</tt>, <tt>lynx</tt>. Their installation is highly recommended to ease troubleshooting.<pre>yum install screen wget vim emacs lynx</pre><br />
<br />
=== Java Runtime Environment (JRE) ===<br />
<br />
Early CLOCKSS boxes were configured to use the Sun Java JRE 6 (now Oracle Java JRE) but we now recommend using the OpenJDK JRE 6 freely available through the CentOS repository. All CLOCKSS boxes have a 64-bit JRE to take advantage of the large amount of memory available on these machines.<br />
<br />
=== LOCKSS Daemon Configuration ===<br />
<br />
Once the CentOS environment has been configured and verified working by a CLOCKSS engineer, the next step is to configure the LOCKSS daemon as part of the CLOCKSS network. This is done using the <tt>hostconfig</tt> utility packaged with the LOCKSS daemon. For a CLOCKSS box, the following settings are recommended:<br />
<br />
{| class="wikitable"<br />
|-<br />
! scope="col"| Parameter<br />
! scope="col"| Description<br />
|-<br />
| <tt>Fully qualified hostname (FQDN) of this machine</tt><br />
| | Provided by CLOCKSS. We schema we use is clockss-<i>site</i>.clockss.org where <i>site</i> can uniquely identify the host institution of the machine.<br />
|-<br />
| <tt>IP address of this machine</tt><br />
| | A static public IP address provided by host institution.<br />
|-<br />
| <tt>Initial subnet for admin UI access</tt><br />
| | The subnet (in CIDR or X.Y.Z.* notation) that will be granted access to the web-based LOCKSS Administrative UI. The localhost is implicitly allowed. Additional subnets can be added through the LOCKSS Administrative UI.<br />
|-<br />
| <tt>LCAP V3 protocol port</tt><br />
| | The TCP port the daemon will listen on for LCAP communication from peers. Remote sites should ensure both internal and external firewalls allow all connections to this port.<br />
|-<br />
| <tt>Mail relay for this machine</tt><br />
| | This should be the DNS name of an SMTP relay that will accept and relay mail from this machine. The script will also prompt for a username and password if the mail relay requires them. It can also be set to <tt>localhost</tt> if the machine is capable of handling email. <br />
|-<br />
| <tt>E-mail address for administrator</tt><br />
| | Occasional alerts will be sent to this address by the LOCKSS daemon.<br />
|-<br />
| <tt>Path to java</tt><br />
| | The full path to a JRE. The default should suffice in most cases.<br />
|-<br />
| <tt>Java switches</tt><br />
| | Java switches to be passed to the JRE. It should be left blank in most cases.<br />
|-<br />
| <tt>Configuration URL</tt><br />
| | <ul><li>CLOCKSS Production: <tt>http://props.lockss.org:8001/clockss/lockss.xml</tt></li><li>CLOCKSS Ingest: <tt>http://props.lockss.org:8001/clockssingest/lockss.xml</tt></li><li>CLOCKSS Triggered: <tt>http://props.lockss.org:8001/clockss-triggered/lockss.xml</tt></li></ul><br />
|-<br />
| <tt>Preservation group(s)</tt><br />
| | <ul><li>CLOCKSS Production: <tt>clockss</tt></li><li>CLOCKSS Ingest: <tt>clockssingest</tt></li><li>CLOCKSS Triggered: <tt>clockss-triggered</tt></li></ul><br />
|-<br />
| <tt>Content storage directories</tt><br />
| | A semi-colon delineated list of paths to the LOCKSS repositories for use by this LOCKSS daemon.<br />
|-<br />
| <tt>Temporary storage directory</tt><br />
| | <tt>/cache0/gamma/tmp</tt><br />
|-<br />
| <tt>Password for web UI administration user admin</tt><br />
| | Set this to a strong password, known to the host site administrator but to no-one else.<br />
|-<br />
|}<br />
<br />
=== CLOCKSS PLN Configuration ===<br />
<br />
The production CLOCKSS network is configured via the [[LOCKSS: Property Server Operations|property server]] to:<br />
* Prevent access to the content by disabling both the proxy and content server functions of the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]].<br />
* Prevent interception of communication between boxes in the network by the use of SSL for the [[LOCKSS: Polling and Repair Protocol]].<br />
* Prevent communication via the [[LOCKSS: Polling and Repair Protocol]] except from other boxes in the same network by requiring each end of a newly established connection to verify the certificate of the other end against the corresponding public key in a keystore. During setup, or when the set of boxes in the network changes, the appropriate keystore is uploaded to a machine via <tt>scp</tt>. We also provide a small script for the CLOCKSS site administrator to run that installs the keystore into the correct location with restrictive permissions.<br />
<br />
== Monitoring ==<br />
<br />
=== Nagios ===<br />
<br />
The CLOCKSS infrastructure is monitored by the Nagios network monitoring system. The CLOCKSS boxes, servers and services that are monitored by Nagios include the CLOCKSS Production boxes, Ingest boxes, Triggered Content boxes and Property Server as well as the CLOCKSS HTTP, SMTP and FTP servers.<br />
<br />
The CLOCKSS Network Administrator is responsible for Nagios monitoring.<br />
<br />
=== Nagios plugins to monitor LOCKSS ===<br />
<br />
Custom Nagios plugins were written to enable Nagios to monitor the following [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] services:<br />
* LOCKSS Daemon Version<br />
* LOCKSS Daemon Uptime<br />
* LOCKSS Repository Spaces (to monitor disk usage)<br />
* LOCKSS Web Administrative UI accessibility<br />
* LCAP Accessibility<br />
<br />
Plugins shipped with Nagios allow us to monitor, where applicable:<br />
* OpenSSH accessibility <br />
* HTTP/HTTPS accessibility<br />
<br />
=== Nagios Alerts ===<br />
<br />
Nagios allows plugins to return four return codes depending on the nature and severity of an issue (or lack thereof). For any return other than <tt>OK</tt>, Nagios has been configured to notify CLOCKSS engineers through email alerts (see [[CLOCKSS: Logging and Records]]).<br />
<br />
=== Access control to Nagios instance ===<br />
<br />
Access to the CLOCKSS Nagios instance is restricted by an Apache Virtual Host definition to the LOCKSS subnet at Stanford (<tt>171.66.236.0/24</tt>) and by username and password.<br />
<br />
=== Nagios Redundancy ===<br />
<br />
CLOCKSS and LOCKSS have a second Nagios instance running in an Amacon EC2 instance located in West Virginia. The second Nagios is not able to monitor CLOCKSS boxes directly. This is by design: We do not want a machine outside of the Stanford network to have the type of access necessary to monitor internal CLOCKSS boxes and processes. Due to this restriction, it is configured only to monitor the CLOCKSS website and the Nagios instance running at Stanford. The second Nagios instance will alert CLOCKSS engineers should any problems occur at Stanford.<br />
<br />
== Hardware Upgrade ==<br />
<br />
Currently, all CLOCKSS boxes have at least four spare hotswap disk bays. A natural way to incrementally upgrade the disk capacity of deployed CLOCKSS boxes is to fill these disk bays with the largest disk available (at the moment, 4TB), create a new RAID array and finally to move content from an existing array. The disks comprising the old array can then be removed and the process can be repeated for the remaining RAID arrays.<br />
<br />
The other components in a CLOCKSS box are not expected to need upgrading during its service life.<br />
<br />
== Hardware Replacement ==<br />
<br />
=== Component Failure ===<br />
<br />
The components in all CLOCKSS boxes are readily available, cheap, off-the-shelf components and replacement components can be sourced from any well-stocked computer retailer. The most common component failures are disks. As mentioned earlier, the RAID5 arrays employed in CLOCKSS boxes are only able to tolerate one disk failure each so it is important that any failed disks be found and replaced as quickly as possible.<br />
<br />
If a disk fails, we contact the CLOCKSS box's hardware vendor if the CLOCKSS box is still under warranty. If the box is no longer under warranty, we prefer to purchase a disk and ship it expedited to the remote site. However, if the remote site is not within the United States, an alternative option is to ask the site administrator to source replacement parts within their country and send an invoice to CLOCKSS for a reimbursement.<br />
<br />
Other component failures are handled similarly. If a machine cannot be repaired, it is replaced.<br />
<br />
=== Planned CLOCKSS Box Replacement ===<br />
<br />
Despite the upgradability of existing hardware, we recognize that, eventually, it will no longer be cost effective or practical to continue to run old hardware due to increases in performance per watt efficiency, disk bandwidth, disk capacity and other measures. Software and content volume is also expected to push hardware towards obsolescence.<br />
<br />
Replacing an entire machine is a fairly straightforward process. After purchasing and configuring the machine as described elsewhere in this document, it is brought up alongside the machine it is to replace. To minimize downtime, content is copied in two passes, using <tt>rsync</tt>. The first pass copies the bulk of the data to the new machine. Once the first pass is complete, the LOCKSS daemon on the old machine is taken offline and its filesystems are placed into read-only mode. The second pass then copies any changes to the content that occurred during the first pass, and verifies the integrity of content from the first pass. If no errors occurred during the two passes, the old machine is taken offline and the new machine takes its place. Subsequent [[LOCKSS: Polling and Repair Protocol]] will also verify the integrity of the copy.<br />
<br />
== Software Updates ==<br />
<br />
=== Re-locating Content ===<br />
<br />
Relocating an AU to another LOCKSS repository filesystem, for example if a filesystem has filled up, requires the following steps:<br />
# Shutdown the LOCKSS daemon to prevent it from modifying the AU. For extra precaution, temporarily remount the filesystem as read-only.<br />
# Copy the AU to the new LOCKSS repository. When the copy completes successfully, delete the the AU from the source filesystem.<br />
# If the LOCKSS repository has not yet been made available to the LOCKSS daemon, edit <tt>/etc/lockss/config.dat</tt> and append the new LOCKSS repository path to <tt>LOCKSS_DISK_PATHS</tt>.<br />
# The AU needs its <tt>lockss_repo</tt> added or modified to reflect the new path. This is done by editing the AU's <tt>lockss_repo</tt> line in <tt>/cache0/gamma/config/au.txt</tt>. For bulk relocation, a LOCKSS tool called <tt>auconf.pl</tt> can scan the LOCKSS repositories for AUs and update <tt>au.txt</tt> as necessary.<br />
# If the source filesystem was remounted read-only, remount it as read-write. <br />
# Finally, start the LOCKSS daemon and check that the AUs repository path points to the new location.<br />
<br />
If a large amount of content is being moved and the daemon requires minimal disruption, a modified procedure is followed:<br />
<br />
# With the LOCKSS daemon running and the filesystem mounted read-write, use rsync to copy the bulk of the content to the new filesystem.<br />
# When the first pass completes, stop the LOCKSS daemon and remount the source filesystem as read-only.<br />
# Run rsync again to copy any changes to the AUs since the last rsync was started.<br />
# Update <tt>/etc/lockss/config.dat</tt> and <tt>/cache0/gamma/config/au.txt</tt> as necessary.<br />
# Remount the filesystem as read-write. Bring up the LOCKSS daemon and check that the all the AUs repository paths have been updated.<br />
<br />
=== LOCKSS Daemon Updates ===<br />
<br />
All remote sites are encouraged to setup automatic updates for the LOCKSS daemon package so that updates are installed automatically within a few hours of their release to the CLOCKSS RPM repository. Additionally, all systems administrators receive announcements of newly released daemons through the CLOCKSS Administrators mailing list; systems administrators opting to install updates manually are encouraged to use the announcements as a cue to install updates. We use Nagios to monitor the LOCKSS daemon version of all CLOCKSS boxes. If a CLOCKSS box has not been updated with the latest LOCKSS daemon within a day, we notify its systems administrator(s).<br />
<br />
=== System Package Updates ===<br />
<br />
Most CLOCKSS boxes are configured for unattended upgrade, so new upgrades will be installed automatically. All CLOCKSS box systems administrators are encouraged to join the CentOS Announce mailing list to receive real-time annoucements about security updates. If the box in question is not configured for unattended updates:<br />
* Security updates should be downloaded and installed as soon as they are announced.<br />
* All other non-critical updates should be installed at least once a quarter.<br />
<br />
Anyone can subscribe to the CentOS Announce mailing list by visiting the [http://lists.centos.org/mailman/listinfo/centos-announce CentOS-announce Mailing List] home page.<br />
<br />
=== System Updates ===<br />
<br />
During the lifetime of the CLOCKSS program, the CentOS project will release new major versions of their CentOS operating system (e.g., CentOS 5.x to 6.x). All production CLOCKSS boxes were configured so that its operating system filesystem is independent from its content filesystems. The CentOS upgrade path is then relatively simple: Save the LOCKSS daemon configuration, user accounts and OpenSSH keys, unmount and take offline all content filesystems then effectively reinstall CentOS and the LOCKSS daemon environment and restore the CLOCKSS box's configuration.<br />
<br />
The end of life dates for the CentOS versions currently installed on CLOCKSS boxes are:<br />
<br />
{| class="wikitable" style="margin: 0 auto"<br />
|-<br />
! scope="col"| Version<br />
! scope="col"| Release Date<br />
! scope="col"| Full Updates<br />
! scope="col"| Maintenance Updates<br />
|-<br />
| CentOS 5.x<br />
| | 2007-04-12<br />
| | Q1 2014<br />
| | 2017-03-31<br />
|-<br />
| CentOS 6.x<br />
| | 22011-07-10<br />
| | Q2 2017<br />
| | 2020-11-30<br />
|-<br />
|}<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by:<br />
** CLOCKSS Network Administrator<br />
** LOCKSS Engineering Staff<br />
* Approval by CLOCKSS Technical Lead</div>Dshrhttp://documents.clockss.org/index.php/CLOCKSS:_Box_OperationsCLOCKSS: Box Operations2014-04-06T22:01:45Z<p>Dshr: /* Nagios plugins to monitor LOCKSS */</p>
<hr />
<div>= CLOCKSS: Box Operations =<br />
<br />
== Requirements for CLOCKSS host sites ==<br />
<br />
A CLOCKSS host site signs an agreement with the CLOCKSS board under which they commit to providing:<br />
* Rack space in a physically secure location accessible only to authorized personnel.<br />
* Power.<br />
* Cooling.<br />
* Network bandwidth.<br />
And to have an in-place disaster recovery plans if these resources fail. The CLOCKSS Executive Director has copies of these agreements.<br />
<br />
The CLOCKSS Network Administrator is responsible for all processes involving production and ingest CLOCKSS box hardware and software infrastructure.<br />
<br />
== Hardware Bringup ==<br />
<br />
=== Hardware Vendors ===<br />
All new systems are currently purchased from iXsystems based in San Jose, CA. The LOCKSS team's contact there is sales representative Kevin Lee ([mailto:kle@ixsystems.com kle@ixsystems.com]).<br />
<br />
The LOCKSS team previously worked with two hardware vendors, Iron Systems, located in Fremont, CA and eRacks in Orange, CA. Most of the systems currently in production were purchased from Iron Systems. Business contacts are Kawaljit Nagi ([mailto:kawal@ironsystems.com kawal@ironsystems.com]) and James King ([mailto:jamesk@eracks.com jamesk@eracks.com]), respectively. The LOCKSS team also has a relationship with Joesph Wolff ([mailto:joe@eracks.com joe@eracks.com]), the owner of eRacks.<br />
<br />
=== Hardware Purchase Process ===<br />
Purchases start as a discussion among LOCKSS Engineering Staff, the CLOCKSS Network Administrator and the CLOCKSS Executive Director about what is needed. A specification is drawn by the engineers and sent to our hardware vendors for an early quote. If the quote looks reasonable, it is sent to the CLOCKSS Executive Director, who negotiates a final quote and sends a purchase order. The CLOCKSS Executive Director is advised to release payment once the machine has been built, tested, verified working by the LOCKSS team, preconfigured with software (see below) and shipped. <br />
<br />
=== Hardware Warranty ===<br />
Machines purchased from Iron Systems and eRacks come with 1 - 2 years of limited hardware warranty, depending on hardware vendor. The warranty on most of the CLOCKSS boxes has since expired but it proved useful for exchanging failed hardware components. Details about the warranty are in the purchase orders that have been submitted as well as the invoices subsequently received.<br />
<br />
=== Recommended Hardware ===<br />
CLOCKSS hardware or virtual machines meet or exceed the following specifications:<br />
<br />
{| class="wikitable" style="margin: 0 auto"<br />
|-<br />
! scope="col"| Item<br />
! scope="col"| Ingest<br />
! scope="col"| Production<br />
! scope="col"| Triggered (VMware)<br />
|-<br />
| Chassis<br />
| | Supermicro 2U<br />
| | Supermicro 4U<br />
| | ''N/A''<br />
|-<br />
| Disk Bays<br />
| | 12<br />
| | 24<br />
| | ''N/A''<br />
|-<br />
| Processor<br />
| | AMD Operton 6128 (eight cores)<br />
| | Dual Xeon E5504 (eight cores)<br />
| | Dual-core CPU<br />
|-<br />
| RAM<br />
| | 16GB ECC<br />
| | 24GB ECC<br />
| | 4GB<br />
|-<br />
| Disk<br />
| | 8 x 3TB SATA 6Gbps 7200RPM<br />
| | 12 x 2TB SATA 3Gbps<br />
| | 40GB disk<br />
|-<br />
| RAID<br />
| | LSI Megaraid<br />
| | Software (Linux <tt>mdadm</tt>)<br />
| | None (Underlying RAID array)<br />
|-<br />
| Network<br />
| | Onboard dual Gbe<br />
| | Onboard dual Gbe<br />
| | 10/100Mb NIC<br />
|-<br />
| Remote Access<br />
| | IPMI<br />
| | IPMI<br />
| | VMware vSphere Client<br />
|-<br />
| Power Supply<br />
| | 800W redundant<br />
| | 1600W redundant<br />
| | ''N/A''<br />
|-<br />
|}<br />
<br />
=== CLOCKSS Virtual Machines ===<br />
<br />
CLOCKSS virtual machines run on servers running the VMWare vSphere ESXi 5 hypervisor.<br />
<br />
=== Hardware Service Life ===<br />
<br />
The service life of a CLOCKSS box is five to seven years. This is dictated by budget restrictions, software needs as well as cost and performance efficiency. Please see the section on [[#Hardware Replacement]] later in this document.<br />
<br />
=== Building and testing ===<br />
<br />
Our hardware vendors are responsible for sourcing components and building the machines once a purchase order is submitted by CLOCKSS. After a machine is built, both hardware vendors have an extensive and rigorous process in place to ensure a machine will function correctly.<br />
<br />
=== Remote Access via IPMI ===<br />
<br />
Every CLOCKSS box is equipped with an Intelligent Platform Management Interface (IPMI) module that provides remote "side-band" and "out-of-band" access. Although IPMI is invaluable in the event we need to access or recover a CLOCKSS box outside the scope of its operating system, there are known security vulnerabilities and so it is disabled by default on all machines. This is done by disabling it in BIOS and unplugging the network cable from the IPMI module.<br />
<br />
In the event we need to use IPMI, the following precautions should be taken:<br />
<br />
# Update the firmware, if a newer version exists.<br />
# Do not use the default username and password.<br />
# Make IPMI accessible only through a VPN or other secure connection.<br />
<br />
=== RAID Configuration ===<br />
<br />
The CLOCKSS Ingest machines have LSI MegaRAID based hardware RAID controllers and eight 3TB disks. These disks were partitioned into groups of four and put into a RAID5 configuration. Each RAID array then yields approximately 8.1TB of usable space after RAID and EXT4 filesystem overhead. Since the Ingest machines have an extra disk dedicated for the OS (called a "system disk"), the full space of the RAID arrays is dedicated to the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]]. This configuration also isolates problems on the system disk from content storage arrays. <br />
<br />
The CLOCKSS Production machines utilize the Linux kernel's <tt>mdadm</tt> module to implement software RAID. Similar to the CLOCKSS Ingest machines, the Production machines also have several RAID5 arrays consisting of three or four disks depending on the number of disks in the machine. In contrast to the Ingest machines, these RAID5 arrays are used by both the operating system and the LOCKSS daemon.<br />
<br />
=== RAID Health Monitoring ===<br />
<br />
The most common component failures are disks and it is expected that all CLOCKSS boxes will experience a few disk failures over the course of their service life. The RAID5 arrays employed in CLOCKSS boxes are only able to tolerate one disk failure each. It is thus important that any failed disks be found and replaced as quickly as possible; a double disk failure requires repair from another box via the [[LOCKSS: Polling and Repair Protocol]], which is much slower than RAID rebuild.<br />
<br />
The RAID array health on all CLOCKSS Production boxes are monitored by Nagios via the Nagios Remote Plugin Executor (NRPE). Each machine runs an NRPE daemon and has a copy of a custom plugin we wrote to monitor mdadm RAID arrays. NRPE executes the plugin on behalf of Nagios and returns the result. If Nagios does not receive an OK, it sends an alert to CLOCKSS engineers.<br />
<br />
== Software Bringup ==<br />
<br />
All the software used for the CLOCKSS infrastructure is freely available and open source.<br />
<br />
=== CentOS Installation and Configuration ===<br />
<br />
A base install of CentOS 6.x with Java options is performed on the partition or disk designated for the operating system then<br />
the following steps are taken to configure it as a CLOCKSS box:<br />
# The system accounts, <tt>lcap</tt> and <tt>lockss</tt> are created using <tt>useradd -r -m</tt> or equivalent. Then the LOCKSS LCAP public SSH keys are then added to their <tt>authorized_keys</tt> file, taking care to ensure the permissions are setup correctly.<br />
# The mount points are created for the LOCKSS daemon repositories:<br />
#* The preferred naming scheme is <tt>/cacheN/gamma</tt> where <tt>N</tt> is an integer starting from 0, (e.g. <tt>/cache0/gamma</tt>, <tt>/cache1/gamma</tt> and so forth).<br />
#* The mount point ownerships are changed to the <tt>lockss</tt> user and group and permissions for <tt>other</tt> are stripped.<br />
#* <tt>noexec</tt> is added to the mount point parameters.<br />
#* Finally, the mount points should be added to <tt>updatedb</tt>'s <tt>PRUNEPATHS</tt> variable to prevent the system from indexing the LOCKSS daemon repositories.<br />
# Then the firewall (<tt>iptables</tt> rules) are set up:<br />
#* The LOCKSS daemon requires the LCAP port (port 9729), used to communicate with other LOCKSS daemons, to be open to all machines: This is accomplished under CentOS with this <tt>iptables</tt> rule: <pre>-A INPUT -m state --state NEW -m tcp -p tcp --dport 9729 -j ACCEPT</pre> Note: Although we have a list of IP addresses that should be granted access to this machine's LCAP port, we opt not to restrict access because it is protected by SSL certificate checks (see [[#CLOCKSS PLN Configuration]]). <br />
#* Additionally, a CLOCKSS box is setup so that the administrative ports 22 (OpenSSH) and 8080 through 8083 are accessible to the remote site's administrative subnet(s) and the LOCKSS subnet at Stanford (171.66.236.0/24, previously 171.66.236.0/26). The following <tt>iptables</tt> should be repeated for each subnet, taking care to replace <tt>SUBNET</tt> with the subnet in CIDR form:<pre>-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 22 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8080 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8081 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8082 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8083 -j ACCEPT</pre>These rules should be reflected in any external firewalls at the site.<br />
# Setting up the CLOCKSS repository is a two step process:<br />
## The first is to create a new file, <tt>lockss.repo</tt> under <tt>/etc/yum.repos.d</tt> containing the following lines:<pre>[lockss]&#10;name = LOCKSS Daemon Repository&#10;baseurl=http://www.lockss.org/clockss-repo/&#10;gpgcheck=1</pre><br />
##Then to install the LOCKSS RPM GPG key:<pre>rpm --import http://www.lockss.org/LOCKSS-GPG-RPM-KEY</pre> <br />
#The LOCKSS daemon can now be installed by invoking:<pre>yum install lockss-daemon</pre><br />
# The LOCKSS daemon installs a <tt>logrotate</tt> configuration file to <tt>/etc/logrotate.d/lockss</tt>. By default, the size is limited to 2M but on CLOCKSS boxes, we set the limit to 20M: <pre>/var/log/lockss/daemon {&#10; size 20M&#10; rotate 5&#10; compress&#10; delaycompress&#10; create&#10; notifempty&#10; missingok&#10;}&#10;&#10;/var/log/lockss/stdout {&#10; size 10k&#10; rotate 5&#10; compress&#10; copytruncate&#10; notifempty&#10; missingok&#10;}&#10;</pre><br />
# To keep the time on CLOCKSS boxes synchronized, we use <tt>NTP</tt> (or preferably, <tt>OpenNTPd</tt> where available). To install <tt>ntpd</tt>:<pre>yum install ntp</pre><br />
# To automatically install package updates, we use <tt>yum-cron</tt>. To install and configure it: <pre>yum install yum-cron</pre> Then ensure it will start automatically on runlevels 3, 4 and 5: <pre>chkconfig --add yum-cron</pre> <br />
# Some extra, optional packages we've found to be useful are <tt>screen</tt>, <tt>wget</tt>, <tt>vim</tt>, <tt>emacs</tt>, <tt>lynx</tt>. Their installation is highly recommended to ease troubleshooting.<pre>yum install screen wget vim emacs lynx</pre><br />
<br />
=== Java Runtime Environment (JRE) ===<br />
<br />
Early CLOCKSS boxes were configured to use the Sun Java JRE 6 (now Oracle Java JRE) but we now recommend using the OpenJDK JRE 6 freely available through the CentOS repository. All CLOCKSS boxes have a 64-bit JRE to take advantage of the large amount of memory available on these machines.<br />
<br />
=== LOCKSS Daemon Configuration ===<br />
<br />
Once the CentOS environment has been configured and verified working by a CLOCKSS engineer, the next step is to configure the LOCKSS daemon as part of the CLOCKSS network. This is done using the <tt>hostconfig</tt> utility packaged with the LOCKSS daemon. For a CLOCKSS box, the following settings are recommended:<br />
<br />
{| class="wikitable"<br />
|-<br />
! scope="col"| Parameter<br />
! scope="col"| Description<br />
|-<br />
| <tt>Fully qualified hostname (FQDN) of this machine</tt><br />
| | Provided by CLOCKSS. We schema we use is clockss-<i>site</i>.clockss.org where <i>site</i> can uniquely identify the host institution of the machine.<br />
|-<br />
| <tt>IP address of this machine</tt><br />
| | A static public IP address provided by host institution.<br />
|-<br />
| <tt>Initial subnet for admin UI access</tt><br />
| | The subnet (in CIDR or X.Y.Z.* notation) that will be granted access to the web-based LOCKSS Administrative UI. The localhost is implicitly allowed. Additional subnets can be added through the LOCKSS Administrative UI.<br />
|-<br />
| <tt>LCAP V3 protocol port</tt><br />
| | The TCP port the daemon will listen on for LCAP communication from peers. Remote sites should ensure both internal and external firewalls allow all connections to this port.<br />
|-<br />
| <tt>Mail relay for this machine</tt><br />
| | This should be the DNS name of an SMTP relay that will accept and relay mail from this machine. The script will also prompt for a username and password if the mail relay requires them. It can also be set to <tt>localhost</tt> if the machine is capable of handling email. <br />
|-<br />
| <tt>E-mail address for administrator</tt><br />
| | Occasional alerts will be sent to this address by the LOCKSS daemon.<br />
|-<br />
| <tt>Path to java</tt><br />
| | The full path to a JRE. The default should suffice in most cases.<br />
|-<br />
| <tt>Java switches</tt><br />
| | Java switches to be passed to the JRE. It should be left blank in most cases.<br />
|-<br />
| <tt>Configuration URL</tt><br />
| | <ul><li>CLOCKSS Production: <tt>http://props.lockss.org:8001/clockss/lockss.xml</tt></li><li>CLOCKSS Ingest: <tt>http://props.lockss.org:8001/clockssingest/lockss.xml</tt></li><li>CLOCKSS Triggered: <tt>http://props.lockss.org:8001/clockss-triggered/lockss.xml</tt></li></ul><br />
|-<br />
| <tt>Preservation group(s)</tt><br />
| | <ul><li>CLOCKSS Production: <tt>clockss</tt></li><li>CLOCKSS Ingest: <tt>clockssingest</tt></li><li>CLOCKSS Triggered: <tt>clockss-triggered</tt></li></ul><br />
|-<br />
| <tt>Content storage directories</tt><br />
| | A semi-colon delineated list of paths to the LOCKSS repositories for use by this LOCKSS daemon.<br />
|-<br />
| <tt>Temporary storage directory</tt><br />
| | <tt>/cache0/gamma/tmp</tt><br />
|-<br />
| <tt>Password for web UI administration user admin</tt><br />
| | Set this to a strong password, known to the host site administrator but to no-one else.<br />
|-<br />
|}<br />
<br />
=== CLOCKSS PLN Configuration ===<br />
<br />
The production CLOCKSS network is configured via the [[LOCKSS: Property Server Operations|property server]] to:<br />
* Prevent access to the content by disabling both the proxy and content server functions of the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]].<br />
* Prevent interception of communication between boxes in the network by the use of SSL for the [[LOCKSS: Polling and Repair Protocol]].<br />
* Prevent communication via the [[LOCKSS: Polling and Repair Protocol]] except from other boxes in the same network by requiring each end of a newly established connection to verify the certificate of the other end against the corresponding public key in a keystore. During setup, or when the set of boxes in the network changes, the appropriate keystore is uploaded to a machine via <tt>scp</tt>. We also provide a small script for the CLOCKSS site administrator to run that installs the keystore into the correct location with restrictive permissions.<br />
<br />
== Monitoring ==<br />
<br />
=== Nagios ===<br />
<br />
The CLOCKSS infrastructure is monitored by the Nagios network monitoring system. The CLOCKSS boxes, servers and services that are monitored by Nagios include the CLOCKSS Production boxes, Ingest boxes, Triggered Content boxes and Property Server as well as the CLOCKSS HTTP, SMTP and FTP servers.<br />
<br />
The CLOCKSS Network Administrator is responsible for Nagios monitoring.<br />
<br />
=== Nagios plugins to monitor LOCKSS ===<br />
<br />
Custom Nagios plugins were written to enable Nagios to monitor the following [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] services:<br />
* LOCKSS Daemon Version<br />
* LOCKSS Daemon Uptime<br />
* LOCKSS Repository Spaces (to monitor disk usage)<br />
* LOCKSS Web Administrative UI accessibility<br />
* LCAP Accessibility<br />
<br />
Plugins shipped with Nagios allow us to monitor, where applicable:<br />
* OpenSSH accessibility <br />
* HTTP/HTTPS accessibility<br />
<br />
=== Nagios Alerts ===<br />
<br />
Nagios allows plugins to return four return codes depending on the nature and severity of an issue (or lack thereof). For any return other than <tt>OK</tt>, Nagios has been configured to notify CLOCKSS engineers through email alerts (see [[CLOCKSS: Logging and Records]]).<br />
<br />
=== Access control to Nagios instance ===<br />
<br />
Access to the CLOCKSS Nagios instance is restricted by an Apache Virtual Host definition to the LOCKSS subnet at Stanford (<tt>171.66.236.0/24</tt>) and by username and password.<br />
<br />
=== Nagios Redundancy ===<br />
<br />
CLOCKSS and LOCKSS have a second Nagios instance running in an Amacon EC2 instance located in West Virginia. The second Nagios is not able to monitor CLOCKSS boxes directly. This is by design: We do not want a machine outside of the Stanford network to have the type of access necessary to monitor internal CLOCKSS boxes and processes. Due to this restriction, it is configured only to monitor the CLOCKSS website and the Nagios instance running at Stanford. The second Nagios instance will alert CLOCKSS engineers should any problems occur at Stanford.<br />
<br />
== Hardware Upgrade ==<br />
<br />
Currently, all CLOCKSS boxes have at least four spare hotswap disk bays. A natural way to incrementally upgrade the disk capacity of deployed CLOCKSS boxes is to fill these disk bays with the largest disk available (at the moment, 4TB), create a new RAID array and finally to move content from an existing array. The disks comprising the old array can then be removed and the process can be repeated for the remaining RAID arrays.<br />
<br />
The other components in a CLOCKSS box are not expected to need upgrading during its service life.<br />
<br />
== Hardware Replacement ==<br />
<br />
=== Component Failure ===<br />
<br />
The components in all CLOCKSS boxes are readily available, cheap, off-the-shelf components and replacement components can be sourced from any well-stocked computer retailer. The most common component failures are disks. As mentioned earlier, the RAID5 arrays employed in CLOCKSS boxes are only able to tolerate one disk failure each so it is important that any failed disks be found and replaced as quickly as possible.<br />
<br />
If a disk fails, we contact the CLOCKSS box's hardware vendor if the CLOCKSS box is still under warranty. If the box is no longer under warranty, we prefer to purchase a disk and ship it expedited to the remote site. However, if the remote site is not within the United States, an alternative option is to ask the site administrator to source replacement parts within their country and send an invoice to CLOCKSS for a reimbursement.<br />
<br />
Other component failures are handled similarly. If a machine cannot be repaired, it is replaced.<br />
<br />
=== Planned CLOCKSS Box Replacement ===<br />
<br />
Despite the upgradability of existing hardware, we recognize that, eventually, it will no longer be cost effective or practical to continue to run old hardware due to increases in performance per watt efficiency, disk bandwidth, disk capacity and other measures. Software and content volume is also expected to push hardware towards obsolescence.<br />
<br />
Replacing an entire machine is a fairly straightforward process. After purchasing and configuring the machine as described elsewhere in this document, it is brought up alongside the machine it is to replace. To minimize downtime, content is copied in two passes, using <tt>rsync</tt>. The first pass copies the bulk of the data to the new machine. Once the first pass is complete, the LOCKSS daemon on the old machine is taken offline and its filesystems are placed into read-only mode. The second pass then copies any changes to the content that occurred during the first pass, and verifies the integrity of content from the first pass. If no errors occurred during the two passes, the old machine is taken offline and the new machine takes its place. Subsequent [[LOCKSS: Polling and Repair Protocol]] will also verify the integrity of the copy.<br />
<br />
== Software Updates ==<br />
<br />
=== Re-locating Content ===<br />
<br />
Relocating an AU to another LOCKSS repository filesystem, for example if a filesystem has filled up, requires the following steps:<br />
# Shutdown the LOCKSS daemon to prevent it from modifying the AU. For extra precaution, temporarily remount the filesystem as read-only.<br />
# Copy the AU to the new LOCKSS repository. When the copy completes successfully, delete the the AU from the source filesystem.<br />
# If the LOCKSS repository has not yet been made available to the LOCKSS daemon, edit <tt>/etc/lockss/config.dat</tt> and append the new LOCKSS repository path to <tt>LOCKSS_DISK_PATHS</tt>.<br />
# The AU needs its <tt>lockss_repo</tt> added or modified to reflect the new path. This is done by editing the AU's <tt>lockss_repo</tt> line in <tt>/cache0/gamma/config/au.txt</tt>. For bulk relocation, a LOCKSS tool called <tt>auconf.pl</tt> can scan the LOCKSS repositories for AUs and update <tt>au.txt</tt> as necessary.<br />
# If the source filesystem was remounted read-only, remount it as read-write. <br />
# Finally, start the LOCKSS daemon and check that the AUs repository path points to the new location.<br />
<br />
If a large amount of content is being moved and the daemon requires minimal disruption, a modified procedure is followed:<br />
<br />
# With the LOCKSS daemon running and the filesystem mounted read-write, use rsync to copy the bulk of the content to the new filesystem.<br />
# When the first pass completes, stop the LOCKSS daemon and remount the source filesystem as read-only.<br />
# Run rsync again to copy any changes to the AUs since the last rsync was started.<br />
# Update <tt>/etc/lockss/config.dat</tt> and <tt>/cache0/gamma/config/au.txt</tt> as necessary.<br />
# Remount the filesystem as read-write. Bring up the LOCKSS daemon and check that the all the AUs repository paths have been updated.<br />
<br />
=== LOCKSS Daemon Updates ===<br />
<br />
All remote sites are encouraged to setup automatic updates for the LOCKSS daemon package so that updates are installed automatically within a few hours of their release to the CLOCKSS RPM repository. Additionally, all systems administrators receive announcements of newly released daemons through the CLOCKSS Administrators mailing list; systems administrators opting to install updates manually are encouraged to use the announcements as a cue to install updates. We use Nagios to monitor the LOCKSS daemon version of all CLOCKSS boxes. If a CLOCKSS box has not been updated with the latest LOCKSS daemon within a day, we notify its systems administrator(s).<br />
<br />
=== System Package Updates ===<br />
<br />
Most CLOCKSS boxes are configured for unattended upgrade, so new upgrades will be installed automatically. All CLOCKSS box systems administrators are encouraged to join the CentOS Announce mailing list to receive real-time annoucements about security updates. If the box in question is not configured for unattended updates:<br />
* Security updates should be downloaded and installed as soon as they are announced.<br />
* All other non-critical updates should be installed at least once a quarter.<br />
<br />
Anyone can subscribe to the CentOS Announce mailing list by visiting the [http://lists.centos.org/mailman/listinfo/centos-announce CentOS-announce Mailing List] home page.<br />
<br />
=== System Updates ===<br />
<br />
During the lifetime of the CLOCKSS program, the CentOS project will release new major versions of their CentOS operating system (e.g., CentOS 5.x to 6.x). All production CLOCKSS boxes were configured so that its operating system filesystem is independent from its content filesystems. The CentOS upgrade path is then relatively simple: Save the LOCKSS daemon configuration, user accounts and OpenSSH keys, unmount and take offline all content filesystems then effectively reinstall CentOS and the LOCKSS daemon environment and restore the CLOCKSS box's configuration.<br />
<br />
The end of life dates for the CentOS versions currently installed on CLOCKSS boxes are:<br />
<br />
{| class="wikitable" style="margin: 0 auto"<br />
|-<br />
! scope="col"| Version<br />
! scope="col"| Release Date<br />
! scope="col"| Full Updates<br />
! scope="col"| Maintenance Updates<br />
|-<br />
| CentOS 5.x<br />
| | 2007-04-12<br />
| | Q1 2014<br />
| | 2017-03-31<br />
|-<br />
| CentOS 6.x<br />
| | 22011-07-10<br />
| | Q2 2017<br />
| | 2020-11-30<br />
|-<br />
|}<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by:<br />
** CLOCKSS Network Administrator<br />
** LOCKSS Engineering Staff<br />
* Approval by CLOCKSS Technical Lead</div>Dshrhttp://documents.clockss.org/index.php/CLOCKSS:_Box_OperationsCLOCKSS: Box Operations2014-04-06T22:00:41Z<p>Dshr: /* CLOCKSS PLN Configuration */</p>
<hr />
<div>= CLOCKSS: Box Operations =<br />
<br />
== Requirements for CLOCKSS host sites ==<br />
<br />
A CLOCKSS host site signs an agreement with the CLOCKSS board under which they commit to providing:<br />
* Rack space in a physically secure location accessible only to authorized personnel.<br />
* Power.<br />
* Cooling.<br />
* Network bandwidth.<br />
And to have an in-place disaster recovery plans if these resources fail. The CLOCKSS Executive Director has copies of these agreements.<br />
<br />
The CLOCKSS Network Administrator is responsible for all processes involving production and ingest CLOCKSS box hardware and software infrastructure.<br />
<br />
== Hardware Bringup ==<br />
<br />
=== Hardware Vendors ===<br />
All new systems are currently purchased from iXsystems based in San Jose, CA. The LOCKSS team's contact there is sales representative Kevin Lee ([mailto:kle@ixsystems.com kle@ixsystems.com]).<br />
<br />
The LOCKSS team previously worked with two hardware vendors, Iron Systems, located in Fremont, CA and eRacks in Orange, CA. Most of the systems currently in production were purchased from Iron Systems. Business contacts are Kawaljit Nagi ([mailto:kawal@ironsystems.com kawal@ironsystems.com]) and James King ([mailto:jamesk@eracks.com jamesk@eracks.com]), respectively. The LOCKSS team also has a relationship with Joesph Wolff ([mailto:joe@eracks.com joe@eracks.com]), the owner of eRacks.<br />
<br />
=== Hardware Purchase Process ===<br />
Purchases start as a discussion among LOCKSS Engineering Staff, the CLOCKSS Network Administrator and the CLOCKSS Executive Director about what is needed. A specification is drawn by the engineers and sent to our hardware vendors for an early quote. If the quote looks reasonable, it is sent to the CLOCKSS Executive Director, who negotiates a final quote and sends a purchase order. The CLOCKSS Executive Director is advised to release payment once the machine has been built, tested, verified working by the LOCKSS team, preconfigured with software (see below) and shipped. <br />
<br />
=== Hardware Warranty ===<br />
Machines purchased from Iron Systems and eRacks come with 1 - 2 years of limited hardware warranty, depending on hardware vendor. The warranty on most of the CLOCKSS boxes has since expired but it proved useful for exchanging failed hardware components. Details about the warranty are in the purchase orders that have been submitted as well as the invoices subsequently received.<br />
<br />
=== Recommended Hardware ===<br />
CLOCKSS hardware or virtual machines meet or exceed the following specifications:<br />
<br />
{| class="wikitable" style="margin: 0 auto"<br />
|-<br />
! scope="col"| Item<br />
! scope="col"| Ingest<br />
! scope="col"| Production<br />
! scope="col"| Triggered (VMware)<br />
|-<br />
| Chassis<br />
| | Supermicro 2U<br />
| | Supermicro 4U<br />
| | ''N/A''<br />
|-<br />
| Disk Bays<br />
| | 12<br />
| | 24<br />
| | ''N/A''<br />
|-<br />
| Processor<br />
| | AMD Operton 6128 (eight cores)<br />
| | Dual Xeon E5504 (eight cores)<br />
| | Dual-core CPU<br />
|-<br />
| RAM<br />
| | 16GB ECC<br />
| | 24GB ECC<br />
| | 4GB<br />
|-<br />
| Disk<br />
| | 8 x 3TB SATA 6Gbps 7200RPM<br />
| | 12 x 2TB SATA 3Gbps<br />
| | 40GB disk<br />
|-<br />
| RAID<br />
| | LSI Megaraid<br />
| | Software (Linux <tt>mdadm</tt>)<br />
| | None (Underlying RAID array)<br />
|-<br />
| Network<br />
| | Onboard dual Gbe<br />
| | Onboard dual Gbe<br />
| | 10/100Mb NIC<br />
|-<br />
| Remote Access<br />
| | IPMI<br />
| | IPMI<br />
| | VMware vSphere Client<br />
|-<br />
| Power Supply<br />
| | 800W redundant<br />
| | 1600W redundant<br />
| | ''N/A''<br />
|-<br />
|}<br />
<br />
=== CLOCKSS Virtual Machines ===<br />
<br />
CLOCKSS virtual machines run on servers running the VMWare vSphere ESXi 5 hypervisor.<br />
<br />
=== Hardware Service Life ===<br />
<br />
The service life of a CLOCKSS box is five to seven years. This is dictated by budget restrictions, software needs as well as cost and performance efficiency. Please see the section on [[#Hardware Replacement]] later in this document.<br />
<br />
=== Building and testing ===<br />
<br />
Our hardware vendors are responsible for sourcing components and building the machines once a purchase order is submitted by CLOCKSS. After a machine is built, both hardware vendors have an extensive and rigorous process in place to ensure a machine will function correctly.<br />
<br />
=== Remote Access via IPMI ===<br />
<br />
Every CLOCKSS box is equipped with an Intelligent Platform Management Interface (IPMI) module that provides remote "side-band" and "out-of-band" access. Although IPMI is invaluable in the event we need to access or recover a CLOCKSS box outside the scope of its operating system, there are known security vulnerabilities and so it is disabled by default on all machines. This is done by disabling it in BIOS and unplugging the network cable from the IPMI module.<br />
<br />
In the event we need to use IPMI, the following precautions should be taken:<br />
<br />
# Update the firmware, if a newer version exists.<br />
# Do not use the default username and password.<br />
# Make IPMI accessible only through a VPN or other secure connection.<br />
<br />
=== RAID Configuration ===<br />
<br />
The CLOCKSS Ingest machines have LSI MegaRAID based hardware RAID controllers and eight 3TB disks. These disks were partitioned into groups of four and put into a RAID5 configuration. Each RAID array then yields approximately 8.1TB of usable space after RAID and EXT4 filesystem overhead. Since the Ingest machines have an extra disk dedicated for the OS (called a "system disk"), the full space of the RAID arrays is dedicated to the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]]. This configuration also isolates problems on the system disk from content storage arrays. <br />
<br />
The CLOCKSS Production machines utilize the Linux kernel's <tt>mdadm</tt> module to implement software RAID. Similar to the CLOCKSS Ingest machines, the Production machines also have several RAID5 arrays consisting of three or four disks depending on the number of disks in the machine. In contrast to the Ingest machines, these RAID5 arrays are used by both the operating system and the LOCKSS daemon.<br />
<br />
=== RAID Health Monitoring ===<br />
<br />
The most common component failures are disks and it is expected that all CLOCKSS boxes will experience a few disk failures over the course of their service life. The RAID5 arrays employed in CLOCKSS boxes are only able to tolerate one disk failure each. It is thus important that any failed disks be found and replaced as quickly as possible; a double disk failure requires repair from another box via the [[LOCKSS: Polling and Repair Protocol]], which is much slower than RAID rebuild.<br />
<br />
The RAID array health on all CLOCKSS Production boxes are monitored by Nagios via the Nagios Remote Plugin Executor (NRPE). Each machine runs an NRPE daemon and has a copy of a custom plugin we wrote to monitor mdadm RAID arrays. NRPE executes the plugin on behalf of Nagios and returns the result. If Nagios does not receive an OK, it sends an alert to CLOCKSS engineers.<br />
<br />
== Software Bringup ==<br />
<br />
All the software used for the CLOCKSS infrastructure is freely available and open source.<br />
<br />
=== CentOS Installation and Configuration ===<br />
<br />
A base install of CentOS 6.x with Java options is performed on the partition or disk designated for the operating system then<br />
the following steps are taken to configure it as a CLOCKSS box:<br />
# The system accounts, <tt>lcap</tt> and <tt>lockss</tt> are created using <tt>useradd -r -m</tt> or equivalent. Then the LOCKSS LCAP public SSH keys are then added to their <tt>authorized_keys</tt> file, taking care to ensure the permissions are setup correctly.<br />
# The mount points are created for the LOCKSS daemon repositories:<br />
#* The preferred naming scheme is <tt>/cacheN/gamma</tt> where <tt>N</tt> is an integer starting from 0, (e.g. <tt>/cache0/gamma</tt>, <tt>/cache1/gamma</tt> and so forth).<br />
#* The mount point ownerships are changed to the <tt>lockss</tt> user and group and permissions for <tt>other</tt> are stripped.<br />
#* <tt>noexec</tt> is added to the mount point parameters.<br />
#* Finally, the mount points should be added to <tt>updatedb</tt>'s <tt>PRUNEPATHS</tt> variable to prevent the system from indexing the LOCKSS daemon repositories.<br />
# Then the firewall (<tt>iptables</tt> rules) are set up:<br />
#* The LOCKSS daemon requires the LCAP port (port 9729), used to communicate with other LOCKSS daemons, to be open to all machines: This is accomplished under CentOS with this <tt>iptables</tt> rule: <pre>-A INPUT -m state --state NEW -m tcp -p tcp --dport 9729 -j ACCEPT</pre> Note: Although we have a list of IP addresses that should be granted access to this machine's LCAP port, we opt not to restrict access because it is protected by SSL certificate checks (see [[#CLOCKSS PLN Configuration]]). <br />
#* Additionally, a CLOCKSS box is setup so that the administrative ports 22 (OpenSSH) and 8080 through 8083 are accessible to the remote site's administrative subnet(s) and the LOCKSS subnet at Stanford (171.66.236.0/24, previously 171.66.236.0/26). The following <tt>iptables</tt> should be repeated for each subnet, taking care to replace <tt>SUBNET</tt> with the subnet in CIDR form:<pre>-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 22 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8080 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8081 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8082 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8083 -j ACCEPT</pre>These rules should be reflected in any external firewalls at the site.<br />
# Setting up the CLOCKSS repository is a two step process:<br />
## The first is to create a new file, <tt>lockss.repo</tt> under <tt>/etc/yum.repos.d</tt> containing the following lines:<pre>[lockss]&#10;name = LOCKSS Daemon Repository&#10;baseurl=http://www.lockss.org/clockss-repo/&#10;gpgcheck=1</pre><br />
##Then to install the LOCKSS RPM GPG key:<pre>rpm --import http://www.lockss.org/LOCKSS-GPG-RPM-KEY</pre> <br />
#The LOCKSS daemon can now be installed by invoking:<pre>yum install lockss-daemon</pre><br />
# The LOCKSS daemon installs a <tt>logrotate</tt> configuration file to <tt>/etc/logrotate.d/lockss</tt>. By default, the size is limited to 2M but on CLOCKSS boxes, we set the limit to 20M: <pre>/var/log/lockss/daemon {&#10; size 20M&#10; rotate 5&#10; compress&#10; delaycompress&#10; create&#10; notifempty&#10; missingok&#10;}&#10;&#10;/var/log/lockss/stdout {&#10; size 10k&#10; rotate 5&#10; compress&#10; copytruncate&#10; notifempty&#10; missingok&#10;}&#10;</pre><br />
# To keep the time on CLOCKSS boxes synchronized, we use <tt>NTP</tt> (or preferably, <tt>OpenNTPd</tt> where available). To install <tt>ntpd</tt>:<pre>yum install ntp</pre><br />
# To automatically install package updates, we use <tt>yum-cron</tt>. To install and configure it: <pre>yum install yum-cron</pre> Then ensure it will start automatically on runlevels 3, 4 and 5: <pre>chkconfig --add yum-cron</pre> <br />
# Some extra, optional packages we've found to be useful are <tt>screen</tt>, <tt>wget</tt>, <tt>vim</tt>, <tt>emacs</tt>, <tt>lynx</tt>. Their installation is highly recommended to ease troubleshooting.<pre>yum install screen wget vim emacs lynx</pre><br />
<br />
=== Java Runtime Environment (JRE) ===<br />
<br />
Early CLOCKSS boxes were configured to use the Sun Java JRE 6 (now Oracle Java JRE) but we now recommend using the OpenJDK JRE 6 freely available through the CentOS repository. All CLOCKSS boxes have a 64-bit JRE to take advantage of the large amount of memory available on these machines.<br />
<br />
=== LOCKSS Daemon Configuration ===<br />
<br />
Once the CentOS environment has been configured and verified working by a CLOCKSS engineer, the next step is to configure the LOCKSS daemon as part of the CLOCKSS network. This is done using the <tt>hostconfig</tt> utility packaged with the LOCKSS daemon. For a CLOCKSS box, the following settings are recommended:<br />
<br />
{| class="wikitable"<br />
|-<br />
! scope="col"| Parameter<br />
! scope="col"| Description<br />
|-<br />
| <tt>Fully qualified hostname (FQDN) of this machine</tt><br />
| | Provided by CLOCKSS. We schema we use is clockss-<i>site</i>.clockss.org where <i>site</i> can uniquely identify the host institution of the machine.<br />
|-<br />
| <tt>IP address of this machine</tt><br />
| | A static public IP address provided by host institution.<br />
|-<br />
| <tt>Initial subnet for admin UI access</tt><br />
| | The subnet (in CIDR or X.Y.Z.* notation) that will be granted access to the web-based LOCKSS Administrative UI. The localhost is implicitly allowed. Additional subnets can be added through the LOCKSS Administrative UI.<br />
|-<br />
| <tt>LCAP V3 protocol port</tt><br />
| | The TCP port the daemon will listen on for LCAP communication from peers. Remote sites should ensure both internal and external firewalls allow all connections to this port.<br />
|-<br />
| <tt>Mail relay for this machine</tt><br />
| | This should be the DNS name of an SMTP relay that will accept and relay mail from this machine. The script will also prompt for a username and password if the mail relay requires them. It can also be set to <tt>localhost</tt> if the machine is capable of handling email. <br />
|-<br />
| <tt>E-mail address for administrator</tt><br />
| | Occasional alerts will be sent to this address by the LOCKSS daemon.<br />
|-<br />
| <tt>Path to java</tt><br />
| | The full path to a JRE. The default should suffice in most cases.<br />
|-<br />
| <tt>Java switches</tt><br />
| | Java switches to be passed to the JRE. It should be left blank in most cases.<br />
|-<br />
| <tt>Configuration URL</tt><br />
| | <ul><li>CLOCKSS Production: <tt>http://props.lockss.org:8001/clockss/lockss.xml</tt></li><li>CLOCKSS Ingest: <tt>http://props.lockss.org:8001/clockssingest/lockss.xml</tt></li><li>CLOCKSS Triggered: <tt>http://props.lockss.org:8001/clockss-triggered/lockss.xml</tt></li></ul><br />
|-<br />
| <tt>Preservation group(s)</tt><br />
| | <ul><li>CLOCKSS Production: <tt>clockss</tt></li><li>CLOCKSS Ingest: <tt>clockssingest</tt></li><li>CLOCKSS Triggered: <tt>clockss-triggered</tt></li></ul><br />
|-<br />
| <tt>Content storage directories</tt><br />
| | A semi-colon delineated list of paths to the LOCKSS repositories for use by this LOCKSS daemon.<br />
|-<br />
| <tt>Temporary storage directory</tt><br />
| | <tt>/cache0/gamma/tmp</tt><br />
|-<br />
| <tt>Password for web UI administration user admin</tt><br />
| | Set this to a strong password, known to the host site administrator but to no-one else.<br />
|-<br />
|}<br />
<br />
=== CLOCKSS PLN Configuration ===<br />
<br />
The production CLOCKSS network is configured via the [[LOCKSS: Property Server Operations|property server]] to:<br />
* Prevent access to the content by disabling both the proxy and content server functions of the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]].<br />
* Prevent interception of communication between boxes in the network by the use of SSL for the [[LOCKSS: Polling and Repair Protocol]].<br />
* Prevent communication via the [[LOCKSS: Polling and Repair Protocol]] except from other boxes in the same network by requiring each end of a newly established connection to verify the certificate of the other end against the corresponding public key in a keystore. During setup, or when the set of boxes in the network changes, the appropriate keystore is uploaded to a machine via <tt>scp</tt>. We also provide a small script for the CLOCKSS site administrator to run that installs the keystore into the correct location with restrictive permissions.<br />
<br />
== Monitoring ==<br />
<br />
=== Nagios ===<br />
<br />
The CLOCKSS infrastructure is monitored by the Nagios network monitoring system. The CLOCKSS boxes, servers and services that are monitored by Nagios include the CLOCKSS Production boxes, Ingest boxes, Triggered Content boxes and Property Server as well as the CLOCKSS HTTP, SMTP and FTP servers.<br />
<br />
The CLOCKSS Network Administrator is responsible for Nagios monitoring.<br />
<br />
=== Nagios plugins to monitor LOCKSS ===<br />
<br />
Custom Nagios plugins were written to enable Nagios to monitor the following LOCKSS daemon services:<br />
* LOCKSS Daemon Version<br />
* LOCKSS Daemon Uptime<br />
* LOCKSS Repository Spaces (to monitor disk usage)<br />
* LOCKSS Web Administrative UI accessibility<br />
* LCAP Accessibility<br />
<br />
Plugins shipped with Nagios allow us to monitor, where applicable:<br />
* OpenSSH accessibility <br />
* HTTP/HTTPS accessibility<br />
<br />
=== Nagios Alerts ===<br />
<br />
Nagios allows plugins to return four return codes depending on the nature and severity of an issue (or lack thereof). For any return other than <tt>OK</tt>, Nagios has been configured to notify CLOCKSS engineers through email alerts (see [[CLOCKSS: Logging and Records]]).<br />
<br />
=== Access control to Nagios instance ===<br />
<br />
Access to the CLOCKSS Nagios instance is restricted by an Apache Virtual Host definition to the LOCKSS subnet at Stanford (<tt>171.66.236.0/24</tt>) and by username and password.<br />
<br />
=== Nagios Redundancy ===<br />
<br />
CLOCKSS and LOCKSS have a second Nagios instance running in an Amacon EC2 instance located in West Virginia. The second Nagios is not able to monitor CLOCKSS boxes directly. This is by design: We do not want a machine outside of the Stanford network to have the type of access necessary to monitor internal CLOCKSS boxes and processes. Due to this restriction, it is configured only to monitor the CLOCKSS website and the Nagios instance running at Stanford. The second Nagios instance will alert CLOCKSS engineers should any problems occur at Stanford.<br />
<br />
== Hardware Upgrade ==<br />
<br />
Currently, all CLOCKSS boxes have at least four spare hotswap disk bays. A natural way to incrementally upgrade the disk capacity of deployed CLOCKSS boxes is to fill these disk bays with the largest disk available (at the moment, 4TB), create a new RAID array and finally to move content from an existing array. The disks comprising the old array can then be removed and the process can be repeated for the remaining RAID arrays.<br />
<br />
The other components in a CLOCKSS box are not expected to need upgrading during its service life.<br />
<br />
== Hardware Replacement ==<br />
<br />
=== Component Failure ===<br />
<br />
The components in all CLOCKSS boxes are readily available, cheap, off-the-shelf components and replacement components can be sourced from any well-stocked computer retailer. The most common component failures are disks. As mentioned earlier, the RAID5 arrays employed in CLOCKSS boxes are only able to tolerate one disk failure each so it is important that any failed disks be found and replaced as quickly as possible.<br />
<br />
If a disk fails, we contact the CLOCKSS box's hardware vendor if the CLOCKSS box is still under warranty. If the box is no longer under warranty, we prefer to purchase a disk and ship it expedited to the remote site. However, if the remote site is not within the United States, an alternative option is to ask the site administrator to source replacement parts within their country and send an invoice to CLOCKSS for a reimbursement.<br />
<br />
Other component failures are handled similarly. If a machine cannot be repaired, it is replaced.<br />
<br />
=== Planned CLOCKSS Box Replacement ===<br />
<br />
Despite the upgradability of existing hardware, we recognize that, eventually, it will no longer be cost effective or practical to continue to run old hardware due to increases in performance per watt efficiency, disk bandwidth, disk capacity and other measures. Software and content volume is also expected to push hardware towards obsolescence.<br />
<br />
Replacing an entire machine is a fairly straightforward process. After purchasing and configuring the machine as described elsewhere in this document, it is brought up alongside the machine it is to replace. To minimize downtime, content is copied in two passes, using <tt>rsync</tt>. The first pass copies the bulk of the data to the new machine. Once the first pass is complete, the LOCKSS daemon on the old machine is taken offline and its filesystems are placed into read-only mode. The second pass then copies any changes to the content that occurred during the first pass, and verifies the integrity of content from the first pass. If no errors occurred during the two passes, the old machine is taken offline and the new machine takes its place. Subsequent [[LOCKSS: Polling and Repair Protocol]] will also verify the integrity of the copy.<br />
<br />
== Software Updates ==<br />
<br />
=== Re-locating Content ===<br />
<br />
Relocating an AU to another LOCKSS repository filesystem, for example if a filesystem has filled up, requires the following steps:<br />
# Shutdown the LOCKSS daemon to prevent it from modifying the AU. For extra precaution, temporarily remount the filesystem as read-only.<br />
# Copy the AU to the new LOCKSS repository. When the copy completes successfully, delete the the AU from the source filesystem.<br />
# If the LOCKSS repository has not yet been made available to the LOCKSS daemon, edit <tt>/etc/lockss/config.dat</tt> and append the new LOCKSS repository path to <tt>LOCKSS_DISK_PATHS</tt>.<br />
# The AU needs its <tt>lockss_repo</tt> added or modified to reflect the new path. This is done by editing the AU's <tt>lockss_repo</tt> line in <tt>/cache0/gamma/config/au.txt</tt>. For bulk relocation, a LOCKSS tool called <tt>auconf.pl</tt> can scan the LOCKSS repositories for AUs and update <tt>au.txt</tt> as necessary.<br />
# If the source filesystem was remounted read-only, remount it as read-write. <br />
# Finally, start the LOCKSS daemon and check that the AUs repository path points to the new location.<br />
<br />
If a large amount of content is being moved and the daemon requires minimal disruption, a modified procedure is followed:<br />
<br />
# With the LOCKSS daemon running and the filesystem mounted read-write, use rsync to copy the bulk of the content to the new filesystem.<br />
# When the first pass completes, stop the LOCKSS daemon and remount the source filesystem as read-only.<br />
# Run rsync again to copy any changes to the AUs since the last rsync was started.<br />
# Update <tt>/etc/lockss/config.dat</tt> and <tt>/cache0/gamma/config/au.txt</tt> as necessary.<br />
# Remount the filesystem as read-write. Bring up the LOCKSS daemon and check that the all the AUs repository paths have been updated.<br />
<br />
=== LOCKSS Daemon Updates ===<br />
<br />
All remote sites are encouraged to setup automatic updates for the LOCKSS daemon package so that updates are installed automatically within a few hours of their release to the CLOCKSS RPM repository. Additionally, all systems administrators receive announcements of newly released daemons through the CLOCKSS Administrators mailing list; systems administrators opting to install updates manually are encouraged to use the announcements as a cue to install updates. We use Nagios to monitor the LOCKSS daemon version of all CLOCKSS boxes. If a CLOCKSS box has not been updated with the latest LOCKSS daemon within a day, we notify its systems administrator(s).<br />
<br />
=== System Package Updates ===<br />
<br />
Most CLOCKSS boxes are configured for unattended upgrade, so new upgrades will be installed automatically. All CLOCKSS box systems administrators are encouraged to join the CentOS Announce mailing list to receive real-time annoucements about security updates. If the box in question is not configured for unattended updates:<br />
* Security updates should be downloaded and installed as soon as they are announced.<br />
* All other non-critical updates should be installed at least once a quarter.<br />
<br />
Anyone can subscribe to the CentOS Announce mailing list by visiting the [http://lists.centos.org/mailman/listinfo/centos-announce CentOS-announce Mailing List] home page.<br />
<br />
=== System Updates ===<br />
<br />
During the lifetime of the CLOCKSS program, the CentOS project will release new major versions of their CentOS operating system (e.g., CentOS 5.x to 6.x). All production CLOCKSS boxes were configured so that its operating system filesystem is independent from its content filesystems. The CentOS upgrade path is then relatively simple: Save the LOCKSS daemon configuration, user accounts and OpenSSH keys, unmount and take offline all content filesystems then effectively reinstall CentOS and the LOCKSS daemon environment and restore the CLOCKSS box's configuration.<br />
<br />
The end of life dates for the CentOS versions currently installed on CLOCKSS boxes are:<br />
<br />
{| class="wikitable" style="margin: 0 auto"<br />
|-<br />
! scope="col"| Version<br />
! scope="col"| Release Date<br />
! scope="col"| Full Updates<br />
! scope="col"| Maintenance Updates<br />
|-<br />
| CentOS 5.x<br />
| | 2007-04-12<br />
| | Q1 2014<br />
| | 2017-03-31<br />
|-<br />
| CentOS 6.x<br />
| | 22011-07-10<br />
| | Q2 2017<br />
| | 2020-11-30<br />
|-<br />
|}<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by:<br />
** CLOCKSS Network Administrator<br />
** LOCKSS Engineering Staff<br />
* Approval by CLOCKSS Technical Lead</div>Dshrhttp://documents.clockss.org/index.php/CLOCKSS:_Box_OperationsCLOCKSS: Box Operations2014-04-06T21:58:56Z<p>Dshr: /* RAID Configuration */</p>
<hr />
<div>= CLOCKSS: Box Operations =<br />
<br />
== Requirements for CLOCKSS host sites ==<br />
<br />
A CLOCKSS host site signs an agreement with the CLOCKSS board under which they commit to providing:<br />
* Rack space in a physically secure location accessible only to authorized personnel.<br />
* Power.<br />
* Cooling.<br />
* Network bandwidth.<br />
And to have an in-place disaster recovery plans if these resources fail. The CLOCKSS Executive Director has copies of these agreements.<br />
<br />
The CLOCKSS Network Administrator is responsible for all processes involving production and ingest CLOCKSS box hardware and software infrastructure.<br />
<br />
== Hardware Bringup ==<br />
<br />
=== Hardware Vendors ===<br />
All new systems are currently purchased from iXsystems based in San Jose, CA. The LOCKSS team's contact there is sales representative Kevin Lee ([mailto:kle@ixsystems.com kle@ixsystems.com]).<br />
<br />
The LOCKSS team previously worked with two hardware vendors, Iron Systems, located in Fremont, CA and eRacks in Orange, CA. Most of the systems currently in production were purchased from Iron Systems. Business contacts are Kawaljit Nagi ([mailto:kawal@ironsystems.com kawal@ironsystems.com]) and James King ([mailto:jamesk@eracks.com jamesk@eracks.com]), respectively. The LOCKSS team also has a relationship with Joesph Wolff ([mailto:joe@eracks.com joe@eracks.com]), the owner of eRacks.<br />
<br />
=== Hardware Purchase Process ===<br />
Purchases start as a discussion among LOCKSS Engineering Staff, the CLOCKSS Network Administrator and the CLOCKSS Executive Director about what is needed. A specification is drawn by the engineers and sent to our hardware vendors for an early quote. If the quote looks reasonable, it is sent to the CLOCKSS Executive Director, who negotiates a final quote and sends a purchase order. The CLOCKSS Executive Director is advised to release payment once the machine has been built, tested, verified working by the LOCKSS team, preconfigured with software (see below) and shipped. <br />
<br />
=== Hardware Warranty ===<br />
Machines purchased from Iron Systems and eRacks come with 1 - 2 years of limited hardware warranty, depending on hardware vendor. The warranty on most of the CLOCKSS boxes has since expired but it proved useful for exchanging failed hardware components. Details about the warranty are in the purchase orders that have been submitted as well as the invoices subsequently received.<br />
<br />
=== Recommended Hardware ===<br />
CLOCKSS hardware or virtual machines meet or exceed the following specifications:<br />
<br />
{| class="wikitable" style="margin: 0 auto"<br />
|-<br />
! scope="col"| Item<br />
! scope="col"| Ingest<br />
! scope="col"| Production<br />
! scope="col"| Triggered (VMware)<br />
|-<br />
| Chassis<br />
| | Supermicro 2U<br />
| | Supermicro 4U<br />
| | ''N/A''<br />
|-<br />
| Disk Bays<br />
| | 12<br />
| | 24<br />
| | ''N/A''<br />
|-<br />
| Processor<br />
| | AMD Operton 6128 (eight cores)<br />
| | Dual Xeon E5504 (eight cores)<br />
| | Dual-core CPU<br />
|-<br />
| RAM<br />
| | 16GB ECC<br />
| | 24GB ECC<br />
| | 4GB<br />
|-<br />
| Disk<br />
| | 8 x 3TB SATA 6Gbps 7200RPM<br />
| | 12 x 2TB SATA 3Gbps<br />
| | 40GB disk<br />
|-<br />
| RAID<br />
| | LSI Megaraid<br />
| | Software (Linux <tt>mdadm</tt>)<br />
| | None (Underlying RAID array)<br />
|-<br />
| Network<br />
| | Onboard dual Gbe<br />
| | Onboard dual Gbe<br />
| | 10/100Mb NIC<br />
|-<br />
| Remote Access<br />
| | IPMI<br />
| | IPMI<br />
| | VMware vSphere Client<br />
|-<br />
| Power Supply<br />
| | 800W redundant<br />
| | 1600W redundant<br />
| | ''N/A''<br />
|-<br />
|}<br />
<br />
=== CLOCKSS Virtual Machines ===<br />
<br />
CLOCKSS virtual machines run on servers running the VMWare vSphere ESXi 5 hypervisor.<br />
<br />
=== Hardware Service Life ===<br />
<br />
The service life of a CLOCKSS box is five to seven years. This is dictated by budget restrictions, software needs as well as cost and performance efficiency. Please see the section on [[#Hardware Replacement]] later in this document.<br />
<br />
=== Building and testing ===<br />
<br />
Our hardware vendors are responsible for sourcing components and building the machines once a purchase order is submitted by CLOCKSS. After a machine is built, both hardware vendors have an extensive and rigorous process in place to ensure a machine will function correctly.<br />
<br />
=== Remote Access via IPMI ===<br />
<br />
Every CLOCKSS box is equipped with an Intelligent Platform Management Interface (IPMI) module that provides remote "side-band" and "out-of-band" access. Although IPMI is invaluable in the event we need to access or recover a CLOCKSS box outside the scope of its operating system, there are known security vulnerabilities and so it is disabled by default on all machines. This is done by disabling it in BIOS and unplugging the network cable from the IPMI module.<br />
<br />
In the event we need to use IPMI, the following precautions should be taken:<br />
<br />
# Update the firmware, if a newer version exists.<br />
# Do not use the default username and password.<br />
# Make IPMI accessible only through a VPN or other secure connection.<br />
<br />
=== RAID Configuration ===<br />
<br />
The CLOCKSS Ingest machines have LSI MegaRAID based hardware RAID controllers and eight 3TB disks. These disks were partitioned into groups of four and put into a RAID5 configuration. Each RAID array then yields approximately 8.1TB of usable space after RAID and EXT4 filesystem overhead. Since the Ingest machines have an extra disk dedicated for the OS (called a "system disk"), the full space of the RAID arrays is dedicated to the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]]. This configuration also isolates problems on the system disk from content storage arrays. <br />
<br />
The CLOCKSS Production machines utilize the Linux kernel's <tt>mdadm</tt> module to implement software RAID. Similar to the CLOCKSS Ingest machines, the Production machines also have several RAID5 arrays consisting of three or four disks depending on the number of disks in the machine. In contrast to the Ingest machines, these RAID5 arrays are used by both the operating system and the LOCKSS daemon.<br />
<br />
=== RAID Health Monitoring ===<br />
<br />
The most common component failures are disks and it is expected that all CLOCKSS boxes will experience a few disk failures over the course of their service life. The RAID5 arrays employed in CLOCKSS boxes are only able to tolerate one disk failure each. It is thus important that any failed disks be found and replaced as quickly as possible; a double disk failure requires repair from another box via the [[LOCKSS: Polling and Repair Protocol]], which is much slower than RAID rebuild.<br />
<br />
The RAID array health on all CLOCKSS Production boxes are monitored by Nagios via the Nagios Remote Plugin Executor (NRPE). Each machine runs an NRPE daemon and has a copy of a custom plugin we wrote to monitor mdadm RAID arrays. NRPE executes the plugin on behalf of Nagios and returns the result. If Nagios does not receive an OK, it sends an alert to CLOCKSS engineers.<br />
<br />
== Software Bringup ==<br />
<br />
All the software used for the CLOCKSS infrastructure is freely available and open source.<br />
<br />
=== CentOS Installation and Configuration ===<br />
<br />
A base install of CentOS 6.x with Java options is performed on the partition or disk designated for the operating system then<br />
the following steps are taken to configure it as a CLOCKSS box:<br />
# The system accounts, <tt>lcap</tt> and <tt>lockss</tt> are created using <tt>useradd -r -m</tt> or equivalent. Then the LOCKSS LCAP public SSH keys are then added to their <tt>authorized_keys</tt> file, taking care to ensure the permissions are setup correctly.<br />
# The mount points are created for the LOCKSS daemon repositories:<br />
#* The preferred naming scheme is <tt>/cacheN/gamma</tt> where <tt>N</tt> is an integer starting from 0, (e.g. <tt>/cache0/gamma</tt>, <tt>/cache1/gamma</tt> and so forth).<br />
#* The mount point ownerships are changed to the <tt>lockss</tt> user and group and permissions for <tt>other</tt> are stripped.<br />
#* <tt>noexec</tt> is added to the mount point parameters.<br />
#* Finally, the mount points should be added to <tt>updatedb</tt>'s <tt>PRUNEPATHS</tt> variable to prevent the system from indexing the LOCKSS daemon repositories.<br />
# Then the firewall (<tt>iptables</tt> rules) are set up:<br />
#* The LOCKSS daemon requires the LCAP port (port 9729), used to communicate with other LOCKSS daemons, to be open to all machines: This is accomplished under CentOS with this <tt>iptables</tt> rule: <pre>-A INPUT -m state --state NEW -m tcp -p tcp --dport 9729 -j ACCEPT</pre> Note: Although we have a list of IP addresses that should be granted access to this machine's LCAP port, we opt not to restrict access because it is protected by SSL certificate checks (see [[#CLOCKSS PLN Configuration]]). <br />
#* Additionally, a CLOCKSS box is setup so that the administrative ports 22 (OpenSSH) and 8080 through 8083 are accessible to the remote site's administrative subnet(s) and the LOCKSS subnet at Stanford (171.66.236.0/24, previously 171.66.236.0/26). The following <tt>iptables</tt> should be repeated for each subnet, taking care to replace <tt>SUBNET</tt> with the subnet in CIDR form:<pre>-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 22 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8080 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8081 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8082 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8083 -j ACCEPT</pre>These rules should be reflected in any external firewalls at the site.<br />
# Setting up the CLOCKSS repository is a two step process:<br />
## The first is to create a new file, <tt>lockss.repo</tt> under <tt>/etc/yum.repos.d</tt> containing the following lines:<pre>[lockss]&#10;name = LOCKSS Daemon Repository&#10;baseurl=http://www.lockss.org/clockss-repo/&#10;gpgcheck=1</pre><br />
##Then to install the LOCKSS RPM GPG key:<pre>rpm --import http://www.lockss.org/LOCKSS-GPG-RPM-KEY</pre> <br />
#The LOCKSS daemon can now be installed by invoking:<pre>yum install lockss-daemon</pre><br />
# The LOCKSS daemon installs a <tt>logrotate</tt> configuration file to <tt>/etc/logrotate.d/lockss</tt>. By default, the size is limited to 2M but on CLOCKSS boxes, we set the limit to 20M: <pre>/var/log/lockss/daemon {&#10; size 20M&#10; rotate 5&#10; compress&#10; delaycompress&#10; create&#10; notifempty&#10; missingok&#10;}&#10;&#10;/var/log/lockss/stdout {&#10; size 10k&#10; rotate 5&#10; compress&#10; copytruncate&#10; notifempty&#10; missingok&#10;}&#10;</pre><br />
# To keep the time on CLOCKSS boxes synchronized, we use <tt>NTP</tt> (or preferably, <tt>OpenNTPd</tt> where available). To install <tt>ntpd</tt>:<pre>yum install ntp</pre><br />
# To automatically install package updates, we use <tt>yum-cron</tt>. To install and configure it: <pre>yum install yum-cron</pre> Then ensure it will start automatically on runlevels 3, 4 and 5: <pre>chkconfig --add yum-cron</pre> <br />
# Some extra, optional packages we've found to be useful are <tt>screen</tt>, <tt>wget</tt>, <tt>vim</tt>, <tt>emacs</tt>, <tt>lynx</tt>. Their installation is highly recommended to ease troubleshooting.<pre>yum install screen wget vim emacs lynx</pre><br />
<br />
=== Java Runtime Environment (JRE) ===<br />
<br />
Early CLOCKSS boxes were configured to use the Sun Java JRE 6 (now Oracle Java JRE) but we now recommend using the OpenJDK JRE 6 freely available through the CentOS repository. All CLOCKSS boxes have a 64-bit JRE to take advantage of the large amount of memory available on these machines.<br />
<br />
=== LOCKSS Daemon Configuration ===<br />
<br />
Once the CentOS environment has been configured and verified working by a CLOCKSS engineer, the next step is to configure the LOCKSS daemon as part of the CLOCKSS network. This is done using the <tt>hostconfig</tt> utility packaged with the LOCKSS daemon. For a CLOCKSS box, the following settings are recommended:<br />
<br />
{| class="wikitable"<br />
|-<br />
! scope="col"| Parameter<br />
! scope="col"| Description<br />
|-<br />
| <tt>Fully qualified hostname (FQDN) of this machine</tt><br />
| | Provided by CLOCKSS. We schema we use is clockss-<i>site</i>.clockss.org where <i>site</i> can uniquely identify the host institution of the machine.<br />
|-<br />
| <tt>IP address of this machine</tt><br />
| | A static public IP address provided by host institution.<br />
|-<br />
| <tt>Initial subnet for admin UI access</tt><br />
| | The subnet (in CIDR or X.Y.Z.* notation) that will be granted access to the web-based LOCKSS Administrative UI. The localhost is implicitly allowed. Additional subnets can be added through the LOCKSS Administrative UI.<br />
|-<br />
| <tt>LCAP V3 protocol port</tt><br />
| | The TCP port the daemon will listen on for LCAP communication from peers. Remote sites should ensure both internal and external firewalls allow all connections to this port.<br />
|-<br />
| <tt>Mail relay for this machine</tt><br />
| | This should be the DNS name of an SMTP relay that will accept and relay mail from this machine. The script will also prompt for a username and password if the mail relay requires them. It can also be set to <tt>localhost</tt> if the machine is capable of handling email. <br />
|-<br />
| <tt>E-mail address for administrator</tt><br />
| | Occasional alerts will be sent to this address by the LOCKSS daemon.<br />
|-<br />
| <tt>Path to java</tt><br />
| | The full path to a JRE. The default should suffice in most cases.<br />
|-<br />
| <tt>Java switches</tt><br />
| | Java switches to be passed to the JRE. It should be left blank in most cases.<br />
|-<br />
| <tt>Configuration URL</tt><br />
| | <ul><li>CLOCKSS Production: <tt>http://props.lockss.org:8001/clockss/lockss.xml</tt></li><li>CLOCKSS Ingest: <tt>http://props.lockss.org:8001/clockssingest/lockss.xml</tt></li><li>CLOCKSS Triggered: <tt>http://props.lockss.org:8001/clockss-triggered/lockss.xml</tt></li></ul><br />
|-<br />
| <tt>Preservation group(s)</tt><br />
| | <ul><li>CLOCKSS Production: <tt>clockss</tt></li><li>CLOCKSS Ingest: <tt>clockssingest</tt></li><li>CLOCKSS Triggered: <tt>clockss-triggered</tt></li></ul><br />
|-<br />
| <tt>Content storage directories</tt><br />
| | A semi-colon delineated list of paths to the LOCKSS repositories for use by this LOCKSS daemon.<br />
|-<br />
| <tt>Temporary storage directory</tt><br />
| | <tt>/cache0/gamma/tmp</tt><br />
|-<br />
| <tt>Password for web UI administration user admin</tt><br />
| | Set this to a strong password, known to the host site administrator but to no-one else.<br />
|-<br />
|}<br />
<br />
=== CLOCKSS PLN Configuration ===<br />
<br />
The production CLOCKSS network is configured via the [[LOCKSS: Property Server Operations|property server]] to:<br />
* Prevent access to the content by disabling both the proxy and content server functions of the LOCKSS daemon.<br />
* Prevent interception of communication between boxes in the network by the use of SSL for the [[LOCKSS: Polling and Repair Protocol]].<br />
* Prevent communication via the [[LOCKSS: Polling and Repair Protocol]] except from other boxes in the same network by requiring each end of a newly established connection to verify the certificate of the other end against the corresponding public key in a keystore. During setup, or when the set of boxes in the network changes, the appropriate keystore is uploaded to a machine via <tt>scp</tt>. We also provide a small script for the CLOCKSS site administrator to run that installs the keystore into the correct location with restrictive permissions.<br />
<br />
== Monitoring ==<br />
<br />
=== Nagios ===<br />
<br />
The CLOCKSS infrastructure is monitored by the Nagios network monitoring system. The CLOCKSS boxes, servers and services that are monitored by Nagios include the CLOCKSS Production boxes, Ingest boxes, Triggered Content boxes and Property Server as well as the CLOCKSS HTTP, SMTP and FTP servers.<br />
<br />
The CLOCKSS Network Administrator is responsible for Nagios monitoring.<br />
<br />
=== Nagios plugins to monitor LOCKSS ===<br />
<br />
Custom Nagios plugins were written to enable Nagios to monitor the following LOCKSS daemon services:<br />
* LOCKSS Daemon Version<br />
* LOCKSS Daemon Uptime<br />
* LOCKSS Repository Spaces (to monitor disk usage)<br />
* LOCKSS Web Administrative UI accessibility<br />
* LCAP Accessibility<br />
<br />
Plugins shipped with Nagios allow us to monitor, where applicable:<br />
* OpenSSH accessibility <br />
* HTTP/HTTPS accessibility<br />
<br />
=== Nagios Alerts ===<br />
<br />
Nagios allows plugins to return four return codes depending on the nature and severity of an issue (or lack thereof). For any return other than <tt>OK</tt>, Nagios has been configured to notify CLOCKSS engineers through email alerts (see [[CLOCKSS: Logging and Records]]).<br />
<br />
=== Access control to Nagios instance ===<br />
<br />
Access to the CLOCKSS Nagios instance is restricted by an Apache Virtual Host definition to the LOCKSS subnet at Stanford (<tt>171.66.236.0/24</tt>) and by username and password.<br />
<br />
=== Nagios Redundancy ===<br />
<br />
CLOCKSS and LOCKSS have a second Nagios instance running in an Amacon EC2 instance located in West Virginia. The second Nagios is not able to monitor CLOCKSS boxes directly. This is by design: We do not want a machine outside of the Stanford network to have the type of access necessary to monitor internal CLOCKSS boxes and processes. Due to this restriction, it is configured only to monitor the CLOCKSS website and the Nagios instance running at Stanford. The second Nagios instance will alert CLOCKSS engineers should any problems occur at Stanford.<br />
<br />
== Hardware Upgrade ==<br />
<br />
Currently, all CLOCKSS boxes have at least four spare hotswap disk bays. A natural way to incrementally upgrade the disk capacity of deployed CLOCKSS boxes is to fill these disk bays with the largest disk available (at the moment, 4TB), create a new RAID array and finally to move content from an existing array. The disks comprising the old array can then be removed and the process can be repeated for the remaining RAID arrays.<br />
<br />
The other components in a CLOCKSS box are not expected to need upgrading during its service life.<br />
<br />
== Hardware Replacement ==<br />
<br />
=== Component Failure ===<br />
<br />
The components in all CLOCKSS boxes are readily available, cheap, off-the-shelf components and replacement components can be sourced from any well-stocked computer retailer. The most common component failures are disks. As mentioned earlier, the RAID5 arrays employed in CLOCKSS boxes are only able to tolerate one disk failure each so it is important that any failed disks be found and replaced as quickly as possible.<br />
<br />
If a disk fails, we contact the CLOCKSS box's hardware vendor if the CLOCKSS box is still under warranty. If the box is no longer under warranty, we prefer to purchase a disk and ship it expedited to the remote site. However, if the remote site is not within the United States, an alternative option is to ask the site administrator to source replacement parts within their country and send an invoice to CLOCKSS for a reimbursement.<br />
<br />
Other component failures are handled similarly. If a machine cannot be repaired, it is replaced.<br />
<br />
=== Planned CLOCKSS Box Replacement ===<br />
<br />
Despite the upgradability of existing hardware, we recognize that, eventually, it will no longer be cost effective or practical to continue to run old hardware due to increases in performance per watt efficiency, disk bandwidth, disk capacity and other measures. Software and content volume is also expected to push hardware towards obsolescence.<br />
<br />
Replacing an entire machine is a fairly straightforward process. After purchasing and configuring the machine as described elsewhere in this document, it is brought up alongside the machine it is to replace. To minimize downtime, content is copied in two passes, using <tt>rsync</tt>. The first pass copies the bulk of the data to the new machine. Once the first pass is complete, the LOCKSS daemon on the old machine is taken offline and its filesystems are placed into read-only mode. The second pass then copies any changes to the content that occurred during the first pass, and verifies the integrity of content from the first pass. If no errors occurred during the two passes, the old machine is taken offline and the new machine takes its place. Subsequent [[LOCKSS: Polling and Repair Protocol]] will also verify the integrity of the copy.<br />
<br />
== Software Updates ==<br />
<br />
=== Re-locating Content ===<br />
<br />
Relocating an AU to another LOCKSS repository filesystem, for example if a filesystem has filled up, requires the following steps:<br />
# Shutdown the LOCKSS daemon to prevent it from modifying the AU. For extra precaution, temporarily remount the filesystem as read-only.<br />
# Copy the AU to the new LOCKSS repository. When the copy completes successfully, delete the the AU from the source filesystem.<br />
# If the LOCKSS repository has not yet been made available to the LOCKSS daemon, edit <tt>/etc/lockss/config.dat</tt> and append the new LOCKSS repository path to <tt>LOCKSS_DISK_PATHS</tt>.<br />
# The AU needs its <tt>lockss_repo</tt> added or modified to reflect the new path. This is done by editing the AU's <tt>lockss_repo</tt> line in <tt>/cache0/gamma/config/au.txt</tt>. For bulk relocation, a LOCKSS tool called <tt>auconf.pl</tt> can scan the LOCKSS repositories for AUs and update <tt>au.txt</tt> as necessary.<br />
# If the source filesystem was remounted read-only, remount it as read-write. <br />
# Finally, start the LOCKSS daemon and check that the AUs repository path points to the new location.<br />
<br />
If a large amount of content is being moved and the daemon requires minimal disruption, a modified procedure is followed:<br />
<br />
# With the LOCKSS daemon running and the filesystem mounted read-write, use rsync to copy the bulk of the content to the new filesystem.<br />
# When the first pass completes, stop the LOCKSS daemon and remount the source filesystem as read-only.<br />
# Run rsync again to copy any changes to the AUs since the last rsync was started.<br />
# Update <tt>/etc/lockss/config.dat</tt> and <tt>/cache0/gamma/config/au.txt</tt> as necessary.<br />
# Remount the filesystem as read-write. Bring up the LOCKSS daemon and check that the all the AUs repository paths have been updated.<br />
<br />
=== LOCKSS Daemon Updates ===<br />
<br />
All remote sites are encouraged to setup automatic updates for the LOCKSS daemon package so that updates are installed automatically within a few hours of their release to the CLOCKSS RPM repository. Additionally, all systems administrators receive announcements of newly released daemons through the CLOCKSS Administrators mailing list; systems administrators opting to install updates manually are encouraged to use the announcements as a cue to install updates. We use Nagios to monitor the LOCKSS daemon version of all CLOCKSS boxes. If a CLOCKSS box has not been updated with the latest LOCKSS daemon within a day, we notify its systems administrator(s).<br />
<br />
=== System Package Updates ===<br />
<br />
Most CLOCKSS boxes are configured for unattended upgrade, so new upgrades will be installed automatically. All CLOCKSS box systems administrators are encouraged to join the CentOS Announce mailing list to receive real-time annoucements about security updates. If the box in question is not configured for unattended updates:<br />
* Security updates should be downloaded and installed as soon as they are announced.<br />
* All other non-critical updates should be installed at least once a quarter.<br />
<br />
Anyone can subscribe to the CentOS Announce mailing list by visiting the [http://lists.centos.org/mailman/listinfo/centos-announce CentOS-announce Mailing List] home page.<br />
<br />
=== System Updates ===<br />
<br />
During the lifetime of the CLOCKSS program, the CentOS project will release new major versions of their CentOS operating system (e.g., CentOS 5.x to 6.x). All production CLOCKSS boxes were configured so that its operating system filesystem is independent from its content filesystems. The CentOS upgrade path is then relatively simple: Save the LOCKSS daemon configuration, user accounts and OpenSSH keys, unmount and take offline all content filesystems then effectively reinstall CentOS and the LOCKSS daemon environment and restore the CLOCKSS box's configuration.<br />
<br />
The end of life dates for the CentOS versions currently installed on CLOCKSS boxes are:<br />
<br />
{| class="wikitable" style="margin: 0 auto"<br />
|-<br />
! scope="col"| Version<br />
! scope="col"| Release Date<br />
! scope="col"| Full Updates<br />
! scope="col"| Maintenance Updates<br />
|-<br />
| CentOS 5.x<br />
| | 2007-04-12<br />
| | Q1 2014<br />
| | 2017-03-31<br />
|-<br />
| CentOS 6.x<br />
| | 22011-07-10<br />
| | Q2 2017<br />
| | 2020-11-30<br />
|-<br />
|}<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by:<br />
** CLOCKSS Network Administrator<br />
** LOCKSS Engineering Staff<br />
* Approval by CLOCKSS Technical Lead</div>Dshrhttp://documents.clockss.org/index.php/LOCKSS:_Software_Development_ProcessLOCKSS: Software Development Process2014-04-06T21:57:24Z<p>Dshr: /* Plugin Development Process */</p>
<hr />
<div>= LOCKSS: Software Development Process =<br />
<br />
The LOCKSS team operates two slightly different software development, testing and release processes, one for the core [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] code and one for the [[LOCKSS: Basic Concepts#Plugins|plugins]] that adapt the core daemon's behavior to particular content. All code is maintained in SourceForge's CVS repository. Two independent local copies of this repository are updated every 24 hours.<br />
<br />
== Daemon Development Process ==<br />
<br />
The development, testing and release process for the core daemon code has the following stages:<br />
# Requirements generation<br />
# Prioritization<br />
# Tracking<br />
# Testing<br />
# Approval<br />
# Release<br />
<br />
This process operates on a nominal 8-week release cycle; that is the goal is to release a new version of the daemon code every 8 weeks. A release can be delayed if last-minute problems are detected but if this happens a post-mortem is held to determine the cause and determine how to prevent it recurring. The cause has usually been some major enhancement that takes longer to debug than expected.<br />
<br />
=== Requirements ===<br />
<br />
There are four main sources of requirements for changes to the daemon:<br />
* The community of LOCKSS users.<br />
* The content processing staff in the LOCKSS team, as they process content from LOCKSS and CLOCKSS publishers for preservation and either:<br />
** notice content that isn't being collected or processed properly, or<br />
** encounter situations that require new capabilities.<br />
* The whole LOCKSS team as they interact with the LOCKSS daemon in testing, and the developers as they fix bugs and refactor code.<br />
* The deliverables under contracts with funders, as for example our current grant from the Andrew W. Mellon foundation.<br />
Support interactions with the community of LOCKSS users are tracked and managed using the RT ticketing system. If, during resolution of an RT ticket, it is determined that a change to the daemon is required, an issue to track it is generated in the Roundup bug tracking system. In some cases an informal conversation with a community member results in a Roundup issue being created without an RT ticket. The same may happen if a LOCKSS team member identifies a needed change, after informal discussion within the team, as a result of a code review, or if a deliverable is required by an external funder. Detailed instructions on the use of Roundup to create requirements and track progress are [http://wiki.lockss.org/cgi-bin/wiki.pl?RoundUp here].<br />
<br />
=== Prioritization ===<br />
<br />
At the start of each release cycle the development staff meet and review the Roundup issues to select the set of issues that will be prioritized for the upcoming release.<br />
This prioritization is based on the needs of the LOCKSS Alliance members and the CLOCKSS archive, <br />
deliverables for funders, security considerations and the severity of the bug in question.<br />
At intervals during the release cycle the prioritization is reviewed with the goal of meeting the<br />
8-week cycle time with the available resource and responding rapidly to any critical bugs.<br />
<br />
=== Tracking ===<br />
<br />
Roundup issues are assigned to developers after prioritization under the supervision of the LOCKSS Technical Lead.<br />
As progress is made the developer updates the issue's status.<br />
<br />
=== Testing ===<br />
<br />
Core daemon code undergoes three types of testing:<br />
* Unit testing.<br />
* Functional testing.<br />
* Load testing.<br />
<br />
Testing occurs in three stages:<br />
* Developers perform unit and functional testing before and after committing changes to SourceForge.<br />
* Every night the entire code base is checked out from SourceForge, built and both unit and functional tests are run. Any failures are reported to the whole team by e-mail. Remedying them is a high-priority task.<br />
* Every time a release candidate is built it runs unit and functional tests, then is installed on one or more of the LOCKSS team's GLN boxes for load testing.<br />
<br />
==== Unit Testing ====<br />
<br />
The LOCKSS team make heavy use of the Junit unit test framework for Java in the context of the Ant automated Java build tool.<br />
As of September 2013 there were 495 files containing 4953 individual tests for 867 classes in the daemon code base. <br />
All significant methods in all classes are required to have unit tests, although this requirement has yet to be satisfied.<br />
Where practical, subsystems should have unit tests.<br />
The "where practical" caveat is made necessary by the distributed and randomized architecture of the LOCKSS system,<br />
which makes some systems impractical to unit-test in the Junit framework.<br />
Coverage of these tests is evaluated at intervals using Jcoverage; it is currently assessed as requiring improvement.<br />
A few specialized subsystems have functional tests implemented in the Junit framework.<br />
These are treated as unit tests.<br />
<br />
Developers should make every effort to reproduce reported bugs by implementing<br />
one or more unit tests that fail because of the bug, which will then verify that the bug is fixed and does<br />
not re-appear later.<br />
<br />
==== Functional Tests ====<br />
<br />
The LOCKSS team uses an in-house functional testing framework called STF (Stochastic Test Framework).<br />
This currently implements 23 different test scenarios.<br />
In each, the framework sets up a small LOCKSS network, typically of 4-5 daemons, on a single machine.<br />
The framework interacts with the Web user interface of each of these daemons as a normal LOCKSS box<br />
administrator would to direct them to ingest and poll on relatively small amounts of simulated content.<br />
The framework is capable of injecting faults,<br />
such a localized damage to, or total loss of the simulated content.<br />
It uses the Web user interface of the daemons to monitor the results of the<br />
functional tests and detect any errors, such as failure to repair the injected faults.<br />
<br />
If a bug cannot be reproduced in the unit test environment efforts should be<br />
made to reproduce it by constructing a suitable scenario in STF that fails<br />
because of the bug, and can validate that the bug is fixed and does not re-appear.<br />
<br />
==== Load testing ====<br />
<br />
Production LOCKSS and CLOCKSS boxes operate at a scale that is logistically<br />
infeasible to reproduce in the unit and functional test environments.<br />
If a bug cannot be reproduced in the unit or functional test environments it must be reproduced manually in the internal test network.<br />
New daemon releases go through two types of load test:<br />
* They are installed in stages on a small internal network of 4 LOCKSS boxes with a substantial amount of selected real content. The behavior of these daemons is monitored closely and manual efforts are made to trigger any bugs that were only replicable under load testing.<br />
* They are then installed on at least 4 of the 13 GLN LOCKSS boxes that the LOCKSS team operates, which typically have at least 5TB of content. The behavior of these daemons is monitored closely and manual efforts are made to trigger any bugs that were only replicable under load testing.<br />
<br />
==== Hard-to-replicate bugs ====<br />
<br />
Sometimes bugs will be reported that the LOCKSS team fails to reproduce locally.<br />
Some diagnostic information will be available in the daemon logs at the<br />
reporting site.<br />
This is used to add self-checking and/or logging to the daemon to<br />
investigate the problem further.<br />
<br />
=== Approval ===<br />
<br />
When bug fixes and enhancements have been completed and pass unit and<br />
functional tests, the corresponding Roundup issues are set to Testing<br />
state. Once a release candidate has been built (below), each issue is<br />
checked again to confirm that it is operating properly in a production<br />
environment, then moved to Approved state. The rules for moving to<br />
Approved state vary according to the type of issue:<br />
<br />
* Changes that produce predictable behavioral differences are observed to ensure correct behavior.<br />
* Fixes for intermittent bugs (such as race conditions) remain in Testing state until the desired behavior is observed (or a suitable time elapses without observing the failure).<br />
* Changes that do not affect the daemon's behavior in observable ways (such as fixes or enhancements to test code) may be moved to Approved as soon as the test passes with the release build.<br />
<br />
=== Release ===<br />
<br />
On the freeze date, the LOCKSS build master creates a branch in the SourceForge repository,<br />
and labels it with the release name and the tag for the first release candidate.<br />
The first release candidate is then checked out from this branch and built.<br />
Subsequent fixes until the final release are made to both the release and main branches.<br />
All release candidates are tagged and built from the release branch. A release branch tag has the form:<br />
<code><br />
release-candidate_${N1}-${N2}-b${N3}<br />
</code><br />
where:<br />
* N1 is the LOCKSS daemon major version number, currently 1,<br />
* N2 is the LOCKSS daemon minor version number, currently 62,<br />
* N3 is the sequence number of the build, incremented by 1 each time a build is performed.<br />
<br />
If the build or tests fail, developers are notified, problems fixed, and a new release candidate is tagged and built.<br />
When successful the release candidate is signed using the LOCKSS code signing key,<br />
and uploaded to a test Yum repository, from where<br />
it is installed on a small internal test LOCKSS network.<br />
It is monitored carefully for at least a few days.<br />
If no serious problems are observed the candidate is installed on several internal boxes that participate in the<br />
GLN, in order to observe it under significant load.<br />
<br />
During the release testing phase each of the Roundup issues is checked<br />
to verify that it's operating as expected in the production system, and<br />
marked Approved if so. If problems are found in the release candidate,<br />
fixes are made on the branch (and the main branch if appropriate) and a<br />
new release candidate is produced and tested as above.<br />
<br />
When all issues (except possibly issues that can only be observed in production - see above) are marked Approved and the candidate has been running<br />
without significant problems for at least a week, the LOCKSS technical lead approves the release.<br />
The candidate is moved to the<br />
release Yum repository and the release announcement is sent. Users have<br />
the option to set their boxes up to install new release automatically or<br />
do it manually when they receive the release announcement.<br />
The Approved issues may then have their status changed to Resolved.<br />
<br />
=== Documentation ===<br />
<br />
Routine changes to the daemon are documented in Roundup, and in the commit messages in SourceForge. They do not require changes to the system's architectural documents.<br />
<br />
== Daemon Enhancements ==<br />
<br />
Significant enhancements, such as any architectural changes, to the daemon follow a slightly different process.<br />
An experimental branch is created in the SourceForge repository and used<br />
to preserve the various steps of development of a prototype. These steps<br />
normally include:<br />
* Requirements generation, a group discussion to identify and document in the internal Wiki outline requirements for the enhancement sufficient to allow development of a prototype.<br />
* Prototyping, development of a working enhancement and sufficient unit and/or functional tests to demonstrate that the enhancement meets its outline requirements.<br />
* Design review, a formal review of the design of the prototype and documentation of a design for a production implementation.<br />
* Implementation of a production implementation, either from scratch (in a second branch) or by evolving the prototype.<br />
* Code review, a formal review of the code that is proposed for addition to the main branch of the repository that identifies a set of changes that must be made before the changes are actually applied.<br />
* Merge, in which the implementation with any changes required by the code review is committed to the main branch of the SourceForge repository and the branch(es) used during development abandoned.<br />
<br />
The enhancements then undergo the normal testing, approval and release process.<br />
<br />
=== Enhancement Documentation ===<br />
<br />
Significant enhancements to the daemon may require changes to documents such as the [[Definition of AIP]]. These changes are identified during the design review, and are the responsibility of the developer concerned.<br />
<br />
== Plugin Development Process ==<br />
<br />
The development, testing and release process for [[LOCKSS: Basic Concepts#LOCKSS Daemon|daemon]] [[LOCKSS: Basic Concepts#Plugins|plugins]] has the following stages:<br />
# Requirements generation<br />
# Prioritization<br />
# Tracking<br />
# Development<br />
# Testing<br />
# Approval<br />
# Release<br />
<br />
The process normally operates asynchronously with the daemon development process; plugins are released when they are ready. If a plugin requires a new feature from the core daemon code its release will be delayed until after the relevant daemon release. This is enforced by having the plugin declare a minimum required daemon version, which prevents the plugin inadvertently being loaded by earlier daemons.<br />
<br />
=== Requirements ===<br />
<br />
The process for generating requirements for changes to the plugins is the same as for changes to the core daemon code, except that there is one additional source: new publishers joining LOCKSS and CLOCKSS. When a plugin writer is [[CLOCKSS: Ingest Pipeline#Harvest Publisher Engagement|assigned to analyze a new site]] they fill out a template that forms the specification for the new plugin and is checked in to CVS along with it. When changes are required, the appropriate changes are made to the specification and it is checked back in to CVS.<br />
<br />
=== Prioritization ===<br />
<br />
The process for prioritization of changes to the plugins differs from that for the daemon code in two respects:<br />
* It is a periodic process but it is not synchronous with the daemon release cycle. Plugin changes are released when ready not, except in special circumstances, together with daemon releases.<br />
* Some plugin developments may be urgent, if they are caused by impending cessation of publication, loss of access to the content, publisher change or move of the content between publishing platforms.<br />
<br />
=== Tracking ===<br />
<br />
Progress on plugin developments is tracked in JIRA.<br />
<br />
=== Development ===<br />
<br />
Plugin development should normally take place in the source tree of the current daemon release. Exceptionally, if the plugin needs bug fixes or new features not in the current release development can use the head of the CVS tree, with the plugin's <tt>required_daemon_version</tt> set to the next daemon release number. <br />
<br />
=== Testing ===<br />
<br />
Plugins are implemented partly in Java and partly in XML.<br />
The Java classes have unit tests in the same way as the core daemon code does.<br />
As of September 2013 there were 221 files with 911 individual tests for 248 plugins plus 325 auxilliary plugin classes.<br />
During every build all the plugin unit tests are run,<br />
and some additional validation is performed on the XML.<br />
<br />
Once a new or changed plugin has passed these tests, it is loaded into<br />
a daemon in a test environment which is manually directed to collect<br />
one or more Archival Units (AUs) of its target content. Two checks<br />
are performed:<br />
* The status info and daemon logs are examined to detect errors such as unexpected 404s.<br />
* The collected content is browsed using the daemon's "audit proxy" (a proxy that returns only collected content and 404 for everything else) to ensure that all the desired content is collected and undesired content is not. For example, if the AU represents a volume of a journal, all articles belonging to that valume should be present, along with the common files they reference (e.g., style sheets), and there should be no articles belonging to other volumes.<br />
* A visual check of the collected content against the publisher's original.<br />
<br />
Once a new or changed plugin has passed these tests it is released to the LOCKSS or CLOCKSS content test network as appropriate.<br />
Under the control of an internally developed testing framework (AUTest) several AUs are collected, polled and the results checked for agreement.<br />
If the agreement is less than 100%, the reason is diagnosed.<br />
Either the plugin is further changed and the process repeated.<br />
or if the diagnosis is that collection from the publisher suffered transient errors,<br />
the AUs are re-collected and the check repeated.<br />
Metadata extraction is performed on the collected AUs and checked for correctness and completeness.<br />
<br />
=== Approval ===<br />
<br />
When a plugin is ready, it's released. Plugins are released<br />
individually, independently of the daemon, except for cases where a<br />
plugin requires a daemon feature that has not yet been released, in<br />
which case it waits for a daemon release.<br />
<br />
The LOCKSS or CLOCKSS Plugin Lead (as appropriate) approves the release of a new plugin.<br />
<br />
=== Release ===<br />
<br />
A plugin release requires the following steps:<br />
* The plugin build master packages on a build machine<br />
* The plugin build master signs the plugin with their key:<br />
** The keystore containing the keys that can sign plugins for the GLN is the daemon's default keystore, which is controlled by the LOCKSS technical lead.<br />
** PLNs, including the CLOCKSS PLN, have their own keystores. The CLOCKSS keystore is controlled by the LOCKSS technical lead.<br />
* The plugin is uploaded to the appropriate repository.<br />
* A plugin collection is triggered manually on one production or ingest box.<br />
* The logs are checked to ensure the plugin loaded correctly.<br />
* The remaining boxes will automatically fetch and load any new (or new versions of) plugins within 12 hours.<br />
For CLOCKSS, the plugin is released to the ingest and production boxes.<br />
<br />
=== Documentation ===<br />
<br />
Plugin changes are documented in JIRA, and in the commit messages in SourceForge. They do not require changes to the system's architectural documents.<br />
<br />
== Development Environment ==<br />
<br />
Developers are free to use the operating system and other tools of their choice on their own machines while developing LOCKSS software, provided that when performing [[LOCKSS: Software Development Process#Unit Testing|pre- and post-commit testing]] they use the currently approved versions of:<br />
* Apache Ant<br />
* The JDK<br />
* The Java libraries from CVS.<br />
Diversity in development environments assists in identifying hidden dependencies.<br />
<br />
== Dependencies ==<br />
<br />
Two types of dependency are of concern for the functioning of the LOCKSS system, and thus need to be proactively monitored:<br />
* The ability of operating system support for the requirements of the LOCKSS software, and the other software components used by the CLOCKSS archive.<br />
* The set of formats supported by the Web browsers in the Knowledge Base of the Designated Community.<br />
<br />
=== Operating System Dependencies ===<br />
<br />
The LOCKSS software depends upon:<br />
* A Java virtual machine, currently version 6 or 7.<br />
* A set of Java libraries.<br />
* A POSIX file system.<br />
* An SQL database.<br />
Any modern operating system can support these dependencies. Although the system is currently supported only on Red Hat compatible Linux distributions, in development it runs on many versions of Linux, on MacOS and with some restrictions on Windows. Some years ago it was ported from OpenBSD with little trouble. As the LOCKSS team has the software under continuous development on a range of operating systems, any problems with operating system support rapidly become evident to the team.<br />
<br />
Other tools required are similarly situated. The main one is the Apache web server, which is both widely supported, and could be replaced by a competitor with little trouble. Again, a lack of support for these tools would rapidly become evident to the team because they are indispensable in development.<br />
<br />
=== Browser Format Support ===<br />
<br />
[[LOCKSS: Format Migration|Web formats become obsolete]] when support for them is removed from the browsers in general use. The occurrence of such obsolescence is a subject of active research, in which the LOCKSS team participates on a continuing basis. Any looming obsolescence of a member of the set of formats identified by the [[LOCKSS: Format Migration#Format Migration in the CLOCKSS Archive|CLOCKSS software's use of the File Identification Tool Set (FITS)]] would become known through this research network.<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by:<br />
** LOCKSS Engineering Staff<br />
** LOCKSS Plugin Lead<br />
** CLOCKSS Plugin Lead<br />
** LOCKSS Content Lead<br />
** CLOCKSS Content Lead<br />
** LOCKSS Build Master<br />
* Approval by LOCKSS Technical Lead<br />
<br />
== Relevant Documents ==<br />
# [[LOCKSS: Format Migration]]<br />
# [[CLOCKSS: Ingest Pipeline]]</div>Dshrhttp://documents.clockss.org/index.php/LOCKSS:_Software_Development_ProcessLOCKSS: Software Development Process2014-04-06T21:55:33Z<p>Dshr: /* LOCKSS: Software Development Process */</p>
<hr />
<div>= LOCKSS: Software Development Process =<br />
<br />
The LOCKSS team operates two slightly different software development, testing and release processes, one for the core [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] code and one for the [[LOCKSS: Basic Concepts#Plugins|plugins]] that adapt the core daemon's behavior to particular content. All code is maintained in SourceForge's CVS repository. Two independent local copies of this repository are updated every 24 hours.<br />
<br />
== Daemon Development Process ==<br />
<br />
The development, testing and release process for the core daemon code has the following stages:<br />
# Requirements generation<br />
# Prioritization<br />
# Tracking<br />
# Testing<br />
# Approval<br />
# Release<br />
<br />
This process operates on a nominal 8-week release cycle; that is the goal is to release a new version of the daemon code every 8 weeks. A release can be delayed if last-minute problems are detected but if this happens a post-mortem is held to determine the cause and determine how to prevent it recurring. The cause has usually been some major enhancement that takes longer to debug than expected.<br />
<br />
=== Requirements ===<br />
<br />
There are four main sources of requirements for changes to the daemon:<br />
* The community of LOCKSS users.<br />
* The content processing staff in the LOCKSS team, as they process content from LOCKSS and CLOCKSS publishers for preservation and either:<br />
** notice content that isn't being collected or processed properly, or<br />
** encounter situations that require new capabilities.<br />
* The whole LOCKSS team as they interact with the LOCKSS daemon in testing, and the developers as they fix bugs and refactor code.<br />
* The deliverables under contracts with funders, as for example our current grant from the Andrew W. Mellon foundation.<br />
Support interactions with the community of LOCKSS users are tracked and managed using the RT ticketing system. If, during resolution of an RT ticket, it is determined that a change to the daemon is required, an issue to track it is generated in the Roundup bug tracking system. In some cases an informal conversation with a community member results in a Roundup issue being created without an RT ticket. The same may happen if a LOCKSS team member identifies a needed change, after informal discussion within the team, as a result of a code review, or if a deliverable is required by an external funder. Detailed instructions on the use of Roundup to create requirements and track progress are [http://wiki.lockss.org/cgi-bin/wiki.pl?RoundUp here].<br />
<br />
=== Prioritization ===<br />
<br />
At the start of each release cycle the development staff meet and review the Roundup issues to select the set of issues that will be prioritized for the upcoming release.<br />
This prioritization is based on the needs of the LOCKSS Alliance members and the CLOCKSS archive, <br />
deliverables for funders, security considerations and the severity of the bug in question.<br />
At intervals during the release cycle the prioritization is reviewed with the goal of meeting the<br />
8-week cycle time with the available resource and responding rapidly to any critical bugs.<br />
<br />
=== Tracking ===<br />
<br />
Roundup issues are assigned to developers after prioritization under the supervision of the LOCKSS Technical Lead.<br />
As progress is made the developer updates the issue's status.<br />
<br />
=== Testing ===<br />
<br />
Core daemon code undergoes three types of testing:<br />
* Unit testing.<br />
* Functional testing.<br />
* Load testing.<br />
<br />
Testing occurs in three stages:<br />
* Developers perform unit and functional testing before and after committing changes to SourceForge.<br />
* Every night the entire code base is checked out from SourceForge, built and both unit and functional tests are run. Any failures are reported to the whole team by e-mail. Remedying them is a high-priority task.<br />
* Every time a release candidate is built it runs unit and functional tests, then is installed on one or more of the LOCKSS team's GLN boxes for load testing.<br />
<br />
==== Unit Testing ====<br />
<br />
The LOCKSS team make heavy use of the Junit unit test framework for Java in the context of the Ant automated Java build tool.<br />
As of September 2013 there were 495 files containing 4953 individual tests for 867 classes in the daemon code base. <br />
All significant methods in all classes are required to have unit tests, although this requirement has yet to be satisfied.<br />
Where practical, subsystems should have unit tests.<br />
The "where practical" caveat is made necessary by the distributed and randomized architecture of the LOCKSS system,<br />
which makes some systems impractical to unit-test in the Junit framework.<br />
Coverage of these tests is evaluated at intervals using Jcoverage; it is currently assessed as requiring improvement.<br />
A few specialized subsystems have functional tests implemented in the Junit framework.<br />
These are treated as unit tests.<br />
<br />
Developers should make every effort to reproduce reported bugs by implementing<br />
one or more unit tests that fail because of the bug, which will then verify that the bug is fixed and does<br />
not re-appear later.<br />
<br />
==== Functional Tests ====<br />
<br />
The LOCKSS team uses an in-house functional testing framework called STF (Stochastic Test Framework).<br />
This currently implements 23 different test scenarios.<br />
In each, the framework sets up a small LOCKSS network, typically of 4-5 daemons, on a single machine.<br />
The framework interacts with the Web user interface of each of these daemons as a normal LOCKSS box<br />
administrator would to direct them to ingest and poll on relatively small amounts of simulated content.<br />
The framework is capable of injecting faults,<br />
such a localized damage to, or total loss of the simulated content.<br />
It uses the Web user interface of the daemons to monitor the results of the<br />
functional tests and detect any errors, such as failure to repair the injected faults.<br />
<br />
If a bug cannot be reproduced in the unit test environment efforts should be<br />
made to reproduce it by constructing a suitable scenario in STF that fails<br />
because of the bug, and can validate that the bug is fixed and does not re-appear.<br />
<br />
==== Load testing ====<br />
<br />
Production LOCKSS and CLOCKSS boxes operate at a scale that is logistically<br />
infeasible to reproduce in the unit and functional test environments.<br />
If a bug cannot be reproduced in the unit or functional test environments it must be reproduced manually in the internal test network.<br />
New daemon releases go through two types of load test:<br />
* They are installed in stages on a small internal network of 4 LOCKSS boxes with a substantial amount of selected real content. The behavior of these daemons is monitored closely and manual efforts are made to trigger any bugs that were only replicable under load testing.<br />
* They are then installed on at least 4 of the 13 GLN LOCKSS boxes that the LOCKSS team operates, which typically have at least 5TB of content. The behavior of these daemons is monitored closely and manual efforts are made to trigger any bugs that were only replicable under load testing.<br />
<br />
==== Hard-to-replicate bugs ====<br />
<br />
Sometimes bugs will be reported that the LOCKSS team fails to reproduce locally.<br />
Some diagnostic information will be available in the daemon logs at the<br />
reporting site.<br />
This is used to add self-checking and/or logging to the daemon to<br />
investigate the problem further.<br />
<br />
=== Approval ===<br />
<br />
When bug fixes and enhancements have been completed and pass unit and<br />
functional tests, the corresponding Roundup issues are set to Testing<br />
state. Once a release candidate has been built (below), each issue is<br />
checked again to confirm that it is operating properly in a production<br />
environment, then moved to Approved state. The rules for moving to<br />
Approved state vary according to the type of issue:<br />
<br />
* Changes that produce predictable behavioral differences are observed to ensure correct behavior.<br />
* Fixes for intermittent bugs (such as race conditions) remain in Testing state until the desired behavior is observed (or a suitable time elapses without observing the failure).<br />
* Changes that do not affect the daemon's behavior in observable ways (such as fixes or enhancements to test code) may be moved to Approved as soon as the test passes with the release build.<br />
<br />
=== Release ===<br />
<br />
On the freeze date, the LOCKSS build master creates a branch in the SourceForge repository,<br />
and labels it with the release name and the tag for the first release candidate.<br />
The first release candidate is then checked out from this branch and built.<br />
Subsequent fixes until the final release are made to both the release and main branches.<br />
All release candidates are tagged and built from the release branch. A release branch tag has the form:<br />
<code><br />
release-candidate_${N1}-${N2}-b${N3}<br />
</code><br />
where:<br />
* N1 is the LOCKSS daemon major version number, currently 1,<br />
* N2 is the LOCKSS daemon minor version number, currently 62,<br />
* N3 is the sequence number of the build, incremented by 1 each time a build is performed.<br />
<br />
If the build or tests fail, developers are notified, problems fixed, and a new release candidate is tagged and built.<br />
When successful the release candidate is signed using the LOCKSS code signing key,<br />
and uploaded to a test Yum repository, from where<br />
it is installed on a small internal test LOCKSS network.<br />
It is monitored carefully for at least a few days.<br />
If no serious problems are observed the candidate is installed on several internal boxes that participate in the<br />
GLN, in order to observe it under significant load.<br />
<br />
During the release testing phase each of the Roundup issues is checked<br />
to verify that it's operating as expected in the production system, and<br />
marked Approved if so. If problems are found in the release candidate,<br />
fixes are made on the branch (and the main branch if appropriate) and a<br />
new release candidate is produced and tested as above.<br />
<br />
When all issues (except possibly issues that can only be observed in production - see above) are marked Approved and the candidate has been running<br />
without significant problems for at least a week, the LOCKSS technical lead approves the release.<br />
The candidate is moved to the<br />
release Yum repository and the release announcement is sent. Users have<br />
the option to set their boxes up to install new release automatically or<br />
do it manually when they receive the release announcement.<br />
The Approved issues may then have their status changed to Resolved.<br />
<br />
=== Documentation ===<br />
<br />
Routine changes to the daemon are documented in Roundup, and in the commit messages in SourceForge. They do not require changes to the system's architectural documents.<br />
<br />
== Daemon Enhancements ==<br />
<br />
Significant enhancements, such as any architectural changes, to the daemon follow a slightly different process.<br />
An experimental branch is created in the SourceForge repository and used<br />
to preserve the various steps of development of a prototype. These steps<br />
normally include:<br />
* Requirements generation, a group discussion to identify and document in the internal Wiki outline requirements for the enhancement sufficient to allow development of a prototype.<br />
* Prototyping, development of a working enhancement and sufficient unit and/or functional tests to demonstrate that the enhancement meets its outline requirements.<br />
* Design review, a formal review of the design of the prototype and documentation of a design for a production implementation.<br />
* Implementation of a production implementation, either from scratch (in a second branch) or by evolving the prototype.<br />
* Code review, a formal review of the code that is proposed for addition to the main branch of the repository that identifies a set of changes that must be made before the changes are actually applied.<br />
* Merge, in which the implementation with any changes required by the code review is committed to the main branch of the SourceForge repository and the branch(es) used during development abandoned.<br />
<br />
The enhancements then undergo the normal testing, approval and release process.<br />
<br />
=== Enhancement Documentation ===<br />
<br />
Significant enhancements to the daemon may require changes to documents such as the [[Definition of AIP]]. These changes are identified during the design review, and are the responsibility of the developer concerned.<br />
<br />
== Plugin Development Process ==<br />
<br />
The development, testing and release process for daemon plugins has the following stages:<br />
# Requirements generation<br />
# Prioritization<br />
# Tracking<br />
# Development<br />
# Testing<br />
# Approval<br />
# Release<br />
<br />
The process normally operates asynchronously with the daemon development process; plugins are released when they are ready. If a plugin requires a new feature from the core daemon code its release will be delayed until after the relevant daemon release. This is enforced by having the plugin declare a minimum required daemon version, which prevents the plugin inadvertently being loaded by earlier daemons.<br />
<br />
=== Requirements ===<br />
<br />
The process for generating requirements for changes to the plugins is the same as for changes to the core daemon code, except that there is one additional source: new publishers joining LOCKSS and CLOCKSS. When a plugin writer is [[CLOCKSS: Ingest Pipeline#Harvest Publisher Engagement|assigned to analyze a new site]] they fill out a template that forms the specification for the new plugin and is checked in to CVS along with it. When changes are required, the appropriate changes are made to the specification and it is checked back in to CVS.<br />
<br />
=== Prioritization ===<br />
<br />
The process for prioritization of changes to the plugins differs from that for the daemon code in two respects:<br />
* It is a periodic process but it is not synchronous with the daemon release cycle. Plugin changes are released when ready not, except in special circumstances, together with daemon releases.<br />
* Some plugin developments may be urgent, if they are caused by impending cessation of publication, loss of access to the content, publisher change or move of the content between publishing platforms.<br />
<br />
=== Tracking ===<br />
<br />
Progress on plugin developments is tracked in JIRA.<br />
<br />
=== Development ===<br />
<br />
Plugin development should normally take place in the source tree of the current daemon release. Exceptionally, if the plugin needs bug fixes or new features not in the current release development can use the head of the CVS tree, with the plugin's <tt>required_daemon_version</tt> set to the next daemon release number. <br />
<br />
=== Testing ===<br />
<br />
Plugins are implemented partly in Java and partly in XML.<br />
The Java classes have unit tests in the same way as the core daemon code does.<br />
As of September 2013 there were 221 files with 911 individual tests for 248 plugins plus 325 auxilliary plugin classes.<br />
During every build all the plugin unit tests are run,<br />
and some additional validation is performed on the XML.<br />
<br />
Once a new or changed plugin has passed these tests, it is loaded into<br />
a daemon in a test environment which is manually directed to collect<br />
one or more Archival Units (AUs) of its target content. Two checks<br />
are performed:<br />
* The status info and daemon logs are examined to detect errors such as unexpected 404s.<br />
* The collected content is browsed using the daemon's "audit proxy" (a proxy that returns only collected content and 404 for everything else) to ensure that all the desired content is collected and undesired content is not. For example, if the AU represents a volume of a journal, all articles belonging to that valume should be present, along with the common files they reference (e.g., style sheets), and there should be no articles belonging to other volumes.<br />
* A visual check of the collected content against the publisher's original.<br />
<br />
Once a new or changed plugin has passed these tests it is released to the LOCKSS or CLOCKSS content test network as appropriate.<br />
Under the control of an internally developed testing framework (AUTest) several AUs are collected, polled and the results checked for agreement.<br />
If the agreement is less than 100%, the reason is diagnosed.<br />
Either the plugin is further changed and the process repeated.<br />
or if the diagnosis is that collection from the publisher suffered transient errors,<br />
the AUs are re-collected and the check repeated.<br />
Metadata extraction is performed on the collected AUs and checked for correctness and completeness.<br />
<br />
=== Approval ===<br />
<br />
When a plugin is ready, it's released. Plugins are released<br />
individually, independently of the daemon, except for cases where a<br />
plugin requires a daemon feature that has not yet been released, in<br />
which case it waits for a daemon release.<br />
<br />
The LOCKSS or CLOCKSS Plugin Lead (as appropriate) approves the release of a new plugin.<br />
<br />
=== Release ===<br />
<br />
A plugin release requires the following steps:<br />
* The plugin build master packages on a build machine<br />
* The plugin build master signs the plugin with their key:<br />
** The keystore containing the keys that can sign plugins for the GLN is the daemon's default keystore, which is controlled by the LOCKSS technical lead.<br />
** PLNs, including the CLOCKSS PLN, have their own keystores. The CLOCKSS keystore is controlled by the LOCKSS technical lead.<br />
* The plugin is uploaded to the appropriate repository.<br />
* A plugin collection is triggered manually on one production or ingest box.<br />
* The logs are checked to ensure the plugin loaded correctly.<br />
* The remaining boxes will automatically fetch and load any new (or new versions of) plugins within 12 hours.<br />
For CLOCKSS, the plugin is released to the ingest and production boxes.<br />
<br />
=== Documentation ===<br />
<br />
Plugin changes are documented in JIRA, and in the commit messages in SourceForge. They do not require changes to the system's architectural documents.<br />
<br />
== Development Environment ==<br />
<br />
Developers are free to use the operating system and other tools of their choice on their own machines while developing LOCKSS software, provided that when performing [[LOCKSS: Software Development Process#Unit Testing|pre- and post-commit testing]] they use the currently approved versions of:<br />
* Apache Ant<br />
* The JDK<br />
* The Java libraries from CVS.<br />
Diversity in development environments assists in identifying hidden dependencies.<br />
<br />
== Dependencies ==<br />
<br />
Two types of dependency are of concern for the functioning of the LOCKSS system, and thus need to be proactively monitored:<br />
* The ability of operating system support for the requirements of the LOCKSS software, and the other software components used by the CLOCKSS archive.<br />
* The set of formats supported by the Web browsers in the Knowledge Base of the Designated Community.<br />
<br />
=== Operating System Dependencies ===<br />
<br />
The LOCKSS software depends upon:<br />
* A Java virtual machine, currently version 6 or 7.<br />
* A set of Java libraries.<br />
* A POSIX file system.<br />
* An SQL database.<br />
Any modern operating system can support these dependencies. Although the system is currently supported only on Red Hat compatible Linux distributions, in development it runs on many versions of Linux, on MacOS and with some restrictions on Windows. Some years ago it was ported from OpenBSD with little trouble. As the LOCKSS team has the software under continuous development on a range of operating systems, any problems with operating system support rapidly become evident to the team.<br />
<br />
Other tools required are similarly situated. The main one is the Apache web server, which is both widely supported, and could be replaced by a competitor with little trouble. Again, a lack of support for these tools would rapidly become evident to the team because they are indispensable in development.<br />
<br />
=== Browser Format Support ===<br />
<br />
[[LOCKSS: Format Migration|Web formats become obsolete]] when support for them is removed from the browsers in general use. The occurrence of such obsolescence is a subject of active research, in which the LOCKSS team participates on a continuing basis. Any looming obsolescence of a member of the set of formats identified by the [[LOCKSS: Format Migration#Format Migration in the CLOCKSS Archive|CLOCKSS software's use of the File Identification Tool Set (FITS)]] would become known through this research network.<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by:<br />
** LOCKSS Engineering Staff<br />
** LOCKSS Plugin Lead<br />
** CLOCKSS Plugin Lead<br />
** LOCKSS Content Lead<br />
** CLOCKSS Content Lead<br />
** LOCKSS Build Master<br />
* Approval by LOCKSS Technical Lead<br />
<br />
== Relevant Documents ==<br />
# [[LOCKSS: Format Migration]]<br />
# [[CLOCKSS: Ingest Pipeline]]</div>Dshrhttp://documents.clockss.org/index.php/LOCKSS:_Property_Server_OperationsLOCKSS: Property Server Operations2014-04-06T21:52:56Z<p>Dshr: /* Apache Virtual Host */</p>
<hr />
<div>= LOCKSS: Property Server Operations =<br />
<br />
== Software ==<br />
The Property Server makes LOCKSS properties available to LOCKSS and CLOCKSS boxes via the HTTP and HTTPS protocols. Although any web<br />
server would suffice, the Property Server for LOCKSS and CLOCKSS uses the free and open-source Apache HTTP Server under the Ubuntu Server<br />
LTS operating system.<br />
<br />
== Configuration ==<br />
<br />
=== Apache Virtual Host ===<br />
<br />
Currently, the Property Servers for all PLNs, including CLOCKSS, use the same Apache Virtual Host definition. Each PLN, including CLOCKSS, has separate access control definitions for the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] property files and [[LOCKSS: Basic Concepts#Plugins|plugins]] served by the Property Server. CLOCKSS has three different LOCKSS daemon properties; one for each class of machines it runs:<br />
* CLOCKSS Ingest<br />
* CLOCKSS Production<br />
* CLOCKSS Triggered<br />
A typical access control definition within the Property Server looks like:<br />
<pre><br />
<Directory "/home/www/props/html/clockss"><br />
Order deny,allow<br />
Deny from all<br />
Include /etc/apache2/access.d/props/lockss<br />
Include /etc/apache2/access.d/props/clockss<br />
</Directory><br />
</pre><br />
<br />
=== Access Control ===<br />
<br />
The properties for a PLN such as CLOCKSS are only accessible by machines on the LOCKSS subnet <tt>171.66.236.0/24</tt> at Stanford and by authorized boxes in each network, such as CLOCKSS Ingest and Production boxes. A machine's IP address must be explicitly added to the ACL by a designated LOCKSS engineer before the machine can access the appropriate property files.<br />
<br />
== Property Change Process ==<br />
<br />
=== Use of version control ===<br />
<br />
LOCKSS property files are static XML files and changes are made through RCS, an early revision control system still maintained by the GNU Project. Changes are made by checking out and locking the property file to be updated, making changes and finally, checking it in (with a brief description of changes) and unlocking. It is possible to revert to an earlier version of a file, if necessary, using RCS.<br />
<br />
=== Property Server Access Control ===<br />
<br />
Only authorized LOCKSS team members are allowed to login to the Property Server. Additionally, a LOCKSS team member's user account must be part of a privileged user group to modify the CLOCKSS property files.<br />
<br />
== CLOCKSS Plugins ==<br />
<br />
The CLOCKSS property files refer the LOCKSS daemon running on CLOCKSS boxes to a URL where the daemon can retrieve CLOCKSS-specific LOCKSS daemon plugins as signed <tt>.jar</tt> files, generated as described in [[LOCKSS: Software Development Process#Plugin Development Process|LOCKSS: Software Development Process]]. These plugins are also served by the CLOCKSS Property Server and are served in a way that allows them to be preserved by the LOCKSS daemon. The Apache definition is as follows:<br />
<pre><br />
<Directory "/home/www/props/html/clockss/plugins"><br />
Options Indexes MultiViews<br />
IndexOptions IgnoreCase SuppressHTMLPreamble<br />
IndexIgnore .. held FOOTER*<br />
HeaderName HEADER.html<br />
ReadmeName FOOTER.html<br />
</Directory><br />
</pre><br />
<br />
== Cloud Mirror and Fail-over ==<br />
<br />
=== Amazon EC2 Instance ===<br />
<br />
The LOCKSS Property Server and HTTP server are mirrored to an Amazon Elastic Computing Cloud (EC2) instance nightly (and any changes made to the mirror are lost on the next sync cycle). If the CLOCKSS Property Server or HTTP server needs downtime for maintenance or experiences a problem (due to hardware failure or network or power outage) the mirror can take its place and continue to provide core services.<br />
<br />
Our EC2 instance is currently running in Amazon's <tt>US-EAST-1</tt> region located in northern Virginia and should be insulated from a catastrophic event on the West Coast. Amazon Web Service (AWS) makes it easy to replicate the mirror to other regions within the United States or internationally (Ireland, Singapore, Sydney, Tokyo, Sao Paulo) if necessary.<br />
<br />
Although changes could be made to the mirror, it is treated by LOCKSS and CLOCKSS processes as <i>read-only</i>; changes made to the mirror are lost on the next nightly mirror sync.<br />
<br />
==== Access control for the EC2 instance ====<br />
<br />
The CLOCKSS Amazon EC2 instances are managed through one Amazon AWS account. The credentials are only known among the CLOCKSS systems administrators at Stanford. It's possible to fine tune access controls using AWS Identity and Access Management (IAM) to AWS resources, however, this is not necessary for our use case.<br />
<br />
The EC2 instance access control is done through EC2 Security Groups and Key Pairs. The former is used to configure a class of external firewall rules that can be applied to any EC2 instance under the AWS account. The latter establishes SSH public-private keys pairs. The key pair can be generated by Amazon and downloaded or a public key can be uploaded. Private keys are never stored at Amazon. The Key Pair assigned to an instance is used by the Amazon Machine Image (AMI) as the initial SSH key pair to install for in a newly brought up instance.<br />
<br />
Root access is disabled on the official Ubuntu LTS AMIs we use. Instead the initial username is <tt>ubuntu</tt>.<br />
<br />
==== The nightly mirror process ====<br />
<br />
Each night the following process updates the mirror:<br />
<br />
* All files to be mirrored are copied to a staging area using <tt>rsync</tt>.<br />
* The MySQL server dumps all databases to the staging area.<br />
* The contents of the staging area are copied to a staging area on the Amazon EC2 instance using <tt>rsync</tt>.<br />
* The Amazon EC2 instance updates itself from the staging area by:<br />
** Copying the files to their proper location and modifying files, if necessary.<br />
** Re-loading its MySQL server from the dumps.<br />
** Re-starting the services.<br />
<br />
=== Fail-over Triggers ===<br />
<br />
Fail-over to the Amazon EC2 mirror is triggered during scheduled and unscheduled events causing interruption to core LOCKSS services, as called for by LOCKSS management. These include events such as:<br />
* Scheduled hardware upgrades or replacement<br />
* Scheduled software upgrades<br />
* Hardware failures<br />
* Software failures (kernel panics, segmentation faults, ...)<br />
* Infrastructure interruption (cooling, networking, power, ...)<br />
* Human error<br />
* Natural disaster<br />
<br />
Although the process of failing over to the mirror can be performed quickly, it is not time-critical because all information the CLOCKSS boxes obtain from the property server that is part of the fail-over is cached on each box. They continue to operate during a property server outage using their most recent content from the property server. The cache does not persists across daemon restarts, so a box whose daemon restarts during the fail-over process will wait until the fail-over succeeds before restarting.<br />
<br />
=== The fail-over process ===<br />
<br />
The LOCKSS Property Server mirror is synced nightly so the fail-over process only requires updating the relevant DNS records, for eaxmple those for CLOCKSS, to point to the mirror. This is done manually; the designated LOCKSS team member logs in to the CLOCKSS domain name registrar and updates the DNS records. The time-to-live (TTL) for core records is set to 30 minutes so it will take at most 30 minutes before any changes are fully propagated through the Internet. Access during the TTL of an unplanned fail-over may be intermittent for this reason, but will not cause problems for the LOCKSS boxes.<br />
<br />
==== Access control in the fail-over process ====<br />
<br />
Only designated LOCKSS team members have login access to the Amazon EC2 instance; there should be no need for any such login during normal or fail-over operations.<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by LOCKSS Network Administrator<br />
* Approval by LOCKSS Technical Lead<br />
<br />
== Relevant Documents ==<br />
# [[LOCKSS: Software Development Process]]</div>Dshrhttp://documents.clockss.org/index.php/LOCKSS:_Format_MigrationLOCKSS: Format Migration2014-04-06T21:50:58Z<p>Dshr: /* Format Migration in the CLOCKSS Archive */</p>
<hr />
<div>= LOCKSS: Format Migration =<br />
<br />
== Obsolescence of Web Formats ==<br />
<br />
Jeff Rothenberg's [http://www.sciamdigital.com/index.cfm?fa=Products.ViewIssuePreview&ARTICLEID_CHAR=07FBACA7-7185-43C8-B9E7-E07B5343F87 path-breaking pre-Web analysis of digital preservation] predicted that:<br />
* The probability of an individual format going obsolete was high.<br />
* The time that would elapse between the introduction of a new format and its obsolescence would likely be short.<br />
The OAIS reference model inherited this analysis, and its implication that expending resources now to prepare for the likely rapid obsolescence of formats was desirable.<br />
<br />
The LOCKSS technology was [http://lockss.org/locksswiki/files/Freenix2000.pdf designed from the start] specifically to preserve content published on the Web. A Web format becomes obsolete when support for it is removed from browsers. [http://blog.dshr.org/2007/04/format-obsolescence-scenarios.html Theoretical] [http://blog.dshr.org/2009/04/spring-cni-plenary-remix.html analyses] of the mechanisms by which this would happen predicted that this would be a rare occurrence, because the incentives for doing so are weak, and the [http://blog.dshr.org/2008/01/format-obsolescence-right-here-right.html disincentives are strong]. Subsequent [http://blog.dshr.org/2012/10/formats-through-time.html practical research by Matt Holden of INA] into the renderability of Web formats that were predicted to be the most likely to suffer obsolescence, audio-visual formats from the early days of the Web, showed that 15 years later format obsolescence was negligible.<br />
<br />
Further, the alternative to format migration is emulation. The argument for format migration has always been that it would be impractical to deliver emulation to end-users. Recent work has demonstrated [http://blog.dshr.org/2013/11/in-browser-emulation.html two viable paths to delivering emulation] to readers of the types of web content preserved by LOCKSS and CLOCKSS:<br />
* A team from the University of Freiburg presented papers at [http://dx.doi.org/10.2218/ijdc.v8i1.250 IDCC2013] and [http://purl.pt/24107/1/iPres2013_PDF/Cloudy%20Emulation%20%E2%80%93%20Efficient%20and%20Scaleable%20Emulation-based%20Services.pdf iPRES2013] showing that it was possible to deliver emulation-as-a-cloud service to browsers using only HTML5, with no special plugin. What delivery method could be more convenient than embedding a live emulation in a Web page simply by pasting a link into it?<br />
* Building on earlier work by, among others [http://jpc.sourceforge.net/home_home.html the University of Oxford], running Javascript emulations of obsolete environments in the reader's browser is now routine:<br />
** The Internet Archive uses JMESS to [http://blog.archive.org/2013/10/25/microcomputer-software-lives-again-this-time-in-your-browser/ deliver emulations of ancient software].<br />
** The [https://github.com/s-macke/jor1k jor1k project] has the [http://linux.slashdot.org/story/13/11/12/161227/linux-kernel-running-in-javascript-emulator-with-graphics-and-network-support Linux kernel running] in an emulated computer implemented in Javascript.<br />
** Google's Chrome can now [http://news.cnet.com/8301-1023_3-57615373-93/google-emulates-1980s-era-amiga-computer-in-chrome/ emulate the Amiga 500] via the Native Client interface.<br />
Thus it is far from clear that, even if Web formats eventually suffer obsolescence, format migration would be necessary. By the time obsolescence might happen, it might well be that delivering a transparent emulation to the reader's browser would be the preferred method.<br />
<br />
== LOCKSS Strategy for Format Obsolescence ==<br />
<br />
Thus Web archives, such as the CLOCKSS archive, have a different model of <i>when</i> to devote resources to format migration, because:<br />
* The probability of a format going obsolete is low.<br />
* If a format does go obsolete, it will be a long time after its introduction.<br />
* It may well be that emulation, rather than format migration, would be the preferred way to deliver content in an obsolete format, were obsolescence ever to occur.<br />
Further, studies have shown that:<br />
* Digital objects in archives are [http://www.ssrc.ucsc.edu/pub/adams-ssrctr-11-01.html infrequently accessed by readers].<br />
* Storage is a [http://www.lockss.org/locksswp/wp-content/uploads/2012/09/unesco2012.pdf significant proportion of the total cost] of digital preservation.<br />
* Those digital preservation systems that perform preemptive bulk format migration do not discard the original, but store both the original and the migrated copy.<br />
Given these observations, the LOCKSS system's strategy for preserving content is:<br />
* Store, and maintain the integrity of, the original bits.<br />
* Exploit the ''content negotiation'' capabilities of the Web (and presumably any successor technology to the Web) to detect when a reader's browser does not support the original format in which the bits are stored.<br />
* If this shows it to be necessary, use [http://reports-archive.adm.cs.cmu.edu/anon/1998/CMU-CS-98-102.pdf John Ockerbloom's ''Typed Object Model'' technology] to construct a format migration pipeline capable of migrating from the original format to a format the reader's browser can render.<br />
* Use this pipeline to generate a temporary access copy of the original in a format suitable for the reader's browser.<br />
* Discard this access copy when it is no longer needed.<br />
A framework to support this strategy was implemented in the LOCKSS software and [http://dx.doi.org/10.1045/january2005-rosenthal demonstrated in 2005]. To avoid wasting resources implementing capabilities which have no realistic prospect of being needed in the foreseeable future, work in this area is on hold. When there is evidence that some format of content under preservation is facing obsolescence, a decision will be taken as to whether a production version of this migration strategy is the appropriate path to take, or whether (for example) an in-browser emulation strategy would be more effective.<br />
<br />
This strategy has a number of significant advantages:<br />
* It uses the minimum amount of storage.<br />
* It does not waste resources migrating content which is unlikely to be accessed and, if ever accessed, is unlikely to have suffered format obsolescence.<br />
* It performs any format migration that is actually necessary as late as possible, when the technology for performing it is likely to be better.<br />
* It expends resources as late as possible, exploiting the time value of money to the maximum extent.<br />
* It does not commit to format migration, which may not be the appropriate strategy at the time the reader requests access.<br />
This dual strategy of being prepared for both format migration and emulation is the approach [http://www.slideshare.net/FuturePerfect_/jeff-rothenberg-digital-preservation-perspective endorsed by Jeff Rothenberg in a March, 2012 presentation (e.g. slide 41)].<br />
<br />
== Format Migration in the CLOCKSS Archive ==<br />
<br />
The CLOCKSS archive is a dark archive. No readers (Consumers in the OAIS terminology) ever "interact with [CLOCKSS] services to find preserved information of interest and to access that information in detail". If content is ever triggered from the archive, readers access it from one of a number of re-publishing systems. Dissemination of triggered content is a transaction between the archive and one or more of these republishing systems which involves construction of a [[Definition of DIP|Dissemination Information Package]] and its transmission to the re-publishing system(s). If a subsequent reader's browser is unable to render the format in which the digital object was represented in the DIP, and is thus stored in the re-publishing system the reader is accessing, the technique described above can be applied.<br />
<br />
The LOCKSS software used by the CLOCKSS archive can integrate the [https://code.google.com/p/fits/ File Identification Tool Set (FITS)], which includes [http://jhove.sourceforge.net/ JHOVE], [http://sourceforge.net/projects/droid/ DROID] and other tools. FITS is currently configured as follows:<br />
* Priority: Put DROID at top of list, then JHOVE.<br />
* Disable NLNZ Metadata Extractor, which gives Exceptions.<br />
* Turn off validation, which is very time consuming on large files.<br />
* Configure DROID to exclude html, xhtml & pdf because it performs badly on these file types.<br />
* Configure JHOVE to exclude js because it performs badly on files of this type.<br />
Various outputs from FITS are available through the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon's GUI]] for every URL in an AU. Samples of FITS output are available:<br />
* [[Media:ABER6 FITS.pdf|A harvest AU]] (80-page PDF).<br />
* [[Media:LA24 FITS.pdf|A file transfer AU]] (188 page PDF).<br />
<br />
Thus, if a format in which a digital object is stored in the archive is known at the time of a trigger event to be obsolete (in that the vast majority of browsers in general use are unable to render it) the technique described above can be applied in the process of generating the DIP by emulating a browser that cannot render the format in question. In this case the format in which the digital object is stored in the re-publishing system will be different from that in the archive, the result of a format migration of the original. The original continues to be preserved in its original format in the archive. What is stored in the re-publishing system is the temporary access copy.<br />
<br />
== Availability of Format Converters ==<br />
<br />
Any strategy for format migration, not just the one taken by the LOCKSS software, depends upon the timely availability of converters capable of transforming the doomed format into a less doomed one. As regards the Web formats preserved by LOCKSS networks and the CLOCKSS archive, the sunk investment in (and thus value of) existing Web content in format A means that format A will not be rendered obsolete by format B (i.e. support for rendering format A will not be removed from, and support for format B added to, browsers in common use) unless and until there is a suitable converter from format A to format B. Thus the risk of a format going obsolete with no suitable converter is low.<br />
<br />
Further, the Web content of LOCKSS networks and the CLOCKSS archive can be satisfactorily rendered by a completely open source stack. Thus there are open source renderers for the content, which:<br />
* makes a [http://blog.dshr.org/2007/04/format-obsolescence-scenarios.html scenario in which the format goes obsolete implausible], and<br />
* makes it [http://blog.dshr.org/2009/01/are-format-specifications-important-for.html easy to implement a converter], since the necessary syntax and semantics are available in source code form.<br />
Note that, since in the LOCKSS approach format migration takes place at access time rather than at some earlier pre-emptive migration time, any criticism of the approach on the basis that converters might not be available applies ''a fortiori'' to the pre-emptive approach.<br />
<br />
Again, to avoid wasting resources implementing capabilities which have no realistic prospect of being needed in the foreseeable future, work in this area, such as integration with a registry of format convertors, is on hold.<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by LOCKSS Engineering Staff<br />
* Approval by LOCKSS Chief Scientist<br />
<br />
== Relevant Documents ==<br />
<br />
# Jeff Rothenberg "Ensuring the Longevity of Digital Documents" Scientific American Vol. 272 No. 1 1995<br />
# OAIS (2012) CCSDS 650.0-M-2: Reference Model for an Open Archival Information System (OAIS). Magenta Book. Issue 1. June 2012 (ISO 14721:2003) http://public.ccsds.org/publications/archive/650x0m2.pdf accessed 2013.08.31<br />
# David S. H. Rosenthal and Vicky Reich. “Permanent Web Publishing”, In Proceedings of the FREENIX Track: 2000 USENIX Annual Technical Conference. June 18-23, 2000, San Diego, California. pp. 129-140. http://lockss.org/locksswiki/files/Freenix2000.pdf<br />
# David S. H. Rosenthal. "Format Obsolescence: Scenarios", April 29, 2007 http://blog.dshr.org/2007/04/format-obsolescence-scenarios.html<br />
# David S. H. Rosenthal. "Spring CNI Plenary: The Remix". April 10, 2009 http://blog.dshr.org/2009/04/spring-cni-plenary-remix.html<br />
# David S. H. Rosenthal. "Format Obsolescence: Right Here Right Now?" January 3, 2008 http://blog.dshr.org/2008/01/format-obsolescence-right-here-right.html<br />
# John Ockerbloom. "Mediating Among Diverse Data Formats". Tech. Rep. CMU-CS-98-102, Carnegie-Mellon University, 1998. http://reports-archive.adm.cs.cmu.edu/anon/1998/CMU-CS-98-102.pdf<br />
# David S. H. Rosenthal, Thomas Lipkis, Thomas Robertson, Seth Morabito. “Transparent Format Migration of Preserved Web Content”, D-Lib Magazine, vol. 11, no. 1, January 2005. doi:10.1045/january2005-rosenthal<br />
# Ian Adams, Ethan L. Miller, Mark W. Storer. "Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories", Tech. Rept. UCSC-SSRC-11-01, University of California, Santa Cruz, March 2011 http://www.ssrc.ucsc.edu/pub/adams-ssrctr-11-01.html<br />
# David S. H. Rosenthal, Daniel C. Rosenthal, Ethan L. Miller, Ian F. Adams, Mark W. Storer, Erez Zadok. "The Economics of Long-Term Digital Storage", Memory of the World in the Digital Age, Vancouver, BC, September 2012. http://www.lockss.org/locksswp/wp-content/uploads/2012/09/unesco2012.pdf<br />
# David S. H. Rosenthal. "Formats Through Time" October 9, 2012 http://blog.dshr.org/2012/10/formats-through-time.html<br />
# David S. H. Rosenthal. "Are Format Specifications Important For Preservation?" January 4, 2009 http://blog.dshr.org/2009/01/are-format-specifications-important-for.html<br />
# Dirk von Suchodoletz, Klaus Rechert, Isgandar Valizada "Towards Emulation-as-a-Service: Cloud Services for Versatile Digital Object Access" http://dx.doi.org/10.2218/ijdc.v8i1.250<br />
# I. Valizada, K. Rechert, K. Meier, D. Wehrle, D. v. Suchodoletz and L. Sabel "Cloudy Emulation – Efficient and Scalable Emulation-based Services" http://purl.pt/24107/1/iPres2013_PDF/Cloudy%20Emulation%20%E2%80%93%20Efficient%20and%20Scaleable%20Emulation-based%20Services.pdf<br />
# JPC project "JPC: The Pure Java x86 PC Emulator" http://jpc.sourceforge.net/home_home.html<br />
# Jason Scott "Microcomputer Software Lives Again, This Time in Your Browser" http://blog.archive.org/2013/10/25/microcomputer-software-lives-again-this-time-in-your-browser/<br />
# Sebastian Macke "jor1k project" https://github.com/s-macke/jor1k<br />
# warmflatsprite "Linux Kernel Running In JavaScript Emulator With Graphics and Network Support" http://linux.slashdot.org/story/13/11/12/161227/linux-kernel-running-in-javascript-emulator-with-graphics-and-network-support<br />
# Steven Shankland "Google emulates 1980s-era Amiga computer in Chrome" http://news.cnet.com/8301-1023_3-57615373-93/google-emulates-1980s-era-amiga-computer-in-chrome/</div>Dshrhttp://documents.clockss.org/index.php/File:ClockssTaylorAndFrancisPlugin.xml.pdfFile:ClockssTaylorAndFrancisPlugin.xml.pdf2014-04-06T21:47:04Z<p>Dshr: </p>
<hr />
<div></div>Dshrhttp://documents.clockss.org/index.php/LOCKSS:_Basic_ConceptsLOCKSS: Basic Concepts2014-04-06T21:45:38Z<p>Dshr: Initial version</p>
<hr />
<div>= LOCKSS: Basic Concepts =<br />
<br />
This document introduces some basic concepts of the LOCKSS technology that are needed to understand the remainder of the documentation.<br />
<br />
== LOCKSS Daemon ==<br />
<br />
The LOCKSS daemon is a large (>200K lines of code) Java program that turns a generic Linux system into a digital preservation appliance called a LOCKSS box. The LOCKSS daemon is the only application program that runs in a LOCKSS box. Every action of a LOCKSS box, for ingest, preservation, dissemination and administration is performed by the LOCKSS daemon. The LOCKSS daemon is administered via a Web interface that allows authorized administrators to direct it to collect content, control how that content is disseminated, and monitor the daemon's performance. Among the functions performed by the LOCKSS daemon are:<br />
* Ingest via Web crawling, or file import.<br />
* Preservation via the [[LOCKSS: Polling and Repair Protocol|LOCKSS: Polling and Repair Protocol]].<br />
* Dissemination by acting as both a Web server and a Web proxy, and by file export.<br />
* Administration via a Web user interface.<br />
* Status and statistics reporting.<br />
Because all access to preserved content is mediated by the LOCKSS daemon, the physical representation of its internal data structures, such as how content and metadata are stored, is essentially of academic interest only. In particular, the fundamental abstraction that the LOCKSS daemon presents is not that it preserves ''files''. It preserves ''URLs''; their content and their associated headers (metadata) as a unit, although we often casually refer to these (content, header) pairs as "files". Their internal representations are not visible to those using the system to ingest or disseminate content, but only to those administering the underlying system. Only in exceptional circumstances does an administrator log in to the underlying operating system; all routine and normal diagnostic operations are performed through the Web interface.<br />
<br />
== Plugins ==<br />
<br />
The behavior of the LOCKSS daemon is generic. It must be adapted to the requirements of particular content it is to preserve. This is done via the"plugin" for that content, which is an instance of a Java class. In most cases, it is an instance of class DefinablePlugin whose behavior has been customized by parameters in an XML file; colloquially this XML file is often referred to as "the plugin" because it contains all the information that distinguishes this plugin from another that also uses DefinablePlugin. This information includes, for example, the classes which DefinablePlugin can use to extract metadata from the relevant content. A plugin (the class plus the parameters) represents a class of content, such as "content published on HighWire's H2O platform".<br />
<br />
== Archival Units ==<br />
<br />
Here is an [[Media:ClockssTaylorAndFrancisPlugin.xml.pdf|example plugin XML file]], for Taylor and Francis journals, It defines the class of content "published by Taylor and Francis". There are many journals in this class, and content is continually being added to them, so for operational convenience we divide the class into Archival Units (AUs) representing, typically, a year or a volume of a journal. Each AU is defined by the plugin class name, in this case <tt>org.lockss.plugin.taylorandfrancis.ClockssTaylorAndFrancisPlugin</tt> and a set of definitional parameters defined by the XML file, in this case:<br />
* <tt>base_url</tt><br />
* <tt>journal_id</tt><br />
* <tt>volume_name</tt><br />
<br />
For example, the plugin specifies that for an AU of a particular journal identified by these parameters, crawling should start at <tt>au_start_url</tt>:<br />
<pre><br />
<entry><br />
<string>au_start_url</string><br />
<string>&quot;%sclockss/%s/%s/index.html&quot;, base_url, journal_id, volume_name</string><br />
</entry><br />
</pre><br />
<br />
== Title Database ==<br />
<br />
The values for these parameters come from the Title Data Base (TDB), which is not actually a database, but a knowledge base represented as a set of text files in an easy-to-edit syntax that are processed into an XML file that is obtained by the LOCKSS daemon. For each AU in the system, there is a TDB entry providing the plugin class name and a (name, value) pair for each of the parameters defined by that plugin class that are different from the default. The TDB entry for Advances in Building Energy Research, defining its AUs from 2007-2014, looks like this:<br />
<pre><br />
{<br />
<br />
publisher <<br />
name = Taylor & Francis ;<br />
info[contract] = 2008 ;<br />
info[tester] = A<br />
><br />
<br />
plugin = org.lockss.plugin.taylorandfrancis.ClockssTaylorAndFrancisPlugin<br />
param[base_url] = http://www.tandfonline.com/<br />
implicit < status ; status2 ; year ; name ; param[volume_name] ><br />
...<br />
<br />
{<br />
<br />
title <<br />
name = Advances in Building Energy Research ;<br />
issn = 1751-2549 ;<br />
eissn = 1756-2201 ;<br />
issnl = 1751-2549<br />
><br />
<br />
param[journal_id] = taer20<br />
<br />
au < manifest ; exists ; 2007 ; Advances in Building Energy Research Volume 1 ; 1 ><br />
au < manifest ; exists ; 2008 ; Advances in Building Energy Research Volume 2 ; 2 ><br />
au < manifest ; exists ; 2009 ; Advances in Building Energy Research Volume 3 ; 3 ><br />
au < zapped ; finished ; 2010 ; Advances in Building Energy Research Volume 4 ; 4 ><br />
au < finished ; crawling ; 2011 ; Advances in Building Energy Research Volume 5 ; 5 ><br />
au < finished ; crawling ; 2012 ; Advances in Building Energy Research Volume 6 ; 6 ><br />
au < crawling ; exists ; 2013 ; Advances in Building Energy Research Volume 7 ; 7 ><br />
au < expected ; exists ; 2014 ; Advances in Building Energy Research Volume 8 ; 8 ><br />
<br />
}<br />
...<br />
}<br />
</pre><br />
<br />
The definitional parameters are specified as follows:<br />
* <tt>base_url</tt> is <tt>http://www.tandfonline.com/</tt> for all Taylor and Francis journals specified at the top.<br />
* <tt>journal_id</tt> is <tt>taer20</tt> specified in the section for Advances in Building Energy Research.<br />
* <tt>volume_name</tt> is specified by the 5th column of the table.<br />
<br />
The text form of the TDB is preserved in the LOCKSS source code repository at SourceForge, which is backed up each night to an on-site and an off-site system, both maintained by the LOCKSS team, in addition to SourceForge's backups. There is a copy of the XML form of the TDB for each LOCKSS network on each LOCKSS box in the network, in addition to the copy on the [[LOCKSS: Property Server Operations|Property Server]] and its backup in the Amazon cloud.<br />
<br />
== AUID ==<br />
<br />
Everywhere an AU needs to be uniquely identified, we use an internal name, its Archival Unit ID (AUID) as the means to do so, for example as a key in maps and databases, or in the messages of the [[LOCKSS: Polling and Repair Protocol|LOCKSS: Polling and Repair Protocol]]. The AUID for an AU is immutable string with an encoded representation of:<br />
* The fully-qualified Java class name of the plugin.<br />
* &<br />
* For each of the definitional parameters defined by the plugin XML:<br />
** The parameter name.<br />
** ~<br />
** The parameter value.<br />
** & (except for the last parameter)<br />
Because it contains the class of the plugin and all the definitional parameters, the AUID is unique to an AU irrespective of which box it is on, <br />
<br />
The AUID for the AU for Volume 6 of Advances in Building Energy Research, defined by the TDB entry above, and used as the example in [[Definition of AIP#Harvest AU|Definition of AIP]] is:<br />
<pre><br />
org|lockss|plugin|taylorandfrancis|ClockssTaylorAndFrancisPlugin&base_url~http%3A%2F%2Fwww%2Etandfonline%2Ecom%2F&journal_id~taer20&volume_name~6<br />
</pre><br />
<br />
Archival Units also have an external "AU name", which is a human-readable string used in Web pages and reports but for no other purpose.<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by LOCKSS Engineering Staff<br />
* Approval by LOCKSS Chief Scientist</div>Dshrhttp://documents.clockss.org/index.php/Main_PageMain Page2014-04-06T21:40:29Z<p>Dshr: /* LOCKSS Program Documents */</p>
<hr />
<div>= CLOCKSS Archive =<br />
<br />
Welcome to the documentation Wiki of the [http://www.clockss.org CLOCKSS Archive].<br />
<br />
== CLOCKSS Archive Documents ==<br />
<br />
These documents describe the organization, policies, practices and plans of the CLOCKSS Archive:<br />
* [[CLOCKSS: Mission Statement]]<br />
* [[CLOCKSS: Governance and Organization]]<br />
* [[CLOCKSS: Budget and Planning Process]]<br />
* [[CLOCKSS: Business Plan Overview]]<br />
* [[CLOCKSS: Business History]]<br />
* [[CLOCKSS: Collection Development]]<br />
* [[CLOCKSS: Preservation Strategy]]<br />
* [[CLOCKSS: Access Policy]]<br />
* [[CLOCKSS: Succession Plan]]<br />
* [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement]<br />
* [http://www.clockss.org/clockss/Contribute_to_CLOCKSS CLOCKSS: Fees]<br />
<br />
== LOCKSS Program Documents ==<br />
<br />
These documents describe the [http://www.lockss.org LOCKSS technology]:<br />
* [[LOCKSS: Basic Concepts]]<br />
* [[LOCKSS: Polling and Repair Protocol]]<br />
* [[LOCKSS: Format Migration]]<br />
* [[LOCKSS: Metadata Database]] <br />
* [[LOCKSS: Extracting Bibliographic Metadata]]<br />
* [[LOCKSS: Software Development Process]]<br />
* [[LOCKSS: Property Server Operations]]<br />
<br />
== LOCKSS Adaptations to CLOCKSS Archive Documents ==<br />
<br />
These documents describe how the LOCKSS technology is used by the CLOCKSS Archive:<br />
* [[CLOCKSS: Threats and Mitigations]]<br />
* [[CLOCKSS: Logging and Records]]<br />
* [[CLOCKSS: Ingest Pipeline]]<br />
* [[CLOCKSS: Box Operations]]<br />
* [[CLOCKSS: Extracting Triggered Content]]<br />
* [[CLOCKSS: Hardware and Software Inventory]]<br />
<br />
== OAIS Conformance Documents ==<br />
<br />
These documents describe the mapping between the CLOCKSS Archive and the [http://public.ccsds.org/publications/archive/650x0m2.pdf OAIS Reference Architecture]:<br />
* [[CLOCKSS: Designated Community]]<br />
* [[CLOCKSS: Mandatory Responsibilities]]<br />
* [[Definition of SIP]]<br />
* [[Definition of AIP]]<br />
* [[Definition of DIP]]<br />
<br />
== Background ==<br />
<br />
During the preparation for the [http://www.iso.org/iso/catalogue_detail.htm?csnumber=56510 ISO 16363 audit] of the CLOCKSS Archive, it was decided to make as much documentation of the CLOCKSS Archive public as possible. This wiki is the result. The contents have been collected from a variety of sources, including:<br />
* Published papers<br />
* CLOCKSS board minutes and other Board documents<br />
* The LOCKSS team's internal wiki<br />
* The LOCKSS team's ticketing and bug tracking systems<br />
Many of these sources were not intended to be made public, and contained confidential or inappropriate material. The contents of this wiki were extracted from them and reviewed for publication as part of the audit preparations. These pages document:<br />
* The structure, policies and practices of the CLOCKSS Archive.<br />
* The conformance of the CLOCKSS Archive to the OAIS Reference Model<br />
* The policies, practices and technology of the LOCKSS Program, which operates the CLOCKSS Archive under contract to the CLOCKSS Board.<br />
* The adaptations made to the generic LOCKSS technology for the purposes of the CLOCKSS Archive.<br />
As structures, policies, practices and technologies change, these documents will be maintained so that up-to-date information on these topics is available to the public.<br />
<br />
== ISO 16363 Criteria ==<br />
<br />
For the purposes of the audit, the wiki also contains a page for each of the ISO 16363 criteria. The goal for these pages was that they be a finding aid for the auditors, allowing them to easily locate the relevant parts of the documents for each criterion, but that all actual content be in the documents. Thus interested readers can thus use the documents without needing to refer to the criteria pages:<br />
<br />
* [[3) Organizational Infrastructure]]<br />
* [[4) Digital Object Management]]<br />
* [[5) Infrastructure and Security Risk Management]]<br />
<br />
== Confidential Documents ==<br />
<br />
The auditors requested additional information, some of which was confidential:<br />
* Requested information that was not confidential was added to the documents in this Wiki. For example, URL lists and metadata for sample AUs were added to [[Definition of AIP]].<br />
* Requested information that was confidential was supplied in a separate Wiki which was deactivated at the end of the audit.<br />
<br />
== CLOCKSS Permission Statement ==<br />
<br />
CLOCKSS system has permission to ingest, preserve, and serve this Archival Unit.</div>Dshrhttp://documents.clockss.org/index.php/File:AU-Structure.pngFile:AU-Structure.png2014-04-05T21:38:42Z<p>Dshr: </p>
<hr />
<div></div>Dshrhttp://documents.clockss.org/index.php/Definition_of_AIPDefinition of AIP2014-04-05T21:31:54Z<p>Dshr: /* Creating an AIP from a file transfer SIP */</p>
<hr />
<div>= CLOCKSS Definition of AIP =<br />
<br />
== OAIS Archival Information Package (AIP) ==<br />
<br />
The OAIS definition of AIP is:<br />
<blockquote>Archival Information Package (AIP): An Information Package, consisting of the Content Information and the associated Preservation Description Information (PDI), which is preserved within an OAIS.</blockquote><br />
Other relevant definitions are:<br />
<blockquote>AIP Edition: An AIP whose Content Information or Preservation Description Information has been upgraded or improved with the intent not to preserve information, but to increase or improve it. An AIP edition is not considered to be the result of a Migration.</blockquote><br />
<blockquote>AIP Version: An AIP whose Content Information or Preservation Description Information has undergone a Transformation on a source AIP and is a candidate to replace the source AIP. An AIP version is considered to be the result of a Digital Migration.</blockquote><br />
<blockquote>Archival Information Collection (AIC): An Archival Information Package whose Content Information is an aggregation of other Archival Information Packages.</blockquote><br />
<blockquote>Archival Information Unit (AIU): An Archival Information Package where the Archive chooses not to break down the Content Information into other Archival Information Packages. An AIU can consist of multiple digital objects (e.g., multiple files).</blockquote><br />
<blockquote>Preservation Description Information (PDI): The information which is necessary for adequate preservation of the Content Information and which can be categorized as Provenance, Reference, Fixity, Context, and Access Rights Information.</blockquote><br />
The OAIS discussion of AIP is:<br />
<blockquote>Within the OAIS one or more SIPs are transformed into one or more Archival Information Packages (AIPs) for preservation. The AIP has a complete set of PDI for the associated Content Information. The AIP may also contain a collection of other AIPs, and this is discussed and modeled in section 4. The Packaging Information of the AIP will conform to OAIS internal standards, and it may vary as it is managed by the OAIS.</blockquote><br />
<br />
The OAIS definition of Representation Information is:<br />
<blockquote>'''Representation Information:''' The information that maps a Data Object into more meaningful concepts. An example of Representation Information for a bit sequence which is a FITS file might consist of the FITS standard which defines the format plus a dictionary which defines the meaning in the file of keywords which are not part of the standard.<br />
<br />
Another example is JPEG software which is used to render a JPEG file; rendering the JPEG file as bits is not very meaningful to humans but the software, which embodies an understanding of the JPEG standard, maps the bits into pixels which can then be rendered as an image for human viewing.</blockquote><br />
Other relevant definitions are:<br />
<blockquote>'''Representation Network:''' The set of Representation Information that fully describes the meaning of a Data Object. Representation Information in digital forms needs additional Representation Information so its digital forms can be understood over the Long Term.</blockquote><br />
<blockquote>'''Representation Rendering Software:''' A type of software that displays Representation Information of an Information Object in forms understandable to humans.'''</blockquote><br />
The OAIS discussion of Representation Information is:<br />
<blockquote>In general, it can be said that ‘Data interpreted using its Representation Information yields Information’, ... In order for this Information Object to be successfully preserved, it is critical for an OAIS to identify clearly and to understand clearly the Data Object and its associated Representation Information. For digital information, this means the OAIS must clearly identify the bits and the Representation Information that applies to those bits. ... As a further complication, the recursive nature of Representation Information, which typically is composed of its own data and its own Representation Information, typically leads to a network of Representation Information objects. Since a key purpose of an OAIS is to preserve information for a Designated Community, the OAIS must understand the Knowledge Base of its Designated Community to understand the minimum Representation Information that must be maintained. The OAIS should then make a decision between maintaining the minimum Representation Information needed for its Designated Community, or maintaining a larger amount of Representation Information that may allow understanding by a larger Consumer community with a less specialized Knowledge Base, which would be the equivalent of extending the definition of the Designated Community. Over time, evolution of the Designated Community’s Knowledge Base may require updates to the Representation Information to ensure continued understanding. The choice, for an OAIS, to collect all the relevant Representation Information or to reference its existence in another trusted or partner OAIS Archive, is an implementation and organization decision.</blockquote><br />
<br />
== CLOCKSS Archival Information Package (AIP) ==<br />
<br />
CLOCKSS calls its AIPs Archival Units (AUs). They are constructed from [[Definition of SIP|CLOCKSS SIPs]] as described below. The internal representation of an AU is as a directory hierarchy in a POSIX file system in which the directory names represent the components of the name of the URL from which content was collected, including the terminal component. Each of these directories, in addition to the names of descendant components, may contain files whose names start with <tt>#</tt> (and thus cannot be components of the URL name, since they would be tags within the URL) containing metadata relating to the URL up to that component. Additionally each directory may contain a directory named <tt>#content</tt> containing a sequence of versions of content obtained from the URL up to that component, and metadata relating to that version of the content. This metadata is represented as a file containing {key, value} pairs. It will include all the HTTP headers obtained by the <tt>GET</tt> that obtained the content, in particular the <tt>Content-Type</tt>, one component of which is the <tt>Media-Type</tt>. It will also include additional metadata generated during ingestion of that version of the content at that URL, typically including a checksum, and a timestamp. This somewhat complex representation is required because:<br />
* The URLs <tt>http://www.example.com/foo</tt> and <tt>http://www.example.com/foo/bar</tt> may both contain (different) content.<br />
* The content and/or the headers obtained from <tt>http://www.example.com/foo/bar</tt> at time T(0) and at time T(1) may differ.<br />
The representation allows for easy access by tools other than the LOCKSS daemon, for example shell scripts.<br />
<br />
=== CLOCKSS AIP Examples ===<br />
<br />
==== File Transfer AU ====<br />
<br />
The Journal of Laser Applications Volume 24:<br />
* [[Media:LA24 urls.pdf|A list of URLs (87-page PDF)]].<br />
* [[Media:LA24 metadata.pdf|The metadata]].<br />
<br />
==== Harvest AU ====<br />
<br />
Advances in Building Energy Research Volume 6:<br />
* [[Media:ABER V6 urls.pdf|A list of URLs (43-page PDF)]].<br />
* [[Media:ABER V6 metadata.pdf|The metadata]].<br />
<br />
=== CLOCKSS Preservation Description Information (PDI) ===<br />
<br />
The OAIS Reference Model classifies the Preservation Description Information (PDI) included in an AIP as follows:<br />
<ul><li>'''Provenance:''' the provenance of each version of each URL in the AU can be determined as follows. The content of that version was obtained from the URL represented by the path from the root of the AU's directory hierarchy to the parent of the content directory, at the time of the timestamp. If this content was the result of a repair from another CLOCKSS box, that is recorded in the metadata. </li><br />
<li>'''Fixity:''' the metadata for each version of each URL in the AU includes a checksum computed at the time it was obtained and, if one is available, a checksum provided by the Web server from which it was obtained.</li><br />
<li>'''Context:''' the context information for an AU consists of two pieces of information which together allow the daemon to construct an instance of a Java class customized to suit the AU:<br />
<ul><br />
<li>The ''parameters'', which are a set of {name, value} pairs providing the arguments needed to construct an instance of the class selected by the plugin ID.</li><br />
<li>The ''plugin ID'', which, in encoded form, identifies the class to be instantiated and supplies additional information.</li><br />
</ul><br />
In effect, the context for the AU is a customized instance of a Java class, normally referred to as its ''plugin''. It is thus executable, capable of performing operations on the AU such as [[Definition of AIP#Creating AIPs from SIPs|adding content and metadata from a SIP]], [[LOCKSS: Extracting Bibliographic Metadata|extracting metadata]], and taking part in [[LOCKSS: Polling and Repair Protocol|integrity checks]].<br /><br />
The context for the content of a URL in an AU consists of the associated metadata, including the <tt>Content-Type</tt> and the other HTTP headers which together provide the information a Web browser uses to render the content.<br /><br />
Each AU is a stand-alone, self-contained object that, if necessary, can be disseminated (triggered) independently. Although it may be part of a larger bibliographic unit in a logical sense, it is self-contained in its representation. To ensure this common URLs, for example a journal logo image, are replicated in each AU that refers to them.</li><br />
<li>'''Reference:''' Each AU has an immutable internal name, computed from its context information and stored with it. The name of the AU is the same on all CLOCKSS boxes, although the location of the AU instance in the box's file system may differ. The use of this name to locate the instance of an AU on a CLOCKSS box, or content within it, is [[Definition of AIP#Locating Digital Objects|described below]].</li><br />
<li>'''Access Rights:''' the CLOCKSS archive is a dark archive; access to the content of all AUs is forbidden unless in the future the [[CLOCKSS: Extracting Triggered Content|CLOCKSS board declares a trigger event]] for a specific set of AUs. Thus it is neither necessary nor possible to store the access rights to an AU with the AU itself, since they will only be determined in the future.</li></ul><br />
<br />
== Creating AIPs from SIPs ==<br />
<br />
When the AU is created its root directory is created, and the context information (plugin ID and AU parameters) are recorded in a read-only file in that directory. The same information is recorded in a central file in the daemon's configuration directory.<br />
<br />
Because most publishers whose materials are preserved in the CLOCKSS archive are serial publishers, emitting a continuous stream of articles, AUs are typically constructed from a series of SIPs, each one representing the articles published since its predecessor SIP. There are [[Definition of SIP|two kinds of SIP]]:<br />
* A ''harvest'' SIP represents content that the CLOCKSS archive will ingest by crawling the publisher's web site.<br />
* A ''file transfer'' SIP represents content that the publisher will package and transfer to the CLOCKSS archive via FTP, rsync, or other file transfer machanism.<br />
<br />
The process of creating an AU (AIP) from a SIP is different for each of the two types of SIP, but each process starts by ''configuring'' the AU on the CLOCKSS boxes, which involves supplying the context information (plugin ID and the parameters it requires) via the [[LOCKSS: Property Server Operations|CLOCKSS property server]]. Each box creates a root directory for its instance of the AU at a suitable place in its POSIX file system and calls the AU's plugin to arrange attempts to collect SIPs. The plugin contains information about times when content collection is allowed. The LOCKSS daemon on each box maintains a schedule of all AUs' collection attempts; when an AU requests a collection attempt it is scheduled so as to conform to box-wide and publisher-specific limits on the number of simultaneous collection attempts.<br />
<br />
AIP (AU) instances in CLOCKSS boxes may be created even before the first SIP supplying content for them is available. The instance continues to accumulate content via a sequence of SIPs becoming available through time, as the LOCKSS daemon repeatedly crawls the Web server from which SIPs are collected. Nominally, each AU represents a delimited span of time, such as a year or a volume of a journal. But because SIPs containing errata or corrections can arrive even after the delimited span of time, there is in general never a point in time at which the AU can be said to be definitively complete in the sense that no further content will ever be added.<br />
<br />
Although, as documented in [[CLOCKSS: Ingest Pipeline]], the quality assurance process that content undergoes as it is ingested into the CLOCKSS archive includes some visual spot checks that the content renders properly in a web browser, these are primarily intended to assure that all necessary URLs are being harvested. At the scale of the CLOCKSS archive's operations it is not feasible for these checks to be exhaustive. Further, the Content Information in an AIP is copyright by the publisher. It is their responsibility to ensure that it is Independently Understandable for their readers, who are also the eventual Consumers of the content if it is ever triggered from the CLOCKSS archive. Even if these spot checks were to detect some rendering problems, the CLOCKSS archive would not be permitted to modify the Content Information to correct them.<br />
<br />
The goal of the [[CLOCKSS: Ingest Pipeline]] is not to ensure that the AUs are "complete and correct"; there are no operationally implementable definitions of those terms. If there were, they would involve second-guessing the publishers. The goal is rather to ensure that the AUs "faithfully reflect what the publisher has published on their web site (for harvested AUs) or supplied to the archive (for file transfer AUs)".<br />
<br />
=== Creating an AIP from a harvest SIP ===<br />
<br />
When the scheduled time for the AU's requested collection attempt arrives, the plugin configures the box's Web crawler as it first verifies that the appropriate [[Definition of SIP|CLOCKSS permission statement]] is present on the publisher's Web site, then crawls the site. The plugin bounds the crawl in the URL name space by determining which links to follow, rate-limits the crawl so as not to disrupt operation of the publisher's Web site, and stores all newly-discovered content and the related metadata in the appropriate place relative to the AU's root directory in the box's repository, as described [[Definition of AIP#Creating AIPs from SIPs|above]].<br />
<br />
Note that for quality assurance (QA) reasons, as set out in [[CLOCKSS: Ingest Pipeline]], in practice each harvest AU is created twice:<br />
* First, a temporary AU is created on the CLOCKSS ingest machines. These machines collect the AU's current content, and then come to agreement on the content using the [[LOCKSS: Polling and Repair Protocol]].<br />
* Once agreement is reached, the permanent AU is created on each of the production CLOCKSS boxes. They then:<br />
** crawl the content from the CLOCKSS ingest boxes<br />
** start regular integrity checks with the other production CLOCKSS boxes using the [[LOCKSS: Polling and Repair Protocol]].<br />
The process the production boxes use is exactly the same as used by the ingest boxes, except that the production box is configured to crawl using the LOCKSS daemon on an ingest box as a proxy. The LOCKSS daemon on the ingest box is configured to act as a proxy that returns the content it has for the publisher's URL, or 404 for content it does not have. This ensures that the URLs from which the production box collects content point to the publisher, although the content will be that previously collected from the publisher's URL by the ingest box (see [[CLOCKSS: Ingest Pipeline#Harvest Content Processing|Harvest Content Processing]]). Note that the publisher's web site is not involved in this process. Once agreement is reached among the production boxes, the AU can be removed from the ingest boxes. Removal is recorded by changing the state of the AU in the Title DataBase (TDB) to ZAPPED (See [[LOCKSS: Extracting Bibliographic Metadata#Metadata Database Indexing|LOCKSS: Extracting Bibliographic Metadata]]). The current process for doing so has been deemed inadequate (see [[CLOCKSS: Ingest Pipeline#Harvest Process|CLOCKSS: Ingest Pipeline]]); a replacement process is under development.<br />
<br />
=== Creating an AIP from a file transfer SIP ===<br />
<br />
Publishers choosing to supply content via file transfer can choose either:<br />
* To ''push'' the content via <tt>FTP</tt>, <tt>SFTP</tt> or <tt>rsync</tt> to a CLOCKSS-run ingest server, using a user name and password chosen by the CLOCKSS team and specific to the publisher. <br />
* To have the CLOCKSS archive's ingest server ''pull'' the content from a publisher-run <tt>FTP</tt>, <tt>SFTP</tt> or <tt>rsync</tt> server using a user name and password chosen by, and specific to the publisher.<br />
In both cases, the combination of the DNS name at the publisher's end, the user name, and the password identifies the publisher for provenance purposes.<br />
<br />
The content and metadata received via file transfer typically represents input to the journal's publishing platform. There is typically no one-to-one relationship between the files received and URLs on the publishing platform. It typically does not contain the information, such as HTML, CSS, Javascript and so on, or the URLs that are present in harvest content and permit a fairly accurate replica of the journal's website if the content were ever triggered. Instead, the process described in [[CLOCKSS: Extracting Triggered Content#t#Preparing_File_Transfer_Content_for_Dissemination|CLOCKSS: Extracting Triggered Content]] would create a new, usable website with the journal's "Intellectual Content" (as defined buy Portico).<br />
<br />
Content and metadata in SIPs received via file transfer is organized by shell scripts into a directory hierarchy that is exported to the production CLOCKSS boxes by a Web server, called the ''staging server''. The hierarchy contains a directory per publisher containing a directory per year that contains the content and metadata received from that publisher during that year, and an automatically generated manifest page containing links to all the files and a CLOCKSS permission statement. Some publishers' file transfer SIPs do not contain fixity information, for those publishers fixity information is computed as soon as possible after file transfer and stored with the content.<br />
<br />
At the start of each new year an AU for the new year's SIPs from each publisher supplying content via file transfer is configured on the production CLOCKSS boxes via the [[LOCKSS: Property Server Operations|CLOCKSS property server]]. The AU's plugin is configured to start crawling from the automatically generated manifest page of that year's directory for that publisher on the staging server at regular intervals. Each time the box crawls, it adds content and metadata from newly received SIPs to the box's copy of the AU. In this case the URLs in the AIPs have point to the files in the SIPs on the staging server.<br />
<br />
After the year has ended and agreement has been reached on the AU among the production CLOCKSS boxes, the AU can be removed from the staging server and its [[LOCKSS: Basic Concepts#Title Database|TDB]] status changed to ZAPPED. A review deemed the initial process for doing so inadequate and error-prone. AUs will not be removed from the staging server until an acceptable process has been developed and tested. This process must integrate with the improved [[CLOCKSS: Logging and Records#External Reports|process for generating external reports]].<br />
<br />
== CLOCKSS Representation Information ==<br />
<br />
[[CLOCKSS: Designated Community]] documents that the software Knowledge Base of the eventual Consumers of triggered content includes web browsers and their associated plugins capable of rendering the formats used on the Web at the time of their publication. This part of the Representation Information for the CLOCKSS archive is preserved outside the CLOCKSS archive, primarily in open source code repositories.<br />
<br />
[[LOCKSS: Format Migration]] documents how evolution in this Knowledge Base through time can be handled using the Content Negotiation capabilities of the Web to implement [http://dx.doi.org/10.1045/january2005-rosenthal transparent, on-access format migration]. This capability is implemented in the LOCKSS software, which is preserved outside the CLOCKSS archive, in the SourceForge repository and elsewhere.<br />
<br />
Thus the representation Information that is necessary to preserve with content objects in a CLOCKSS AIP is that information needed by web browsers to render, and the LOCKSS software to migrate, the content obtained from a URL. This information is is two parts:<br />
* The <tt>Mime-Type</tt> and other HTTP headers obtained from the URL. The metadata preserved for each version of each URL within an AU includes all HTTP headers.<br />
* "Magic Number" and other information contained in the HTTP content payload itself. The content preserved for each version of each URL within an AU is the entire HTTP content payload.<br />
A web browser has no other information upon which to base its rendering of the content of a URL than its name (perhaps with a file extension) and these two parts, which must therefore be adequate for the purpose of making the content understandable to Consumers.<br />
<br />
== Locating Digital Objects ==<br />
<br />
The term "digital objects" is often used in discussions of preserved digital information. In the CLOCKSS content, it might refer to either of two types of object:<br />
* AIPs, or in the CLOCKSS context AUs.<br />
* Content objects within an AU.<br />
<br />
=== Locating AIPs ===<br />
<br />
The CLOCKSS archive has a single class of AIP, called an Archival Unit (AU). As [[Definition of AIP#CLOCKSS Preservation Description Information (PDI)|described above]], the Reference part of the AU's Preservation Description Information is an immutable name that is the same on each CLOCKSS box. The location of the instance of the AU on a particular CLOCKSS box may be obtained by querying a map from this internal name to the path to the AU's root directory. This map is built during box startup and subsequently maintained as new AUs are created, or AUs moved.<br />
<br />
=== Locating content objects ===<br />
<br />
Content within an AU can be located in one of two ways:<br />
* Via the URL from which it was obtained. As [[Definition of AIP#CLOCKSS Archival Information Package (AIP)|described above]], there is a reversible mapping between the URI from which the content was collected and the path from the root of the AU containing it in the POSIX file system containing the AU. This enables content within an AU to be located via the map between the AU's internal name and its root location, and the components of the URL.<br />
* Via metadata search. As [[Definition of AIP#CLOCKSS Preservation Description Information (PDI)|described above]], content within an AU can be located by querying the [[LOCKSS: Metadata Database|metadata database]] using specific bibliographic metadata fields to match against the [[LOCKSS: Extracting Bibliographic Metadata|bibliographic metadata supplied by the publisher]] or derived from the AU's context. <br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by LOCKSS Engineering Staff<br />
* Approval by LOCKSS Chief Scientist<br />
<br />
== Relevant Documents ==<br />
# OAIS (2012) CCSDS 650.0-M-2: Reference Model for an Open Archival Information System (OAIS). Magenta Book. Issue 1. June 2012 (ISO 14721:2003) http://public.ccsds.org/publications/archive/650x0m2.pdf accessed 2013.08.31<br />
# [[Definition of SIP]]<br />
# [[CLOCKSS: Extracting Triggered Content]]<br />
# [[LOCKSS: Metadata Database]]<br />
# [[LOCKSS: Extracting Bibliographic Metadata]]<br />
# [[LOCKSS: Polling and Repair Protocol]]<br />
# [[CLOCKSS: Ingest Pipeline]]<br />
# [[CLOCKSS: Designated Community]]<br />
# [[LOCKSS: Format Migration]]<br />
# David S.H. Rosenthal, Thomas Lipkis, Thomas Robertson, Seth Morabito. “Transparent Format Migration of Preserved Web Content”, D-Lib Magazine, vol. 11, no. 1, January 2005. http://dx.doi.org/10.1045/january2005-rosenthal accessed 2013.8.7</div>Dshrhttp://documents.clockss.org/index.php/Definition_of_SIPDefinition of SIP2014-04-05T19:30:07Z<p>Dshr: /* Creating AIPs from SIPs */</p>
<hr />
<div>= CLOCKSS Definition of SIP =<br />
<br />
== OAIS Submission Information Package (SIP) ==<br />
<br />
The OAIS definition of a SIP is:<br />
<blockquote>Submission Information Package (SIP): An Information Package that is delivered by the<br />
Producer to the OAIS for use in the construction or update of one or more AIPs and/or the<br />
associated Descriptive Information.</blockquote><br />
Other relevant definitions are:<br />
<blockquote>Data Submission Session: A delivery of media or a single telecommunications session that<br />
provides Data to an OAIS. The Data Submission Session format/contents is based on a data<br />
model negotiated between the OAIS and the Producer in the Submission Agreement. This<br />
data model identifies the logical constructs used by the Producer and how they are<br />
represented on each media delivery or in the telecommunication session.</blockquote><br />
<blockquote>Submission Agreement: The agreement reached between an OAIS and the Producer that<br />
specifies a data model, and any other arrangements needed, for the Data Submission Session.<br />
This data model identifies format/contents and the logical constructs used by the Producer<br />
and how they are represented on each media delivery or in a telecommunication session.</blockquote><br />
<blockquote>Ingest Functional Entity: The OAIS functional entity that contains the services and<br />
functions that accept Submission Information Packages from Producers, prepares Archival<br />
Information Packages for storage, and ensures that Archival Information Packages and their<br />
supporting Descriptive Information become established within the OAIS.</blockquote><br />
The discussion of SIPs in OAIS is:<br />
<blockquote>The Submission Information Package (SIP) is that package that is sent to an OAIS by a<br />
Producer. Its form and detailed content are typically negotiated between the Producer and<br />
the OAIS (see related standards in 1.5). Most SIPs will have some Content Information and<br />
some PDI.</blockquote><br />
<blockquote>The relationships between SIPs and AIPs can be complex; as well as a simple one-to-one<br />
relationship in which one SIP produces one AIP, other possibilities include: one AIP being<br />
produced from multiple SIPs produced at different times by one Producer or by many<br />
Producers; one SIP resulting in a number of AIPs; and many SIPs from one or more sources<br />
being unbundled and recombined in different ways to produce many AIPs. Even in the first<br />
case, the OAIS may have to perform a number of transformations on the SIP. The Packaging<br />
Information will always be present in some form.</blockquote><br />
<br />
== CLOCKSS Submission Information Package (SIP) ==<br />
<br />
The majority of the content the CLOCKSS archive is chartered to preserve is electronic journals, which are serials. Serials publishers (Producers in OAIS terminology) emit a continuous stream of articles through time, together in most cases with metadata that organizes the articles into a logical structure of issues, volumes and journals. Decisions to trigger content from the CLOCKSS archive will normally be taken on at a journal, or perhaps at a range of volumes, granularity. It is thus important to preserve this logical structure; the CLOCKSS archive does so by organizing content into Archival Units (AUs) which generally correspond to a volume or a year of a journal, but in some cases may correspond to a year of multiple journals. An AU, together with some related information, forms the CLOCKSS archive's AIP.<br />
<br />
Publishers submitting content to the CLOCKSS archive choose one of two ways to do so:<br />
* Harvest, in which the CLOCKSS archive collects the content the publisher supplies to readers from their web site shortly after it is published (See [[CLOCKSS: Ingest Pipeline#Harvest Content Pipeline|CLOCKSS; Ingest Pipeline]]).<br />
* File transfer, in which the publisher packages up content and metadata in a form they define and arranges for the packages to be transferred to the CLOCKSS archive at a time of the publisher's choosing.<br />
Comparing these with OAIS' SIP definitions above, some important aspects are evident:<br />
* SIPs are a structure that the archive imposes on what from the publisher's point of view is a continuous stream of content. SIP is not a useful concept in communicating with publishers.<br />
* In neither case is the form of content as it is transferred to the archive defined by the archive. It is defined by the publisher; the archive has very limited ability to affect this definition during negotiation of the Submission Agreement (see [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement]).<br />
* In both cases the content is constructed by the publisher.<br />
* In neither case is the content transferred a completed logical unit of content. What is transferred is the set of articles published since the last transfer. This will in almost all cases form only part of a logical unit, for example a volume of a journal, and will in some cases form part of multiple logical units. Examples are transfers that span the end of one volume and the start of its successor, and file transfers from large publishers which typically include articles from many journals. Thus CLOCKSS SIPs are "for use in the construction or update of one or more AIPs".<br />
* In neither case is a completed logical unit of content, or an entire SIP, transferred at a single point in time: <br />
** For harvested content, the timing of the transfer is under the control of the archive. The CLOCKSS archive cannot wait until the successor of a volume has started then ingest the articles the older volume contains as a completed unit because:<br />
*** Doing so places content published early in a volume at additional risk.<br />
*** Publishers are not constrained to cease publishing content in an earlier volume once they have started publishing in its successor. Both "publish ahead of print" and corrections are examples in which this happens.<br />
** For file transfer content, the timing of the transfer is under the control of the publisher. In general, publishers arrange for the transfer to happen shortly after publication, but they are not constrained to do so, nor are they constrained to identify the point in time at which a volume is complete.<br />
The CLOCKSS archive is thus an example of how "The relationships between SIPs and AIPs can be complex".<br />
<br />
Because the CLOCKSS archive provides two different ways publishers can choose to provide content for preservation, it supports two different types of SIP, for harvest and for file transfer.<br />
<br />
=== Harvest SIP ===<br />
<br />
Publishers choosing to have the CLOCKSS archive harvest their content normally demonstrate its logical structure by providing volume and issue table-of-contents (ToC) pages. The volume ToC acquires links to the issues as they are published; the issue ToC acquires links to the articles in that issue as they are published. The CLOCKSS archive requires two things of each publisher choosing harvest that, together with the content as it accumulates, form the SIP for that logical unit of content from that publisher:<br />
* A volume ToC page, or some page playing an equivalent role, via whose links the content that is to form the AU can be found. This page is termed the "manifest page".<br />
* A permission statement on the manifest page, or on a page with a known relation to the manifest page, containing either a Creative Commons license, or a statement granting CLOCKSS permission to collect and preserve the content pointed to by the manifest page.<br />
<br />
=== File Transfer SIP ===<br />
<br />
Publishers choosing to supply content via file transfer do so in some format of their choosing, subject to agreement by the CLOCKSS archive that the format is adequately specified and contains adequate metadata. The transferred files containing content and metadata form the SIP for that content from that publisher.<br />
<br />
== Creating AIPs from SIPs ==<br />
<br />
In both harvest and file transfer cases the [[CLOCKSS: Ingest Pipeline]] uses additional, internal publisher-specific information to create AUs (AIPs) from the SIPs. For details, see [[Definition of AIP]]. As described in [[CLOCKSS: Logging and Records]], the CLOCKSS Executive Director receives regular reports on the progress of this process, upon which publisher billing is based.<br />
<br />
== Content Information and Information Properties ==<br />
<br />
=== OAIS Content Information and Information Properties ===<br />
<br />
In the OAIS model, the SIP is a means to transfer Content Information containing Information Properties to the archive. The definitions of these terms are:<br />
<blockquote>Content Information: A set of information that is the original target of preservation or that<br />
includes part or all of that information. It is an Information Object composed of its Content Data<br />
Object and its Representation Information.</blockquote><br />
<blockquote>Information Property: That part of the Content Information as described by the Information<br />
Property Description. The detailed expression, or value, of that part of the information content<br />
is conveyed by the appropriate parts of the Content Data Object and its Representation<br />
Information.</blockquote><br />
Other relevant definitions are:<br />
<blockquote>Content Data Object: The Data Object, that together with associated Representation<br />
Information, comprises the Content Information.</blockquote><br />
<blockquote>Information Property Description: The description of the Information Property. It is a<br />
description of a part of the information content of a Content Information object that is<br />
highlighted for a particular purpose.</blockquote><br />
<blockquote>Representation Information: The information that maps a Data Object into more<br />
meaningful concepts. An example of Representation Information for a bit sequence which is<br />
a FITS file might consist of the FITS standard which defines the format plus a dictionary<br />
which defines the meaning in the file of keywords which are not part of the standard.<br />
Another example is JPEG software which is used to render a JPEG file; rendering the JPEG<br />
file as bits is not very meaningful to humans but the software, which embodies an<br />
understanding of the JPEG standard, maps the bits into pixels which can then be rendered as<br />
an image for human viewing.</blockquote><br />
The discussion of Information Properties includes:<br />
<blockquote>The<br />
Producer may provide, or the Archive may itself define, as part of the Provenance<br />
Information, Information Property Descriptions of Information Properties which should<br />
be maintained over time, and indeed may provide Information Property Descriptions of<br />
Information Properties which do not need to be maintained over time. An Information<br />
Property is that part of the Content Information as described by the Information Property<br />
Description. An Information Property Description is a description of a part of the information<br />
content of a Content Information object that is highlighted for a particular purpose. The detailed<br />
expression, or value, of that part of the information content is conveyed by the appropriate parts<br />
of the Content Data Object and its Representation Information. For example, consider a simple<br />
digital book which when rendered appears as pages with margins, title, chapter headings,<br />
paragraphs, and text lines composed of words and punctuation. Information Property<br />
Descriptions for Information Properties that must be preserved could be expressed as<br />
‘paragraph identification’ and ‘characters expressing words and punctuation’. The<br />
Information Properties would consist of all the book’s paragraph identifications, words, and<br />
punctuation as expressed by the Content Data Object and its Representation Information.<br />
This means that all formatting other than the recognition of paragraphs and readable text<br />
could be altered while still maintaining required preservation. The Archive may express an<br />
evaluation of the Authenticity of its holdings, based on community practice and<br />
recommendations (including best practices, guidelines, standards, and legal requirements).<br />
For example scientific Archives may have less stringent evaluation criteria than State<br />
Archives; however, the Consumer may make his/her own judgment of the Authenticity<br />
starting with the evidence obtained from PDI.</blockquote><br />
<br />
=== CLOCKSS Content Information and Information Properties ===<br />
<br />
The OAIS definition of Information Properties is, in effect, properties of the Content Information that are, or are not, the same before and after a transformation of the content as part of a preservation operation. An example would be paragraph indentifications, which might or might not be preserved by a format migration of the preserved content.<br />
<br />
The CLOCKSS archive does not perform content transformations as part of <i>preservation</i> operations, only as part of <i>dissemination</i> operations. The archive preserves the content it ingests in its original format and as its original bits. This content includes bibliographic and representational metadata. As described in [[LOCKSS: Extracting Bibliographic Metadata]] subsequent to ingestion the bibliographic metadata will be extracted and added to the [[LOCKSS: Metadata Database]] for ease of access and reporting. The metadata database is merely a cache of information from the preserved content. The database is not itself preserved; it can be reconstructed from the preserved content, which is unaffected by metadata extraction.<br />
<br />
If content is ever triggered, as part of the process of generating a [[Definition of DIP|DIP]] containing it, the triggered content may be transformed from its original form to a form that is understandable by applying the Knowledge Base of the eventual Consumers. [[CLOCKSS: Extracting Triggered Content#Preparing File Transfer Content for Dissemination|CLOCKSS: Extracting Triggered Content]] describes this type of transformation as applied to file transfer content currently. The reasons for this policy are:<br />
* The CLOCKSS archive is a dark archive. It is intended that the vast majority of its content will never be triggered. Devoting resources to transforming content that will never be triggered would be wasteful.<br />
* [[CLOCKSS: Designated Community]] specifies that the Knowledge Base of the eventual Consumers of any triggered content includes Web browsers, or their equivalent in some successor technology to the Web. The goal is that the eventual Consumers of any triggered content are as able to understand it as readers of the publisher's Web site were at the time of publication. To ensure this, two conditions must hold:<br />
** The original Content Information must include everything that would have been provided to a reader of the publisher's Web site at the time of publication, see [[Definition of AIP#Creating AIPs from SIPs|Definition of AIP]].<br />
** The DIP containing the triggered Content Information must include it in a format that Web browsers, or their equivalent in some successor technology to the Web, can render successfully. If this is not the original format, a format migration will be performed during generation of the DIP, see [[LOCKSS: Format Migration#Format Migration in the CLOCKSS Archive]], but this will not affect the preserved AIP.<br />
Thus the Information Properties that the CLOCKSS archive preserves are the entire set of Information Properties of the Content Information. No operations are performed on the preserved content that would affect those properties. It is not necessary for the CLOCKSS archive to specify individual Information Properties that are, or are not, to be preserved since all Information Properties are preserved. If a transformation is performed as part of generating a DIP, the properties of the Content Information may not be the same before and after the transformation, but this does not mean that the Information Properties have not been preserved. A later, different DIP generation process might result in the properties of the Content Information being the same before and after the transformation.<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by LOCKSS Engineering Staff<br />
* Approval by LOCKSS Chief Scientist<br />
<br />
== Relevant Documents ==<br />
# OAIS (2012) CCSDS 650.0-M-2: Reference Model for an Open Archival Information System (OAIS). Magenta Book. Issue 1. June 2012 (ISO 14721:2003) http://public.ccsds.org/publications/archive/650x0m2.pdf accessed 2013.08.31<br />
# [[LOCKSS: Format Migration]]<br />
# [[Definition of AIP]]<br />
# [[Definition of DIP]]<br />
# [[CLOCKSS: Designated Community]]<br />
# [[CLOCKSS: Extracting Triggered Content]]<br />
# [[LOCKSS: Extracting Bibliographic Metadata]]<br />
# [[LOCKSS: Metadata Database]]<br />
# [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement]</div>Dshrhttp://documents.clockss.org/index.php/CLOCKSS:_Logging_and_RecordsCLOCKSS: Logging and Records2014-03-12T04:30:03Z<p>Dshr: /* Retention Policy */ fix format glitch</p>
<hr />
<div>= CLOCKSS: Logging and Records =<br />
<br />
The CLOCKSS system uses three types of record:<br />
* '''Logs:''' detailed logs, at an extensively customizable level of the operations of the LOCKSS daemon written to log files on the host machine. The purpose of these logs is to enable diagnosis of problems that arise. Logs are retained on the machine that generated them in <tt>/var/log</tt>.<br />
* '''Alerts:''' messages sent off-machine by the LOCKSS daemon when significant events occur. The purpose of Alerts is to draw attention to potential problems that may need diagnosis. Alerts are sent via e-mail to the <<tt>clockss-alerts@clockss.org</tt>> mail alias, and added to the log files on the host machine via the <tt>syslog</tt> mechanism.<br />
* '''Records:''' statistical summaries and business records of the operation of the system as a whole, not of individual boxes. They are provided to the CLOCKSS board and CLOCKSS member organizations, and electronic copies are being stored in a system run by the Executive Director.<br />
<br />
== Retention Policy ==<br />
<br />
Although the LOCKSS daemon can generate extremely detailed logs, doing so routinely is counter-productive. It buries the signal in the noise. The goal of the logging and record policy, in the absence of a specific problem to diagnose, is to:<br />
* Generate Logs adequate to, and retain them long enough to, enable simple diagnosis.<br />
* Generate Alerts on any condition that the daemon determines is anomalous, and on other significant events, with sufficient detail to draw the system administrator's attention to problems requiring diagnosis, and to retain them indefinitely.<br />
* Generate the Records needed for business and governance, and for monitoring of the CLOCKSS network's overall performance, and to retain them indefinitely.<br />
Specific log retention policies for each CLOCKSS box are specified in <tt>/etc/logrotate.conf</tt> and the files in <tt>/etc/logrotate.d/</tt>. On each CLOCKSS Box:<br />
* System logs are retained for a month.<br />
* At least the most recent 20MB of LOCKSS daemon log data is retained.<br />
<br />
== Ingest Alerts ==<br />
<br />
An Alert is generated at the end of each crawl of a [[Definition of AIP#Creating AIPs from SIPs|SIP]] that meets certain criteria recording the final status of the crawl, and the number of HTTP 200 results obtained (this is equivalent to the number of new URLs which were found, plus the number of existing URLs that were found to have modified content). An example of such an alert:<br />
<pre><br />
Date: Sat 19 Feb 2011 04:17:24 PST<br />
From: LOCKSS box ingest2.clockss.org <clockss-alert@lockss.org><br />
Subject: [lockss-alert] LOCKSS box info: CrawlEnd<br />
<br />
LOCKSS box 'ingest2.clockss.org' raised an alert at Sat Feb 19 04:12:24 PST 2011<br />
<br />
Name: CrawlEnd<br />
Severity: info<br />
AU: Nature Reviews Genetics Volume 11<br />
Explanation: Crawl ended successfully: 2276 new files<br />
<br />
Crawl ended successfully, 2276 new files, 4 warnings.<br />
</pre><br />
<br />
== Preservation Alerts ==<br />
<br />
An Alert is generated at the end of each [[LOCKSS: Polling and Repair Protocol|poll]] that detects an integrity problem:<br />
* If there were a non-zero number of URLs for which:<br />
** A repair was needed because the content failed to match the consensus.<br />
** Repair content was fetched.<br />
** The repair content matched the consensus.<br />
* If there were a non-zero number of URL version newly flagged as suspect because their content failed to match the locally stored hash.<br />
An example of such an Alert:<br />
<pre><br />
Date: Sat Jul 20 2013 04:17:24 PST<br />
From: LOCKSS box ingest2.clockss.org <clockss-alert@lockss.org><br />
Subject: [lockss-alert] LOCKSS box info: PollEnd<br />
<br />
LOCKSS box 'ingest2.clockss.org' raised an alert at Sat Jul 20 04:12:24 PST 2013<br />
<br />
Name: PollEnd<br />
Severity: info<br />
AU: Nature Reviews Genetics Volume 11<br />
Explanation: Poll ended successfully: 99.89% agreement<br />
<br />
Poll ended successfully, 2866 URLs, 99.89% agreement, 3 suspect files found.<br />
</pre><br />
<br />
== Dissemination Alerts ==<br />
<br />
The CLOCKSS archive is a dark archive; access to the content is permitted only at the direction of the CLOCKSS board. Thus, as described in [[CLOCKSS: Box Operations]], the content access mechanisms of the LOCKSS daemon are disabled, and packet filters are used to further prevent access. Nevertheless, Alerts are generated on any access to the content in order that they may be treated as [[CLOCKSS: Logging and Records#Administrative and Security Alerts|Security Alerts]].<br />
<br />
== Administrative and Security Alerts ==<br />
<br />
Alerts are generated on the following administrative actions:<br />
* Changes to the configuration files.<br />
* Changes to the access control permissions.<br />
* Adding or de-activating an AU.<br />
* Enabling or disabling the content servers.<br />
* User account added or removed or password changed.<br />
<br />
== External Communications ==<br />
<br />
=== Engagement ===<br />
<br />
Engagement with harvest content publishers before ingestion is described in [[CLOCKSS: Ingest Pipeline#Harvest Publisher Engagement|CLOCKSS; Ingest Pipeline]].<br />
<br />
Engagement with file transfer content publishers before ingestion is described in [[CLOCKSS: Ingest Pipeline#File Transfer Publisher Engagement|CLOCKSS; Ingest Pipeline]].<br />
<br />
In all cases interactions with the publisher take place through the RT ticketing system, so they are recorded permanently.<br />
<br />
=== External Reports ===<br />
<br />
The technology for generating reports is being revised; the earlier technology became too inefficient as the number of articles on each box grew because it generated reports on each box from the [[LOCKSS: Metadata Database]] then merged them. The new technology is a centralized database with a row for each article, a column for each of the production and ingest boxes, and the cell containing the ingest timestamp of the article on that box, obtained by a regular polling process that asks each box for the articles ingested since the last time it was asked.<br />
<br />
The following reports are generated for external consumption:<br />
* Monthly reports of the state of preservation of all serials committed to preservation in the CLOCKSS archive are delivered to the CLOCKSS board, the Keepers Registry and posted [http://www.clockss.org/keepers/ on the Web].<br />
* KBART reports are generated monthly and posted [http://www.clockss.org/kbart/ on the Web]. For the Global LOCKSS Network, these reports are used to update link resolver knowledge bases so that libraries can provide their readers access to the content of their LOCKSS box. Because the CLOCKSS archive is a dark archive, these reports cannot be used to update link resolvers. However, several analysis tools use KBART as an input format, so the KBART reports for CLOCKSS are made public.<br />
* The CLOCKSS Executive Director is sent an e-mail report of the article counts in the CLOCKSS archive weekly. These reports are preserved in Stanford's backup system.<br />
* The CLOCKSS archive charges publishers a small fee for each current article ingested, billed quarterly. Thus a quarterly report is generated showing for each publisher the number of their articles ingested in that quarter for each publication year. The report is submitted to the CLOCKSS Executive Director for onward transmission to the publishers. Significant discrepancies between this and the publisher's own article counts will result (and have resulted) in investigation and corrective action. To aid in this process more detailed reports, down to the article level, can be generated on request.<br />
<br />
The CLOCKSS Metadata Lead is responsible for the production and dissemination of these reports.<br />
<br />
== Monitoring ==<br />
<br />
=== Log Monitoring ===<br />
<br />
* Ingest boxes: The CLOCKSS Content Lead is responsible for monitoring logs on the ingest boxes<br />
* Production boxes: The CLOCKSS Technical Lead is responsible for monitoring logs on production boxes when needed.<br />
* Web servers: The CLOCKSS Network Administrator is responsible for monitoring web server logs.<br />
<br />
=== Alert Monitoring ===<br />
<br />
The CLOCKSS Technical Lead is responsible for monitoring the Alerts generated by CLOCKSS boxes.<br />
<br />
== Nagios ==<br />
<br />
The state of the CLOCKSS infrastructure, including the CLOCKSS boxes and the ingest machines, is monitored by Nagios as described in [[CLOCKSS: Box Operations#Nagios|CLOCKSS: Box Operations]].<br />
<br />
The CLOCKSS Network Administrator is responsible for monitoring via Nagios.<br />
<br />
== Network Diagnostics ==<br />
<br />
The LOCKSS team's internal monitoring and evaluation processes identified some areas in which the efficiency of the polling process could be improved in the context of the Global LOCKSS Network (GLN). The Andrew W. Mellon Foundation funded work to implement and evaluate improvements in these areas. This is expected to be complete by March 2014. Although these improvements will be deployed to the CLOCKSS network, because there are many fewer boxes in the CLOCKSS network than the GLN the areas of inefficiency are not relevant to the CLOCKSS network. Thus the improvements are not expected to make a substantial difference to the performance of the CLOCKSS network.<br />
<br />
The Mellon-funded work included development of improved instrumentation and analysis software, which polls the administrative Web UI of each LOCKSS box in a network to collect vast amounts of data about the operations of each box. For examples of the use of this software, see [[LOCKSS: Polling and Repair Protocol#Enhancements|LOCKSS: Polling and Repair Protocol]].<br />
<br />
The CLOCKSS Network Administrator is responsible for collecting and analyzing this data.<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by:<br />
** LOCKSS Engineering Staff<br />
** CLOCKSS Network Administrator<br />
* Approval by CLOCKSS Technical Lead<br />
<br />
== Relevant Documents ==<br />
# [[CLOCKSS: Box Operations]]<br />
# [[CLOCKSS: Ingest Pipeline]]<br />
# [[LOCKSS: Polling and Repair Protocol]]<br />
# [[Definition of AIP]]</div>Dshrhttp://documents.clockss.org/index.php/CLOCKSS:_Ingest_PipelineCLOCKSS: Ingest Pipeline2014-02-28T21:40:51Z<p>Dshr: /* Completeness of Harvest Content */ fix typos</p>
<hr />
<div>= CLOCKSS: Ingest Pipeline =<br />
<br />
The CLOCKSS Archive ingests two different types of content:<br />
* Harvest Content - the contents of a publisher's Web site, obtained by crawling it.<br />
* File Transfer content - the form of the content the publisher uses to create their Web site, obtained from the publisher typically via FTP.<br />
<br />
== Harvest Content Pipeline ==<br />
<br />
=== Harvest Publisher Engagement ===<br />
<br />
The process a new publisher signing up with CLOCKSS undergoes depends on whether the publisher uses a publishing platform already supported by the LOCKSS software:<br />
* If so, the discussion need cover only the process for turning on access for the CLOCKSS Archive's ingest machines.<br />
* Otherwise, a plugin writer from the LOCKSS team is designated to analyze the publisher's site and:<br />
** develop requirements for the publisher plugin as described in [[LOCKSS: Software Development Process]].<br />
** work with the publisher to add CLOCKSS permission pages and make any other necessary changes to their site.<br />
In all cases, all interactions with the publisher take place through the RT ticketing system and are thus recorded permanently.<br />
<br />
The CLOCKSS Plugin Lead is responsible for this process.<br />
<br />
=== Harvest Plugin Development and Testing ===<br />
<br />
The designated LOCKSS team member:<br />
* Works with the publisher to grant access to the CLOCKSS ingest machines IP addresses.<br />
* Develops and tests any necessary software enhancements including unit tests (see [[LOCKSS: Software Development Process]]).<br />
* Works with a plugin writer to implement and test the necessary plugin and its unit tests (see [[LOCKSS: Software Development Process]]).<br />
<br />
Once the plugin passes its tests and the CLOCKSS ingest machines have access to the publisher's site with CLOCKSS permissions, the plugin writer works with the LOCKSS content team to process sample batches of publisher content for quality assurance.<br />
<br />
The CLOCKSS Plugin Lead is responsible for this process.<br />
<br />
=== Harvest Content Processing ===<br />
<br />
Once a plugin has been developed and tested with unit tests and sample content, it is released to a set of content-testing boxes, which are identical to production CLOCKSS boxes except generally smaller. Then:<br />
* Under the direction of an internally developed testing framework (AUTest), two daemons with the plugin are directed to collect two copies each of a substantial amount of real content, one copy collected from Stanford's IP address range, the other from Rice or Indiana.<br />
* AUTest then directs each of the two daemons to compute the message digest of the filtered contents of each collected AU.<br />
* AUTest compares the results to ensure correct collection and correct operation of hash filters.<br />
* Metadata is extracted and checked for sanity.<br />
* A sample of the content is browsed by a human tester to verify the correctness of the plugin by ensuring that:<br />
** all the types of files that should be collected are,<br />
** and that files that shouldn't be collected, such as advertisements or articles from previous years, are properly excluded.<br />
Any problems detected are addressed by modifying and testing the plugin in a development environment, or by contacting the publisher if necessary to resolve systemic site errors, then the tests are repeated on the content-testing boxes.<br />
<br />
The process above is repeated on a substantial sample of new content as it becomes available (in following years/volumes, or new publications from the same publisher), in order to detect changes to the publisher's site, or the format of new titles, which require changes to the plugin.<br />
<br />
Once an AU has been successfully tested on the content-testing boxes, it is configured for collection on all boxes in the ingest network. If the plugin is new or changed it is also released to the ingest network. Each ingest box collects the content and the network runs polls to detect and resolve transient collection errors. When all the copies of an AU have come into full agreement on the ingest boxes, they are then configured on the network of CLOCKSS production boxes.<br />
<br />
Depending on the complexity and diversity of the publisher's content the "substantial sample" can be anything from quite small to the publisher's entire content. Note that the need for a "substantial sample" of content for testing means that there is a delay between the time the publisher starts adding content to the bibliographic unit represented by the AU, and the time the ingest network starts collecting it.<br />
<br />
The CLOCKSS Content Lead is responsible for this process.<br />
<br />
==== Harvest Process ====<br />
<br />
Once AUs of content are configured on the CLOCKSS production boxes, they schedule collection of the content from the ingest boxes, as described in [[Definition of AIP#Creating an AIP from a harvest SIP|Definition of AIP]].<br />
<br />
A harvest content AU is considered to be preserved when it has been ingested by all the production CLOCKSS boxes and at least one poll on it has been successful (see [[LOCKSS: Polling and Repair Protocol]]). The AU can then be removed from the ingest boxes. A review of the process for doing so deemed it inadequate and too manual; a replacement process is under development that will integrate with the improvements being made to [[CLOCKSS: Logging and Records#External Reports|external report generation]].<br />
<br />
The CLOCKSS Content Lead is responsible for both the current and replacement processes.<br />
<br />
==== Completeness of Harvest Content ====<br />
<br />
Completeness in this context means that the content in the CLOCKSS archive faithfully reflects the content on the publisher's web site. The CLOCKSS ingest pipeline includes multiple ingest CLOCKSS boxes that individually harvest content from the publisher's website. Once collected, the ingest CLOCKSS boxes poll among themselves and repair content objects from the publisher if there is disagreement. The content testing process requires complete agreement among the ingest boxes before the content is released to the product CLOCKSS network. As a result the content described by the CLOCKSS plugin and parameters for the [[Definition of SIP|Submission Information Package]] is preserved in the CLOCKSS PLN.<br />
<br />
To account for the possibility of errors in the rules and procedures of the CLOCKSS plugin, a second layer of testing is done by visually inspecting content submitted by a publisher to ensure that it functions as it did on the publisher's website, and that all content available from the publisher's website is also preserved in the CLOCKSS ingest boxes before being released to the CLOCKSS production network.<br />
<br />
The CLOCKSS Content Lead is responsible for this process.<br />
<br />
==== Correctness of Harvest Content ====<br />
<br />
Correctness in this context means that the publisher's content is Independently Understandable by the [[CLOCKSS: Designated Community]]. For the CLOCKSS archive, this is the responsibility of the publisher, not of the archive. Harvest content is used by customers of the publisher on a regular basis, so content that is not Independently Understandable on the publisher's website is rare, and is normally detected by the author. Under the [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement], the publisher may submit content in any form, including proprietary formats, so full validation of content and content types is not always practical.<br />
<br />
==== Annual Harvest Content Cycle ====<br />
<br />
The processing cycle for the harvest content a publisher is publishing during a given year can be described as follows.<br />
* The first phase begins typically in the second quarter of the year.<br />
** A sample of the publisher's AUs for the year are configured and processed on the content-testing machines to ensure the plugin works as expected, and corrective action is taken if necessary.<br />
** When the plugin is deemed satisfactory, all the publisher's AUs for the year are configured on the ingest machines and allowed to crawl and poll.<br />
* The second phase continues throughout the rest of the year.<br />
** Poll results for AUs on the ingest machines are monitored, and corrective action is taken if necessary, including enhancing the plugin.<br />
** More content samples may be configured and processed on the content-testing machines to ensure the plugin continues to work as expected throughout the year.<br />
* The third phase begins at the beginning of the following year.<br />
** As the AUs for the year that is ending stop growing, final poll results are monitored until the AUs are deemed fully processed, at which point they are configured on the production machines.<br />
** Crawl and poll results of the AUs on the production machines are then monitored until the AUs are deemed fully processed.<br />
<br />
As regards subsequent errata and corrections, we assume that publishers follow the NLM ''best practice'' guidelines and at least refer and link to them in a subsequent issue. If the publisher does, they will be collected with that issue.<br />
<br />
The CLOCKSS Content Lead is responsible for this process.<br />
<br />
==== e-Book Processing ====<br />
<br />
Unlike serial content which is published incrementally over a period of time, harvest content that takes the form of books can be processed over a regular cycle. During each cycle:<br />
* A sample of the publisher's books that were published during the previous interval is configured on the content-testing machines to verify that the plugin works adequately, and corrective action is taken if necessary.<br />
* When the plugin is deemed ready, all the publisher's books from the previous interval are configured on the ingest machines and allowed to crawl and poll. Results are monitored and corrective action is taken if necessary.<br />
* When a book is deemed fully processed on the ingest machines, it is configured on the production machines. The crawl and poll results are then monitored on the production machines until the book is deemed fully processed.<br />
<br />
The CLOCKSS Content Lead is responsible for this process.<br />
<br />
== File Transfer Content Pipeline ==<br />
<br />
=== File Transfer Publisher Engagement ===<br />
<br />
The designated LOCKSS team member's discussion with a new publisher whose content is to be received via file transfer needs to determine two things:<br />
* The means by which the file transfer content is transferred to the CLOCKSS ingest machines. This is up to the publisher, techniques that have been implemented include:<br />
** FTP from an FTP server that the publisher maintains to a CLOCKSS ingest machine.<br />
** FTP by the publisher to an FTP server on a CLOCKSS ingest machine.<br />
** <tt>rsync</tt> between a publisher machine and a CLOCKSS ingest machine<br />
* The format of the file transfer content, in particular:<br />
** How the ingest scripts can verify that the content received is correct, for example by checking manifests and checksums.<br />
** How the content can be rendered if it is ever triggered. A publisher-specific version of the abstract plan described in [[CLOCKSS: Extracting Triggered Content#Preparing File Transfer Content for Dissemination|CLOCKSS: Extracting Triggered Content]] should be drawn up.<br />
In all cases, all interactions with the publisher take place through the RT ticketing system and are thus recorded permanently.<br />
<br />
The CLOCKSS Plugin Lead is responsible for this process.<br />
<br />
=== Ingest script development ===<br />
<br />
Once the information above is available, the designated team member writes shell scripts to be executed from <tt>cron</tt> that ensure that collected content is up-to-date by:<br />
* If FTP from the publisher, collect any as-yet-uncollected content.<br />
* If <tt>rsync</tt>, run rsync against the publisher's machine.<br />
* If the format in which the publisher makes file transfer content available lacks checksums and manifests, the ingest script must generate them as the content is collected.<br />
* Otherwise, the ingest script verifies the manifest and checksums in the content, alerting the content team if any discrepancies are found.<br />
<br />
The designated team member also writes a verification script that is run against all content on the file transfer ingest machine at intervals.<br />
<br />
The CLOCKSS Plugin Lead is responsible for this process.<br />
<br />
=== File Transfer Plugin Development and Testing ===<br />
<br />
File transfer content also requires a plugin, These plugins are developed and tested in the same way as harvest plugins (see [[CLOCKSS: Ingest Pipeline#Harvest Plugin Development and Testing|above]]), except that the emphasis is on [[LOCKSS: Extracting Bibliographic Metadata|metadata extraction]] because it is in some cases more complex, whereas crawling is trivial and polling requires no filters.<br />
<br />
The CLOCKSS Plugin Lead is responsible for this process.<br />
<br />
=== File Transfer Content Pre-Processing ===<br />
<br />
File transfer content undergoes a subset of the testing steps described above for [[CLOCKSS: Ingest Pipeline#Harvest Content Processing|harvest content]]. The structure of file transfer AUs is simple and consistent (typically a directory hierarchy) so testing the crawl rules on a large sample of content is unnecessary. All the CLOCKSS boxes collect the same copy of the content so few of the complexities of preserving harvest content (such as hash filters) are present.<br />
<br />
A slightly simplified workflow in AUTest directs content-testing boxes to collect a single copy of a smaller sample of AUs from the staging server, and to extract metadata and check it for sanity. After any problems are corrected, the AU and similar AUs are configured on the CLOCKSS production boxes.<br />
<br />
=== File Transfer Content Ingest ===<br />
<br />
The collection scripts are added to the ingest user's <tt>crontab</tt>, and run daily to collect any new content. Every 24 hours the entire content of the file transfer ingest machine is synchronized with an (a) an on-site backup and (b) an off-site backup using <tt>rsync</tt>. The verification scripts are run against each of these backup copies at intervals.<br />
<br />
Once the ingested content is verified it is staged to a Web server that can be accessed only by the CLOCKSS boxes. The CLOCKSS boxes crawl the file transfer content staged on the Web server under control of a file transfer plugin and preserve it, as described in [[Definition of AIP#Creating an AIP from a file transfer SIP|Definition of AIP]].<br />
<br />
File transfer content is considered to be preserved when it has been ingested by all production CLOCKSS boxes and at least one poll on it has been successful (see [[LOCKSS: Polling and Repair Protocol]]). <br />
<br />
The CLOCKSS Content Lead is responsible for this process.<br />
<br />
==== Completeness of File Transfer Content ====<br />
<br />
Completeness in this context means that the content in the CLOCKSS archive faithfully reflects the content delivered by the publisher. Content submitted via file transfer is either made available from a content server operated by the publisher, or delivered by the publisher to a content server hosted by CLOCKSS. In either case the content is transferred to a staging server operated by CLOCKSS. The ingest scripts include procedures that verify the files submitted by the publisher correspond to those on the content server. Where publishers provide checksum information, checksums are also compared with locally computed checksums to ensure that the content is the same. If the checksums differ, the local copy is deleted and re-collected.<br />
<br />
==== Correctness of File Transfer Content ====<br />
<br />
Correctness in this context means that, if the content is ever triggered, the result will be Independently Understandable by the [[CLOCKSS: Designated Community]]. During initial engagement with file transfer publishers, their content types are assessed to develop a plan for rendering the content if it is ever triggered. One goal of the ingest scripts is to verify that the content types submitted match this assessment; if they do not the [[CLOCKSS: Ingest Pipeline#File Transfer Publisher Engagement|plan for triggering the content]] must be revised to account for the new content types.<br />
<br />
== Feedback ==<br />
<br />
The CLOCKSS Executive Director receives regular reports of the article counts ingested, upon which publishers are billed (See [[CLOCKSS: Logging and Records#External Reports|CLOCKSS: Logging and Records]]).<br />
<br />
Content AUs can be in one of three preservation states, as reported to the publisher, the CLOCKSS Board and the Keepers Registery (See [[CLOCKSS: Logging and Records#External Reports|CLOCKSS: Logging and Records]]):<br />
# Committed for preservation<br />
# In process<br />
# Preserved<br />
Progress is tracked through the AU's configuration file.<br />
<br />
The CLOCKSS Content Lead is responsible for this process.<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by:<br />
** LOCKSS Technical Staff<br />
** CLOCKSS Plugin Lead<br />
** CLOCKSS Content Lead<br />
* Approval by CLOCKSS Technical Lead<br />
<br />
== Relevant Documents ==<br />
# [[CLOCKSS: Designated Community]]<br />
# [[CLOCKSS: Logging and Records]]<br />
# [[LOCKSS: Software Development Process]]<br />
# [[LOCKSS: Polling and Repair Protocol]]<br />
# [[Definition of SIP]]<br />
# [[Definition of AIP]]<br />
# [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement]</div>Dshrhttp://documents.clockss.org/index.php/Definition_of_AIPDefinition of AIP2014-02-27T17:41:18Z<p>Dshr: /* Creating an AIP from a harvest SIP */ More clarification suggested by Tom Lipkis</p>
<hr />
<div>= CLOCKSS Definition of AIP =<br />
<br />
== OAIS Archival Information Package (AIP) ==<br />
<br />
The OAIS definition of AIP is:<br />
<blockquote>Archival Information Package (AIP): An Information Package, consisting of the Content Information and the associated Preservation Description Information (PDI), which is preserved within an OAIS.</blockquote><br />
Other relevant definitions are:<br />
<blockquote>AIP Edition: An AIP whose Content Information or Preservation Description Information has been upgraded or improved with the intent not to preserve information, but to increase or improve it. An AIP edition is not considered to be the result of a Migration.</blockquote><br />
<blockquote>AIP Version: An AIP whose Content Information or Preservation Description Information has undergone a Transformation on a source AIP and is a candidate to replace the source AIP. An AIP version is considered to be the result of a Digital Migration.</blockquote><br />
<blockquote>Archival Information Collection (AIC): An Archival Information Package whose Content Information is an aggregation of other Archival Information Packages.</blockquote><br />
<blockquote>Archival Information Unit (AIU): An Archival Information Package where the Archive chooses not to break down the Content Information into other Archival Information Packages. An AIU can consist of multiple digital objects (e.g., multiple files).</blockquote><br />
<blockquote>Preservation Description Information (PDI): The information which is necessary for adequate preservation of the Content Information and which can be categorized as Provenance, Reference, Fixity, Context, and Access Rights Information.</blockquote><br />
The OAIS discussion of AIP is:<br />
<blockquote>Within the OAIS one or more SIPs are transformed into one or more Archival Information Packages (AIPs) for preservation. The AIP has a complete set of PDI for the associated Content Information. The AIP may also contain a collection of other AIPs, and this is discussed and modeled in section 4. The Packaging Information of the AIP will conform to OAIS internal standards, and it may vary as it is managed by the OAIS.</blockquote><br />
<br />
The OAIS definition of Representation Information is:<br />
<blockquote>'''Representation Information:''' The information that maps a Data Object into more meaningful concepts. An example of Representation Information for a bit sequence which is a FITS file might consist of the FITS standard which defines the format plus a dictionary which defines the meaning in the file of keywords which are not part of the standard.<br />
<br />
Another example is JPEG software which is used to render a JPEG file; rendering the JPEG file as bits is not very meaningful to humans but the software, which embodies an understanding of the JPEG standard, maps the bits into pixels which can then be rendered as an image for human viewing.</blockquote><br />
Other relevant definitions are:<br />
<blockquote>'''Representation Network:''' The set of Representation Information that fully describes the meaning of a Data Object. Representation Information in digital forms needs additional Representation Information so its digital forms can be understood over the Long Term.</blockquote><br />
<blockquote>'''Representation Rendering Software:''' A type of software that displays Representation Information of an Information Object in forms understandable to humans.'''</blockquote><br />
The OAIS discussion of Representation Information is:<br />
<blockquote>In general, it can be said that ‘Data interpreted using its Representation Information yields Information’, ... In order for this Information Object to be successfully preserved, it is critical for an OAIS to identify clearly and to understand clearly the Data Object and its associated Representation Information. For digital information, this means the OAIS must clearly identify the bits and the Representation Information that applies to those bits. ... As a further complication, the recursive nature of Representation Information, which typically is composed of its own data and its own Representation Information, typically leads to a network of Representation Information objects. Since a key purpose of an OAIS is to preserve information for a Designated Community, the OAIS must understand the Knowledge Base of its Designated Community to understand the minimum Representation Information that must be maintained. The OAIS should then make a decision between maintaining the minimum Representation Information needed for its Designated Community, or maintaining a larger amount of Representation Information that may allow understanding by a larger Consumer community with a less specialized Knowledge Base, which would be the equivalent of extending the definition of the Designated Community. Over time, evolution of the Designated Community’s Knowledge Base may require updates to the Representation Information to ensure continued understanding. The choice, for an OAIS, to collect all the relevant Representation Information or to reference its existence in another trusted or partner OAIS Archive, is an implementation and organization decision.</blockquote><br />
<br />
== CLOCKSS Archival Information Package (AIP) ==<br />
<br />
CLOCKSS calls its AIPs Archival Units (AUs). They are constructed from [[Definition of SIP|CLOCKSS SIPs]] as described below. The internal representation of an AU is as a directory hierarchy in a POSIX file system in which the directory names represent the components of the name of the URL from which content was collected, including the terminal component. Each of these directories, in addition to the names of descendant components, may contain files whose names start with <tt>#</tt> (and thus cannot be components of the URL name, since they would be tags within the URL) containing metadata relating to the URL up to that component. Additionally each directory may contain a directory named <tt>#content</tt> containing a sequence of versions of content obtained from the URL up to that component, and metadata relating to that version of the content. This metadata is represented as a file containing {key, value} pairs. It will include all the HTTP headers obtained by the <tt>GET</tt> that obtained the content, in particular the <tt>Content-Type</tt>, one component of which is the <tt>Media-Type</tt>. It will also include additional metadata generated during ingestion of that version of the content at that URL, typically including a checksum, and a timestamp. This somewhat complex representation is required because:<br />
* The URLs <tt>http://www.example.com/foo</tt> and <tt>http://www.example.com/foo/bar</tt> may both contain (different) content.<br />
* The content and/or the headers obtained from <tt>http://www.example.com/foo/bar</tt> at time T(0) and at time T(1) may differ.<br />
The representation allows for easy access by tools other than the LOCKSS daemon, for example shell scripts.<br />
<br />
=== CLOCKSS AIP Examples ===<br />
<br />
==== File Transfer AU ====<br />
<br />
The Journal of Laser Applications Volume 24:<br />
* [[Media:LA24 urls.pdf|A list of URLs (87-page PDF)]].<br />
* [[Media:LA24 metadata.pdf|The metadata]].<br />
<br />
==== Harvest AU ====<br />
<br />
Advances in Building Energy Research Volume 6:<br />
* [[Media:ABER V6 urls.pdf|A list of URLs (43-page PDF)]].<br />
* [[Media:ABER V6 metadata.pdf|The metadata]].<br />
<br />
=== CLOCKSS Preservation Description Information (PDI) ===<br />
<br />
The OAIS Reference Model classifies the Preservation Description Information (PDI) included in an AIP as follows:<br />
<ul><li>'''Provenance:''' the provenance of each version of each URL in the AU can be determined as follows. The content of that version was obtained from the URL represented by the path from the root of the AU's directory hierarchy to the parent of the content directory, at the time of the timestamp. If this content was the result of a repair from another CLOCKSS box, that is recorded in the metadata. </li><br />
<li>'''Fixity:''' the metadata for each version of each URL in the AU includes a checksum computed at the time it was obtained and, if one is available, a checksum provided by the Web server from which it was obtained.</li><br />
<li>'''Context:''' the context information for an AU consists of two pieces of information which together allow the daemon to construct an instance of a Java class customized to suit the AU:<br />
<ul><br />
<li>The ''parameters'', which are a set of {name, value} pairs providing the arguments needed to construct an instance of the class selected by the plugin ID.</li><br />
<li>The ''plugin ID'', which, in encoded form, identifies the class to be instantiated and supplies additional information.</li><br />
</ul><br />
In effect, the context for the AU is a customized instance of a Java class, normally referred to as its ''plugin''. It is thus executable, capable of performing operations on the AU such as [[Definition of AIP#Creating AIPs from SIPs|adding content and metadata from a SIP]], [[LOCKSS: Extracting Bibliographic Metadata|extracting metadata]], and taking part in [[LOCKSS: Polling and Repair Protocol|integrity checks]].<br /><br />
The context for the content of a URL in an AU consists of the associated metadata, including the <tt>Content-Type</tt> and the other HTTP headers which together provide the information a Web browser uses to render the content.<br /><br />
Each AU is a stand-alone, self-contained object that, if necessary, can be disseminated (triggered) independently. Although it may be part of a larger bibliographic unit in a logical sense, it is self-contained in its representation. To ensure this common URLs, for example a journal logo image, are replicated in each AU that refers to them.</li><br />
<li>'''Reference:''' Each AU has an immutable internal name, computed from its context information and stored with it. The name of the AU is the same on all CLOCKSS boxes, although the location of the AU instance in the box's file system may differ. The use of this name to locate the instance of an AU on a CLOCKSS box, or content within it, is [[Definition of AIP#Locating Digital Objects|described below]].</li><br />
<li>'''Access Rights:''' the CLOCKSS archive is a dark archive; access to the content of all AUs is forbidden unless in the future the [[CLOCKSS: Extracting Triggered Content|CLOCKSS board declares a trigger event]] for a specific set of AUs. Thus it is neither necessary nor possible to store the access rights to an AU with the AU itself, since they will only be determined in the future.</li></ul><br />
<br />
== Creating AIPs from SIPs ==<br />
<br />
When the AU is created its root directory is created, and the context information (plugin ID and AU parameters) are recorded in a read-only file in that directory. The same information is recorded in a central file in the daemon's configuration directory.<br />
<br />
Because most publishers whose materials are preserved in the CLOCKSS archive are serial publishers, emitting a continuous stream of articles, AUs are typically constructed from a series of SIPs, each one representing the articles published since its predecessor SIP. There are [[Definition of SIP|two kinds of SIP]]:<br />
* A ''harvest'' SIP represents content that the CLOCKSS archive will ingest by crawling the publisher's web site.<br />
* A ''file transfer'' SIP represents content that the publisher will package and transfer to the CLOCKSS archive via FTP, rsync, or other file transfer machanism.<br />
<br />
The process of creating an AU (AIP) from a SIP is different for each of the two types of SIP, but each process starts by ''configuring'' the AU on the CLOCKSS boxes, which involves supplying the context information (plugin ID and the parameters it requires) via the [[LOCKSS: Property Server Operations|CLOCKSS property server]]. Each box creates a root directory for its instance of the AU at a suitable place in its POSIX file system and calls the AU's plugin to arrange attempts to collect SIPs. The plugin contains information about times when content collection is allowed. The LOCKSS daemon on each box maintains a schedule of all AUs' collection attempts; when an AU requests a collection attempt it is scheduled so as to conform to box-wide and publisher-specific limits on the number of simultaneous collection attempts.<br />
<br />
AIP (AU) instances in CLOCKSS boxes may be created even before the first SIP supplying content for them is available. The instance continues to accumulate content via a sequence of SIPs becoming available through time, as the LOCKSS daemon repeatedly crawls the Web server from which SIPs are collected. Nominally, each AU represents a delimited span of time, such as a year or a volume of a journal. But because SIPs containing errata or corrections can arrive even after the delimited span of time, there is in general never a point in time at which the AU can be said to be definitively complete in the sense that no further content will ever be added.<br />
<br />
Although, as documented in [[CLOCKSS: Ingest Pipeline]], the quality assurance process that content undergoes as it is ingested into the CLOCKSS archive includes some visual spot checks that the content renders properly in a web browser, these are primarily intended to assure that all necessary URLs are being harvested. At the scale of the CLOCKSS archive's operations it is not feasible for these checks to be exhaustive. Further, the Content Information in an AIP is copyright by the publisher. It is their responsibility to ensure that it is Independently Understandable for their readers, who are also the eventual Consumers of the content if it is ever triggered from the CLOCKSS archive. Even if these spot checks were to detect some rendering problems, the CLOCKSS archive would not be permitted to modify the Content Information to correct them.<br />
<br />
The goal of the [[CLOCKSS: Ingest Pipeline]] is not to ensure that the AUs are "complete and correct"; there are no operationally implementable definitions of those terms. If there were, they would involve second-guessing the publishers. The goal is rather to ensure that the AUs "faithfully reflect what the publisher has published on their web site (for harvested AUs) or supplied to the archive (for file transfer AUs)".<br />
<br />
=== Creating an AIP from a harvest SIP ===<br />
<br />
When the scheduled time for the AU's requested collection attempt arrives, the plugin configures the box's Web crawler as it first verifies that the appropriate [[Definition of SIP|CLOCKSS permission statement]] is present on the publisher's Web site, then crawls the site. The plugin bounds the crawl in the URL name space by determining which links to follow, rate-limits the crawl so as not to disrupt operation of the publisher's Web site, and stores all newly-discovered content and the related metadata in the appropriate place relative to the AU's root directory in the box's repository, as described [[Definition of AIP#Creating AIPs from SIPs|above]].<br />
<br />
Note that for quality assurance (QA) reasons, as set out in [[CLOCKSS: Ingest Pipeline]], in practice each harvest AU is created twice:<br />
* First, a temporary AU is created on the CLOCKSS ingest machines. These machines collect the AU's current content, and then come to agreement on the content using the [[LOCKSS: Polling and Repair Protocol]].<br />
* Once agreement is reached, the permanent AU is created on each of the production CLOCKSS boxes. They then:<br />
** crawl the content from the CLOCKSS ingest boxes<br />
** start regular integrity checks with the other production CLOCKSS boxes using the [[LOCKSS: Polling and Repair Protocol]].<br />
The process the production boxes use is exactly the same as used by the ingest boxes, except that the production box is configured to crawl using the LOCKSS daemon on an ingest box as a proxy. The LOCKSS daemon on the ingest box is configured to act as a proxy that returns the content it has for the publisher's URL, or 404 for content it does not have. This ensures that the URLs from which the production box collects content point to the publisher, although the content will be that previously collected from the publisher's URL by the ingest box (see [[CLOCKSS: Ingest Pipeline#Harvest Content Processing|Harvest Content Processing]]). Note that the publisher's web site is not involved in this process. Once agreement is reached among the production boxes, the AU can be removed from the ingest boxes. Removal is recorded by changing the state of the AU in the Title DataBase (TDB) to ZAPPED (See [[LOCKSS: Extracting Bibliographic Metadata#Metadata Database Indexing|LOCKSS: Extracting Bibliographic Metadata]]). The current process for doing so has been deemed inadequate (see [[CLOCKSS: Ingest Pipeline#Harvest Process|CLOCKSS: Ingest Pipeline]]); a replacement process is under development.<br />
<br />
=== Creating an AIP from a file transfer SIP ===<br />
<br />
Publishers choosing to supply content via file transfer can choose either:<br />
* To ''push'' the content via <tt>FTP</tt>, <tt>SFTP</tt> or <tt>rsync</tt> to a CLOCKSS-run ingest server, using a user name and password chosen by the CLOCKSS team and specific to the publisher. <br />
* To have the CLOCKSS archive's ingest server ''pull'' the content from a publisher-run <tt>FTP</tt>, <tt>SFTP</tt> or <tt>rsync</tt> server using a user name and password chosen by, and specific to the publisher.<br />
In both cases, the combination of the DNS name at the publisher's end, the user name, and the password identifies the publisher for provenance purposes.<br />
<br />
Content and metadata in SIPs received via file transfer is organized by shell scripts into a directory hierarchy that is exported to the production CLOCKSS boxes by a Web server, called the ''staging server''. The hierarchy contains a directory per publisher containing a directory per year that contains the content and metadata received from that publisher during that year, and an automatically generated manifest page containing links to all the files and a CLOCKSS permission statement. Some publishers' file transfer SIPs do not contain fixity information, for those publishers fixity information is computed as soon as possible after file transfer and stored with the content.<br />
<br />
At the start of each new year an AU for the new year's SIPs from each publisher supplying content via file transfer is configured on the production CLOCKSS boxes via the [[LOCKSS: Property Server Operations|CLOCKSS property server]]. The AU's plugin is configured to start crawling from the automatically generated manifest page of that year's directory for that publisher on the staging server at regular intervals. Each time the box crawls, it adds content and metadata from newly received SIPs to the box's copy of the AU. In this case the files in the AIPs have names reflecting the files in the SIPs on the staging server. The publisher's names or URLs for these files are contained in, or recoverable from metadata in the SIPs/AIPs. The content will be associated with them when and if it is triggered.<br />
<br />
After the year has ended and agreement has been reached on the AU among the production CLOCKSS boxes, the AU can be removed from the staging server and its status changed to ZAPPED in the TDB (See [[LOCKSS: Extracting Bibliographic Metadata#Metadata Database Indexing|LOCKSS: Extracting Bibliographic Metadata]]). A review deemed the initial process for doing so inadequate and error-prone. AUs will not be removed from the staging server until an acceptable process has been developed and tested. This process must integrate with the improved [[CLOCKSS: Logging and Records#External Reports|process for generating external reports]].<br />
<br />
== CLOCKSS Representation Information ==<br />
<br />
[[CLOCKSS: Designated Community]] documents that the software Knowledge Base of the eventual Consumers of triggered content includes web browsers and their associated plugins capable of rendering the formats used on the Web at the time of their publication. This part of the Representation Information for the CLOCKSS archive is preserved outside the CLOCKSS archive, primarily in open source code repositories.<br />
<br />
[[LOCKSS: Format Migration]] documents how evolution in this Knowledge Base through time can be handled using the Content Negotiation capabilities of the Web to implement [http://dx.doi.org/10.1045/january2005-rosenthal transparent, on-access format migration]. This capability is implemented in the LOCKSS software, which is preserved outside the CLOCKSS archive, in the SourceForge repository and elsewhere.<br />
<br />
Thus the representation Information that is necessary to preserve with content objects in a CLOCKSS AIP is that information needed by web browsers to render, and the LOCKSS software to migrate, the content obtained from a URL. This information is is two parts:<br />
* The <tt>Mime-Type</tt> and other HTTP headers obtained from the URL. The metadata preserved for each version of each URL within an AU includes all HTTP headers.<br />
* "Magic Number" and other information contained in the HTTP content payload itself. The content preserved for each version of each URL within an AU is the entire HTTP content payload.<br />
A web browser has no other information upon which to base its rendering of the content of a URL than its name (perhaps with a file extension) and these two parts, which must therefore be adequate for the purpose of making the content understandable to Consumers.<br />
<br />
== Locating Digital Objects ==<br />
<br />
The term "digital objects" is often used in discussions of preserved digital information. In the CLOCKSS content, it might refer to either of two types of object:<br />
* AIPs, or in the CLOCKSS context AUs.<br />
* Content objects within an AU.<br />
<br />
=== Locating AIPs ===<br />
<br />
The CLOCKSS archive has a single class of AIP, called an Archival Unit (AU). As [[Definition of AIP#CLOCKSS Preservation Description Information (PDI)|described above]], the Reference part of the AU's Preservation Description Information is an immutable name that is the same on each CLOCKSS box. The location of the instance of the AU on a particular CLOCKSS box may be obtained by querying a map from this internal name to the path to the AU's root directory. This map is built during box startup and subsequently maintained as new AUs are created, or AUs moved.<br />
<br />
=== Locating content objects ===<br />
<br />
Content within an AU can be located in one of two ways:<br />
* Via the URL from which it was obtained. As [[Definition of AIP#CLOCKSS Archival Information Package (AIP)|described above]], there is a reversible mapping between the URI from which the content was collected and the path from the root of the AU containing it in the POSIX file system containing the AU. This enables content within an AU to be located via the map between the AU's internal name and its root location, and the components of the URL.<br />
* Via metadata search. As [[Definition of AIP#CLOCKSS Preservation Description Information (PDI)|described above]], content within an AU can be located by querying the [[LOCKSS: Metadata Database|metadata database]] using specific bibliographic metadata fields to match against the [[LOCKSS: Extracting Bibliographic Metadata|bibliographic metadata supplied by the publisher]] or derived from the AU's context. <br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by LOCKSS Engineering Staff<br />
* Approval by LOCKSS Chief Scientist<br />
<br />
== Relevant Documents ==<br />
# OAIS (2012) CCSDS 650.0-M-2: Reference Model for an Open Archival Information System (OAIS). Magenta Book. Issue 1. June 2012 (ISO 14721:2003) http://public.ccsds.org/publications/archive/650x0m2.pdf accessed 2013.08.31<br />
# [[Definition of SIP]]<br />
# [[CLOCKSS: Extracting Triggered Content]]<br />
# [[LOCKSS: Metadata Database]]<br />
# [[LOCKSS: Extracting Bibliographic Metadata]]<br />
# [[LOCKSS: Polling and Repair Protocol]]<br />
# [[CLOCKSS: Ingest Pipeline]]<br />
# [[CLOCKSS: Designated Community]]<br />
# [[LOCKSS: Format Migration]]<br />
# David S.H. Rosenthal, Thomas Lipkis, Thomas Robertson, Seth Morabito. “Transparent Format Migration of Preserved Web Content”, D-Lib Magazine, vol. 11, no. 1, January 2005. http://dx.doi.org/10.1045/january2005-rosenthal accessed 2013.8.7</div>Dshrhttp://documents.clockss.org/index.php/Definition_of_AIPDefinition of AIP2014-02-19T22:22:17Z<p>Dshr: /* Creating an AIP from a file transfer SIP */ Response to Site Visit Schedule</p>
<hr />
<div>= CLOCKSS Definition of AIP =<br />
<br />
== OAIS Archival Information Package (AIP) ==<br />
<br />
The OAIS definition of AIP is:<br />
<blockquote>Archival Information Package (AIP): An Information Package, consisting of the Content Information and the associated Preservation Description Information (PDI), which is preserved within an OAIS.</blockquote><br />
Other relevant definitions are:<br />
<blockquote>AIP Edition: An AIP whose Content Information or Preservation Description Information has been upgraded or improved with the intent not to preserve information, but to increase or improve it. An AIP edition is not considered to be the result of a Migration.</blockquote><br />
<blockquote>AIP Version: An AIP whose Content Information or Preservation Description Information has undergone a Transformation on a source AIP and is a candidate to replace the source AIP. An AIP version is considered to be the result of a Digital Migration.</blockquote><br />
<blockquote>Archival Information Collection (AIC): An Archival Information Package whose Content Information is an aggregation of other Archival Information Packages.</blockquote><br />
<blockquote>Archival Information Unit (AIU): An Archival Information Package where the Archive chooses not to break down the Content Information into other Archival Information Packages. An AIU can consist of multiple digital objects (e.g., multiple files).</blockquote><br />
<blockquote>Preservation Description Information (PDI): The information which is necessary for adequate preservation of the Content Information and which can be categorized as Provenance, Reference, Fixity, Context, and Access Rights Information.</blockquote><br />
The OAIS discussion of AIP is:<br />
<blockquote>Within the OAIS one or more SIPs are transformed into one or more Archival Information Packages (AIPs) for preservation. The AIP has a complete set of PDI for the associated Content Information. The AIP may also contain a collection of other AIPs, and this is discussed and modeled in section 4. The Packaging Information of the AIP will conform to OAIS internal standards, and it may vary as it is managed by the OAIS.</blockquote><br />
<br />
The OAIS definition of Representation Information is:<br />
<blockquote>'''Representation Information:''' The information that maps a Data Object into more meaningful concepts. An example of Representation Information for a bit sequence which is a FITS file might consist of the FITS standard which defines the format plus a dictionary which defines the meaning in the file of keywords which are not part of the standard.<br />
<br />
Another example is JPEG software which is used to render a JPEG file; rendering the JPEG file as bits is not very meaningful to humans but the software, which embodies an understanding of the JPEG standard, maps the bits into pixels which can then be rendered as an image for human viewing.</blockquote><br />
Other relevant definitions are:<br />
<blockquote>'''Representation Network:''' The set of Representation Information that fully describes the meaning of a Data Object. Representation Information in digital forms needs additional Representation Information so its digital forms can be understood over the Long Term.</blockquote><br />
<blockquote>'''Representation Rendering Software:''' A type of software that displays Representation Information of an Information Object in forms understandable to humans.'''</blockquote><br />
The OAIS discussion of Representation Information is:<br />
<blockquote>In general, it can be said that ‘Data interpreted using its Representation Information yields Information’, ... In order for this Information Object to be successfully preserved, it is critical for an OAIS to identify clearly and to understand clearly the Data Object and its associated Representation Information. For digital information, this means the OAIS must clearly identify the bits and the Representation Information that applies to those bits. ... As a further complication, the recursive nature of Representation Information, which typically is composed of its own data and its own Representation Information, typically leads to a network of Representation Information objects. Since a key purpose of an OAIS is to preserve information for a Designated Community, the OAIS must understand the Knowledge Base of its Designated Community to understand the minimum Representation Information that must be maintained. The OAIS should then make a decision between maintaining the minimum Representation Information needed for its Designated Community, or maintaining a larger amount of Representation Information that may allow understanding by a larger Consumer community with a less specialized Knowledge Base, which would be the equivalent of extending the definition of the Designated Community. Over time, evolution of the Designated Community’s Knowledge Base may require updates to the Representation Information to ensure continued understanding. The choice, for an OAIS, to collect all the relevant Representation Information or to reference its existence in another trusted or partner OAIS Archive, is an implementation and organization decision.</blockquote><br />
<br />
== CLOCKSS Archival Information Package (AIP) ==<br />
<br />
CLOCKSS calls its AIPs Archival Units (AUs). They are constructed from [[Definition of SIP|CLOCKSS SIPs]] as described below. The internal representation of an AU is as a directory hierarchy in a POSIX file system in which the directory names represent the components of the name of the URL from which content was collected, including the terminal component. Each of these directories, in addition to the names of descendant components, may contain files whose names start with <tt>#</tt> (and thus cannot be components of the URL name, since they would be tags within the URL) containing metadata relating to the URL up to that component. Additionally each directory may contain a directory named <tt>#content</tt> containing a sequence of versions of content obtained from the URL up to that component, and metadata relating to that version of the content. This metadata is represented as a file containing {key, value} pairs. It will include all the HTTP headers obtained by the <tt>GET</tt> that obtained the content, in particular the <tt>Content-Type</tt>, one component of which is the <tt>Media-Type</tt>. It will also include additional metadata generated during ingestion of that version of the content at that URL, typically including a checksum, and a timestamp. This somewhat complex representation is required because:<br />
* The URLs <tt>http://www.example.com/foo</tt> and <tt>http://www.example.com/foo/bar</tt> may both contain (different) content.<br />
* The content and/or the headers obtained from <tt>http://www.example.com/foo/bar</tt> at time T(0) and at time T(1) may differ.<br />
The representation allows for easy access by tools other than the LOCKSS daemon, for example shell scripts.<br />
<br />
=== CLOCKSS AIP Examples ===<br />
<br />
==== File Transfer AU ====<br />
<br />
The Journal of Laser Applications Volume 24:<br />
* [[Media:LA24 urls.pdf|A list of URLs (87-page PDF)]].<br />
* [[Media:LA24 metadata.pdf|The metadata]].<br />
<br />
==== Harvest AU ====<br />
<br />
Advances in Building Energy Research Volume 6:<br />
* [[Media:ABER V6 urls.pdf|A list of URLs (43-page PDF)]].<br />
* [[Media:ABER V6 metadata.pdf|The metadata]].<br />
<br />
=== CLOCKSS Preservation Description Information (PDI) ===<br />
<br />
The OAIS Reference Model classifies the Preservation Description Information (PDI) included in an AIP as follows:<br />
<ul><li>'''Provenance:''' the provenance of each version of each URL in the AU can be determined as follows. The content of that version was obtained from the URL represented by the path from the root of the AU's directory hierarchy to the parent of the content directory, at the time of the timestamp. If this content was the result of a repair from another CLOCKSS box, that is recorded in the metadata. </li><br />
<li>'''Fixity:''' the metadata for each version of each URL in the AU includes a checksum computed at the time it was obtained and, if one is available, a checksum provided by the Web server from which it was obtained.</li><br />
<li>'''Context:''' the context information for an AU consists of two pieces of information which together allow the daemon to construct an instance of a Java class customized to suit the AU:<br />
<ul><br />
<li>The ''parameters'', which are a set of {name, value} pairs providing the arguments needed to construct an instance of the class selected by the plugin ID.</li><br />
<li>The ''plugin ID'', which, in encoded form, identifies the class to be instantiated and supplies additional information.</li><br />
</ul><br />
In effect, the context for the AU is a customized instance of a Java class, normally referred to as its ''plugin''. It is thus executable, capable of performing operations on the AU such as [[Definition of AIP#Creating AIPs from SIPs|adding content and metadata from a SIP]], [[LOCKSS: Extracting Bibliographic Metadata|extracting metadata]], and taking part in [[LOCKSS: Polling and Repair Protocol|integrity checks]].<br /><br />
The context for the content of a URL in an AU consists of the associated metadata, including the <tt>Content-Type</tt> and the other HTTP headers which together provide the information a Web browser uses to render the content.<br /><br />
Each AU is a stand-alone, self-contained object that, if necessary, can be disseminated (triggered) independently. Although it may be part of a larger bibliographic unit in a logical sense, it is self-contained in its representation. To ensure this common URLs, for example a journal logo image, are replicated in each AU that refers to them.</li><br />
<li>'''Reference:''' Each AU has an immutable internal name, computed from its context information and stored with it. The name of the AU is the same on all CLOCKSS boxes, although the location of the AU instance in the box's file system may differ. The use of this name to locate the instance of an AU on a CLOCKSS box, or content within it, is [[Definition of AIP#Locating Digital Objects|described below]].</li><br />
<li>'''Access Rights:''' the CLOCKSS archive is a dark archive; access to the content of all AUs is forbidden unless in the future the [[CLOCKSS: Extracting Triggered Content|CLOCKSS board declares a trigger event]] for a specific set of AUs. Thus it is neither necessary nor possible to store the access rights to an AU with the AU itself, since they will only be determined in the future.</li></ul><br />
<br />
== Creating AIPs from SIPs ==<br />
<br />
When the AU is created its root directory is created, and the context information (plugin ID and AU parameters) are recorded in a read-only file in that directory. The same information is recorded in a central file in the daemon's configuration directory.<br />
<br />
Because most publishers whose materials are preserved in the CLOCKSS archive are serial publishers, emitting a continuous stream of articles, AUs are typically constructed from a series of SIPs, each one representing the articles published since its predecessor SIP. There are [[Definition of SIP|two kinds of SIP]]:<br />
* A ''harvest'' SIP represents content that the CLOCKSS archive will ingest by crawling the publisher's web site.<br />
* A ''file transfer'' SIP represents content that the publisher will package and transfer to the CLOCKSS archive via FTP, rsync, or other file transfer machanism.<br />
<br />
The process of creating an AU (AIP) from a SIP is different for each of the two types of SIP, but each process starts by ''configuring'' the AU on the CLOCKSS boxes, which involves supplying the context information (plugin ID and the parameters it requires) via the [[LOCKSS: Property Server Operations|CLOCKSS property server]]. Each box creates a root directory for its instance of the AU at a suitable place in its POSIX file system and calls the AU's plugin to arrange attempts to collect SIPs. The plugin contains information about times when content collection is allowed. The LOCKSS daemon on each box maintains a schedule of all AUs' collection attempts; when an AU requests a collection attempt it is scheduled so as to conform to box-wide and publisher-specific limits on the number of simultaneous collection attempts.<br />
<br />
AIP (AU) instances in CLOCKSS boxes may be created even before the first SIP supplying content for them is available. The instance continues to accumulate content via a sequence of SIPs becoming available through time, as the LOCKSS daemon repeatedly crawls the Web server from which SIPs are collected. Nominally, each AU represents a delimited span of time, such as a year or a volume of a journal. But because SIPs containing errata or corrections can arrive even after the delimited span of time, there is in general never a point in time at which the AU can be said to be definitively complete in the sense that no further content will ever be added.<br />
<br />
Although, as documented in [[CLOCKSS: Ingest Pipeline]], the quality assurance process that content undergoes as it is ingested into the CLOCKSS archive includes some visual spot checks that the content renders properly in a web browser, these are primarily intended to assure that all necessary URLs are being harvested. At the scale of the CLOCKSS archive's operations it is not feasible for these checks to be exhaustive. Further, the Content Information in an AIP is copyright by the publisher. It is their responsibility to ensure that it is Independently Understandable for their readers, who are also the eventual Consumers of the content if it is ever triggered from the CLOCKSS archive. Even if these spot checks were to detect some rendering problems, the CLOCKSS archive would not be permitted to modify the Content Information to correct them.<br />
<br />
The goal of the [[CLOCKSS: Ingest Pipeline]] is not to ensure that the AUs are "complete and correct"; there are no operationally implementable definitions of those terms. If there were, they would involve second-guessing the publishers. The goal is rather to ensure that the AUs "faithfully reflect what the publisher has published on their web site (for harvested AUs) or supplied to the archive (for file transfer AUs)".<br />
<br />
=== Creating an AIP from a harvest SIP ===<br />
<br />
When the scheduled time for the AU's requested collection attempt arrives, the plugin configures the box's Web crawler as it first verifies that the appropriate [[Definition of SIP|CLOCKSS permission statement]] is present on the publisher's Web site, then crawls the site. The plugin bounds the crawl in the URL name space by determining which links to follow, rate-limits the crawl so as not to disrupt operation of the publisher's Web site, and stores all newly-discovered content and the related metadata in the appropriate place relative to the AU's root directory in the box's repository, as described [[Definition of AIP#Creating AIPs from SIPs|above]].<br />
<br />
Note that for quality assurance (QA) reasons, as set out in [[CLOCKSS: Ingest Pipeline]], in practice each harvest AU is created twice:<br />
* First, a temporary AU is created on the CLOCKSS ingest machines. These machines collect the AU's current content, and then come to agreement on the content using the [[LOCKSS: Polling and Repair Protocol]].<br />
* Once agreement is reached, the permanent AU is created on each of the production CLOCKSS boxes. They then:<br />
** crawl the content from the CLOCKSS ingest boxes<br />
** start regular integrity checks with the other production CLOCKSS boxes using the [[LOCKSS: Polling and Repair Protocol]].<br />
The process the production boxes use is exactly the same as used by the ingest boxes, except that the production box is configured to crawl using the LOCKSS daemon on an ingest box as a proxy. The LOCKSS daemon on the ingest box is configured to act as a proxy that returns the content it has for the URL, or 404 for content it does not have. This ensures that the URLs from which the production box collects content point to the publisher, although the content will come from the ingest box. Once agreement is reached among the production boxes, the AU can be removed from the ingest boxes. Removal is recorded by changing the state of the AU in the Title DataBase (TDB) to ZAPPED (See [[LOCKSS: Extracting Bibliographic Metadata#Metadata Database Indexing|LOCKSS: Extracting Bibliographic Metadata]]). The current process for doing so has been deemed inadequate (see [[CLOCKSS: Ingest Pipeline#Harvest Process|CLOCKSS: Ingest Pipeline]]); a replacement process is under development.<br />
<br />
=== Creating an AIP from a file transfer SIP ===<br />
<br />
Publishers choosing to supply content via file transfer can choose either:<br />
* To ''push'' the content via <tt>FTP</tt>, <tt>SFTP</tt> or <tt>rsync</tt> to a CLOCKSS-run ingest server, using a user name and password chosen by the CLOCKSS team and specific to the publisher. <br />
* To have the CLOCKSS archive's ingest server ''pull'' the content from a publisher-run <tt>FTP</tt>, <tt>SFTP</tt> or <tt>rsync</tt> server using a user name and password chosen by, and specific to the publisher.<br />
In both cases, the combination of the DNS name at the publisher's end, the user name, and the password identifies the publisher for provenance purposes.<br />
<br />
Content and metadata in SIPs received via file transfer is organized by shell scripts into a directory hierarchy that is exported to the production CLOCKSS boxes by a Web server, called the ''staging server''. The hierarchy contains a directory per publisher containing a directory per year that contains the content and metadata received from that publisher during that year, and an automatically generated manifest page containing links to all the files and a CLOCKSS permission statement. Some publishers' file transfer SIPs do not contain fixity information, for those publishers fixity information is computed as soon as possible after file transfer and stored with the content.<br />
<br />
At the start of each new year an AU for the new year's SIPs from each publisher supplying content via file transfer is configured on the production CLOCKSS boxes via the [[LOCKSS: Property Server Operations|CLOCKSS property server]]. The AU's plugin is configured to start crawling from the automatically generated manifest page of that year's directory for that publisher on the staging server at regular intervals. Each time the box crawls, it adds content and metadata from newly received SIPs to the box's copy of the AU. In this case the files in the AIPs have names reflecting the files in the SIPs on the staging server. The publisher's names or URLs for these files are contained in, or recoverable from metadata in the SIPs/AIPs. The content will be associated with them when and if it is triggered.<br />
<br />
After the year has ended and agreement has been reached on the AU among the production CLOCKSS boxes, the AU can be removed from the staging server and its status changed to ZAPPED in the TDB (See [[LOCKSS: Extracting Bibliographic Metadata#Metadata Database Indexing|LOCKSS: Extracting Bibliographic Metadata]]). A review deemed the initial process for doing so inadequate and error-prone. AUs will not be removed from the staging server until an acceptable process has been developed and tested. This process must integrate with the improved [[CLOCKSS: Logging and Records#External Reports|process for generating external reports]].<br />
<br />
== CLOCKSS Representation Information ==<br />
<br />
[[CLOCKSS: Designated Community]] documents that the software Knowledge Base of the eventual Consumers of triggered content includes web browsers and their associated plugins capable of rendering the formats used on the Web at the time of their publication. This part of the Representation Information for the CLOCKSS archive is preserved outside the CLOCKSS archive, primarily in open source code repositories.<br />
<br />
[[LOCKSS: Format Migration]] documents how evolution in this Knowledge Base through time can be handled using the Content Negotiation capabilities of the Web to implement [http://dx.doi.org/10.1045/january2005-rosenthal transparent, on-access format migration]. This capability is implemented in the LOCKSS software, which is preserved outside the CLOCKSS archive, in the SourceForge repository and elsewhere.<br />
<br />
Thus the representation Information that is necessary to preserve with content objects in a CLOCKSS AIP is that information needed by web browsers to render, and the LOCKSS software to migrate, the content obtained from a URL. This information is is two parts:<br />
* The <tt>Mime-Type</tt> and other HTTP headers obtained from the URL. The metadata preserved for each version of each URL within an AU includes all HTTP headers.<br />
* "Magic Number" and other information contained in the HTTP content payload itself. The content preserved for each version of each URL within an AU is the entire HTTP content payload.<br />
A web browser has no other information upon which to base its rendering of the content of a URL than its name (perhaps with a file extension) and these two parts, which must therefore be adequate for the purpose of making the content understandable to Consumers.<br />
<br />
== Locating Digital Objects ==<br />
<br />
The term "digital objects" is often used in discussions of preserved digital information. In the CLOCKSS content, it might refer to either of two types of object:<br />
* AIPs, or in the CLOCKSS context AUs.<br />
* Content objects within an AU.<br />
<br />
=== Locating AIPs ===<br />
<br />
The CLOCKSS archive has a single class of AIP, called an Archival Unit (AU). As [[Definition of AIP#CLOCKSS Preservation Description Information (PDI)|described above]], the Reference part of the AU's Preservation Description Information is an immutable name that is the same on each CLOCKSS box. The location of the instance of the AU on a particular CLOCKSS box may be obtained by querying a map from this internal name to the path to the AU's root directory. This map is built during box startup and subsequently maintained as new AUs are created, or AUs moved.<br />
<br />
=== Locating content objects ===<br />
<br />
Content within an AU can be located in one of two ways:<br />
* Via the URL from which it was obtained. As [[Definition of AIP#CLOCKSS Archival Information Package (AIP)|described above]], there is a reversible mapping between the URI from which the content was collected and the path from the root of the AU containing it in the POSIX file system containing the AU. This enables content within an AU to be located via the map between the AU's internal name and its root location, and the components of the URL.<br />
* Via metadata search. As [[Definition of AIP#CLOCKSS Preservation Description Information (PDI)|described above]], content within an AU can be located by querying the [[LOCKSS: Metadata Database|metadata database]] using specific bibliographic metadata fields to match against the [[LOCKSS: Extracting Bibliographic Metadata|bibliographic metadata supplied by the publisher]] or derived from the AU's context. <br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by LOCKSS Engineering Staff<br />
* Approval by LOCKSS Chief Scientist<br />
<br />
== Relevant Documents ==<br />
# OAIS (2012) CCSDS 650.0-M-2: Reference Model for an Open Archival Information System (OAIS). Magenta Book. Issue 1. June 2012 (ISO 14721:2003) http://public.ccsds.org/publications/archive/650x0m2.pdf accessed 2013.08.31<br />
# [[Definition of SIP]]<br />
# [[CLOCKSS: Extracting Triggered Content]]<br />
# [[LOCKSS: Metadata Database]]<br />
# [[LOCKSS: Extracting Bibliographic Metadata]]<br />
# [[LOCKSS: Polling and Repair Protocol]]<br />
# [[CLOCKSS: Ingest Pipeline]]<br />
# [[CLOCKSS: Designated Community]]<br />
# [[LOCKSS: Format Migration]]<br />
# David S.H. Rosenthal, Thomas Lipkis, Thomas Robertson, Seth Morabito. “Transparent Format Migration of Preserved Web Content”, D-Lib Magazine, vol. 11, no. 1, January 2005. http://dx.doi.org/10.1045/january2005-rosenthal accessed 2013.8.7</div>Dshrhttp://documents.clockss.org/index.php/CLOCKSS:_Ingest_PipelineCLOCKSS: Ingest Pipeline2014-02-19T22:11:39Z<p>Dshr: /* Harvest Content Processing */ Response to Site VIsit Schedule</p>
<hr />
<div>= CLOCKSS: Ingest Pipeline =<br />
<br />
The CLOCKSS Archive ingests two different types of content:<br />
* Harvest Content - the contents of a publisher's Web site, obtained by crawling it.<br />
* File Transfer content - the form of the content the publisher uses to create their Web site, obtained from the publisher typically via FTP.<br />
<br />
== Harvest Content Pipeline ==<br />
<br />
=== Harvest Publisher Engagement ===<br />
<br />
The process a new publisher signing up with CLOCKSS undergoes depends on whether the publisher uses a publishing platform already supported by the LOCKSS software:<br />
* If so, the discussion need cover only the process for turning on access for the CLOCKSS Archive's ingest machines.<br />
* Otherwise, a plugin writer from the LOCKSS team is designated to analyze the publisher's site and:<br />
** develop requirements for the publisher plugin as described in [[LOCKSS: Software Development Process]].<br />
** work with the publisher to add CLOCKSS permission pages and make any other necessary changes to their site.<br />
In all cases, all interactions with the publisher take place through the RT ticketing system and are thus recorded permanently.<br />
<br />
The CLOCKSS Plugin Lead is responsible for this process.<br />
<br />
=== Harvest Plugin Development and Testing ===<br />
<br />
The designated LOCKSS team member:<br />
* Works with the publisher to grant access to the CLOCKSS ingest machines IP addresses.<br />
* Develops and tests any necessary software enhancements including unit tests (see [[LOCKSS: Software Development Process]]).<br />
* Works with a plugin writer to implement and test the necessary plugin and its unit tests (see [[LOCKSS: Software Development Process]]).<br />
<br />
Once the plugin passes its tests and the CLOCKSS ingest machines have access to the publisher's site with CLOCKSS permissions, the plugin writer works with the LOCKSS content team to process sample batches of publisher content for quality assurance.<br />
<br />
The CLOCKSS Plugin Lead is responsible for this process.<br />
<br />
=== Harvest Content Processing ===<br />
<br />
Once a plugin has been developed and tested with unit tests and sample content, it is released to a set of content-testing boxes, which are identical to production CLOCKSS boxes except generally smaller. Then:<br />
* Under the direction of an internally developed testing framework (AUTest), two daemons with the plugin are directed to collect two copies each of a substantial amount of real content, one copy collected from Stanford's IP address range, the other from Rice or Indiana.<br />
* AUTest then directs each of the two daemons to compute the message digest of the filtered contents of each collected AU.<br />
* AUTest compares the results to ensure correct collection and correct operation of hash filters.<br />
* Metadata is extracted and checked for sanity.<br />
* A sample of the content is browsed by a human tester to verify the correctness of the plugin by ensuring that:<br />
** all the types of files that should be collected are,<br />
** and that files that shouldn't be collected, such as advertisements or articles from previous years, are properly excluded.<br />
Any problems detected are addressed by modifying and testing the plugin in a development environment, or by contacting the publisher if necessary to resolve systemic site errors, then the tests are repeated on the content-testing boxes.<br />
<br />
The process above is repeated on a substantial sample of new content as it becomes available (in following years/volumes, or new publications from the same publisher), in order to detect changes to the publisher's site, or the format of new titles, which require changes to the plugin.<br />
<br />
Once an AU has been successfully tested on the content-testing boxes, it is configured for collection on all boxes in the ingest network. If the plugin is new or changed it is also released to the ingest network. Each ingest box collects the content and the network runs polls to detect and resolve transient collection errors. When all the copies of an AU have come into full agreement on the ingest boxes, they are then configured on the network of CLOCKSS production boxes.<br />
<br />
Depending on the complexity and diversity of the publisher's content the "substantial sample" can be anything from quite small to the publisher's entire content. Note that the need for a "substantial sample" of content for testing means that there is a delay between the time the publisher starts adding content to the bibliographic unit represented by the AU, and the time the ingest network starts collecting it.<br />
<br />
The CLOCKSS Content Lead is responsible for this process.<br />
<br />
==== Harvest Process ====<br />
<br />
Once AUs of content are configured on the CLOCKSS production boxes, they schedule collection of the content from the ingest boxes, as described in [[Definition of AIP#Creating an AIP from a harvest SIP|Definition of AIP]].<br />
<br />
A harvest content AU is considered to be preserved when it has been ingested by all the production CLOCKSS boxes and at least one poll on it has been successful (see [[LOCKSS: Polling and Repair Protocol]]). The AU can then be removed from the ingest boxes. A review of the process for doing so deemed it inadequate and too manual; a replacement process is under development that will integrate with the improvements being made to [[CLOCKSS: Logging and Records#External Reports|external report generation]].<br />
<br />
The CLOCKSS Content Lead is responsible for both the current and replacement processes.<br />
<br />
==== Completeness of Harvest Content ====<br />
<br />
Completeness in this context means that the content in the CLOCKSS archive faithfully reflects the content on the publisher's web site. The CLOCKSS ingest pipeline includes multiple ingest CLOCKSS boxes that individually harvest content from the publisher's website. Once collected, the ingest CLOCKSS boxes poll among themselves and repair content objects from the publiser if there is disagreement. The content testing process requires complete agreement among the ingest boxes before the content is released to the product CLOCKSS network. As a result the content described by the CLOCKSS plugin and parameters for the [[Definition of SIP|Submission Information Package]] is preserved in the CLOCKSS PLN.<br />
<br />
To account for the possibility of errors in the rules and procedures of the CLOCKSS plugin, a second layer of testing is done by visually inspecting content submitted by a publisher to ensure that it functions as it did on the publisher's website, and all that content available from the publisher's website is also preserved in the CLOCKSS ingest boxes before being released to the CLOCKSS production network.<br />
<br />
The CLOCKSS Content Lead is responsible for this process.<br />
<br />
==== Correctness of Harvest Content ====<br />
<br />
Correctness in this context means that the publisher's content is Independently Understandable by the [[CLOCKSS: Designated Community]]. For the CLOCKSS archive, this is the responsibility of the publisher, not of the archive. Harvest content is used by customers of the publisher on a regular basis, so content that is not Independently Understandable on the publisher's website is rare, and is normally detected by the author. Under the [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement], the publisher may submit content in any form, including proprietary formats, so full validation of content and content types is not always practical.<br />
<br />
==== Annual Harvest Content Cycle ====<br />
<br />
The processing cycle for the harvest content a publisher is publishing during a given year can be described as follows.<br />
* The first phase begins typically in the second quarter of the year.<br />
** A sample of the publisher's AUs for the year are configured and processed on the content-testing machines to ensure the plugin works as expected, and corrective action is taken if necessary.<br />
** When the plugin is deemed satisfactory, all the publisher's AUs for the year are configured on the ingest machines and allowed to crawl and poll.<br />
* The second phase continues throughout the rest of the year.<br />
** Poll results for AUs on the ingest machines are monitored, and corrective action is taken if necessary, including enhancing the plugin.<br />
** More content samples may be configured and processed on the content-testing machines to ensure the plugin continues to work as expected throughout the year.<br />
* The third phase begins at the beginning of the following year.<br />
** As the AUs for the year that is ending stop growing, final poll results are monitored until the AUs are deemed fully processed, at which point they are configured on the production machines.<br />
** Crawl and poll results of the AUs on the production machines are then monitored until the AUs are deemed fully processed.<br />
<br />
As regards subsequent errata and corrections, we assume that publishers follow the NLM ''best practice'' guidelines and at least refer and link to them in a subsequent issue. If the publisher does, they will be collected with that issue.<br />
<br />
The CLOCKSS Content Lead is responsible for this process.<br />
<br />
==== e-Book Processing ====<br />
<br />
Unlike serial content which is published incrementally over a period of time, harvest content that takes the form of books can be processed over a regular cycle. During each cycle:<br />
* A sample of the publisher's books that were published during the previous interval is configured on the content-testing machines to verify that the plugin works adequately, and corrective action is taken if necessary.<br />
* When the plugin is deemed ready, all the publisher's books from the previous interval are configured on the ingest machines and allowed to crawl and poll. Results are monitored and corrective action is taken if necessary.<br />
* When a book is deemed fully processed on the ingest machines, it is configured on the production machines. The crawl and poll results are then monitored on the production machines until the book is deemed fully processed.<br />
<br />
The CLOCKSS Content Lead is responsible for this process.<br />
<br />
== File Transfer Content Pipeline ==<br />
<br />
=== File Transfer Publisher Engagement ===<br />
<br />
The designated LOCKSS team member's discussion with a new publisher whose content is to be received via file transfer needs to determine two things:<br />
* The means by which the file transfer content is transferred to the CLOCKSS ingest machines. This is up to the publisher, techniques that have been implemented include:<br />
** FTP from an FTP server that the publisher maintains to a CLOCKSS ingest machine.<br />
** FTP by the publisher to an FTP server on a CLOCKSS ingest machine.<br />
** <tt>rsync</tt> between a publisher machine and a CLOCKSS ingest machine<br />
* The format of the file transfer content, in particular:<br />
** How the ingest scripts can verify that the content received is correct, for example by checking manifests and checksums.<br />
** How the content can be rendered if it is ever triggered. A publisher-specific version of the abstract plan described in [[CLOCKSS: Extracting Triggered Content#Preparing File Transfer Content for Dissemination|CLOCKSS: Extracting Triggered Content]] should be drawn up.<br />
In all cases, all interactions with the publisher take place through the RT ticketing system and are thus recorded permanently.<br />
<br />
The CLOCKSS Plugin Lead is responsible for this process.<br />
<br />
=== Ingest script development ===<br />
<br />
Once the information above is available, the designated team member writes shell scripts to be executed from <tt>cron</tt> that ensure that collected content is up-to-date by:<br />
* If FTP from the publisher, collect any as-yet-uncollected content.<br />
* If <tt>rsync</tt>, run rsync against the publisher's machine.<br />
* If the format in which the publisher makes file transfer content available lacks checksums and manifests, the ingest script must generate them as the content is collected.<br />
* Otherwise, the ingest script verifies the manifest and checksums in the content, alerting the content team if any discrepancies are found.<br />
<br />
The designated team member also writes a verification script that is run against all content on the file transfer ingest machine at intervals.<br />
<br />
The CLOCKSS Plugin Lead is responsible for this process.<br />
<br />
=== File Transfer Plugin Development and Testing ===<br />
<br />
File transfer content also requires a plugin, These plugins are developed and tested in the same way as harvest plugins (see [[CLOCKSS: Ingest Pipeline#Harvest Plugin Development and Testing|above]]), except that the emphasis is on [[LOCKSS: Extracting Bibliographic Metadata|metadata extraction]] because it is in some cases more complex, whereas crawling is trivial and polling requires no filters.<br />
<br />
The CLOCKSS Plugin Lead is responsible for this process.<br />
<br />
=== File Transfer Content Pre-Processing ===<br />
<br />
File transfer content undergoes a subset of the testing steps described above for [[CLOCKSS: Ingest Pipeline#Harvest Content Processing|harvest content]]. The structure of file transfer AUs is simple and consistent (typically a directory hierarchy) so testing the crawl rules on a large sample of content is unnecessary. All the CLOCKSS boxes collect the same copy of the content so few of the complexities of preserving harvest content (such as hash filters) are present.<br />
<br />
A slightly simplified workflow in AUTest directs content-testing boxes to collect a single copy of a smaller sample of AUs from the staging server, and to extract metadata and check it for sanity. After any problems are corrected, the AU and similar AUs are configured on the CLOCKSS production boxes.<br />
<br />
=== File Transfer Content Ingest ===<br />
<br />
The collection scripts are added to the ingest user's <tt>crontab</tt>, and run daily to collect any new content. Every 24 hours the entire content of the file transfer ingest machine is synchronized with an (a) an on-site backup and (b) an off-site backup using <tt>rsync</tt>. The verification scripts are run against each of these backup copies at intervals.<br />
<br />
Once the ingested content is verified it is staged to a Web server that can be accessed only by the CLOCKSS boxes. The CLOCKSS boxes crawl the file transfer content staged on the Web server under control of a file transfer plugin and preserve it, as described in [[Definition of AIP#Creating an AIP from a file transfer SIP|Definition of AIP]].<br />
<br />
File transfer content is considered to be preserved when it has been ingested by all production CLOCKSS boxes and at least one poll on it has been successful (see [[LOCKSS: Polling and Repair Protocol]]). <br />
<br />
The CLOCKSS Content Lead is responsible for this process.<br />
<br />
==== Completeness of File Transfer Content ====<br />
<br />
Completeness in this context means that the content in the CLOCKSS archive faithfully reflects the content delivered by the publisher. Content submitted via file transfer is either made available from a content server operated by the publisher, or delivered by the publisher to a content server hosted by CLOCKSS. In either case the content is transferred to a staging server operated by CLOCKSS. The ingest scripts include procedures that verify the files submitted by the publisher correspond to those on the content server. Where publishers provide checksum information, checksums are also compared with locally computed checksums to ensure that the content is the same. If the checksums differ, the local copy is deleted and re-collected.<br />
<br />
==== Correctness of File Transfer Content ====<br />
<br />
Correctness in this context means that, if the content is ever triggered, the result will be Independently Understandable by the [[CLOCKSS: Designated Community]]. During initial engagement with file transfer publishers, their content types are assessed to develop a plan for rendering the content if it is ever triggered. One goal of the ingest scripts is to verify that the content types submitted match this assessment; if they do not the [[CLOCKSS: Ingest Pipeline#File Transfer Publisher Engagement|plan for triggering the content]] must be revised to account for the new content types.<br />
<br />
== Feedback ==<br />
<br />
The CLOCKSS Executive Director receives regular reports of the article counts ingested, upon which publishers are billed (See [[CLOCKSS: Logging and Records#External Reports|CLOCKSS: Logging and Records]]).<br />
<br />
Content AUs can be in one of three preservation states, as reported to the publisher, the CLOCKSS Board and the Keepers Registery (See [[CLOCKSS: Logging and Records#External Reports|CLOCKSS: Logging and Records]]):<br />
# Committed for preservation<br />
# In process<br />
# Preserved<br />
Progress is tracked through the AU's configuration file.<br />
<br />
The CLOCKSS Content Lead is responsible for this process.<br />
<br />
== Change Process ==<br />
<br />
Changes to this document require:<br />
* Review by:<br />
** LOCKSS Technical Staff<br />
** CLOCKSS Plugin Lead<br />
** CLOCKSS Content Lead<br />
* Approval by CLOCKSS Technical Lead<br />
<br />
== Relevant Documents ==<br />
# [[CLOCKSS: Designated Community]]<br />
# [[CLOCKSS: Logging and Records]]<br />
# [[LOCKSS: Software Development Process]]<br />
# [[LOCKSS: Polling and Repair Protocol]]<br />
# [[Definition of SIP]]<br />
# [[Definition of AIP]]<br />
# [https://www.clockss.org/clocksswiki/files/CLOCKSS_Participating_Publisher_Agreement.pdf CLOCKSS: Publisher Agreement]</div>Dshr