Difference between revisions of "CLOCKSS: Box Operations"

From CLOCKSS Trusted Digital Repository Documents
Jump to: navigation, search
m (Change Process)
(restoring post-2016 edits)
 
(7 intermediate revisions by one user not shown)
Line 4: Line 4:
  
 
A CLOCKSS host site signs an agreement with the CLOCKSS board under which they commit to providing:
 
A CLOCKSS host site signs an agreement with the CLOCKSS board under which they commit to providing:
* Rack space in a physically secure location accessible only to authorized personnel.
+
* Rack space (in a physically secure facility accessible only to authorized personnel)
* Power.
+
* Power
* Cooling.
+
* Cooling
* Network bandwidth.
+
* Network bandwidth
 
And to have an in-place disaster recovery plans if these resources fail. The CLOCKSS Executive Director has copies of these agreements.
 
And to have an in-place disaster recovery plans if these resources fail. The CLOCKSS Executive Director has copies of these agreements.
  
Line 15: Line 15:
  
 
=== Hardware Vendors ===
 
=== Hardware Vendors ===
All new systems are currently purchased from iXsystems based in San Jose, CA. The LOCKSS team's contact there is sales representative Kevin Lee ([mailto:kle@ixsystems.com kle@ixsystems.com]).
+
All new systems are currently purchased from 45Drives based in Nova Scotia, Canada or iXsystems based in San Jose, CA. The LOCKSS team's contact at 45Drives is sales representative Allan Hiller ([mailto:ahiller@45drives.com ahiller@45drives.com]); at iXSystems it is Kevin Lee ([mailto:kle@ixsystems.com kle@ixsystems.com]).
  
The LOCKSS team previously worked with two hardware vendors, Iron Systems, located in Fremont, CA and eRacks in Orange, CA. Most of the systems currently in production were purchased from Iron Systems. Business contacts are Kawaljit Nagi ([mailto:kawal@ironsystems.com kawal@ironsystems.com]) and James King ([mailto:jamesk@eracks.com jamesk@eracks.com]), respectively. The LOCKSS team also has a relationship with Joesph Wolff ([mailto:joe@eracks.com joe@eracks.com]), the owner of eRacks.
+
The LOCKSS team previously worked with two hardware vendors, Iron Systems, located in Fremont, CA and eRacks in Orange, CA. Most of the systems currently in production were purchased from Iron Systems. Business contacts are Kawaljit Nagi ([mailto:kawal@ironsystems.com kawal@ironsystems.com]) and James King ([mailto:jamesk@eracks.com jamesk@eracks.com]), respectively. The LOCKSS team also has a relationship with Joseph Wolff ([mailto:joe@eracks.com joe@eracks.com]), the owner of eRacks.
  
 
=== Hardware Purchase Process ===
 
=== Hardware Purchase Process ===
Purchases start as a discussion among LOCKSS Engineering Staff, the CLOCKSS Network Administrator and the CLOCKSS Executive Director about what is needed. A specification is drawn by the engineers and sent to our hardware vendors for an early quote. If the quote looks reasonable, it is sent to the CLOCKSS Executive Director, who negotiates a final quote and sends a purchase order. The CLOCKSS Executive Director is advised to release payment once the machine has been built, tested, verified working by the LOCKSS team, preconfigured with software (see below) and shipped.  
+
Purchases begin as a discussion among LOCKSS Engineering Staff, the CLOCKSS Network Administrator and the CLOCKSS Executive Director about what is needed. A specification is drawn by the engineers and sent to our hardware vendors for a bids. If a bid looks reasonable, it is sent to the CLOCKSS Executive Director, who negotiates the details and pricing, and sends a purchase order to the vendor. The CLOCKSS Executive Director is advised to release payment once the machine has been built, tested, verified working by the LOCKSS team, preconfigured with software (see below) and shipped.  
  
 
=== Hardware Warranty ===
 
=== Hardware Warranty ===
Machines purchased from Iron Systems and eRacks come with 1 - 2 years of limited hardware warranty, depending on hardware vendor. The warranty on most of the CLOCKSS boxes has since expired but it proved useful for exchanging failed hardware components. Details about the warranty are in the purchase orders that have been submitted as well as the invoices subsequently received.
+
Machines purchased from our hardware vendors come with 1 - 2 years of limited hardware warranty, depending on hardware vendor. The warranty on most of the CLOCKSS boxes currently in production have expired, but the warranties proved useful in the past for the replacement of failed hardware components. Details about the warranty are in the purchase orders that have been submitted, as well as the invoices subsequently received.
  
 
=== Recommended Hardware ===
 
=== Recommended Hardware ===
Line 33: Line 33:
 
! scope="col"| Ingest
 
! scope="col"| Ingest
 
! scope="col"| Production
 
! scope="col"| Production
! scope="col"| Triggered (VMware)
+
! scope="col"| Triggered (VMware / Xen)
 
|-
 
|-
 
| Chassis
 
| Chassis
| | Supermicro 2U
+
| | Supermicro 4U
 
| | Supermicro 4U
 
| | Supermicro 4U
 
| | ''N/A''
 
| | ''N/A''
 
|-
 
|-
 
| Disk Bays
 
| Disk Bays
| | 12
+
| | 24
 
| | 24
 
| | 24
 
| | ''N/A''
 
| | ''N/A''
 
|-
 
|-
 
| Processor
 
| Processor
| | AMD Operton 6128 (eight cores)
+
| | AMD Opteron 6376 (16 cores)
| | Dual Xeon E5504 (eight cores)
+
| | Dual Xeon E5504 (8 cores)
| | Dual-core CPU
+
| | Dual-core vCPU
 
|-
 
|-
 
| RAM
 
| RAM
| | 16GB ECC
+
| | 64GB ECC
 
| | 24GB ECC
 
| | 24GB ECC
| | 4GB
+
| | 4GB vRAM
 
|-
 
|-
 
| Disk
 
| Disk
| | 8 x 3TB SATA 6Gbps 7200RPM
+
| | 18 x 4TB SATA 6Gbps 7200RPM
| | 12 x 2TB SATA 3Gbps
+
| | 16 x 3TB SATA 3Gbps
| | 40GB disk
+
| | 7200RPM 100GB vDisk
 
|-
 
|-
 
| RAID
 
| RAID
| | LSI Megaraid
+
| | Software (Linux <tt>mdadm</tt>)
 
| | Software (Linux <tt>mdadm</tt>)
 
| | Software (Linux <tt>mdadm</tt>)
 
| | None (Underlying RAID array)
 
| | None (Underlying RAID array)
 
|-
 
|-
 
| Network
 
| Network
| | Onboard dual Gbe
+
| | Onboard dual Gbit
| | Onboard dual Gbe
+
| | Onboard dual Gbit
| | 10/100Mb NIC
+
| | Single vEth (1 Gbit via hypervisor)
 
|-
 
|-
 
| Remote Access
 
| Remote Access
 
| | IPMI
 
| | IPMI
 
| | IPMI
 
| | IPMI
| | VMware vSphere Client
+
| | Xen hypervisor
 
|-
 
|-
 
| Power Supply
 
| Power Supply
| | 800W redundant
+
| | 1600W redundant
 
| | 1600W redundant
 
| | 1600W redundant
 
| | ''N/A''
 
| | ''N/A''
Line 84: Line 84:
 
=== CLOCKSS Virtual Machines ===
 
=== CLOCKSS Virtual Machines ===
  
CLOCKSS virtual machines run on servers running the VMWare vSphere ESXi 5 hypervisor.
+
CLOCKSS virtual machines run on servers running Xen on CentOS 7.x.
  
 
=== Hardware Service Life ===
 
=== Hardware Service Life ===
Line 92: Line 92:
 
=== Building and testing ===
 
=== Building and testing ===
  
Our hardware vendors are responsible for sourcing components and building the machines once a purchase order is submitted by CLOCKSS. After a machine is built, both hardware vendors have an extensive and rigorous process in place to ensure a machine will function correctly.
+
Our hardware vendors are responsible for sourcing components and building the machines once a purchase order is submitted by CLOCKSS. All of our hardware vendors have an extensive and rigorous process in place to ensure a machine functions prior to delivery.
  
 
=== Remote Access via IPMI ===
 
=== Remote Access via IPMI ===
  
Every CLOCKSS box is equipped with an Intelligent Platform Management Interface (IPMI) module that provides remote "side-band" and "out-of-band" access. Although IPMI is invaluable in the event we need to access or recover a CLOCKSS box outside the scope of its operating system, there are known security vulnerabilities and so it is disabled by default on all machines. This is done by disabling it in BIOS and unplugging the network cable from the IPMI module.
+
Every CLOCKSS box is equipped with an Intelligent Platform Management Interface (IPMI) module that provides remote "side-band" and "out-of-band" access. Although IPMI is invaluable in the event we need to access or recover a CLOCKSS box outside the scope of its operating system, there are known security vulnerabilities and so it is disabled by default on all machines. This is done by unplugging the network cable from the IPMI module, and additionally disabling it in the BIOS.
  
 
In the event we need to use IPMI, the following precautions should be taken:
 
In the event we need to use IPMI, the following precautions should be taken:
Line 106: Line 106:
 
=== RAID Configuration ===
 
=== RAID Configuration ===
  
The CLOCKSS Ingest machines have LSI MegaRAID based hardware RAID controllers and eight 3TB disks. These disks were partitioned into groups of four and put into a RAID5 configuration. Each RAID array then yields approximately 8.1TB of usable space after RAID and EXT4 filesystem overhead. Since the Ingest machines have an extra disk dedicated for the OS (called a "system disk"), the full space of the RAID arrays is dedicated to the LOCKSS daemon. This configuration also isolates problems on the system disk from content storage arrays.  
+
The CLOCKSS Ingest machines utilize RAID disk arrays for storage. RAID functions are provided either by the Linux <tt>mdadm</tt> kernel module, or a LSI MegaRAID hardware RAID controller. The disks were partitioned into groups and put into a RAID6 configuration (a handful of legacy RAID arrays use the RAID5 configuration). Currently, in aggregate, the RAID arrays on each CLOCKSS Ingest machine yields at least 8.1TB of usable space after RAID and EXT4 filesystem overhead. Since the Ingest machines have extra disks dedicated for the OS (called a "system disk"), the full space of the RAID arrays is dedicated to the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]]. Use of dedicated systems disks allow problems and maintenance on the system disks to be isolated from those affecting content storage disks.
  
The CLOCKSS Production machines utilize the Linux kernel's <tt>mdadm</tt> module to implement software RAID. Similar to the CLOCKSS Ingest machines, the Production machines also have several RAID5 arrays consisting of three or four disks depending on the number of disks in the machine. In contrast to the Ingest machines, these RAID5 arrays are used by both the operating system and the LOCKSS daemon.
+
The CLOCKSS Production machines utilize the Linux kernel's <tt>mdadm</tt> module to implement software RAID. Similar to the CLOCKSS Ingest machines, the Production machines also have several RAID6 arrays (and some legacy RAID5 arrays) consisting of three to seven disks depending on the number of disks available in a given machine. In contrast to the Ingest machines, many of these machines have a RAID array serving the dual roles of system and content storage.
  
 
=== RAID Health Monitoring ===
 
=== RAID Health Monitoring ===
  
The most common component failures are disks and it is expected that all CLOCKSS boxes will experience a few disk failures over the course of their service life. The RAID5 arrays employed in CLOCKSS boxes are only able to tolerate one disk failure each. It is thus important that any failed disks be found and replaced as quickly as possible; a double disk failure requires repair from another box via the [[LOCKSS: Polling and Repair Protocol]], which is much slower than RAID rebuild.
+
The most common component failures experienced by CLOCKSS machines are disk failures and it is expected that all CLOCKSS boxes will experience a few disk failures over the course of their service life. The RAID5 and RAID6 arrays employed in CLOCKSS boxes are only able to tolerate one or two disk failures each without data loss, respectively. It is therefore important that any failed disks be detected and replaced as quickly as possible; an array failure requires repair from another box via the [[LOCKSS: Polling and Repair Protocol]], which is significantly slower than rebuilding the RAID array.
  
The RAID array health on all CLOCKSS Production boxes are monitored by Nagios via the Nagios Remote Plugin Executor (NRPE). Each machine runs an NRPE daemon and has a copy of a custom plugin we wrote to monitor mdadm RAID arrays. NRPE executes the plugin on behalf of Nagios and returns the result. If Nagios does not receive an OK, it sends an alert to CLOCKSS engineers.
+
The RAID array health on all CLOCKSS Production boxes is monitored by Nagios via the Nagios Remote Plugin Executor (NRPE) and custom Nagios plugins. The NPRE daemon running on each machine executes plugins on the behalf of Nagios and returns the result. Nagios will send an alert to CLOCKSS engineers upon the detection of any problems with the individual machines. For more, see the [[#Monitoring]] section below.
  
 
== Software Bringup ==
 
== Software Bringup ==
 +
 +
To simplify the software bringup and ease future system upgrades, all new CLOCKSS boxes are configured with two smaller, dedicated, system disks of an exact size. The LOCKSS engineering team is then able to use virtual machines to prepare disk images with a base installation of CentOS, LOCKSS, and other software. The images are then sent to our hardware vendor and imaged onto the system disks of each machine. A checksum can verify the integrity of the disk image.
 +
 +
The steps taken by the LOCKSS engineers to prepare the contents of the system disks is described below.
 +
 +
=== Software Licensing ===
  
 
All the software used for the CLOCKSS infrastructure is freely available and open source.
 
All the software used for the CLOCKSS infrastructure is freely available and open source.
Line 122: Line 128:
 
=== CentOS Installation and Configuration ===
 
=== CentOS Installation and Configuration ===
  
A base install of CentOS 6.x with Java options is performed on the partition or disk designated for the operating system then
+
A base install of CentOS 7.x with Java options is performed on the partition or disk designated for the operating system then
 
the following steps are taken to configure it as a CLOCKSS box:
 
the following steps are taken to configure it as a CLOCKSS box:
 
# The system accounts, <tt>lcap</tt> and <tt>lockss</tt> are created using <tt>useradd -r -m</tt> or equivalent. Then the LOCKSS LCAP public SSH keys are then added to their <tt>authorized_keys</tt> file, taking care to ensure the permissions are setup correctly.
 
# The system accounts, <tt>lcap</tt> and <tt>lockss</tt> are created using <tt>useradd -r -m</tt> or equivalent. Then the LOCKSS LCAP public SSH keys are then added to their <tt>authorized_keys</tt> file, taking care to ensure the permissions are setup correctly.
# The mount points are created for the LOCKSS daemon repositories:
+
# Then the firewall (<tt>iptables</tt> rules via systemd's firewald) are set up:
#* The preferred naming scheme is <tt>/cacheN/gamma</tt> where <tt>N</tt> is an integer starting from 0, (e.g. <tt>/cache0/gamma</tt>, <tt>/cache1/gamma</tt> and so forth).
+
#* The LOCKSS daemon requires the LCAP port (port 9729), used to communicate with other LOCKSS daemons, to be open to all machines. This is accomplished under CentOS 7.x (and systemd) with this firewalld command: <pre>firewall-cmd --zone=public --add-port=9729/tcp --permanent</pre> Note: although we have a list of IP addresses that could additionally be whitelisted access to the machine's LCAP port, we opt not to restrict access this way because communication on this port is established and protected by SSL certificates and verification (see [[#CLOCKSS PLN Configuration]]).
#* The mount point ownerships are changed to the <tt>lockss</tt> user and group and permissions for <tt>other</tt> are stripped.
+
#* Additionally, a CLOCKSS box is setup so that the administrative ports 22 (OpenSSH) and 8080 through 8086 are accessible to the remote site's administrative subnet(s) and the LOCKSS subnet at Stanford (171.66.236.0/24, previously 171.66.236.0/26). This is accomplished by creating a new firewalld zone: <pre>firewall-cmd --create-zone=lockss</pre> Then adding a new source for each subnet or IP address allowed access to the machine. Opening access to the machine from the LOCKSS subnet (171.66.236.0/24) is necessary for the proper administration, maintenance, and monitoring of the CLOCKSS box: <pre>firewall-cmd --zone=lockss --add-source=<IP or subnet></pre> And finally, specifying the ports open in this zone: <pre>firewall-cmd --zone=lockss --add-port=22/tcp --permanent</pre><pre>firewall-cmd --zone=lockss --add-port=8080-8086/tcp --permanent</pre> These rules should be reflected in any external firewalls at the site.
#* <tt>noexec</tt> is added to the mount point parameters.
+
#* Finally, the mount points should be added to <tt>updatedb</tt>'s <tt>PRUNEPATHS</tt> variable to prevent the system from indexing the LOCKSS daemon repositories.
+
# Then the firewall (<tt>iptables</tt> rules) are set up:
+
#* The LOCKSS daemon requires the LCAP port (port 9729), used to communicate with other LOCKSS daemons, to be open to all machines: This is accomplished under CentOS with this <tt>iptables</tt> rule: <pre>-A INPUT -m state --state NEW -m tcp -p tcp --dport 9729 -j ACCEPT</pre> Note: Although we have a list of IP addresses that should be granted access to this machine's LCAP port, we opt not to restrict access because it is protected by SSL certificate checks (see [[#CLOCKSS PLN Configuration]]).  
+
#* Additionally, a CLOCKSS box is setup so that the administrative ports 22 (OpenSSH) and 8080 through 8083 are accessible to the remote site's administrative subnet(s) and the LOCKSS subnet at Stanford (171.66.236.0/24, previously 171.66.236.0/26). The following <tt>iptables</tt> should be repeated for each subnet, taking care to replace <tt>SUBNET</tt> with the subnet in CIDR form:<pre>-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 22 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8080 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8081 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8082 -j ACCEPT&#10;-A INPUT -m state --state NEW -m tcp -p tcp -s SUBNET --dport 8083 -j ACCEPT</pre>These rules should be reflected in any external firewalls at the site.
+
 
# Setting up the CLOCKSS repository is a two step process:
 
# Setting up the CLOCKSS repository is a two step process:
 
## The first is to create a new file, <tt>lockss.repo</tt> under <tt>/etc/yum.repos.d</tt> containing the following lines:<pre>[lockss]&#10;name = LOCKSS Daemon Repository&#10;baseurl=http://www.lockss.org/clockss-repo/&#10;gpgcheck=1</pre>
 
## The first is to create a new file, <tt>lockss.repo</tt> under <tt>/etc/yum.repos.d</tt> containing the following lines:<pre>[lockss]&#10;name = LOCKSS Daemon Repository&#10;baseurl=http://www.lockss.org/clockss-repo/&#10;gpgcheck=1</pre>
Line 138: Line 139:
 
#The LOCKSS daemon can now be installed by invoking:<pre>yum install lockss-daemon</pre>
 
#The LOCKSS daemon can now be installed by invoking:<pre>yum install lockss-daemon</pre>
 
# The LOCKSS daemon installs a <tt>logrotate</tt> configuration file to <tt>/etc/logrotate.d/lockss</tt>. By default, the size is limited to 2M but on CLOCKSS boxes, we set the limit to 20M: <pre>/var/log/lockss/daemon {&#10;    size 20M&#10;    rotate 5&#10;    compress&#10;    delaycompress&#10;    create&#10;    notifempty&#10;    missingok&#10;}&#10;&#10;/var/log/lockss/stdout {&#10;    size 10k&#10;    rotate 5&#10;    compress&#10;    copytruncate&#10;    notifempty&#10;    missingok&#10;}&#10;</pre>
 
# The LOCKSS daemon installs a <tt>logrotate</tt> configuration file to <tt>/etc/logrotate.d/lockss</tt>. By default, the size is limited to 2M but on CLOCKSS boxes, we set the limit to 20M: <pre>/var/log/lockss/daemon {&#10;    size 20M&#10;    rotate 5&#10;    compress&#10;    delaycompress&#10;    create&#10;    notifempty&#10;    missingok&#10;}&#10;&#10;/var/log/lockss/stdout {&#10;    size 10k&#10;    rotate 5&#10;    compress&#10;    copytruncate&#10;    notifempty&#10;    missingok&#10;}&#10;</pre>
# To keep the time on CLOCKSS boxes synchronized, we use <tt>NTP</tt> (or preferably, <tt>OpenNTPd</tt> where available). To install <tt>ntpd</tt>:<pre>yum install ntp</pre>
+
# To keep the time on CLOCKSS boxes synchronized, we use NTP synchronization via systemd's timedatect1: <pre>timedatectl set-ntp on</pre>
# To automatically install package updates, we use <tt>yum-cron</tt>. To install and configure it: <pre>yum install yum-cron</pre> Then ensure it will start automatically on runlevels 3, 4 and 5: <pre>chkconfig --add yum-cron</pre>  
+
# To automatically install package updates, we use <tt>yum-cron</tt>. To install and configure it: <pre>yum install yum-cron</pre> Then ensure it will start automatically on runlevels 3, 4 and 5: <pre>systemctl enable yum-cron</pre>  
# Some extra, optional packages we've found to be useful are <tt>screen</tt>, <tt>wget</tt>, <tt>vim</tt>, <tt>emacs</tt>, <tt>lynx</tt>. Their installation is highly recommended to ease troubleshooting.<pre>yum install screen wget vim emacs lynx</pre>
+
# Some extra, optional packages we've found to be useful are <tt>tmux</tt>, <tt>wget</tt>, <tt>vim</tt>, <tt>emacs</tt>, <tt>lynx</tt>. Their installation is highly recommended to ease troubleshooting.<pre>yum install tmux wget vim emacs lynx</pre>
  
 
=== Java Runtime Environment (JRE) ===
 
=== Java Runtime Environment (JRE) ===
  
Early CLOCKSS boxes were configured to use the Sun Java JRE 6 (now Oracle Java JRE) but we now recommend using the OpenJDK JRE 6 freely available through the CentOS repository. All CLOCKSS boxes have a 64-bit JRE to take advantage of the large amount of memory available on these machines.
+
All CLOCKSS boxes have the 64-bit OpenJDK 8 installed; a handful of older machines use the official Java 8 JRE from Oracle, because an up to date version is not available from the official CentOS repository. The amount of memory allocated towards the Java heap is set automatically to an appropriate amount for a machine, during LOCKSS daemon startup.
  
 
=== LOCKSS Daemon Configuration ===
 
=== LOCKSS Daemon Configuration ===
  
Once the CentOS environment has been configured and verified working by a CLOCKSS engineer, the next step is to configure the LOCKSS daemon as part of the CLOCKSS network. This is done using the <tt>hostconfig</tt> utility packaged with the LOCKSS daemon. For a CLOCKSS box, the following settings are recommended:
+
This is a step unique for every CLOCKSS machine. Once a machine is configured onto the host institution's network, and verified working by a CLOCKSS engineer, the next step is to configure the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] as part of the CLOCKSS network. This is done using the <tt>hostconfig</tt> utility packaged with the LOCKSS daemon. For a CLOCKSS box, the following settings are recommended:
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 199: Line 200:
  
 
The production CLOCKSS network is configured via the [[LOCKSS: Property Server Operations|property server]] to:
 
The production CLOCKSS network is configured via the [[LOCKSS: Property Server Operations|property server]] to:
* Prevent access to the content by disabling both the proxy and content server functions of the LOCKSS daemon.
+
* Prevent access to the content by disabling both the proxy and content server functions of the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]].
 
* Prevent interception of communication between boxes in the network by the use of SSL for the [[LOCKSS: Polling and Repair Protocol]].
 
* Prevent interception of communication between boxes in the network by the use of SSL for the [[LOCKSS: Polling and Repair Protocol]].
 
* Prevent communication via the [[LOCKSS: Polling and Repair Protocol]] except from other boxes in the same network by requiring each end of a newly established connection to verify the certificate of the other end against the corresponding public key in a keystore. During setup, or when the set of boxes in the network changes, the appropriate keystore is uploaded to a machine via <tt>scp</tt>. We also provide a small script for the CLOCKSS site administrator to run that installs the keystore into the correct location with restrictive permissions.
 
* Prevent communication via the [[LOCKSS: Polling and Repair Protocol]] except from other boxes in the same network by requiring each end of a newly established connection to verify the certificate of the other end against the corresponding public key in a keystore. During setup, or when the set of boxes in the network changes, the appropriate keystore is uploaded to a machine via <tt>scp</tt>. We also provide a small script for the CLOCKSS site administrator to run that installs the keystore into the correct location with restrictive permissions.
Line 207: Line 208:
 
=== Nagios ===
 
=== Nagios ===
  
The CLOCKSS infrastructure is monitored by the Nagios network monitoring system. The CLOCKSS boxes, servers and services that are monitored by Nagios include the CLOCKSS Production boxes, Ingest boxes, Triggered Content boxes and Property Server as well as the CLOCKSS HTTP, SMTP and FTP servers.
+
The CLOCKSS infrastructure is monitored by the Nagios monitoring system. The CLOCKSS boxes, servers and services that are monitored by Nagios include the CLOCKSS Production boxes, Ingest boxes, Triggered Content boxes and Property Server as well as the CLOCKSS HTTP, SMTP and FTP servers.
  
 
The CLOCKSS Network Administrator is responsible for Nagios monitoring.
 
The CLOCKSS Network Administrator is responsible for Nagios monitoring.
Line 213: Line 214:
 
=== Nagios plugins to monitor LOCKSS ===
 
=== Nagios plugins to monitor LOCKSS ===
  
Custom Nagios plugins were written to enable Nagios to monitor the following LOCKSS daemon services:
+
Custom Nagios plugins were written to enable Nagios to monitor the following [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] services:
 
* LOCKSS Daemon Version
 
* LOCKSS Daemon Version
 
* LOCKSS Daemon Uptime
 
* LOCKSS Daemon Uptime
Line 219: Line 220:
 
* LOCKSS Web Administrative UI accessibility
 
* LOCKSS Web Administrative UI accessibility
 
* LCAP Accessibility
 
* LCAP Accessibility
 +
* RAID array health
  
 
Plugins shipped with Nagios allow us to monitor, where applicable:
 
Plugins shipped with Nagios allow us to monitor, where applicable:
Line 238: Line 240:
 
== Hardware Upgrade ==
 
== Hardware Upgrade ==
  
Currently, all CLOCKSS boxes have at least four spare hotswap disk bays. A natural way to incrementally upgrade the disk capacity of deployed CLOCKSS boxes is to fill these disk bays with the largest disk available (at the moment, 4TB), create a new RAID array and finally to move content from an existing array. The disks comprising the old array can then be removed and the process can be repeated for the remaining RAID arrays.
+
Currently, all CLOCKSS boxes have at least four spare hotswap disk bays. A natural way to incrementally upgrade the disk capacity of deployed CLOCKSS boxes is to fill these disk bays with the largest disk available, create a new RAID array and finally to move content from an existing array. The disks comprising the old array can then be removed and the process can be repeated for the remaining RAID arrays.
  
The other components in a CLOCKSS box are not expected to need upgrading during its service life.
+
Although memory and PSU modules can also be swapped, the remaining components in a CLOCKSS box are not expected to be serviceable.
  
 
== Hardware Replacement ==
 
== Hardware Replacement ==
Line 246: Line 248:
 
=== Component Failure ===
 
=== Component Failure ===
  
The components in all CLOCKSS boxes are readily available, cheap, off-the-shelf components and replacement components can be sourced from any well-stocked computer retailer. The most common component failures are disks. As mentioned earlier, the RAID5 arrays employed in CLOCKSS boxes are only able to tolerate one disk failure each so it is important that any failed disks be found and replaced as quickly as possible.
+
The components in all CLOCKSS boxes are readily available, cheap, off-the-shelf components and replacement components can be sourced from any well-stocked computer retailer. The most common component failures are disks. As mentioned earlier, the RAID6 arrays employed in CLOCKSS boxes are able to tolerate two disk failures each but it is important that any failed disks be found and replaced as quickly as possible to avoid irreparable data loss.
  
If a disk fails, we contact the CLOCKSS box's hardware vendor if the CLOCKSS box is still under warranty. If the box is no longer under warranty, we prefer to purchase a disk and ship it expedited to the remote site. However, if the remote site is not within the United States, an alternative option is to ask the site administrator to source replacement parts within their country and send an invoice to CLOCKSS for a reimbursement.
+
If a component fails and the CLOCKSS box is still under warranty, we will submit a warranty claim with the hardware vendor for replacement. If the box is no longer under warranty, we prefer to purchase components and them expedited to the remote site. However, if the remote site is not within the United States, an alternative option is to ask the site administrator to source replacement parts within their country, and send an invoice to CLOCKSS for a reimbursement.
  
Other component failures are handled similarly. If a machine cannot be repaired, it is replaced.
+
If a machine cannot be repaired, it is replaced.
  
 
=== Planned CLOCKSS Box Replacement ===
 
=== Planned CLOCKSS Box Replacement ===
  
Despite the upgradability of existing hardware, we recognize that, eventually, it will no longer be cost effective or practical to continue to run old hardware due to increases in performance per watt efficiency, disk bandwidth, disk capacity and other measures. Software and content volume is also expected to push hardware towards obsolescence.
+
Despite the upgradability of existing hardware, we recognize that, eventually, it will no longer be cost effective or practical to continue to run old hardware due to increases in performance per watt efficiency, disk bandwidth, disk capacity, hardware compatibility, and other considerations. Software and content volume is also expected to push hardware towards obsolescence.
  
 
Replacing an entire machine is a fairly straightforward process. After purchasing and configuring the machine as described elsewhere in this document, it is brought up alongside the machine it is to replace. To minimize downtime, content is copied in two passes, using <tt>rsync</tt>. The first pass copies the bulk of the data to the new machine. Once the first pass is complete, the LOCKSS daemon on the old machine is taken offline and its filesystems are placed into read-only mode. The second pass then copies any changes to the content that occurred during the first pass, and verifies the integrity of content from the first pass. If no errors occurred during the two passes, the old machine is taken offline and the new machine takes its place. Subsequent [[LOCKSS: Polling and Repair Protocol]] will also verify the integrity of the copy.
 
Replacing an entire machine is a fairly straightforward process. After purchasing and configuring the machine as described elsewhere in this document, it is brought up alongside the machine it is to replace. To minimize downtime, content is copied in two passes, using <tt>rsync</tt>. The first pass copies the bulk of the data to the new machine. Once the first pass is complete, the LOCKSS daemon on the old machine is taken offline and its filesystems are placed into read-only mode. The second pass then copies any changes to the content that occurred during the first pass, and verifies the integrity of content from the first pass. If no errors occurred during the two passes, the old machine is taken offline and the new machine takes its place. Subsequent [[LOCKSS: Polling and Repair Protocol]] will also verify the integrity of the copy.
  
== Software Updates ==
+
== Software and Content Maintenance ==
  
 
=== Re-locating Content ===
 
=== Re-locating Content ===
  
 
Relocating an AU to another LOCKSS repository filesystem, for example if a filesystem has filled up, requires the following steps:
 
Relocating an AU to another LOCKSS repository filesystem, for example if a filesystem has filled up, requires the following steps:
# Shutdown the LOCKSS daemon to prevent it from modifying the AU. For extra precaution, temporarily remount the filesystem as read-only.
+
# Shutdown the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] to prevent it from modifying the AU. For extra precaution, temporarily remount the filesystem as read-only.
 
# Copy the AU to the new LOCKSS repository. When the copy completes successfully, delete the the AU from the source filesystem.
 
# Copy the AU to the new LOCKSS repository. When the copy completes successfully, delete the the AU from the source filesystem.
 
# If the LOCKSS repository has not yet been made available to the LOCKSS daemon, edit <tt>/etc/lockss/config.dat</tt> and append the new LOCKSS repository path to <tt>LOCKSS_DISK_PATHS</tt>.
 
# If the LOCKSS repository has not yet been made available to the LOCKSS daemon, edit <tt>/etc/lockss/config.dat</tt> and append the new LOCKSS repository path to <tt>LOCKSS_DISK_PATHS</tt>.
Line 280: Line 282:
 
=== LOCKSS Daemon Updates ===
 
=== LOCKSS Daemon Updates ===
  
All remote sites are encouraged to setup automatic updates for the LOCKSS daemon package so that updates are installed automatically within a few hours of their release to the CLOCKSS RPM repository.  Additionally, all systems administrators receive announcements of newly released daemons through the CLOCKSS Administrators mailing list; systems administrators opting to install updates manually are encouraged to use the announcements as a cue to install updates. We use Nagios to monitor the LOCKSS daemon version of all CLOCKSS boxes. If a CLOCKSS box has not been updated with the latest LOCKSS daemon within a day, we notify its systems administrator(s).
+
All remote sites are encouraged to setup automatic updates for the [[LOCKSS: Basic Concepts#LOCKSS Daemon|LOCKSS daemon]] package so that updates are installed automatically within a few hours of their release to the CLOCKSS RPM repository.  Additionally, all systems administrators receive announcements of newly released daemons through the CLOCKSS Administrators mailing list; systems administrators opting to install updates manually are encouraged to use the announcements as a cue to install updates. We use Nagios to monitor the LOCKSS daemon version of all CLOCKSS boxes. If a CLOCKSS box has not been updated with the latest LOCKSS daemon within a day, we notify its systems administrator(s).
  
 
=== System Package Updates ===
 
=== System Package Updates ===
Line 292: Line 294:
 
=== System Updates ===
 
=== System Updates ===
  
During the lifetime of the CLOCKSS program, the CentOS project will release new major versions of their CentOS operating system (e.g., CentOS 5.x to 6.x). All production CLOCKSS boxes were configured so that its operating system filesystem is independent from its content filesystems. The CentOS upgrade path is then relatively simple: Save the LOCKSS daemon configuration, user accounts and OpenSSH keys, unmount and take offline all content filesystems then effectively reinstall CentOS and the LOCKSS daemon environment and restore the CLOCKSS box's configuration.
+
During the lifetime of the CLOCKSS program, the CentOS project will release new major versions of their CentOS operating system (e.g., CentOS 6.x to 7.x). When a new release is available, the LOCKSS engineers will prepare new system disk images using the instructions document earlier in this document, and instruct our remote site administrators on the disk imaging process. Configuration and other settings specific to the machine will be reconfigured, or transferred from the replaced machine.
  
 
The end of life dates for the CentOS versions currently installed on CLOCKSS boxes are:
 
The end of life dates for the CentOS versions currently installed on CLOCKSS boxes are:
Line 302: Line 304:
 
! scope="col"| Full Updates
 
! scope="col"| Full Updates
 
! scope="col"| Maintenance Updates
 
! scope="col"| Maintenance Updates
|-
 
| CentOS 5.x
 
| | 2007-04-12
 
| | Q1 2014
 
| | 2017-03-31
 
 
|-
 
|-
 
| CentOS 6.x
 
| CentOS 6.x
| | 22011-07-10
+
| | 2011-07-10
 
| | Q2 2017
 
| | Q2 2017
 
| | 2020-11-30
 
| | 2020-11-30
 +
|-
 +
| CentOS 7.x
 +
| | 2014-07-07
 +
| | Q4 2020
 +
| | 2024-06-30
 
|-
 
|-
 
|}
 
|}
Line 319: Line 321:
 
Changes to this document require:
 
Changes to this document require:
 
* Review by:
 
* Review by:
** CLOCKSS Network Administrator
+
** CLOCKSS Technical Lead
 
** LOCKSS Engineering Staff
 
** LOCKSS Engineering Staff
* Approval by CLOCKSS Technical Lead
+
* Approval by CLOCKSS Network Administrator

Latest revision as of 21:31, 12 August 2019

Contents

CLOCKSS: Box Operations

Requirements for CLOCKSS host sites

A CLOCKSS host site signs an agreement with the CLOCKSS board under which they commit to providing:

  • Rack space (in a physically secure facility accessible only to authorized personnel)
  • Power
  • Cooling
  • Network bandwidth

And to have an in-place disaster recovery plans if these resources fail. The CLOCKSS Executive Director has copies of these agreements.

The CLOCKSS Network Administrator is responsible for all processes involving production and ingest CLOCKSS box hardware and software infrastructure.

Hardware Bringup

Hardware Vendors

All new systems are currently purchased from 45Drives based in Nova Scotia, Canada or iXsystems based in San Jose, CA. The LOCKSS team's contact at 45Drives is sales representative Allan Hiller (ahiller@45drives.com); at iXSystems it is Kevin Lee (kle@ixsystems.com).

The LOCKSS team previously worked with two hardware vendors, Iron Systems, located in Fremont, CA and eRacks in Orange, CA. Most of the systems currently in production were purchased from Iron Systems. Business contacts are Kawaljit Nagi (kawal@ironsystems.com) and James King (jamesk@eracks.com), respectively. The LOCKSS team also has a relationship with Joseph Wolff (joe@eracks.com), the owner of eRacks.

Hardware Purchase Process

Purchases begin as a discussion among LOCKSS Engineering Staff, the CLOCKSS Network Administrator and the CLOCKSS Executive Director about what is needed. A specification is drawn by the engineers and sent to our hardware vendors for a bids. If a bid looks reasonable, it is sent to the CLOCKSS Executive Director, who negotiates the details and pricing, and sends a purchase order to the vendor. The CLOCKSS Executive Director is advised to release payment once the machine has been built, tested, verified working by the LOCKSS team, preconfigured with software (see below) and shipped.

Hardware Warranty

Machines purchased from our hardware vendors come with 1 - 2 years of limited hardware warranty, depending on hardware vendor. The warranty on most of the CLOCKSS boxes currently in production have expired, but the warranties proved useful in the past for the replacement of failed hardware components. Details about the warranty are in the purchase orders that have been submitted, as well as the invoices subsequently received.

Recommended Hardware

CLOCKSS hardware or virtual machines meet or exceed the following specifications:

Item Ingest Production Triggered (VMware / Xen)
Chassis Supermicro 4U Supermicro 4U N/A
Disk Bays 24 24 N/A
Processor AMD Opteron 6376 (16 cores) Dual Xeon E5504 (8 cores) Dual-core vCPU
RAM 64GB ECC 24GB ECC 4GB vRAM
Disk 18 x 4TB SATA 6Gbps 7200RPM 16 x 3TB SATA 3Gbps 7200RPM 100GB vDisk
RAID Software (Linux mdadm) Software (Linux mdadm) None (Underlying RAID array)
Network Onboard dual Gbit Onboard dual Gbit Single vEth (1 Gbit via hypervisor)
Remote Access IPMI IPMI Xen hypervisor
Power Supply 1600W redundant 1600W redundant N/A

CLOCKSS Virtual Machines

CLOCKSS virtual machines run on servers running Xen on CentOS 7.x.

Hardware Service Life

The service life of a CLOCKSS box is five to seven years. This is dictated by budget restrictions, software needs as well as cost and performance efficiency. Please see the section on #Hardware Replacement later in this document.

Building and testing

Our hardware vendors are responsible for sourcing components and building the machines once a purchase order is submitted by CLOCKSS. All of our hardware vendors have an extensive and rigorous process in place to ensure a machine functions prior to delivery.

Remote Access via IPMI

Every CLOCKSS box is equipped with an Intelligent Platform Management Interface (IPMI) module that provides remote "side-band" and "out-of-band" access. Although IPMI is invaluable in the event we need to access or recover a CLOCKSS box outside the scope of its operating system, there are known security vulnerabilities and so it is disabled by default on all machines. This is done by unplugging the network cable from the IPMI module, and additionally disabling it in the BIOS.

In the event we need to use IPMI, the following precautions should be taken:

  1. Update the firmware, if a newer version exists.
  2. Do not use the default username and password.
  3. Make IPMI accessible only through a VPN or other secure connection.

RAID Configuration

The CLOCKSS Ingest machines utilize RAID disk arrays for storage. RAID functions are provided either by the Linux mdadm kernel module, or a LSI MegaRAID hardware RAID controller. The disks were partitioned into groups and put into a RAID6 configuration (a handful of legacy RAID arrays use the RAID5 configuration). Currently, in aggregate, the RAID arrays on each CLOCKSS Ingest machine yields at least 8.1TB of usable space after RAID and EXT4 filesystem overhead. Since the Ingest machines have extra disks dedicated for the OS (called a "system disk"), the full space of the RAID arrays is dedicated to the LOCKSS daemon. Use of dedicated systems disks allow problems and maintenance on the system disks to be isolated from those affecting content storage disks.

The CLOCKSS Production machines utilize the Linux kernel's mdadm module to implement software RAID. Similar to the CLOCKSS Ingest machines, the Production machines also have several RAID6 arrays (and some legacy RAID5 arrays) consisting of three to seven disks depending on the number of disks available in a given machine. In contrast to the Ingest machines, many of these machines have a RAID array serving the dual roles of system and content storage.

RAID Health Monitoring

The most common component failures experienced by CLOCKSS machines are disk failures and it is expected that all CLOCKSS boxes will experience a few disk failures over the course of their service life. The RAID5 and RAID6 arrays employed in CLOCKSS boxes are only able to tolerate one or two disk failures each without data loss, respectively. It is therefore important that any failed disks be detected and replaced as quickly as possible; an array failure requires repair from another box via the LOCKSS: Polling and Repair Protocol, which is significantly slower than rebuilding the RAID array.

The RAID array health on all CLOCKSS Production boxes is monitored by Nagios via the Nagios Remote Plugin Executor (NRPE) and custom Nagios plugins. The NPRE daemon running on each machine executes plugins on the behalf of Nagios and returns the result. Nagios will send an alert to CLOCKSS engineers upon the detection of any problems with the individual machines. For more, see the #Monitoring section below.

Software Bringup

To simplify the software bringup and ease future system upgrades, all new CLOCKSS boxes are configured with two smaller, dedicated, system disks of an exact size. The LOCKSS engineering team is then able to use virtual machines to prepare disk images with a base installation of CentOS, LOCKSS, and other software. The images are then sent to our hardware vendor and imaged onto the system disks of each machine. A checksum can verify the integrity of the disk image.

The steps taken by the LOCKSS engineers to prepare the contents of the system disks is described below.

Software Licensing

All the software used for the CLOCKSS infrastructure is freely available and open source.

CentOS Installation and Configuration

A base install of CentOS 7.x with Java options is performed on the partition or disk designated for the operating system then the following steps are taken to configure it as a CLOCKSS box:

  1. The system accounts, lcap and lockss are created using useradd -r -m or equivalent. Then the LOCKSS LCAP public SSH keys are then added to their authorized_keys file, taking care to ensure the permissions are setup correctly.
  2. Then the firewall (iptables rules via systemd's firewald) are set up:
    • The LOCKSS daemon requires the LCAP port (port 9729), used to communicate with other LOCKSS daemons, to be open to all machines. This is accomplished under CentOS 7.x (and systemd) with this firewalld command:
      firewall-cmd --zone=public --add-port=9729/tcp --permanent
      Note: although we have a list of IP addresses that could additionally be whitelisted access to the machine's LCAP port, we opt not to restrict access this way because communication on this port is established and protected by SSL certificates and verification (see #CLOCKSS PLN Configuration).
    • Additionally, a CLOCKSS box is setup so that the administrative ports 22 (OpenSSH) and 8080 through 8086 are accessible to the remote site's administrative subnet(s) and the LOCKSS subnet at Stanford (171.66.236.0/24, previously 171.66.236.0/26). This is accomplished by creating a new firewalld zone:
      firewall-cmd --create-zone=lockss
      Then adding a new source for each subnet or IP address allowed access to the machine. Opening access to the machine from the LOCKSS subnet (171.66.236.0/24) is necessary for the proper administration, maintenance, and monitoring of the CLOCKSS box:
      firewall-cmd --zone=lockss --add-source=<IP or subnet>
      And finally, specifying the ports open in this zone:
      firewall-cmd --zone=lockss --add-port=22/tcp --permanent
      firewall-cmd --zone=lockss --add-port=8080-8086/tcp --permanent
      These rules should be reflected in any external firewalls at the site.
  3. Setting up the CLOCKSS repository is a two step process:
    1. The first is to create a new file, lockss.repo under /etc/yum.repos.d containing the following lines:
      [lockss]
      name = LOCKSS Daemon Repository
      baseurl=http://www.lockss.org/clockss-repo/
      gpgcheck=1
    2. Then to install the LOCKSS RPM GPG key:
      rpm --import http://www.lockss.org/LOCKSS-GPG-RPM-KEY
  4. The LOCKSS daemon can now be installed by invoking:
    yum install lockss-daemon
  5. The LOCKSS daemon installs a logrotate configuration file to /etc/logrotate.d/lockss. By default, the size is limited to 2M but on CLOCKSS boxes, we set the limit to 20M:
    /var/log/lockss/daemon {
        size 20M
        rotate 5
        compress
        delaycompress
        create
        notifempty
        missingok
    }
    
    /var/log/lockss/stdout {
        size 10k
        rotate 5
        compress
        copytruncate
        notifempty
        missingok
    }
    
  6. To keep the time on CLOCKSS boxes synchronized, we use NTP synchronization via systemd's timedatect1:
    timedatectl set-ntp on
  7. To automatically install package updates, we use yum-cron. To install and configure it:
    yum install yum-cron
    Then ensure it will start automatically on runlevels 3, 4 and 5:
    systemctl enable yum-cron
  8. Some extra, optional packages we've found to be useful are tmux, wget, vim, emacs, lynx. Their installation is highly recommended to ease troubleshooting.
    yum install tmux wget vim emacs lynx

Java Runtime Environment (JRE)

All CLOCKSS boxes have the 64-bit OpenJDK 8 installed; a handful of older machines use the official Java 8 JRE from Oracle, because an up to date version is not available from the official CentOS repository. The amount of memory allocated towards the Java heap is set automatically to an appropriate amount for a machine, during LOCKSS daemon startup.

LOCKSS Daemon Configuration

This is a step unique for every CLOCKSS machine. Once a machine is configured onto the host institution's network, and verified working by a CLOCKSS engineer, the next step is to configure the LOCKSS daemon as part of the CLOCKSS network. This is done using the hostconfig utility packaged with the LOCKSS daemon. For a CLOCKSS box, the following settings are recommended:

Parameter Description
Fully qualified hostname (FQDN) of this machine Provided by CLOCKSS. We schema we use is clockss-site.clockss.org where site can uniquely identify the host institution of the machine.
IP address of this machine A static public IP address provided by host institution.
Initial subnet for admin UI access The subnet (in CIDR or X.Y.Z.* notation) that will be granted access to the web-based LOCKSS Administrative UI. The localhost is implicitly allowed. Additional subnets can be added through the LOCKSS Administrative UI.
LCAP V3 protocol port The TCP port the daemon will listen on for LCAP communication from peers. Remote sites should ensure both internal and external firewalls allow all connections to this port.
Mail relay for this machine This should be the DNS name of an SMTP relay that will accept and relay mail from this machine. The script will also prompt for a username and password if the mail relay requires them. It can also be set to localhost if the machine is capable of handling email.
E-mail address for administrator Occasional alerts will be sent to this address by the LOCKSS daemon.
Path to java The full path to a JRE. The default should suffice in most cases.
Java switches Java switches to be passed to the JRE. It should be left blank in most cases.
Configuration URL
Preservation group(s)
  • CLOCKSS Production: clockss
  • CLOCKSS Ingest: clockssingest
  • CLOCKSS Triggered: clockss-triggered
Content storage directories A semi-colon delineated list of paths to the LOCKSS repositories for use by this LOCKSS daemon.
Temporary storage directory /cache0/gamma/tmp
Password for web UI administration user admin Set this to a strong password, known to the host site administrator but to no-one else.

CLOCKSS PLN Configuration

The production CLOCKSS network is configured via the property server to:

  • Prevent access to the content by disabling both the proxy and content server functions of the LOCKSS daemon.
  • Prevent interception of communication between boxes in the network by the use of SSL for the LOCKSS: Polling and Repair Protocol.
  • Prevent communication via the LOCKSS: Polling and Repair Protocol except from other boxes in the same network by requiring each end of a newly established connection to verify the certificate of the other end against the corresponding public key in a keystore. During setup, or when the set of boxes in the network changes, the appropriate keystore is uploaded to a machine via scp. We also provide a small script for the CLOCKSS site administrator to run that installs the keystore into the correct location with restrictive permissions.

Monitoring

Nagios

The CLOCKSS infrastructure is monitored by the Nagios monitoring system. The CLOCKSS boxes, servers and services that are monitored by Nagios include the CLOCKSS Production boxes, Ingest boxes, Triggered Content boxes and Property Server as well as the CLOCKSS HTTP, SMTP and FTP servers.

The CLOCKSS Network Administrator is responsible for Nagios monitoring.

Nagios plugins to monitor LOCKSS

Custom Nagios plugins were written to enable Nagios to monitor the following LOCKSS daemon services:

  • LOCKSS Daemon Version
  • LOCKSS Daemon Uptime
  • LOCKSS Repository Spaces (to monitor disk usage)
  • LOCKSS Web Administrative UI accessibility
  • LCAP Accessibility
  • RAID array health

Plugins shipped with Nagios allow us to monitor, where applicable:

  • OpenSSH accessibility
  • HTTP/HTTPS accessibility

Nagios Alerts

Nagios allows plugins to return four return codes depending on the nature and severity of an issue (or lack thereof). For any return other than OK, Nagios has been configured to notify CLOCKSS engineers through email alerts (see CLOCKSS: Logging and Records).

Access control to Nagios instance

Access to the CLOCKSS Nagios instance is restricted by an Apache Virtual Host definition to the LOCKSS subnet at Stanford (171.66.236.0/24) and by username and password.

Nagios Redundancy

CLOCKSS and LOCKSS have a second Nagios instance running in an Amacon EC2 instance located in West Virginia. The second Nagios is not able to monitor CLOCKSS boxes directly. This is by design: We do not want a machine outside of the Stanford network to have the type of access necessary to monitor internal CLOCKSS boxes and processes. Due to this restriction, it is configured only to monitor the CLOCKSS website and the Nagios instance running at Stanford. The second Nagios instance will alert CLOCKSS engineers should any problems occur at Stanford.

Hardware Upgrade

Currently, all CLOCKSS boxes have at least four spare hotswap disk bays. A natural way to incrementally upgrade the disk capacity of deployed CLOCKSS boxes is to fill these disk bays with the largest disk available, create a new RAID array and finally to move content from an existing array. The disks comprising the old array can then be removed and the process can be repeated for the remaining RAID arrays.

Although memory and PSU modules can also be swapped, the remaining components in a CLOCKSS box are not expected to be serviceable.

Hardware Replacement

Component Failure

The components in all CLOCKSS boxes are readily available, cheap, off-the-shelf components and replacement components can be sourced from any well-stocked computer retailer. The most common component failures are disks. As mentioned earlier, the RAID6 arrays employed in CLOCKSS boxes are able to tolerate two disk failures each but it is important that any failed disks be found and replaced as quickly as possible to avoid irreparable data loss.

If a component fails and the CLOCKSS box is still under warranty, we will submit a warranty claim with the hardware vendor for replacement. If the box is no longer under warranty, we prefer to purchase components and them expedited to the remote site. However, if the remote site is not within the United States, an alternative option is to ask the site administrator to source replacement parts within their country, and send an invoice to CLOCKSS for a reimbursement.

If a machine cannot be repaired, it is replaced.

Planned CLOCKSS Box Replacement

Despite the upgradability of existing hardware, we recognize that, eventually, it will no longer be cost effective or practical to continue to run old hardware due to increases in performance per watt efficiency, disk bandwidth, disk capacity, hardware compatibility, and other considerations. Software and content volume is also expected to push hardware towards obsolescence.

Replacing an entire machine is a fairly straightforward process. After purchasing and configuring the machine as described elsewhere in this document, it is brought up alongside the machine it is to replace. To minimize downtime, content is copied in two passes, using rsync. The first pass copies the bulk of the data to the new machine. Once the first pass is complete, the LOCKSS daemon on the old machine is taken offline and its filesystems are placed into read-only mode. The second pass then copies any changes to the content that occurred during the first pass, and verifies the integrity of content from the first pass. If no errors occurred during the two passes, the old machine is taken offline and the new machine takes its place. Subsequent LOCKSS: Polling and Repair Protocol will also verify the integrity of the copy.

Software and Content Maintenance

Re-locating Content

Relocating an AU to another LOCKSS repository filesystem, for example if a filesystem has filled up, requires the following steps:

  1. Shutdown the LOCKSS daemon to prevent it from modifying the AU. For extra precaution, temporarily remount the filesystem as read-only.
  2. Copy the AU to the new LOCKSS repository. When the copy completes successfully, delete the the AU from the source filesystem.
  3. If the LOCKSS repository has not yet been made available to the LOCKSS daemon, edit /etc/lockss/config.dat and append the new LOCKSS repository path to LOCKSS_DISK_PATHS.
  4. The AU needs its lockss_repo added or modified to reflect the new path. This is done by editing the AU's lockss_repo line in /cache0/gamma/config/au.txt. For bulk relocation, a LOCKSS tool called auconf.pl can scan the LOCKSS repositories for AUs and update au.txt as necessary.
  5. If the source filesystem was remounted read-only, remount it as read-write.
  6. Finally, start the LOCKSS daemon and check that the AUs repository path points to the new location.

If a large amount of content is being moved and the daemon requires minimal disruption, a modified procedure is followed:

  1. With the LOCKSS daemon running and the filesystem mounted read-write, use rsync to copy the bulk of the content to the new filesystem.
  2. When the first pass completes, stop the LOCKSS daemon and remount the source filesystem as read-only.
  3. Run rsync again to copy any changes to the AUs since the last rsync was started.
  4. Update /etc/lockss/config.dat and /cache0/gamma/config/au.txt as necessary.
  5. Remount the filesystem as read-write. Bring up the LOCKSS daemon and check that the all the AUs repository paths have been updated.

LOCKSS Daemon Updates

All remote sites are encouraged to setup automatic updates for the LOCKSS daemon package so that updates are installed automatically within a few hours of their release to the CLOCKSS RPM repository. Additionally, all systems administrators receive announcements of newly released daemons through the CLOCKSS Administrators mailing list; systems administrators opting to install updates manually are encouraged to use the announcements as a cue to install updates. We use Nagios to monitor the LOCKSS daemon version of all CLOCKSS boxes. If a CLOCKSS box has not been updated with the latest LOCKSS daemon within a day, we notify its systems administrator(s).

System Package Updates

Most CLOCKSS boxes are configured for unattended upgrade, so new upgrades will be installed automatically. All CLOCKSS box systems administrators are encouraged to join the CentOS Announce mailing list to receive real-time annoucements about security updates. If the box in question is not configured for unattended updates:

  • Security updates should be downloaded and installed as soon as they are announced.
  • All other non-critical updates should be installed at least once a quarter.

Anyone can subscribe to the CentOS Announce mailing list by visiting the CentOS-announce Mailing List home page.

System Updates

During the lifetime of the CLOCKSS program, the CentOS project will release new major versions of their CentOS operating system (e.g., CentOS 6.x to 7.x). When a new release is available, the LOCKSS engineers will prepare new system disk images using the instructions document earlier in this document, and instruct our remote site administrators on the disk imaging process. Configuration and other settings specific to the machine will be reconfigured, or transferred from the replaced machine.

The end of life dates for the CentOS versions currently installed on CLOCKSS boxes are:

Version Release Date Full Updates Maintenance Updates
CentOS 6.x 2011-07-10 Q2 2017 2020-11-30
CentOS 7.x 2014-07-07 Q4 2020 2024-06-30

Change Process

Changes to this document require:

  • Review by:
    • CLOCKSS Technical Lead
    • LOCKSS Engineering Staff
  • Approval by CLOCKSS Network Administrator