LOCKSS: Software Development Process

From CLOCKSS Trusted Digital Repository Documents
Revision as of 00:22, 15 August 2019 by Ntay (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

LOCKSS: Software Development Process

The LOCKSS team operates two slightly different software development, testing and release processes, one for the core LOCKSS daemon code and one for the plugins that adapt the core daemon's behavior to particular content. All code is maintained in Git repositories on GitHub. Two independent local copies of this repository are updated every 24 hours.

Daemon Development Process

The development, testing and release process for the core daemon code has the following stages:

  1. Requirements generation
  2. Prioritization
  3. Tracking
  4. Testing
  5. Approval
  6. Release

This process operates on a regular release cycle; that is the goal is to release a new version of the daemon code every few months. A release can be delayed if last-minute problems are detected but if this happens a post-mortem is held to determine the cause and determine how to prevent it recurring. The cause has usually been some major enhancement that takes longer to debug than expected.

Requirements

There are four main sources of requirements for changes to the daemon:

  • The community of LOCKSS users.
  • The content processing staff in the LOCKSS team, as they process content from LOCKSS and CLOCKSS publishers for preservation and either:
    • notice content that isn't being collected or processed properly, or
    • encounter situations that require new capabilities.
  • The whole LOCKSS team as they interact with the LOCKSS daemon in testing, and the developers as they fix bugs and refactor code.
  • The deliverables under contracts with funders, as for example our current grant from the Andrew W. Mellon Foundation.

Support interactions with the community of LOCKSS users are tracked and managed using the RT ticketing system. If, during resolution of an RT ticket, it is determined that a change to the daemon is required, an issue to track it is generated in the JIRA issue tracking system. In some cases an informal conversation with a community member results in a JIRA issue being created without an RT ticket. The same may happen if a LOCKSS team member identifies a needed change, after informal discussion within the team, as a result of a code review, or if a deliverable is required by an external funder.

Prioritization

At the start of each release cycle the development staff meet and review the Roundup issues to select the set of issues that will be prioritized for the upcoming release. This prioritization is based on the needs of the LOCKSS Alliance members and the CLOCKSS Archive, deliverables for funders, security considerations and the severity of the bug in question. At intervals during the release cycle the prioritization is reviewed with the goal of meeting the release cycle time with the available resource and responding rapidly to any critical bugs.

Tracking

JIRA issues are assigned to developers after prioritization under the supervision of the LOCKSS Technical Lead. As progress is made the developer updates the issue's status.

Testing

Core daemon code undergoes three types of testing:

  • Unit testing.
  • Functional testing.
  • Load testing.

Testing occurs in three stages:

  • Developers perform unit and functional testing before and after committing changes to GitHub.
  • Every night the entire code base is checked out from GitHub, built and both unit and functional tests are run. Any failures are reported to the whole team by e-mail. Remedying them is a high-priority task.
  • Every time a release candidate is built it runs unit and functional tests, then is installed on one or more of the LOCKSS team's GLN boxes for load testing.

Unit Testing

The LOCKSS team make heavy use of the Junit unit test framework for Java in the context of the Ant automated Java build tool. As of October 2017, there were 746 files containing 5,385 individual tests for 1,158 classes in the daemon code base. All significant methods in all classes are required to have unit tests, although this requirement has yet to be satisfied. Where practical, subsystems should have unit tests. The "where practical" caveat is made necessary by the distributed and randomized architecture of the LOCKSS system, which makes some systems impractical to unit-test in the JUnit framework. Coverage of these tests is evaluated at intervals using Jcoverage; it is currently assessed as requiring improvement. A few specialized subsystems have functional tests implemented in the JUnit framework. These are treated as unit tests.

Developers should make every effort to reproduce reported bugs by implementing one or more unit tests that fail because of the bug, which will then verify that the bug is fixed and does not re-appear later.

Functional Tests

The LOCKSS team uses an in-house functional testing framework called STF (Stochastic Test Framework). This currently implements 23 different test scenarios. In each, the framework sets up a small LOCKSS network, typically of 4-5 daemons, on a single machine. The framework interacts with the Web user interface of each of these daemons as a normal LOCKSS box administrator would to direct them to ingest and poll on relatively small amounts of simulated content. The framework is capable of injecting faults, such a localized damage to, or total loss of the simulated content. It uses the Web user interface of the daemons to monitor the results of the functional tests and detect any errors, such as failure to repair the injected faults.

If a bug cannot be reproduced in the unit test environment efforts should be made to reproduce it by constructing a suitable scenario in STF that fails because of the bug, and can validate that the bug is fixed and does not re-appear.

Load testing

Production LOCKSS and CLOCKSS boxes operate at a scale that is logistically infeasible to reproduce in the unit and functional test environments. If a bug cannot be reproduced in the unit or functional test environments it must be reproduced manually in the internal test network. New daemon releases go through two types of load test:

  • They are installed in stages on a small internal network of 4 LOCKSS boxes with a substantial amount of selected real content. The behavior of these daemons is monitored closely and manual efforts are made to trigger any bugs that were only replicable under load testing.
  • They are then installed on at least 4 of the 13 GLN LOCKSS boxes that the LOCKSS team operates, which typically have at least 25TB of content. The behavior of these daemons is monitored closely and manual efforts are made to trigger any bugs that were only replicable under load testing.

Hard-to-replicate bugs

Sometimes bugs will be reported that the LOCKSS team fails to reproduce locally. Some diagnostic information will be available in the daemon logs at the reporting site. This is used to add self-checking and/or logging to the daemon to investigate the problem further.

Approval

When bug fixes and enhancements have been completed and pass unit and functional tests, the corresponding JIRA issues are marked to Testing state. Once a release candidate has been built (below), each issue is checked again to confirm that it is operating properly in a production environment, then marked Approved state. The rules for moving to Approved state vary according to the type of issue:

  • Changes that produce predictable behavioral differences are observed to ensure correct behavior.
  • Fixes for intermittent bugs (such as race conditions) remain in Testing state until the desired behavior is observed (or a suitable time elapses without observing the failure).
  • Changes that do not affect the daemon's behavior in observable ways (such as fixes or enhancements to test code) may be moved to Approved as soon as the test passes with the release build.

Release

On the freeze date, the LOCKSS build master creates a branch in the GitHub repository, and labels it with the release name and the tag for the first release candidate. The first release candidate is then checked out from this branch and built. Subsequent fixes until the final release are made to both the release and main branches. All release candidates are tagged and built from the release branch. A release branch tag has the form: release-candidate_${N1}-${N2}-b${N3} where:

  • N1 is the LOCKSS daemon major version number, currently 1,
  • N2 is the LOCKSS daemon minor version number, currently 73,
  • N3 is the sequence number of the build, incremented by 1 each time a build is performed.

If the build or tests fail, developers are notified, problems fixed, and a new release candidate is tagged and built. When successful the release candidate is signed using the LOCKSS code signing key, and uploaded to a test Yum repository, from where it is installed on a small internal test LOCKSS network. It is monitored carefully for at least a few days. If no serious problems are observed, the candidate is installed on several internal boxes that participate in the GLN, in order to observe it under significant load.

During the release testing phase, each of the JIRA issues is checked to verify that it's operating as expected in the production system, and marked Approved if so. If problems are found in the release candidate, fixes are made on the branch (and the main branch if appropriate) and a new release candidate is produced and tested as above.

When all issues (except possibly issues that can only be observed in production - see above) are marked Approved and the candidate has been running without significant problems for at least a week, the LOCKSS technical lead approves the release. The candidate is moved to the release Yum repository and the release announcement is sent. Users have the option to set their boxes up to install new release automatically or do it manually when they receive the release announcement. The Approved issues may then have their status changed to Resolved.

Documentation

Routine changes to the daemon are documented in Roundup, and in the commit messages in GitHub. They do not require changes to the system's architectural documents.

Daemon Enhancements

Significant enhancements, such as any architectural changes, to the daemon follow a slightly different process. An experimental branch is created in the GitHub repository and used to preserve the various steps of development of a prototype. These steps normally include:

  • Requirements generation, a group discussion to identify and document in the internal Wiki outline requirements for the enhancement sufficient to allow development of a prototype.
  • Prototyping, development of a working enhancement and sufficient unit and/or functional tests to demonstrate that the enhancement meets its outline requirements.
  • Design review, a formal review of the design of the prototype and documentation of a design for a production implementation.
  • Implementation of a production implementation, either from scratch (in a second branch) or by evolving the prototype.
  • Code review, a formal review of the code that is proposed for addition to the main branch of the repository that identifies a set of changes that must be made before the changes are actually applied.
  • Merge, in which the implementation with any changes required by the code review is committed to the main branch of the GitHub repository and the branch(es) used during development abandoned.

The enhancements then undergo the normal testing, approval and release process.

Enhancement Documentation

Significant enhancements to the daemon may require changes to documents such as the Definition of AIP. These changes are identified during the design review, and are the responsibility of the developer concerned.

Plugin Development Process

The development, testing and release process for daemon plugins has the following stages:

  1. Requirements generation
  2. Prioritization
  3. Tracking
  4. Development
  5. Testing
  6. Approval
  7. Release

The process normally operates asynchronously with the daemon development process; plugins are released when they are ready. If a plugin requires a new feature from the core daemon code its release will be delayed until after the relevant daemon release. This is enforced by having the plugin declare a minimum required daemon version, which prevents the plugin inadvertently being loaded by earlier daemons.

Requirements

The process for generating requirements for changes to the plugins is the same as for changes to the core daemon code, except that there is one additional source: new publishers joining LOCKSS and CLOCKSS. When a plugin writer is assigned to analyze a new site they fill out a template that forms the specification for the new plugin and is checked in to Git along with it. When changes are required, the appropriate changes are made to the specification and it is checked back in to Git.

Prioritization

The process for prioritization of changes to the plugins differs from that for the daemon code in two respects:

  • It is a periodic process but it is not synchronous with the daemon release cycle. Plugin changes are released when ready not, except in special circumstances, together with daemon releases.
  • Some plugin developments may be urgent, if they are caused by impending cessation of publication, loss of access to the content, publisher change or move of the content between publishing platforms.

Tracking

Progress on plugin developments is tracked in JIRA.

Development

Plugin development should normally take place in the source tree of the current daemon release. Exceptionally, if the plugin needs bug fixes or new features not in the current release development can use the head of the Git tree, with the plugin's required_daemon_version set to the next daemon release number.

Testing

Plugins are implemented partly in Java and partly in XML. The Java classes have unit tests in the same way as the core daemon code does. As of October 2017, there were 427 files with 1,637 individual tests for 387 plugins plus 780 auxiliary plugin classes. During every build all the plugin unit tests are run, and some additional validation is performed on the XML.

Once a new or changed plugin has passed these tests, it is loaded into a daemon in a test environment which is manually directed to collect one or more Archival Units (AUs) of its target content. Two checks are performed:

  • The status info and daemon logs are examined to detect errors such as unexpected 404s.
  • The list of excluded urls is examined for errors of unintended exclusion.
  • The collected content is browsed using the daemon's "audit proxy" (a proxy that returns only collected content and 404 for everything else) to ensure that all the desired content is collected and undesired content is not. For example, if the AU represents a volume of a journal, all articles belonging to that valume should be present, along with the common files they reference (e.g., style sheets), and there should be no articles belonging to other volumes.
  • A visual check of the collected content against the publisher's original.

Once a new or changed plugin has passed these tests it is released to the LOCKSS or CLOCKSS content test network as appropriate. Under the control of an internally developed testing framework (AUTest) several AUs are collected, polled and the results checked for agreement. If the agreement is less than 100%, the reason is diagnosed. Either the plugin is further changed and the process repeated or if the diagnosis is that collection from the publisher suffered transient errors, the AUs are re-collected and the check repeated. Metadata extraction is performed on the collected AUs and checked for correctness and completeness.

Approval

When a plugin is ready, it's released. Plugins are released individually, independently of the daemon, except for cases where a plugin requires a daemon feature that has not yet been released, in which case it waits for a daemon release.

The LOCKSS or CLOCKSS Plugin Lead (as appropriate) approves the release of a new plugin.

Release

A plugin release requires the following steps:

  • The plugin build master packages on a build machine
  • The plugin build master signs the plugin with their key:
    • The keystore containing the keys that can sign plugins for the GLN is the daemon's default keystore, which is controlled by the LOCKSS technical lead.
    • PLNs, including the CLOCKSS PLN, have their own keystores. The CLOCKSS keystore is controlled by the LOCKSS technical lead.
  • The plugin is uploaded to the appropriate repository.
  • A plugin collection is triggered manually on one production or ingest box.
  • The logs are checked to ensure the plugin loaded correctly.
  • The remaining boxes will automatically fetch and load any new (or new versions of) plugins within 12 hours.

For CLOCKSS, the plugin is released to the ingest and production boxes.

Documentation

Plugin changes are documented in JIRA, and in the commit messages in GitHub. They do not require changes to the system's architectural documents.

Development Environment

Developers are free to use the operating system and other tools of their choice on their own machines while developing LOCKSS software, provided that when performing pre- and post-commit testing they use the currently approved versions of:

  • Apache Ant
  • The JDK
  • The Java libraries from Git.

Diversity in development environments assists in identifying hidden dependencies.

Dependencies

Two types of dependency are of concern for the functioning of the LOCKSS system, and thus need to be proactively monitored:

  • The ability of operating system support for the requirements of the LOCKSS software, and the other software components used by the CLOCKSS Archive.
  • The set of formats supported by the Web browsers in the Knowledge Base of the Designated Community.

Operating System Dependencies

The LOCKSS software depends upon:

  • A Java virtual machine, currently version 7 or 8.
  • A set of Java libraries.
  • A POSIX file system.
  • An SQL database.

Any modern operating system can support these dependencies. Although the system is currently supported only on Red Hat compatible Linux distributions, in development it runs on many versions of Linux, on MacOS and with some restrictions on Windows. Some years ago it was ported from OpenBSD with little trouble. As the LOCKSS team has the software under continuous development on a range of operating systems, any problems with operating system support rapidly become evident to the team.

Other tools required are similarly situated. The main one is the Apache Web server, which is both widely supported, and could be replaced by a competitor with little trouble. Again, a lack of support for these tools would rapidly become evident to the team because they are indispensable in development.

Browser Format Support

Web formats become obsolete when support for them is removed from the browsers in general use. The occurrence of such obsolescence is a subject of active research, in which the LOCKSS team participates on a continuing basis. Any looming obsolescence of a member of the set of formats identified by the CLOCKSS software's use of the File Identification Tool Set (FITS) would become known through this research network.

Change Process

Changes to this document require:

  • Review by:
    • LOCKSS Engineering Staff
    • LOCKSS Plugin Lead
    • CLOCKSS Plugin Lead
    • LOCKSS Content Lead
    • CLOCKSS Content Lead
    • LOCKSS Build Master
  • Approval by LOCKSS Technical Manager

Relevant Documents

  1. LOCKSS: Format Migration
  2. CLOCKSS: Ingest Pipeline