LOCKSS: Extracting Bibliographic Metadata

From CLOCKSS Trusted Digital Repository Documents
Jump to: navigation, search

LOCKSS: Extracting Bibliographic Metadata

A part of the LOCKSS preservation process is extracting and indexing bibliographic metadata. Bibliographic metadata is supplied as part of the content submitted by a publisher as Submission Information Packages (SIPs) and preserved in Archival Information Packages in LOCKSS preservation networks.

Publishers are required to submit adequate bibliographic metadata with the content to support the four uses the CLOCKSS Archive makes of such metadata:

  • To enable the CLOCKSS organization to verify and bill for the content submitted by the publisher. This requires reasonably accurate counts of articles per publisher, but no other metadata.
  • To track the holdings and progress in preserving content of LOCKSS networks through reports to agencies such as the Keepers registry. This requires reasonably accurate publisher and date and/or volume range metadata, but no other metadata.
  • To identify relevant content in responding to a trigger event. This requires accurate journal and volume metatadata.
  • To make content disseminated from LOCKSS boxes, and triggered from the CLOCKSS network, accessible to end users using bibliographic information through online tools such as link resolvers. This requires full accurate metadata.

This document describes the kinds of bibliographic information that is preserved by LOCKSS systems, the formats and methods most commonly used for transmitting bibliographic metadata, the mechanisms used to extract and index the bibliographic metadata in a metadata database, and ways of presenting and querying extracted bibliographic metadata. Note that for the routine uses of bibliographic metadata, full metadata is not required and a small level of noise is acceptable. For the uses which only happen as part of a trigger event, full, accurate metadata is required, but there is time to detect and remedy any remaining noise. Thus the goal in normal processing is not perfect metadata, but metadata with an acceptably low level of noise.

Kinds of Bibliographic Metadata

LOCKSS can preserve many kinds of content. The mechanisms underlying the preservation process operate on storage units, typically the content of a URL on a website, or files on disk or contained within archive formats such as ZIP or TAR files. However, bibliographic metadata pertains to bibliographic units such as journal articles or book chapters. Therefore the kinds of bibliographic metadata that are supplied by publsihers are related to bibliographic units rather than to storage units.

The kinds of bibliographic metadata that are extracted and stored by LOCKSS depends on the bibliographic type of the preserved content. For serials such as journals, the bibliographic metadata includes:

  • Publisher
  • Publication name (e.g. name of journal)
  • ISSN and eISSN
  • DOI
  • Volume
  • Issue
  • Publication date (preferably cover date)
  • Article title
  • Article DOI
  • Article author(s)
  • Article number
  • Article page range (start page or start/end page)
  • Article keywords
  • Article summary

For books and monographs, the bibliographic metadata includes:

  • Publisher
  • Publication name (e.g. name of book)
  • Edition
  • ISBN and eISBN
  • DOI
  • Volume
  • Publication date (preferably cover date)
  • Author(s)/Editor(s)
  • Keywords
  • Summary

For individual book or monograph chapters, the bibliographic metadata also includes:

  • Chapter title
  • Chapter DOI
  • Chapter author(s)
  • Chapter number
  • Chapter page ranges (start page or start/end page)
  • Chapter author(s)
  • Chapter keywords
  • Chapter summary

For book or monograph series, the bibliographic metadata also includes:

  • Series name
  • Series ISSN and eISSN

In addition, certain physical metadata is also collected about the relationship of the bibliographic unit to its original Submission Information Package (SIP) and the AIP where it is preserved. This is collected from the CLOCKSS system primarily for auditing purposes, and to assist in any eventual triggering of the content. These include:

  • Publishing platform (e.g. HighWire Press)
  • Plugin ID and AUID (identifies both the SIP and the AIP)
  • URLs of bibliographic features (e.g. abstract, full-text PDF, full-text HTML, citation file)
  • Date bibliographic unit was first added to the AIP

Formats and Methods for Transmitting Bibliographic Metadata

Publishers encode and transmit bibliographic metadata in a variety of ways. How the metadata is encoded depends heavily on whether the content is being harvested from the publisher's website, or transferred to CLOCKSS by the publisher at a time of their choosing.

Harvest Content

Content that is harvested from the publisher's website generally takes the form of HTML pages with links to supporting files such as PDF and ePub files, videos and audio files, and other file types. publishers deliver metadata for harvested content in one of several formats and mechanisms:

The most common is to embed metadata as HTML META tags in the header of each article HTML page, either the abstract page or in the full text HTML page or both. The most frequently used metadata encodings include Google Scholar and Dublin Core. Some publishers supply the same metadata in both formats in a file. Google Scholar is the encoding to facilitate searching for content through Google's Scholar project. Here is an example of the Google Scholar encoding for an article of a typical journal:

<meta content="Molecular Interventions" name="citation_journal_title" />
<meta content="1534-0384" name="citation_issn" />
<meta content="1543-2548" name="citation_issn" />
<meta content="Duckles, Sue P." name="citation_authors" />
<meta content="Dial M for Molecular" name="citation_title" />
<meta content="04/01/2001" name="citation_date" />
<meta content="1" name="citation_volume" />
<meta content="1" name="citation_issue" />
<meta content="6" name="citation_firstpage" />
<meta content="1/1/6" name="citation_id" />
<meta content="molint;1/1/6" name="citation_mjid" />
<meta content="http://molinterv.aspetjournals.org/content/1/1/6.full" name="citation_fulltext_html_url" />
<meta content="http://molinterv.aspetjournals.org/content/1/1/6.full.pdf" name="citation_pdf_url" />

Here is the same information encoded as Dublin Core:

<meta name="DC.Format" content="text/html" />
<meta name="DC.Language" content="en" />
<meta content="Dial M for Molecular" name="DC.Title" />
<meta content="" name="DC.Identifier" />
<meta content="2001-04-01" name="DC.Date" />
<meta content="American Society for Pharmacology and Experimental Therapeutics" name="DC.Publisher" />
<meta content="Sue P. Duckles" name="DC.Contributor" />

Dublin Core is a more limited encoding, and full information is not always available in this encoding without resorting to non-standard extensions. It is often necessary to consult both encodings to extract complete metadata.

The other way bibliographic information is provided for harvested content is through a separate file that is linked to from the abstract or full-text HTML page and is meant for use by citation management systems. The most common format is RIS, developed by Research Information Systems as an interchange format among citation management systems. Here is the RIS representation of the same article:

TY - JOUR
PB - American Society for Pharmacology and Experimental Therapeutics
JO -  Molecular Interventions
SN - 1534-0384
SN - 1543-2548
TI - Dial M for Molecular
AU - Sue P. Duckles 
PY - 2001
DA - 2001-04-01
VL - 1
IS - 1
SP - 6
UR - http://molinterv.aspetjournals.org/content/1/1/6.full" name="citation_fulltext_html_url

File Transfer Content

Transferred content is typically in the form of "pre-publication" content that is used as input to a publication system. The content in this case consists of document files in PDF, ePub and other formats, supporting files such as image, audio, and video files, and one or more tagged text files that provide the content and metadata and refer to the other files. Many publishers have developed proprietary formats, but some have begun using evolving industry-wide formats, such as JATS (Journal Article Tag Suite) for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language of the World Wide Web Consortium. There are several profiles for describing various kinds of content.

Extracting metadata from proprietary formats depends on an understanding of how the publisher has encoded the metadata information. The JATS XML encoding as an open standard is well-understood and it is relatively simple to extract metadata from it.

Mechanisms for Extracting and Indexing Bibliographic Metadata

LOCKSS preserves content at the level of storage units, yet metadata is at the level of bibliographic units, so a mechanism is necessary to identify the storage units that contain metadata for each bibliographic unit, and then to extract the metadata and index the bibliographic units in the Metadata Database. Since the representation of metadata in preserved content depends on the publishing platform, the mechanism for performing these steps is expressed as a framework in the LOCKSS daemon. Plugins that conform to this framework embody all the rules and procedures for processing content from individual publishing platforms.

Article Iterator

The Article Iterator is slightly misnamed. It is a plugin framework that identifies files that represent different types of bibliographic units within an AIP (Archival Unit or AU). Examples of bibliographic units are abstract, full-text, supplementary material, and so on. The definition of the mapping between the various types of bibligraphic units and the files within the AU is plugin-specific.

The article iterator typically identifies files using patterns that match a specific file such as an HTML article page. For each such page, the article iterator identifies all the related files that also represent bibliographic features.

For example, an article iterator whose patterns match the HTML abstract page can attempt to locate related files that represent the metadata, full-text HTML, full-text PDF, citation manager files, and other bibliographic features. The article iterator labels these feature URLs for use by the metadata extractor and other clients of the article iterator.

Metadata Extractor

The metadata extractor is a plugin framework that extracts metadata for each bibliographic unit using features identified by the article iterator. It typically uses the feature identified as the metadata feature. The metadata extractor operates by processing the contents of the storage unit that contains the metadata, identifying specific metadata elements, validating and normalizing values, and creating a record that contains the metadata for that bibliographic unit.

The processing of the files to identify and extract the metadata values is unique to the plugin that governs content for a given AIP. For harvested journal articles that provide metadata using META tags in abstract or full-text HTML pages, the extraction process involves parsing the HTML header and extracting the corresponding "name" and "content" attribute values. Once the pairs are isolated, the values can be validated to ensure they conform to the expected type.

Errors in Metadata

Validation, deduplication and other forms of error correction are necessary because it is not uncommon for there to be errors in provided metadata. Pages are maintained on the LOCKSS team's internal wiki with information about metadata problems. A PDF of one of them, describing one class of problems, is here.

Normalization is necessary to ensure a common representation for the type of value. For example an ISSN should have four digits, an hyphen, three digits followed by a digit or the letter 'X". Normalization entails ensuring the correct punctation and ensuring that a final "X" is upper-cased. An additional example is the publisher name in the header of HTML files, which is often specified as:

  <meta name="dc.publisher" content="publisher_name" />

It's common to find inconsistent spellings, abbreviations, etc. of the same publisher name in different files. These are translated to a consistent name by a manually-maintained mapping table. Entries are added whenever an unexpected publisher name appears in a report.

A further example is missing DOIs. Some content may not have had a DOI assigned. In other cases, a DOI may have been assigned but the metadata extractor may fail to find it in the content, either because it is missing or because it is not in the place(s) that the extractor expects.

The report generators require a de-duplication step. It uses a combination of bibliographic items, including publisher, publication title, publication year, volume, issue, start page, and a computed article ID. Two metadata items are the same if all the available values are the same. The DOI is the preferred unique ID, but if it isn't available a substitute is generated using the title, if there is one, and otherwise the access URL. The article ID is computed the same way regardless of whether it is harvest or file transfer content. Because these IDs are successively less reliable as unique IDs, the de-duplication is not completely reliable in the face of noisy metadata.

Metadata Database Indexing

Once metadata has been extracted from a bibliographic unit, the Metadata Manager in the LOCKSS box indexes it in the metadata database. The indexing operation is independent of how the bibliographic units were identified or the metadata was extracted. Parts of the indexing process are common to all bibliographic types, while other parts depend on which type is being indexed. The first section of this document shows what metadata is stored in the metadata database for each bibliographic type.

The Metadata Manager also checks that the metadata for a bibliographic unit is complete enough to index. At least basic information such as the publisher, publication title, one or more feature URLs, and the ID of the AU (AIP), indicating the plugin and the preservation parameters, is required. Some publishers supply incomplete metadata. The Metadata Manager attempts to fill in missing bibliographic information from the Title Database (TDB) entry for the AU (AIP).

The TDB is a per-publisher knowledge base that is maintained by the LOCKSS team and is used to add new AUs to a LOCKSS box. The TDB provides all the preservation parameters necessary to define the AU, plus additional, readily available bibliographic information. For harvest content, this includes the publication name, publisher name, the ISSN/eISSN, ISBN/eISBN, volume, publication name, and publisher proprietary identifier. For transferred content, less bibliographic information is available in the TDB entries because each AU typically includes all content from a given publisher for a given year.

If the publisher or publication title is not available from either the metadata or the TDB, the Metadata Manager generates values that can be readily identified for all bibliographic units in the AU being indexed. These "gensyms" can later be updated once more complete bibliographic information becomes available. Flagging missing metadata in this way is the only use the system makes of "gensyms".

Querying and Presenting Extracted Metadata

The administrator of a LOCKSS box and, for CLOCKSS, authorized CLOCKSS staff can access the archive, including for the purposes of querying the metadata and generating various reports required to operate LOCKSS preservation networks and report the state of preservation.

There are several ways that authorized staff can query the metadata of a LOCKSS or CLOCKSS box. Since the metadata is stored in a relational metadata database, it is possible for custom and standard report generators to run SQL queries against the metadata database of a LOCKSS box, or all CLOCKSS boxes, in the preservation network.

An example of this is the monthly report submitted to the Keepers registry at the University of Edinburgh. The Keepers report shows the range of years and volumes for every title that is committed for preservation, in process, or preserved in the CLOCKSS preservation network. This report is necessary to satisfy the reporting requirements of the CLOCKSS board and the Designated Community. Custom reports can also be written that enable the CLOCKSS staff to satisfy requests from publishers under the terms of the CLOCKSS: Publisher Agreement.

Other custom report generators could also be written using SQL access that export the bibliographic information of the metadata database. As an example, it would be possible to create a custom report that exports the bibliographic information in METS (Metadata Encoding and Transmission Standard) XML format for further processing, since there is a close correspondence between the LOCKSS and METS schema and types.

Relevant Documents

Change Process

Changes to this document require:

  • Review by LOCKSS Engineering Staff
  • Approval by LOCKSS Technical Lead