Using NLP to Support Dynamic Arrangement, Description, and Discovery of Born Digital Collections: The ArchExtract Experiment

By Mary W. Elings

This post is the eighth in our Spring 2016 series on processing digital materials.

———

Many of us working with archival materials are looking for tools and methods to support arrangement, description, and discovery of electronic records and born digital collections, as well as large bodies of digitized text. Natural Language Processing (NLP), which uses algorithms and mathematical models to process natural language, offers a variety of potential solutions to support this work. Several efforts have investigated using NLP solutions for analyzing archival materials, including TOME (Interactive TOpic Model and MEtadata Visualization), Ed Summers’ Fondz, and Thomas Padilla’s Woese Collection work, among others, though none have resulted in a major tool for broader use.

One of these projects, ArchExtract, was carried out at UC Berkeley’s Bancroft Library in 2014-2015. ArchExtract sought to apply several NLP tools and methods to large digital text collections and build a web application that would package these largely command-line NLP tools into an interface that would make it easy for archivists and researchers to use.

The ArchExtract project focused on facilitating analysis of the content and, via that analysis, discovery by researchers. The development work was done by an intern from the UC Berkeley School of Information, Janine Heiser, who built a web application that implements several NLP tools, including Topic Modelling, Named Entity Recognition, and Keyword Extraction to explore and present large, text-based digital collections.

The ArchExtract application extracts topics, named entities (people, places, subjects, dates, etc.), and keywords from a given collection. The application automates, implements, and extends various natural language processing software tools, such as MALLET and the Stanford Core NLP toolkit, and provides a graphical user interface designed for non-technical users.

 

archextract1
ArchExtract Interface Showing Topic Model Results. Elings/Heiser, 2015.

In testing the application, we found the automated text analysis tools in ArchExtract were successful in identifying major topics, as well as names, dates, and places found in the text, and their frequency, thereby giving archivists an understanding of the scope and content of a collection as part of the arrangement and description process. We called this process “dynamic arrangement and description,” as materials can be re-arranged using different text processing settings so that archivists can look critically at the collection without changing the physical or virtual arrangement.

The topic models, in particular, surfaced documents that may have been related to a topic but did not contain a specific keyword or entity. The process was akin to the sort of serendipity a researcher might achieve when shelf reading in the analog world, wherein you might find what you seek without knowing it was there. And while topic modelling has been criticized for being inexact, it can be “immensely powerful for browsing and isolating results in thousands or millions of uncatalogued texts” (Schmidt, 2012). This, combined with the named entity and keyword extraction, can give archivists and researchers important data that could be used in describing and discovering material.

archextract2
ArchExtract Interface Showing Named Entity Recognition Results. Elings/Heiser, 2015.

As a demonstration project, ArchExtract was successful in achieving our goals. The code developed is documented and freely available on GitHub to anyone interested in how it was done or who might wish to take it further. We are very excited by the potential of these tools in dynamically arranging and describing large, text-based digital collections, but even more so by their application in discovery. We are particularly pleased that broad, open source projects like BitCurator and ePADD are taking this work forward and will be bringing NLP tools into environments that we can all take advantage of in processing and providing access to our born digital materials.

———

Mary W. Elings is the Principal Archivist for Digital Collections and Head of the Digital Collections Unit of The Bancroft Library at the University of California, Berkeley. She is responsible for all aspects of the digital collections, including managing digital curation activities, the born digital archives program, web archiving, digital processing, mass digitization, finding aid publication and maintenance, metadata, archival information management and digital asset management, and digital initiatives. Her current work concentrates on issues surrounding born-digital materials, supporting digital humanities and digital social sciences, and research data management. Ms. Elings co-authored the article “Metadata for All: Descriptive Standards and Metadata Sharing across Libraries, Archives and Museums,” and wrote a primer on linked data for LAMs. She has taught as an adjunct professor in the School of Information Studies at Syracuse University, New York (2003-2009) and School of Library and Information Science, Catholic University, Washington, DC (2010-2014), and is a regular guest-lecturer in the John F. Kennedy University Museum Studies program (2010-present).

Advertisements

Electronic Records Section call for nominations

Election season is fast approaching, and the Electronic Records Section has some exciting opportunities for service, both in elected and appointed positions!

The ERS needs to elect a new Vice Chair/Chair Elect and 2 Steering Committee members. We are also looking for a volunteer to serve as Communications Liaison. All nominations must be received by June 1st!

Vice Chair/Chair Elect (1 position open)

The Vice Chair serves a 1-year term beginning immediately following the 2016 Annual Meeting. Their responsibilities are to assist the Chair in leading the section and representing the section in the absence of the Chair. Upon completion of the Vice Chair’s term, the Vice Chair assumes the position of Chair, at the conclusion of the incumbent Chair’s term. Upon completion of the 1-year term as Chair, he/she serves one final year as Past Chair.

Steering Committee (2 positions open)

Steering Committee members serve for a term of 3 years, beginning immediately following the 2016 Annual Meeting. Their responsibility is to assist the Chair and the Vice Chair in leading and organizing section activities.

Communication Liaison (1 position open)

The Communications Liaison facilitates communications between the Steering Committee and the Section membership and other audiences, including but not limited to the SAA microsite, electronic mailing list, blogs, social media, and other forms of online communication not yet in use by the Section. This role is open to all eligible Electronic Records Section members. The appointee will serve a renewable one-year term. Note that this role is appointed and not subject to election.

How can I nominate someone?

To nominate yourself or someone else, or to volunteer for appointment as Communication Liaison,  please send to Marty Gengenbach (martin[dot]gengenbach[at]gmail[dot]com):

  • the nominee/volunteer name,
  • contact information, and
  • position (Vice Chair/Chair Elect, Steering Committee, or Communication Liaison)

Important dates:

June 1All nominations must be received by this date. The steering committee will review and confirm nominations the following week.

June 15 – Candidate statements due.

July 1 – Supplemental information such as candidate photos or biographies due.

For more information on Electronic Records Section leadership, please see the ERS section bylaws.

Indiana Archives and Records Administration’s Accession Profile Use in Bagger

By Tibaut Houzanme and John Scancella

This post is the seventh in our Spring 2016 series on processing digital materials. This quick report for the practitioner drew from the “Bagger’s Enhancements for Digital Accessions” post prepared for the Library of Congress’ blog The Signal.

———

Context

In the past, the Indiana Archives and Records Administration (IARA) would simply receive, hash and place digital accessions in storage, with the metadata keyed into a separate Microsoft Access Database. Currently, IARA is automating many of its records processes with the APPX-based Archival Enterprise Management system (AXAEM). When the implementation concludes, this open source, integrated records management and digital preservation system will become the main accessioning tool. For now, and for accessions outside AXAEM’s reach, IARA uses Bagger.  Both AXAEM and Bagger comply with the BagIt packaging standard: accessions captured with Bagger can later be readily ingested by AXAEM. IARA anticipates time gains and record/metadata silos reduction.

Initial Project Scope

IARA aims to capture required metadata for each accession in a consistent manner. Bagger allows this to be done through a standard profile. IARA developed a profile inspired by the fields and drop-down menus on its State Form (SF 48883). When that profile was initially implemented, Bagger scrambled the metadata fields order and the accession was not easily understood. John Scancella, the lead Bagger developer at the Library of Congress implemented a change that makes Bagger now keep the metadata sequence as originally intended in the profile. IARA then added additional metadata fields for preservation decisions.

Scope Expansion and Metadata Fields

With  colleagues’ feedback, it appeared IARA’s profile could be useful to other institutions. A generic version of the profile was then created, that uses more generic terms and made all the metadata fields optional. This way, each institution can decide which fields it would enforce the use of. This makes the generic profile useful to most digital records project and collecting institutions.

The two profiles display similar  metadata fields for context (provenance, records series), identity, integrity, physical, logical, inventory, administrative, digital originality, storage media or carriers types, appraisal and classification values, format openness and curation lifecycle information for each accession. Together with the hash values and files size that Bagger collects, this provides a framework to more effectively help evaluate, manage and preserve long term digital records.

Below are the profile fields:

Houzanme_IARAAccessionProfileUseinBagger_ERSblog_1
Figure 1: IARA Profile with Sample Accession Screen (1 of 2)

 

Houzanme_IARAAccessionProfileUseinBagger_ERSblog_2
Figure 2: IARA Profile with Sample Accession Screen (2 of 2)

 

The fictitious metadata  values in the figures above are for demonstration purposes and include hash value and size in the corresponding text file below:

Houzanme_IARAAccessionProfileUseinBagger_ERSblogs_3
Figure 3: Metadata Fields and Values in the bag-info.txt File after Bag Creation

This test accession used  random files accessible from the Digital Corpora and Open Preservation websites.

Adopting or Adapting Profiles

To use the IARA’s profile, its generic version or any other profile in Bagger, download the latest version (as of this writing 2.5.0). To start an accession, select the appropriate profile from the dropdown list. This will populate the screen with the profile-specific metadata fields. Select objects, enter values, save your bag.

For detailed instructions on how to edit metadata fields and obligation level, create  a new or change an existing profile to meet your project/institution’s requirements, please refer to the Bagger User Guide in the “doc” folder inside your downloaded Bagger.zip file.

To comment on IARA’s profiles, email erecords[at]iara[dot]in[dot]gov. For Bagger issues, open a GitHub ticket. For technical information on Bagger and these profiles, please refer to the LOC’s Blog.

———

Tibaut Houzanme is Digital Archivist with the Indiana Archives and Records Administration. John Scancella is Information Technology Specialist with the Library of Congress.

Processing Digital Research Data

By Elise Dunham

This is the sixth post in our Spring 2016 series on processing digital materials.

———

The University of Illinois at Urbana-Champaign’s (Illinois) library-based Research Data Service (RDS) will be launching an institutional data repository, the Illinois Data Bank (IDB), in May 2016. The IDB will provide University of Illinois researchers with a repository for research data that will facilitate data sharing and ensure reliable stewardship of published data. The IDB is a web application that transfers deposited datasets into Medusa, the University Library’s digital preservation service for the long-term retention and accessibility of its digital collections. Content is ingested into Medusa via the IDB’s unmediated self-deposit process.

As we conceived of and developed our dataset curation workflow for digital datasets ingested in the IDB, we turned to archivists in the University Archives to gain an understanding of their approach to processing digital materials. [Note: I am not specifying whether data deposited in the IDB is “born digital” or “digitized” because, from an implementation perspective, both types of material can be deposited via the self-deposit system in the IDB. We are not currently offering research data digitization services in the RDS.] There were a few reasons for consulting with the archivists: 1) Archivists have deep, real-world curation expertise and we anticipate that many of the challenges we face with data will have solutions whose foundations were developed by archivists and 2) If, through discussing processes, we found areas where the RDS and Archives have converging preservation or curation needs, we could communicate these to the Preservation Services Unit, who develops and manages Medusa, and 3) I’m an archivist by training and I jump on any opportunity to talk with archivists about archives!

Even though the RDS and the University Archives share a central goal–to preserve and make accessible the digital objects that we steward–we learned that there are some operational and policy differences between our approaches to digital stewardship that necessitate points of variance in our processing/curation workflow:

Appraisal and Selection

In my view, appraisal and selection are fundamental to the archives practice. The archives field has developed a rich theoretical foundation when it comes to appraisal and selection, and without these functions the archives endeavor would be wholly unsustainable. Appraisal and selection ideally tend to occur in the very early stages of the archival processing workflow. The IDB curation workflow will differ significantly–by and large, appraisal and selection procedures will not take place until at least five years after a dataset is published in the IDB–making our appraisal process more akin to that of an archives that chooses to appraise records after accessioning or even during the processing of materials for long-term storage. Our different approaches to appraisal and selection speak to the different functions the RDS and the University Archives fulfill within the Library and the University.

The University Archives is mandated to preserve University records in perpetuity by the General Rules of the University, the Illinois State Records Act. The RDS’s initiating goal, in contrast, is to provide a mechanism for Illinois researchers to be compliant with funder and/or journal requirements to make results of research publicly available. Here, there is no mandate for the IDB to accept solely what data is deemed to have “enduring value” and, in fact, the research data curation field is so new that we do not yet have a community-endorsed sense of what “enduring value” means for research data. Standards regarding the enduring value of research data may evolve over the long-term in response to discipline-specific circumstances.

To support researchers’ needs and/or desires to share their data in a simple and straightforward way, the IDB ingest process is largely unmediated. Depositing privileges are open to all campus affiliates who have the appropriate University log-in credentials (e.g., faculty, graduate students, and staff), and deposited files are ingested into Medusa immediately upon deposit. RDS curators will do a cursory check of deposits, as doing so remains scalable (see workflow chart below), and the IDB reserves the right to suppress access to deposits for a “compelling reason” (e.g., failure to meet criteria for depositing as outlined in the IDB Accession Policy, violations of publisher policy, etc.). Aside from cases that we assume will be rare, the files as deposited into the IDB, unappraised, are the files that are preserved and made accessible in the IDB.

Preservation Commitment

A striking policy difference between the RDS and the University Archives is that the RDS makes a commitment to preserving and facilitating access to datasets for a minimum of five years after the date of publication in the Illinois Data Bank.

The University Archives, of course, makes a long-term commitment to preserving and making accessible records of the University. I have to say, when I learned that the five-year minimum commitment was the plan for the IDB, I was shocked and a bit dismayed! But after reflecting on the fact that files deposited in the IDB undergo no formal appraisal process at ingest, the concept began to feel more comfortable and reasonable. At a time when terabytes of data are created, oftentimes for single projects, and budgets are a universal concern, there are logistical storage issues to contend with. Now, I fully believe that for us to ensure that we are able to 1) meet current, short-term data sharing needs on our campus and 2) fulfill our commitment to stewarding research data in an effective and scalable manner over time, we have to make a circumspect minimum commitment and establish policies and procedures that enable us to assess the long-term viability of a dataset deposited into the IDB after five years.

The RDS has collaborated with archives and preservation experts at Illinois and, basing our work in archival appraisal theory, have developed guidelines and processes for reviewing published datasets after their five-year commitment ends to determine whether to retain, deaccession, or dedicate more stewardship resources to datasets. Enacting a systematic approach to appraising the long-term value of research data will enable us to allot resources to datasets in a way that is proportional to the datasets’ value to research communities and its preservation viability.

Convergences

To show that we’re not all that different after all, I’ll briefly mention a few areas where the University Archives and the RDS are taking similar approaches or facing similar challenges:

  • We are both taking an MPLP-style approach to file conversion. In order to get preservation control of digital content, at minimum, checksums are established for all accessioned files. As a general rule, if the file can be opened using modern technology, file conversion will not be pursued as an immediate preservation action. Establishing strategies and policies for managing a variety of file formats at scale is an area that will be evolving at Illinois through collaboration of the University Archives, the RDS, and the Preservation Services Unit.
  • Accruals present metadata challenges. How do we establish clear accrual relationships in our metadata when a dataset or a records series is updated annually? Are there ways to automate processes to support management of accruals?
  • Both units do as much as they can to get contextual information about the material being accessioned from the creator, and metadata is enhanced as possible throughout curation/processing.
  • The University Archives and the RDS control materials in aggregation, with the University Archives managing at the archival collection level and the RDS managing digital objects at the dataset level.
  • More? Certainly! For both the research data curation community and the archives community, continually adopting pragmatic strategies to manage the information created by humans (and machines!) is paramount, and we will continue to learn from one another.

Research Data Alliance Interest Group

If you’re interested in further exploring the areas where the principles and practices in archives and research data curation overlap and where they diverge, join the Research Data Alliance (RDA) Archives and Records Professionals for Research Data Interest Group. You’ll need to register with the RDA, (which is free!), and subscribe to the group. If you have any questions, feel free to get in touch!

IDB Curation Workflow

The following represents our planned functional workflow for handling dataset deposits in the Illinois Data Bank:

Dunham_ProcessingDigitalReserachData_PublishedDepositScan_ERSblog_1
Workflow graphic created by Elizabeth Wickes. Click on the image to view it in greater detail.

Learn More

To learn more about the IDB policies and procedures discussed in this post, keep an eye on the Illinois Data Bank website after it launches next month. Of particular interest on the Policies page will be the Accession Policy and the Preservation Review, Retention, Deaccession, Revision, and Withdrawal Procedure document.

Acknowledgements

Bethany Anderson and Chris Prom of the University of Illinois Archives

The rest of the Research Data Preservation Review Policy/Procedures team: Bethany Anderson, Susan Braxton, Heidi Imker, and Kyle Rimkus

The rest of the RDS team: Qian Zhang, Elizabeth Wickes, Colleen Fallaw, and Heidi Imker

———

Dunham_ProcessingDigitalReserachData_PublishedDepositScan_ERSblog_2Elise Dunham is a Data Curation Specialist for the Research Data Service at the University of Illinois at Urbana-Champaign. She holds an MLS from the Simmons College Graduate School of Library and Information Science where she specialized in archives and metadata. She contributes to the development of the Illinois Data Bank in areas of metadata management, repository policy, and workflow development. Currently she co-chairs the Research Data Alliance Archives and Records Professionals for Research Data Interest Group and is leading the DACS workshop revision working group of the Society of American Archivists Technical Subcommittee for Describing Archives: A Content Standard.

Let the Entities Describe Themselves

By Josh Schneider and Peter Chan

This is the fifth post in our Spring 2016 series on processing digital materials.

———

Why do we process archival materials? Do our processing goals differ based on whether the materials are paper or digital? Processing objectives may depend in part upon institutional priorities, policies, and donor agreements, or collection-specific issues. Yet, irrespective of the format of the materials, we recognize two primary goals to arranging and describing materials: screening for confidential, restricted, or legally-protected information that would impede repositories from providing ready access to the materials; and preparing the files for use by researchers, including by efficiently optimizing discovery and access to the material’s intellectual content.

More and more of the work required to achieve these two goals for electronic records can be performed with the aid of computer assisted technology, automating many archival processes. To help screen for confidential information, for instance, several software platforms utilize regular expression search (BitCurator, AccessData Forensic ToolKit, ePADD). Lexicon search (ePADD) can also help identify confidential information by checking a collection against a categorized list of user-supplied keywords. Additional technologies that may harness machine learning and natural language processing (NLP), and that are being adopted by the profession to assist with arrangement and description, include: topic modeling (ArchExtract); latent semantic analysis (GAMECIP); predictive coding (University of Illinois); and named entity recognition (Linked Jazz, ArchExtract, ePADD). For media, automated transcription and timecoding services (Pop Up Archive) already offer richer access. Likewise, computer vision, including pattern recognition and face recognition, has the potential to help automate image and video description (Stanford Vision Lab, IBM Watson Visual Recognition). Other projects (Overview) outside of the archival community are also exploring similar technologies to make sense of large corpuses of text.

From an archivist’s perspective, one of the most game-changing technologies to support automated processing may be named entity recognition (NER). NER works by identifying and extracting named entities across a corpus, and is in widespread commercial use, especially in the fields of search, advertising, marketing, and litigation discovery. A range of proprietary tools, such as Open Calais, Semantria, and AlchemyAPI, offer entity extraction as a commercial service, especially geared toward facilitating access to breaking news across these industries. ePADD, an open source tool being developed to promote the appraisal, processing, discovery, and delivery of email archives, relies upon a custom NER to reveal the intellectual content of historical email archives.

NER.png

Currently, however, there are no open source NER tools broadly tuned towards the diverse variety of other textual content collected and shared by cultural heritage institutions. Most open source NER tools, such as StanfordNER and Apache OpenNLP, focus on extracting named persons, organizations, and locations. While ePADD also initially focused on just these three categories, an upcoming release will improve browsing accuracy by including more fine-grained categories of organization and location entities bootstrapped from Wikipedia, such as libraries, museums, and universities. This enhanced NER, trained to also identify probable matches, also recognizes other entity types such as diseases, which can assist with screening for protected health information, and events.

What if an open source NER like that in development for ePADD for historical email could be refined to support processing of an even broader set of archival substrates? Expanding the study and use of NLP in this fashion stands to benefit the public and an ever-growing body of researchers, including those in digital humanities, seeking to work with the illuminative and historically significant content collected by cultural heritage organizations.

Of course, entity extraction algorithms are not perfect, and questions remain for archivists regarding how best to disambiguate entities extracted from a corpus, and link disambiguated entities to authority headings. Some of these issues reflect technical hurdles, and others underscore the need for robust institutional policies around what constitutes “good enough” digital processing. Yet, the benefits of NER, especially when considered in the context of large text corpora, are staggering. Facilitating browsing and visualization of a corpus by entity type provides new ways for researchers to access content. Publishing extracted entities as linked open data can enable new content discovery pathways and uncover trends across institutional holdings, while also helping balance outstanding privacy and copyright concerns that may otherwise limit online material sharing.

It is likely that “good enough” processing will remain a moving target as researcher practices and expectations continue to evolve with emerging technologies. But we believe entity extraction fulfills an ongoing need to enable researchers to gain quick access to archival collections’ intellectual content, and that its broader application would greatly benefit both repositories and researchers.

———

Peter Chan is Digital Archivist in the Department of Special Collections and University Archives at Stanford University, is a member of GAMECIP, and is Project Manager for ePADD.

Josh Schneider is Assistant University Archivist in the Department of Special Collections and University Archives at Stanford University, and is Community Manager for ePADD.