Let the Entities Describe Themselves

By Josh Schneider and Peter Chan

This is the fifth post in our Spring 2016 series on processing digital materials.

———

Why do we process archival materials? Do our processing goals differ based on whether the materials are paper or digital? Processing objectives may depend in part upon institutional priorities, policies, and donor agreements, or collection-specific issues. Yet, irrespective of the format of the materials, we recognize two primary goals to arranging and describing materials: screening for confidential, restricted, or legally-protected information that would impede repositories from providing ready access to the materials; and preparing the files for use by researchers, including by efficiently optimizing discovery and access to the material’s intellectual content.

More and more of the work required to achieve these two goals for electronic records can be performed with the aid of computer assisted technology, automating many archival processes. To help screen for confidential information, for instance, several software platforms utilize regular expression search (BitCurator, AccessData Forensic ToolKit, ePADD). Lexicon search (ePADD) can also help identify confidential information by checking a collection against a categorized list of user-supplied keywords. Additional technologies that may harness machine learning and natural language processing (NLP), and that are being adopted by the profession to assist with arrangement and description, include: topic modeling (ArchExtract); latent semantic analysis (GAMECIP); predictive coding (University of Illinois); and named entity recognition (Linked Jazz, ArchExtract, ePADD). For media, automated transcription and timecoding services (Pop Up Archive) already offer richer access. Likewise, computer vision, including pattern recognition and face recognition, has the potential to help automate image and video description (Stanford Vision Lab, IBM Watson Visual Recognition). Other projects (Overview) outside of the archival community are also exploring similar technologies to make sense of large corpuses of text.

From an archivist’s perspective, one of the most game-changing technologies to support automated processing may be named entity recognition (NER). NER works by identifying and extracting named entities across a corpus, and is in widespread commercial use, especially in the fields of search, advertising, marketing, and litigation discovery. A range of proprietary tools, such as Open Calais, Semantria, and AlchemyAPI, offer entity extraction as a commercial service, especially geared toward facilitating access to breaking news across these industries. ePADD, an open source tool being developed to promote the appraisal, processing, discovery, and delivery of email archives, relies upon a custom NER to reveal the intellectual content of historical email archives.

NER.png

Currently, however, there are no open source NER tools broadly tuned towards the diverse variety of other textual content collected and shared by cultural heritage institutions. Most open source NER tools, such as StanfordNER and Apache OpenNLP, focus on extracting named persons, organizations, and locations. While ePADD also initially focused on just these three categories, an upcoming release will improve browsing accuracy by including more fine-grained categories of organization and location entities bootstrapped from Wikipedia, such as libraries, museums, and universities. This enhanced NER, trained to also identify probable matches, also recognizes other entity types such as diseases, which can assist with screening for protected health information, and events.

What if an open source NER like that in development for ePADD for historical email could be refined to support processing of an even broader set of archival substrates? Expanding the study and use of NLP in this fashion stands to benefit the public and an ever-growing body of researchers, including those in digital humanities, seeking to work with the illuminative and historically significant content collected by cultural heritage organizations.

Of course, entity extraction algorithms are not perfect, and questions remain for archivists regarding how best to disambiguate entities extracted from a corpus, and link disambiguated entities to authority headings. Some of these issues reflect technical hurdles, and others underscore the need for robust institutional policies around what constitutes “good enough” digital processing. Yet, the benefits of NER, especially when considered in the context of large text corpora, are staggering. Facilitating browsing and visualization of a corpus by entity type provides new ways for researchers to access content. Publishing extracted entities as linked open data can enable new content discovery pathways and uncover trends across institutional holdings, while also helping balance outstanding privacy and copyright concerns that may otherwise limit online material sharing.

It is likely that “good enough” processing will remain a moving target as researcher practices and expectations continue to evolve with emerging technologies. But we believe entity extraction fulfills an ongoing need to enable researchers to gain quick access to archival collections’ intellectual content, and that its broader application would greatly benefit both repositories and researchers.

———

Peter Chan is Digital Archivist in the Department of Special Collections and University Archives at Stanford University, is a member of GAMECIP, and is Project Manager for ePADD.

Josh Schneider is Assistant University Archivist in the Department of Special Collections and University Archives at Stanford University, and is Community Manager for ePADD.

Advertisements

2 thoughts on “Let the Entities Describe Themselves

  1. bancroftlibrary May 6, 2016 / 11:25 pm

    Peter and Josh,
    Very nice piece focusing on the importance of NER in processing and discovery of born digital archives. Our project, ArchExtract (https://github.com/j9recurses/archextract), actually used not only topic modeling, but also named entity recognition and keyword extraction as part of the web application. We were very excited by the potential of these tools in describing text-based digital materials but even more so by their application in discovery, which allows the user to explore the content of the documents in various ways using different processing approaches.

    I have discussed this as “Dynamic Arrangement and Description” in talks on ArchExtract as it allows a researcher to re-define the parameters and adjust the results to locate different sets of documents in context. It is somewhat akin to the sort of serendipity a researcher might achieve when shelf reading in the analog world.

    I appreciate your focus on NLP/NER for born digital materials, as I hope to see us move more in this direction in processing our digital collections. It is a great example of MPLP for born digital collections.

    Mary Elings
    Principal Archivist for Digital Collections

    Like

  2. Josh Schneider May 18, 2016 / 3:17 pm

    Hi Mary,

    Thanks for your thoughtful feedback! We have updated the post to reflect ArchExtract’s use of NER. “Dynamic arrangement and description” is a great way to describe how these tools enable and empower archivists, researchers, and others to engage with these materials, and we agree that their application towards supporting discovery is especially transformative.

    Josh

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s