By Josh Schneider and Peter Chan
This is the fifth post in our Spring 2016 series on processing digital materials.
Why do we process archival materials? Do our processing goals differ based on whether the materials are paper or digital? Processing objectives may depend in part upon institutional priorities, policies, and donor agreements, or collection-specific issues. Yet, irrespective of the format of the materials, we recognize two primary goals to arranging and describing materials: screening for confidential, restricted, or legally-protected information that would impede repositories from providing ready access to the materials; and preparing the files for use by researchers, including by efficiently optimizing discovery and access to the material’s intellectual content.
More and more of the work required to achieve these two goals for electronic records can be performed with the aid of computer assisted technology, automating many archival processes. To help screen for confidential information, for instance, several software platforms utilize regular expression search (BitCurator, AccessData Forensic ToolKit, ePADD). Lexicon search (ePADD) can also help identify confidential information by checking a collection against a categorized list of user-supplied keywords. Additional technologies that may harness machine learning and natural language processing (NLP), and that are being adopted by the profession to assist with arrangement and description, include: topic modeling (ArchExtract); latent semantic analysis (GAMECIP); predictive coding (University of Illinois); and named entity recognition (Linked Jazz, ArchExtract, ePADD). For media, automated transcription and timecoding services (Pop Up Archive) already offer richer access. Likewise, computer vision, including pattern recognition and face recognition, has the potential to help automate image and video description (Stanford Vision Lab, IBM Watson Visual Recognition). Other projects (Overview) outside of the archival community are also exploring similar technologies to make sense of large corpuses of text.
From an archivist’s perspective, one of the most game-changing technologies to support automated processing may be named entity recognition (NER). NER works by identifying and extracting named entities across a corpus, and is in widespread commercial use, especially in the fields of search, advertising, marketing, and litigation discovery. A range of proprietary tools, such as Open Calais, Semantria, and AlchemyAPI, offer entity extraction as a commercial service, especially geared toward facilitating access to breaking news across these industries. ePADD, an open source tool being developed to promote the appraisal, processing, discovery, and delivery of email archives, relies upon a custom NER to reveal the intellectual content of historical email archives.
Currently, however, there are no open source NER tools broadly tuned towards the diverse variety of other textual content collected and shared by cultural heritage institutions. Most open source NER tools, such as StanfordNER and Apache OpenNLP, focus on extracting named persons, organizations, and locations. While ePADD also initially focused on just these three categories, an upcoming release will improve browsing accuracy by including more fine-grained categories of organization and location entities bootstrapped from Wikipedia, such as libraries, museums, and universities. This enhanced NER, trained to also identify probable matches, also recognizes other entity types such as diseases, which can assist with screening for protected health information, and events.
What if an open source NER like that in development for ePADD for historical email could be refined to support processing of an even broader set of archival substrates? Expanding the study and use of NLP in this fashion stands to benefit the public and an ever-growing body of researchers, including those in digital humanities, seeking to work with the illuminative and historically significant content collected by cultural heritage organizations.
Of course, entity extraction algorithms are not perfect, and questions remain for archivists regarding how best to disambiguate entities extracted from a corpus, and link disambiguated entities to authority headings. Some of these issues reflect technical hurdles, and others underscore the need for robust institutional policies around what constitutes “good enough” digital processing. Yet, the benefits of NER, especially when considered in the context of large text corpora, are staggering. Facilitating browsing and visualization of a corpus by entity type provides new ways for researchers to access content. Publishing extracted entities as linked open data can enable new content discovery pathways and uncover trends across institutional holdings, while also helping balance outstanding privacy and copyright concerns that may otherwise limit online material sharing.
It is likely that “good enough” processing will remain a moving target as researcher practices and expectations continue to evolve with emerging technologies. But we believe entity extraction fulfills an ongoing need to enable researchers to gain quick access to archival collections’ intellectual content, and that its broader application would greatly benefit both repositories and researchers.
Peter Chan is Digital Archivist in the Department of Special Collections and University Archives at Stanford University, is a member of GAMECIP, and is Project Manager for ePADD.
Josh Schneider is Assistant University Archivist in the Department of Special Collections and University Archives at Stanford University, and is Community Manager for ePADD.