By Josh Schneider
This post is the fourth in a bloggERS series about access to born-digital materials.
The email correspondence of historical and political figures presents great potential to support scholarly research into contemporary society. Yet archival repositories of all types—whether public or private, government or cultural—currently face significant impediments to collecting and administering email collections due to concerns about privacy and copyright and the difficulty of processing large, multi-decade archives containing hundreds of thousands of messages. ePADD is an open source and freely downloadable software suite released by Stanford University and its NHPRC grant partners this past summer that expressly addresses these challenges.
With the initial release, the ePADD team harnessed machine learning, natural language processing (NLP), automated metadata extraction, and other batch processes, to support identification and restriction of messages containing regular expressions, such as credit card and social security numbers, and browsing and visualization of the archive by correspondent or mentioned entities and the ability to link these with established authority headings. We also built functionality for custom lexicons that permit tiered thematic searching of the email archive, remote access to a redacted email archive to aid in discovery, and query generation, which compares the entity index of the archive with that of any other text-based corpus.
At the heart of the software is a custom named entity recognizer (NER), which benefits from the email address book associated with the archive and supports browsing and visualization of named person, organization, and location entities within message headers and text. Since its initial release, ePADD has continued to expand its NER capabilities dramatically. The developer version now includes more fine-grained categories of organization and location entities, such as libraries, museums, and universities, bootstrapped from Wikipedia, which should enhance browsing accuracy. This enhanced NER—which will be rolled out in future public releases—also supports extraction, browsing, and visualization of other entity types, including events and diseases, the latter of which should improve screening for protected health information.
The ePADD team and its partners aim to keep these improvements rolling. In September 2015, IMLS generously agreed to fund three years of additional ePADD development, “ePADD Phase 2,” to further IMLS’s vision of a national digital platform. ePADD is proud to include University of Illinois Urbana-Champaign, Harvard University, University of California, Irvine, and Metropolitan New York Library Council (METRO) among its partners for this second phase of growth. We will maintain an iterative release cycle, with a major software release scheduled for every six months; the first of these is set for June 2016.
In Phase 2, we aim to greatly improve archival institutions’ ability to appraise, process, and provide access to email collections otherwise unavailable to researchers. We hope that fulfilling Phase 2 development goals will also lay the groundwork for future efforts applying similar automated workflows and functionalities to other classes of born-digital materials.
Phase 2 should bring critical enhancements to ePADD’s feature set with respect to electronic records management and access, including:
- Supporting the ability to work with archives of up to 750,000 messages;
- Ensuring integration with existing and emerging tools and services for archival workflows, including support for preservation;
- Supporting export of entities and other metadata as linked open data; and
- Improving restriction management, implementing an audit trail, and exploring the creation of a public delivery module for email archives released through FOIA and sunshine requests.
We hope you will join our growing community: Please download the software, join our user forums or mailing list, or contribute a user story to inform future development. You can also follow us on Twitter for the latest news and announcements.
Josh Schneider is Assistant University Archivist in the Department of Special Collections and University Archives at Stanford University. He is also Community Manager for ePADD, an open-source software package that supports archival processes around the appraisal, processing, discovery, and delivery of email archives.