ml4arc – Machine Learning, Deep Learning, and Natural Language Processing Applications in Archives

by Emily Higgs

On Friday, July 26, 2019, academics and practitioners met at Wilson Library at UNC Chapel Hill for “ml4arc – Machine Learning, Deep Learning, and Natural Language Processing Applications in Archives.” This meeting featured expert panels and participant-driven discussions about how we can use natural language processing – using software to understand text and its meaning – and machine learning – a branch of artificial intelligence that learns to infer patterns from data – in the archives.

The meeting was hosted by the RATOM Project (Review, Appraisal, and Triage of Mail).  The RATOM project is a partnership between the State Archives of North Carolina and the School of Information and Library Science at UNC Chapel Hill. RATOM will extend the email processing capabilities currently present in the TOMES software and BitCurator environment, developing additional modules for identifying and extracting the contents of email-containing formats, NLP tasks, and machine learning approaches. RATOM and the ml4arc meeting are generously supported by the Andrew W. Mellon Foundation.

Presentations at ml4arc were split between successful applications of machine learning and problems that could potentially be addressed by machine learning in the future. In his talk, Mike Shallcross from Indiana University identified archival workflow pain points that provide opportunities for machine learning. In particular, he sees the potential for machine learning to address issues of authenticity and integrity in digital archives, PII and risk mitigation, aggregate description, and how all these processes are (or are not) scalable and sustainable. Many of the presentations addressed these key areas and how natural language processing and machine learning can lend aid to archivists and records managers. Additionally, attendees got to see presentations and demonstrations from tools for email such as RATOM, TOMES, and ePADD. Euan Cochrane also gave a talk about the EaaSI sandbox and discussed potential relationships between software preservation and machine learning.

The meeting agenda had a strong focus on using machine learning in email archives; collecting and processing emails is a large encumbrance in many archives that can stand to benefit greatly from machine learning tools. For example, Joanne Kaczmarek from the University of Illinois presented a project processing capstone email accounts using an e-discovery and predictive coding software called Ringtail. In partnership with the Illinois State Archives, Kaczmarek used Ringtail to identify groups of “archival” and “non-archival” emails from 62 capstone accounts, and to further break down the “archival” category into “restricted” and “public.” After 3-4 weeks of tagging training data with this software, the team was able to reduce the volume of emails by 45% by excluding “non-archival” messages, and identify 1.8 million emails that met the criteria to be made available to the public. Manually, this tagging process could have easily taken over 13 years of staff time.

After the ml4arc meeting, I am excited to see the evolution of these projects and how natural language processing and machine learning can help us with our responsibilities as archivists and records managers. From entity extraction to PII identification, there are myriad possibilities for these technologies to help speed up our processes and overcome challenges.

Emily Higgs is the Digital Archivist for the Swarthmore College Peace Collection and Friends Historical Library. Before moving to Swarthmore, she was a North Carolina State University Libraries Fellow. She is also the Assistant Team Leader for the SAA ERS section blog.