An Exploration of BitCurator NLP: Incorporating New Tools for Born-Digital Collections

by Morgan Goodman

Natural Language Processing (NLP) has been a buzz-worthy topic for professionals working with born-digital material over the last few years. The BitCurator Project recently released a new set of natural language processing tools, and I had the opportunity to test out the topic modeler, Bitcurator-nlp-gentm, with a group of archivists in the Raleigh-Durham area. I was interested in exploring how NLP might assist archivists to more effectively and efficiently perform their everyday duties. While my goal was to explore possible applications of topic modeling in archival appraisal specifically, the discussions surrounding other possible uses were enlightening.  The resulting research informed my 2019 Master’s paper for the University of North Carolina Chapel Hill.

Topic Modeling extracts text from files and organizes the tokenized words into topics. Imagine a set of words such as: mask, october, horror, michael, myers. Based on this grouping of words you might be able to determine that somewhere across the corpus there is a file about one the Halloween franchise horror films. When I met with the archivists, I had them run the program with disk images from their own collections, and we discussed the visualization output and whether or not they were able easily analyze and determine the nature of the topics presented.

BitCurator utilizes open source tools in their applications and chose the pyLDAvis visualization for the final output of their topic modeling tool (more information about the algorithm and how it works can be found by reading Sievert and Shirley’s paper. You can also play around with the output through this Jupyter notebook).  The left side view of the visualization has topic circles displayed in relative sizes and plotted on a two-dimensional plane. Each topic is labeled with a number in decreasing order of prevalence (circle #1 is the main topic in the overall corpus, and is also the largest circle). The space between topics is determined by the relative relation of the topics, i.e. topics that are less related are plotted further away from each other. The right-side view contains a list of 30 words with a blue bar indicating that term’s frequency across the corpus. Clicking on a topic circle will alter the view of the terms list by adding a red bar for each term, showing the term frequency in that particular topic in relation to the overall corpus.

Picture1

The user can then manipulate a metric slider which is meant to help decipher what the topic is about. Essentially, when the slider is all the way to the right at “1”, the most prevalent terms for the entire corpus are listed. When a topic is selected and the slider is at 1, it shows all the prevalent terms for the corpus in relation to that particular topic (in your Halloween example, you might see more general words like: movie, plot, character). Alternatively, the closer to “0” the slider moves, the less corpus-wide terms appear and the more topic specific terms are displayed (i.e.: knife, haddonfield, strode).

While the NLP does the hard work to scan and extract text from the files, some analysis is still required by the user. The tool’s output offers archivists a bird’s eye view of the collection, which can be helpful when little to nothing is known about its contents. However, many of the archivists I spoke to felt this tool is most effective when you already know a bit about the collection you are looking at. In that sense, it may be beneficial to allow researchers to use topic modeling in the reading room to explore a large collection. Researchers and others with subject matter expertise may get the most benefit from this tool – do you have to know about the Halloween movie franchise to know that Michael Myers is a fictional horror film character? Probably. Now imagine more complex topics that the archivists may not have working knowledge of. The archivist can point the researcher to the right collection and let them do the analysis. This tool may also help for description or possibly identifying duplication across a collection (which seems to be a common problem for people working with born-digital collections).

The next steps to getting NLP tools like this off the ground are to implement training. Information retrieval and ranking methods that create the output may not be widely understood. To unlock the value within an NLP tool, users must know how they work, how to run them, and how to perform meaningful analysis.  Training archivists in the reading room to assist researchers would be an excellent way to get tools like this out of the think tank and into the real world.


MorganMorgan Goodman is a 2019 graduate from the University of North Carolina, Chapel Hill and currently resides in Denver, Colorado. She holds a MS in Information Science with a specialization in Archives and Records Management.