Student Impressions of Tech Skills for the Field

by Sarah Nguyen


Back in March, during bloggERS’ Making Tech Skills a Strategic Priority series, we distributed an open survey to MLIS, MLS, MI, and MSIS students to understand what they know and have experienced in relation to  technology skills as they enter the field. 

To be frank, this survey stemmed from personal interests since I just completed an MLIS core course on Research, Assessment, and Design (re: survey to collect data on current landscape). I am also interested in what skills I need to build/what class I should sign up for my next quarter (re: what tech skills do I need to become hire-able?). While I feel comfortable with a variety of tech-related tools and tasks, I’ve been intimidated by more “high-level”computational languages for some years. This survey was helpful for exploring what skills other LIS pre-professionals are interested in and which skills will help us make these costly degrees worth the time and financial investment that is traditionally required to enter a stable archive or library position.

Method

The survey was open for one month on Google Forms, and distributed to SAA communities, @SAA_ERS Twitter, the Digital Curation Google Group, and a few MLIS university program listservs. There were 15 questions and we received responses from 51 participants. 

Results & Analysis

Here’s a superficial scan of the results. If you would like to come up with your own analyses, feel free to view the raw data on GitHub.

Figure 1. Technology-related skills that students want to learn

The most popular technology-related skill that students are interested in learning is data management (manipulating, querying, transforming data, etc.). This is a pretty broad topic as it involves many tools and protocols which can vary between a GUI or scripts. A separate survey that does a breakdown of specific data management tools might be in order, especially since these types of skills can be divided into specialty courses, workshops, which then translates into a specific job position. A more specific survey could help demonstrate what types of skills need to be taught in a full semester-long course, or what skills can be covered in a day-long or multi-day workshop.

It was interesting to see that even in this day and age where social media management can be second nature to many students’ daily lives, there was still a notable interest in understanding how to make this a part of their career. This makes me wonder what value students have in knowing how to strategically manage an archives’ social media account. How could this help with the job market, as well as an archival organization’s main mission?

Looking deeper into the popular data management category, it would be interesting to know the current landscape of knowledge or pedagogy in communicating with IT (e.g. project management and translating users’ needs). In many cases, archivists are working separately from but dependently on IT system administrators, and it can be frustrating since either department may have distinct concerns about a server or other networks. In June’s NYC Preservathon/Preservashare 2019, there was mention that IT exists to make sure servers and networks are spinning at all hours of the day. Unlike archivists, they are not concerned about the longevity of the content, obsolescence of file formats, or the software to render files. Could it be useful to have a course on how to effectively communicate and take control of issues that can be fuzzy lines between archives, data management, and IT? Or as one survey respondent said, “I think more basic programming courses focusing on tech languages commonly used in archives/libraries would be very helpful.” Personally, I’ve only learned this from experience working in different tech-related jobs. This is not a subject I see on my MLIS course catalog, nor a discussion at conference workshops. 

The popularity of data management skills sparked another question: what about knowledge around computer networks and servers? Even though LTO will forever be in our hearts, cloud storage is also a backup medium we’re budgeting for and relying on. Same goes for hosting a database for remote access and/or publishing digital files. A friend mentioned this networking workshop for non-tech savvy learners—Grassroots Networking: Network Administration for Small Organizations/Home Organizations—which could be helpful for multiple skill types including data management, digital forensics, web archiving, web development, etc. This is similar to a course that could be found in computer science or MLIS-adjacent information management departments.

Figure 2. Have you taken/will you take technology-focused courses in your program?
Figure 3. Do you feel comfortable defining the difference between scripting and programming

I can’t say this is statistically significant, but the inverse relationship between 15.7% who have not/will not take a technology-focused course in their program, compared to 78.4% of respondents who are not aware of the difference between scripting and programming is eyebrow raising. According to an article in PLOS Computational Biology,  the term “script” means “something that is executed directly as is”, while a “program[… is] something that is explicitly compiled before being used. The distinction is more one of degree than kind—libraries written in Python are actually compiled to bytecode as they are loaded, for example—so one other way to think of it is “things that are edited directly” and “things that are not edited directly” (Wilson et al 2017). This distinction is important since more archives are acquiring, processing and sharing collections that rely on the archivist to execute jobs such as web-scraping or metadata management (scripts) or archivists who can build and maintain a database (programming). These might be interpreted as trick questions, but the particular semantics and what is considered technology-focused is something modern library, archives, and information programs might want to consider. 

Figure 4. How do you approach new technology?

Figure 4 illustrates the various ways students tackle new technologies. Reading the f* manual (RTFM) and Searching forums are the most common approaches to navigating technology. Here are quotes from a couple students on how they tend to learn a new piece of software:

  • “break whatever I’m trying to do with a new technology into steps and look for tutorials & examples related to each of those steps (i.e. Is this step even possible with X, how to do it, how else to use it, alternatives for accomplishing that step that don’t involve X)”
  • “I tend to google “how to….” for specific tasks and learn new technology on a task-by-task basis.”

In the end, there was overwhelming interest in “more project-based courses that allow skills from other tech classes to be applied.” Unsurprisingly, many of us are looking for full-time, stable jobs after graduating and the “more practical stuff, like CONTENTdm for archives” seems to be a pressure felt in-order to get an entry-level position. Not just entry too; as continuing education learners, there is also a push to strive for more—several respondents are looking for a challenge to level up their tech skills: 

  • “I want more classes with hands-on experience with technical skills. A lot of my classes have been theory based or else they present technology to us in a way that is not easy to process (i.e. a lecture without much hands-on work).”
  • “Higher-level programming, etc. — everything on offer at my school is entry level. Also digital forensics — using tools such as BitCurator.”
  • “Advanced courses for the introductory courses. XML 2 and python 2 to continue to develop the skills.”
  • “A skills building survey of various code/scripting, that offers structured learning (my professor doesn’t give a ton of feedback and most learning is independent, and the main focus is an independent project one comes up with), but that isn’t online. It’s really hard to learn something without face to face interaction, I don’t know why.”

It’ll be interesting to see what skills recent MLIS, MLS, MIS, and MSIM graduates will enter the field with. While many job postings list certain software and skills as requirements, will programs follow suit? I have a feeling this might be a significant question to ask in the larger context of what is the purpose of this Master’s degree and how can the curriculum keep up with the dynamic technology needs of the field.

Disclaimer: 

  1. Potential bias: Those taking the survey might be interested in learning higher-level tech skills because they do not already know the skills, while those who are already tech-savvy might avoid a basic survey such as this one since they already know the skills. This may put a bias on the survey population consisting of mostly novice tech students.   
  2. More data on specific computational languages and technology courses taken are available in the GitHub csv file. As mentioned earlier, I just finished my first year as a part-time MLIS student, so I’m still learning the distinct jobs and nature of the LIS field. Feel free to submit an issue to the GitHub repo, or tweet me @snewyuen if you’d like to talk more about what this data could mean.

Bibliography

Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Computational Biology 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510


Sarah Nguyen with a Uovo storage truck

Sarah Nguyen is an advocate for open, accessible, and secure technologies. While studying as an MLIS candidate with the University of Washington iSchool, she is expressing interests through a few gigs: Project Coordinator for Preserve This Podcast at METRO, Assistant Research Scientist for Investigating & Archiving the Scholarly Git Experience at NYU Libraries, and archivist for the Dance Heritage Coalition/Mark Morris Dance Group. Offline, she can be found riding a Cannondale mtb or practicing movement through dance. (Views do not represent Uovo. And I don’t even work with them. Just liked the truck.)

Advertisements

The Theory and Craft of Digital Preservation: An interview with Trevor Owens

BloggERS! editor, Dorothy Waugh recently interviewed Trevor Owens, Head of Digital Content Management at the Library of Congress about his recent–and award-winning–book, The Theory and Craft of Digital Preservation.


Who is this book for and how do you imagine it being used?

I attempted to write a book that would be engaging and accessible to anyone who cares about long-term access to digital content and wants to devote time and energy to helping ensure that important digital content is not lost to the ages. In that context, I imagine the primary audience as current and emerging professionals that work to ensure enduring access to cultural heritage: archivists, librarians, curators, conservators, folklorists, oral historians, etc. With that noted, I think the book can also be of use to broader conversations in information science, computer science and engineering, and the digital humanities. 

Tell us about the title of the book and, in particular, your decision to use the word “craft” to describe digital preservation.

The words “theory” and “craft” in the title of the book forecast both the structure and the two central arguments that I advance in the book. 

The first chapters focus on theory. This includes tracing the historical lineages of preservation in libraries, archives, museums, folklore, and historic preservation. I then move to explore work in new media studies and platform studies to round out a nuanced understanding of the nature of digital media. I start there because I think it’s essential that cultural heritage practitioners moor their own frameworks and approaches to digital preservation in a nuanced understanding of the varied and historically contingent nature of preservation as a concept and the complexities of digital media and digital information. 

The latter half of the book is focused on what I describe as the “craft” of digital preservation. My use of the term craft is designed to intentionally challenge the notion that work in digital preservation should be understood as “a science.” Given the complexities of both what counts as preservation in a given context and the varied nature of digital media, I believe it is essential that we explicitly distance ourselves from many of the assumptions and baggage that come along with the ideology of “digital.” 

We can’t build some super system that just solves digital preservation. Digital preservation requires making judgement calls. Digital preservation requires the applied thinking and work of professionals. Digital preservation is not simply a technical question, instead digital preservation involves understanding the nature of the content that matters most to an intended community and making judgement calls about how best to mitigate risks of potential loss of access to that content. As a result of my focus on craft, I offer less of a “this is exactly what one should do” approach, and more of an invitation to join the community of practice that is developing knowledge and honing and refining their craft. 

Reading the book, I was so happy to see you make connections between the work that we do as archivists and digital preservation. Can you speak to that relationship and why you think it is important?

Archivists are key players in making preservation happen and the emergence of digital content across all kinds of materials and media that archivists work with means that digital preservation is now a core part of the work that archivists do. 

I organize a lot of my discussion about the craft of digital preservation around archival concepts as opposed to library science or curatorial practices. For example, I talk about arrangement and description. I also draw from ideas like MPLP as key concepts for work in digital preservation and from work on community archives. 

Old Files. From XKCD: webcomic of romance, sarcasm, math, and language. 2014

Broadly speaking, in the development of digital media, I see a growing context collapse between formats that had been distinct in the past. That is, conservation of oil paintings, management and preservation of bound volumes, and organizing and managing heterogeneous sets of records have some strong similarities but there are also a lot of differences. The born digital incarnations of those works; digital art, digital publishing, and digital records, are all made up of digital information and file formats, and face a related set of digital preservation issues.

With that note, I think archival practice tends to be particularly well-suited for dealing with the nature of digital content. Archives have long dealt with the problem of scale that is now intensified by digital data. At the same time, archivists have also long dealt with hybrid collections and complex jumbles of formats, forms, and organizational structures, which is also increasingly the case for all types of forms that transition into born-digital content. 

You emphasize that the technical component of digital preservation is sometimes prioritized over social, ethical, and organizational components. What are the risks implicit in overlooking these other important components?

Digital preservation is not primarily a technical problem. The ideology of “digital” is that things should be faster, cheaper, and automatic. The ideology of “digital” suggests that we should need less labor, less expertise, and less resources to make digital stuff happen. If we let this line of thinking infect our idea of digital preservation we are going to see major losses of important data, we will see major failures to respect ethical and privacy issues relating to digital content, and lots of money will be spent on work that fails to get us the results that we want.

In contrast, when we take as a starting point that digital preservation is about investing resources in building strong organizations and teams who participate in the community of practice and work on the complex interactions that emerge between competing library and archives values then we have a chance of both being effective but also building great and meaningful jobs for professionals.

If digital preservation work is happening in organizations that have an overly technical view of the problem, it is happening despite, not because, of their organization’s approach. That is, there are people doing the work, they just likely aren’t getting credit and recognition for doing that work. Digital preservation happens because of people who understand that the fundamental nature of the work requires continual efforts to get enough resources to meaningfully mitigate risks of loss, and thoughtful decision making about building and curating collections of value to communities.

Considerations related to access and discovery form a central part of the book and you encourage readers to “Start simple and prioritize access,” an approach that reminded me of many similar initiatives focused on getting institutions started with the management and preservation of born-digital archives. Can you speak to this approach and tell us how you see the relationship between preservation and access?

A while back, OCLC ran an initiative called “walk before you run,” focused on working with digital archives and digital content. I know it was a major turning point for helping the field build our practices. Our entire community is learning how to do this work and we do it together. We need to try things and see which things work best and which don’t. 

It’s really important to prioritize access in this work. Preservation is fundamentally about access in the future. The best way you know that something will be accessible in the future is if you’re making it accessible now. Then your users will help you. They can tell you if something isn’t working. The more that we can work end-to-end, that is, that we accession, process, arrange, describe, and make available digital content to our users, the more that we are able to focus on how we can continually improve that process end-to-end. Without having a full end-to-end process in place, it’s impossible to zoom out and look at that whole sequence of processes to start figuring out where the bottlenecks are and where you need to focus on working to optimize things. 


Dr. Trevor Owens is a librarian, researcher, policy maker, and educator working on digital infrastructure for libraries. Owens serves as the first Head of Digital Content Management for Library Services at the Library of Congress. He previously worked as a senior program administrator at the United States Institute of Museum and Library Services (IMLS) and, prior to that, as a Digital Archivist for the National Digital Information Infrastructure and Preservation Program and as a history of science curator at the Library of Congress. Owens is the author of three books, including The Theory and Craft of Digital Preservation and Designing Online Communities: How Designers, Developers, Community Managers, and Software Structure Discourse and Knowledge Production on the Web. His research and writing has been featured in: Curator: The Museum Journal, Digital Humanities Quarterly, The Journal of Digital Humanities, D-Lib, Simulation & Gaming, Science Communication, New Directions in Folklore, and American Libraries. In 2014 the Society for American Archivists granted him the Archival Innovator Award, presented annually to recognize the archivist, repository, or organization that best exemplifies the “ability to think outside the professional norm.”

Collections as Data

by Elizabeth Russey Roke


Archives put a great deal of effort into preserving the original object.  We document the context around its creation, perform conservation work on the object if necessary, and implement reading room procedures designed to limit damage or loss.  As a result, researchers can read and hold an original letter written by Alice Walker, view a series of tintypes taken before the Civil War, read marginalia written by Ted Hughes in a book from his personal library, or listen to an audio recording of one of Martin Luther King, Jr.’s speeches.  In other words, we enable researchers to encounter materials as they were originally designed to be used as much as is possible.

The nature of digital humanities research challenges these traditional modes of archival access: are these the only ways to interact with archival material?  How do we serve users who want to leverage computational techniques such as text mining, machine learning, network analysis, or computer vision in their research or teaching?  Are machines and algorithms “users”? Archivists also encounter these questions as the content of archives shifts from analog to born digital material. Digital files were created and designed to be processed by algorithms, not just encountered through experiences such as watching, viewing, or reading.  What could access for these types of materials look like if we gave access to their full functionality and not just their appearance? 

I have spent the past two years working on an IMLS grant focused on addressing these types of questions.  Collections As Data: Always Already Computational examined how digital collections are and could be used beyond analog research methodologies.  Collections as data is ordered information, stored digitally, that is inherently amenable to computation.  This could include metadata, digital text, or other digital surrogates.  Whereas a digital repository might enable researchers to read a newspaper on a computer screen, an approach grounded in collections as data would give researchers access to the OCR file the repository generated to enable keyword search. In other words, digital repositories should provide access beyond the viewers, page turners, and streaming servers of most current digital repositories that replicate analog experiences.  At its core, collections as data simply asks cultural heritage organizations to make the full digital object available rather than making assumptions about how users will want to interact with it.  

Collections as data implementations are not necessarily complex nor do they involve complicated repository development.  Some of the simplest examples can be found on Github where archives such as Vanderbilt and New York University publish their EAD files.  The Rockefeller Archive Center and the Museum of Modern Art go a step further and publish all of their collection data, along with a creative commons license.  Emory, my home institution, makes finding aid data available in both EAD and RDF from our finding aids database, which has led to a digital humanities project that harvested correspondence indexes from our Irish poetry collections to build network graphs of the Belfast Group.  More complex implementations often provide access to data through APIs instead of a bulk download.  An example of this can be found at Carnegie Hall Archives, which allows researchers to query their data through a SPARQL endpoint.

Chord diagram created as part of Emory’s Belfast Group Poetry project, showing the networks of people associated with the Belfast Group and their relationships with each other.

The Collections As Data: Always Already Computational final report includes more information and ideas for getting started with collections as data. It includes resources such as a set of user personas, methods profiles of common techniques used by data-driven researchers, and real-world case studies of institutions with collections as data including their motivations, technical details, and how they made the case to their administrators.  I highly recommend “50 Things,” which is a list of activities associated with collections as data work ranging from the simple to the complex. 

There are a few takeaways from this project I’d like to highlight for archivists in particular:

Collections as data approaches are archival.  Data-driven research demands authenticity and context of the data source, established and preserved through archival principles of documentation, transparency, and provenance.  This type of information was one of the most universal requests from digital humanities researchers. It was clear that they were not only interested in the object, but in how it came to be.  They wanted to understand their data as an archival object with information about its creation, provenance, and preservation. Archivists need to advocate for digital collections to be treated not just as digital surrogates, or what I like to think of as expensive photocopying, but as unique resources unto themselves deserving description, preservation, and access that may not necessarily match that of the original object.

Collections as data enhances access to archival material.  What if we could partially open restricted material to researchers?  Emory holds the papers of Salman Rushdie and his email files are largely restricted per the deed of gift.  Computational techniques being developed in ePADD could generate maps of Rushdie’s correspondents and reveal patterns in the timing and frequency of his correspondence, just through email header information and without exposing sensitive data (i.e. the content of the email) that Rushdie wanted to restrict.  Could this methodology be extrapolated to other types of restricted electronic files?  

Just start.  For digital files, trying something is always the first, and best approach.  There is no one way or best way to do collections as data work. Consider your community and ask them what they need.  Unlike baseball fields, if you build it, they probably won’t come unless you ask first. Collections as data material already exists in your collection, especially if you use ArchivesSpace.  Publish it. Think broadly about what might constitute collections as data and how you might make use of it yourself; collections as data benefits us too. Follow the Computational Archival Science project at the University of Maryland, which is exploring how we think about archival collections as data.  

If you want to take a deep dive into collections as data (and get funding to do so!) consider applying to be part of the second cohort of the Part to Whole Mellon grant, which aims to foster the development of broadly viable models that support implementation and use of collections as data.  The next call for proposals opens August 1:  https://collectionsasdata.github.io/part2whole/ .  On August 5, the project team will offer a webinar with more information about the grant and opportunities to ask questions:  https://collectionsasdata.github.io/part2whole/cfp2webinar/.


Elizabeth Russey Roke is a digital archivist and metadata specialist at the Stuart A. Rose Library of Emory University, Atlanta, Georgia. Primarily focused on preservation, discovery, and access to digitized and born digital assets from special collections, Elizabeth works on a variety of technology projects and initiatives related to digital repositories, metadata standards, and archival descriptive practice. She was a co-investigator on a 2016-2018 IMLS grant investigating collections as data.

A Conversation with Annalise Berdini, Digital Archivist at the Seeley G. Mudd Manuscript Library, Princeton University

Interview conducted with Annalise Berdini in May 2019 by Hannah Silverman and Tamar Zeffren

This is the eighth post in a new series of conversations between emerging professionals and archivists actively working with digital materials.


Annalise Berdini is the Digital Archivist at the Seeley G. Mudd Manuscript Library at Princeton University, a position she has held since January 2018. She is responsible for the ongoing management of the University Archives Digital Curation Program, as well as managing a collection of web archives and assisting with reference services.

Annalise’s first post-graduate school position was as a manuscripts and archives processor at the University of California, San Diego (UCSD). While she was working at UCSD, universities and archives were slowly starting to see the need for a dedicated digital archivist position. When the Special Collections department at UCSD created their first digital archivist position, Annalise applied and got the job. She explains that a good deal of her work there, and at Princeton, is graciously supported by a community of digital archivists solving similar challenges in other institutions.

As Annalise has now held a digital archivist role at two different institutions, both universities, we were interested to hear her perspectives on how colleagues and researchers have understood – or misunderstood – her role. “Because I have digital in my job title,” she noted, “people interpret that in a lot of very wide and broad ways. Really digital archives is still an emerging field…there are so many questions to answer, and it’s fun to investigate that aspect of the field.”

Given prevailing concerns among institutional archives about preserving and processing legacy media, we were keenly interested in hearing Annalise’s insights about securing stakeholder buy-in to develop a digital archives program.

“It’s a struggle everywhere,” she acknowledges. Presently, Princeton’s efforts to build up a more robust digital preservation program have led the University to a partnership with a UK-based company called Arkivum, which offers digital preservation, storage, maintenance, auditing and reporting modules and has the capacity to incorporate services from Archivematica and create a customized digital storage solution for Princeton.

“We’ve been lucky here [at Mudd]. We’re getting this great system. There is buy-in and there seems to be a pretty strong push right now. For us, the most compelling argument we’ve had is that we are mandated to collect student materials and student records that will not exist anywhere else unless we take them. The school has to keep those records, there’s not an option. Emphasizing how easily that content could be lost without a proper digital preservation system in place was very compelling to people who weren’t necessarily aware of the fact that hard drives sitting on a shelf are really not acceptable storage choices and options.”

Annalise has also found that deploying some compelling statistics can aid in building awareness around digital archives needs. In discussions about how rapidly materials can degrade, Annalise likes to cite a 2013 Western Archives article, “Capturing and Processing Born-Digital Files in the STOP AIDS Project Records,” which showcases findings that out of a vast collection of optical storage media, “only 10% of these hundreds of DVDs were really able to be recovered, whereas, strangely, a lot of the floppy disks were better and easier to recover…I think emphasizing how fragile digital content is [can help people understand] how easily it will corrupt without you even knowing it.”

Equally as important to generating momentum for such programs are the direct relationships Annalise cultivates with colleagues, within and without the archives. “My boss was really instrumental in the process, and the head of library IT helped me navigate getting approvals from the University as a whole and the University IT department.”

The complex process of sustaining and innovating a digital archives infrastructure provides ongoing opportunities for Annalise to “solve puzzles” and to unite colleagues in confronting the challenges of documenting and preserving born-digital heritage: “I have focused on trying to find one person who is maybe a level above me and to connect with them and then hopefully build up a network within my institution to build some groundswell.”


Hannah Silverman

Tamar Zeffren

Hannah Silverman and Tamar Zeffren both work at JDC Archives. Tamar is the Archival Collections Manager. Hannah is the Digitization Project Specialist and also works independently as a photo archivist. Both received SAA’s DAS certification.

An Exploration of BitCurator NLP: Incorporating New Tools for Born-Digital Collections

by Morgan Goodman

Natural Language Processing (NLP) has been a buzz-worthy topic for professionals working with born-digital material over the last few years. The BitCurator Project recently released a new set of natural language processing tools, and I had the opportunity to test out the topic modeler, Bitcurator-nlp-gentm, with a group of archivists in the Raleigh-Durham area. I was interested in exploring how NLP might assist archivists to more effectively and efficiently perform their everyday duties. While my goal was to explore possible applications of topic modeling in archival appraisal specifically, the discussions surrounding other possible uses were enlightening.  The resulting research informed my 2019 Master’s paper for the University of North Carolina Chapel Hill.

Topic Modeling extracts text from files and organizes the tokenized words into topics. Imagine a set of words such as: mask, october, horror, michael, myers. Based on this grouping of words you might be able to determine that somewhere across the corpus there is a file about one the Halloween franchise horror films. When I met with the archivists, I had them run the program with disk images from their own collections, and we discussed the visualization output and whether or not they were able easily analyze and determine the nature of the topics presented.

BitCurator utilizes open source tools in their applications and chose the pyLDAvis visualization for the final output of their topic modeling tool (more information about the algorithm and how it works can be found by reading Sievert and Shirley’s paper. You can also play around with the output through this Jupyter notebook).  The left side view of the visualization has topic circles displayed in relative sizes and plotted on a two-dimensional plane. Each topic is labeled with a number in decreasing order of prevalence (circle #1 is the main topic in the overall corpus, and is also the largest circle). The space between topics is determined by the relative relation of the topics, i.e. topics that are less related are plotted further away from each other. The right-side view contains a list of 30 words with a blue bar indicating that term’s frequency across the corpus. Clicking on a topic circle will alter the view of the terms list by adding a red bar for each term, showing the term frequency in that particular topic in relation to the overall corpus.

Picture1

The user can then manipulate a metric slider which is meant to help decipher what the topic is about. Essentially, when the slider is all the way to the right at “1”, the most prevalent terms for the entire corpus are listed. When a topic is selected and the slider is at 1, it shows all the prevalent terms for the corpus in relation to that particular topic (in your Halloween example, you might see more general words like: movie, plot, character). Alternatively, the closer to “0” the slider moves, the less corpus-wide terms appear and the more topic specific terms are displayed (i.e.: knife, haddonfield, strode).

While the NLP does the hard work to scan and extract text from the files, some analysis is still required by the user. The tool’s output offers archivists a bird’s eye view of the collection, which can be helpful when little to nothing is known about its contents. However, many of the archivists I spoke to felt this tool is most effective when you already know a bit about the collection you are looking at. In that sense, it may be beneficial to allow researchers to use topic modeling in the reading room to explore a large collection. Researchers and others with subject matter expertise may get the most benefit from this tool – do you have to know about the Halloween movie franchise to know that Michael Myers is a fictional horror film character? Probably. Now imagine more complex topics that the archivists may not have working knowledge of. The archivist can point the researcher to the right collection and let them do the analysis. This tool may also help for description or possibly identifying duplication across a collection (which seems to be a common problem for people working with born-digital collections).

The next steps to getting NLP tools like this off the ground are to implement training. Information retrieval and ranking methods that create the output may not be widely understood. To unlock the value within an NLP tool, users must know how they work, how to run them, and how to perform meaningful analysis.  Training archivists in the reading room to assist researchers would be an excellent way to get tools like this out of the think tank and into the real world.


MorganMorgan Goodman is a 2019 graduate from the University of North Carolina, Chapel Hill and currently resides in Denver, Colorado. She holds a MS in Information Science with a specialization in Archives and Records Management.