Collections as Data

by Elizabeth Russey Roke


Archives put a great deal of effort into preserving the original object.  We document the context around its creation, perform conservation work on the object if necessary, and implement reading room procedures designed to limit damage or loss.  As a result, researchers can read and hold an original letter written by Alice Walker, view a series of tintypes taken before the Civil War, read marginalia written by Ted Hughes in a book from his personal library, or listen to an audio recording of one of Martin Luther King, Jr.’s speeches.  In other words, we enable researchers to encounter materials as they were originally designed to be used as much as is possible.

The nature of digital humanities research challenges these traditional modes of archival access: are these the only ways to interact with archival material?  How do we serve users who want to leverage computational techniques such as text mining, machine learning, network analysis, or computer vision in their research or teaching?  Are machines and algorithms “users”? Archivists also encounter these questions as the content of archives shifts from analog to born digital material. Digital files were created and designed to be processed by algorithms, not just encountered through experiences such as watching, viewing, or reading.  What could access for these types of materials look like if we gave access to their full functionality and not just their appearance? 

I have spent the past two years working on an IMLS grant focused on addressing these types of questions.  Collections As Data: Always Already Computational examined how digital collections are and could be used beyond analog research methodologies.  Collections as data is ordered information, stored digitally, that is inherently amenable to computation.  This could include metadata, digital text, or other digital surrogates.  Whereas a digital repository might enable researchers to read a newspaper on a computer screen, an approach grounded in collections as data would give researchers access to the OCR file the repository generated to enable keyword search. In other words, digital repositories should provide access beyond the viewers, page turners, and streaming servers of most current digital repositories that replicate analog experiences.  At its core, collections as data simply asks cultural heritage organizations to make the full digital object available rather than making assumptions about how users will want to interact with it.  

Collections as data implementations are not necessarily complex nor do they involve complicated repository development.  Some of the simplest examples can be found on Github where archives such as Vanderbilt and New York University publish their EAD files.  The Rockefeller Archive Center and the Museum of Modern Art go a step further and publish all of their collection data, along with a creative commons license.  Emory, my home institution, makes finding aid data available in both EAD and RDF from our finding aids database, which has led to a digital humanities project that harvested correspondence indexes from our Irish poetry collections to build network graphs of the Belfast Group.  More complex implementations often provide access to data through APIs instead of a bulk download.  An example of this can be found at Carnegie Hall Archives, which allows researchers to query their data through a SPARQL endpoint.

Chord diagram created as part of Emory’s Belfast Group Poetry project, showing the networks of people associated with the Belfast Group and their relationships with each other.

The Collections As Data: Always Already Computational final report includes more information and ideas for getting started with collections as data. It includes resources such as a set of user personas, methods profiles of common techniques used by data-driven researchers, and real-world case studies of institutions with collections as data including their motivations, technical details, and how they made the case to their administrators.  I highly recommend “50 Things,” which is a list of activities associated with collections as data work ranging from the simple to the complex. 

There are a few takeaways from this project I’d like to highlight for archivists in particular:

Collections as data approaches are archival.  Data-driven research demands authenticity and context of the data source, established and preserved through archival principles of documentation, transparency, and provenance.  This type of information was one of the most universal requests from digital humanities researchers. It was clear that they were not only interested in the object, but in how it came to be.  They wanted to understand their data as an archival object with information about its creation, provenance, and preservation. Archivists need to advocate for digital collections to be treated not just as digital surrogates, or what I like to think of as expensive photocopying, but as unique resources unto themselves deserving description, preservation, and access that may not necessarily match that of the original object.

Collections as data enhances access to archival material.  What if we could partially open restricted material to researchers?  Emory holds the papers of Salman Rushdie and his email files are largely restricted per the deed of gift.  Computational techniques being developed in ePADD could generate maps of Rushdie’s correspondents and reveal patterns in the timing and frequency of his correspondence, just through email header information and without exposing sensitive data (i.e. the content of the email) that Rushdie wanted to restrict.  Could this methodology be extrapolated to other types of restricted electronic files?  

Just start.  For digital files, trying something is always the first, and best approach.  There is no one way or best way to do collections as data work. Consider your community and ask them what they need.  Unlike baseball fields, if you build it, they probably won’t come unless you ask first. Collections as data material already exists in your collection, especially if you use ArchivesSpace.  Publish it. Think broadly about what might constitute collections as data and how you might make use of it yourself; collections as data benefits us too. Follow the Computational Archival Science project at the University of Maryland, which is exploring how we think about archival collections as data.  

If you want to take a deep dive into collections as data (and get funding to do so!) consider applying to be part of the second cohort of the Part to Whole Mellon grant, which aims to foster the development of broadly viable models that support implementation and use of collections as data.  The next call for proposals opens August 1:  https://collectionsasdata.github.io/part2whole/ .  On August 5, the project team will offer a webinar with more information about the grant and opportunities to ask questions:  https://collectionsasdata.github.io/part2whole/cfp2webinar/.


Elizabeth Russey Roke is a digital archivist and metadata specialist at the Stuart A. Rose Library of Emory University, Atlanta, Georgia. Primarily focused on preservation, discovery, and access to digitized and born digital assets from special collections, Elizabeth works on a variety of technology projects and initiatives related to digital repositories, metadata standards, and archival descriptive practice. She was a co-investigator on a 2016-2018 IMLS grant investigating collections as data.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s