Archival Collections as Data for Digital Scholarship

By Laurie Allen and Stewart Varner

Archives and special collections have a long history experimenting with and embracing digital tools, so it is not surprising that they have been natural partners for digital scholarship librarians. In this blog post, we want to share a couple of experiences we’ve had where digital scholarship and the archives came together.

Laurie Allen, Director for Digital Scholarship, University of Pennsylvania:

The Cope Evans project was an early collaboration between the Digital Scholarship group, Special Collections, and a group of students at the Haverford College Libraries. Over the years, a series of gifts had made it possible for Haverford to digitize and richly describe the Cope Evans Family Papers, which include correspondence and other documents from a connected group of Philadelphia Quaker families. While the ContentDM system used by the library allowed for searching through the digitized items, it did not take full advantage of the available metadata, including geospatial metadata. In the summer of 2014, the library employed a group of students to make use of the metadata records and associated images as a dataset. Over the following two summers, two groups of Haverford undergraduates explored the exported data from the Cope Collections to create maps, network analyses, and other visualizations and analyses of the collection. Of course, their exploration of the data led them directly back to the original materials and the resulting work represented a broader and deeper connection to the materials.

This experimentation with using our collections as data for student work led the Haverford Libraries to continue approaching the data and metadata of Quaker collections in data rich ways going forward. The Quakers and Mental Health site and the Beyond Penn’s Treaty site that have since been made take this work forward at Haverford.

Stewart Varner, Managing Director of Price Lab for Digital Humanities, University of Pennsylvania:

When I was the Digital Scholarship Librarian at the University of North Carolina, I worked on a project called DocSouth Data which was designed to facilitate innovative research methods on Documenting the American South, one of the library’s most popular online collections. Documenting the American South is composed of eighteen thematic collections of digitized material. DocSouth Data takes four of the most text-heavy collections, including the heavily used North American Slave Narrative, and makes them available as .txt files as well as .xml files. With these files, scholars can start looking for patterns using simple tools like Voyant and easily experiment with text analysis methods like topic modeling and sentiment analysis.

DocSouth Data was an exciting partnership between myself, the Library and Information Technology team, and archivists in UNC’s Special Collections. The original idea came from Nick Graham who, at the time, was the Program Coordinator for the North Carolina Digital Heritage Center (and is currently the University Archivist at UNC). I worked closely with Library and Information Technology who created the plain text files, organized them into a clear folder structure and made them available as .zip files on the library’s website. Once DocSouth Data was live, I hosted workshops at UNC and elsewhere that gave faculty, students, and librarians the chance to explore new ways to study the collections.

Since these two projects started, both Laurie and Stewart have joined the project team for the IMLS funded Collections as Data project. The Haverford Libraries contributed two facets to that project.

3 thoughts on “Archival Collections as Data for Digital Scholarship

  1. Kari Smith December 19, 2017 / 2:17 pm

    Thanks for writing about your work! I’m particularly interested in knowing how much work you had to do to provide the material in both .txt and .xml files — how much OCR correction did you have to do and how long did that take?


    • Stewart Varner January 10, 2018 / 10:10 am

      Hi Kari,

      Sorry for the slow reply but I didn’t realize there were comments until this morning! The OCR correction happened before my time at UNC but I do know that it was both time consuming and expensive. My understanding is that, because OCR technology was so bad when the project started in the 90s, a combination of student workers and out-sourced labor actually hand-keyed the collections early on. I believe that the NEH funded quite a bit – if not all – of this work.


  2. markconrad2014 December 20, 2017 / 9:39 am

    I would be interested to know if you are doing any similar work with the university’s born digital records.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s