By Nathan Gerth
Over the past several years there has been a growing conversation about “collections as data” in the archival field. Elizabeth Russey Roke underscored the growing impact of this movement in her recent post on the Collections As Data: Always Already Computational final report at Blo. Much like her, I have seen this computational approach to the data in our collections manifest itself in ways at my home institution, with our decision to start providing researchers with aggregate data harvested from our born-digital collections.
Data as Collections
At the same time, in my role as a digital archivist working with congressional papers, I have seen a growing body of what I call “data as collections.” I am using the term data in this case specifically in reference to exports from relational database systems in collections. More akin to research datasets than standard born-digital acquisitions, these exports amplify the privacy and technical challenges associated with typical digital collections. However, they also embody some of the more appealing possibilities for the computational research highlighted by the “collections as data” initiative, given their structured nature and millions of data points.
The problem of curating and supplying access to a particular type of data export has become an acute problem in the field of congressional papers. As documented in a white paper by a Congressional Papers Section Task Force in 2017, members in the U.S. House of Representatives and U.S. Senate have widely adopted proprietary Constituent Management Systems (CMS) or Constituent Services Systems (CSS) to manage constituent communications. The exports from these systems document the core interactions between the American people and their representatives in Congress. Unfortunately, these data exports have remained largely inaccessible to archivists and researchers alike.
The question of curating, preserving, and supplying access to the exports from these systems has galvanized the work of several task forces in the archival community. In recent years, congressional papers archivists have collaborated to document the problem in the white paper referenced above and to support the development of a tool to access these exports. The latter effort, spearheaded by West Virginia Libraries, earned a Lyrasis Catalyst Fund grant in 2018 to assess the development possibilities for an open-source platform developed at WVU to open and search these data exports. You can see a screenshot of the application in action below.
Screenshot of data table viewed in the West Virginia University Libraries CSS Data Tool
The project funded by the grant, America Contacts Congress, has now issued its final report and the members of the task force that served as its advisory board are transitioning to the next stage of the project. Here are where things stand:
What We Now Know
We now know much more about the key research audiences for this data and the archival needs associated with the tool. Researchers expressed solid enthusiasm for gaining access to the data, especially computationally minded quantitative scholars. For those of us involved in testing data in the tool, the project gave us a moment to become much more familiar with our data. I, for my part, also know a great deal more about the 16 million records in the relational data tables we received from the office of Senator Harry Reid, in addition to the 3 million attachments referenced by those tables. Without the ability to search and view the data in the tool, the tables and attachments from the Reid collection would have existed as little more than binary files.
While members of the grant’s advisory board know much more about how the tool might be used in the sphere of congressional papers, we would like to learn more about other cases of “data in collections” in the archival field. Who beyond congressional papers archivists are grappling with supplying access to and preserving relational databases? We know, for example, that many state and local governments are using the same Constituent Relationship Management systems, such as iConstituent and Intranet Quorum, deployed in congressional offices. Do our needs overlap with those of other archivists and could this tool serve a broader community? While the amount of CSS data exports in congressional collections is significant, the direction we plan to take tool development and partnerships to supply access to the data will hinge on finding a broader audience of archivists facing similar challenges.
If any of the questions above apply to you, consider contacting the members of the America Contacts Congress project’s advisory board. We would love to hear from you and discuss how the outcomes of the grant might apply to a broader array of data exports in archival collections. Who knows, we might even help you test the tool on your own data exports! For more information about the project, visit our webpage.
Nathan Gerth is the Head of Digital Services and Digital Archivist at the University of Nevada, Reno Libraries. Splitting his time between application support and digital preservation, he is the primary custodian of the electronic records from the papers of Senator Harry Reid. Outside of the University, he is an active participant in the congressional papers community, serving as the incoming chair of the Congressional Papers Section and as a member of the Association of Centers for the Study of Congress CSS Data Task Force.