by Andy Meyer
This post is a case study about how I used the statistical programming language R to help export, transform, and load data from legacy finding aids into ArchivesSpace. I’m sharing this workflow in the hopes that another institution might find this approach helpful and could be generalized to other issues facing archives.
I decided to use the programming language R because it is a free and open source programming language that I had some prior experience using. R has a large and active user community as well as a large number of relevant packages that extend the basic functions of R, including libraries that can deal with Microsoft Word tables and read and write XML. All of the code for this project is posted on Github.
The specific task that sparked this script was when I inherited hundreds of finding aids with minimal collection-level information and very long and detailed box and folder lists. These were all Microsoft Word documents with the box and folder list formatted as a table within the Word document. We recently adopted ArchivesSpace as our archival content management system so the challenge was to reformat this data and upload it into ArchivesSpace. I considered manual approaches but eventually opted to develop this code to automate this work. The code is generally organized into three sections: data export, transforming and cleaning the data, and finally, creating an EAD file to load into ArchivesSpace.
After installing the appropriate libraries, the first step of the process was to extract the data from the Microsoft Word tables. Given the nature of our finding aids, I focused on extracting only the box and folder list; collection-level information would be added manually later in the process.
This process was surprisingly straightforward; I created a variable with a path to a Word Document and used the “docx_extract_tbl” function from the docxtractr package to extract the contents of that table into a data.frame in R. Sometimes our finding aids were inconsistent so I occasionally had to tweak the data to rearrange the columns or add missing values. The outcome of this step of the process is four columns that contain folder title, date, box number, and folder number.
This data export process is remarkably flexible. Using other R functions and libraries, I have extended this process to export data from CSV files or Excel spreadsheets. In theory, this process could be extended to receive a wide variety of data including collection-level descriptions and digital objects from a wider variety of sources. There are other tools that can also do this work (Yale’s Excel to EAD process and Harvard’s Aspace Import Excel plugin), but I found this process to be easier for my institution’s needs.
Data Transformation and Cleaning
Once I extracted the data from the Microsoft Word document, I did some minimal data cleanup, a sampling of which included:
- Extracting a date range for the collection. Again, past practice focused on creating folder-level descriptions and nearly all of our finding aids lacked collection-level information. From the box/folder list, I tried to extract a date range for the entire collection. This process was messy but worked a fair amount of the time. In cases when the data were not standardized, I defined this information manually.
- Standardizing “No Date” text. Over the course of this project, I discovered the following terms for folders that didn’t have dates: “n.d.”,”N.D.”,”no date”,”N/A”,”NDG”,”Various”, “N. D.”,””,”??”,”n. d.”,”n. d. “,”No date”,”-“,”N.A.”,”ND”, “NO DATE”, “Unknown.” For all of these, I updated the date field to “Undated” as a way to standardize this field.
- Spelling out abbreviations. Occasionally, I would use regular expressions to spell out words in the title field. This could be standard terms like “Corresp” to “Correspondence” or local terms like “NPU” to “North Park University.”
R is a powerful tool and provides many options for data cleaning. We did pretty minimal cleaning but this approach could be extended to do major transformations to the data.
Create EAD to Load into ArchivesSpace
Lastly, with the data cleaned, I could restructure the data into an XML file. Because the goal of this project was to import into ArchivesSpace, I created an extremely basic EAD file meant mainly to enter the box and folder information into ArchivesSpace; collection-level information would be added manually within ArchivesSpace. In order to get the cleaned data to import, I first needed to define a few collection-level elements including the collection title, collection ID, and date range for the collection. I also took this as an opportunity to apply a standard conditions governing access note for all collections.
Next, I used the XML package in R to create the minimally required nodes and attributes. For this section, I relied on examples from the book XML and Web Technologies for Data Sciences with R by Deborah Nolan and Duncan Temple Lang. I created the basic EAD schema in R using the “newXMLNode” functions from the XML package. This section of code is very minimal, and I would welcome suggestions from the broader community about how to improve it. Lastly, I defined functions that make the title, date, box, and folder nodes, which were then applied to the data exported and transformed in earlier steps. Lastly, this script saves everything as an XML file that I then uploaded into ArchivesSpace.
Although this script was designed to solve a very specific problem—extracting box and folder information from a Microsoft Word table and importing that information into ArchivesSpace—I think this approach could have wide and varied usage. The import process can accept loosely formatted data in a variety of different formats including Microsoft Word, plain text, CSV, and Excel and reformat the underlying data into a standard table. R offers an extremely robust set of packages to update, clean, and reformat this data. Lastly, you can define the export process to reformat the data into a suitable file format. Given the nature of this programming language, it is easy to preserve your original data source as well as document all the transformations you perform.
Andy Meyer is the director (and lone arranger) of the F.M. Johnson Archives and Special Collections at North Park University. He is interested in archival content management systems, digital preservation, and creative ways to engage communities with archival materials.