By Bethany Scott
The University of Houston (UH) Libraries made an institutional commitment in late 2015 to migrate the data for its digitized and born-digital cultural heritage collections to open source systems for preservation and access: Hydra-in-a-Box (now Hyku!), Archivematica, and ArchivesSpace. As a part of that broader initiative, the Digital Preservation Task Force began implementing Archivematica in 2016 for preservation processing and storage.
At the same time, the DAMS Implementation Task Force was also starting to create data models, use cases, and workflows with the goal of ultimately providing access to digital collections via a new online repository to replace CONTENTdm. We decided that this would be a great opportunity to create an end-to-end digital access and preservation workflow for digitized collections, in which digital production tasks could be partially or fully automated and workflow tools could integrate directly with repository/management systems like Archivematica, ArchivesSpace, and the new DAMS. To carry out this work, we created a cross-departmental working group consisting of members from Metadata & Digitization Services, Web Services, and Special Collections.
As a working group, we discussed the data models, requirements, and capabilities of our separate access and preservation systems, but we had a hard time coming up with a single workflow that would handle both our preservation master files and our access files and metadata. For instance, our digital preservation policy required that we capture structural metadata and sequencing of digital objects so that we could reconstruct the original order of the physical materials if needed, but this information would not be easily captured or represented in the access system where keyword searches and faceted browsing would be the primary mode of interacting with digital objects. Eventually, we came up with this workflow:
Concurrently with the workflow design, the Digital Preservation Task Force also contracted with Artefactual for Archivematica software installation and support, tested and refined processing configuration settings using test and then production data, and planned for the migration of born digital archives in three phases.
The current process for getting born digital into Archivematica includes lots of manual/hands-on steps, such as disk imaging or other means of data transfer from external storage media, moving files into the appropriate directory structure for AM ingest, and adding EAD finding aids or BitCurator reports as supplemental documentation. Although these born-digital preparation processes are not yet automated (as our digitization workflow will be through the use of the workflow tools outlined in the above diagram), because the born-digital process falls outside of this workflow I have been able to start Archivematica ingests while the larger ecosystem is still under development. Currently, I have moved AIPs for several collections into archival storage, and I’m also starting transfers to the production instance for digitized film and video.
Of course, each time we begin transfers for a new content type, we tend to discover that we need to modify processing configuration settings—or come across new and exciting errors to troubleshoot—which brings me to the lessons learned so far from the implementation project.
First, I’ll highlight the importance of inventories in migration planning. If you’re trying to decide where to start with digital preservation, or with implementing other types of archival systems, conducting an inventory of your assets to get a handle on what you have and what it will take to address the needs of those assets really helps with migration and implementation planning.
For instance, in my case, Spec Coll staff had been transferring born digital archives from external storage media to a network drive for several years at the time that I arrived, but metadata about the contents of those transfers and files was contained in many disparate XML files that were difficult to extract data from and aggregate. By creating a thorough inventory of the directories on the network drive and conducting filetype analysis using free tools like WinDirStat, I was able to identify the range of filetypes we have to deal with, look at the extents of the digital files in each collection, and begin to prioritize accordingly.
As a working group, we also discovered that we needed to question assumptions in working together to develop an overall workflow that would accommodate the sometimes conflicting needs of several departments. As an example, our access system is based on an object model, meaning that the hierarchical relationships between individual digital objects or the sequence of the objects in a collection may not be captured or displayed; rather, the primary mode of discovery and interaction with objects in the access system will be keyword searches and faceted browsing, and the data model and metadata application profile have been designed to support that.
By contrast, the preservation system and archival description in general are a collection-based data model—it’s our aim that each AIP will represent a complete collection or accession, including all contextual and collection-level information available. Furthermore, as I mentioned above, our digital preservation policy specifically requires us to capture and preserve structural metadata so that the hierarchies and intellectual arrangement of a digitized collection can be reconstructed from an AIP if need be.
Each of these approaches has its pros and cons, and ultimately the different needs and requirements of the users of our various systems were the determining factor in adopting the appropriate data model and identifying the system of record for different types of data and files. For instance, users of our DAMS tend primarily to be looking for freely reusable images, and the ability to use keyword searches and facets to discover and “remix” specific sets of digital objects that may not necessarily come from a single provenance collection is a common use case in which the object approach is the most beneficial. Only by discussing the implications and benefits of these two approaches were we able to brainstorm workflow solutions and tools that would allow both systems to work to their best advantage. To that end, we focused mainly on two stages of a digital project carried out by separate but related units: digitization and metadata.
During digitization, both preservation and access files are captured, so it made sense to also incorporate requirements for preservation metadata at this stage and send those files directly to Archivematica storage for long-term preservation. We created a digitization workflow utility to be used by the digi unit that packages preservation files and metadata to be ingested to Archivematica, and that also pulls in any relevant metadata from ArchivesSpace to be passed along to the metadata unit along with the access files in the next phase of the project.
Finally, don’t hesitate to jump in! It’s daunting to begin a brand new program when you haven’t worked out all the details of the new workflows or processes you’re designing. However, I found that conducting test transfers with real, production data—and troubleshooting the errors that came up—helped me familiarize myself with the features of Archivematica. And I’ve also started to refine the processing configuration settings by actually working with the files hands-on. Our Archivematica set-up includes two production instances: one for born digital and one for digitized materials. With this infrastructure we are able to configure Archivematica’s microservices specifically for each of these types of collections; for instance, digitized collections are never normalized because during the process of digitization we specifically select preservation-worthy filetypes for our master files.
By contrast, with born-digital collections which often include a range of filetypes requiring further analysis, we do not configure a specific file format identification or normalization command; rather, I select the appropriate identification method according to the filetypes in the package (or sometimes by trial and error if a transfer fails the file format ID microservice on the first try) and the appropriate normalization command based on whether we require normalization for access in addition to preservation.
This content was originally presented as a part of session 506: You Are Not Alone! Navigating the Implementation of New Archival Systems. Slides are available here.
Bethany Scott is Digital Projects Coordinator at the University of Houston Libraries’ Special Collections, where she is responsible for planning and coordinating the digitization of archival materials, creating online exhibits, and implementing Archivematica and ArchivesSpace. A cataloger in a past life, her professional interests include metadata schema, classification, and descriptive practice for both archival and bibliographic materials. She holds an MS in Information Studies from the University of Texas at Austin.