by Gregory Wiedeman
Archivists have made major progress using disk imaging to safely move content off of floppy disks and other external media. These workflows have the best tools with the most complete documentation. Yet external media only makes up a portion of born-digital records and there is less guidance on processing and processing and providing access to other types of digital content. At UAlbany, we certainly have the disk-in-a-box problem that most repositories face, but partly because of our institutional context, we have found that this aspect has become a minor part of “our set-up.”
Most of our born-digital accessions now come in over network shares, cloud storage, or exports from web applications. Additionally, we take in a lot of born-digital content that can be made publically available now, without restrictions. Some of this is because we’ve been taking in a lot of institutional records, which are public records (think minutes, formal reports, and publications), but there is also a surprising amount of unrestricted digital material we collect from outside groups, like political advocacy and activist groups. Since we also digitize many records that need to be described and managed, we needed our infrastructure to support them as well. So, one of the major factors in developing our digital processing infrastructure is that we hope that we can make many records available online soon after we acquire them.
What all that means is that our set-up for processing both born-digital and digitized records is more of an ecosystem than a desktop station or single system. This is great, but makes it difficult to overview, and the result was a bit too long for a single post. So, the BloggERS Editorial Board and I decided to split it into two posts. The first part will focus on the theoretical and technical foundation, the principles behind our decision-making and the servers that run all off the systems. The second post will summarize all the difference systems and microservices we use, provide a sample use-case of how they all work together, and discuss what we’ve learned since all this has been in place and the challenges that remain.
Use one archival descriptive system
Most systems built for managing digital content use bibliographic-style description which makes them challenging to integrate into archival processes. They assume each “thing” gets a record with a series of elements or statements, much like the archetypical library catalog. This really includes everything from Digital Asset Management Systems or “Digital Repositories” down to DFXML. Archival description instead describes groups of things at progressive levels of detail. Since we use archival description for paper materials, using one system means applying archival description to digital records as well.
In practice, this means that every digital object is connected to archival description managed by ArchivesSpace. This does not mean that we list each PDF or disk image, but merely that there is a record somewhere in ArchivesSpace that refers to it. This can be anywhere from a collection-level record that includes a big pile of disk images, or an item-level record that refers to an individual PDF. The identifier from ArchivesSpace can then help provide intellectual control without having to describe every digital file.
Some digital objects get additional detailed item-level metadata, while others rely on an identifier to pull in description of records in ArchivesSpace. Our repository assumes everyone uses a full set of Dublin Core or MODS elements out-of-the-box, but we needed most objects to rely only on the ASpace identifier. So we had to modify our repository to both be able to use less descriptive metadata and to link to metadata records in other systems.
Build networks of systems and limit their use to what they are really good at
We try to keep our systems as boilerplate as possible, avoiding customizations and using their default processes. By systems, I mean software applications, such as ArchivesSpace, Hyrax, DSpace, Solr, or Fedora. These applications might span multiple servers, or multiple systems can run together on a single server. Most systems are really good at performing their core functions, but get less effective the more edge cases you ask them to do. Systems are also challenging to maintain, and sticking to the “main line” that everyone else uses ensures that we will have the easiest possible upgrade path.
This means we use ArchivesSpace for managing archival description, ArcLight for displaying description, and Hyrax as a digital repository. We only ask them to do these things, and use smaller tools to perform other functions. We use asInventory as a spreadsheet tool for listing archival materials instead of ArchivesSpace, ArcLight instead of the ArchivesSpace Public User Interface for discovery and display, and network storage for preservation instead of relying on Hyrax and Fedora.
When we need to adapt systems to local practices, instead of customizing them, we try to bring the customizations outside the boundaries of the system and instead rely on their openness and API connections. We create or adapt what I am calling “microservices” to fulfil these local custom functions. These are small, as-simple-as-possible tools that each perform one specific function. Theoretically, at least, they are easy to build and might not be designed to be maintained. Instead, we will adapt or replace them with another super-simple tool when they get problematic or are no longer useful. Microservices do not store or manage data themselves, so when (not if) they stop working, we are not relying on them to immediately fulfil core functions. We will still have to replace them, but we will not have to drop everything and scramble to fix them to serve the next user who walks in the physical or virtual door. In this way, microservices are sort of like the sacrificial anodes of our digital infrastructure.
Computers don’t preserve bits, people preserve bits
Digital preservation is not something any software can do by itself. Preservation requires human attention and labor. There is no system where you can merely ingest your content to preserve it. Instead of relying on a single ideal “preservation” system, our approach is to get digital content onto “good enough” storage and plan to actively manage and maintain it over time. Preservation systems are tools and their effectiveness depends on their context and use.
While we use Fedora Commons as part of the Hyrax/Samvera stack, we do not consider our instance to be a preservation system and we do not use it as such. Hyrax is really complex and challenging to manage, particularly since it stores data in multiple places: in Fedora, in a database, and on a server’s hard disk. Were we to rely on Hyrax as our preservation system, my biggest fear would be that the database and Fedora get out of sync which will prevent Hyrax from booting or using the Rails console. In this scary scenario, Fedora would still manage the digital content and metadata, but we’d have to try and piece together what “wd/37/6c/90/wd376c90g” means and how it connects to the human-readable metadata.
Instead, we use Hyrax as only an access system. We keep all master files and metadata in standard packages on network shares using BagIt-python. Preservation copies, like uncompressed TIFFs and WAVs, are not uploaded to Hyrax in order to limit data duplication, as we make derivatives prior to ingest. When we add metadata through Hyrax, a microservice adds it to the preservation storage overnight. This preserves the master copies in an environment we are more confident that we can maintain, as simplicity might be more important for preservation than complex functionality. It also lowers the stakes for maintaining Hyrax, as we don’t risk losing materials if something goes wrong.
These are all the servers we use to process and manage digital records. They are all virtual servers that live in the university data center. Our Library Systems department services the Windows servers, and university IT supports the Linux servers.
- ArchivesSpace production server
- Oracle Linux server, 2 core, 6GB RAM
- Runs ArchivesSpace and MySQL
- ArchivesSpace development server
- Oracle Linux server, 2 core, 4GB RAM
- Runs ArchivesSpace and MySQL
- Ruby on Rails production server
- Oracle Linux, 4 core, 8GB RAM
- Runs applications that use the Ruby on Rails web application framework, including ArcLight, Hyrax, Bento search, and Jekyll website
- Serves the Ruby on Rails environment using Passenger and nginx as the webserver
- Ruby on Rails development server
- Oracle Linux, 4 core, 8GB RAM
- Has duplicate development instances of all Ruby on Rails-based applications
- Runs scheduled microservices
- Solr server
- Windows Server 12GB RAM
- Runs the Solr search engine application which is used by ArcLight and Hyrax
- Fedora server
- Windows Server, 2 core, 4GB RAM
- Runs Fedora using Apache Tomcat to support Hyrax
- Postgres database server
- Windows Server
- Supports Hyrax and Fedora
I know this list can be very intimidating for many small and medium-size repositories who can struggle to find the support for even one web application, but I think it’s important to be about the technology required to process and actually provide access to digital records. We don’t want to make promises to our donors that we can’t fulfil. There are many administrators, donors, and IT staff members that don’t assume archival repositories require this technology. Even at a major research university, we had to spend years changing the culture of expectations to put these tools in place. I hope that if archivists can be more transparent about these requirements we can help each other make the case for more support.
Gregory Wiedeman is the university archivist in the M.E. Grenander Department of Special Collections & Archives at the University at Albany, SUNY where he helps ensure long-term access to the school’s public records. He oversees collecting, processing, and reference for the University Archives and supports the implementation and development of the department’s archival systems.