What’s Your Set-up?: Processing Digital Records at UAlbany (part 2)

by Gregory Wiedeman


In the last post I wrote about the theoretical and technical foundations for our born-digital records set-up at UAlbany. Here, I try to show the systems we use and how they work in practice.

Systems

ArchivesSpace

ArchivesSpace manages all archival description. Accession records and top level description for collections and file series are created directly in ArchivesSpace, while lower-level description, containers, locations, and digital objects are created using asInventory spreadsheets. Overnight, all modified published records are exported using exportPublicData.py and indexed into Solr using indexNewEAD.sh. This Solr index is read by ArcLight.

ArcLight

ArcLight provides discovery and display for archival description exported from ArchivesSpace. It uses URIs from ArchivesSpace digital objects to point to digital content in Hyrax while placing that content in the context of archival description. ArcLight is also really good at systems integration because it allows any system to query it through an unauthenticated API. This allows Hyrax and other tools to easily query ArcLight for description records.

Hyrax

Hyrax manages digital objects and item-level metadata. Some objects have detailed Dublin Core-style metadata, while other objects only have an ArchivesSpace identifier. Some custom client-side JavaScript uses this identifier to query ArcLight for more description to contextualize the object and provide links to more items. This means users can discover a file that does not have detailed metadata, such as Minutes, and Hyrax will display the Scope and Content note of the parent series along with links to more series and collection-level description.

Storage

Our preservation storage uses network shares managed by our university data center. We limit write access to the SIP and AIP storage directories to one service account used only by the server that runs the scheduled microservices. This means that only tested automated processes can create, edit, or delete SIPs and AIPs. Archivists have read-only access to these directories, which contain standard bags generated by BagIt-python that are validated against BagIt Profiles. Microservices also place a copy of all SIPs in a processing directory where archivists have full access to work directly with the files. These processing packages have specific subdirectories for master files, derivatives, and metadata. This allows other microservices to be run on them with just the package identifier. So, if you needed to batch create derivatives or metadata files, the microservices know which directories to look in.

The microservices themselves have built-in checks in place, such as they will make sure a valid AIP exists before deleting a SIP. The data center also has some low-level preservation features in place, and we are working to build additional preservation services that will run asynchronously from the rest of our processing workflows. This system is far from perfect, but it works for now, and at the end of the day, we are relying on the permanent positions in our department as well as in Library Systems and university IT to keep these files available long-term.

Microservices

These microservices are the glue that keeps most of our workflows working together. Most of the links here point to code in our Github page, but we’re also trying to add public information on these processes to our documentation site.

asInventory

This is a basic Python desktop app for managing lower-level description in ArchivesSpace through Excel spreadsheets using the API. Archivists can place a completed spreadsheet in a designated asInventory input directory and double-click an .exe file to add new archival objects to ArchivesSpace. A separate .exe can export all the child records from a resource or archival object identifier. The exported spreadsheets include the identifier for each archival object, container, and location, so we can easily roundtrip data from ArchivesSpace, edit it in Excel, and push the updates back into ArchivesSpace. 

We have since built our born digital description workflow on top of asInventory. The spreadsheet has a “DAO” column and will create a digital object using a URI that is placed there. An archivist can describe digital records in a spreadsheet while adding Hyrax URLs that link to individual or groups of files.

We have been using asInventory for almost 3 years, and it does need some maintenance work. Shifting a lot of the code to the ArchivesSnake library will help make this easier, and I also hope to find a way to eliminate the need for a GUI framework so it runs just like a regular script.

Syncing scripts

The ArchivesSpace-ArcLight-Workflow Github repository is a set of scripts that keeps our systems connected and up-to-date. exportPublicData.py ensures that all published description in ArchivesSpace is exported each night, and indexNewEAD.sh indexes this description into Solr so it can be used by ArcLight. processNewUploads.py is the most complex process. This script takes all new digital objects uploaded through the Hyrax web interface, stores preservation copies as AIPs, and creates digital object records in ArchivesSpace that points to them. Part of what makes this step challenging is that Hyrax does not have an API, so the script uses Solr and a web scraper as a workaround.

These scripts sound complicated, but they have been relatively stable over the past year or so. I hope we can work on simplifying them too, by relying more on ArchivesSnake and moving some separate functions to other smaller microservices. One example is how the ASpace export script also adds a link for each collection to our website. We can simplify this by moving this task to a separate, smaller script. That way, when one script breaks or needs to be updated, it would not affect the other function.

Ingest and Processing scripts

These scripts process digital records by uploading metadata for them in our systems and moving them to our preservation storage.

  • ingest.py packages files as a SIP and optionally updates ArchivesSpace accession records by added dates and extents.
  • We have standard transfer folders for some campus offices with designated paths for new records and log files along with metadata about the transferring office. transferAccession.py runs ingest.py but uses the transfer metadata to create accession records and produces spreadsheet log files so offices can see what they transferred
  • confluence.py scrapes files from our campus’s Confluence wiki system, so for offices that use Confluence all I need is access to their page to periodically transfer records.
  • convertImages.py makes derivative files. This is mostly designed for image files, such as batch converting TIFFs to JPGs or PDFs.
  • listFiles.py is very handy. All it does is create a text file that lists all filenames and paths in a SIP. These can then be easily copied into a spreadsheet.
  • An archivist can arrange records by creating an asInventory spreadsheet that points to individual or groups of files. buildHyraxUpload.py then creates a TSV file for uploading these files to Hyrax with the relevant ArchivesSpace identifiers.
  • updateASpace.py takes the output TSV from uploading to Hyrax and updates the same inventory spreadsheets. These can then be uploaded back into ArchivesSpace which will create digital objects that point to Hyrax URLs.

SIP and AIP Package Classes

These classes are extensions of the Bagit-python library. They contain a number of methods that are used by other microservices. This lets us easily create() or load() our specific SIP or AIP packages and add files to them. They also include complex things like getting a human-readable extent and date ranges from the filesystem. My favorite feature might be clean() which removes all Thumbs.db, desktop.ini, and .DS_Store files as the package is created.

Example use case

  1. Wild records appear! A university staff member has placed records of the University Senate from the past year in a standard folder share used for transfers.
  2. An archivist runs transferAccession.py, which creates an ArchivesSpace accession record using some JSON in the transfer folder and technical metadata from the filesystem (modified dates and digital extents). It then packages the files using BagIt-python and places one copy in the read-only SIP directory and a working copy in a processing directory.
    • For outside acquisitions, the archivists usually manually download, export, or image the materials and create an accession record manually. Then, ingest.py packages these materials and adds dates and extents to the accession records when possible.
  3. The archivist makes derivative files for access or preservation. Since there is a designated derivatives directory in the processing package, the archivists can use a variety of manual tools or run other microservices using the package identifier. Scripts such as convertImages.py can batch convert or combine images and PDFs and other scripts for processing email are still being developed.
  4. The archivist then runs listFiles.py to get a list of file paths and copies them into an asInventory spreadsheet.
  5. The archivist arranges the issues within the University Senate Records. They might create a new subseries and use that identifier in an asInventory spreadsheet to upload a list of files and then download them again to get a list of ref_ids.
  6. The archivist runs buildHyraxUpload.py to create a tab-separated values (TSV) file for uploading files to Hyrax using the description and ref_ids from the asInventory spreadsheet.
  7. After uploading the files to Hyrax, the archivist runs updateASpace.py to add the new Hyrax URLs to the same asInventory spreadsheet and uploads them back to ArchivesSpace. This creates new digital objects that point to Hyrax.

Successes and Challenges

Our set-up will always be a work in progress, and we hope to simplify, replace, or improve most of these processes over time. Since Hyrax and ArcLight have been in place for almost a year, we have noticed some aspects that are working really well and others that we still need to improve on.

I think the biggest success was customizing Hyrax to rely on description pulled from ArcLight. This has proven to be dependable and has allowed us to make significant amounts of born-digital and digitized materials available online without requiring detailed item-level metadata. Instead, we rely on high-level archival description and whatever information we can use at scale from the creator or the file system.

Suddenly we have a backlog. Since description is no longer the biggest barrier to making materials available, the holdup has been the parts of the workflow that require human intervention. Even though we are doing more with each action, large amounts of materials are still held up waiting for a human to process them. The biggest bottlenecks are working with campus offices and donors as well as arrangement and description.

There is also a ton of spreadsheets. I think this is a good thing, as we have discovered many cases where born-digital records come with some kind of existing description, but it often requires data cleaning and transformation. One collection came with authors, titles, and abstracts for each of a few thousand PDF files, but that metadata was trapped in hand-encoded HTML files from the 1990s. Spreadsheets are a really good tool for straddle the divide between automated and manual processes required to save this kind of metadata, and this is a comfortable environment for many archivists to work in.[1]

You may have noticed, but the biggest needs we have now—donor relations, arrangement and description, metadata cleanup—are roles that archivists are really good and comfortable at. It turned out that once we had effective digital infrastructure in place, it created further demands on archivists and traditional archival processes.

This brings us to the biggest challenge we face now. Since our set-up often requires comfort on the command line, we have severely limited the number of archivists who can work on these materials and required non-archival skills to perform basic archival functions. We are trying to mitigate this in some respects by better distributing individual stages for each collection and providing more documentation. Still, this has clearly been a major flaw, as we need to meet users (in this case other archivists) where they are rather than place further demands on them.[2]


Gregory Wiedeman is the university archivist in the M.E. Grenander Department of Special Collections & Archives at the University at Albany, SUNY where he helps ensure long-term access to the school’s public records. He oversees collecting, processing, and reference for the University Archives and supports the implementation and development of the department’s archival systems.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s