Learning a thing or two from Harpo: batch processing

Harpo Shining Shoes

Recently I recently participated in the Fall Symposium of the Midwest Archives Conference, Hard Skills for Managing Digital Collections in Archives. There were a number of excellent tools that the presenters covered including ExifTool, Bagit and MediaInfo as well as new tricks for analyzing/processing data using that old frenemy, Excel. These tools are tremendously helpful to the work of archivists and are getting better all the time. Here at the Carleton College Archives, each one has greatly increased the speed with which we can process electronic accessions.

But if we are going to keep up with the tremendous flood of new electronic records into our archives, our processing programs need to run even faster.  Three commonly used programs—ExifTool, Bagit and DROID—all process collection materials one at a time, and each program also requires additional steps and selection of options to generate the desired output. The convenience of these solutions is often overshadowed by their resource intensiveness. We need the ability to instruct all these programs to process a whole series of records, producing all the various outputs we want, with one or two steps as opposed to a dozen.

To solve this problem at Carleton, we have created a set of batch processors for programs we regularly use in our archive. For each program in our toolbox we wrote a script using the programming language Python that applies the program not just to a single accession but an entire directory of accessions. Additionally we gave the batch processor instructions to generate all the reports from each program we might need to manage our digital files.

These improvements have made a big difference in our ability to process numerous digital accessions quickly and consistently.  For anyone that would like to try them, all the batch processors are run via the command line (PC Command Prompt or the Mac Terminal) and can be downloaded from our GitHub repositories:

Photo credit: Harpo Shining Shoes animated GIF.  From “A Night in Casablanca” via http://wfmu.org/playlists/shows/51730.

Publication of the University of Minnesota Libraries Electronic Records Task Force (ERTF) Final Report

By Carol Kussmann

In May of 2014 the University of Minnesota Libraries charged the Electronic Records Task Force with developing initial capacity for ingesting and processing electronic records; defining ingest and processing workflows as well as tasks and workflows for both technical and archival staff.  The Electronic Records Task Force (ERTF) is pleased to announce the final report is now available here through the University Digital Conservancy.

The report describes our work over the past year and covers many different areas in detail including our workstation setup, software and equipment, security concerns, minimal task selection for ingesting materials, and tool and software evaluation and selection.  The report also talks about how and why we modified workflows when we ran into situations that warranted change.  To assist in addressing staffing needs we summarized the amount of work we were able to complete, how much time it took to do so, and projections for known future work.

High level lessons learned include:

  • Each accession brings with it its own questions and issues. For example:
    • A music collection had many file names that included special characters; because of this the files didn’t transfer correctly.
    • New accessions can include new file formats or types that you are unfamiliar with and need to find the best way to address. Email was one of those for us. [Which we still haven’t fully resolved.]
    • Some collections are ‘simple’ to ingest as all files are contained on a single hard drive, others are more complicated or more time consuming. One collection we ingested had 60+ CD/DVDs that needed to be individually ingested.
  • Personnel consistency is key.
    • We had a sub-group of five Task Force members who could ingest records. It was found that those who were able to spend more time working with the records didn’t have to spend as much time reviewing procedures.  Those who spent more time also better understood common issues and how to work through them.

We hope that readers find this final report useful as not only does it document the work of the Task Force but also shares many resource pieces that were created throughout the year some of which include:

  • Description of our workstation, including hardware and software
  • Processing workflow instructions for ingesting materials
  • Draft Deed of Gift Addendum for electronic records
  • Donor guides (for sharing information with donors about electronic records)

The ERTF completed the tasks set out in the first charge but understand that working with electronic records is a constant process and work must continue.  To this end, we are in the process of drafting a second charge to address processing ingested records and providing access to processed records.

Thank you to the work of the entire Task Force: Lisa Calahan, Kevin Dyke, Lara Friedman-Shedlov, Lisa Johnston, Carol Kussmann (co-chair), Mary Miller, Erik Moore, Arvid Nelsen (co-chair), Jon Nichols, Justin Schell, Mike Sutliff, and sponsors John Butler and Kris Kiesling.

Carol Kussmann is a Digital Preservation Analyst with the University of Minnesota Libraries Data & Technology – Digital Preservation & Repository Technologies unit, and co-chair of the University of Minnesota Libraries – Electronic Records Task Force. Questions about the report can be sent to: lib-ertf@umn.edu.