NPR: Will Future Historians Consider These Days The Digital Dark Ages?

Nice article from NPR on digital preservation with some thoughtful comments by our colleagues in archives and museums.
Will Future Historians Consider These Days The Digital Dark Ages?
We are awash in a sea of information, but how to historians sift through the mountain of data? In the future, computer programs will be unreadable, and therefore worthless, to historians.
NPR.ORG|BY MORNING EDITION
Additional discussions on Bert Lyons Twitter feed regarding this article.

Learning a thing or two from Harpo: batch processing

harpo_cleans_something
Harpo Shining Shoes

Recently I recently participated in the Fall Symposium of the Midwest Archives Conference, Hard Skills for Managing Digital Collections in Archives. There were a number of excellent tools that the presenters covered including ExifTool, Bagit and MediaInfo as well as new tricks for analyzing/processing data using that old frenemy, Excel. These tools are tremendously helpful to the work of archivists and are getting better all the time. Here at the Carleton College Archives, each one has greatly increased the speed with which we can process electronic accessions.

But if we are going to keep up with the tremendous flood of new electronic records into our archives, our processing programs need to run even faster.  Three commonly used programs—ExifTool, Bagit and DROID—all process collection materials one at a time, and each program also requires additional steps and selection of options to generate the desired output. The convenience of these solutions is often overshadowed by their resource intensiveness. We need the ability to instruct all these programs to process a whole series of records, producing all the various outputs we want, with one or two steps as opposed to a dozen.

To solve this problem at Carleton, we have created a set of batch processors for programs we regularly use in our archive. For each program in our toolbox we wrote a script using the programming language Python that applies the program not just to a single accession but an entire directory of accessions. Additionally we gave the batch processor instructions to generate all the reports from each program we might need to manage our digital files.

These improvements have made a big difference in our ability to process numerous digital accessions quickly and consistently.  For anyone that would like to try them, all the batch processors are run via the command line (PC Command Prompt or the Mac Terminal) and can be downloaded from our GitHub repositories:

Photo credit: Harpo Shining Shoes animated GIF.  From “A Night in Casablanca” via http://wfmu.org/playlists/shows/51730.

Computer Generated Archival Description

offtimusManager

In our archive at Carleton College we have implemented a number of automated and semi-automated tools to assist with processing our digital records. We use several batch processes to create access copies, generate checksums, validate file formats, extract tagged metadata and are working on a data accessioner that can automate many of the repetitive steps we perform on our Submission Information Packages (SIPS).  While these improvements have been tremendously helpful with processing collections quickly, there is one area that is consistently backed up in our workflow: the creation of descriptive metadata. Minimal descriptive metadata has improved our processing time for electronic records, but I can already see that this will not be enough in the near future.

In light of the accelerating growth rate of digital accessions in our repositories, how sustainable will human created descriptive metadata be in the next few years? Perhaps we should be turning to automated, computer based methods for creating descriptions just like we have for other processing steps. We are already relying on optical character recognition (OCR) to improve access to scanned print documents, but there are other methods that hold great promise. Voice to text software, while not fully baked yet, is being used by some digitization vendors to create transcriptions of video and audio files. Facial recognition could be a powerful tool for photograph identification – I could see these same methods being applied to the recognition of buildings as well. Geospatial data based on known reference points, such as an address, can make images of locations more searchable and usable in dynamically generated maps. Analysis of text could even be used to generate subject categories.

These methods would of course change how we work as professionals and how users access our records.  Our descriptive metadata would be much more extensive, but would probably be filled with many more errors than we are currently willing to accept.  To use this data, researchers might turn away from the traditional finding aid, detailed biographical descriptions and human assigned subject headings in favor of term searching, ranked results and faceted displays.  These new tools and changes may be unsettling, but in light of our mounting backlogs of electronic records, we may have no choice but to embrace them.

Do you have any experience with this kind of cataloging?  Does the idea of trusting a machine do this work cause you to feel dizziness or shortness of breath?  Please let us know in the comments below.

Nat Wilson is a Digital Archivist at Carleton College.