By Jess Whyte
This is the second post in our Spring 2016 series on processing digital materials.
Tucked away in the manuscript collections at the Thomas Fisher Rare Book Library, there are disks. They’ve been quietly hiding out in folders and boxes for the last 30 years. As the University of Toronto Libraries develops its digital preservation policies and workflows, we identified these disks as an ideal starting point to test out some of our processes. The Fisher was the perfect place to start:
- the collections are heterogeneous in terms of format, age, media and filesystems
- the scope is manageable (we identified just under 2000 digital objects in the manuscript collections)
- the content has relatively clear boundaries (we’re dealing with disks and drives, not relational databases, software or web archives)
- the content is at risk
The Thomas Fisher Rare Book Library Digital Preservation Pilot Project was born. It’s purpose: to evaluate the extent of the content at risk and establish a baseline level of preservation on the content.
Identifying digital assets
The project started by identifying and listing all the known digital objects in the manuscript collections. I did this by batch searching all the .pdf finding aids from post-1960 with terms like “digital,” “electronic,” “disk,” —you get the idea. Once we knew how many items we were dealing with and where we could find them, we could begin.
Early days, testing and fails
When I first started, I optimistically thought I would just fire up BitCurator and everything would work.
It didn’t, but that’s okay. All of the reasons we chose these collections in the first place (format, media, filesystem and age diversity) also posed a variety of challenges to our workflow for capture and analysis. There was also a question of scalability – could I really expect to create preservation copies of ~2000 disks along with accompanying metadata within a target 18-month window? By processing each object one-by-one in a graphical user interface? While working on the project part-time? No, I couldn’t. Something needed to change.
Our early iterations of the process went something like this:
- Use a Kryoflux and its corresponding software to take an image of the disk
- Mount the image in a tool like FTK Imager or HFSExplorer
- Export a list of the files in a somewhat consistent manner to serve as a manifest, metadata and de facto finding aid
- Bag it all up in Bagger.
This was slow, inconsistent, and not well-suited to the project timetable. I tried using fiwalk (included with BitCurator) to walk through a series of images and automatically generate manifests of their contents, but fiwalk does not support HFS and other, older filesystems. Considering 40% of our disks thus far were HFS (at this point, I was 100 disks in), fiwalk wasn’t going to save us. I could automate the process for 60% of the disks, but the remainder would still need to be handled individually–and I wouldn’t have those beautifully formatted DFXML (Digital Forensics XML) files to accompany them. I needed a fix.
Enter disktype and md5deep
I needed a way to a) mount a series of disk images, b) look inside, c) generate metadata on the file contents and d) produce a more human-readable manifest that could serve as a finding aid.
Ideally, the format of all that metadata would be consistent. Critically, the whole process would be as automated as possible.
This is where disktype and md5deep come in. I could use disktype to identify an image’s filesystem, mount it accordingly and then use md5deep to generate DFXML and .csv files. The first iteration of our script did just that, but md5deep doesn’t produce as much metadata as fiwalk. While I don’t have the skills to rewrite fiwalk, I do have the skills to write a simple bash script that routes disk images based on their filesystem to either md5deep or fiwalk. You can find that script here, and a visualization of how it works below:
I could now turn this (collection of image files and corresponding imaging logs):
into this (collection of image files, logs, DFXML files, and CSV manifests):
Or, to put it another way, I could now take one of these:
And rapidly turn it into this ready-to-be-bagged package:
Challenges, Future Considerations and Questions
Are we going too fast?
Do we really want to do this quickly? What discoveries or insights will we miss by automating this process? There is value in slowing down and spending time with an artifact and learning from it. Opportunities to do this will likely come up thanks to outliers, but I still want to carve out some time to play around with how these images can be used and studied, individually and as a set.
We’re still investigating ways to run virus checks that are efficient and thorough, but not invasive (won’t modify the image in any way). One possibility is to include the virus check in our bash script, but this will slow it down significantly and make quick passes through a collection of images impossible (during the early, testing phases of this pilot, this is critical). Another possibility is running virus checks before the images are bagged. This would let us run the virus checks overnight and then address any flagged images (so far, we’ve found viruses in ~3% of our disk images and most were boot-sector viruses). I’m curious to hear how others fit virus checks into their workflows, so please comment if you have suggestions or ideas.
Adding More Filesystem Recognition
Right now, the processing script only recognizes FAT and HFS filesystems and then routes them accordingly. So far, these are the only two filesystems that have come up in my work, but the plan is to add other filesystems to the script on an as-needed basis. In other words, if I happen to meet an Amiga disk on the road, I can add it then.
This project is currently focused on creating preservation copies. For now, access requests are handled on an as-needed basis. This is definitely something that will require future work.
Automating much of this process means we can complete the work with available resources, but it raises questions about error checking. If a human isn’t opening each image individually, poking around, maybe extracting a file or two, then how can we be sure of successful capture? That said, we do currently have some indicators: the Kryoflux log files, human monitoring of the imaging process (are there “bad” sectors? Is it worth taking a closer look?), and the DFXML and .csv manifest files (were they successfully created? Are there files in the image?). How are other archives handling automation and exception handling?
If you’d like to see our evolving workflow or follow along with our project timeline, you can do so here. Your feedback and comments are welcome.
Jess Whyte is a Masters Student in the Faculty of Information at the University of Toronto. She holds a two-year digital preservation internship with the University of Toronto Libraries and also works as a Research Assistant with the Digital Curation Institute.
Gengenbach, M. (2012). The way we do it here”: Mapping digital forensics workflows in collecting institutions.”. Unpublished master’s thesis, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina.
Goldman, B. (2011). Bridging the gap: taking practical steps toward managing born-digital collections in manuscript repositories. RBM: A Journal of Rare Books, Manuscripts and Cultural Heritage, 12(1), 11-24
Prael, A., & Wickner, A. (2015). Getting to Know FRED: Introducing Workflows for Born-Digital Content.