Preservation and Access Can Coexist: Implementing Archivematica with Collaborative Working Groups

By Bethany Scott

The University of Houston (UH) Libraries made an institutional commitment in late 2015 to migrate the data for its digitized and born-digital cultural heritage collections to open source systems for preservation and access: Hydra-in-a-Box (now Hyku!), Archivematica, and ArchivesSpace. As a part of that broader initiative, the Digital Preservation Task Force began implementing Archivematica in 2016 for preservation processing and storage.

At the same time, the DAMS Implementation Task Force was also starting to create data models, use cases, and workflows with the goal of ultimately providing access to digital collections via a new online repository to replace CONTENTdm. We decided that this would be a great opportunity to create an end-to-end digital access and preservation workflow for digitized collections, in which digital production tasks could be partially or fully automated and workflow tools could integrate directly with repository/management systems like Archivematica, ArchivesSpace, and the new DAMS. To carry out this work, we created a cross-departmental working group consisting of members from Metadata & Digitization Services, Web Services, and Special Collections.

Continue reading

Advertisements

Digital Preservation, Eh?

by Alexandra Jokinen

This post is the third post in our series on international perspectives on digital preservation.

___

Hello / Bonjour!

Welcome to the Canadian edition of International Perspectives on Digital Preservation. My name is Alexandra Jokinen. I am the new(ish) Digital Archives Intern at Dalhousie University in Halifax, Nova Scotia. I work closely with the Digital Archivist, Creighton Barrett, to aid in the development of policies and procedures for some key aspects of the University Libraries’ digital archives program—acquisitions, appraisal, arrangement, description, and preservation.

One of the ways in which we are beginning to tackle this very large, very complex (but exciting!) endeavour is to execute digital preservation on a small scale, focusing on the processing of digital objects within a single collection, and then using those experiences to create documentation and workflows for different aspects of the digital archives program.

The collection chosen to be our guinea pig was a recent donation of work from esteemed Canadian ecologist and environmental scientist, Bill Freedman, who taught and conducted research at Dalhousie from 1979 to 2015. The fonds is a hybrid of analogue and digital materials dating from 1988 to 2015. Digital media carriers include: 1 computer system unit, 5 laptops, 2 external hard drives, 7 USB flash drives, 5 zip disks, 57 CDs, 6 DVDs, 67 5.25 inch floppy disks and 228 3.5 inch floppy disks. This is more digital material than the archives is likely to acquire in future accessions, but the Freedman collection acted as a good test case because it provided us with a comprehensive variety of digital formats to work with.

Our first area of focus was appraisal. For the analogue material in the collection, this process was pretty straightforward: conduct macro-appraisal and functional analysis by physically reviewing material. However, (as could be expected) appraisal of the digital material was much more difficult to complete. The archives recently purchased a forensic recovery of evidence device (FRED) but does not yet have all the necessary software and hardware to read the legacy formats in the collection (such as the floppy disks and zip disks), so, we started by investigating the external hard drives and USB flash drives. After examining their content, we were able to get an accurate sense of the information they contained, the organizational structure of the files, and the types of formats created by Freedman. Although, we were not able to examine files on the legacy media, we felt that we had enough context to perform appraisal, determine selection criteria and formulate an arrangement structure for the collection.

The next step of the project will be to physically organize the material. This will involve separating, photographing and reboxing the digital media carriers and updating a new registry of digital media that was created during a recent digital archives collection assessment modelled after OCLC’s 2012 “You’ve Got to Walk Before You Can Run” research report. Then, we will need to process the digital media, which will entail creating disk images with our FRED machine and using forensic tools to analyze the data.  Hopefully, this will allow us to apply the selection criteria used on the analogue records to the digital records and weed out what we do not want to retain. During this process, we will be creating procedure documentation on accessioning digital media as well as updating the archives’ accessioning manual.

The project’s final steps will be to take the born-digital content we have collected and ingest it using Archivematica to create Archival Information Packages for storage and preservation and accessed via the Archives Catalogue and Online Collections.

So there you have it! We have a long way to go in terms of digital preservation here at Dalhousie (and we are just getting started!), but hopefully our work over the next several months will ensure that solid policies and procedures are in place for maintaining a trustworthy digital preservation system in the future.

This internship is funded in part by a grant from the Young Canada Works Building Careers in Heritage Program, a Canadian federal government program for graduates transitioning to the workplace.

___

dsc_0329

Alexandra Jokinen has a Master’s Degree in Film and Photography Preservation and Collections Management from Ryerson University in Toronto. Previously, she has worked as an Archivist at the Liaison of Independent Filmmakers of Toronto and completed a professional practice project at TIFF Film Reference Library and Special Collections.

Connect with me on LinkedIn!

Practical Digital Preservation: In-House Solutions to Digital Preservation for Small Institutions

By Tyler McNally

This post is the tenth post in our series on processing digital materials.

Many archives don’t have the resources to install software or subscribe to a service such as Archivematica, but still have a mandate to collect and preserve born-digital records. Below is a digital-preservation workflow created by Tyler McNally at the University of Manitoba. If you have a similar workflow at your institution, include it in the comments. 

———

Recently I completed an internship at the University of Manitoba’s College of Medicine Archives, working with Medical Archivist Jordan Bass. A large part of my work during this internship dealt with building digital infrastructure for the archive to utilize in working on digital preservation. As a small operation, the archive does not have the resources to really pursue any kind of paid or difficult to use system.

Originally, our plan was to use the open-source, self-install version of Archivematica, but certain issues that cropped up made this impossible, considering the resources we had at hand. We decided that we would simply make our own digital-preservation workflow, using open-source and free software to convert our files for preservation and access, check for viruses, and create checksums—not every service that Archivematica offers, but enough to get our files stored safely. I thought other institutions of similar size and means might find the process I developed useful in thinking about their own needs and capabilities.

Continue reading

Clearing the digital backlog at the Thomas Fisher Rare Book Library

By Jess Whyte

This is the second post in our Spring 2016 series on processing digital materials.

———

Tucked away in the manuscript collections at the Thomas Fisher Rare Book Library, there are disks. They’ve been quietly hiding out in folders and boxes for the last 30 years. As the University of Toronto Libraries develops its digital preservation policies and workflows, we identified these disks as an ideal starting point to test out some of our processes. The Fisher was the perfect place to start:

  • the collections are heterogeneous in terms of format, age, media and filesystems
  • the scope is manageable (we identified just under 2000 digital objects in the manuscript collections)
  • the content has relatively clear boundaries (we’re dealing with disks and drives, not relational databases, software or web archives)
  • the content is at risk

The Thomas Fisher Rare Book Library Digital Preservation Pilot Project was born. It’s purpose: to evaluate the extent of the content at risk and establish a baseline level of preservation on the content.

Identifying digital assets

The project started by identifying and listing all the known digital objects in the manuscript collections. I did this by batch searching all the .pdf finding aids from post-1960 with terms like “digital,” “electronic,” “disk,” —you get the idea. Once we knew how many items we were dealing with and where we could find them, we could begin.

Early days, testing and fails
When I first started, I optimistically thought I would just fire up BitCurator and everything would work.

whyte01

It didn’t, but that’s okay. All of the reasons we chose these collections in the first place (format, media, filesystem and age diversity) also posed a variety of challenges to our workflow for capture and analysis. There was also a question of scalability – could I really expect to create preservation copies of ~2000 disks along with accompanying metadata within a target 18-month window? By processing each object one-by-one in a graphical user interface? While working on the project part-time? No, I couldn’t. Something needed to change.

Our early iterations of the process went something like this:

  1. Use a Kryoflux and its corresponding software to take an image of the disk
  2. Mount the image in a tool like FTK Imager or HFSExplorer
  3. Export a list of the files in a somewhat consistent manner to serve as a manifest, metadata and de facto finding aid
  4. Bag it all up in Bagger.

This was slow, inconsistent, and not well-suited to the project timetable. I tried using fiwalk (included with BitCurator) to walk through a series of images and automatically generate manifests of their contents, but fiwalk does not support HFS and other, older filesystems. Considering 40% of our disks thus far were HFS (at this point, I was 100 disks in), fiwalk wasn’t going to save us. I could automate the process for 60% of the disks, but the remainder would still need to be handled individually–and I wouldn’t have those beautifully formatted DFXML (Digital Forensics XML) files to accompany them. I needed a fix.

Enter disktype and md5deep

I needed a way to a) mount a series of disk images, b) look inside, c) generate metadata on the file contents and d) produce a more human-readable manifest that could serve as a finding aid.

Ideally, the format of all that metadata would be consistent. Critically, the whole process would be as automated as possible.

This is where disktype and md5deep come in. I could use disktype to identify an image’s filesystem, mount it accordingly and then use md5deep to generate DFXML and .csv files. The first iteration of our script did just that, but md5deep doesn’t produce as much metadata as fiwalk. While I don’t have the skills to rewrite fiwalk, I do have the skills to write a simple bash script that routes disk images based on their filesystem to either md5deep or fiwalk. You can find that script here, and a visualization of how it works below:

whyte02

I could now turn this (collection of image files and corresponding imaging logs):

Whyte03

into this (collection of image files, logs, DFXML files, and CSV manifests):

Whyte04

Or, to put it another way, I could now take one of these:

Whyte05

And rapidly turn it into this ready-to-be-bagged package:

Whyte06

Challenges, Future Considerations and Questions

Are we going too fast?
Do we really want to do this quickly? What discoveries or insights will we miss by automating this process? There is value in slowing down and spending time with an artifact and learning from it. Opportunities to do this will likely come up thanks to outliers, but I still want to carve out some time to play around with how these images can be used and studied, individually and as a set.

Virus Checks:
We’re still investigating ways to run virus checks that are efficient and thorough, but not invasive (won’t modify the image in any way).  One possibility is to include the virus check in our bash script, but this will slow it down significantly and make quick passes through a collection of images impossible (during the early, testing phases of this pilot, this is critical). Another possibility is running virus checks before the images are bagged. This would let us run the virus checks overnight and then address any flagged images (so far, we’ve found viruses in ~3% of our disk images and most were boot-sector viruses). I’m curious to hear how others fit virus checks into their workflows, so please comment if you have suggestions or ideas.

Adding More Filesystem Recognition
Right now, the processing script only recognizes FAT and HFS filesystems and then routes them accordingly. So far, these are the only two filesystems that have come up in my work, but the plan is to add other filesystems to the script on an as-needed basis. In other words, if I happen to meet an Amiga disk on the road, I can add it then.

Access Copies:
This project is currently focused on creating preservation copies. For now, access requests are handled on an as-needed basis. This is definitely something that will require future work.

Error Checking:
Automating much of this process means we can complete the work with available resources, but it raises questions about error checking. If a human isn’t opening each image individually, poking around, maybe extracting a file or two, then how can we be sure of successful capture? That said, we do currently have some indicators: the Kryoflux log files, human monitoring of the imaging process (are there “bad” sectors? Is it worth taking a closer look?), and the DFXML and .csv manifest files (were they successfully created? Are there files in the image?). How are other archives handling automation and exception handling?

If you’d like to see our evolving workflow or follow along with our project timeline, you can do so here. Your feedback and comments are welcome.

———

Jess Whyte is a Masters Student in the Faculty of Information at the University of Toronto. She holds a two-year digital preservation internship with the University of Toronto Libraries and also works as a Research Assistant with the Digital Curation Institute.  

Resources:

Gengenbach, M. (2012). The way we do it here”: Mapping digital forensics workflows in collecting institutions.”. Unpublished master’s thesis, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina.

Goldman, B. (2011). Bridging the gap: taking practical steps toward managing born-digital collections in manuscript repositories. RBM: A Journal of Rare Books, Manuscripts and Cultural Heritage, 12(1), 11-24

Prael, A., & Wickner, A. (2015). Getting to Know FRED: Introducing Workflows for Born-Digital Content.

It May Work In Theory . . . Getting Down to Earth with Digital Workflows

Recently, Joe Coen, archivist at the Roman Catholic Diocese of Brooklyn, posted this to the ERS listserv:

I’m looking to find out what workflows you have for ingest of electronic records and what tools you are using for doing checksums, creating a wrapper for the files and metadata, etc. We will be taking in electronic and paper records from a closed high school next month and l want to do as much as l can according to best practices.

I’d appreciate any advice and suggestions you can give.
51069522_fa3dd37b07_z
“OK. I’ve connected the Fedora-BagIt-Library-Sleuthkit to the FTK-Bitcurator-Archivematica instance using the Kryoflux-Writeblocker-Labelmaker . . . now what?” (photo by e-magic, https://www.flickr.com/photos/emagic/51069522/).
Joe said a couple of people responded to his question directly, but that means we’ve missed an opportunity to learn as a community of archivists working with digital materials about the actual practices of other archivists working with digital materials.

There are a lot of different archivists working with electronic records—some are administrators, some are temps, some are lone arrangers, some are programmers, some are born digital archivists and some have digital archivy thrust upon them—and this diversity of interests and viewpoints is, to my mind, an untapped resource.

There are so many helpful articles and white papers out there offering general guidance and warning of common pitfalls, but sometimes, when you’re trying to cobble together an ingest workflow or planning a site visit, you just think, “Yeah, but how do I actually do this?”

Why don’t we do that here?

If you’ve got links to ingest workflows, transfer guidelines, in-house best practices, digital materials surveys, or any other formal or informal procedures that just might maybe, kinda, one day be helpful to another archivist, why not post or describe them in the comments?

I know I’ve often scoured the Internet for similar advice only to find it in a comment to a blog post.