Experimenting and Digressing to the Digital Archivist Role

By Walker Sampson

___

This is the third post in the bloggERS! series Digital Archives Pathways, where archivists discuss the non-traditional, accidental, idiosyncratic, or unique paths they took to become a digital archivist.

On the surface, my route to digital preservation work was by the book: I attended UT–Austin’s School of Information from 2008–2010 and graduated with a heavy emphasis on born-digital archiving. Nevertheless, I feel my path to this work has been at least partly non-traditional in that it was a) initially unplanned and b) largely informed by projects outside formal graduate coursework (though my professors were always accommodating in allowing me to tie in such work to their courses where it made sense).

I came to the UT–Austin program as a literature major from the University of Mississippi with an emphasis on Shakespeare and creative writing. I had no intention of pursuing digital archiving work and was instead gunning for coursework in manuscript and codex conservation. It took a few months, but I realized I did not relish this type of work. There’s little point in relating the details here, but I think it’s sufficient to say that at times one simply doesn’t enjoy what one thought one would enjoy.

So, a change of direction in graduate school—not unheard of, right? I began looking into other courses and projects. One that stood out was a video game preservation IMLS grant project. I’ve always played video games, so why not? I was eventually able to serve as a graduate research assistant on this project while looking for other places at the school and around town that were doing this type of work.

One key find was a computer museum run out of the local Goodwill in Austin, which derived most of its collection from the electronics recycling stream processed of that facility. At that point, I already had an interest in and experience with old hardware and operating systems: I was fortunate enough to have a personal computer in the house growing up. That machine had a certain mystique that I wanted to explore. I read DOS for Dummies and Dan Gookin’s Guide to Underground DOS 6.0 cover to cover. I logged into bulletin board systems, scrounged for shareware games, and swapped my finds on floppy disk with another friend around the block. All of this is to say that I had a certain comfort level with computers before changing directions in graduate school.

I answered a call for volunteers at the computer museum and soon began working there. This gig would become the nexus for a lot of my learning and advancement with born digital materials. For example, the curator wanted a database for cataloging the vintage machines and equipment, so I learned enough PHP and MySQL to put together a relational database with a web frontend. (I expect there were far better solutions to the problem in retrospect, but I was eager to try and make things at the time. That same desire would play out far less well when I tried to make a version two of the database using the Fedora framework – an ill-conceived strategy from the start. C’est la vie.)

I and other students would also use equipment from the Goodwill museum to read old floppies. At the time BitCurator had not hit 1.0, and it seemed more expedient to simply run dd and other Unix utilities from a machine to which we had attached a floppy drive pulled from the Goodwill recycling stream. I learned a great deal about imaging through this work alone. Many of the interview transcripts for the Presidential Election of 1988 at the Dolph Briscoe Center for American History were acquired in this way under the guidance of Dr. Patricia Galloway. Using vintage and ‘obsolescent’ machines from the Goodwill computer museum was not originally part of a plan to rescue archival material on legacy media, but Dr. Galloway recognized the value of such an exercise and formed the Digital Archeology Lab at UT–Austin. In this way, experimenting can open the door to new formal practices.

This experience and several other like it, were as instrumental as coursework in establishing my career path. With that in mind, I’ll break out a couple of guiding principles that I gleaned from this process.

1: Experiment and Learn Independently

I say this coming out of one of the top ten graduate programs in the field, but the books I checked out to learn PHP and MySQL were not required reading, and the database project wasn’t for coursework at all. Learning how to use a range of Unix utilities and write scripts for batch-processing files were also projects that required self-directed learning outside of formal studies. Volunteer work is not strictly required, but a bulk of what I learned was in service of a non-profit where I had the space to learn and experiment.

Despite my early background playing with computers I don’t feel that it ultimately matters. Provided you are interested in the field, start your own road of experimentation now and get over any initial discomfort with computers by diving in head first.

This, over and over

 

In other words, be comfortable failing. Learning in this way mean failing a lot—but failing in a methodical way. Moreover, when it is just you and a computer, you can fail at a fantastic rate that would appall your friends and family —but no one will ever know! You may have to break some stuff and become almost excruciatingly frustrated at one point or another. Take a deep breath and come back to it the next day.

2: Make Stuff That Interests You

All that experimenting and independent learning can be lonely, so design projects and outputs that you are excited by and want to share with others, regardless of how well it turns out. It helps to check out books, play with code, and bump around on the command line in service of an actual project you want to complete. Learning without a destination point you want to reach, however earnest, will inevitably run out of steam. While a given course may have its own direction and intention, and a given job position may have its own directives and responsibilities, there is typically a healthy latitude for a person to develop projects and directions that serve their interests and broadly align with the goals of the course or job.

Again, my own path has been fairly “traditional”—graduate studies in the field and volunteer gigs along with some part-time work to build experience. Even within this traditional framework however, experimenting, exploring projects outside the usual assignments, and independently embarking on learning whatever I thought I needed to learn have been huge benefits for me.

___

Walker Sampson is the Digital Archivist at the University of Colorado Boulder Libraries where he is responsible for the acquisition, accessioning and description of born digital objects, along with the continued preservation and stewardship of all digital materials in the Libraries.

 

Preservation and Access Can Coexist: Implementing Archivematica with Collaborative Working Groups

By Bethany Scott

The University of Houston (UH) Libraries made an institutional commitment in late 2015 to migrate the data for its digitized and born-digital cultural heritage collections to open source systems for preservation and access: Hydra-in-a-Box (now Hyku!), Archivematica, and ArchivesSpace. As a part of that broader initiative, the Digital Preservation Task Force began implementing Archivematica in 2016 for preservation processing and storage.

At the same time, the DAMS Implementation Task Force was also starting to create data models, use cases, and workflows with the goal of ultimately providing access to digital collections via a new online repository to replace CONTENTdm. We decided that this would be a great opportunity to create an end-to-end digital access and preservation workflow for digitized collections, in which digital production tasks could be partially or fully automated and workflow tools could integrate directly with repository/management systems like Archivematica, ArchivesSpace, and the new DAMS. To carry out this work, we created a cross-departmental working group consisting of members from Metadata & Digitization Services, Web Services, and Special Collections.

Continue reading

Digital Preservation, Eh?

by Alexandra Jokinen

This post is the third post in our series on international perspectives on digital preservation.

___

Hello / Bonjour!

Welcome to the Canadian edition of International Perspectives on Digital Preservation. My name is Alexandra Jokinen. I am the new(ish) Digital Archives Intern at Dalhousie University in Halifax, Nova Scotia. I work closely with the Digital Archivist, Creighton Barrett, to aid in the development of policies and procedures for some key aspects of the University Libraries’ digital archives program—acquisitions, appraisal, arrangement, description, and preservation.

One of the ways in which we are beginning to tackle this very large, very complex (but exciting!) endeavour is to execute digital preservation on a small scale, focusing on the processing of digital objects within a single collection, and then using those experiences to create documentation and workflows for different aspects of the digital archives program.

The collection chosen to be our guinea pig was a recent donation of work from esteemed Canadian ecologist and environmental scientist, Bill Freedman, who taught and conducted research at Dalhousie from 1979 to 2015. The fonds is a hybrid of analogue and digital materials dating from 1988 to 2015. Digital media carriers include: 1 computer system unit, 5 laptops, 2 external hard drives, 7 USB flash drives, 5 zip disks, 57 CDs, 6 DVDs, 67 5.25 inch floppy disks and 228 3.5 inch floppy disks. This is more digital material than the archives is likely to acquire in future accessions, but the Freedman collection acted as a good test case because it provided us with a comprehensive variety of digital formats to work with.

Our first area of focus was appraisal. For the analogue material in the collection, this process was pretty straightforward: conduct macro-appraisal and functional analysis by physically reviewing material. However, (as could be expected) appraisal of the digital material was much more difficult to complete. The archives recently purchased a forensic recovery of evidence device (FRED) but does not yet have all the necessary software and hardware to read the legacy formats in the collection (such as the floppy disks and zip disks), so, we started by investigating the external hard drives and USB flash drives. After examining their content, we were able to get an accurate sense of the information they contained, the organizational structure of the files, and the types of formats created by Freedman. Although, we were not able to examine files on the legacy media, we felt that we had enough context to perform appraisal, determine selection criteria and formulate an arrangement structure for the collection.

The next step of the project will be to physically organize the material. This will involve separating, photographing and reboxing the digital media carriers and updating a new registry of digital media that was created during a recent digital archives collection assessment modelled after OCLC’s 2012 “You’ve Got to Walk Before You Can Run” research report. Then, we will need to process the digital media, which will entail creating disk images with our FRED machine and using forensic tools to analyze the data.  Hopefully, this will allow us to apply the selection criteria used on the analogue records to the digital records and weed out what we do not want to retain. During this process, we will be creating procedure documentation on accessioning digital media as well as updating the archives’ accessioning manual.

The project’s final steps will be to take the born-digital content we have collected and ingest it using Archivematica to create Archival Information Packages for storage and preservation and accessed via the Archives Catalogue and Online Collections.

So there you have it! We have a long way to go in terms of digital preservation here at Dalhousie (and we are just getting started!), but hopefully our work over the next several months will ensure that solid policies and procedures are in place for maintaining a trustworthy digital preservation system in the future.

This internship is funded in part by a grant from the Young Canada Works Building Careers in Heritage Program, a Canadian federal government program for graduates transitioning to the workplace.

___

dsc_0329

Alexandra Jokinen has a Master’s Degree in Film and Photography Preservation and Collections Management from Ryerson University in Toronto. Previously, she has worked as an Archivist at the Liaison of Independent Filmmakers of Toronto and completed a professional practice project at TIFF Film Reference Library and Special Collections.

Connect with me on LinkedIn!

Practical Digital Preservation: In-House Solutions to Digital Preservation for Small Institutions

By Tyler McNally

This post is the tenth post in our series on processing digital materials.

Many archives don’t have the resources to install software or subscribe to a service such as Archivematica, but still have a mandate to collect and preserve born-digital records. Below is a digital-preservation workflow created by Tyler McNally at the University of Manitoba. If you have a similar workflow at your institution, include it in the comments. 

———

Recently I completed an internship at the University of Manitoba’s College of Medicine Archives, working with Medical Archivist Jordan Bass. A large part of my work during this internship dealt with building digital infrastructure for the archive to utilize in working on digital preservation. As a small operation, the archive does not have the resources to really pursue any kind of paid or difficult to use system.

Originally, our plan was to use the open-source, self-install version of Archivematica, but certain issues that cropped up made this impossible, considering the resources we had at hand. We decided that we would simply make our own digital-preservation workflow, using open-source and free software to convert our files for preservation and access, check for viruses, and create checksums—not every service that Archivematica offers, but enough to get our files stored safely. I thought other institutions of similar size and means might find the process I developed useful in thinking about their own needs and capabilities.

Continue reading

Clearing the digital backlog at the Thomas Fisher Rare Book Library

By Jess Whyte

This is the second post in our Spring 2016 series on processing digital materials.

———

Tucked away in the manuscript collections at the Thomas Fisher Rare Book Library, there are disks. They’ve been quietly hiding out in folders and boxes for the last 30 years. As the University of Toronto Libraries develops its digital preservation policies and workflows, we identified these disks as an ideal starting point to test out some of our processes. The Fisher was the perfect place to start:

  • the collections are heterogeneous in terms of format, age, media and filesystems
  • the scope is manageable (we identified just under 2000 digital objects in the manuscript collections)
  • the content has relatively clear boundaries (we’re dealing with disks and drives, not relational databases, software or web archives)
  • the content is at risk

The Thomas Fisher Rare Book Library Digital Preservation Pilot Project was born. It’s purpose: to evaluate the extent of the content at risk and establish a baseline level of preservation on the content.

Identifying digital assets

The project started by identifying and listing all the known digital objects in the manuscript collections. I did this by batch searching all the .pdf finding aids from post-1960 with terms like “digital,” “electronic,” “disk,” —you get the idea. Once we knew how many items we were dealing with and where we could find them, we could begin.

Early days, testing and fails
When I first started, I optimistically thought I would just fire up BitCurator and everything would work.

whyte01

It didn’t, but that’s okay. All of the reasons we chose these collections in the first place (format, media, filesystem and age diversity) also posed a variety of challenges to our workflow for capture and analysis. There was also a question of scalability – could I really expect to create preservation copies of ~2000 disks along with accompanying metadata within a target 18-month window? By processing each object one-by-one in a graphical user interface? While working on the project part-time? No, I couldn’t. Something needed to change.

Our early iterations of the process went something like this:

  1. Use a Kryoflux and its corresponding software to take an image of the disk
  2. Mount the image in a tool like FTK Imager or HFSExplorer
  3. Export a list of the files in a somewhat consistent manner to serve as a manifest, metadata and de facto finding aid
  4. Bag it all up in Bagger.

This was slow, inconsistent, and not well-suited to the project timetable. I tried using fiwalk (included with BitCurator) to walk through a series of images and automatically generate manifests of their contents, but fiwalk does not support HFS and other, older filesystems. Considering 40% of our disks thus far were HFS (at this point, I was 100 disks in), fiwalk wasn’t going to save us. I could automate the process for 60% of the disks, but the remainder would still need to be handled individually–and I wouldn’t have those beautifully formatted DFXML (Digital Forensics XML) files to accompany them. I needed a fix.

Enter disktype and md5deep

I needed a way to a) mount a series of disk images, b) look inside, c) generate metadata on the file contents and d) produce a more human-readable manifest that could serve as a finding aid.

Ideally, the format of all that metadata would be consistent. Critically, the whole process would be as automated as possible.

This is where disktype and md5deep come in. I could use disktype to identify an image’s filesystem, mount it accordingly and then use md5deep to generate DFXML and .csv files. The first iteration of our script did just that, but md5deep doesn’t produce as much metadata as fiwalk. While I don’t have the skills to rewrite fiwalk, I do have the skills to write a simple bash script that routes disk images based on their filesystem to either md5deep or fiwalk. You can find that script here, and a visualization of how it works below:

whyte02

I could now turn this (collection of image files and corresponding imaging logs):

Whyte03

into this (collection of image files, logs, DFXML files, and CSV manifests):

Whyte04

Or, to put it another way, I could now take one of these:

Whyte05

And rapidly turn it into this ready-to-be-bagged package:

Whyte06

Challenges, Future Considerations and Questions

Are we going too fast?
Do we really want to do this quickly? What discoveries or insights will we miss by automating this process? There is value in slowing down and spending time with an artifact and learning from it. Opportunities to do this will likely come up thanks to outliers, but I still want to carve out some time to play around with how these images can be used and studied, individually and as a set.

Virus Checks:
We’re still investigating ways to run virus checks that are efficient and thorough, but not invasive (won’t modify the image in any way).  One possibility is to include the virus check in our bash script, but this will slow it down significantly and make quick passes through a collection of images impossible (during the early, testing phases of this pilot, this is critical). Another possibility is running virus checks before the images are bagged. This would let us run the virus checks overnight and then address any flagged images (so far, we’ve found viruses in ~3% of our disk images and most were boot-sector viruses). I’m curious to hear how others fit virus checks into their workflows, so please comment if you have suggestions or ideas.

Adding More Filesystem Recognition
Right now, the processing script only recognizes FAT and HFS filesystems and then routes them accordingly. So far, these are the only two filesystems that have come up in my work, but the plan is to add other filesystems to the script on an as-needed basis. In other words, if I happen to meet an Amiga disk on the road, I can add it then.

Access Copies:
This project is currently focused on creating preservation copies. For now, access requests are handled on an as-needed basis. This is definitely something that will require future work.

Error Checking:
Automating much of this process means we can complete the work with available resources, but it raises questions about error checking. If a human isn’t opening each image individually, poking around, maybe extracting a file or two, then how can we be sure of successful capture? That said, we do currently have some indicators: the Kryoflux log files, human monitoring of the imaging process (are there “bad” sectors? Is it worth taking a closer look?), and the DFXML and .csv manifest files (were they successfully created? Are there files in the image?). How are other archives handling automation and exception handling?

If you’d like to see our evolving workflow or follow along with our project timeline, you can do so here. Your feedback and comments are welcome.

———

Jess Whyte is a Masters Student in the Faculty of Information at the University of Toronto. She holds a two-year digital preservation internship with the University of Toronto Libraries and also works as a Research Assistant with the Digital Curation Institute.  

Resources:

Gengenbach, M. (2012). The way we do it here”: Mapping digital forensics workflows in collecting institutions.”. Unpublished master’s thesis, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina.

Goldman, B. (2011). Bridging the gap: taking practical steps toward managing born-digital collections in manuscript repositories. RBM: A Journal of Rare Books, Manuscripts and Cultural Heritage, 12(1), 11-24

Prael, A., & Wickner, A. (2015). Getting to Know FRED: Introducing Workflows for Born-Digital Content.

It May Work In Theory . . . Getting Down to Earth with Digital Workflows

Recently, Joe Coen, archivist at the Roman Catholic Diocese of Brooklyn, posted this to the ERS listserv:

I’m looking to find out what workflows you have for ingest of electronic records and what tools you are using for doing checksums, creating a wrapper for the files and metadata, etc. We will be taking in electronic and paper records from a closed high school next month and l want to do as much as l can according to best practices.

I’d appreciate any advice and suggestions you can give.
51069522_fa3dd37b07_z
“OK. I’ve connected the Fedora-BagIt-Library-Sleuthkit to the FTK-Bitcurator-Archivematica instance using the Kryoflux-Writeblocker-Labelmaker . . . now what?” (photo by e-magic, https://www.flickr.com/photos/emagic/51069522/).
Joe said a couple of people responded to his question directly, but that means we’ve missed an opportunity to learn as a community of archivists working with digital materials about the actual practices of other archivists working with digital materials.

There are a lot of different archivists working with electronic records—some are administrators, some are temps, some are lone arrangers, some are programmers, some are born digital archivists and some have digital archivy thrust upon them—and this diversity of interests and viewpoints is, to my mind, an untapped resource.

There are so many helpful articles and white papers out there offering general guidance and warning of common pitfalls, but sometimes, when you’re trying to cobble together an ingest workflow or planning a site visit, you just think, “Yeah, but how do I actually do this?”

Why don’t we do that here?

If you’ve got links to ingest workflows, transfer guidelines, in-house best practices, digital materials surveys, or any other formal or informal procedures that just might maybe, kinda, one day be helpful to another archivist, why not post or describe them in the comments?

I know I’ve often scoured the Internet for similar advice only to find it in a comment to a blog post.