By Tyler McNally
This post is the tenth post in our series on processing digital materials.
Many archives don’t have the resources to install software or subscribe to a service such as Archivematica, but still have a mandate to collect and preserve born-digital records. Below is a digital-preservation workflow created by Tyler McNally at the University of Manitoba. If you have a similar workflow at your institution, include it in the comments.
Recently I completed an internship at the University of Manitoba’s College of Medicine Archives, working with Medical Archivist Jordan Bass. A large part of my work during this internship dealt with building digital infrastructure for the archive to utilize in working on digital preservation. As a small operation, the archive does not have the resources to really pursue any kind of paid or difficult to use system.
Originally, our plan was to use the open-source, self-install version of Archivematica, but certain issues that cropped up made this impossible, considering the resources we had at hand. We decided that we would simply make our own digital-preservation workflow, using open-source and free software to convert our files for preservation and access, check for viruses, and create checksums—not every service that Archivematica offers, but enough to get our files stored safely. I thought other institutions of similar size and means might find the process I developed useful in thinking about their own needs and capabilities.
Our first step was to back up our data before doing anything else. I also like to quarantine the data at this point as well using an external hard drive (my workstation is not connected to the network or our final storage server). I personally have never had a problem with data loss or viruses, but people using some of these open source programs have reported issues with file corruption and deletion, so it is a good idea to work with a copy.
Second, we used ClamAV, an open-source antivirus software available for Windows, MacOS, and Linux, to deal with any viruses before opening any files. One of the reasons I really like ClamAV is there is a GUI available on Linux, so the less tech-savvy users are not stuck using command line only. I had ClamAV generate a report that I saved in .txt file to be added to the metadata folder in the final AIP.
At this point, I liked to drill down into the lowest levels of the ingest to get an idea of what file formats were there so a file-format conversion plan could be made—for example, video files tend to take a long time to convert, so it’s worth starting them sooner and letting it run in the background, or some files may already be in one of your preservation or access formats, so they don’t require conversion.
The next step was to move forward with file normalization for access and preservation. We preferred Archivematica’s standards, available here; they seem to hold up well enough, but you may want to do more research on lossy and lossless formats, considering your processing and storage resources. Depending on your institution and its capabilities, it may be more prudent to use a different format for access, depending on the software and hardware available to you, or to look into a different lossless format that gives you a smaller size.
For step 6, we created our access and preservation directories. What I did was copy the structure of the original directory. So, I would have my high level AIP folder (following some internal naming convention) containing two sub-folders, “objects” and “metadata.” The “objects” folder had three more sub-folders: “original,” “access,” and “preservation.” I copied the original files, making sure to duplicate the original structure of the ingest, into the “originals” folder, and once converted, I rebuilt that original structure with the files normalized for both preservation and access. The images below demonstrate this structure: the top left is the high level folder, you can then see what is at the lowest levels of the object sub-folder in the other three images presented here.
Having created the final structure of the AIP and normalized all the files, it was time to finish the metadata. The ClamAV report made earlier was already in the “Metadata” folder, but we required two other pieces of metadata:
- Checksums: we used MD5 as our standard algorithm for checksums, opting to generate them using MD5Deep—a command-line tool—because it works on Windows, Mac, and Linux. It’s simple to use and can batch generate checksums to a single .md5 file. I chose to create separate sets of MD5 sums for the original, preservation, and access files and save them as a unique .md5 files in the metadata folder.
- Descriptive metadata: We created XML files for our AIPs using Dublin Core metadata standards. We opted to go with a minimalist approach and used the following seven fields: Title, Creator, Subject, Description, Contributor, Date, and Format. I used Microsoft Word to fill out the XML, although LibreOffice is an open-source option that would also work, by saving your document as .xml.
The directory containing the original, preservation, and access folders, along with the folder containing all the metadata, was compressed using Windows’ native file archiver, 7-Zip, or WinRAR, depending on our needs. All of these formats can be accessed on Windows, Mac, and Linux and offer decent compression size, which fit our purposes.
With this long list of instructions for preservation, though, I have yet to talk about the tools we used for normalization of audio, video, images, and word processing files.
Audio: I chose to use Audacity, a free, open-source audio editor that runs on any operating system. It has a GUI, which makes it easy to use, but it doesn’t do batch conversion (though, you can open multiple files at a time). We exported files as MP3 for access and WAV for preservation. The pictures below show where the export button is and what the export dialogue box looks like.
Image: I used two pieces of software for image conversion, Preview when using our Mac and ImageMagick on our Linux machine. Preview comes with the Macintosh OS, and it does support batch conversion. These pictures show where you will find the export command, and what the dialogue window that opens looks like and what your settings will need to be to save a JPG (our access format) and as a TIFF (our preservation format).
ImageMagick is a command-line tool available on all operating systems. It’s fairly easy to use for a single file at a time, but you can batch convert with it. I will cover both methods.
For a single image, open a terminal window and ‘cd’ into the folder where the image is. Enter the command “convert 1.jpg 1.tiff” and hit enter. This will convert 1.jpg from a JPG to a TIFF. The batch conversion method I used has been called unstable, but I’ve never had a problem (though, it is not recommended for your only copies as it may corrupt them). I do prefer this method even for single files because it doesn’t require one to type out file names in terminal, which I personally find tedious. My method was to copy the files I wanted to convert into a new folder on my desktop and convert them and then delete the copies. While in the directory containing the files to convert, type the command “mogrify -format tiff *.jpg”. This command will take anything with the extension “.jpg” as its file format and convert it to a TIFF. As I said, I found the batch conversion to be the best method, simply because it didn’t require me to type out file names in terminal, which have to be exact; so I always used it.
Video: We went with HandBrake, a free and open-source program available on Mac, Windows, and Linux. One really nice thing about HandBrake is you can cue up a bunch of conversions and just leave it to run overnight or while you work on something else. It also has a GUI, which makes it easy to use. It will accept any video format, but only exports MP4 and MKV, though these are the recommended access and preservation standards, respectively. With the first conversion, click the start button up top, and for any subsequent conversions, click “Add to Queue.” In the image below I have directed HandBrake to my test file, set it to convert it to an MKV and save it in my preseveration folder as “test.mkv.”
Text: The last file type I encountered was Word Processing Files. We used DOC as our access format and PDF; conversion is done by opening the originals in Microsoft Word (or the free, open source LibreOffice) and resaving them as PDFs.
What I really like about this workflow is that it’s cheap, easy, and adaptable to individual institution’s needs. The Medical Archives spent no money on new technology or software, since we used open-source programs and the hardware we already had. Through little to no expenses and maybe a couple weeks of trial and error, most people with a small degree of computer skills could get this type of system working. The fact that all these parts are separate means that it can be customized for what an institution needs, and if something breaks or updates stop coming, it will not be a critical piece of the infrastructure; meaning it can be easily substituted and the system updated. For smaller or more isolated institutions with little money and IT support, this could be a viable solution for digital preservation. Of course, there are some issues with this setup: The metadata we generated was nowhere near as sophisticated or in-depth as the metadata Archivematica produces and records. Compression was also an issue where it took a long time and was not always successful. Working between Linux and Mac systems, formatting our external drives was also painful. Ultimately, I think the positives outweigh the negatives. If one can deal with some technical hiccups, this can be a strong, cheap, and easy digital preservation workflow.
Tyler McNally is a masters student in archival studies at the University of Manitoba. His research focuses on the archiving of security and intelligence agencies at the federal level in Canada, access to classified records, archives and accountability and digital preservation.