Improving Workflows at UNC Libraries’ Wilson Special Collections Library

by Erica Titkemeyer and Jessica Venlet

This is the tenth post in the bloggERS Script It! Series.

At Wilson Special Collections Library, we are always trying to find ways to improve our digital preservation workflows. Improving our skills with the command line and using existing command line tools has played a key role in workflow improvements. So, we’ve picked a few favorite tools and tips to share.

FFmpeg

We use FFmpeg for a number of daily tasks, whether it’s generating derivatives for preservation files or analyzing a video or audio file we’ve received through a born-digital accession.

Clearing embedded metadata and uses for FFprobe:
As part of our audio digitization work, we embed metadata into all preservation WAV files. This metadata follows guidelines set out by the Federal Agencies Digital Guidelines Initiative (FADGI) and mostly relates the file back to the original item it was digitized from, including its unique identifier, title, and the curatorial unit it is held by. It has come up a few times now where we have recognized inconsistencies in how this data is reflected in the file, that the data itself is incorrect, or the data is insufficient.

WAV file metadata
Look at that terrible metadata!

When large-scale issues have come up, particularly with legacy files in our backlog, we’ve made use of FFmpeg’s ‘-map_metadata’ command to batch delete the embedded metadata. Below is a script used to batch create brand new files without metadata, with “_clean” attached to their original file name:

For i in *.wav; do ffmpeg -i “$i” -map_metadata –1 –c:a copy “${i%.wav}_clean.wav”; done

After successfully removing metadata from the files, we use the tool BWF MetaEdit to batch embed the correct metadata that we have prepared in a .csv file.

For born-digital work, we regularly use the tool/command ‘ffprobe’, a stream analyzer that is part of the FFmpeg build. It allows us to quickly see data about AV files (such as duration, file size, codecs, aspect ratio, etc.) that are helpful in identifying files and making general appraisal decisions. As we grow our capabilities in preserving born digital AV, we also foresee the need to document this type of file data in our ingest documentation.

walk_to_dfxml.py

In our born-digital workflows we don’t disk image every digital storage device we receive by default. This workflow choice has benefits and disadvantages. One disadvantage is losing the ability to quickly document all timestamps associated with files. While our workflows were preserving last modified dates, other timestamps like access or creation dates were not as effectively captured. In search of a way to remedy this issue, I turned to Twitter for some advice on the capture and value of each timestamp. Several folks recommended generating DFXML which is usually used on disk images. Tim Walsh helpfully pointed to a python script walk_to_dfxml.py that can generate DFXML on directories instead of disk images. Workflow challenge solved!

DFXML output example
DFXML output example

Brunnhilde

Brunnhilde is another tool that was particularly helpful in consolidating tasks and tools. By kicking off Brunnhilde in the command line, we are able to: check for viruses, create checksums, identify file formats, identify duplicates, create a manifest, and run a PII scan. Additionally, this tool gives us a report that is useful for digital archives specialists, but also holds potential as an appraisal tool for consultations with curators. We’re still working out that aspect of the workflow, but when it comes to the technical steps Brunnhilde and the associated command line tools it includes has really improved our processing work.

Learning as We Go

Like many archivists, we had limited experience with using the command line before graduate school. In the course of our careers, we’ve had to learn a lot on the fly because so many great command line tools are essential for working with digital archives.

One thing that can be tricky when you are new is moving the cursor around the terminal easily. It seems like it should be a no brainer, but it’s really not so obvious.

  • For Macs, see the excellent Script Ahoy resource:
  • For PC, see this resource for a variety of shortcuts including moving. In general:
    • Home key moves to beginning. End key moves to the end.
    • Ctrl + left or right arrow moves the cursor around in chunks

Another helpful set of commands are remove (rm) and move (mv). We use these when dealing with extraneous files created through quality control applications in our AV workflow that we’d like to delete quickly, or when we need to separate derivatives (such as mp3s) from a large batch of preservation files (wav).

    • Important note about rm: it’s always smart to first use ‘echo’ to see what files you would be removing with your command (ex: echo rm *.lvl would list all the .lvl files that would be removed by your command).

If you are just starting out, you may consider exploring online tutorials or guides like:


Erica Titkemeyer is the Project Director and Audiovisual Conservator for the Southern Folklife Collection at the UNC Wilson Special Collections Library, coordinating inhouse digitization and outsourcing of audiovisual materials for preservation. Erica also participants in the improvement of online access and digital preservation for digitized materials.

Jessica Venlet works as the Assistant University Archivist for Digital Records & Records Management at the UNC Wilson Special Collections Library. In this role, Jessica is responsible for a variety of things related to both records management and digital preservation. In particular, she leads the acquisition and management of born-digital university records. She earned a Master of Science in Information degree from the University of Michigan.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s