This year’s meeting of the Preservation and Archiving Special Interest Group (PASIG) took place this October at the Museum of Modern Art in New York. PASIG brings together an international community to share successes and challenges of digital preservation, with an emphasis on practical applications and solutions.
The conference was three days long, and kicked off with a day of “Bootcamp/101” sessions, focused on bringing everyone up to speed on what it is we’re preserving and how we can go about building infrastructures to support preservation. Unfortunately I wasn’t able to arrive until Day 2, but many of the presentation slides are available online at the conference’s figshare page.
I arrived on Thursday morning, ready to jump into a morning of presentations and panel discussions on reproducibility and research data. Vicky Steeves started the presentations with an explanation of reproducibility vs replication, a distinction well worth making especially for those with of us with less experience working with research data.
“Reproducibility independently confirms results with the same data (and/or code) Replication independently confirms results with new data (and/or code)”
Steeves pointed out that the concerns of reproducibility are really an iceberg, because the environment in which the research was conducted often goes unnoticed–especially in a technological environment where research tools may rely on a certain version of a browser, hardware, or software tools. These tools may be updated or change in a way that isn’t immediately visible.
One potential solution to this problem was presented by Fernando Chirigati of New York University. He introduced the tool ReproZip, which allows the researcher to package the data files, libraries, and environment variables. Reprozip runs in the background while the experiment is conducted, and documents the variables and technological dependencies that future researchers will need when reproducing an experiment in a future where tools and browsers may have changed. The packaged data and environment variables can be archived, then unpackaged by ReproZip for future use.
Both Peter Brunhill from University of Edinburgh and Rachel Trent from George Washington University Libraries discussed the problem of reproducing research reliant on web resources. Brunhill’s presentation, “Web Today, Gone Tomorrow” focused on the lack of persistence in web addresses, and the need for ongoing preservation of online articles and other academic resources. To get an idea of the scope of this problem, 20-30% of referenced URLs are lost within 2 weeks of publication. Brunhill presented the Hiberlink project, which aims to find solutions for this preservation gap through partnerships with academic publishing outlets. Rachel Trent’s presentation, “Documenting the Demographic Imagination” discussed the challenges of preserving social media data for reproducible research. Given the continued migration from one social media forum to another (myspace to facebook to twitter, etc), the archivist can’t assume that future researchers will understand the basis of any of these websites. Trent discussed the usage of social media managers and web harvesters to automate the collection of social media data, and what metadata can be automatically extracted using these tools. Trent and her team are now looking for feedback from the community on what’s missing from their social media metadata, and how researchers want to interact with this metadata.
After a brief lunch break, we dove into the challenges of preserving complex and very large data. Karen Cariani presented on the public broadcasting media library and archives of WGBH. Working with audio and video files, the preservation needs are significant and uncompressed preservation masters are very large. The formats are complicated and proxy files are necessary for access purposes. Cariani discussed how the HydraDAM2 project worked to fill this preservation gap, by extending the HydraDAM system to work with the Fedora 4 repository and creating a Hydra “head” for digital A/V preservation.
Ben Fino-Radin continued on the theme of preservation at scale, discussing the creation of workflows for digitized time-based media holdings at the MoMA. The digital repository uses Archivematica for ingest, Arkivum for storage, and Binder for managing these digital assets. A single 120 minute film once restored at 4k resolution contains 4 Terabytes of data, so the workflows and systems for managing these files have to move quickly and efficiently. This also means that the MoMA must be efficient in prioritizing film digitization efforts.
Day three focused on sustainability, not only sustaining our cultural and scientific heritage through digital preservation, but also on sustaining our planet and communities. Eira Tansey from University of Cincinnati pointed out the obvious but rarely discussed point that archives require energy, digital archives especially so. She urged the audience to consider the energy required for preservation during their daily work in the archive. Some common practices or digital preservation may be a wasteful use of resources, such as preserving every derivative file as it is migrated from one format to the next, or considering file compression as the enemy of preservation. She posted the entire text of her talk online, The Voice of One Crying Out in the Wilderness: Preservation in the Anthropocene.
Elvia Arroyo-Ramirez, Processing Archivist for Latin American Manuscript Collections, Princeton University, presented “Invisible Defaults and Perceived Limitations: Processing the Juan Gelman Files.” She discussed how the systems we use contain the biases of the people who create them, pointing to systems that require file names be ‘cleaned’ or ‘scrubbed’ to remove ‘illegal characters’ including Spanish-language diacritic glyphs. When working with a born-digital collection created in another language, those glyphs are vital to the understanding of those records. She asked the community how we can intervene to make our tools and technologies reflect our mission to preserve the records and ‘do no harm.’
The conference was concluded by Ingrid Burrington, neither an archivist nor a digital preservationist, but self-described writer, mapmaker and joke maker, and author of Networks of New York: An Illustrated Field Guide to Urban Internet Infrastructure. She discussed the physical infrastructure that makes up the internet and the corporate infrastructures that keep it running. She pointed to social media as crafting communication and products like Google Maps crafting our understanding of the world’s geography. Companies like Google can skew their products away from reality–be that the blurring of sensitive government installation or their own data centers. Corporate interest and the public need for information do not always align.
This change of perspective was a great end to the conference, bringing us out of our technical comfort zones and making the audience consider how the work of digital preservation has larger and potentially more dire effects than we may realize.
Alice Sara Prael is the Digital Accessioning Archivist at Beinecke Rare Book & Manuscript Library at Yale University. She works with born digital archival material through a centralized accessioning service.