Digital Preservation in NYC

This year’s meeting of the Preservation and Archiving Special Interest Group (PASIG) took place this October at the Museum of Modern Art in New York.  PASIG brings together an international community to share successes and challenges of digital preservation, with an emphasis on practical applications and solutions.

The conference was three days long, and kicked off with a day of “Bootcamp/101” sessions, focused on bringing everyone up to speed on what it is we’re preserving and how we can go about building infrastructures to support preservation.  Unfortunately I wasn’t able to arrive until Day 2, but many of the presentation slides are available online at the conference’s figshare page.

I arrived on Thursday morning, ready to jump into a morning of presentations and panel discussions on reproducibility and research data.  Vicky Steeves started the presentations with an explanation of reproducibility vs replication, a distinction well worth making especially for those with of us with less experience working with research data.  

“Reproducibility independently confirms results with the same data (and/or code) Replication independently confirms results with new data (and/or code)”

Steeves pointed out that the concerns of reproducibility are really an iceberg, because the environment in which the research was conducted often goes unnoticed–especially in a technological environment where research tools may rely on a certain version of a browser, hardware, or software tools.  These tools may be updated or change in a way that isn’t immediately visible.

One potential solution to this problem was presented by Fernando Chirigati of New York University.  He introduced the tool ReproZip, which allows the researcher to package the data files, libraries, and environment variables.  Reprozip runs in the background while the experiment is conducted, and documents the variables and technological dependencies that future researchers will need when reproducing an experiment in a future where tools and browsers may have changed.  The packaged data and environment variables can be archived, then unpackaged by ReproZip for future use.

Both Peter Brunhill from University of Edinburgh and Rachel Trent from George Washington University Libraries discussed the problem of reproducing research reliant on web resources.  Brunhill’s presentation, “Web Today, Gone Tomorrow” focused on the lack of persistence in web addresses, and the need for ongoing preservation of online articles and other academic resources.  To get an idea of the scope of this problem, 20-30% of referenced URLs are lost within 2 weeks of publication.  Brunhill presented the Hiberlink project, which aims to find solutions for this preservation gap through partnerships with academic publishing outlets.  Rachel Trent’s presentation, “Documenting the Demographic Imagination” discussed the challenges of preserving social media data for reproducible research.  Given the continued migration from one social media forum to another (myspace to facebook to twitter, etc), the archivist can’t assume that future researchers will understand the basis of any of these websites.  Trent discussed the usage of social media managers and web harvesters to automate the collection of social media data, and what metadata can be automatically extracted using these tools.  Trent and her team are now looking for feedback from the community on what’s missing from their social media metadata, and how researchers want to interact with this metadata.

After a brief lunch break, we dove into the challenges of preserving complex and very large data.  Karen Cariani presented on the public broadcasting media library and archives of WGBH.  Working with audio and video files, the preservation needs are significant and uncompressed preservation masters are very large.  The formats are complicated and proxy files are necessary for access purposes.  Cariani discussed how the HydraDAM2 project worked to fill this preservation gap, by extending the HydraDAM system to work with the Fedora 4 repository and creating a Hydra “head” for digital A/V preservation.   

Ben Fino-Radin continued on the theme of preservation at scale, discussing the creation of workflows for digitized time-based media holdings at the MoMA.  The digital repository uses Archivematica for ingest, Arkivum for storage, and Binder for managing these digital assets.  A single 120 minute film once restored at 4k resolution contains 4 Terabytes of data, so the workflows and systems for managing these files have to move quickly and efficiently.  This also means that the MoMA must be efficient in prioritizing film digitization efforts.

Day three focused on sustainability, not only sustaining our cultural and scientific heritage through digital preservation, but also on sustaining our planet and communities.  Eira Tansey from University of Cincinnati pointed out the obvious but rarely discussed point that archives require energy, digital archives especially so.  She urged the audience to consider the energy required for preservation during their daily work in the archive.  Some common practices or digital preservation may be a wasteful use of resources, such as preserving every derivative file as it is migrated from one format to the next, or considering file compression as the enemy of preservation.  She posted the entire text of her talk online, The Voice of One Crying Out in the Wilderness: Preservation in the Anthropocene.

Elvia Arroyo-Ramirez, Processing Archivist for Latin American Manuscript Collections, Princeton University, presented “Invisible Defaults and Perceived Limitations: Processing the Juan Gelman Files.”  She discussed how the systems we use contain the biases of the people who create them, pointing to systems that require file names be ‘cleaned’ or ‘scrubbed’ to remove ‘illegal characters’ including Spanish-language diacritic glyphs.  When working with a born-digital collection created in another language, those glyphs are vital to the understanding of those records.  She asked the community how we can intervene to make our tools and technologies reflect our mission to preserve the records and ‘do no harm.’  

The conference was concluded by Ingrid Burrington, neither an archivist nor a digital preservationist, but self-described writer, mapmaker and joke maker, and author of Networks of New York:  An Illustrated Field Guide to Urban Internet Infrastructure.  She discussed the physical infrastructure that makes up the internet and the corporate infrastructures that keep it running. She pointed to social media as crafting communication and products like Google Maps crafting our understanding of the world’s geography.  Companies like Google can skew their products away from reality–be that the blurring of sensitive government installation or their own data centers. Corporate interest and the public need for information do not always align.

This change of perspective was a great end to the conference, bringing us out of our technical comfort zones and making the audience consider how the work of digital preservation has larger and potentially more dire effects than we may realize.

 

profilephoto

 

Alice Sara Prael is the Digital Accessioning Archivist at Beinecke Rare Book & Manuscript Library at Yale University.  She works with born digital archival material through a centralized accessioning service.

Call for Posts: International Perspectives on Digital Preservation

The BloggERS editorial team is planning a series of blog posts to present an international view on digital preservation, and we would like to invite you to participate.

We like to think of our topical blog series as a chance for digital archivists to share information about issues they are facing, solutions they have implemented, and new projects they are working on. We’ve had some great series in the past on digital processing and access, so we thought it might be valuable to get perspectives on digital preservation from various countries and cultures.

We have several goals that we hope the series might reach:

  1. We want to highlight similarities across borders, which will foster information sharing and can lead to fruitful collaborations;
  2. We want to discover differences in practice based on local laws, values, practices, histories; differences in practice give fresh perspective into one’s own work as well as provide new ideas for moving forward;
  3. We want to use the ERS blog to facilitate in the development of an international dialogue about the values, technologies, and practices that shape digital preservation needs across the globe;
  4. We hope to encourage future collaborative relationships by giving repositories worldwide a chance to describe their problems and solutions;
  5. We want to offer the blog as a common space for discussions of digital preservation with international points of view.

We want this series of posts to be useful to anyone working anywhere around the globe, not just in the United States. If you’ve run into issues specific to your country or culture and want to describe your issues and share your solutions, or if you’ve got a cool project that might interest an international audience, we’d love to hear from you.

Contact us with post ideas at ers.mailer.blog@gmail.com

Also, check out our Guidelines for Writers.

Get to know the candidates: Greg Wiedeman

The 2016 elections for Electronic Records Section leadership are upon us! Over the next two weeks, we will be presenting additional information provided by the 2016 nominees for ERS leadership positions. For more information about the slate of candidates, you can check out the full 2016 ERS elections site. ERS Members: be sure to vote! Polls are open July 8 through the 22!

Candidate name: Gregory Wiedeman

Running for: Steering Committee

What made you decide you wanted to become an archivist?

As a graduate history student without funding, working in archives was an attractive alternative to poverty. Then I found out how awesome a job it is. I was lucky enough to get great hands-on experience on big projects and I found the work more complex, interesting, and enjoyable than my graduate research. I love all of the problem-solving it takes, and making all of the content we have available for public use.

What is one thing you’d like to see the Electronic Records Section accomplish during your time on the steering committee?

I really see the role of the section to encourage communication about techniques and best practices. Digital records have tremendous potential to make our collections much more accessible, but there are major hurdles – many of which can be helped by sharing the work many of us are doing independently. The ERS has already done a great job with this bloggERS! series which has provided a forum for the sharing and discussion of some of the really innovative projects that are pushing the community forward. Yet, I’ve found that there are many electronic records efforts are smaller, ad hoc, and more continuous than formal or polished. I’d like to find a way to share all the workflows, draft policies, and small scripts so that our peers can reuse and build upon them. In addition to the continuation of the bloggERS! series, I’d like the ERS to look into a mechanism for archivists to self-submit these small, but useful, efforts in a way that promotes permissive reuse and documentation.

What is your favorite GIF?

This time of year it’s always Bartolo:

giphy2

NPR: Will Future Historians Consider These Days The Digital Dark Ages?

Nice article from NPR on digital preservation with some thoughtful comments by our colleagues in archives and museums.
Will Future Historians Consider These Days The Digital Dark Ages?
We are awash in a sea of information, but how to historians sift through the mountain of data? In the future, computer programs will be unreadable, and therefore worthless, to historians.
NPR.ORG|BY MORNING EDITION
Additional discussions on Bert Lyons Twitter feed regarding this article.

Learning a thing or two from Harpo: batch processing

harpo_cleans_something
Harpo Shining Shoes

Recently I recently participated in the Fall Symposium of the Midwest Archives Conference, Hard Skills for Managing Digital Collections in Archives. There were a number of excellent tools that the presenters covered including ExifTool, Bagit and MediaInfo as well as new tricks for analyzing/processing data using that old frenemy, Excel. These tools are tremendously helpful to the work of archivists and are getting better all the time. Here at the Carleton College Archives, each one has greatly increased the speed with which we can process electronic accessions.

But if we are going to keep up with the tremendous flood of new electronic records into our archives, our processing programs need to run even faster.  Three commonly used programs—ExifTool, Bagit and DROID—all process collection materials one at a time, and each program also requires additional steps and selection of options to generate the desired output. The convenience of these solutions is often overshadowed by their resource intensiveness. We need the ability to instruct all these programs to process a whole series of records, producing all the various outputs we want, with one or two steps as opposed to a dozen.

To solve this problem at Carleton, we have created a set of batch processors for programs we regularly use in our archive. For each program in our toolbox we wrote a script using the programming language Python that applies the program not just to a single accession but an entire directory of accessions. Additionally we gave the batch processor instructions to generate all the reports from each program we might need to manage our digital files.

These improvements have made a big difference in our ability to process numerous digital accessions quickly and consistently.  For anyone that would like to try them, all the batch processors are run via the command line (PC Command Prompt or the Mac Terminal) and can be downloaded from our GitHub repositories:

Photo credit: Harpo Shining Shoes animated GIF.  From “A Night in Casablanca” via http://wfmu.org/playlists/shows/51730.

Introducing NEW Steering Committee members!

We are happy to announce the results of the Electronic Records Section elections for 2015:

Chair-elect/Vice Chair: Kyle Henke (DePaul University)
Steering Committee: Ann Cooper (College of William and Mary)
Steering Committee: Carol Kussmann (University of Minnesota)

In order to get to know them, the bloggERS! sent out a three-question interview to each new member:

1. What made you decide you wanted to become an archivist?
2. What is one thing that you’d like to see the Electronic Records Section accomplish during your time on the steering committee?
3. What is your favorite GIF? 

Kyle Henke (DePaul University), Chair-Elect/Vice Chair

Q: What made you decide you wanted to become an archivist?

KH: My interest in archives came from an uncertain search result. As an undergraduate student at San Diego State University, majoring in History, I made a cursory search using the Library ILS and found some results that gave me a location of “Special Collections and Archives; Non-Circulating.” Having no idea what this meant, I sought out the object and had my first experience in an archival reading room. It was awkward, entering a closed-off room unsure of what I was doing. Luckily, the staff was courteous and helpful, making the process manageable. I found using primary sources was enthralling and continued to come back for nearly every paper I had to write, utilizing any primary resource I could that was relevant.

From there I took a second job in the Archives where I was treated like an intern, given projects intended to see varying aspects of the field to see if this was a profession I wanted to pursue. Some of it was routine, but always mentally engaging. Fairly quickly, I was pretty certain this was the right place for me. While content is always king, I find myself interested by the properties and structure of objects. Preserving the continuity and integrity of the object became the prevailing goal, whether it is the physical properties of a paper document or the digital properties of an mp3 file.

Q: What is one thing that you’d like to see the Electronic Records Section accomplish during your time as chair?

KH: For a number of years now I’ve worked in archives, focusing on digital content and systems. I’ve managed digital repositories, cross-walked metadata, developed policies and workflows and so much more. The pivotal component to my growth in the profession has been collaborating with colleagues in the field. I see the purpose of this group as a method to facilitate communication and encourage collaboration across the profession. I would like to continue developing methods of outreach and education for those within the profession and those on the outside. I like the direction the ERS blog (BloggERS) has gone and would like to promote and use this resource to directly involve our community and gain a wider audience.

Additionally, I’m interesting in investigating a way to connect one another to a project or idea that would contribute towards collaboration. I know I’ve had ideas for presentations or workshops that are halted as I become uncertain of the next steps or outcome. However, having informal talks with colleagues at work or at conferences allows fresh and different perspectives. Perhaps these informal collaborations could possibly lead beyond discussions and to outcomes such as posters, evaluations, speaking sessions, tutorials, workshops, instruction, etc.

Q: What is your favorite GIF?

https://31.media.tumblr.com/66783baac87573d10a182246f4037e09/tumblr_inline_nqvh9xlzEj1rpcnpz_500.gif

Ann Cooper (College of William and Mary), Steering Committee

AC: I decided to become an archivist because my interests and professional strengths fit a lot better with this field than they did with being a history professor. I haven’t regretted it and I’m happy doing what I do now.

Q: What is one thing that you’d like to see the Electronic Records Section accomplish during your time as chair?

AC: I’d like to see us develop and make available some guidelines for training staff in working with electronic material in specific situations or some sample training materials for archivists to use.

Q: What is your favorite GIF?

AC:
https://media.giphy.com/media/3oEduGjJPPpPLGnDO0/giphy.gif

Carol Kussmann (University of Minnesota), Steering Committee

Q: What made you decide you wanted to become an archivist?

CK: As with a fair number of people I run into, I am an accidental archivist. My previous dream job was working in a museum; I was the Assistant Registrar at the Spurlock Museum and I loved every minute of it. One day the archives called and were looking for information on a photograph in their collection. After doing a bit of digging I found the information they were looking for. It is that problem solving that I love. I soon found myself pursuing a Masters of Library and Information Science. After an out-of-state move, I started working with electronic records on the Minnesota Historical Society’s NDIIPP project around preserving and providing access to state government records. It was my job to research, explore and problem solve many different topics relating to digital preservation. After the grant was over, I worked with the Minnesota State Archives to develop their electronic records program.

Once during an annual review focused on electronic records I was asked, “What was something you didn’t do correctly and how did you handle it?” I answered that there were many things we tried when exploring tools to use to assist with electronic records processing, but if it didn’t work, it wasn’t a failure–it was part of the learning experience. You learned from it and moved on to keep looking for a solution to the problem you were trying to solve. It is that exploration that I love and being able to put the successes into practice in my new dream job as a Digital Preservation Analyst at the University of Minnesota Libraries.

Q: What is one thing that you’d like to see the Electronic Records Section accomplish during your time on the steering committee?

CK: I think that many of us are in the same position – we are doing a lot of exploring. We need to share our experiences with each other. Often times we wait until we are done with a project before sharing, and then usually only the “successful” parts of the exploration are shared. We can’t be afraid to share the whole experience. Facilitating ways to do this would be something I would like to see the ERS accomplish. The listserv and blog are good steps, but other methods that allow people to talk more freely or share things that are still in progress would be useful as well.

Q: What is your favorite GIF?

CK: It’s not a GIF, but…
https://wearetheryan.files.wordpress.com/2014/12/never-forget.jpg?w=540&h=319

 

Thank you to all who voted in the 2015 election for the Electronic Records Section Steering Committee members, and thanks to former Steering Committee members whose terms have ended!

The current ERS leadership roster is available here.

A Little Too Personal

The following is a post by John Rees, Archivist and Digital Resources Manager at the National Library of Medicine, based on a breakout session at the ERS meeting of last year’s SAA annual meeting.

One of the breakout sessions at the 2014 ERS section meeting convened around the topic of identifying and redacting personally identifiable information (PII) and personal health information (PHI) from born-digital content. The premise I proposed was, “Health sciences archivists working in the paper world have a relatively easy time of identifying/restricting PII/PHI content. As we move to born-digital collecting we are especially in need of tools and techniques that will allow us to easily identify/restrict/release similar data in electronic form.”

Of course this issue has broader relevance beyond the health science archives, and as we transition from paper-based models of archival processing to data processing models, machines should be able to interpret and act upon various content rules in an automated fashion. The healthcare industry is ahead of the curve in this area, building tools to anonymize any of the nineteen identifiers HIPAA defines as PHI in electronic health record data systems.

Archivists arguably face greater challenges than healthcare workers, sifting through the variety of semantic and unstructured PII found in the various formats traditionally referred to as personal papers, such as recommendation letters, correspondence with sensitive content, publication peer review commentary, etc. Human cognition can learn what these data are and identify them fairly easily during physical processing of paper material, but this requires more effort when triaging unstructured data on poorly labeled media—reading a list of filenames is not sufficient due diligence.

In general, the group felt confident in our ability to collect born-digital material but was much less confident in our ability to provide unmediated access to these records on the open web. Our discussion started off by sharing any tools we knew of that purport to locate PII/PHI in digital archival materials—the list was short:

The strength of these tools is that they can easily and quickly identify logically formatted PII such as social security numbers, email addresses, credit card numbers, phone numbers, and bank account numbers. Their weaknesses include too many false positives, expense of stand-alone proprietary software, narrow use cases, too much item-level manual intervention, and steep learning curves.

The group then talked about access protocols. Identifying PII to restrict requires significant effort, but models for access are almost nonexistent, which complicates the issue when management wants collections to be as immediate and open as possible, the common refrain being, “It’s already digital, so why can’t we put it on the web as soon as it’s acquired?”

The breakout group agreed that from a risk management perspective, outside of manual review and item-level redaction of surrogates, limiting access to data was the easiest solution. Methods of limiting access include:

  • On site-only access via read-only physical media or a disk image
  • On site online access via un-networked computer
  • Authentication paywalls or read-only virtual reading rooms on the open web

In the end we recognized the problem is complex and there are no magic solutions. However, each participant went away with the goal of making incremental progress toward a solution this year.

Welcome to bloggERS!

This is the first post on the SAA Electronic Records Section blog!

We hope that you will find bloggERS! a helpful source of information on topics related to digital preservation, electronic records management, and other important issues for those working with born-digital and digitized content.

Keep an eye on this site in the coming weeks for a variety of different types of posts, including:

  • Topics for discussion and collaboration, to help archivists communicate and learn from each other and those outside the field.
  • Aggregations of news, information, and resources on electronic records issues
  • New content including case studies, interviews, surveys, reviews, and other writings of interest to archivists and electronic records professionals

And if you’d like to get involved, we’re looking for volunteers to join the bloggERS! editorial committee! This volunteer body is made up of members of the Electronic Records Section, the ERS steering committee, and the ERS Communications Liaison, and is responsible for managing the release of content on the bloggERS! site.

What if you’d rather be writing for bloggERS? Well, we’re looking for content contributors too! Check out the ERS Blog Guidelines for Writers for content and formatting recommendations, and feel free to get in touch with us if you have any questions. We’d love to feature your case study, unpublished article, or feature on the bloggERS! site.

This is just the start of the conversation. We look forward to hearing your thoughts on the site as it develops!