Securing Our Digital Legacy: An Introduction to the Digital Preservation Coalition

by Sharon McMeekin, Head of Workforce Development


Nineteen years ago, the digital preservation community gathered in York, UK, for the Cedars Project’s Preservation 2000 conference. It was here that the first seeds were sown for what would become the Digital Preservation Coalition (DPC). Guided by Neil Beagrie, then of King’s College London and Jisc, work to establish the DPC continued over the next 18 months and, in 2002, representatives from 7 organizations signed the articles that formally constituted the DPC.

In the 17 years since its creation, the DPC has gone from strength to strength, the last 10 years under the leadership of current Executive Director, William Kilbride. The past decade has been a particular period of growth, as shown by the rise in the staff compliment from 2 to 7. We now have more than 90 members who represent an increasingly diverse group of organizations from 12 countries across sectors including cultural heritage, higher education, government, banking, industry, media, research and international bodies.

DPC staff, chair, and president

Our mission at the DPC is to:

[…] enable our members to deliver resilient long-term access to digital content and services, helping them to derive enduring value from digital assets and raising awareness of the strategic, cultural and technological challenges they face.

We work to achieve this through a broad portfolio of work across six strategic areas of activity: Community Engagement, Advocacy, Workforce Development, Capacity Building, Good Practice and Standards, and Management and Governance. Everything we do is member-driven and they guide our activities through the DPC Board, Representative Council, and Sub-Committees which oversee each strategic area.

Although the DPC is driven primarily by the needs of our members, we do also aim to contribute to the broader digital preservation community. As such, many of the resources we develop are made publicly available. In the remainder of this blog post, I’ll be taking a quick look at each of the DPC’s areas of activity and pointing out resources you might find useful.

1 | Community Engagement

First up is our work in the area of Community Engagement. Here our aim is to enable “a growing number of agencies and individuals in all sectors and in all countries to participate in a dynamic and mutually supportive digital preservation community”. Collaboration is a key to digital preservation success, and we hope to encourage and support it by helping build an inclusive and active community. An important step in achieving this aim was the publication of our ‘Inclusion and Diversity Policy’ in 2018.

Webinars are key to building community engagement amongst our members. We invite speakers to talk to our members about particular topics and share experiences through case studies. These webinars are recorded and made available for members to watch at a later date. We also run a monthly ‘Members Lounge’ to allow informal sharing of current work and discussion of issues as they arise and, on the public end of the website, a popular blog, covering case studies, new innovations, thought pieces, recaps of events and more.

2 | Advocacy

Our advocacy work campaigns “for a political and institutional climate more responsive and better informed about the digital preservation challenge”, as well as “raising awareness about the new opportunities that resilient digital assets create”. This tends to happen on several levels, from enabling and aiding members’ advocacy efforts within their own organizations, through raising legislators’ and policy makers’ awareness of digital preservation, to educating the wider populace.

To help those advocating for digital preservation within their own context, we have recently published our Executive Guide. The Guide provides a grab bag of statements and facts to help make the case for digital preservation, including key messages, motivators, opportunities to be gained and risks faced. We welcome any suggestions for additions or changes to this resource!

Our longest running advocacy activity is the biannual Digital Preservation Awards, last held in 2018. The Awards aim to celebrate excellence and innovation in digital preservation across a range of categories. This high-profile event has been joined in recent years by two other activities with a broad remit and engagement. The first is the Bit List of Digitally Endangered Species, which highlights at risk digital information, showing both where preservation work is needed and where efforts have been successful. Finally, there is World Digital Preservation Day (WDPD), a day to showcase digital preservation around the globe. Response to WDPD since its inauguration in 2017 has been exceptionally positive. There’s been tweets, blogs, events, webinars, and even a song and dance! This year WDPD is scheduled for 7th November, and we encourage everyone to get involved.

The nominees, winners, and judges for the 2018 Digital Preservation Awards

3 | Workforce Development

Workforce Development activities at the DPC focus on “providing opportunities for our members to acquire, develop and retain competent and responsive workforces that are ready to address the challenges of digital preservation”. There are many threads to this work, but key for our members are the scholarships we provide through our Career Development Fund and free access to the training courses we run.

At the moment we offer three training courses: ‘Getting Started with Digital Preservation’, ‘Making Progress with Digital Preservation’ and ‘Advocacy for Digital Preservation’, but we have plans to expand the portfolio in the coming year. All of our training courses are available to non-members for a modest fee, but at the moment are mostly held face to face in the UK and Ireland. A move to online training provision is, however, planned for 2020. We are also happy to share training resources and have set up a Slack workspace to enable this and greater collaboration with regards to digital preservation training.

Other resources that may prove helpful that fall under our Workforce Development heading include the ‘Digital Preservation Handbook’, a free online publication covering a digital preservation in the broadest sense. The Handbook aims to be a comprehensive guide for those starting with digital preservation, whilst also offering links additional resources. The content for Handbook was crowd-sourced from experts and has all been peer reviewed. Another useful and slightly less well-known series of publications are our ‘Topical Notes’, originally funded by the National Archives of Ireland, and intended to create resources that introduced key digital preservation issues to a non-specialist audience (particularly record creators). Each note is only two pages long and jargon-free, so a great resource to help raise awareness.

4 | Capacity Building

Perhaps the biggest area of DPC work covers Capacity Building, that is “supporting and assuring our members in the delivery and maintenance of high quality and sustainable digital preservation services through knowledge exchange, technology watch, research and development.” This can take the form of direct member support, helping with tasks such as policy development and procurement, as well as participation in research projects.

Our more advanced publication series, the Technology Watch Reports, also sit below the Capacity Building heading. Written by experts and peer reviewed, each report takes a deeper dive into a particular digital preservation issue. Our latest report on Email Preservation is currently available for member preview but will be publicly released shortly. Some other ‘classics’ include Preserving Social Media, Personal Digital Archiving, and the always popular The Open Archival Information System (OAIS) Reference Model: Introductory Guide (2nd Edition) (I always tell those new to OAIS to start here rather than the 200+ dry pages of the full standard!)

We also run around six thematic Briefing Day events a year on topical issues. As with the training, these are largely held in the UK and Ireland, but they are now also live-streamed for members. We support a number of Thematic Task Forces and Working Groups, with the ‘Web Archiving and Preservation Working Group’ being particularly active at the moment.

DPC members engaged in a brainstorming session

5 | Good Practice and Standards

Our Good Practice and Standards stream of work was a new addition as of the publication of our latest Strategic Plan (2018-22). Here we are contributing work towards “identifying and developing good practice and standards that make digital preservation achievable, supporting efforts to ensure services are tightly matched to shifting requirements.”

We hope this work will allow us to input into standards with the needs of our members in mind and facilitate the sharing of good practice that already happens across the coalition. This has already borne fruit in the shape of the forthcoming DPC Rapid Assessment Model, a maturity model to help with benchmarking digital preservation progress within your organization. You can read a bit more about it in this blog post by Jen Mitcham and the model will be released publicly in late September.

We also work with vendors through our Supporter Program and events like our ‘Digital Futures’ series to help bridge the gap between practice and solutions.

6 | Management and Governance

Our final stream of work is less focused on digital preservation and instead on “ensuring the DPC is a sustainable, competent organization focussed on member needs, providing a robust and trusted platform for collaboration within and beyond the Coalition.” This obviously relates to both the viability of the organization and well as good governance. It is essential that everything we do is transparent and that the members can both direct what we do and ensure accountability.

The Future

Before I depart, I thought I would share a little bit about some of our plans for the future. In the next few years we’ll be taking steps to further internationalize as an organization. At the moment our membership is roughly 75% UK and Ireland and 25% international, but those numbers are gradually moving closer and we hope that continues. With that in mind we will be investigating new ways to deliver services and resources online, as well as in languages beyond English. We’re starting this year with the publication of our prospectus in German, French and Spanish.

We’re also beginning to look forward to our 20th anniversary in 2022. It’s a Digital Preservation Awards Year, so that’s reason enough for a celebration, but we will also be welcoming the digital preservation community to Glasgow, Scotland, as hosts of iPRES 2022. Plans are already afoot for the conference, and we’re excited to make it a showcase for both the community and one of our home cities. Hopefully we’ll see you there, but I encourage you to make use of our resources and to get in touch soon!

Access our Knowledge Base: https://www.dpconline.org/knowledge-base

Follow us on Twitter: https://twitter.com/dpc_chat

Find out how to join us: https://www.dpconline.org/about/join-us


Sharon McMeekin is Head of Workforce Development with the Digital Preservation Coalition and leads on work including training workshops and their scholarship program. She is also Managing Editor of the ‘Digital Preservation Handbook’. With Masters degrees in Information Technology and Information Management and Preservation, both from the University of Glasgow, Sharon is an archivist by training, specializing in digital preservation. She is also an ILM qualified trainer. Before joining the DPC she spent five years as Digital Archivist with RCAHMS. As an invited speaker, Sharon presents on digital preservation at a wide variety of training events, conferences and university courses.

Advertisements

The Theory and Craft of Digital Preservation: An interview with Trevor Owens

BloggERS! editor, Dorothy Waugh recently interviewed Trevor Owens, Head of Digital Content Management at the Library of Congress about his recent–and award-winning–book, The Theory and Craft of Digital Preservation.


Who is this book for and how do you imagine it being used?

I attempted to write a book that would be engaging and accessible to anyone who cares about long-term access to digital content and wants to devote time and energy to helping ensure that important digital content is not lost to the ages. In that context, I imagine the primary audience as current and emerging professionals that work to ensure enduring access to cultural heritage: archivists, librarians, curators, conservators, folklorists, oral historians, etc. With that noted, I think the book can also be of use to broader conversations in information science, computer science and engineering, and the digital humanities. 

Tell us about the title of the book and, in particular, your decision to use the word “craft” to describe digital preservation.

The words “theory” and “craft” in the title of the book forecast both the structure and the two central arguments that I advance in the book. 

The first chapters focus on theory. This includes tracing the historical lineages of preservation in libraries, archives, museums, folklore, and historic preservation. I then move to explore work in new media studies and platform studies to round out a nuanced understanding of the nature of digital media. I start there because I think it’s essential that cultural heritage practitioners moor their own frameworks and approaches to digital preservation in a nuanced understanding of the varied and historically contingent nature of preservation as a concept and the complexities of digital media and digital information. 

The latter half of the book is focused on what I describe as the “craft” of digital preservation. My use of the term craft is designed to intentionally challenge the notion that work in digital preservation should be understood as “a science.” Given the complexities of both what counts as preservation in a given context and the varied nature of digital media, I believe it is essential that we explicitly distance ourselves from many of the assumptions and baggage that come along with the ideology of “digital.” 

We can’t build some super system that just solves digital preservation. Digital preservation requires making judgement calls. Digital preservation requires the applied thinking and work of professionals. Digital preservation is not simply a technical question, instead digital preservation involves understanding the nature of the content that matters most to an intended community and making judgement calls about how best to mitigate risks of potential loss of access to that content. As a result of my focus on craft, I offer less of a “this is exactly what one should do” approach, and more of an invitation to join the community of practice that is developing knowledge and honing and refining their craft. 

Reading the book, I was so happy to see you make connections between the work that we do as archivists and digital preservation. Can you speak to that relationship and why you think it is important?

Archivists are key players in making preservation happen and the emergence of digital content across all kinds of materials and media that archivists work with means that digital preservation is now a core part of the work that archivists do. 

I organize a lot of my discussion about the craft of digital preservation around archival concepts as opposed to library science or curatorial practices. For example, I talk about arrangement and description. I also draw from ideas like MPLP as key concepts for work in digital preservation and from work on community archives. 

Old Files. From XKCD: webcomic of romance, sarcasm, math, and language. 2014

Broadly speaking, in the development of digital media, I see a growing context collapse between formats that had been distinct in the past. That is, conservation of oil paintings, management and preservation of bound volumes, and organizing and managing heterogeneous sets of records have some strong similarities but there are also a lot of differences. The born digital incarnations of those works; digital art, digital publishing, and digital records, are all made up of digital information and file formats, and face a related set of digital preservation issues.

With that note, I think archival practice tends to be particularly well-suited for dealing with the nature of digital content. Archives have long dealt with the problem of scale that is now intensified by digital data. At the same time, archivists have also long dealt with hybrid collections and complex jumbles of formats, forms, and organizational structures, which is also increasingly the case for all types of forms that transition into born-digital content. 

You emphasize that the technical component of digital preservation is sometimes prioritized over social, ethical, and organizational components. What are the risks implicit in overlooking these other important components?

Digital preservation is not primarily a technical problem. The ideology of “digital” is that things should be faster, cheaper, and automatic. The ideology of “digital” suggests that we should need less labor, less expertise, and less resources to make digital stuff happen. If we let this line of thinking infect our idea of digital preservation we are going to see major losses of important data, we will see major failures to respect ethical and privacy issues relating to digital content, and lots of money will be spent on work that fails to get us the results that we want.

In contrast, when we take as a starting point that digital preservation is about investing resources in building strong organizations and teams who participate in the community of practice and work on the complex interactions that emerge between competing library and archives values then we have a chance of both being effective but also building great and meaningful jobs for professionals.

If digital preservation work is happening in organizations that have an overly technical view of the problem, it is happening despite, not because, of their organization’s approach. That is, there are people doing the work, they just likely aren’t getting credit and recognition for doing that work. Digital preservation happens because of people who understand that the fundamental nature of the work requires continual efforts to get enough resources to meaningfully mitigate risks of loss, and thoughtful decision making about building and curating collections of value to communities.

Considerations related to access and discovery form a central part of the book and you encourage readers to “Start simple and prioritize access,” an approach that reminded me of many similar initiatives focused on getting institutions started with the management and preservation of born-digital archives. Can you speak to this approach and tell us how you see the relationship between preservation and access?

A while back, OCLC ran an initiative called “walk before you run,” focused on working with digital archives and digital content. I know it was a major turning point for helping the field build our practices. Our entire community is learning how to do this work and we do it together. We need to try things and see which things work best and which don’t. 

It’s really important to prioritize access in this work. Preservation is fundamentally about access in the future. The best way you know that something will be accessible in the future is if you’re making it accessible now. Then your users will help you. They can tell you if something isn’t working. The more that we can work end-to-end, that is, that we accession, process, arrange, describe, and make available digital content to our users, the more that we are able to focus on how we can continually improve that process end-to-end. Without having a full end-to-end process in place, it’s impossible to zoom out and look at that whole sequence of processes to start figuring out where the bottlenecks are and where you need to focus on working to optimize things. 


Dr. Trevor Owens is a librarian, researcher, policy maker, and educator working on digital infrastructure for libraries. Owens serves as the first Head of Digital Content Management for Library Services at the Library of Congress. He previously worked as a senior program administrator at the United States Institute of Museum and Library Services (IMLS) and, prior to that, as a Digital Archivist for the National Digital Information Infrastructure and Preservation Program and as a history of science curator at the Library of Congress. Owens is the author of three books, including The Theory and Craft of Digital Preservation and Designing Online Communities: How Designers, Developers, Community Managers, and Software Structure Discourse and Knowledge Production on the Web. His research and writing has been featured in: Curator: The Museum Journal, Digital Humanities Quarterly, The Journal of Digital Humanities, D-Lib, Simulation & Gaming, Science Communication, New Directions in Folklore, and American Libraries. In 2014 the Society for American Archivists granted him the Archival Innovator Award, presented annually to recognize the archivist, repository, or organization that best exemplifies the “ability to think outside the professional norm.”

A Conversation with Annalise Berdini, Digital Archivist at the Seeley G. Mudd Manuscript Library, Princeton University

Interview conducted with Annalise Berdini in May 2019 by Hannah Silverman and Tamar Zeffren

This is the eighth post in a new series of conversations between emerging professionals and archivists actively working with digital materials.


Annalise Berdini is the Digital Archivist at the Seeley G. Mudd Manuscript Library at Princeton University, a position she has held since January 2018. She is responsible for the ongoing management of the University Archives Digital Curation Program, as well as managing a collection of web archives and assisting with reference services.

Annalise’s first post-graduate school position was as a manuscripts and archives processor at the University of California, San Diego (UCSD). While she was working at UCSD, universities and archives were slowly starting to see the need for a dedicated digital archivist position. When the Special Collections department at UCSD created their first digital archivist position, Annalise applied and got the job. She explains that a good deal of her work there, and at Princeton, is graciously supported by a community of digital archivists solving similar challenges in other institutions.

As Annalise has now held a digital archivist role at two different institutions, both universities, we were interested to hear her perspectives on how colleagues and researchers have understood – or misunderstood – her role. “Because I have digital in my job title,” she noted, “people interpret that in a lot of very wide and broad ways. Really digital archives is still an emerging field…there are so many questions to answer, and it’s fun to investigate that aspect of the field.”

Given prevailing concerns among institutional archives about preserving and processing legacy media, we were keenly interested in hearing Annalise’s insights about securing stakeholder buy-in to develop a digital archives program.

“It’s a struggle everywhere,” she acknowledges. Presently, Princeton’s efforts to build up a more robust digital preservation program have led the University to a partnership with a UK-based company called Arkivum, which offers digital preservation, storage, maintenance, auditing and reporting modules and has the capacity to incorporate services from Archivematica and create a customized digital storage solution for Princeton.

“We’ve been lucky here [at Mudd]. We’re getting this great system. There is buy-in and there seems to be a pretty strong push right now. For us, the most compelling argument we’ve had is that we are mandated to collect student materials and student records that will not exist anywhere else unless we take them. The school has to keep those records, there’s not an option. Emphasizing how easily that content could be lost without a proper digital preservation system in place was very compelling to people who weren’t necessarily aware of the fact that hard drives sitting on a shelf are really not acceptable storage choices and options.”

Annalise has also found that deploying some compelling statistics can aid in building awareness around digital archives needs. In discussions about how rapidly materials can degrade, Annalise likes to cite a 2013 Western Archives article, “Capturing and Processing Born-Digital Files in the STOP AIDS Project Records,” which showcases findings that out of a vast collection of optical storage media, “only 10% of these hundreds of DVDs were really able to be recovered, whereas, strangely, a lot of the floppy disks were better and easier to recover…I think emphasizing how fragile digital content is [can help people understand] how easily it will corrupt without you even knowing it.”

Equally as important to generating momentum for such programs are the direct relationships Annalise cultivates with colleagues, within and without the archives. “My boss was really instrumental in the process, and the head of library IT helped me navigate getting approvals from the University as a whole and the University IT department.”

The complex process of sustaining and innovating a digital archives infrastructure provides ongoing opportunities for Annalise to “solve puzzles” and to unite colleagues in confronting the challenges of documenting and preserving born-digital heritage: “I have focused on trying to find one person who is maybe a level above me and to connect with them and then hopefully build up a network within my institution to build some groundswell.”


Hannah Silverman

Tamar Zeffren

Hannah Silverman and Tamar Zeffren both work at JDC Archives. Tamar is the Archival Collections Manager. Hannah is the Digitization Project Specialist and also works independently as a photo archivist. Both received SAA’s DAS certification.

PASIG (Preservation and Archiving Special Interest Group) 2019 Recap

by Kelly Bolding

PASIG 2019 met the week of February 11th at El Colegio de México (commonly known as Colmex) in Mexico City. PASIG stands for Preservation and Archiving Special Interest Group, and the group’s meeting brings together an international group of practitioners, industry experts, vendors, and researchers to discuss practical digital preservation topics and approaches. This meeting was particularly special because it was the first time the group convened in Latin America (past meetings have generally been held in Europe and the United States). Excellent real-time bilingual translation for presentations given in both English and Spanish enabled conversations across geographical and lingual boundaries and made room to center Latin American preservationists’ perspectives and transformative post-custodial archival practice.

Perla Rodriguez of the Universidad Nacional Autónoma de México (UNAM) discusses an audiovisual preservation case study.

The conference began with broad overviews of digital preservation topics and tools to create a common starting ground, followed by more focused deep-dives on subsequent days. I saw two major themes emerge over the course of the week. The first was the importance of people over technology in digital preservation. From David Minor’s introductory session to Isabel Galina Russell’s overview of the digital preservation landscape in Mexico, presenters continuously surfaced examples of the “people side” of digital preservation (think: preservation policies, appraisal strategies, human labor and decision-making, keeping momentum for programs, communicating to stakeholders, ethical partnerships). One point that struck me during the community archives session was Verónica Reyes-Escudero’s discussion of “cultural competency as a tool for front-end digital preservation.” By conceptualizing interpersonal skills as a technology for facilitating digital preservation, we gain a broader and more ethically grounded idea of what it is we are really trying to do by preserving bits in the first place. Software and hardware are part of the picture, but they are certainly not the whole view.

The second major theme was that digital preservation is best done together. Distributed digital preservation platforms, consortial preservation models, and collaborative research networks were also well-represented by speakers from LOCKSS, Texas Digital Library (TDL), Duraspace, Open Preservation Foundation, Software Preservation Network, and others. The takeaway from these sessions was that the sheer resource-intensiveness of digital preservation means that institutions, both large and small, are going to have to collaborate in order to achieve their goals. PASIG seemed to be a place where attendees could foster and strengthen these collective efforts. Throughout the conference, presenters also highlighted failures of collaborative projects and the need for sustainable financial and governance models, particularly in light of recent developments at the Digital Preservation Network (DPN) and Digital Public Library of America (DPLA). I was particularly impressed by Mary Molinaro’s honest and informative discussion about the factors that led to the shuttering of DPN. Molinaro indicated that DPN would soon be publishing a final report in order to transparently share their model, flaws and all, with the broader community.

Touching on both of these themes, Carlos Martínez Suárez of Video Trópico Sur gave a moving keynote about his collaboration with Natalie M. Baur, Preservation Librarian at Colmex, to digitize and preserve video recordings he made while living with indigenous groups in the Mexican state of Chiapas. The question and answer portion of this session highlighted some of the ethical issues surrounding rights and consent when providing access to intimate documentation of people’s lives. While Colmex is not yet focusing on access to this collection, it was informative to hear Baur and others talk a bit about the ongoing technical, legal, and ethical challenges of a work-in-progress collaboration.

Presenters also provided some awesome practical tools for attendees to take home with them. One of the many great open resources session leaders shared was Frances Harrell (NEDCC) and Alexandra Chassanoff (Educopia)’s DigiPET: A Community Built Guide for Digital Preservation Education + Training Google document, a living resource for compiling educational tools that you can add to using this form. Julian Morley also shared a Preservation Storage Cost Model Google sheet that contains a template with a wealth of information about estimating the cost of different digital preservation storage models, including comparisons for several cloud providers. Amy Rudersdorf (AVP), Ben Fino-Radin (Small Data Industries), and Frances Harrell (NEDCC) also discussed helpful frameworks for conducting self-assessments.

Selina Aragon, Daina Bouquin, Don Brower, and Seth Anderson discuss the challenges of software preservation.

PASIG closed out by spending some time on the challenges involved with preserving emerging and complex formats. On the last afternoon of sessions, Amelia Acker (University of Texas at Austin) spoke about the importance of preserving APIs, terms of service, and other “born-networked” formats when archiving social media. She was followed by a panel of software preservationists who discussed different use cases for preserving binaries, source code, and other software artifacts.

Conference slides are all available online.

Thanks to the wonderful work of the PASIG 2019 steering, program, and local arrangements committees!


Kelly Bolding is the Project Archivist for Americana Manuscript Collections at Princeton University Library, as well as the team leader for bloggERS! She is interested in developing workflows for processing born-digital and audiovisual materials and making archival description more accurate, ethical, and inclusive.

GVSU Scripted IR Curation in the Cloud

by Matt Schultz and Kyle Felker

This is the seventh post in the bloggERS Script It! Series.


In 2016, bepress launched a new service to assist subscribers of Digital Commons with archiving the contents of their institutional repository. Using Amazon Web Services (AWS), this new service pushes daily updates of an institution’s repository contents to an Amazon Simple Storage Service (S3) bucket setup that is hosted by the institution. Subscribers thus control a real-time copy of all their institutional repository content outside of bepress’s Digital Commons platform. They can download individual files or the entire repository from this S3 bucket, all on their own schedules, and for whatever purposes they deem appropriate.

Grand Valley State University (GVSU) Libraries makes use of Digital Commons for their ScholarWorks@GVSU institutional repository. Using bepress’s new archiving service has given GVSU an opportunity to perform scripted automation in the Cloud using the same open source tools that they use in their regular curation workflows. These tools include Brunnhilde, which bundles together ClamAV and Siegfried to produce file format, fixity, and virus check reports, as well as BagIt for preservation packaging.

Leveraging the ease of launching Amazon’s EC2 server instances and use of their AWS command line interface (CLI), GVSU Libraries was able to readily configure an EC2 “curation server” that directly syncs copies of their Digital Commons data from S3 to the new server where the installed tools mentioned above do their work of building preservation packages and sending them back to S3 and to Glacier for nearline and long-term storage. The entire process now begins and ends in the Cloud, rather than involving the download of data to a local workstation for processing.

Creating a digital preservation copy of the GVSU’s ScholarWorks files involves four steps:

  1. Syncing data from S3 to EC2:  No processing can actually be done on files in-place on S3, so they must be copied to the EC2 “curation server”. As part of this process, we wanted to reorganize the files into logical directories (alphabetized), so that it would be easier to locate and process files, and to better organize the virus reports and the “bags” the process generates.
  2. Virus and format reports with Brunnhilde:  The synced files are then run through Brunnhilde’s suite of tools.  Brunnhilde will generate both a command-line summary of problems found and detailed reports that are deposited with the files.  The reports stay with the files as they move through the process.
  3. Preservation Packaging with BagIt: Once the files are checked, they need to be put in “bags” for storage and archiving using the BagIt tool.  This will bundle the files in a data directory and generate metadata that can be used to check their integrity.
  4. Syncing files to S3 and Glacier: Checked and bagged files are then moved to a different S3 bucket for nearline storage.  From that bucket, we have set up automated processes (“lifecycle management” in AWS parlance) to migrate the files on a quarterly schedule into Amazon Glacier, our long-term storage solution.

Once the process has been completed once, new files are incorporated and re-synced on a quarterly basis to the BagIt data directories and re-checked with Brunnhilde.  The BagIt metadata must then be updated and re-verified using BagIt, and the changes synced to the destination S3 bucket.

Running all these tools in sequence manually using the command line interface is both tedious and time-consuming. We chose to automate the process using a shell script. Shell scripts are relatively easy to write, and are designed to automate command-line tasks that involve a lot of repetitive work (like this one).

These scripts can be found at our github repo: https://github.com/gvsulib/curationscripts

Process_backup is the main script. It handles each of the four processing stages outlined above.  As it does so, it stores the output of those tasks in log files so they can be examined later.  In addition, it emails notifications to our task management system (Asana) so that our curation staff can check on the process.

After the first time the process is run, the metadata that BagIt generates has to be updated to reflect new data.  The version of BagIt we are using (Python) can’t do this from the command line, but it does have an API with a command that will update existing “bag” metadata. So, we created a small Python script to do this (regen_BagIt_manifest.py).  The shell script invokes this script at the third stage if bags have previously been created.

Finally, the update.sh script automatically updates all the tools used in the process and emails curation staff when the process is done. We then schedule the scripts to run automatically using the Unix cron utility.

GVSU Libraries is now hard at work on a final bagit_check.py script that will facilitate spot retrieval of the most recent version of a “bag” from the S3 nearline storage bucket and perform a validation audit using BagIt.


Matt Schultz is the Digital Curation Librarian for Grand Valley State University Libraries. He provides digital preservation for the Libraries’ digital collections, and offers support to faculty and students in the areas of digital scholarship and research data management. He has the unique displeasure of working on a daily basis with Kyle Felker.

Kyle Felker was born feral in the swamps of Louisiana, and a career in technology librarianship didn’t do much to domesticate him.  He currently works as an Application Developer at GVSU Libraries.

DLF Forum & Digital Preservation 2017 Recap

By Kelly Bolding


The 2017 DLF Forum and NDSA’s Digital Preservation took place this October in Pittsburgh, Pennsylvania. Each year the DLF Forum brings together a variety of digital library practitioners, including librarians, archivists, museum professionals, metadata wranglers, technologists, digital humanists, and scholars in support of the Digital Library Federation’s mission to “advance research, learning, social justice, and the public good through the creative design and wise application of digital library technologies.” The National Digital Stewardship Alliance follows up the three-day main forum with Digital Preservation (DigiPres), a day-long conference dedicated to the “long-term preservation and stewardship of digital information and cultural heritage.” While there were a plethora of takeaways from this year’s events for the digital archivist community, for the sake of brevity, this recap will focus on a few broad themes, followed by some highlights related to electronic records specifically.

As an early career archivist and a first-time DLF/DigiPres attendee, I was impressed by the DLF community’s focus on inclusion and social justice. While technology was central to all aspects of the conference, the sessions centered the social and ethical aspects of digital tools in a way that I found both refreshing and productive. (The theme for this year’s DigiPres was, in fact, “Preservation is Political.”) Rasheedah Phillips, a Philadelphia-based public interest attorney, activist, artist, and science fiction writer opened the forum with a powerful keynote about the Community Futures Lab, a space she co-founded and designed around principles of Afrofuturism and Black Quantum Futurism. By presenting an alternate model of archiving deeply grounded in the communities affected, Phillips’s talk and Q&A responses brought to light an important critique of the restrictive nature of archival repositories. I left Phillips’s talk thinking about how we might allow the the liberatory “futures” she envisions to shape how we design online spaces for engaging with born-digital archival materials, as opposed to modeling these virtual spaces after the physical reading rooms that have alienated many of our potential users.

Other conference sessions echoed Phillips’s challenge to archivists to better engage and center the communities they document, especially those who have been historically marginalized. Ricky Punzalan noted in his talk on access to dispersed ethnographic photographs that collaboration with documented communities should now be a baseline expectation for all digital projects. Rosalie Lack and T-Kay Sangwand spoke about UCLA’s post-custodial approach to ethically developing digital collections across international borders using a collaborative partnership framework. Martha Tenney discussed concrete steps taken by archivists at Barnard College to respect the digital and emotional labor of students whose materials the archives is collecting to fill in gaps in the historical record.

Eira Tansey, Digital Archivist and Records Manager at the University of Cincinnati and organizer for Project ARCC, gave her DigiPres keynote about how our profession can develop an ethic of environmental justice. Weaving stories about the environmental history of Pittsburgh throughout her talk, Tansey called for archivists to commit firmly to ensuring the preservation and usability of environmental information. Related themes of transparency and accountability in the context of preserving and providing access to government and civic data (which is nowadays largely born-digital) were also present through the conference sessions. Regarding advocacy and awareness initiatives, Rachel Mattson and Brandon Locke spoke about Endangered Data Week; and several sessions discussed the PEGI Project. Others presented on the challenges of preserving born-digital civic and government information, including how federal institutions and smaller universities are tackling digital preservation given their often limited budgets, as well as how repositories are acquiring and preserving born-digital congressional records.

Collaborative workflow development for born-digital processing was another theme that emerged in a variety of sessions. Annalise Berdini, Charlie Macquarie, Shira Peltzman, and Kate Tasker, all digital archivists representing different University of California campuses, spoke about their process in coming together to create a standardized set of UC-wide guidelines for describing born-digital materials. Representatives from the OSSArcFlow project also presented some initial findings regarding their research into how repositories are integrating open source tools including BitCurator, Archivematica, and ArchivesSpace within their born-digital workflows; they reported on concerns about the scalability of various tools and standards, as well as desires to transition from siloed workflows to a more holistic approach and to reduce the time spent transforming the output of one tool to be compatible with another tool in the workflow. Elena Colón-Marrero of the Computer History Museum’s Center for Software History provided a thorough rundown of building a software preservation workflow from the ground-up, from inventorying software and establishing a controlled vocabulary for media formats to building a set of digital processing workstations, developing imaging workflows for different media formats, and eventually testing everything out on a case study collection (and she kindly placed her whole talk online!)

Also during the forum, the DLF Born-Digital Access Group met over lunch for an introduction and discussion. The meeting was well-attended, and the conversation was lively as members shared their current born-digital access solutions, both pretty and not so pretty (but never perfect); their wildest hopes and dreams for future access models; and their ideas for upcoming projects the group could tackle together. While technical challenges certainly figured into the discussion about impediments to providing better born-digital access, many of the problems participants reported had to do with their institutions being unwilling to take on perceived legal risks. The main action item that came out of the meeting is that the group plans to take steps to expand NDSA’s Levels of Preservation framework to include Levels of Access, as well as corresponding tiers of rights issues. The goal would be to help archivists assess the state of existing born-digital access models at their institutions, as well as give them tools to advocate for more robust, user-friendly, and accessible models moving forward.

For additional reports on the conference, reflections from several DLF fellows are available on the DLF blog. In addition to the sessions I mentioned, there are plenty more gems to be found in the openly available community notes (DLF, DigiPres) and OSF Repository of slides (DLF, DigiPres), as well as in the community notes for the Liberal Arts Colleges/HBCU Library Alliance unconference that preceded DLF.


Kelly Bolding is a processing archivist for the Manuscripts Division at Princeton University Library, where she is responsible for the arrangement and description of early American history collections and has been involved in the development of born-digital processing workflows. She holds an MLIS from Rutgers University and a BA in English Literature from Reed College.

OSS4Pres 2.0: Developing functional requirements/features for digital preservation tools

By Heidi Elaine Kelly

____

This is the final post in the bloggERS series describing outcomes of the #OSS4Pres 2.0 workshop at iPRES 2016, addressing open source tool and software development for digital preservation. This post outlines the work of the group tasked with “ developing functional requirements/features for OSS tools the community would like to see built/developed (e.g. tools that could be used during ‘pre-ingest’ stage).” 

The Functional Requirements for New Tools and Features Group of the OSS4Pres workshop aimed to write user stories focused on new features that developers can build out to better support digital preservation and archives work. The group was largely comprised of practitioners who work with digital curation tools regularly, and was facilitated by Carl Wilson of the Open Preservation Foundation. While their work largely involved writing user stories for development, the group also came up with requirement lists for specific areas of tool development, outlined below. We hope that these lists help continue to bridge the gap between digital preservation professionals and open source developers by providing a deeper perspective of user needs.

Basic Requirements for Tools:

  • Mostly needed for Mac environment
  • No software installation on donor computer
  • No software dependencies requiring installation (e.g., Java)
  • Must be GUI-based, as most archivists are not skilled with the command line
  • Graceful failure

Descriptive Metadata Extraction Needs (using Apache Tika):

  • Archival date
  • Author
  • Authorship location
  • Subject location
  • Subject
  • Document type
  • Removal of spelling errors to improve extracted text

Technical Metadata Extraction Needs:

  • All datetime information available should be retained (minimum of LastModified Date)
  • Technical manifest report
  • File permissions and file ownership permissions
  • Information about the tool that generated the technical manifest report:
    • tool – name of the tool used to gather the disk image
    • tool version – the version of the tool
    • signature version – if the tool uses ‘signatures’ or other add-ons, e.g. which virus scanner software signature – such as signature release July 2014 or v84
    • datetime process run – the datetime information of when the process ran (usually tools will give you when the process was completed) – for each tool that you use

Data Transfer Tool Requirements:

  • Run from portable external device
  • Bag-It standard compliant (build into a “bag”)
  • Able to select a subset of data – not disk image the whole computer
  • GUI-based tool
  • Original file name (also retained in tech manifest)
  • Original file path (also retained in tech manifest)
  • Directory structure (also retained in tech manifest)
  • Address these issues in filenames (record the actual filename in the tech manifest): Diacritics (e.g. naïve ), Illegal characters ( \ / : * ? “ < > | ), Spaces, M-dashes, n-dashes, Missing file extensions, Excessively long file and folder names, etc
  • Possibly able to connect to “your” FTP site/cloud thingy and send the data there when ready for transfer

Checksum Verification Requirements:

  • File-by-file checksum hash generation
  • Ability to validate the contents of the transfer

Reporting Requirements:

  • Ability to highlight/report on possibly problematic files/folders in a separate file

Testing Requirements:

  • Access to a test corpora, with known issues, to test tool

Smart Selection & Appraisal Tool Requirements:

  • DRM/TPMs detection
  • Regular expressions/fuzzy logic for finding certain terms – e.g. phone numbers, security numbers, other predefined personal data
  • Blacklisting of files – configurable list of blacklist terms
  • Shortlisting a set of “questionable” files based on parameters that could then be flagged for a human to do further QA/QC

Specific Features Needed by the Community:

  • Gathering/generating quantitative metrics for web harvests
  • Mitigation strategies for FFMPEG obsolescence
  • TESSERACT language functionality

____

heidi-elaine-kellyHeidi Elaine Kelly is the Digital Preservation Librarian at Indiana University, where she is responsible for building out the infrastructure to support long-term sustainability of digital content. Previously she was a DiXiT fellow at Huygens ING and an NDSR fellow at the Library of Congress.

OSS4Pres 2.0: Sharing is Caring: Developing an online community space for sharing workflows

By Sam Meister

____

This is the third post in the bloggERS series describing outcomes of the #OSS4Pres 2.0 workshop at iPRES 2016, addressing open source tool and software development for digital preservation. This post outlines the work of the group tasked with “developing requirements for an online community space for sharing workflows, OSS tool integrations, and implementation experiences” See our other posts for information on the groups that focused on feature development and design requirements for FOSS tools.

Cultural heritage institutions, from small museums to large academic libraries, have made significant progress developing and implementing workflows to manage local digital curation and preservation activities. Many institutions are at different stages in the maturity of these workflows. Some are just getting started, and others have had established workflows for many years. Documentation assists institutions in representing current practices and functions as a benchmark for future organizational decision-making and improvements. Additionally, sharing documentation assists in creating cross-institutional understanding of digital curation and preservation activities and can facilitate collaborations amongst institutions around shared needs.

One of the most commonly voiced recommendations from iPRES 2015 OSS4PRES workshop attendees was the desire for a centralized location for technical and instructional documentation, end-to-end workflows, case studies, and other resources related to the installation, implementation, and use of OSS tools. This resource could serve as a hub that would enable practitioners to freely and openly exchange information, user requirements, and anecdotal accounts of OSS initiatives and implementations.

At the OSS4Pres 2.0 workshop, the group of folks looking at developing an online space for sharing workflows and implementation experience started by defining a simple goal and deliverable for the two hour session:

Develop a list of minimal levels of content that should be included in an open online community space for sharing workflows and other documentation

The group the began a discussion on developing this list of minimal levels by thinking about the potential value of user stories in informing these levels. We spent a bit of time proposing a short list of user stories, just enough to provide some insight into the basic structures that would be needed for sharing workflow documentation.

User stories

  • I am using tool 1 and tool 2 and want to know how others have joined them together into a workflow
  • I have a certain type of data to preserve and want to see what workflows other institutions have in place to preserve this data
  • There is a gap in my workflow — a function that we are not carrying out — and I want to see how others have filled this gap
  • I am starting from scratch and need to see some example workflows for inspiration
  • I would like to document my workflow and want to find out how to do this in a way that is useful for others
  • I would like to know why people are using particular tools – is there evidence that they tried another tool, for example, that wasn’t successful?

The group then proceeded to define a workflow object as a series of workflow steps with its own attributes, a visual representation, and organizational context:

Workflow step
Title / name
Description
Tools / resources
Position / role

Visual workflow diagrams / model
Organizational Context
            Institution type
            Content type

Next, we started to draft out the different elements that would be part of an initial minimal level for workflow objects:

Level 1:

Title
Description
Institution / organization type
Contact
Content type(s)
Status
Link to external resources
Download workflow diagram objects
Workflow concerns / reflections / gaps

After this effort the group focused on discussing next steps and how an online community space for sharing workflows could be realized. This discuss led towards pursuing the expansion of COPTR to support sharing of workflow documentation. We outlined a roadmap for next steps toward pursuing this goal:

  • Propose / approach COPTR steering group on adding workflows space to COPTR
  • Develop home page and workflow template
  • Add examples
  • Group review
  • Promote / launch
  • Evaluation

The group has continued this work post-workshop and has made good progress setting up a Community Owned Workflows section to COPTR and developing an initial workflow template. We are in the midst of creating and evaluating sample workflows to help with revising and tweaking as needed. Based on this process we hope to launch and start promoting this new online space for sharing workflows in the months ahead. So stay tuned!

____

meister_photoSam Meister is the Preservation Communities Manager, working with the MetaArchive Cooperative and BitCurator Consortium communities. Previously, he worked as Digital Archivist and Assistant Professor at the University of Montana. Sam holds a Master of Library and Information Science degree from San Jose State University and a B.A. in Visual Arts from the University of California San Diego. Sam is also an Instructor in the Library of Congress Digital Preservation Education and Outreach Program.

 

Preservation and Access Can Coexist: Implementing Archivematica with Collaborative Working Groups

By Bethany Scott

The University of Houston (UH) Libraries made an institutional commitment in late 2015 to migrate the data for its digitized and born-digital cultural heritage collections to open source systems for preservation and access: Hydra-in-a-Box (now Hyku!), Archivematica, and ArchivesSpace. As a part of that broader initiative, the Digital Preservation Task Force began implementing Archivematica in 2016 for preservation processing and storage.

At the same time, the DAMS Implementation Task Force was also starting to create data models, use cases, and workflows with the goal of ultimately providing access to digital collections via a new online repository to replace CONTENTdm. We decided that this would be a great opportunity to create an end-to-end digital access and preservation workflow for digitized collections, in which digital production tasks could be partially or fully automated and workflow tools could integrate directly with repository/management systems like Archivematica, ArchivesSpace, and the new DAMS. To carry out this work, we created a cross-departmental working group consisting of members from Metadata & Digitization Services, Web Services, and Special Collections.

Continue reading