Mary Kidd (MLIS ’14) and Dana Gerber-Margie (MLS ’13) first met at a Radio Preservation Task Force meeting in 2016. They bonded over experiences of conference fatigue, but quickly moved onto topics near and dear to both of their hearts: podcasts and audio archiving. Dana Gerber-Margie has been a long-time podcast super-listener. She is subscribed to over 1400 podcasts, and she regularly listens to 40-50 of them. She launched a podcast recommendation newsletter when she was getting her MLS, called “The Audio Signal,” which has grown into a popular podcast publication called The Bello Collective. Mary was a National Digital Stewardship Resident at WNYC, where she was creating a born-digital preservation strategy for their archives. She had worked on analog archives projects in the past — scanning and transferring collections of tapes — but she’s embraced the madness and importance of preserving born-digital audio. Mary and Dana stayed in touch and continued to brainstorm ideas, which blossomed into a workshop about podcast preservation that they taught at the Personal Digital Archives conference at Stanford in 2017, along with Anne Wootton (co-founder of Popup Archive, now at Apple Podcasts).
Then Mary and I connected at the National Digital Stewardship Residency symposium in Washington, DC in 2017. I got my MLS back in 2013, but since then I’ve been working more at the intersection of media, storytelling, and archives. I had started a podcast and was really interested, for selfish reasons, in learning the most up-to-date best practices for born-digital audio preservation. I marched straight up to Mary and said something like, “hey, let’s work together on an audio preservation project.” Mary set up a three-way Skype call with Dana on the line, and pretty soon we were talking about podcasts. How we love them. How they are at risk because most podcasters host their files on commercial third-party platforms. And how we would love to do a massive outreach and education program where we teach podcasters that their digital files are at risk and give them techniques for preserving them. We wrote these ideas into a grant proposal, with a few numbers and a budget attached, and the Andrew W. Mellon Foundation gave us $142,000 to make it happen. We started working on this grant project, called “Preserve This Podcast,” back in February 2018. We’ve been able to hire people who are just as excited about the idea to help us make it happen. Like Sarah Nguyen, a current MLIS student at the University of Washington and our amazing Project Coordinator.
One moral of this story is that digital archives conferences really can bring people together and inspire them to advance the field. The other moral of the story is that, after months of consulting audio preservation experts and interviewing podcasters and getting 556 podcasters to take a survey and reading about the history of podcasting, we can confirm that podcasts are disappearing and podcast producers are not adequately equipped to preserve their work against the onslaught of forces working against the long-term endurance of digital information rendering devices. There is more information on our website about the project (preservethispodcast.org) and in the report about the survey findings. Please reach out to firstname.lastname@example.org or email@example.com if you have any thoughts or ideas.
Molly Schwartz is the Studio Manager at the Metropolitan New York Library Council (METRO). She is the host and producer of two podcasts about libraries and archives — Library Bytegeist and Preserve This Podcast. Molly did a Fulbright grant at the Aalto University Media Lab in Helsinki, was part of the inaugural cohort of National Digital Stewardship Residents in Washington, D.C., and worked at the U.S. State Department as a data analyst. She holds an MLS with a specialization in Archives, Records and Information Management from the University of Maryland at College Park and a BA/MA in History from the Johns Hopkins University.
by Richard Marciano, Victoria Lemieux, and Mark Hedges
The 3rd workshop on Computational Archival Science (CAS) was held on December 12, 2018, in Seattle, following two earlier CAS workshops in 2016 in Washington DC and in 2017 in Boston. It also built on three earlier workshops on ‘Big Humanities Data’ organized by the same chairs at the 2013-2015 conferences, and more directly on a symposium held in April 2016 at the University of Maryland. The current working definition of CAS is:
A transdisciplinary field that integrates computational and archival theories, methods and resources, both to support the creation and preservation of reliable and authentic records/archives and to address large-scale records/archives processing, analysis, storage, and access, with aim of improving efficiency, productivity and precision, in support of recordkeeping, appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material.
The workshop featured five sessions and thirteen papers with international presenters and authors from the US, Canada, Germany, the Netherlands, the UK, Bulgaria, South Africa, and Portugal. All details (photos, abstracts, slides, and papers) are available at: http://dcicblog.umd.edu/cas/ieee-big-data-2018-3rd-cas-workshop/. The keynote focused on using digital archives to preserve the history of WWII Japanese-American incarceration and featured Geoff Froh, Deputy Director at Densho.org in Seattle.
This workshop explored the conjunction (and its consequences) of emerging methods and technologies around big data with archival practice and new forms of analysis and historical, social, scientific, and cultural research engagement with archives. The aim was to identify and evaluate current trends, requirements, and potential in these areas, to examine the new questions that they can provoke, and to help determine possible research agendas for the evolution of computational archival science in the coming years. At the same time, we addressed the questions and concerns scholarship is raising about the interpretation of ‘big data’ and the uses to which it is put, in particular appraising the challenges of producing quality – meaning, knowledge and value – from quantity, tracing data and analytic provenance across complex ‘big data’ platforms and knowledge production ecosystems, and addressing data privacy issues.
Computational Thinking and Computational Archival Science
#1:Introducing Computational Thinking into Archival Science Education [William Underwood et al]
#2:Automating the Detection of Personally Identifiable Information (PII) in Japanese-American WWII Incarceration Camp Records [Richard Marciano, et al.]
#3:Computational Archival Practice: Towards a Theory for Archival Engineering [Kenneth Thibodeau]
#4:Stirring The Cauldron: Redefining Computational Archival Science (CAS) for The Big Data Domain [Nathaniel Payne]
Machine Learning in Support of Archival Functions
#5:Protecting Privacy in the Archives: Supervised Machine Learning and Born-Digital Records [Tim Hutchinson]
#6:Computer-Assisted Appraisal and Selection of Archival Materials [Cal Lee]
Metadata and Enterprise Architecture
#7:Measuring Completeness as Metadata Quality Metric in Europeana [Péter Királyet al.]
#8:In-place Synchronisation of Hierarchical Archival Descriptions [Mike Bryant et al.]
#9:The Utility Enterprise Architecture for Records Professionals [Shadrack Katuu]
#10:Framing the scope of the common data model for machine-actionable Data Management Plans [João Cardoso et al.]
#11:The Blockchain Litmus Test [Tyler Smith]
Social and Cultural Institution Archives
#12:A Case Study in Creating Transparency in Using Cultural Big Data: The Legacy of Slavery Project [Ryan Cox, Sohan Shah et al]
#13:Jupyter Notebooks for Generous Archive Interfaces [Mari Wigham et al.]
Finally, we are planning a 4th CAS Workshop in December 2019 at the 2019 IEEE International Conference on Big Data (IEEE BigData 2019) in Los Angeles, CA. Stay tuned for an upcoming CAS#4 workshop call for proposals, where we would welcome SAA member contributions!
Richard Marciano is a professor at the University of Maryland iSchool where he directs the Digital Curation Innovation Center (DCIC). He previously conducted research at the San Diego Supercomputer Center at the University of California San Diego for over a decade. His research interests center on digital preservation, sustainable archives, cyberinfrastructure, and big data. He is also the 2017 recipient of Emmett Leahy Award for achievements in records and information management. Marciano holds degrees in Avionics and Electrical Engineering, a Master’s and Ph.D. in Computer Science from the University of Iowa. In addition, he conducted postdoctoral research in Computational Geography.
Victoria Lemieux is an associate professor of archival science at the iSchool and lead of the Blockchain research cluster, Blockchain@UBC at the University of British Columbia – Canada’s largest and most diverse research cluster devoted to blockchain technology. Her current research is focused on risk to the availability of trustworthy records, in particular in blockchain record keeping systems, and how these risks impact upon transparency, financial stability, public accountability and human rights. She has organized two summer institutes for Blockchain@UBC to provide training in blockchain and distributed ledgers, and her next summer institute is scheduled for May 27-June 7, 2019. She has received many awards for her professional work and research, including the 2015 Emmett Leahy Award for outstanding contributions to the field of records management, a 2015 World Bank Big Data Innovation Award, a 2016 Emerald Literati Award and a 2018 Britt Literary Award for her research on blockchain technology. She is also a faculty associate at multiple units within UBC, including the Peter Wall Institute for Advanced Studies, Sauder School of Business, and the Institute for Computers, Information and Cognitive Systems.
Mark Hedges is a Senior Lecturer in the Department of Digital Humanities at King’s College London, where he teaches on the MA in Digital Asset and Media Management, and is also Departmental Research Lead. His original academic background was in mathematics and philosophy, and he gained a PhD in mathematics at University College London, before starting a 17-year career in the software industry, before joining King’s in 2005. His research is concerned primarily with digital archives, research infrastructures, and computational methods, and he has led a range of projects in these areas over the last decade. Most recently has been working in Rwanda on initiatives relating to digital archives and the transformative impact of digital technologies.
Where: Metropolitan New York Library Council (METRO), New York, NY
Stephen Klein, Digital Services Librarian at the CUNY Graduate Center (CUNY)
Ashley Blewer, AV Preservation Specialist at Artefactual
Kelly Stewart, Digital Preservation Services Manager at Artefactual
On December 3, 2018, the Metropolitan New York Library Council (METRO)’s Digital Preservation Interest Group hosted an informative (and impeccably titled) presentation about how the CUNY Graduate Center (GC) plans to incorporate Archivematica, a web-based, open-source digital asset management software (DAMs) developed by Artefactual, into its document management strategy for student dissertations. Speakers included Stephen Klein, Digital Services Librarian at the CUNY Graduate Center (GC); Ashley Blewer, AV Preservation Specialist at Artefactual; and Kelly Stewart, Digital Preservation Services Manager at Artefactual. The presentation began with an overview from Stephen about the GC’s needs and why they chose Archivematica as a DAMs, followed by an introduction to and demo of Archivematica and Duracloud, an open-source cloud storage service, led by Ashley and Kelly (who was presenting via video-conference call). While this post provides a general summary of the presentation, I would recommend reaching out to any of the presenters for more detailed information about their work. They were all great!
Every year the GC Library receives between 400-500 dissertations, theses, and capstones. These submissions can include a wide variety of digital materials, from PDF, video, and audio files, to websites and software. Preservation of these materials is essential if the GC is to provide access to emerging scholarship and retain a record of students’ work towards their degrees. Prior to implementing a DAMs, however, the GC’s strategy for managing digital files of student work was focused primarily on access, not preservation. Access copies of student work were available on CUNY Academic Works, a site that uses Bepress Digital Commons as a CMS. Missing from the workflow, however, was the creation, storage, and management of archival originals. As Stephen explained, if the Open Archival Information System (OAIS) model is a guide for a proper digital preservation workflow, the GC was without the middle, Archival Information Package (AIP), portion of it. Some of the qualities that GC liked about Archivematica was that it was open-source and highly-customizable, came with strong customer support from Artefactual, and had an API that could integrate with tools already in use at the library. GC Library staff hope that Archivematica can eventually integrate with both the library’s electronic submission system (Vireo) and CUNY Academic Works, making the submission, preservation, and access of digital dissertations a much more streamlined, automated, and OAIS-compliant process.
Next, Ashley and Kelly introduced and demoed Archivematica and Duracloud. I was very pleased to see several features of the Archivematica software that were made intentionally intuitive. The design of the interface is very clean and easily customizable to fit different workflows. Also, each AIP that is processed includes a plain-text, human-readable file which serves as extra documentation explaining what Archivematica did to each file. Artefactual recommends pairing Archivematica with Duracloud, although users can choose to integrate the software with local storage or with other cloud services like those offered by Google or Amazon. One of the features I found really interesting about Duracloud is that it comes with various data visualization graphs that show the user how much storage is available and what materials are taking up the most space.
I close by referencing something Ashley wrote in her recent bloggERS post (conveniently she also contributed to this event). She makes an excellent point about how different skill-sets are needed to do digital preservation, from the developers that create the tools that automate digital archival processes to the archivists that advocate for and implement said tools at their institutions. I think this talk was successful precisely because it included the practitioner and vendor perspectives, as well as the unique expertise that comes with each role. Both are needed if we are to meet the challenges and tap into the potential that digital archives present. I hope to see more of these “meetings of the minds” in the future.
Regina Carra is the Archive Project Metadata and Cataloging Coordinator at Mark Morris Dance Group. She is a recent graduate of the Dual Degree MLS/MA program in Library Science and History at Queens College – CUNY.
Scripting and working in the command line have become increasingly important skills for archivists, particularly for those who work with digital materials — at the same time, approaching these tools as a beginner can be intimidating. This series hopes to help break down barriers by allowing archivists to learn from their peers. We want to hear about how you use or are learning to use scripts (Bash, Python, Ruby, etc.) or the command line (one-liners, a favorite command line tool) in your day-to-day work, how scripts play into your processes and workflows, and how you are developing your knowledge in this area. How has this changed the way you think about your work? How has this changed your relationship with your colleagues or other stakeholders?
We’re particularly interested in posts that consist of a walk-through of a simple script (or one-liner) used in your digital archives workflow. Show us your script or command and tell us how it works.
A few other potential topics and themes for posts:
Stories of success or failure with scripting for digital archives
General “tips and tricks” for the command line/scripting
Independent or collaborative learning strategies for developing “tech” skills
A round-up of resources about a particular scripting language or related topic
Applying computational thinking to digital archives
Writing for bloggERS! “Script It!” Series
We encourage visual representations: Posts can include or largely consist of comics, flowcharts, a series of memes, etc!
Written content should be roughly 600-800 words in length
Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA
At this year’s New England Archivists Spring Meeting, archivists who work with born-digital materials had the opportunity to attend the inaugural Born-Digital Access Bootcamp. The bootcamp was an idea generated at the born-digital hackfest, part of a session at SAA 2015, where a group of about 50 archivists came together to tackle the problem facing most archival repositories: How do we provide access to born-digital records, which can have different technical and ethical requirements than digitized materials? Since 2015, a team has come together to form a bootcamp curriculum, reach out to organizations outside of SAA, and organize bootcamps at various conferences.
Alison Clemens and Jessica Farrell facilitated the day-long camp, which had about 30 people in attendance from institutions of all sizes and types, though the majority were academic. The attendees also brought a broad range of experience to the camp, from those just starting out thinking about this issue, to those who have implemented access solutions.
We all have dreams… Have you ever found yourself enamored with the idea of digitizing? How fast and easy it would be to just scan an image and post it online (or email it) instead of pulling item after item after awkward map off the shelf in the stacks day in and day out? Have you ever imagined a world where small institutions have metadata as good as the Digital Commonwealth? While these are all great dreams to have, the reality of the situation remains that many repositories cannot digitize entire collections and on-demand digitizing requires consistency on both collection and item level metadata. This is the story of backtracking through years of unlabeled scans, and what I learned about part time, lone arranger digitization practices.
For every good intention, there is an equally powerful Murphy’s Law reaction…
In a small institution with limited personnel and resources, it is very easy to leave a long trail of to-do tasks that simply do not get done on time. For example:
A researcher asks for a scan of a photograph they found on a finding aid posted on your website.
You scan the photograph, email it to them, and then get pulled off task because your intern needs help with some loose papers in the box she is processing.
Before you get back to the computer, your co-worker asks to meet about pulling materials for an event.
You don’t get back to the computer for another hour, and by then there are six more emails waiting for you and new reference requests. You dive into those.
You haven’t even looked at your to-do list from yesterday.
Maybe you remember to put the photograph into the correct collection folder. Maybe you labeled the photograph 3.014 LLH but never uploaded it onto the server. Maybe you forgot about it all together until someone asked for it again and you remembered doing it before. Do you take the time to find it? Do you rescan it?
I came into years of unlabeled scans, almost a terabyte of images either completely unlabeled or images organized into collection folders, and I am sure I am not alone. I’m sure I’ve also contributed to this type of problem over the years.
By not labeling the images, the person scanning is either relying on visual recognition for the future, or not thinking about the future at all. What if you get sick and need someone to cover, or leave the institution? What if you forget the number you assigned it in the finding aid? This all begs the following question: can anyone keep thousands of images straight? I sure can’t. By not organizing and labeling the scans from the get-go, the person wasted time and resources, exposed the originals to adverse conditions, and flooded the server with useless materials that just make searching harder. Looking at a sea of scans and not knowing where they might belong is difficult and intimidating. Scanning an image and emailing it, then forgetting about it because of other goings on, is quite easy.
With many scans comes great responsibility… On one hand, someone already took the time to scan all those materials, pulling them out of their controlled environment and subjecting them to the scanner. On the other hand, deleting everything without metadata and starting from scratch with specific standards in place would eliminate significant time spent backtracking. I chose to act on a case by case basis, working through the images that were sorted into collection folders, and relegating the completely unlabeled photographs into a folder to delete or deal with later. I ended up going collection by collection through the finding aids to determine which scan was which, ending up with 25 collections with a significant number of scans identified with their item number in addition to the collections that were scanned in their entirety.
Making the best of it all With all these scans identified, I have been able to enrich the metadata of our collections posted on Flickr and on our website. Well, once our new website is actually up and running. Digitization is only step one of the process…and only if you set the correct dpi the first time around…
Irina Sandler graduated in May 2017 with her MLIS from Simmons College. She is currently an archivist at the Baker Library of Harvard Business School as well as the Cambridge Historical Society.
I’m the Records Coordinator for a global energy engineering, procurement, and construction contractor, herein referred to as the “Company.” The Company does design, fabrication, installation, and commissioning of upstream and downstream technologies for operators. I manage the program for our hard copy and electronic records produced from our Houston office.
A few years ago our Records Management team was asked by the IT department to help create a process to archive digital records of closed projects created out of the Houston office. I saw the effort as an opportunity to expand the scope and authority of our records program to include digital records. Up to this point, our practice only covered paper records, and we asked employees to apply the paper record policies to their own electronic records.
The Records Management team’s role was limited to providing IT with advice on how to deploy a software tool where files could be stored for a long-term period. We were not included in the discussions on which software tool to use. It took us over a year to develop the new process with IT and standardize it into a published procedure. We had many areas of triumph and failure throughout the process. Here is a synopsis of the project.
Objective: IT was told that retaining closed projects files on the local server was an unnecessary cost and was tasked with removing them. IT reached out to Records Management to develop a process to maintain the project files for the long-term in a more cost-effective solution that was nearline or offline, where records management policies could be applied.
Vault: The software chosen was a proprietary cloud-based file storage center or “vault.” It has search, tagging, and records disposition capabilities. It is more cost-effective than storing files on the local server.
Process: At 80% project completion, Records Management reaches out to active projects to discover their methods for storing files and the project completion schedule. 80% engineering completion is an important timeline for projects because most of the project team is still involved and the bulk of the work is complete. Records Management also gains knowledge of the project schedule so we can accurately apply the two-year timespan to when the files will be migrated off the local server and to the vault. The two-year time span was created to ensure that all project files would be available to the project team during the typical warranty period. Two years after a project is closed, all technical files and data are exported from the current management system and ingested into the vault, and access groups are created so employees can view and download the files for reference as needed.
Deployment: Last year, we began to apply the process to large active projects that had passed 80% engineering completion. Large projects are those that have greater than 5 million in revenue.
Observations: Recently we have begun to audit the whole project with IT, and are just now identifying our areas of failure and triumph. We will conduct an analysis of these areas and assess where we can make improvements.
Our big areas of failure were related to stakeholder involvement in the development, deployment, and utilization of the vault.
Stakeholders, including the Records Management team, were not involved in the selection or development of the vault software tool. As a result, the vault development project lacked the resources required to make it as successful as possible.
In the deployment of the vault, we did not create an outreach campaign with training courses that would introduce the tool across our very large company. Due to this, many employees are still unaware of the vault. When we talk with departments and projects about methods to save old files for less money they are reluctant to try the solution because it seems like another way for IT to save money from their budget without thinking about the greater needs of the company. IT is still viewed as a support function that is inessential to the Company’s philosophy.
Lastly, we did not have methods to export project files from all systems for ingest into the vault; nor did we, in North America, have the authority to develop that solution. To be effective, that type of decision and process can only be developed by our corporate office in another country. The Company also does not make information about project closure available to most employees. A project end date can be determined by several factors, including when the final invoice was received or the end of the warranty period. This type of information is essential to the information lifecycle of a project, and since we had no involvement from upper level management, we were not able to devise a solution for easily discovering this information.
We had some triumphs throughout the process, though. Our biggest triumph is that this project gave Records Management an opportunity to showcase our knowledge of records retention and its value as a method to save money and maintain business continuity. We were able to collaborate with IT and promulgate a process. It gave us a great opportunity to grow by harnessing better relationships with the business lines. Although some departments and teams are still skeptical about the value of the vault, when we advertise it to other project teams, they see the vault as evidence that the Company cares about preserving their work. We earned our seat at the table with these players, but we still have to work on winning over more projects and departments. We’ve also preserved more than 30 TB of records and saved the Company several thousands of dollars by ingesting inactive project files into the vault.
I am optimistic that when we have support from upper management, we will be able to improve the vault process and infrastructure, and create an effective solution for utilizing records management policies to ensure legal compliance, maintain business continuity, and save money.
Sarah Dushkin earned her MSIS from the University of Texas at Austin School of Information with a focus in Archival Enterprise and Records Management. Afterwards, she sought to diversify her expertise by working outside of the traditional archival setting and moved to Houston to work in oil and gas. She has collaborated with management from across her company to refine their records management program and develop a process that includes the retention of electronic records and data. She lives in Sugar Land, Texas with her husband.
When was the last time you totally, completely, utterly loused up a project or a report or some other task in your professional life? When was the last time you dissected that failure, in meticulous detail, in front of a room full of colleagues? Let’s face it: we’ve all had the first experience, and I’d wager that most of us would pay good money to avoid the second.
It’s a given that we’ll all encounter failure professionally, but there’s a strong cultural disincentive to talk about it. Failure is bad. It is to be avoided at all costs. And should one fail, that failure should be buried away in a dark closet with one’s other skeletons. At the same time, it’s well acknowledged that failure is a critical step on the path to success. It’s only through failing and learning from that experience that we can make the necessary course corrections. In that sense, refusing to acknowledge or unpack failure is a disservice: failure is more valuable when well-understood than when ignored.
This philosophy — that we can gain value from failure by acknowledging and understanding it openly — is the underlying principle behind Fail4Lib, the perennial preconference workshop that takes place at the annual Code4Lib conference, and which completed its fifth iteration (Fail5Lib!) at Code4Lib 2017 in Los Angeles. Jason Casden (now of UNC Libraries) originally conceived of the Fail4Lib idea, and together he and I developed the concept into a workshop about understanding, analyzing, and coming to terms with professional failure in a safe, collegial environment.
Participants in a Fail4Lib workshop engage in a number of activities to foster a healthier relationship with failure: case study discussions to analyze high-profile failures such as the Challenger disaster and the Volkswagen diesel emissions scandal; lightning talks where brave souls share their own professional failures and talk about the lessons they learned; and an open bull session about risk, failure, and organizational culture, to brainstorm on how we can identify and manage failure, and how to encourage our organizations to become more failure-tolerant.
Fail4Lib’s goal is to help its participants to get better at failing. By practicing talking about and thinking about failure, we position ourselves to learn more from the failures of others as well as our own future failures. By sharing and talking through our failures we maximize the value of our experiences, we normalize the practice of openly acknowledging and discussing failure, and we reinforce the message to participants that it happens to all of us. And by brainstorming approaches to allow our institutions to be more failure-tolerant, we can begin making meaningful organizational change towards accepting failure as part of the development process.
The principles I’ve outlined here not only form the framework for the Fail4Lib workshop, they also represent a philosophy for engaging with professional failure in a constructive and blameless way. It’s only by normalizing the experience of failure that we can gain the most from it; in so doing, we make failure more productive, we accelerate our successes, and we make ourselves more resilient.
Andreas Orphanides is Associate Head, User Experience at the NCSU Libraries, where he develops user-focused solutions to support teaching, learning, and information discovery. He has facilitated Fail4Lib workshops at the annual Code4Lib conference since 2013. He holds a BA from Oberlin College and an MSLS from UNC-Chapel Hill.
This is the second post in the bloggERS series describing outcomes of the #OSS4Pres 2.0 workshop at iPRES 2016, addressing open source tool and software development for digital preservation. This post outlines the work of the group tasked with “drafting a design guide and requirements for Free and Open Source Software (FOSS) tools, to ensure that they integrate easily with digital preservation institutional systems and processes.”
The FOSS Development Requirements Group set out to create a design guide for FOSS tools to ensure easier adoption of open-source tools by the digital preservation community, including their integration with common end-to-end software and tools supporting digital preservation and access that are now in use by that community.
The group included representatives of large digital preservation and access projects such as Fedora and Archivematica, as well as tool developers and practitioners, ensuring a range of perspectives were represented. The group’s initial discussion led to the creation of a list of minimum necessary requirements for developing open source tools for digital preservation, based on similar examples from the Open Preservation Foundation (OPF) and from other fields. Below is the draft list that the group came up with, followed by some intended future steps. We welcome feedback or additions to the list, as well as suggestions for where such a list might be hosted long term.
Minimum Necessary Requirements for FOSS Digital Preservation Tool Development
Provide publicly accessible documentation and an issue tracker
Have a documented process for how people can contribute to development, report bugs, and suggest new documentation
Every tool should do the smallest possible task really well; if you are developing an end-to-end system, develop it in a modular way in keeping with this principle
Follow established standards and practices for development and use of the tool
Keep documentation up-to-date and versioned
Follow test-driven development philosophy
Don’t develop a tool without use cases, and stakeholders willing to validate those use cases
Use an open and permissive software license to allow for integrations and broader use
Have a mailing list, Slack or IRC channel, or other means for community interaction
Establish community guidelines
Provide a well-documented mechanism for integration with other tools/systems in different languages
Provide functionality of tool as a library, separating out the GUI and the actual functions
Package tool in an easy-to-use way; the more broadly you want the tool to be used, package it for different operating systems
Use a packaging format that supports any dependencies
Provide examples of functionality for potential users
Consider the organizational home or archive for the tool for long-term sustainability; develop your tool based on potential organizations’ guidelines
Consider providing a mechanism for internationalization of your tool (this is a broader community need as well, to identify the tools that exist and to incentivize this)
Digital preservation is an operating system-agnostic field
Feedback and Perspectives. Because of the expense of the iPRES conference (and its location in Switzerland), all of the group members were from relatively large and well-resourced institutions. The perspective of under-resourced institutions is very often left out of open-source development communities, as they are unable to support and contribute to such projects; in this case, this design guide would greatly benefit from the perspective of such institutions as to how FOSS tools can be developed to better serve their digital preservation needs. The group was also largely from North America and Europe, so this work would eventually benefit greatly from adding perspectives from the FOSS and digital preservation communities in South America, Asia, and Africa.
Institutional Home and Stewardship. When finalized, the FOSS development requirements list should live somewhere permanently and develop based on the ongoing needs of our community. As this line of communication between practitioners and tool developers is key to the continual development of better and more user-friendly digital preservation tools, we should continue to build on the work of this group.
Heidi Elaine Kelly is the Digital Preservation Librarian at Indiana University, where she is responsible for building out the infrastructure to support long-term sustainability of digital content. Previously she was a DiXiT fellow at Huygens ING and an NDSR fellow at the Library of Congress.
Organized by Sam Meister (Educopia), Shira Peltzman (UCLA), Carl Wilson (Open Preservation Foundation), and Heidi Kelly (Indiana University), OSS4PRES 2.0 was a half-day workshop that took place during the 13th annual iPRES 2016 conference in Bern, Switzerland. The workshop aimed to bring together digital preservation practitioners, developers, and administrators in order to discuss the role of open source software (OSS) tools in the field.
Although several months have passed since the workshop wrapped up, we are sharing this information now in an effort to raise awareness of the excellent work completed during this event, to continue the important discussion that took place, and to hopefully broaden involvement in some of the projects that developed. First, however, a bit of background: The initial OSS4PRES workshop was held at iPRES 2015. Attended by over 90 digital preservation professionals from all areas of the open source community, individuals reported on specific issues related to open source tools, which were followed by small group discussions about the opportunities, challenges, and gaps that they observed. The energy from this initial workshop led to both the proposal of a second workshop, as well as a report that was published in Code4Lib Journal, OSS4EVA: Using Open-Source Tools to Fulfill Digital Preservation Requirements.
The overarching goal for the 2016 workshop was to build bridges and fill gaps within the open source community at large. In order to facilitate a focused and productive discussion, OSS4PRES 2.0 was organized into three groups, each of which was led by one of the workshop’s organizers. Additionally, Shira Peltzman floated between groups to minimize overlap and ensure that each group remained on task. In addition to maximizing our output, one of the benefits of splitting up into groups was that each group was able to focus on disparate but complementary aspects of the open source community.
Develop user stories for existing tools (group leader: Carl Wilson)
Carl’s group was comprised principally of digital preservation practitioners. The group scrutinized existing pain points associated with the day-to-day management of digital material, identified tools that had not yet been built that were needed by the open source community, and began to fill this gap by drafting functional requirements for these tools.
Define requirements for online communities to share information about local digital curation and preservation workflows (group leader: Sam Meister)
With an aim to strengthen the overall infrastructure around open source tools in digital preservation, Sam’s group focused on the larger picture by addressing the needs of the open source community at large. The group drafted a list of requirements for an online community space for sharing workflows, tool integrations, and implementation experiences, to facilitate connections between disparate groups, individuals, and organizations that use and rely upon open source tools.
Define requirements for new tools (group leader: Heidi Kelly)
Heidi’s group looked at how the development of open source digital preservation tools could be improved by implementing a set of minimal requirements to make them more user-friendly. Since a list of these requirements specifically for the preservation community had not existed previously, this list both fills a gap and facilitates the building of bridges, by enabling developers to create tools that are easier to use, implement, and contribute to.
Ultimately OSS4PRES 2.0 was an effort to make the open source community more open and diverse, and in the coming weeks we will highlight what each group managed to accomplish towards that end. The blog posts will provide an in-depth summary of the work completed both during and since the event took place, as well as a summary of next steps and potential project outcomes. Stay tuned!
Shira Peltzman is the Digital Archivist for the UCLA Library where she leads the development of a sustainable preservation program for born-digital material. Shira received her M.A. in Moving Image Archiving and Preservation from New York University’s Tisch School of the Arts and was a member of the inaugural class of the National Digital Stewardship Residency in New York (NDSR-NY).
Heidi Elaine Kelly is the Digital Preservation Librarian at Indiana University, where she is responsible for building out the infrastructure to support long-term sustainability of digital content. Previously she was a DiXiT fellow at Huygens ING and an NDSR fellow at the Library of Congress.