OSS4Pres 2.0: Developing functional requirements/features for digital preservation tools

By Heidi Elaine Kelly

____

This is the final post in the bloggERS series describing outcomes of the #OSS4Pres 2.0 workshop at iPRES 2016, addressing open source tool and software development for digital preservation. This post outlines the work of the group tasked with “ developing functional requirements/features for OSS tools the community would like to see built/developed (e.g. tools that could be used during ‘pre-ingest’ stage).” 

The Functional Requirements for New Tools and Features Group of the OSS4Pres workshop aimed to write user stories focused on new features that developers can build out to better support digital preservation and archives work. The group was largely comprised of practitioners who work with digital curation tools regularly, and was facilitated by Carl Wilson of the Open Preservation Foundation. While their work largely involved writing user stories for development, the group also came up with requirement lists for specific areas of tool development, outlined below. We hope that these lists help continue to bridge the gap between digital preservation professionals and open source developers by providing a deeper perspective of user needs.

Basic Requirements for Tools:

  • Mostly needed for Mac environment
  • No software installation on donor computer
  • No software dependencies requiring installation (e.g., Java)
  • Must be GUI-based, as most archivists are not skilled with the command line
  • Graceful failure

Descriptive Metadata Extraction Needs (using Apache Tika):

  • Archival date
  • Author
  • Authorship location
  • Subject location
  • Subject
  • Document type
  • Removal of spelling errors to improve extracted text

Technical Metadata Extraction Needs:

  • All datetime information available should be retained (minimum of LastModified Date)
  • Technical manifest report
  • File permissions and file ownership permissions
  • Information about the tool that generated the technical manifest report:
    • tool – name of the tool used to gather the disk image
    • tool version – the version of the tool
    • signature version – if the tool uses ‘signatures’ or other add-ons, e.g. which virus scanner software signature – such as signature release July 2014 or v84
    • datetime process run – the datetime information of when the process ran (usually tools will give you when the process was completed) – for each tool that you use

Data Transfer Tool Requirements:

  • Run from portable external device
  • Bag-It standard compliant (build into a “bag”)
  • Able to select a subset of data – not disk image the whole computer
  • GUI-based tool
  • Original file name (also retained in tech manifest)
  • Original file path (also retained in tech manifest)
  • Directory structure (also retained in tech manifest)
  • Address these issues in filenames (record the actual filename in the tech manifest): Diacritics (e.g. naïve ), Illegal characters ( \ / : * ? “ < > | ), Spaces, M-dashes, n-dashes, Missing file extensions, Excessively long file and folder names, etc
  • Possibly able to connect to “your” FTP site/cloud thingy and send the data there when ready for transfer

Checksum Verification Requirements:

  • File-by-file checksum hash generation
  • Ability to validate the contents of the transfer

Reporting Requirements:

  • Ability to highlight/report on possibly problematic files/folders in a separate file

Testing Requirements:

  • Access to a test corpora, with known issues, to test tool

Smart Selection & Appraisal Tool Requirements:

  • DRM/TPMs detection
  • Regular expressions/fuzzy logic for finding certain terms – e.g. phone numbers, security numbers, other predefined personal data
  • Blacklisting of files – configurable list of blacklist terms
  • Shortlisting a set of “questionable” files based on parameters that could then be flagged for a human to do further QA/QC

Specific Features Needed by the Community:

  • Gathering/generating quantitative metrics for web harvests
  • Mitigation strategies for FFMPEG obsolescence
  • TESSERACT language functionality

____

heidi-elaine-kellyHeidi Elaine Kelly is the Digital Preservation Librarian at Indiana University, where she is responsible for building out the infrastructure to support long-term sustainability of digital content. Previously she was a DiXiT fellow at Huygens ING and an NDSR fellow at the Library of Congress.

Advertisements

OSS4Pres 2.0: Sharing is Caring: Developing an online community space for sharing workflows

By Sam Meister

____

This is the third post in the bloggERS series describing outcomes of the #OSS4Pres 2.0 workshop at iPRES 2016, addressing open source tool and software development for digital preservation. This post outlines the work of the group tasked with “developing requirements for an online community space for sharing workflows, OSS tool integrations, and implementation experiences” See our other posts for information on the groups that focused on feature development and design requirements for FOSS tools.

Cultural heritage institutions, from small museums to large academic libraries, have made significant progress developing and implementing workflows to manage local digital curation and preservation activities. Many institutions are at different stages in the maturity of these workflows. Some are just getting started, and others have had established workflows for many years. Documentation assists institutions in representing current practices and functions as a benchmark for future organizational decision-making and improvements. Additionally, sharing documentation assists in creating cross-institutional understanding of digital curation and preservation activities and can facilitate collaborations amongst institutions around shared needs.

One of the most commonly voiced recommendations from iPRES 2015 OSS4PRES workshop attendees was the desire for a centralized location for technical and instructional documentation, end-to-end workflows, case studies, and other resources related to the installation, implementation, and use of OSS tools. This resource could serve as a hub that would enable practitioners to freely and openly exchange information, user requirements, and anecdotal accounts of OSS initiatives and implementations.

At the OSS4Pres 2.0 workshop, the group of folks looking at developing an online space for sharing workflows and implementation experience started by defining a simple goal and deliverable for the two hour session:

Develop a list of minimal levels of content that should be included in an open online community space for sharing workflows and other documentation

The group the began a discussion on developing this list of minimal levels by thinking about the potential value of user stories in informing these levels. We spent a bit of time proposing a short list of user stories, just enough to provide some insight into the basic structures that would be needed for sharing workflow documentation.

User stories

  • I am using tool 1 and tool 2 and want to know how others have joined them together into a workflow
  • I have a certain type of data to preserve and want to see what workflows other institutions have in place to preserve this data
  • There is a gap in my workflow — a function that we are not carrying out — and I want to see how others have filled this gap
  • I am starting from scratch and need to see some example workflows for inspiration
  • I would like to document my workflow and want to find out how to do this in a way that is useful for others
  • I would like to know why people are using particular tools – is there evidence that they tried another tool, for example, that wasn’t successful?

The group then proceeded to define a workflow object as a series of workflow steps with its own attributes, a visual representation, and organizational context:

Workflow step
Title / name
Description
Tools / resources
Position / role

Visual workflow diagrams / model
Organizational Context
            Institution type
            Content type

Next, we started to draft out the different elements that would be part of an initial minimal level for workflow objects:

Level 1:

Title
Description
Institution / organization type
Contact
Content type(s)
Status
Link to external resources
Download workflow diagram objects
Workflow concerns / reflections / gaps

After this effort the group focused on discussing next steps and how an online community space for sharing workflows could be realized. This discuss led towards pursuing the expansion of COPTR to support sharing of workflow documentation. We outlined a roadmap for next steps toward pursuing this goal:

  • Propose / approach COPTR steering group on adding workflows space to COPTR
  • Develop home page and workflow template
  • Add examples
  • Group review
  • Promote / launch
  • Evaluation

The group has continued this work post-workshop and has made good progress setting up a Community Owned Workflows section to COPTR and developing an initial workflow template. We are in the midst of creating and evaluating sample workflows to help with revising and tweaking as needed. Based on this process we hope to launch and start promoting this new online space for sharing workflows in the months ahead. So stay tuned!

____

meister_photoSam Meister is the Preservation Communities Manager, working with the MetaArchive Cooperative and BitCurator Consortium communities. Previously, he worked as Digital Archivist and Assistant Professor at the University of Montana. Sam holds a Master of Library and Information Science degree from San Jose State University and a B.A. in Visual Arts from the University of California San Diego. Sam is also an Instructor in the Library of Congress Digital Preservation Education and Outreach Program.

 

Building Bridges and Filling Gaps: OSS4Pres 2.0 at iPRES 2016

By Heidi Elaine Kelly and Shira Peltzman

____

This is the first post in a bloggERS series describing outcomes of the #OSS4Pres 2.0 workshop at iPRES 2016.

Organized by Sam Meister (Educopia), Shira Peltzman (UCLA), Carl Wilson (Open Preservation Foundation), and Heidi Kelly (Indiana University), OSS4PRES 2.0 was a half-day workshop that took place during the 13th annual iPRES 2016 conference in Bern, Switzerland. The workshop aimed to bring together digital preservation practitioners, developers, and administrators in order to discuss the role of open source software (OSS) tools in the field.

Although several months have passed since the workshop wrapped up, we are sharing this information now in an effort to raise awareness of the excellent work completed during this event, to continue the important discussion that took place, and to hopefully broaden involvement in some of the projects that developed. First, however, a bit of background: The initial OSS4PRES workshop was held at iPRES 2015. Attended by over 90 digital preservation professionals from all areas of the open source community, individuals reported on specific issues related to open source tools, which were followed by small group discussions about the opportunities, challenges, and gaps that they observed. The energy from this initial workshop led to both the proposal of a second workshop, as well as a report that was published in Code4Lib Journal, OSS4EVA: Using Open-Source Tools to Fulfill Digital Preservation Requirements.

The overarching goal for the 2016 workshop was to build bridges and fill gaps within the open source community at large. In order to facilitate a focused and productive discussion, OSS4PRES 2.0 was organized into three groups, each of which was led by one of the workshop’s organizers. Additionally, Shira Peltzman floated between groups to minimize overlap and ensure that each group remained on task. In addition to maximizing our output, one of the benefits of splitting up into groups was that each group was able to focus on disparate but complementary aspects of the open source community.

Develop user stories for existing tools (group leader: Carl Wilson)

Carl’s group was comprised principally of digital preservation practitioners. The group scrutinized existing pain points associated with the day-to-day management of digital material, identified tools that had not yet been built that were needed by the open source community, and began to fill this gap by drafting functional requirements for these tools.

Define requirements for online communities to share information about local digital curation and preservation workflows (group leader: Sam Meister)

With an aim to strengthen the overall infrastructure around open source tools in digital preservation, Sam’s group focused on the larger picture by addressing the needs of the open source community at large. The group drafted a list of requirements for an online community space for sharing workflows, tool integrations, and implementation experiences, to facilitate connections between disparate groups, individuals, and organizations that use and rely upon open source tools.

Define requirements for new tools (group leader: Heidi Kelly)

Heidi’s group looked at how the development of open source digital preservation tools could be improved by implementing a set of minimal requirements to make them more user-friendly. Since a list of these requirements specifically for the preservation community had not existed previously, this list both fills a gap and facilitates the building of bridges, by enabling developers to create tools that are easier to use, implement, and contribute to.

Ultimately OSS4PRES 2.0 was an effort to make the open source community more open and diverse, and in the coming weeks we will highlight what each group managed to accomplish towards that end. The blog posts will provide an in-depth summary of the work completed both during and since the event took place, as well as a summary of next steps and potential project outcomes. Stay tuned!

____

peltzman_140902_6761_barnettShira Peltzman is the Digital Archivist for the UCLA Library where she leads the development of a sustainable preservation program for born-digital material. Shira received her M.A. in Moving Image Archiving and Preservation from New York University’s Tisch School of the Arts and was a member of the inaugural class of the National Digital Stewardship Residency in New York (NDSR-NY).

heidi-elaine-kellyHeidi Elaine Kelly is the Digital Preservation Librarian at Indiana University, where she is responsible for building out the infrastructure to support long-term sustainability of digital content. Previously she was a DiXiT fellow at Huygens ING and an NDSR fellow at the Library of Congress.

The Best of BDAX: Five Themes from the 2016 Born Digital Archiving & eXchange

By Kate Tasker

———

Put 40 digital archivists, programmers, technologists, curators, scholars, and managers in a room together for three days, give them unlimited cups of tea and coffee, and get ready for some seriously productive discussions.

This magic happened at the Born Digital Archiving & eXchange (BDAX) unconference, held at Stanford University on July 18-20, 2016. I joined the other BDAX attendees to tackle the continuing challenges of acquiring, discovering, delivering and preserving born-digital materials.

The discussions highlighted five key themes to me:

1) Born-digital workflows are, generally, specific

We’re all coping with the general challenges of born-digital archiving, but we’re encountering individual collections which need to be addressed with local solutions and resources. BDAXers generously shared examples of use cases and successful workflows, and, although these guidelines couldn’t always translate across diverse institutions (big/small, private/public, IT help/no IT help), they’re a foundation for building best practices which can be adapted to specific needs.

2) We need tools

We need reliable tools that will persist over time to help us understand collections, to record consistent metadata and description, and to discover the characteristics of new content types. Project demos including ePADD, BitCurator Access, bwFLA – Emulation as a Service, UC Irvine’s Virtual Reading Room, the Game Metadata and Citation Project, and the University of Michigan’s ArchivesSpace-Archivematica-DSpace Integration project gave encouragement that tools are maturing and will enable us to work with more confidence and efficiency. (Thanks to all the presenters!)

3) Smart people are on this

A lot of people are doing a lot of work to guide and document efforts in born-digital archiving. We need to share these efforts widely, find common points of application, and build momentum – especially for proposed guidelines, best guesses, and continually changing procedures. (We’re laying this train track as we go, but everybody can get on board!) A brilliant resource from BDAX is a “Topical Brain Dump” Google doc where everyone can share tips related to what we each know about born-digital archives (hat-tip to Kari Smith for creating the doc, and to all BDAXers for their contributions).

4) Talking to each other helps!

Chatting with BDAX colleagues over coffee or lunch provided space to compare notes, seek advice, make connections, and find reassurance that we’re not alone in this difficult endeavor. Published literature is continually emerging on born-digital archiving topics (for example, born-digital description), but if we’re not quite ready to commit our own practices to paper magnetic storage media, then informal conversations allow us to share ideas and experiences.

5) Born-digital archiving needs YOU

BDAX attendees brainstormed a wide range of topics for discussion, illustrating that born-digital archiving collides with traditional processes at all stages of stewardship, from appraisal to access. All of these functions need to be re-examined and potentially re-imagined. It’s a big job (*understatement*) but brings with it the opportunity to gather perspective and expertise from individuals across different roles. We need to make sure everyone is invited to this party.

How to Get Involved

So, what’s next? The BDAX organizers and attendees recognize that there are many, many more colleagues out there who need to be included in these conversations. Continuing efforts are coalescing around processing levels and metrics for born-digital collections; accurately measuring and recording extent statements for digital content; and managing security and storage needs for unprocessed digital accessions. Please, join in!

You can read extensive notes for each session in this shared Google Drive folder (yes, we did talk about how to archive Google docs!) or catch up on Tweets at #bdax2016.

To subscribe to the BDAX email listserv, please email Michael Olson (mgolson[at]stanford[dot]edu), or, to join the new BDAX Slack channel, email Shira Peltzman (speltzman[at]library[dot]ucla[dot]edu).

———

ktasker-profile-picKate Tasker works with born-digital collections and information management systems at The Bancroft Library, University of California, Berkeley. She has an MLIS from San Jose State University and is a member of the Academy of Certified Archivists. Kate attended Capture Lab in 2015 and is currently designing workflows to provide access to born-digital collections.

Digital Preservation in the News: Copyright and Abandonware

Heads up for anyone with an interest in video game preservation…

The Electronic Frontier Foundation (EFF) is seeking an exemption to the Prohibition on Circumvention of Copyright Protection Systems for Access Control Technologies (17 U.S.C. § 1201(a)(1)). The exemption is proposed for users who want to modify “videogames that are no longer supported by the developer, and that require communication with a server,” in order to serve player communities who want to keep maintain the functionality of their games–as well as “archivists, historians, and other academic researchers who preserve and study video games[.]” The proposal emphasizes that the games impacted by this exemption would not be persistent worlds (think World of Warcraft or Eve Online), but rather those games “that must communicate with a remote computer (a server) in order to enable core functionality, and that are no longer supported by the developer.”

The exemption is opposed by the Entertainment Software Association, representing major American (ESA) game publishers and platform providers. The ESA response to the EFF proposal argues that the scope of the proposed exemption is too broadly defined, and that “permitting circumvention of video game access controls would increase piracy, significantly reduce users’ options to access copyrighted works on platforms and devices, anddecrease the value of these works for copyright owners[.]”

In addition to the comments by the EFF, ESA, and their respective supporters, there are also a number of articles which go into much greater detail on this issue.

What do you think? Should there be a legal exemption for modifying unsupported (but still copyright-protected) video games to ensure their enduring usability?

The latest round of public comment on the proposed exemption closes on May 1, 2015. To voice your opinion, follow this link to Copyright.gov, where you can learn more and submit a comment voicing your opinion on this and other existing proposals.

Martin Gengenbach is an Assistant Archivist at the Gates Archive.