OSS4Pres 2.0: Developing functional requirements/features for digital preservation tools

By Heidi Elaine Kelly

____

This is the final post in the bloggERS series describing outcomes of the #OSS4Pres 2.0 workshop at iPRES 2016, addressing open source tool and software development for digital preservation. This post outlines the work of the group tasked with “ developing functional requirements/features for OSS tools the community would like to see built/developed (e.g. tools that could be used during ‘pre-ingest’ stage).” 

The Functional Requirements for New Tools and Features Group of the OSS4Pres workshop aimed to write user stories focused on new features that developers can build out to better support digital preservation and archives work. The group was largely comprised of practitioners who work with digital curation tools regularly, and was facilitated by Carl Wilson of the Open Preservation Foundation. While their work largely involved writing user stories for development, the group also came up with requirement lists for specific areas of tool development, outlined below. We hope that these lists help continue to bridge the gap between digital preservation professionals and open source developers by providing a deeper perspective of user needs.

Basic Requirements for Tools:

  • Mostly needed for Mac environment
  • No software installation on donor computer
  • No software dependencies requiring installation (e.g., Java)
  • Must be GUI-based, as most archivists are not skilled with the command line
  • Graceful failure

Descriptive Metadata Extraction Needs (using Apache Tika):

  • Archival date
  • Author
  • Authorship location
  • Subject location
  • Subject
  • Document type
  • Removal of spelling errors to improve extracted text

Technical Metadata Extraction Needs:

  • All datetime information available should be retained (minimum of LastModified Date)
  • Technical manifest report
  • File permissions and file ownership permissions
  • Information about the tool that generated the technical manifest report:
    • tool – name of the tool used to gather the disk image
    • tool version – the version of the tool
    • signature version – if the tool uses ‘signatures’ or other add-ons, e.g. which virus scanner software signature – such as signature release July 2014 or v84
    • datetime process run – the datetime information of when the process ran (usually tools will give you when the process was completed) – for each tool that you use

Data Transfer Tool Requirements:

  • Run from portable external device
  • Bag-It standard compliant (build into a “bag”)
  • Able to select a subset of data – not disk image the whole computer
  • GUI-based tool
  • Original file name (also retained in tech manifest)
  • Original file path (also retained in tech manifest)
  • Directory structure (also retained in tech manifest)
  • Address these issues in filenames (record the actual filename in the tech manifest): Diacritics (e.g. naïve ), Illegal characters ( \ / : * ? “ < > | ), Spaces, M-dashes, n-dashes, Missing file extensions, Excessively long file and folder names, etc
  • Possibly able to connect to “your” FTP site/cloud thingy and send the data there when ready for transfer

Checksum Verification Requirements:

  • File-by-file checksum hash generation
  • Ability to validate the contents of the transfer

Reporting Requirements:

  • Ability to highlight/report on possibly problematic files/folders in a separate file

Testing Requirements:

  • Access to a test corpora, with known issues, to test tool

Smart Selection & Appraisal Tool Requirements:

  • DRM/TPMs detection
  • Regular expressions/fuzzy logic for finding certain terms – e.g. phone numbers, security numbers, other predefined personal data
  • Blacklisting of files – configurable list of blacklist terms
  • Shortlisting a set of “questionable” files based on parameters that could then be flagged for a human to do further QA/QC

Specific Features Needed by the Community:

  • Gathering/generating quantitative metrics for web harvests
  • Mitigation strategies for FFMPEG obsolescence
  • TESSERACT language functionality

____

heidi-elaine-kellyHeidi Elaine Kelly is the Digital Preservation Librarian at Indiana University, where she is responsible for building out the infrastructure to support long-term sustainability of digital content. Previously she was a DiXiT fellow at Huygens ING and an NDSR fellow at the Library of Congress.

OSS4Pres 2.0: Sharing is Caring: Developing an online community space for sharing workflows

By Sam Meister

____

This is the third post in the bloggERS series describing outcomes of the #OSS4Pres 2.0 workshop at iPRES 2016, addressing open source tool and software development for digital preservation. This post outlines the work of the group tasked with “developing requirements for an online community space for sharing workflows, OSS tool integrations, and implementation experiences” See our other posts for information on the groups that focused on feature development and design requirements for FOSS tools.

Cultural heritage institutions, from small museums to large academic libraries, have made significant progress developing and implementing workflows to manage local digital curation and preservation activities. Many institutions are at different stages in the maturity of these workflows. Some are just getting started, and others have had established workflows for many years. Documentation assists institutions in representing current practices and functions as a benchmark for future organizational decision-making and improvements. Additionally, sharing documentation assists in creating cross-institutional understanding of digital curation and preservation activities and can facilitate collaborations amongst institutions around shared needs.

One of the most commonly voiced recommendations from iPRES 2015 OSS4PRES workshop attendees was the desire for a centralized location for technical and instructional documentation, end-to-end workflows, case studies, and other resources related to the installation, implementation, and use of OSS tools. This resource could serve as a hub that would enable practitioners to freely and openly exchange information, user requirements, and anecdotal accounts of OSS initiatives and implementations.

At the OSS4Pres 2.0 workshop, the group of folks looking at developing an online space for sharing workflows and implementation experience started by defining a simple goal and deliverable for the two hour session:

Develop a list of minimal levels of content that should be included in an open online community space for sharing workflows and other documentation

The group the began a discussion on developing this list of minimal levels by thinking about the potential value of user stories in informing these levels. We spent a bit of time proposing a short list of user stories, just enough to provide some insight into the basic structures that would be needed for sharing workflow documentation.

User stories

  • I am using tool 1 and tool 2 and want to know how others have joined them together into a workflow
  • I have a certain type of data to preserve and want to see what workflows other institutions have in place to preserve this data
  • There is a gap in my workflow — a function that we are not carrying out — and I want to see how others have filled this gap
  • I am starting from scratch and need to see some example workflows for inspiration
  • I would like to document my workflow and want to find out how to do this in a way that is useful for others
  • I would like to know why people are using particular tools – is there evidence that they tried another tool, for example, that wasn’t successful?

The group then proceeded to define a workflow object as a series of workflow steps with its own attributes, a visual representation, and organizational context:

Workflow step
Title / name
Description
Tools / resources
Position / role

Visual workflow diagrams / model
Organizational Context
            Institution type
            Content type

Next, we started to draft out the different elements that would be part of an initial minimal level for workflow objects:

Level 1:

Title
Description
Institution / organization type
Contact
Content type(s)
Status
Link to external resources
Download workflow diagram objects
Workflow concerns / reflections / gaps

After this effort the group focused on discussing next steps and how an online community space for sharing workflows could be realized. This discuss led towards pursuing the expansion of COPTR to support sharing of workflow documentation. We outlined a roadmap for next steps toward pursuing this goal:

  • Propose / approach COPTR steering group on adding workflows space to COPTR
  • Develop home page and workflow template
  • Add examples
  • Group review
  • Promote / launch
  • Evaluation

The group has continued this work post-workshop and has made good progress setting up a Community Owned Workflows section to COPTR and developing an initial workflow template. We are in the midst of creating and evaluating sample workflows to help with revising and tweaking as needed. Based on this process we hope to launch and start promoting this new online space for sharing workflows in the months ahead. So stay tuned!

____

meister_photoSam Meister is the Preservation Communities Manager, working with the MetaArchive Cooperative and BitCurator Consortium communities. Previously, he worked as Digital Archivist and Assistant Professor at the University of Montana. Sam holds a Master of Library and Information Science degree from San Jose State University and a B.A. in Visual Arts from the University of California San Diego. Sam is also an Instructor in the Library of Congress Digital Preservation Education and Outreach Program.

 

Preservation and Access Can Coexist: Implementing Archivematica with Collaborative Working Groups

By Bethany Scott

The University of Houston (UH) Libraries made an institutional commitment in late 2015 to migrate the data for its digitized and born-digital cultural heritage collections to open source systems for preservation and access: Hydra-in-a-Box (now Hyku!), Archivematica, and ArchivesSpace. As a part of that broader initiative, the Digital Preservation Task Force began implementing Archivematica in 2016 for preservation processing and storage.

At the same time, the DAMS Implementation Task Force was also starting to create data models, use cases, and workflows with the goal of ultimately providing access to digital collections via a new online repository to replace CONTENTdm. We decided that this would be a great opportunity to create an end-to-end digital access and preservation workflow for digitized collections, in which digital production tasks could be partially or fully automated and workflow tools could integrate directly with repository/management systems like Archivematica, ArchivesSpace, and the new DAMS. To carry out this work, we created a cross-departmental working group consisting of members from Metadata & Digitization Services, Web Services, and Special Collections.

Continue reading

Digital Preservation, Eh?

by Alexandra Jokinen

This post is the third post in our series on international perspectives on digital preservation.

___

Hello / Bonjour!

Welcome to the Canadian edition of International Perspectives on Digital Preservation. My name is Alexandra Jokinen. I am the new(ish) Digital Archives Intern at Dalhousie University in Halifax, Nova Scotia. I work closely with the Digital Archivist, Creighton Barrett, to aid in the development of policies and procedures for some key aspects of the University Libraries’ digital archives program—acquisitions, appraisal, arrangement, description, and preservation.

One of the ways in which we are beginning to tackle this very large, very complex (but exciting!) endeavour is to execute digital preservation on a small scale, focusing on the processing of digital objects within a single collection, and then using those experiences to create documentation and workflows for different aspects of the digital archives program.

The collection chosen to be our guinea pig was a recent donation of work from esteemed Canadian ecologist and environmental scientist, Bill Freedman, who taught and conducted research at Dalhousie from 1979 to 2015. The fonds is a hybrid of analogue and digital materials dating from 1988 to 2015. Digital media carriers include: 1 computer system unit, 5 laptops, 2 external hard drives, 7 USB flash drives, 5 zip disks, 57 CDs, 6 DVDs, 67 5.25 inch floppy disks and 228 3.5 inch floppy disks. This is more digital material than the archives is likely to acquire in future accessions, but the Freedman collection acted as a good test case because it provided us with a comprehensive variety of digital formats to work with.

Our first area of focus was appraisal. For the analogue material in the collection, this process was pretty straightforward: conduct macro-appraisal and functional analysis by physically reviewing material. However, (as could be expected) appraisal of the digital material was much more difficult to complete. The archives recently purchased a forensic recovery of evidence device (FRED) but does not yet have all the necessary software and hardware to read the legacy formats in the collection (such as the floppy disks and zip disks), so, we started by investigating the external hard drives and USB flash drives. After examining their content, we were able to get an accurate sense of the information they contained, the organizational structure of the files, and the types of formats created by Freedman. Although, we were not able to examine files on the legacy media, we felt that we had enough context to perform appraisal, determine selection criteria and formulate an arrangement structure for the collection.

The next step of the project will be to physically organize the material. This will involve separating, photographing and reboxing the digital media carriers and updating a new registry of digital media that was created during a recent digital archives collection assessment modelled after OCLC’s 2012 “You’ve Got to Walk Before You Can Run” research report. Then, we will need to process the digital media, which will entail creating disk images with our FRED machine and using forensic tools to analyze the data.  Hopefully, this will allow us to apply the selection criteria used on the analogue records to the digital records and weed out what we do not want to retain. During this process, we will be creating procedure documentation on accessioning digital media as well as updating the archives’ accessioning manual.

The project’s final steps will be to take the born-digital content we have collected and ingest it using Archivematica to create Archival Information Packages for storage and preservation and accessed via the Archives Catalogue and Online Collections.

So there you have it! We have a long way to go in terms of digital preservation here at Dalhousie (and we are just getting started!), but hopefully our work over the next several months will ensure that solid policies and procedures are in place for maintaining a trustworthy digital preservation system in the future.

This internship is funded in part by a grant from the Young Canada Works Building Careers in Heritage Program, a Canadian federal government program for graduates transitioning to the workplace.

___

dsc_0329

Alexandra Jokinen has a Master’s Degree in Film and Photography Preservation and Collections Management from Ryerson University in Toronto. Previously, she has worked as an Archivist at the Liaison of Independent Filmmakers of Toronto and completed a professional practice project at TIFF Film Reference Library and Special Collections.

Connect with me on LinkedIn!

Practical Digital Preservation: In-House Solutions to Digital Preservation for Small Institutions

By Tyler McNally

This post is the tenth post in our series on processing digital materials.

Many archives don’t have the resources to install software or subscribe to a service such as Archivematica, but still have a mandate to collect and preserve born-digital records. Below is a digital-preservation workflow created by Tyler McNally at the University of Manitoba. If you have a similar workflow at your institution, include it in the comments. 

———

Recently I completed an internship at the University of Manitoba’s College of Medicine Archives, working with Medical Archivist Jordan Bass. A large part of my work during this internship dealt with building digital infrastructure for the archive to utilize in working on digital preservation. As a small operation, the archive does not have the resources to really pursue any kind of paid or difficult to use system.

Originally, our plan was to use the open-source, self-install version of Archivematica, but certain issues that cropped up made this impossible, considering the resources we had at hand. We decided that we would simply make our own digital-preservation workflow, using open-source and free software to convert our files for preservation and access, check for viruses, and create checksums—not every service that Archivematica offers, but enough to get our files stored safely. I thought other institutions of similar size and means might find the process I developed useful in thinking about their own needs and capabilities.

Continue reading

Using NLP to Support Dynamic Arrangement, Description, and Discovery of Born Digital Collections: The ArchExtract Experiment

By Mary W. Elings

This post is the eighth in our Spring 2016 series on processing digital materials.

———

Many of us working with archival materials are looking for tools and methods to support arrangement, description, and discovery of electronic records and born digital collections, as well as large bodies of digitized text. Natural Language Processing (NLP), which uses algorithms and mathematical models to process natural language, offers a variety of potential solutions to support this work. Several efforts have investigated using NLP solutions for analyzing archival materials, including TOME (Interactive TOpic Model and MEtadata Visualization), Ed Summers’ Fondz, and Thomas Padilla’s Woese Collection work, among others, though none have resulted in a major tool for broader use.

One of these projects, ArchExtract, was carried out at UC Berkeley’s Bancroft Library in 2014-2015. ArchExtract sought to apply several NLP tools and methods to large digital text collections and build a web application that would package these largely command-line NLP tools into an interface that would make it easy for archivists and researchers to use.

The ArchExtract project focused on facilitating analysis of the content and, via that analysis, discovery by researchers. The development work was done by an intern from the UC Berkeley School of Information, Janine Heiser, who built a web application that implements several NLP tools, including Topic Modelling, Named Entity Recognition, and Keyword Extraction to explore and present large, text-based digital collections.

The ArchExtract application extracts topics, named entities (people, places, subjects, dates, etc.), and keywords from a given collection. The application automates, implements, and extends various natural language processing software tools, such as MALLET and the Stanford Core NLP toolkit, and provides a graphical user interface designed for non-technical users.

 

archextract1
ArchExtract Interface Showing Topic Model Results. Elings/Heiser, 2015.

In testing the application, we found the automated text analysis tools in ArchExtract were successful in identifying major topics, as well as names, dates, and places found in the text, and their frequency, thereby giving archivists an understanding of the scope and content of a collection as part of the arrangement and description process. We called this process “dynamic arrangement and description,” as materials can be re-arranged using different text processing settings so that archivists can look critically at the collection without changing the physical or virtual arrangement.

The topic models, in particular, surfaced documents that may have been related to a topic but did not contain a specific keyword or entity. The process was akin to the sort of serendipity a researcher might achieve when shelf reading in the analog world, wherein you might find what you seek without knowing it was there. And while topic modelling has been criticized for being inexact, it can be “immensely powerful for browsing and isolating results in thousands or millions of uncatalogued texts” (Schmidt, 2012). This, combined with the named entity and keyword extraction, can give archivists and researchers important data that could be used in describing and discovering material.

archextract2
ArchExtract Interface Showing Named Entity Recognition Results. Elings/Heiser, 2015.

As a demonstration project, ArchExtract was successful in achieving our goals. The code developed is documented and freely available on GitHub to anyone interested in how it was done or who might wish to take it further. We are very excited by the potential of these tools in dynamically arranging and describing large, text-based digital collections, but even more so by their application in discovery. We are particularly pleased that broad, open source projects like BitCurator and ePADD are taking this work forward and will be bringing NLP tools into environments that we can all take advantage of in processing and providing access to our born digital materials.

———

Mary W. Elings is the Principal Archivist for Digital Collections and Head of the Digital Collections Unit of The Bancroft Library at the University of California, Berkeley. She is responsible for all aspects of the digital collections, including managing digital curation activities, the born digital archives program, web archiving, digital processing, mass digitization, finding aid publication and maintenance, metadata, archival information management and digital asset management, and digital initiatives. Her current work concentrates on issues surrounding born-digital materials, supporting digital humanities and digital social sciences, and research data management. Ms. Elings co-authored the article “Metadata for All: Descriptive Standards and Metadata Sharing across Libraries, Archives and Museums,” and wrote a primer on linked data for LAMs. She has taught as an adjunct professor in the School of Information Studies at Syracuse University, New York (2003-2009) and School of Library and Information Science, Catholic University, Washington, DC (2010-2014), and is a regular guest-lecturer in the John F. Kennedy University Museum Studies program (2010-present).

Let the Entities Describe Themselves

By Josh Schneider and Peter Chan

This is the fifth post in our Spring 2016 series on processing digital materials.

———

Why do we process archival materials? Do our processing goals differ based on whether the materials are paper or digital? Processing objectives may depend in part upon institutional priorities, policies, and donor agreements, or collection-specific issues. Yet, irrespective of the format of the materials, we recognize two primary goals to arranging and describing materials: screening for confidential, restricted, or legally-protected information that would impede repositories from providing ready access to the materials; and preparing the files for use by researchers, including by efficiently optimizing discovery and access to the material’s intellectual content.

More and more of the work required to achieve these two goals for electronic records can be performed with the aid of computer assisted technology, automating many archival processes. To help screen for confidential information, for instance, several software platforms utilize regular expression search (BitCurator, AccessData Forensic ToolKit, ePADD). Lexicon search (ePADD) can also help identify confidential information by checking a collection against a categorized list of user-supplied keywords. Additional technologies that may harness machine learning and natural language processing (NLP), and that are being adopted by the profession to assist with arrangement and description, include: topic modeling (ArchExtract); latent semantic analysis (GAMECIP); predictive coding (University of Illinois); and named entity recognition (Linked Jazz, ArchExtract, ePADD). For media, automated transcription and timecoding services (Pop Up Archive) already offer richer access. Likewise, computer vision, including pattern recognition and face recognition, has the potential to help automate image and video description (Stanford Vision Lab, IBM Watson Visual Recognition). Other projects (Overview) outside of the archival community are also exploring similar technologies to make sense of large corpuses of text.

From an archivist’s perspective, one of the most game-changing technologies to support automated processing may be named entity recognition (NER). NER works by identifying and extracting named entities across a corpus, and is in widespread commercial use, especially in the fields of search, advertising, marketing, and litigation discovery. A range of proprietary tools, such as Open Calais, Semantria, and AlchemyAPI, offer entity extraction as a commercial service, especially geared toward facilitating access to breaking news across these industries. ePADD, an open source tool being developed to promote the appraisal, processing, discovery, and delivery of email archives, relies upon a custom NER to reveal the intellectual content of historical email archives.

NER.png

Currently, however, there are no open source NER tools broadly tuned towards the diverse variety of other textual content collected and shared by cultural heritage institutions. Most open source NER tools, such as StanfordNER and Apache OpenNLP, focus on extracting named persons, organizations, and locations. While ePADD also initially focused on just these three categories, an upcoming release will improve browsing accuracy by including more fine-grained categories of organization and location entities bootstrapped from Wikipedia, such as libraries, museums, and universities. This enhanced NER, trained to also identify probable matches, also recognizes other entity types such as diseases, which can assist with screening for protected health information, and events.

What if an open source NER like that in development for ePADD for historical email could be refined to support processing of an even broader set of archival substrates? Expanding the study and use of NLP in this fashion stands to benefit the public and an ever-growing body of researchers, including those in digital humanities, seeking to work with the illuminative and historically significant content collected by cultural heritage organizations.

Of course, entity extraction algorithms are not perfect, and questions remain for archivists regarding how best to disambiguate entities extracted from a corpus, and link disambiguated entities to authority headings. Some of these issues reflect technical hurdles, and others underscore the need for robust institutional policies around what constitutes “good enough” digital processing. Yet, the benefits of NER, especially when considered in the context of large text corpora, are staggering. Facilitating browsing and visualization of a corpus by entity type provides new ways for researchers to access content. Publishing extracted entities as linked open data can enable new content discovery pathways and uncover trends across institutional holdings, while also helping balance outstanding privacy and copyright concerns that may otherwise limit online material sharing.

It is likely that “good enough” processing will remain a moving target as researcher practices and expectations continue to evolve with emerging technologies. But we believe entity extraction fulfills an ongoing need to enable researchers to gain quick access to archival collections’ intellectual content, and that its broader application would greatly benefit both repositories and researchers.

———

Peter Chan is Digital Archivist in the Department of Special Collections and University Archives at Stanford University, is a member of GAMECIP, and is Project Manager for ePADD.

Josh Schneider is Assistant University Archivist in the Department of Special Collections and University Archives at Stanford University, and is Community Manager for ePADD.

Preservation Beyond the Bits: An Interview with Linda Tadic

Linda Tadic is founder and CEO of Digital Bedrock (www.digitalbedrock.com), a managed digital preservation and consulting service in Los Angeles. A founding member and former president of the Association of Moving Image Archivists, she has written and given lectures on AV metadata, copyright, and digital asset management and preservation. She is adjunct professor in UCLA’s Department of Information Studies’ Media Archives Studies program.

We asked Tadic about her research into the environmental consequences of digital preservation. Her presentation “The Environmental Impact of Digital Preservation,” which she’s given in Portland, OR, Singapore, and Paris, describes the relationship of digital preservation to ongoing environmental degradation and outlines ways archivists and archival institutions can lessen their impact. Slides and notes for the presentation can be found at www.digitalbedrock.com/resources.

This interview was conducted over email.

Continue reading

Clearing the digital backlog at the Thomas Fisher Rare Book Library

By Jess Whyte

This is the second post in our Spring 2016 series on processing digital materials.

———

Tucked away in the manuscript collections at the Thomas Fisher Rare Book Library, there are disks. They’ve been quietly hiding out in folders and boxes for the last 30 years. As the University of Toronto Libraries develops its digital preservation policies and workflows, we identified these disks as an ideal starting point to test out some of our processes. The Fisher was the perfect place to start:

  • the collections are heterogeneous in terms of format, age, media and filesystems
  • the scope is manageable (we identified just under 2000 digital objects in the manuscript collections)
  • the content has relatively clear boundaries (we’re dealing with disks and drives, not relational databases, software or web archives)
  • the content is at risk

The Thomas Fisher Rare Book Library Digital Preservation Pilot Project was born. It’s purpose: to evaluate the extent of the content at risk and establish a baseline level of preservation on the content.

Identifying digital assets

The project started by identifying and listing all the known digital objects in the manuscript collections. I did this by batch searching all the .pdf finding aids from post-1960 with terms like “digital,” “electronic,” “disk,” —you get the idea. Once we knew how many items we were dealing with and where we could find them, we could begin.

Early days, testing and fails
When I first started, I optimistically thought I would just fire up BitCurator and everything would work.

whyte01

It didn’t, but that’s okay. All of the reasons we chose these collections in the first place (format, media, filesystem and age diversity) also posed a variety of challenges to our workflow for capture and analysis. There was also a question of scalability – could I really expect to create preservation copies of ~2000 disks along with accompanying metadata within a target 18-month window? By processing each object one-by-one in a graphical user interface? While working on the project part-time? No, I couldn’t. Something needed to change.

Our early iterations of the process went something like this:

  1. Use a Kryoflux and its corresponding software to take an image of the disk
  2. Mount the image in a tool like FTK Imager or HFSExplorer
  3. Export a list of the files in a somewhat consistent manner to serve as a manifest, metadata and de facto finding aid
  4. Bag it all up in Bagger.

This was slow, inconsistent, and not well-suited to the project timetable. I tried using fiwalk (included with BitCurator) to walk through a series of images and automatically generate manifests of their contents, but fiwalk does not support HFS and other, older filesystems. Considering 40% of our disks thus far were HFS (at this point, I was 100 disks in), fiwalk wasn’t going to save us. I could automate the process for 60% of the disks, but the remainder would still need to be handled individually–and I wouldn’t have those beautifully formatted DFXML (Digital Forensics XML) files to accompany them. I needed a fix.

Enter disktype and md5deep

I needed a way to a) mount a series of disk images, b) look inside, c) generate metadata on the file contents and d) produce a more human-readable manifest that could serve as a finding aid.

Ideally, the format of all that metadata would be consistent. Critically, the whole process would be as automated as possible.

This is where disktype and md5deep come in. I could use disktype to identify an image’s filesystem, mount it accordingly and then use md5deep to generate DFXML and .csv files. The first iteration of our script did just that, but md5deep doesn’t produce as much metadata as fiwalk. While I don’t have the skills to rewrite fiwalk, I do have the skills to write a simple bash script that routes disk images based on their filesystem to either md5deep or fiwalk. You can find that script here, and a visualization of how it works below:

whyte02

I could now turn this (collection of image files and corresponding imaging logs):

Whyte03

into this (collection of image files, logs, DFXML files, and CSV manifests):

Whyte04

Or, to put it another way, I could now take one of these:

Whyte05

And rapidly turn it into this ready-to-be-bagged package:

Whyte06

Challenges, Future Considerations and Questions

Are we going too fast?
Do we really want to do this quickly? What discoveries or insights will we miss by automating this process? There is value in slowing down and spending time with an artifact and learning from it. Opportunities to do this will likely come up thanks to outliers, but I still want to carve out some time to play around with how these images can be used and studied, individually and as a set.

Virus Checks:
We’re still investigating ways to run virus checks that are efficient and thorough, but not invasive (won’t modify the image in any way).  One possibility is to include the virus check in our bash script, but this will slow it down significantly and make quick passes through a collection of images impossible (during the early, testing phases of this pilot, this is critical). Another possibility is running virus checks before the images are bagged. This would let us run the virus checks overnight and then address any flagged images (so far, we’ve found viruses in ~3% of our disk images and most were boot-sector viruses). I’m curious to hear how others fit virus checks into their workflows, so please comment if you have suggestions or ideas.

Adding More Filesystem Recognition
Right now, the processing script only recognizes FAT and HFS filesystems and then routes them accordingly. So far, these are the only two filesystems that have come up in my work, but the plan is to add other filesystems to the script on an as-needed basis. In other words, if I happen to meet an Amiga disk on the road, I can add it then.

Access Copies:
This project is currently focused on creating preservation copies. For now, access requests are handled on an as-needed basis. This is definitely something that will require future work.

Error Checking:
Automating much of this process means we can complete the work with available resources, but it raises questions about error checking. If a human isn’t opening each image individually, poking around, maybe extracting a file or two, then how can we be sure of successful capture? That said, we do currently have some indicators: the Kryoflux log files, human monitoring of the imaging process (are there “bad” sectors? Is it worth taking a closer look?), and the DFXML and .csv manifest files (were they successfully created? Are there files in the image?). How are other archives handling automation and exception handling?

If you’d like to see our evolving workflow or follow along with our project timeline, you can do so here. Your feedback and comments are welcome.

———

Jess Whyte is a Masters Student in the Faculty of Information at the University of Toronto. She holds a two-year digital preservation internship with the University of Toronto Libraries and also works as a Research Assistant with the Digital Curation Institute.  

Resources:

Gengenbach, M. (2012). The way we do it here”: Mapping digital forensics workflows in collecting institutions.”. Unpublished master’s thesis, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina.

Goldman, B. (2011). Bridging the gap: taking practical steps toward managing born-digital collections in manuscript repositories. RBM: A Journal of Rare Books, Manuscripts and Cultural Heritage, 12(1), 11-24

Prael, A., & Wickner, A. (2015). Getting to Know FRED: Introducing Workflows for Born-Digital Content.

Request for contributors to a new series on bloggERS!

The editors at bloggERS! HQ are looking for authors to write for a new series of posts, and we’d like to hear from YOU.

The topic of the next series on the Electronic Records Section blog is processing digital materials: what it is, how practitioners are doing it, and how they are measuring their work.

How are you processing digital materials? And how do you define “digital processing,” anyway?

The what and how of digital processing are dependent upon a variety of factors: available resources and technical expertise, the tools, systems, and infrastructure that are particular to an organization, and the nature of the digital materials themselves.

  • What tools are you using, and how do they integrate with your physical arrangement and description practices?
  • Are you leveraging automation, topic modeling, text analysis, named entity recognition, or other technologies in your processing workflows?
  • How are you working with different types of digital content, such as email, websites, documents, and digital images?
  • What are the biggest challenges that you have encountered? What is your biggest recent digital processing success? What would you like to be able to do, and what are your blockers?

If you have answers to any of these questions, or you are thinking of other questions we haven’t asked here, then consider writing a post to share your experiences (good or bad) processing digital materials.

Quantifying and tracking digital processing activities

Many organizations maintain processing metrics, such as hours per linear foot. In processing digital materials, the level of effort may be more dependent upon the type and format of the materials than their extent.

  • What metrics make sense for quantifying digital processing activities?  
  • How does your organization track the pace and efficiency of digital processing activities?
  • Have you explored any alternative ways of documenting digital processing activity?

If you have been working to answer any of these questions for yourself or your institution, we’d like to hear from you!

Writing for bloggERS!

  • Posts should be between 200-600 words in length
  • Posts can take many forms: instructional guides, in-depth tool exploration, surveys, dialogues, point-counterpoint debates are all welcome!
  • Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA
  • Align with other editorial guidelines as outlined in the bloggERS! guidelines for writers.

Posts for this series will start in early April, so let us know ASAP if you are interested in contributing by sending an email to ers.mailer.blog@gmail.com!