Latest #bdaccess Twitter Chat Recap

By Daniel Johnson and Seth Anderson

This post is the eighteenth in a bloggERS series about access to born-digital materials.

____

In preparation for the Born Digital Access Bootcamp: A Collaborative Learning Forum at the New England Archivists spring meeting, an ad-hoc born-digital access group with the Digital Library Federation recently held a set of #bdaccess Twitter chats. The discussions aimed to gain insight into issues that archives and library staff face when providing access to born-digital.

Here are a few ideas that were discussed during the two chats:

  • Backlogs, workflows, delivery mechanisms, lack of known standards, appraisal and familiarity with software were major barriers to providing access.
  • Participants were eager to learn more about new tools, existing functioning systems, providing access to restricted material and complicated objects, which institutions are already providing access to data, what researchers want/need, and if any user testing has been done.
  • Access is being prioritized by user demand, donor concerns, fragile formats and a general mandate that born-digital records are not preserved unless access is provided.
  • Very little user testing has been done.
  • A variety of archivists, IT staff and services librarians are needed to provide access.

You can search #bdaccess on Twitter to see how the conversation evolves or view the complete conversation from these chats on Storify.

The Twitter chats were organized by a group formed at the 2015 SAA annual meeting. Stay tuned for future chats and other ways to get involved!

____

Daniel Johnson is the digital preservation librarian at the University of Iowa, exploring, adapting, and implementing digital preservation policies and strategies for the long-term protection and access to digital materials.

Seth Anderson is the project manager of the MoMA Electronic Records Archive initiative, overseeing the implementation of policy, procedures, and tools for the management and preservation of the Museum of Modern Art’s born-digital records.

Preservation and Access Can Coexist: Implementing Archivematica with Collaborative Working Groups

By Bethany Scott

The University of Houston (UH) Libraries made an institutional commitment in late 2015 to migrate the data for its digitized and born-digital cultural heritage collections to open source systems for preservation and access: Hydra-in-a-Box (now Hyku!), Archivematica, and ArchivesSpace. As a part of that broader initiative, the Digital Preservation Task Force began implementing Archivematica in 2016 for preservation processing and storage.

At the same time, the DAMS Implementation Task Force was also starting to create data models, use cases, and workflows with the goal of ultimately providing access to digital collections via a new online repository to replace CONTENTdm. We decided that this would be a great opportunity to create an end-to-end digital access and preservation workflow for digitized collections, in which digital production tasks could be partially or fully automated and workflow tools could integrate directly with repository/management systems like Archivematica, ArchivesSpace, and the new DAMS. To carry out this work, we created a cross-departmental working group consisting of members from Metadata & Digitization Services, Web Services, and Special Collections.

Continue reading

The Best of BDAX: Five Themes from the 2016 Born Digital Archiving & eXchange

By Kate Tasker

———

Put 40 digital archivists, programmers, technologists, curators, scholars, and managers in a room together for three days, give them unlimited cups of tea and coffee, and get ready for some seriously productive discussions.

This magic happened at the Born Digital Archiving & eXchange (BDAX) unconference, held at Stanford University on July 18-20, 2016. I joined the other BDAX attendees to tackle the continuing challenges of acquiring, discovering, delivering and preserving born-digital materials.

The discussions highlighted five key themes to me:

1) Born-digital workflows are, generally, specific

We’re all coping with the general challenges of born-digital archiving, but we’re encountering individual collections which need to be addressed with local solutions and resources. BDAXers generously shared examples of use cases and successful workflows, and, although these guidelines couldn’t always translate across diverse institutions (big/small, private/public, IT help/no IT help), they’re a foundation for building best practices which can be adapted to specific needs.

2) We need tools

We need reliable tools that will persist over time to help us understand collections, to record consistent metadata and description, and to discover the characteristics of new content types. Project demos including ePADD, BitCurator Access, bwFLA – Emulation as a Service, UC Irvine’s Virtual Reading Room, the Game Metadata and Citation Project, and the University of Michigan’s ArchivesSpace-Archivematica-DSpace Integration project gave encouragement that tools are maturing and will enable us to work with more confidence and efficiency. (Thanks to all the presenters!)

3) Smart people are on this

A lot of people are doing a lot of work to guide and document efforts in born-digital archiving. We need to share these efforts widely, find common points of application, and build momentum – especially for proposed guidelines, best guesses, and continually changing procedures. (We’re laying this train track as we go, but everybody can get on board!) A brilliant resource from BDAX is a “Topical Brain Dump” Google doc where everyone can share tips related to what we each know about born-digital archives (hat-tip to Kari Smith for creating the doc, and to all BDAXers for their contributions).

4) Talking to each other helps!

Chatting with BDAX colleagues over coffee or lunch provided space to compare notes, seek advice, make connections, and find reassurance that we’re not alone in this difficult endeavor. Published literature is continually emerging on born-digital archiving topics (for example, born-digital description), but if we’re not quite ready to commit our own practices to paper magnetic storage media, then informal conversations allow us to share ideas and experiences.

5) Born-digital archiving needs YOU

BDAX attendees brainstormed a wide range of topics for discussion, illustrating that born-digital archiving collides with traditional processes at all stages of stewardship, from appraisal to access. All of these functions need to be re-examined and potentially re-imagined. It’s a big job (*understatement*) but brings with it the opportunity to gather perspective and expertise from individuals across different roles. We need to make sure everyone is invited to this party.

How to Get Involved

So, what’s next? The BDAX organizers and attendees recognize that there are many, many more colleagues out there who need to be included in these conversations. Continuing efforts are coalescing around processing levels and metrics for born-digital collections; accurately measuring and recording extent statements for digital content; and managing security and storage needs for unprocessed digital accessions. Please, join in!

You can read extensive notes for each session in this shared Google Drive folder (yes, we did talk about how to archive Google docs!) or catch up on Tweets at #bdax2016.

To subscribe to the BDAX email listserv, please email Michael Olson (mgolson[at]stanford[dot]edu), or, to join the new BDAX Slack channel, email Shira Peltzman (speltzman[at]library[dot]ucla[dot]edu).

———

ktasker-profile-picKate Tasker works with born-digital collections and information management systems at The Bancroft Library, University of California, Berkeley. She has an MLIS from San Jose State University and is a member of the Academy of Certified Archivists. Kate attended Capture Lab in 2015 and is currently designing workflows to provide access to born-digital collections.

Recent Changes in How Stanford University Libraries is Documenting Born-Digital Processing

By Michael G. Olson

This post is the third in our Spring 2016 series on processing digital materials.

———

Stanford University Libraries is in the process of changing how it documents its digital processing activities and records lab statistics. This is our third iteration of how we track our born-digital work in six years and is a collaborative effort between Digital Library Systems and Services, our Digital Archivist Peter Chan, and Glynn Edwards, who manages our Born-Digital Program and is the Director of the ePADD project.

Initially we documented our statistics using a library-hosted FileMaker Pro database. In this initial iteration we were focused on tracking media counts and media failure rates. After a single year of using the database we decided that we needed to modify the data structure and the data entry templates significantly. Our staff found the database too time consuming and cumbersome to modify.

We decided to simplify and replaced the database with a spreadsheet stored with our collection data. Our digital archivist and hourly lab employees were responsible for updating this spreadsheet when they had finished working with a collection. This was a simple solution that was easy to edit and update, and it worked well for four years until we realized we needed more data for our fiscal year-end reports. As our born-digital program has grown and matured, we discovered we were missing key data points that documented important processing decisions in our workflows. It was time to again improve how we documented our work.

BDFL_labstats_FY2015Q1-Q2_v2Stanford Statistics Spreadsheet version 2

For our brand new version of work tracking we have decided to continue to use a spreadsheet but have migrated our data to Google Drive to better facilitate updates and versioning of our documentation. New data points have been included to better track specific types of born-digital content like email. This new version also allows us to better document the processing lifecycle of our born-digital collections. In order to better do this we have created the following additional data points:

  • Number of email messages
  • Email in ePADD.stanford.edu
  • File count in media cart
  • File size on media cart (GB)
  • SearchWorks (materials discoverable / available in library catalog)
  • SpotLight Exhibit (a virtual exhibit)

BDFL_stats_v3Stanford Statistics Spreadsheet version 3

We anticipate that evolving library administrative needs, the continually changing nature of born-digital data, and new methodologies for processing these materials will make it necessary to again change how we document our work. Our solution is not perfect but is flexible enough to allow us to reimagine our documentation strategy in a few short years. If anyone is interested in learning more about what we are documenting and why, please do let us know, as we would be happy to provide further information and may learn something from our colleagues in the process.

———

Michael G. Olson is the Service Manager for the Born-Digital / Forensics Labs at Stanford University Libraries. In this capacity he is responsible for working with library stakeholders to develop services for acquiring, preserving and accessing born-digital library materials. Michael holds a Masters in Philosophy in History and Computing from the University of Glasgow. He can be reached at mgolson [at] Stanford [dot] edu.

Clearing the digital backlog at the Thomas Fisher Rare Book Library

By Jess Whyte

This is the second post in our Spring 2016 series on processing digital materials.

———

Tucked away in the manuscript collections at the Thomas Fisher Rare Book Library, there are disks. They’ve been quietly hiding out in folders and boxes for the last 30 years. As the University of Toronto Libraries develops its digital preservation policies and workflows, we identified these disks as an ideal starting point to test out some of our processes. The Fisher was the perfect place to start:

  • the collections are heterogeneous in terms of format, age, media and filesystems
  • the scope is manageable (we identified just under 2000 digital objects in the manuscript collections)
  • the content has relatively clear boundaries (we’re dealing with disks and drives, not relational databases, software or web archives)
  • the content is at risk

The Thomas Fisher Rare Book Library Digital Preservation Pilot Project was born. It’s purpose: to evaluate the extent of the content at risk and establish a baseline level of preservation on the content.

Identifying digital assets

The project started by identifying and listing all the known digital objects in the manuscript collections. I did this by batch searching all the .pdf finding aids from post-1960 with terms like “digital,” “electronic,” “disk,” —you get the idea. Once we knew how many items we were dealing with and where we could find them, we could begin.

Early days, testing and fails
When I first started, I optimistically thought I would just fire up BitCurator and everything would work.

whyte01

It didn’t, but that’s okay. All of the reasons we chose these collections in the first place (format, media, filesystem and age diversity) also posed a variety of challenges to our workflow for capture and analysis. There was also a question of scalability – could I really expect to create preservation copies of ~2000 disks along with accompanying metadata within a target 18-month window? By processing each object one-by-one in a graphical user interface? While working on the project part-time? No, I couldn’t. Something needed to change.

Our early iterations of the process went something like this:

  1. Use a Kryoflux and its corresponding software to take an image of the disk
  2. Mount the image in a tool like FTK Imager or HFSExplorer
  3. Export a list of the files in a somewhat consistent manner to serve as a manifest, metadata and de facto finding aid
  4. Bag it all up in Bagger.

This was slow, inconsistent, and not well-suited to the project timetable. I tried using fiwalk (included with BitCurator) to walk through a series of images and automatically generate manifests of their contents, but fiwalk does not support HFS and other, older filesystems. Considering 40% of our disks thus far were HFS (at this point, I was 100 disks in), fiwalk wasn’t going to save us. I could automate the process for 60% of the disks, but the remainder would still need to be handled individually–and I wouldn’t have those beautifully formatted DFXML (Digital Forensics XML) files to accompany them. I needed a fix.

Enter disktype and md5deep

I needed a way to a) mount a series of disk images, b) look inside, c) generate metadata on the file contents and d) produce a more human-readable manifest that could serve as a finding aid.

Ideally, the format of all that metadata would be consistent. Critically, the whole process would be as automated as possible.

This is where disktype and md5deep come in. I could use disktype to identify an image’s filesystem, mount it accordingly and then use md5deep to generate DFXML and .csv files. The first iteration of our script did just that, but md5deep doesn’t produce as much metadata as fiwalk. While I don’t have the skills to rewrite fiwalk, I do have the skills to write a simple bash script that routes disk images based on their filesystem to either md5deep or fiwalk. You can find that script here, and a visualization of how it works below:

whyte02

I could now turn this (collection of image files and corresponding imaging logs):

Whyte03

into this (collection of image files, logs, DFXML files, and CSV manifests):

Whyte04

Or, to put it another way, I could now take one of these:

Whyte05

And rapidly turn it into this ready-to-be-bagged package:

Whyte06

Challenges, Future Considerations and Questions

Are we going too fast?
Do we really want to do this quickly? What discoveries or insights will we miss by automating this process? There is value in slowing down and spending time with an artifact and learning from it. Opportunities to do this will likely come up thanks to outliers, but I still want to carve out some time to play around with how these images can be used and studied, individually and as a set.

Virus Checks:
We’re still investigating ways to run virus checks that are efficient and thorough, but not invasive (won’t modify the image in any way).  One possibility is to include the virus check in our bash script, but this will slow it down significantly and make quick passes through a collection of images impossible (during the early, testing phases of this pilot, this is critical). Another possibility is running virus checks before the images are bagged. This would let us run the virus checks overnight and then address any flagged images (so far, we’ve found viruses in ~3% of our disk images and most were boot-sector viruses). I’m curious to hear how others fit virus checks into their workflows, so please comment if you have suggestions or ideas.

Adding More Filesystem Recognition
Right now, the processing script only recognizes FAT and HFS filesystems and then routes them accordingly. So far, these are the only two filesystems that have come up in my work, but the plan is to add other filesystems to the script on an as-needed basis. In other words, if I happen to meet an Amiga disk on the road, I can add it then.

Access Copies:
This project is currently focused on creating preservation copies. For now, access requests are handled on an as-needed basis. This is definitely something that will require future work.

Error Checking:
Automating much of this process means we can complete the work with available resources, but it raises questions about error checking. If a human isn’t opening each image individually, poking around, maybe extracting a file or two, then how can we be sure of successful capture? That said, we do currently have some indicators: the Kryoflux log files, human monitoring of the imaging process (are there “bad” sectors? Is it worth taking a closer look?), and the DFXML and .csv manifest files (were they successfully created? Are there files in the image?). How are other archives handling automation and exception handling?

If you’d like to see our evolving workflow or follow along with our project timeline, you can do so here. Your feedback and comments are welcome.

———

Jess Whyte is a Masters Student in the Faculty of Information at the University of Toronto. She holds a two-year digital preservation internship with the University of Toronto Libraries and also works as a Research Assistant with the Digital Curation Institute.  

Resources:

Gengenbach, M. (2012). The way we do it here”: Mapping digital forensics workflows in collecting institutions.”. Unpublished master’s thesis, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina.

Goldman, B. (2011). Bridging the gap: taking practical steps toward managing born-digital collections in manuscript repositories. RBM: A Journal of Rare Books, Manuscripts and Cultural Heritage, 12(1), 11-24

Prael, A., & Wickner, A. (2015). Getting to Know FRED: Introducing Workflows for Born-Digital Content.

Request for contributors to a new series on bloggERS!

The editors at bloggERS! HQ are looking for authors to write for a new series of posts, and we’d like to hear from YOU.

The topic of the next series on the Electronic Records Section blog is processing digital materials: what it is, how practitioners are doing it, and how they are measuring their work.

How are you processing digital materials? And how do you define “digital processing,” anyway?

The what and how of digital processing are dependent upon a variety of factors: available resources and technical expertise, the tools, systems, and infrastructure that are particular to an organization, and the nature of the digital materials themselves.

  • What tools are you using, and how do they integrate with your physical arrangement and description practices?
  • Are you leveraging automation, topic modeling, text analysis, named entity recognition, or other technologies in your processing workflows?
  • How are you working with different types of digital content, such as email, websites, documents, and digital images?
  • What are the biggest challenges that you have encountered? What is your biggest recent digital processing success? What would you like to be able to do, and what are your blockers?

If you have answers to any of these questions, or you are thinking of other questions we haven’t asked here, then consider writing a post to share your experiences (good or bad) processing digital materials.

Quantifying and tracking digital processing activities

Many organizations maintain processing metrics, such as hours per linear foot. In processing digital materials, the level of effort may be more dependent upon the type and format of the materials than their extent.

  • What metrics make sense for quantifying digital processing activities?  
  • How does your organization track the pace and efficiency of digital processing activities?
  • Have you explored any alternative ways of documenting digital processing activity?

If you have been working to answer any of these questions for yourself or your institution, we’d like to hear from you!

Writing for bloggERS!

  • Posts should be between 200-600 words in length
  • Posts can take many forms: instructional guides, in-depth tool exploration, surveys, dialogues, point-counterpoint debates are all welcome!
  • Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA
  • Align with other editorial guidelines as outlined in the bloggERS! guidelines for writers.

Posts for this series will start in early April, so let us know ASAP if you are interested in contributing by sending an email to ers.mailer.blog@gmail.com!

Publication of the University of Minnesota Libraries Electronic Records Task Force (ERTF) Final Report

By Carol Kussmann

In May of 2014 the University of Minnesota Libraries charged the Electronic Records Task Force with developing initial capacity for ingesting and processing electronic records; defining ingest and processing workflows as well as tasks and workflows for both technical and archival staff.  The Electronic Records Task Force (ERTF) is pleased to announce the final report is now available here through the University Digital Conservancy.

The report describes our work over the past year and covers many different areas in detail including our workstation setup, software and equipment, security concerns, minimal task selection for ingesting materials, and tool and software evaluation and selection.  The report also talks about how and why we modified workflows when we ran into situations that warranted change.  To assist in addressing staffing needs we summarized the amount of work we were able to complete, how much time it took to do so, and projections for known future work.

High level lessons learned include:

  • Each accession brings with it its own questions and issues. For example:
    • A music collection had many file names that included special characters; because of this the files didn’t transfer correctly.
    • New accessions can include new file formats or types that you are unfamiliar with and need to find the best way to address. Email was one of those for us. [Which we still haven’t fully resolved.]
    • Some collections are ‘simple’ to ingest as all files are contained on a single hard drive, others are more complicated or more time consuming. One collection we ingested had 60+ CD/DVDs that needed to be individually ingested.
  • Personnel consistency is key.
    • We had a sub-group of five Task Force members who could ingest records. It was found that those who were able to spend more time working with the records didn’t have to spend as much time reviewing procedures.  Those who spent more time also better understood common issues and how to work through them.

We hope that readers find this final report useful as not only does it document the work of the Task Force but also shares many resource pieces that were created throughout the year some of which include:

  • Description of our workstation, including hardware and software
  • Processing workflow instructions for ingesting materials
  • Draft Deed of Gift Addendum for electronic records
  • Donor guides (for sharing information with donors about electronic records)

The ERTF completed the tasks set out in the first charge but understand that working with electronic records is a constant process and work must continue.  To this end, we are in the process of drafting a second charge to address processing ingested records and providing access to processed records.

Thank you to the work of the entire Task Force: Lisa Calahan, Kevin Dyke, Lara Friedman-Shedlov, Lisa Johnston, Carol Kussmann (co-chair), Mary Miller, Erik Moore, Arvid Nelsen (co-chair), Jon Nichols, Justin Schell, Mike Sutliff, and sponsors John Butler and Kris Kiesling.

Carol Kussmann is a Digital Preservation Analyst with the University of Minnesota Libraries Data & Technology – Digital Preservation & Repository Technologies unit, and co-chair of the University of Minnesota Libraries – Electronic Records Task Force. Questions about the report can be sent to: lib-ertf@umn.edu.