The Best of BDAX: Five Themes from the 2016 Born Digital Archiving & eXchange

By Kate Tasker


Put 40 digital archivists, programmers, technologists, curators, scholars, and managers in a room together for three days, give them unlimited cups of tea and coffee, and get ready for some seriously productive discussions.

This magic happened at the Born Digital Archiving & eXchange (BDAX) unconference, held at Stanford University on July 18-20, 2016. I joined the other BDAX attendees to tackle the continuing challenges of acquiring, discovering, delivering and preserving born-digital materials.

The discussions highlighted five key themes to me:

1) Born-digital workflows are, generally, specific

We’re all coping with the general challenges of born-digital archiving, but we’re encountering individual collections which need to be addressed with local solutions and resources. BDAXers generously shared examples of use cases and successful workflows, and, although these guidelines couldn’t always translate across diverse institutions (big/small, private/public, IT help/no IT help), they’re a foundation for building best practices which can be adapted to specific needs.

2) We need tools

We need reliable tools that will persist over time to help us understand collections, to record consistent metadata and description, and to discover the characteristics of new content types. Project demos including ePADD, BitCurator Access, bwFLA – Emulation as a Service, UC Irvine’s Virtual Reading Room, the Game Metadata and Citation Project, and the University of Michigan’s ArchivesSpace-Archivematica-DSpace Integration project gave encouragement that tools are maturing and will enable us to work with more confidence and efficiency. (Thanks to all the presenters!)

3) Smart people are on this

A lot of people are doing a lot of work to guide and document efforts in born-digital archiving. We need to share these efforts widely, find common points of application, and build momentum – especially for proposed guidelines, best guesses, and continually changing procedures. (We’re laying this train track as we go, but everybody can get on board!) A brilliant resource from BDAX is a “Topical Brain Dump” Google doc where everyone can share tips related to what we each know about born-digital archives (hat-tip to Kari Smith for creating the doc, and to all BDAXers for their contributions).

4) Talking to each other helps!

Chatting with BDAX colleagues over coffee or lunch provided space to compare notes, seek advice, make connections, and find reassurance that we’re not alone in this difficult endeavor. Published literature is continually emerging on born-digital archiving topics (for example, born-digital description), but if we’re not quite ready to commit our own practices to paper magnetic storage media, then informal conversations allow us to share ideas and experiences.

5) Born-digital archiving needs YOU

BDAX attendees brainstormed a wide range of topics for discussion, illustrating that born-digital archiving collides with traditional processes at all stages of stewardship, from appraisal to access. All of these functions need to be re-examined and potentially re-imagined. It’s a big job (*understatement*) but brings with it the opportunity to gather perspective and expertise from individuals across different roles. We need to make sure everyone is invited to this party.

How to Get Involved

So, what’s next? The BDAX organizers and attendees recognize that there are many, many more colleagues out there who need to be included in these conversations. Continuing efforts are coalescing around processing levels and metrics for born-digital collections; accurately measuring and recording extent statements for digital content; and managing security and storage needs for unprocessed digital accessions. Please, join in!

You can read extensive notes for each session in this shared Google Drive folder (yes, we did talk about how to archive Google docs!) or catch up on Tweets at #bdax2016.

To subscribe to the BDAX email listserv, please email Michael Olson (mgolson[at]stanford[dot]edu), or, to join the new BDAX Slack channel, email Shira Peltzman (speltzman[at]library[dot]ucla[dot]edu).


ktasker-profile-picKate Tasker works with born-digital collections and information management systems at The Bancroft Library, University of California, Berkeley. She has an MLIS from San Jose State University and is a member of the Academy of Certified Archivists. Kate attended Capture Lab in 2015 and is currently designing workflows to provide access to born-digital collections.

Mecha-Archivists Revisited: An Interview with Trevor Owens and Emily Reynolds

BloggERS! recently reached out to Trevor Owens and Emily Reynolds from IMLS to discuss the role of archives digital processing activities in the context of the IMLS national digital platform funding priority.

This interview was conducted asynchronously in text.


BloggERS!: Tell us about the national digital platform funding priority. Why was it introduced, and what are its primary objectives and deliverables?

Trevor: The national digital platform is a framework for approaching all the digital tools, services, infrastructure, and human effort libraries and archives use to meet the needs of their users across the country. In this respect, the platform isn’t an individual thing. It isn’t a piece of software or a website. Instead, it’s the combination of software applications, social and technical infrastructure, and staff expertise that provide library and archive content and services to all users in the US. For more on the concept, see this article Maura Marx and I wrote for American Libraries magazine.

The National Digital Platform concept was developed and refined through two IMLS Focus convenings (reports from the 2014 and 2015 convenings are available online). Those convenings brought together a diverse array of experts and stakeholders from across library and archives contexts who urged IMLS (and other funders) to focus more on making investments in digital tools and services that could have catalytic national impacts. The results of those meetings have informed the development of a specific national digital platform portfolio in the National Leadership Grants for Libraries Program (NLG) and the Laura Bush 21st Century Librarian Program (LB21).


Trevor Owens, Senior Program Officer at IMLS, opening the “Defining and Funding the National Digital Platform” panel, April 2015

BloggERS!: What do you think have been the biggest impacts of the national digital platform so far, especially for archivists working with digital materials?

Trevor: It is still rather early to see the range of impacts and outcomes the first projects in this portfolio are and having. The initial four projects funded from the first cycle of grants last year still haven’t been running for a year yet, the second cycle of grants from last year have only been going for about six months, and the first cycle of grants from this year have just been awarded. With that said, I would suggest that the national digital platform as a framework has already made a significant shift in terms of the projects we are funding. In comparison to many previous projects, these efforts have stronger and clearer plans and approaches to building communities of practice and coalitions to work together to tackle challenges. So in that vein, I’m excited about the prospects of librarians and archivists increasingly seeing their work having both local and direct impacts in their institutions but also as part of national and international networks of teams building, refining, documenting and improving the field through their involvement in the tools that enable our work. So if you take any of those individual projects, like EPADD or Hydra-in-a-box and you see a flurry of activity and engagement with a lot of different stakeholders already as parts of these projects.

Emily: In addition to the tools and services that we’ve funded through NLG, we’ve also seen several LB21 projects have an impact in terms of creating training opportunities for librarians and archivists working with digital materials. The LB21 program has a history of funding projects related to digital skills, even before we conceptualized it as part of the national digital platform. The National Digital Stewardship Residency (NDSR) program is one example; at this point 35 residents have participated in the program in DC, Boston, and New York, and we’ve funded several additional cohorts. Those projects have had real impacts on the careers of the residents themselves (full disclosure: I’m an alumna of the program), as well as the institutions they worked in.

BloggERS!: Where are the current gaps in our national digital capacity with regard to the processing of digital materials?

Emily: One of the interesting things about our jobs as Program Officers is that we aren’t necessarily the ones answering that question. We rely on applicants, peer reviewers, and professionals working in the field to let us know what the pain points are, and where additional capacity is needed. Conferences and blogs (like this one!) are incredibly useful for us to see what topics are of interest in the community, so that we can look at those needs in the context of IMLS’ broader strategic goals and grant programs.

To me, our capacity to manage born-digital and digitized audiovisual materials at scale seems like one of the most critical gaps. IMLS has funded a fair amount of work in this area, from oral history projects like OHMS and Oral History in the Digital Age, to computational approaches to providing access to audio from WGBH and Pop Up Archive, to tools like Avalon Media System. Even with all of this great work, it still remains challenging to adequately manage this complex content and provide meaningful access to it.

Trevor: Completely agree with Emily’s first point on this. The national digital platform is, in many ways, a challenge to the field and to applicants to make the case for what those gaps are and to establish and launch the coalitions that are going to fill them. With that noted, alongside Emily’s points about AV materials, I would also add that it seems like things are starting to really begin to converge and coalesce around the potential for open source tools to support emulation and virtualization as modes of access and preservation. I’ll talk a bit more about some of the projects pointing in this direction in a bit.

BloggERS!: How will the national digital platform help to support the wide variety of different tools and technologies used to acquire, process, and preserve digital materials in archives, libraries, and museums?

Emily: Overall, I think there are a few key themes in our approach to the national digital platform. We’re looking for tools and services that can be implemented and used by many different institutions, across the spectrum in terms of size and resources. I think your use of the phrase “wide variety” in the question also points to another important consideration for us: interoperability. So many institutions are using bespoke approaches to the same problems, with slight variations in tools and methods. Creating linkages between tools and building communities of practice will lower barriers to entry and raise baseline capacity.

Trevor: I would also add that, conceptually, all of those tools and technologies that libraries, archives and museums are using with digital content are already part of the national digital platform. I’m not just trying to be cute, or clever in saying that either. When we step back and look at all of those tools and services that exist now and the skills and knowledge it takes to make them work you start to see all those places where we need to interject resources to improve it. So along with Emily’s great points on interoperability and connecting tools and services I would also again stress that a huge part of this is about skilling up the library and archives workforce to be able to use the range of tools, many of which only work at the command line, that we can piece together to do this work.

BloggERS!: Which current projects being funded through the national digital platform priority are you most excited about?

Trevor: I’m excited about all of them! Seriously, our review process is very competitive and anything that makes it though is really exciting work. With that said, there is a good bit of work that we aren’t able to fund that I would also be excited about. I am happy to share some examples of projects that are particularly relevant to archivists.

Several projects are already making important inroads in this area. I’ll mention a few and then I am sure Emily has some in her portfolio that she can share. For each of these projects anyone can read through their proposal narratives online. So I will be brief about them and link out to where you can find the proposals.

The Software Preservation Network (LG-73-15-0133-15) is holding a national forum alongside this year’s SAA conference to work toward establishing a network of archives working together to develop a strategy for using historical software to provide access to and to process digital archival material. In a related effort, through A Re-enactment Tool for Collections of Digital Artifacts (LG-70-16-0079-16), Rhizome, in partnership with Yale University and the University of Freiburg, are working to enhance a set of open source software tools connecting archives of digital artifacts and emulation frameworks. Together these projects are positioned to both help refine the toolset for this kind of work and to build and establish the community of practice and networks necessary to support archivists doing this work.

Through Systems Interoperability and Collaborative Development for Web Archiving (LG-71-15-0174-15) the Internet Archive, with the University of North Texas, Rutgers University, and Stanford University Library are working to improve systems interoperability and to model enhanced access to, and research use of, web archives. This applied research project is well positioned to refine ways for institutions to interact with Archive-It. Given that over 400 some institutions are using Archive-It, improvements to this system will be very useful to many institutions in the field.

The last project I’ll mention, Improving Access to Time-Based Media through Crowdsourcing and Machine Learning (LG-71-15-0208-15) is a neat example of how an applied research project can significantly impact the field. In this project, the archives at WGBH and the Pop-Up Archive are exploring approaches for metadata creation by leveraging scalable computation and engaging the public to improve access through crowdsourcing games for time-based media. This includes an interesting mixture of speech-to-text and audio analysis tools and open source web-based tools to improve transcripts by engaging the public in a crowdsourced, participatory cataloging project. So the process has potential to inform future work, as well as test the tools that are created as part of the project. Lastly, by working with a massive archive of public broadcasting AV content, the project partners are also going to create and distribute a public dataset of audiovisual metadata for use by other projects.

Emily: Like Trevor said, it’s really hard for us to pick only a few projects to highlight, since we’re so excited about all of the work we’re funding. The development of ePADD (LG-70-15-0242-15) is one great example of IMLS-funded work that will help support digital processing activities. I’m happy to see that a few different people have mentioned it already in this blog series, because it really is an exciting advance in archivists’ capacity to manage email archives.

We also have funded a few interesting national forum projects recently. National forum grants support a meeting or series of meetings, bringing together stakeholders and experts in a topic. Those relationships and networks can persist long past the end of the grant. I’m really looking forward to seeing the results of On the Record, All the Time (RE-43-16-0053-16), a national forum grant to UCLA. They’re addressing the management of digital audiovisual evidence used by law enforcement, and I think the project has the potential to bring about some really interesting partnerships and relationships with other sectors. Like I mentioned earlier, audiovisual content is a huge challenge, and this is an interesting subset of it.

Another exciting national forum grant was awarded to the Amistad Research Center for a project called Diversifying the Digital Historical Record (LG-73-16-0003-16). This project will include a series of meetings with participants including community archives practitioners, scholars, community members, and digital collections experts. It’s an incredibly important issue and the range of partners on the project is amazing.

BloggERS!: Trevor, in a 2014 blogpost, Mecha-Archivists: Envisioning the Role of Software in the Future of Archives, you highlighted the potential value of computational techniques, such as topic modeling and named entity recognition, to help “extend and amplify the seasoned judgement, ethics, wisdom, and expertise [of archivists]” to support making materials available to the user. What progress have you seen in this space since 2014, and how would you rate the development of and training around these tools as a priority within IMLS?

Trevor: I see a lot of the ideas I explored in that Radcliffe Workshop on Technology and Archival Processing as fitting very well with the idea of the National Digital Platform. The key concept in that talk is that we need to approach the work of cultural heritage institutions as complex systems which deploy enabling technologies that support, amplify and extend the abilities of archivists, librarians and the curators to do their work. I realize that’s a mouthful. So I can talk through some examples.

All too often, I have seen folks approach some computational tool and say, “Oh, we could use this to automate classification or description” or a variety of other activities. This sets the bar way too high for the machines. It also is part of longstanding, problematic and flawed notions about expertise, efficiency and labor that devalue what it means to be a professional and an expert. The judgement of professionals and experts is really hard to beat, and it isn’t something we should be trying to beat. Instead of erasing or ignoring all of the accumulated wisdom and expertise of professional librarians, archivists and curators, we should be working to build from and amplify it.

In my mind, the solution is rather simple. Instead of trying to replace the work of experts, it is much better to think through how we can enable and extend that judgement through tools. The example I used in the Radcliffe talk involved Topic Modeling, but I think the same process can and should work for things like natural language processing tools named entity extraction tools, or for that matter tools and services that automate deriving data about audio files or image files.

I see all of this fitting into the national digital platform in a few clear ways. First off, the platform is defined not as a set of tools and services, but as the combined effects of those tools and services and the professionals that animate and operate them. In that vein, the platform is as much about empowering, training and supporting professionals to do the work as it is about giving them the tools to enable them to do the work.

BloggERS!: How can our readers learn more about the national digital platform?

Emily: There is a national digital platform page on the IMLS website, where you can see links to related blog posts, press releases, and publications. That page also links to information about the national digital platform convenings we held in 2014 and 2015. When we post Notices of Funding Opportunity for NLG and LB21, those documents will also have specific information about the funding categories and topics of interest for that specific program.

As part of ongoing efforts towards increased transparency, we’ve begun to publish several documents from successful grant applications online. Trevor and I recently did a series of blog posts highlighting recent awards; each of the projects mentioned in these posts has a link to view some of their proposal documents.

Of course, we also strongly encourage potential applicants and others interested in the national digital platform to contact us! We’re happy to talk through project ideas and provide any additional information about IMLS programs.


qMpvSkY2.jpegEmily Reynolds is a Program Officer in the Office of Library Services at the Institute of Museum and Library Services. She manages a portfolio of grants within IMLS’ national digital platform priority area, primarily focusing on projects related to the education and training of librarians and archivists. Prior to joining IMLS, Emily was a National Digital Stewardship Resident at the World Bank Group Archives. Emily has a master’s degree in information science from the University of Michigan School of Information, and was the recipient of a 2014 National Digital Stewardship Alliance Innovation Award.

Trevor-0.jpgTrevor Owens serves as the Senior Program Officer responsible for the development of the national digital platform portfolio for the Office of Library Services in the Institute of Museum and Library Services. He steers an overall strategy encompassing research, grant making, and policy agendas, as well as communications initiatives, in support of the development of national digital services and resources in libraries. From 2010-2015, Trevor served as a Digital Archivist with the National Digital Information Infrastructure and Preservation Program (NDIIPP) in the Office of Strategic Initiatives at the Library of Congress. Before that, he was the community manager for the Zotero project at the Center for History and New Media. Trevor has a doctorate in social science research methods and educational technology from the Graduate School of Education at George Mason University, a bachelor’s degree in the history of science from the University of Wisconsin, and a master’s degree in American history with an emphasis on digital history from George Mason University. He teaches graduate seminars on digital curation, digital preservation and digital history for the University of Maryland’s iSchool and American University’s history department. In 2014 the Society for American Archivists gave him the Archival Innovator Award, an award granted annually to recognize the archivist, repository, or organization that best exemplifies the “ability to think outside the professional norm.”

Using NLP to Support Dynamic Arrangement, Description, and Discovery of Born Digital Collections: The ArchExtract Experiment

By Mary W. Elings

This post is the eighth in our Spring 2016 series on processing digital materials.


Many of us working with archival materials are looking for tools and methods to support arrangement, description, and discovery of electronic records and born digital collections, as well as large bodies of digitized text. Natural Language Processing (NLP), which uses algorithms and mathematical models to process natural language, offers a variety of potential solutions to support this work. Several efforts have investigated using NLP solutions for analyzing archival materials, including TOME (Interactive TOpic Model and MEtadata Visualization), Ed Summers’ Fondz, and Thomas Padilla’s Woese Collection work, among others, though none have resulted in a major tool for broader use.

One of these projects, ArchExtract, was carried out at UC Berkeley’s Bancroft Library in 2014-2015. ArchExtract sought to apply several NLP tools and methods to large digital text collections and build a web application that would package these largely command-line NLP tools into an interface that would make it easy for archivists and researchers to use.

The ArchExtract project focused on facilitating analysis of the content and, via that analysis, discovery by researchers. The development work was done by an intern from the UC Berkeley School of Information, Janine Heiser, who built a web application that implements several NLP tools, including Topic Modelling, Named Entity Recognition, and Keyword Extraction to explore and present large, text-based digital collections.

The ArchExtract application extracts topics, named entities (people, places, subjects, dates, etc.), and keywords from a given collection. The application automates, implements, and extends various natural language processing software tools, such as MALLET and the Stanford Core NLP toolkit, and provides a graphical user interface designed for non-technical users.


ArchExtract Interface Showing Topic Model Results. Elings/Heiser, 2015.

In testing the application, we found the automated text analysis tools in ArchExtract were successful in identifying major topics, as well as names, dates, and places found in the text, and their frequency, thereby giving archivists an understanding of the scope and content of a collection as part of the arrangement and description process. We called this process “dynamic arrangement and description,” as materials can be re-arranged using different text processing settings so that archivists can look critically at the collection without changing the physical or virtual arrangement.

The topic models, in particular, surfaced documents that may have been related to a topic but did not contain a specific keyword or entity. The process was akin to the sort of serendipity a researcher might achieve when shelf reading in the analog world, wherein you might find what you seek without knowing it was there. And while topic modelling has been criticized for being inexact, it can be “immensely powerful for browsing and isolating results in thousands or millions of uncatalogued texts” (Schmidt, 2012). This, combined with the named entity and keyword extraction, can give archivists and researchers important data that could be used in describing and discovering material.

ArchExtract Interface Showing Named Entity Recognition Results. Elings/Heiser, 2015.

As a demonstration project, ArchExtract was successful in achieving our goals. The code developed is documented and freely available on GitHub to anyone interested in how it was done or who might wish to take it further. We are very excited by the potential of these tools in dynamically arranging and describing large, text-based digital collections, but even more so by their application in discovery. We are particularly pleased that broad, open source projects like BitCurator and ePADD are taking this work forward and will be bringing NLP tools into environments that we can all take advantage of in processing and providing access to our born digital materials.


Mary W. Elings is the Principal Archivist for Digital Collections and Head of the Digital Collections Unit of The Bancroft Library at the University of California, Berkeley. She is responsible for all aspects of the digital collections, including managing digital curation activities, the born digital archives program, web archiving, digital processing, mass digitization, finding aid publication and maintenance, metadata, archival information management and digital asset management, and digital initiatives. Her current work concentrates on issues surrounding born-digital materials, supporting digital humanities and digital social sciences, and research data management. Ms. Elings co-authored the article “Metadata for All: Descriptive Standards and Metadata Sharing across Libraries, Archives and Museums,” and wrote a primer on linked data for LAMs. She has taught as an adjunct professor in the School of Information Studies at Syracuse University, New York (2003-2009) and School of Library and Information Science, Catholic University, Washington, DC (2010-2014), and is a regular guest-lecturer in the John F. Kennedy University Museum Studies program (2010-present).

Let the Entities Describe Themselves

By Josh Schneider and Peter Chan

This is the fifth post in our Spring 2016 series on processing digital materials.


Why do we process archival materials? Do our processing goals differ based on whether the materials are paper or digital? Processing objectives may depend in part upon institutional priorities, policies, and donor agreements, or collection-specific issues. Yet, irrespective of the format of the materials, we recognize two primary goals to arranging and describing materials: screening for confidential, restricted, or legally-protected information that would impede repositories from providing ready access to the materials; and preparing the files for use by researchers, including by efficiently optimizing discovery and access to the material’s intellectual content.

More and more of the work required to achieve these two goals for electronic records can be performed with the aid of computer assisted technology, automating many archival processes. To help screen for confidential information, for instance, several software platforms utilize regular expression search (BitCurator, AccessData Forensic ToolKit, ePADD). Lexicon search (ePADD) can also help identify confidential information by checking a collection against a categorized list of user-supplied keywords. Additional technologies that may harness machine learning and natural language processing (NLP), and that are being adopted by the profession to assist with arrangement and description, include: topic modeling (ArchExtract); latent semantic analysis (GAMECIP); predictive coding (University of Illinois); and named entity recognition (Linked Jazz, ArchExtract, ePADD). For media, automated transcription and timecoding services (Pop Up Archive) already offer richer access. Likewise, computer vision, including pattern recognition and face recognition, has the potential to help automate image and video description (Stanford Vision Lab, IBM Watson Visual Recognition). Other projects (Overview) outside of the archival community are also exploring similar technologies to make sense of large corpuses of text.

From an archivist’s perspective, one of the most game-changing technologies to support automated processing may be named entity recognition (NER). NER works by identifying and extracting named entities across a corpus, and is in widespread commercial use, especially in the fields of search, advertising, marketing, and litigation discovery. A range of proprietary tools, such as Open Calais, Semantria, and AlchemyAPI, offer entity extraction as a commercial service, especially geared toward facilitating access to breaking news across these industries. ePADD, an open source tool being developed to promote the appraisal, processing, discovery, and delivery of email archives, relies upon a custom NER to reveal the intellectual content of historical email archives.


Currently, however, there are no open source NER tools broadly tuned towards the diverse variety of other textual content collected and shared by cultural heritage institutions. Most open source NER tools, such as StanfordNER and Apache OpenNLP, focus on extracting named persons, organizations, and locations. While ePADD also initially focused on just these three categories, an upcoming release will improve browsing accuracy by including more fine-grained categories of organization and location entities bootstrapped from Wikipedia, such as libraries, museums, and universities. This enhanced NER, trained to also identify probable matches, also recognizes other entity types such as diseases, which can assist with screening for protected health information, and events.

What if an open source NER like that in development for ePADD for historical email could be refined to support processing of an even broader set of archival substrates? Expanding the study and use of NLP in this fashion stands to benefit the public and an ever-growing body of researchers, including those in digital humanities, seeking to work with the illuminative and historically significant content collected by cultural heritage organizations.

Of course, entity extraction algorithms are not perfect, and questions remain for archivists regarding how best to disambiguate entities extracted from a corpus, and link disambiguated entities to authority headings. Some of these issues reflect technical hurdles, and others underscore the need for robust institutional policies around what constitutes “good enough” digital processing. Yet, the benefits of NER, especially when considered in the context of large text corpora, are staggering. Facilitating browsing and visualization of a corpus by entity type provides new ways for researchers to access content. Publishing extracted entities as linked open data can enable new content discovery pathways and uncover trends across institutional holdings, while also helping balance outstanding privacy and copyright concerns that may otherwise limit online material sharing.

It is likely that “good enough” processing will remain a moving target as researcher practices and expectations continue to evolve with emerging technologies. But we believe entity extraction fulfills an ongoing need to enable researchers to gain quick access to archival collections’ intellectual content, and that its broader application would greatly benefit both repositories and researchers.


Peter Chan is Digital Archivist in the Department of Special Collections and University Archives at Stanford University, is a member of GAMECIP, and is Project Manager for ePADD.

Josh Schneider is Assistant University Archivist in the Department of Special Collections and University Archives at Stanford University, and is Community Manager for ePADD.

Clearing the digital backlog at the Thomas Fisher Rare Book Library

By Jess Whyte

This is the second post in our Spring 2016 series on processing digital materials.


Tucked away in the manuscript collections at the Thomas Fisher Rare Book Library, there are disks. They’ve been quietly hiding out in folders and boxes for the last 30 years. As the University of Toronto Libraries develops its digital preservation policies and workflows, we identified these disks as an ideal starting point to test out some of our processes. The Fisher was the perfect place to start:

  • the collections are heterogeneous in terms of format, age, media and filesystems
  • the scope is manageable (we identified just under 2000 digital objects in the manuscript collections)
  • the content has relatively clear boundaries (we’re dealing with disks and drives, not relational databases, software or web archives)
  • the content is at risk

The Thomas Fisher Rare Book Library Digital Preservation Pilot Project was born. It’s purpose: to evaluate the extent of the content at risk and establish a baseline level of preservation on the content.

Identifying digital assets

The project started by identifying and listing all the known digital objects in the manuscript collections. I did this by batch searching all the .pdf finding aids from post-1960 with terms like “digital,” “electronic,” “disk,” —you get the idea. Once we knew how many items we were dealing with and where we could find them, we could begin.

Early days, testing and fails
When I first started, I optimistically thought I would just fire up BitCurator and everything would work.


It didn’t, but that’s okay. All of the reasons we chose these collections in the first place (format, media, filesystem and age diversity) also posed a variety of challenges to our workflow for capture and analysis. There was also a question of scalability – could I really expect to create preservation copies of ~2000 disks along with accompanying metadata within a target 18-month window? By processing each object one-by-one in a graphical user interface? While working on the project part-time? No, I couldn’t. Something needed to change.

Our early iterations of the process went something like this:

  1. Use a Kryoflux and its corresponding software to take an image of the disk
  2. Mount the image in a tool like FTK Imager or HFSExplorer
  3. Export a list of the files in a somewhat consistent manner to serve as a manifest, metadata and de facto finding aid
  4. Bag it all up in Bagger.

This was slow, inconsistent, and not well-suited to the project timetable. I tried using fiwalk (included with BitCurator) to walk through a series of images and automatically generate manifests of their contents, but fiwalk does not support HFS and other, older filesystems. Considering 40% of our disks thus far were HFS (at this point, I was 100 disks in), fiwalk wasn’t going to save us. I could automate the process for 60% of the disks, but the remainder would still need to be handled individually–and I wouldn’t have those beautifully formatted DFXML (Digital Forensics XML) files to accompany them. I needed a fix.

Enter disktype and md5deep

I needed a way to a) mount a series of disk images, b) look inside, c) generate metadata on the file contents and d) produce a more human-readable manifest that could serve as a finding aid.

Ideally, the format of all that metadata would be consistent. Critically, the whole process would be as automated as possible.

This is where disktype and md5deep come in. I could use disktype to identify an image’s filesystem, mount it accordingly and then use md5deep to generate DFXML and .csv files. The first iteration of our script did just that, but md5deep doesn’t produce as much metadata as fiwalk. While I don’t have the skills to rewrite fiwalk, I do have the skills to write a simple bash script that routes disk images based on their filesystem to either md5deep or fiwalk. You can find that script here, and a visualization of how it works below:


I could now turn this (collection of image files and corresponding imaging logs):


into this (collection of image files, logs, DFXML files, and CSV manifests):


Or, to put it another way, I could now take one of these:


And rapidly turn it into this ready-to-be-bagged package:


Challenges, Future Considerations and Questions

Are we going too fast?
Do we really want to do this quickly? What discoveries or insights will we miss by automating this process? There is value in slowing down and spending time with an artifact and learning from it. Opportunities to do this will likely come up thanks to outliers, but I still want to carve out some time to play around with how these images can be used and studied, individually and as a set.

Virus Checks:
We’re still investigating ways to run virus checks that are efficient and thorough, but not invasive (won’t modify the image in any way).  One possibility is to include the virus check in our bash script, but this will slow it down significantly and make quick passes through a collection of images impossible (during the early, testing phases of this pilot, this is critical). Another possibility is running virus checks before the images are bagged. This would let us run the virus checks overnight and then address any flagged images (so far, we’ve found viruses in ~3% of our disk images and most were boot-sector viruses). I’m curious to hear how others fit virus checks into their workflows, so please comment if you have suggestions or ideas.

Adding More Filesystem Recognition
Right now, the processing script only recognizes FAT and HFS filesystems and then routes them accordingly. So far, these are the only two filesystems that have come up in my work, but the plan is to add other filesystems to the script on an as-needed basis. In other words, if I happen to meet an Amiga disk on the road, I can add it then.

Access Copies:
This project is currently focused on creating preservation copies. For now, access requests are handled on an as-needed basis. This is definitely something that will require future work.

Error Checking:
Automating much of this process means we can complete the work with available resources, but it raises questions about error checking. If a human isn’t opening each image individually, poking around, maybe extracting a file or two, then how can we be sure of successful capture? That said, we do currently have some indicators: the Kryoflux log files, human monitoring of the imaging process (are there “bad” sectors? Is it worth taking a closer look?), and the DFXML and .csv manifest files (were they successfully created? Are there files in the image?). How are other archives handling automation and exception handling?

If you’d like to see our evolving workflow or follow along with our project timeline, you can do so here. Your feedback and comments are welcome.


Jess Whyte is a Masters Student in the Faculty of Information at the University of Toronto. She holds a two-year digital preservation internship with the University of Toronto Libraries and also works as a Research Assistant with the Digital Curation Institute.  


Gengenbach, M. (2012). The way we do it here”: Mapping digital forensics workflows in collecting institutions.”. Unpublished master’s thesis, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina.

Goldman, B. (2011). Bridging the gap: taking practical steps toward managing born-digital collections in manuscript repositories. RBM: A Journal of Rare Books, Manuscripts and Cultural Heritage, 12(1), 11-24

Prael, A., & Wickner, A. (2015). Getting to Know FRED: Introducing Workflows for Born-Digital Content.

Digital Processing at the Rockefeller Archive Center

By Bonnie Gordon

This is the first post in our Spring 2016 series on processing digital materials, exploring how archivists conceive of, implement, and track activities to arrange and describe digital materials in archival collections. If you are interested in contributing to bloggERS!, check out our guidelines for writers or contact us at

At the Rockefeller Archive Center, we’re working to get “digital processing” out of the hands of “digital” archivists and into the realm of “regular” archivists. We are using “digital processing” to mean description, arrangement, and initial preservation of born digital archival content stored on removable storage media. Our definition will likely expand over time, as we start to receive more born digital materials via network transfer and fewer acquisitions of floppy disks and CDs.

The vast majority of our born digital materials are on removable storage media and currently inaccessible to our researchers, donors, and staff. We have content on over 3,000 digital storage media items, which are rapidly deteriorating. Our backlog of digital media items includes over 2,500 optical disks, almost 200 3.5″ floppy disks, and almost 100 5.25″ floppy disks. There are also a handful of USB flash drives, hard drives, and older and unusual media (Bernoulli disks, Sy-Quest cartridges, 8″ floppy disks). This is a lot of work for one digital archivist! Having multiple “regular” archivists process these materials distributes the work, which means we can get through the backlog much more quickly. Additionally, integrating digital processing into regular processing work will prevent a future backlog from being created.

In order to help our processing archivists establish and enhance intellectual control of our born digital holdings, I’m working to provide them with the tools, workflows, and competencies needed to process digital materials.  Over the next several months, a core group of processing archivists will be trained and provided with documentation on digital media inventorying, digital forensics, and other born digital workflows. After training, archivists will be able to use the skills they gained in their “normal” processing projects. The core group of archivists trained on dealing with born digital materials will then be able to train other archivists. This will help digital processing be perceived as just another aspect of “regular” processing. Additionally, providing good workflow documentation gives our processing archivists the tools and competencies to do their jobs.

Streamlining our digital processing workflows is also a really important part of this. One step in this direction is to create a digital media inventory and disk imaging log that will be able to “talk” to our collections management system (ArchivesSpace). We currently have an inventory and imaging log, but they’re in a Microsoft Access database, which has a number of limitations, one of the primary ones being that it can’t integrate with our other systems. Integrating with ArchivesSpace reduces duplicate data entry, inconsistent data, and further integrates digital processing into our “regular” processing work.

The RAC’s processing archivists establish and enhance intellectual and physical control of our archival holdings, regardless of format, in order to facilitate user access. By fully integrating digital processing into “normal” processing activities, we will be able to preserve and provide access to unique born digital content stored on obsolete and decaying media.

Bonnie Gordon is an Assistant Digital Archivist at the Rockefeller Archive Center, where she works primarily with born digital materials and digital preservation workflows. She received her M.A. in Archives and Public History, with a concentration in Archives, from New York University.

Request for contributors to a new series on bloggERS!

The editors at bloggERS! HQ are looking for authors to write for a new series of posts, and we’d like to hear from YOU.

The topic of the next series on the Electronic Records Section blog is processing digital materials: what it is, how practitioners are doing it, and how they are measuring their work.

How are you processing digital materials? And how do you define “digital processing,” anyway?

The what and how of digital processing are dependent upon a variety of factors: available resources and technical expertise, the tools, systems, and infrastructure that are particular to an organization, and the nature of the digital materials themselves.

  • What tools are you using, and how do they integrate with your physical arrangement and description practices?
  • Are you leveraging automation, topic modeling, text analysis, named entity recognition, or other technologies in your processing workflows?
  • How are you working with different types of digital content, such as email, websites, documents, and digital images?
  • What are the biggest challenges that you have encountered? What is your biggest recent digital processing success? What would you like to be able to do, and what are your blockers?

If you have answers to any of these questions, or you are thinking of other questions we haven’t asked here, then consider writing a post to share your experiences (good or bad) processing digital materials.

Quantifying and tracking digital processing activities

Many organizations maintain processing metrics, such as hours per linear foot. In processing digital materials, the level of effort may be more dependent upon the type and format of the materials than their extent.

  • What metrics make sense for quantifying digital processing activities?  
  • How does your organization track the pace and efficiency of digital processing activities?
  • Have you explored any alternative ways of documenting digital processing activity?

If you have been working to answer any of these questions for yourself or your institution, we’d like to hear from you!

Writing for bloggERS!

  • Posts should be between 200-600 words in length
  • Posts can take many forms: instructional guides, in-depth tool exploration, surveys, dialogues, point-counterpoint debates are all welcome!
  • Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA
  • Align with other editorial guidelines as outlined in the bloggERS! guidelines for writers.

Posts for this series will start in early April, so let us know ASAP if you are interested in contributing by sending an email to!

Digital Preservation System Integration at the University of Michigan’s Bentley Historical Library

By Michael Shallcross

At SAA 2015, Courtney Mumma (formerly of Artefactual Systems) and I participated in a panel discussion at the Electronic Records Section meeting on “implementing digital preservation tools and systems,” with a focus on “the lessons learned through the planning, development, testing, and production of digital preservation applications.”  

The University of Michigan’s Bentley Historical Library is in the midst of a two-year project (2014-2016) funded by the Andrew W. Mellon Foundation to integrate ArchivesSpace, Archivematica, and DSpace in an end-to-end digital archives workflow (for more information on the project itself, see our blog).

Artefactual Systems is responsible for the development work on the project, which has involved adding new functionality to the Archivematica digital preservation system to permit the appraisal and arrangement of digital archives as well as the integration of ArchivesSpace functionality within Archivematica so that users can create and edit archival description in addition to associating digital objects with that information.

Continue reading