A Case Study in Failure (and Triumph!) from the Records Management Perspective

By Sarah Dushkin

____

This is the sixth post in the bloggERS series #digitalarchivesfail: A Celebration of Failure.

I’m the Records Coordinator for a global energy engineering, procurement, and construction  contractor, herein referred to as the “Company.” The Company does design, fabrication, installation, and commissioning of upstream and downstream technologies for operators. I manage the program for our hard copy and electronic records produced from our Houston office.

A few years ago our Records Management team was asked by the IT department to help create a process to archive digital records of closed projects created out of the Houston office. I saw the effort as an opportunity to expand the scope and authority of our records program to include digital records. Up to this point, our practice only covered paper records, and we asked employees to apply the paper record policies to their own electronic records.

The Records Management team’s role was limited to providing IT with advice on how to deploy a software tool where files could be stored for a long-term period. We were not included in the discussions on which software tool to use. It took us over a year to develop the new process with IT and standardize it into a published procedure. We had many areas of triumph and failure throughout the process. Here is a synopsis of the project.

Objective:
IT was told that retaining closed projects files on the local server was an unnecessary cost and was tasked with removing them. IT reached out to Records Management to develop a process to maintain the project files for the long-term in a more cost-effective solution that was nearline or offline, where records management policies could be applied.

Vault:
The software chosen was a proprietary cloud-based file storage center or “vault.” It has search, tagging, and records disposition capabilities. It is more cost-effective than storing files on the local server.

Process:
At 80% project completion, Records Management reaches out to active projects to discover their methods for storing files and the project completion schedule. 80% engineering completion is an important timeline for projects because most of the project team is still involved and the bulk of the work is complete. Records Management also gains knowledge of the project schedule so we can accurately apply the two-year timespan to when the files will be migrated off the local server and to the vault.  The two-year time span was created to ensure that all project files would be available to the project team during the typical warranty period. Two years after a project is closed, all technical files and data are exported from the current management system and ingested into the vault, and access groups are created so employees can view and download the files for reference as needed.

Deployment:
Last year, we began to apply the process to large active projects that had passed 80% engineering completion. Large projects are those that have greater than 5 million in revenue.

Observations:
Recently we have begun to audit the whole project with IT, and are just now identifying our areas of failure and triumph. We will conduct an analysis of these areas and assess where we can make improvements.

Our big areas of failure were related to stakeholder involvement in the development, deployment, and utilization of the vault.

Stakeholders, including the Records Management team, were not involved in the selection or development of the vault software tool. As a result, the vault development project lacked the resources required to make it as successful as possible.

In the deployment of the vault, we did not create an outreach campaign with training courses that would introduce the tool across our very large company. Due to this, many employees are still unaware of the vault. When we talk with departments and projects about methods to save old files for less money they are reluctant to try the solution because it seems like another way for IT to save money from their budget without thinking about the greater needs of the company. IT is still viewed as a support function that is inessential to the Company’s philosophy.

Lastly, we did not have methods to export project files from all systems for ingest into the vault; nor did we, in North America, have the authority to develop that solution. To be effective, that type of decision and process can only be developed by our corporate office in another country. The Company also does not make information about project closure available to most employees. A project end date can be determined by several factors, including when the final invoice was received or the end of the warranty period. This type of information is essential to the information lifecycle of a project, and since we had no involvement from upper level management, we were not able to devise a solution for easily discovering this information.

We had some triumphs throughout the process, though. Our biggest triumph is that this project gave Records Management an opportunity to showcase our knowledge of records retention and its value as a method to save money and maintain business continuity. We were able to collaborate with IT and promulgate a process. It gave us a great opportunity to grow by harnessing better relationships with the business lines. Although some departments and teams are still skeptical about the value of the vault, when we advertise it to other project teams, they see the vault as evidence that the Company cares about preserving their work. We earned our seat at the table with these players, but we still have to work on winning over more projects and departments. We’ve also preserved more than 30 TB of records and saved the Company several thousands of dollars by ingesting inactive project files into the vault.

I am optimistic that when we have support from upper management, we will be able to improve the vault process and infrastructure, and create an effective solution for utilizing records management policies to ensure legal compliance, maintain business continuity, and save money.

____

Sarah Dushkin earned her MSIS from the University of Texas at Austin School of Information with a focus in Archival Enterprise and Records Management. Afterwards, she sought to diversify her expertise by working outside of the traditional archival setting and moved to Houston to work in oil and gas. She has collaborated with management from across her company to refine their records management program and develop a process that includes the retention of electronic records and data. She lives in Sugar Land, Texas with her husband.

bloggERS! has gone fishin’

We’re off to SAA! Will you be there too? Check out our list of ERS-recommended sessions on Sched.

If you can’t make it this year, then follow along on Twitter with #SAA16!

4156531802_929debdfe1
People fishing on Green Lake, circa 1950s. Item 31415, Ben Evans Recreation Program Collection (Record Series 5801-02), Seattle Municipal Archives

 

We’ll be back soon with recaps from recent conferences and plenty of other good stuff.

 

Building a “Computational Archival Science” Community

By Richard Marciano

———

When the bloggERS! series started at the beginning of 2015, some of the very first posts featured work on “computer generated archival description” and “big data and big challenges for archives,” so it seems appropriate to revisit this theme of automation and management of records at scale and provide an update on a recent symposium and several upcoming events.

Richard Marciano co-hosted a recent “Archival Records in the Age of Big Data” symposium. For more information about the recent Symposium, visit: http://dcicblog.umd.edu/cas/. The three-day program is listed online and has links to all the videos and slides. A list of participants can also be found at http://dcicblog.umd.edu/cas/attendees. The objectives of the Symposium were to:

  • address the challenges of big data for digital curation,
  • explore the conjunction of emerging digital methods and technologies,
  • identify and evaluate current trends,
  • determine possible research agendas, and
  • establish a community of practice.

Richard Marciano and Bill Underwood will be further exploring these themes at SAA in Atlanta on Friday, August 5, 9:30am – 10:45am, session 311, for those ERS aficionados interested in contributing to this emerging conversation. See: https://archives2016.sched.org/event/7f9D/311-archival-records-in-the-age-of-big-data

On April 26-28, 2016 the Digital Curation Innovation Center (DCIC) at the University of Maryland’s College of Information Studies (iSchool) convened a Symposium in collaboration with King’s College London. This invitation-only symposium, entitled Finding New Knowledge: Archival Records in the Age of Big Data, featured 52 participants from the UK, Canada, South Africa and the U.S. Among the participants were researchers, students, and representatives from federal agencies, cultural institutions, and consortia.

This group of experts gathered at Maryland’s iSchool to discuss and try to define computational archival science: an interdisciplinary field concerned with the application of computational methods and resources to large-scale records/archives processing, analysis, storage, long-term preservation, and access, with the aim of improving efficiency, productivity and precision in support of appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material.

This event, co-sponsored by Richard Marciano, Mark Hedges from King’s College London and Michael Kurtz from UMD’s iSchool, brought together thought leaders in this emerging CAS field:  Maria Esteva from the Texas Advanced Computing Center (TACC), Victoria Lemieux from the University of British Columbia School of Library, Archival and Information Studies (SLAIS), and Bill Underwood from Georgia Tech Research Institute (GTRI). There is growing interest in large-scale management, automation, and analysis of archival content and the realization of enhanced possibilities for scholarship through the integration of ‘computational thinking’ and ‘archival thinking.

To capitalize on the April Symposium, a follow-up workshop entitled Computational Archival Science: Digital Records in the Age of Big Data, will take place in Washington D.C. the 2nd week of December 2016 at the 2016 IEEE International Conference on Big Data. For information on the upcoming workshop, please visit: http://dcicblog.umd.edu/cas/ieee_big_data_2016_cas-workshop/. Paper contributions will be accepted until October 3, 2016.

———

Richard is a professor at Maryland’s iSchool and director of the Digital Curation Innovation Center (DCIC). His research interests include digital preservation, archives and records management, computational archival science, and big data. He holds degrees in Avionics and Electrical Engineering, a Master’s and Ph.D. in Computer Science from the University of Iowa, and conducted a Postdoc in Computational Geography.

Annual meeting session recommendations, courtesy of ERS

Having trouble deciding between two tantalizing-looking sessions at the Society of American Archivists annual meeting this year? Looking for some recommendations that might tip the scales? Look no further!

The Electronic Records Section has produced a schedule for this year’s conference through its online scheduling tool, Sched. Now you can see the session that may be of interest to ERS members in one place.

The Electronic Records Section mega-schedule is available here.

See something we may have missed? Comment below or email bloggERS! at ers.mailer.blog@gmail.com!

 

The ERS annual meeting approaches!

The Electronic Records Section will meet at the SAA annual meeting in Atlanta on Thursday, August 4, 3:30 pm – 5:00 pm. Following a short business meeting, the session will feature a presentation from Mike Strom, Wyoming State Archivist, who will provide an update on Council of State Archivists’ State Electronic Records Initiative (SERI) and the PERTTS (Program for Electronic Records, Training, Tools and Standards) Portal.

Following the business meeting and presentation, the Electronic Records Section will break into an interactive unconference-style small group discussion.

This is where we need your help! Do you have an intractable electronic records problem you would like to discuss? Or a hot new topic in digital preservation that you’re excited to share with like-minded archivists and electronic records professionals? Add your ideas to the list of discussion topics for the unconference!

Session topics are being collected here: http://bit.ly/SAA16-ERS-Unconf-Ideas

Submissions will be considered until the day of the section meeting, where participants will select discussion topics.

Have any questions? Email ERS Chair Dan Noonan at noonan[dot]37[at]osu[dot]edu.

 

Using NLP to Support Dynamic Arrangement, Description, and Discovery of Born Digital Collections: The ArchExtract Experiment

By Mary W. Elings

This post is the eighth in our Spring 2016 series on processing digital materials.

———

Many of us working with archival materials are looking for tools and methods to support arrangement, description, and discovery of electronic records and born digital collections, as well as large bodies of digitized text. Natural Language Processing (NLP), which uses algorithms and mathematical models to process natural language, offers a variety of potential solutions to support this work. Several efforts have investigated using NLP solutions for analyzing archival materials, including TOME (Interactive TOpic Model and MEtadata Visualization), Ed Summers’ Fondz, and Thomas Padilla’s Woese Collection work, among others, though none have resulted in a major tool for broader use.

One of these projects, ArchExtract, was carried out at UC Berkeley’s Bancroft Library in 2014-2015. ArchExtract sought to apply several NLP tools and methods to large digital text collections and build a web application that would package these largely command-line NLP tools into an interface that would make it easy for archivists and researchers to use.

The ArchExtract project focused on facilitating analysis of the content and, via that analysis, discovery by researchers. The development work was done by an intern from the UC Berkeley School of Information, Janine Heiser, who built a web application that implements several NLP tools, including Topic Modelling, Named Entity Recognition, and Keyword Extraction to explore and present large, text-based digital collections.

The ArchExtract application extracts topics, named entities (people, places, subjects, dates, etc.), and keywords from a given collection. The application automates, implements, and extends various natural language processing software tools, such as MALLET and the Stanford Core NLP toolkit, and provides a graphical user interface designed for non-technical users.

 

archextract1
ArchExtract Interface Showing Topic Model Results. Elings/Heiser, 2015.

In testing the application, we found the automated text analysis tools in ArchExtract were successful in identifying major topics, as well as names, dates, and places found in the text, and their frequency, thereby giving archivists an understanding of the scope and content of a collection as part of the arrangement and description process. We called this process “dynamic arrangement and description,” as materials can be re-arranged using different text processing settings so that archivists can look critically at the collection without changing the physical or virtual arrangement.

The topic models, in particular, surfaced documents that may have been related to a topic but did not contain a specific keyword or entity. The process was akin to the sort of serendipity a researcher might achieve when shelf reading in the analog world, wherein you might find what you seek without knowing it was there. And while topic modelling has been criticized for being inexact, it can be “immensely powerful for browsing and isolating results in thousands or millions of uncatalogued texts” (Schmidt, 2012). This, combined with the named entity and keyword extraction, can give archivists and researchers important data that could be used in describing and discovering material.

archextract2
ArchExtract Interface Showing Named Entity Recognition Results. Elings/Heiser, 2015.

As a demonstration project, ArchExtract was successful in achieving our goals. The code developed is documented and freely available on GitHub to anyone interested in how it was done or who might wish to take it further. We are very excited by the potential of these tools in dynamically arranging and describing large, text-based digital collections, but even more so by their application in discovery. We are particularly pleased that broad, open source projects like BitCurator and ePADD are taking this work forward and will be bringing NLP tools into environments that we can all take advantage of in processing and providing access to our born digital materials.

———

Mary W. Elings is the Principal Archivist for Digital Collections and Head of the Digital Collections Unit of The Bancroft Library at the University of California, Berkeley. She is responsible for all aspects of the digital collections, including managing digital curation activities, the born digital archives program, web archiving, digital processing, mass digitization, finding aid publication and maintenance, metadata, archival information management and digital asset management, and digital initiatives. Her current work concentrates on issues surrounding born-digital materials, supporting digital humanities and digital social sciences, and research data management. Ms. Elings co-authored the article “Metadata for All: Descriptive Standards and Metadata Sharing across Libraries, Archives and Museums,” and wrote a primer on linked data for LAMs. She has taught as an adjunct professor in the School of Information Studies at Syracuse University, New York (2003-2009) and School of Library and Information Science, Catholic University, Washington, DC (2010-2014), and is a regular guest-lecturer in the John F. Kennedy University Museum Studies program (2010-present).

Let the Entities Describe Themselves

By Josh Schneider and Peter Chan

This is the fifth post in our Spring 2016 series on processing digital materials.

———

Why do we process archival materials? Do our processing goals differ based on whether the materials are paper or digital? Processing objectives may depend in part upon institutional priorities, policies, and donor agreements, or collection-specific issues. Yet, irrespective of the format of the materials, we recognize two primary goals to arranging and describing materials: screening for confidential, restricted, or legally-protected information that would impede repositories from providing ready access to the materials; and preparing the files for use by researchers, including by efficiently optimizing discovery and access to the material’s intellectual content.

More and more of the work required to achieve these two goals for electronic records can be performed with the aid of computer assisted technology, automating many archival processes. To help screen for confidential information, for instance, several software platforms utilize regular expression search (BitCurator, AccessData Forensic ToolKit, ePADD). Lexicon search (ePADD) can also help identify confidential information by checking a collection against a categorized list of user-supplied keywords. Additional technologies that may harness machine learning and natural language processing (NLP), and that are being adopted by the profession to assist with arrangement and description, include: topic modeling (ArchExtract); latent semantic analysis (GAMECIP); predictive coding (University of Illinois); and named entity recognition (Linked Jazz, ArchExtract, ePADD). For media, automated transcription and timecoding services (Pop Up Archive) already offer richer access. Likewise, computer vision, including pattern recognition and face recognition, has the potential to help automate image and video description (Stanford Vision Lab, IBM Watson Visual Recognition). Other projects (Overview) outside of the archival community are also exploring similar technologies to make sense of large corpuses of text.

From an archivist’s perspective, one of the most game-changing technologies to support automated processing may be named entity recognition (NER). NER works by identifying and extracting named entities across a corpus, and is in widespread commercial use, especially in the fields of search, advertising, marketing, and litigation discovery. A range of proprietary tools, such as Open Calais, Semantria, and AlchemyAPI, offer entity extraction as a commercial service, especially geared toward facilitating access to breaking news across these industries. ePADD, an open source tool being developed to promote the appraisal, processing, discovery, and delivery of email archives, relies upon a custom NER to reveal the intellectual content of historical email archives.

NER.png

Currently, however, there are no open source NER tools broadly tuned towards the diverse variety of other textual content collected and shared by cultural heritage institutions. Most open source NER tools, such as StanfordNER and Apache OpenNLP, focus on extracting named persons, organizations, and locations. While ePADD also initially focused on just these three categories, an upcoming release will improve browsing accuracy by including more fine-grained categories of organization and location entities bootstrapped from Wikipedia, such as libraries, museums, and universities. This enhanced NER, trained to also identify probable matches, also recognizes other entity types such as diseases, which can assist with screening for protected health information, and events.

What if an open source NER like that in development for ePADD for historical email could be refined to support processing of an even broader set of archival substrates? Expanding the study and use of NLP in this fashion stands to benefit the public and an ever-growing body of researchers, including those in digital humanities, seeking to work with the illuminative and historically significant content collected by cultural heritage organizations.

Of course, entity extraction algorithms are not perfect, and questions remain for archivists regarding how best to disambiguate entities extracted from a corpus, and link disambiguated entities to authority headings. Some of these issues reflect technical hurdles, and others underscore the need for robust institutional policies around what constitutes “good enough” digital processing. Yet, the benefits of NER, especially when considered in the context of large text corpora, are staggering. Facilitating browsing and visualization of a corpus by entity type provides new ways for researchers to access content. Publishing extracted entities as linked open data can enable new content discovery pathways and uncover trends across institutional holdings, while also helping balance outstanding privacy and copyright concerns that may otherwise limit online material sharing.

It is likely that “good enough” processing will remain a moving target as researcher practices and expectations continue to evolve with emerging technologies. But we believe entity extraction fulfills an ongoing need to enable researchers to gain quick access to archival collections’ intellectual content, and that its broader application would greatly benefit both repositories and researchers.

———

Peter Chan is Digital Archivist in the Department of Special Collections and University Archives at Stanford University, is a member of GAMECIP, and is Project Manager for ePADD.

Josh Schneider is Assistant University Archivist in the Department of Special Collections and University Archives at Stanford University, and is Community Manager for ePADD.

Recent Changes in How Stanford University Libraries is Documenting Born-Digital Processing

By Michael G. Olson

This post is the third in our Spring 2016 series on processing digital materials.

———

Stanford University Libraries is in the process of changing how it documents its digital processing activities and records lab statistics. This is our third iteration of how we track our born-digital work in six years and is a collaborative effort between Digital Library Systems and Services, our Digital Archivist Peter Chan, and Glynn Edwards, who manages our Born-Digital Program and is the Director of the ePADD project.

Initially we documented our statistics using a library-hosted FileMaker Pro database. In this initial iteration we were focused on tracking media counts and media failure rates. After a single year of using the database we decided that we needed to modify the data structure and the data entry templates significantly. Our staff found the database too time consuming and cumbersome to modify.

We decided to simplify and replaced the database with a spreadsheet stored with our collection data. Our digital archivist and hourly lab employees were responsible for updating this spreadsheet when they had finished working with a collection. This was a simple solution that was easy to edit and update, and it worked well for four years until we realized we needed more data for our fiscal year-end reports. As our born-digital program has grown and matured, we discovered we were missing key data points that documented important processing decisions in our workflows. It was time to again improve how we documented our work.

BDFL_labstats_FY2015Q1-Q2_v2Stanford Statistics Spreadsheet version 2

For our brand new version of work tracking we have decided to continue to use a spreadsheet but have migrated our data to Google Drive to better facilitate updates and versioning of our documentation. New data points have been included to better track specific types of born-digital content like email. This new version also allows us to better document the processing lifecycle of our born-digital collections. In order to better do this we have created the following additional data points:

  • Number of email messages
  • Email in ePADD.stanford.edu
  • File count in media cart
  • File size on media cart (GB)
  • SearchWorks (materials discoverable / available in library catalog)
  • SpotLight Exhibit (a virtual exhibit)

BDFL_stats_v3Stanford Statistics Spreadsheet version 3

We anticipate that evolving library administrative needs, the continually changing nature of born-digital data, and new methodologies for processing these materials will make it necessary to again change how we document our work. Our solution is not perfect but is flexible enough to allow us to reimagine our documentation strategy in a few short years. If anyone is interested in learning more about what we are documenting and why, please do let us know, as we would be happy to provide further information and may learn something from our colleagues in the process.

———

Michael G. Olson is the Service Manager for the Born-Digital / Forensics Labs at Stanford University Libraries. In this capacity he is responsible for working with library stakeholders to develop services for acquiring, preserving and accessing born-digital library materials. Michael holds a Masters in Philosophy in History and Computing from the University of Glasgow. He can be reached at mgolson [at] Stanford [dot] edu.

Clearing the digital backlog at the Thomas Fisher Rare Book Library

By Jess Whyte

This is the second post in our Spring 2016 series on processing digital materials.

———

Tucked away in the manuscript collections at the Thomas Fisher Rare Book Library, there are disks. They’ve been quietly hiding out in folders and boxes for the last 30 years. As the University of Toronto Libraries develops its digital preservation policies and workflows, we identified these disks as an ideal starting point to test out some of our processes. The Fisher was the perfect place to start:

  • the collections are heterogeneous in terms of format, age, media and filesystems
  • the scope is manageable (we identified just under 2000 digital objects in the manuscript collections)
  • the content has relatively clear boundaries (we’re dealing with disks and drives, not relational databases, software or web archives)
  • the content is at risk

The Thomas Fisher Rare Book Library Digital Preservation Pilot Project was born. It’s purpose: to evaluate the extent of the content at risk and establish a baseline level of preservation on the content.

Identifying digital assets

The project started by identifying and listing all the known digital objects in the manuscript collections. I did this by batch searching all the .pdf finding aids from post-1960 with terms like “digital,” “electronic,” “disk,” —you get the idea. Once we knew how many items we were dealing with and where we could find them, we could begin.

Early days, testing and fails
When I first started, I optimistically thought I would just fire up BitCurator and everything would work.

whyte01

It didn’t, but that’s okay. All of the reasons we chose these collections in the first place (format, media, filesystem and age diversity) also posed a variety of challenges to our workflow for capture and analysis. There was also a question of scalability – could I really expect to create preservation copies of ~2000 disks along with accompanying metadata within a target 18-month window? By processing each object one-by-one in a graphical user interface? While working on the project part-time? No, I couldn’t. Something needed to change.

Our early iterations of the process went something like this:

  1. Use a Kryoflux and its corresponding software to take an image of the disk
  2. Mount the image in a tool like FTK Imager or HFSExplorer
  3. Export a list of the files in a somewhat consistent manner to serve as a manifest, metadata and de facto finding aid
  4. Bag it all up in Bagger.

This was slow, inconsistent, and not well-suited to the project timetable. I tried using fiwalk (included with BitCurator) to walk through a series of images and automatically generate manifests of their contents, but fiwalk does not support HFS and other, older filesystems. Considering 40% of our disks thus far were HFS (at this point, I was 100 disks in), fiwalk wasn’t going to save us. I could automate the process for 60% of the disks, but the remainder would still need to be handled individually–and I wouldn’t have those beautifully formatted DFXML (Digital Forensics XML) files to accompany them. I needed a fix.

Enter disktype and md5deep

I needed a way to a) mount a series of disk images, b) look inside, c) generate metadata on the file contents and d) produce a more human-readable manifest that could serve as a finding aid.

Ideally, the format of all that metadata would be consistent. Critically, the whole process would be as automated as possible.

This is where disktype and md5deep come in. I could use disktype to identify an image’s filesystem, mount it accordingly and then use md5deep to generate DFXML and .csv files. The first iteration of our script did just that, but md5deep doesn’t produce as much metadata as fiwalk. While I don’t have the skills to rewrite fiwalk, I do have the skills to write a simple bash script that routes disk images based on their filesystem to either md5deep or fiwalk. You can find that script here, and a visualization of how it works below:

whyte02

I could now turn this (collection of image files and corresponding imaging logs):

Whyte03

into this (collection of image files, logs, DFXML files, and CSV manifests):

Whyte04

Or, to put it another way, I could now take one of these:

Whyte05

And rapidly turn it into this ready-to-be-bagged package:

Whyte06

Challenges, Future Considerations and Questions

Are we going too fast?
Do we really want to do this quickly? What discoveries or insights will we miss by automating this process? There is value in slowing down and spending time with an artifact and learning from it. Opportunities to do this will likely come up thanks to outliers, but I still want to carve out some time to play around with how these images can be used and studied, individually and as a set.

Virus Checks:
We’re still investigating ways to run virus checks that are efficient and thorough, but not invasive (won’t modify the image in any way).  One possibility is to include the virus check in our bash script, but this will slow it down significantly and make quick passes through a collection of images impossible (during the early, testing phases of this pilot, this is critical). Another possibility is running virus checks before the images are bagged. This would let us run the virus checks overnight and then address any flagged images (so far, we’ve found viruses in ~3% of our disk images and most were boot-sector viruses). I’m curious to hear how others fit virus checks into their workflows, so please comment if you have suggestions or ideas.

Adding More Filesystem Recognition
Right now, the processing script only recognizes FAT and HFS filesystems and then routes them accordingly. So far, these are the only two filesystems that have come up in my work, but the plan is to add other filesystems to the script on an as-needed basis. In other words, if I happen to meet an Amiga disk on the road, I can add it then.

Access Copies:
This project is currently focused on creating preservation copies. For now, access requests are handled on an as-needed basis. This is definitely something that will require future work.

Error Checking:
Automating much of this process means we can complete the work with available resources, but it raises questions about error checking. If a human isn’t opening each image individually, poking around, maybe extracting a file or two, then how can we be sure of successful capture? That said, we do currently have some indicators: the Kryoflux log files, human monitoring of the imaging process (are there “bad” sectors? Is it worth taking a closer look?), and the DFXML and .csv manifest files (were they successfully created? Are there files in the image?). How are other archives handling automation and exception handling?

If you’d like to see our evolving workflow or follow along with our project timeline, you can do so here. Your feedback and comments are welcome.

———

Jess Whyte is a Masters Student in the Faculty of Information at the University of Toronto. She holds a two-year digital preservation internship with the University of Toronto Libraries and also works as a Research Assistant with the Digital Curation Institute.  

Resources:

Gengenbach, M. (2012). The way we do it here”: Mapping digital forensics workflows in collecting institutions.”. Unpublished master’s thesis, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina.

Goldman, B. (2011). Bridging the gap: taking practical steps toward managing born-digital collections in manuscript repositories. RBM: A Journal of Rare Books, Manuscripts and Cultural Heritage, 12(1), 11-24

Prael, A., & Wickner, A. (2015). Getting to Know FRED: Introducing Workflows for Born-Digital Content.

Digital Processing at the Rockefeller Archive Center

By Bonnie Gordon

This is the first post in our Spring 2016 series on processing digital materials, exploring how archivists conceive of, implement, and track activities to arrange and describe digital materials in archival collections. If you are interested in contributing to bloggERS!, check out our guidelines for writers or contact us at ers.mailer.blog@gmail.com

———

At the Rockefeller Archive Center, we’re working to get “digital processing” out of the hands of “digital” archivists and into the realm of “regular” archivists. We are using “digital processing” to mean description, arrangement, and initial preservation of born digital archival content stored on removable storage media. Our definition will likely expand over time, as we start to receive more born digital materials via network transfer and fewer acquisitions of floppy disks and CDs.

The vast majority of our born digital materials are on removable storage media and currently inaccessible to our researchers, donors, and staff. We have content on over 3,000 digital storage media items, which are rapidly deteriorating. Our backlog of digital media items includes over 2,500 optical disks, almost 200 3.5″ floppy disks, and almost 100 5.25″ floppy disks. There are also a handful of USB flash drives, hard drives, and older and unusual media (Bernoulli disks, Sy-Quest cartridges, 8″ floppy disks). This is a lot of work for one digital archivist! Having multiple “regular” archivists process these materials distributes the work, which means we can get through the backlog much more quickly. Additionally, integrating digital processing into regular processing work will prevent a future backlog from being created.

In order to help our processing archivists establish and enhance intellectual control of our born digital holdings, I’m working to provide them with the tools, workflows, and competencies needed to process digital materials.  Over the next several months, a core group of processing archivists will be trained and provided with documentation on digital media inventorying, digital forensics, and other born digital workflows. After training, archivists will be able to use the skills they gained in their “normal” processing projects. The core group of archivists trained on dealing with born digital materials will then be able to train other archivists. This will help digital processing be perceived as just another aspect of “regular” processing. Additionally, providing good workflow documentation gives our processing archivists the tools and competencies to do their jobs.

Streamlining our digital processing workflows is also a really important part of this. One step in this direction is to create a digital media inventory and disk imaging log that will be able to “talk” to our collections management system (ArchivesSpace). We currently have an inventory and imaging log, but they’re in a Microsoft Access database, which has a number of limitations, one of the primary ones being that it can’t integrate with our other systems. Integrating with ArchivesSpace reduces duplicate data entry, inconsistent data, and further integrates digital processing into our “regular” processing work.

The RAC’s processing archivists establish and enhance intellectual and physical control of our archival holdings, regardless of format, in order to facilitate user access. By fully integrating digital processing into “normal” processing activities, we will be able to preserve and provide access to unique born digital content stored on obsolete and decaying media.

———

Bonnie Gordon is an Assistant Digital Archivist at the Rockefeller Archive Center, where she works primarily with born digital materials and digital preservation workflows. She received her M.A. in Archives and Public History, with a concentration in Archives, from New York University.