The University of Minnesota Archives manages the web archiving program for the Twin Cities campus. We use Archive-It to capture the bulk of our online content, but as we have discovered, managing subsets of our web content and bringing it into our collections has its unique challenges and requires creative approaches. We increasingly face requests to providea permanent, accessible home for files that would otherwise be difficult to locate in a large archived website. Some content, like newsletters, is created in HTML and is not well-suited for upload into the institutional repository (IR) we use to handle most of our digital content. Our success in managing web content that is created for the web (as opposed to uploaded and linked PDF files, for example) has been mixed.
In 2016, a department informed us that one of their web domains was going to be cleared of its current content and redirected. Since that website contained six years of University Relations press releases, available solely in HTML format, we were pretty keen on retrieving that content before it disappeared from the live web.
The department also wanted these releases saved, so they downloaded the contents of the website for us, converted each release into a PDF, and emailed them to us before that content was removed. Although we did have crawls of the press releases through Archive-It, we intended to use our institutional repository, the University Digital Conservancy(UDC), to preserve and provide access to the PDF files derived from the website.
So, when faced with the 2,920 files included in the transfer, labeled in no particularly helpful way, in non-chronological order, and with extraneous files included, I rolled up my sleeves and got to work. With the application of some helpful programs and a little more spreadsheet data entry than I would like to admit to, I ended up with some 2,000 articles renamed in chronological order. I grouped and combined the files by year, which was in keeping with the way we have previously provided access to press releases available in the UDC.
All that was left was to OCR and upload, right?
And everything screeched to a halt. Because of the way the files had been downloaded and converted, every page of every file contained renderable text from the original stylesheet hidden within an additional layer that prevented OCR’ing with our available tools, and we were unable to invest more time to find an acceptable solution.
Thus, these news releases now sit in the UDC, six 1000 page documents that cannot be full-text searched but are, mercifully, in chronological order. The irony of having our born-digital materials return to the same limitations that plagued our analogue press releases, prior to the adoption of the UDC, has not been lost on us.
But this failure shines a light on the sometimes murky boundaries between archiving the web and managing web content in our archive. I have a website sitting on my desk, burned to a CD. The site is gone from the live web, and Archive-It never crawled it. We have a complete download of an intranet site sitting on our network drive–again, Archive-It never crawled that site. We handle increasing amounts of web content that never made it into Archive-It. But, using our IR to handle these documents is imperfect, too, and can require significant hands-on work when the content has to be stripped out of its original context (the website), and manipulated to meet the preservation requirements of that IR (file format, OCR).
Cross-pollination between our IR and our web archive is inevitable when they are both capturing the born-digital content of the University of Minnesota. Assisting departments with archiving their websites and web-based materials usually involves using a combination of the two, but raises questions of scalability. But even in our failure to bring those press releases all the way to the finish line, we were able to get pretty close using the tools we had available to us and were able to make the files available, and frankly, that’s an almost-success I can live with.
And, while we were running around with those press releases, another department posted a web-based multimedia annual report only to ask later whether it could be uploaded to the IR, with their previous annual reports. Onward!
Valerie Collins is a Digital Repositories & Records Archivist at the University of Minnesota Archives.
In January 2017, I began a position with the National Digital Stewardship Residency (NDSR) hosted by the Georgian Papers Programme, an international collaborative project based on transforming scholarly and personal access to unpublished collections at the Royal Library and Archives from the Hanoverian dynasty through online access and scholarly exploration.
My role within this project is to conduct a comparative analysis of descriptive metadata for the collections at the Royal Library and Royal Archives and related collections at the Library of Congress. While NDSR is a program that focuses more broadly on training individuals within the field of digital preservation and curation, my project is specifically tasked with understanding how metadata can aid in providing access to digitized collections. My analysis will help the Georgian Papers Programme partners determine their readiness for data sharing and will inform the development of a shared platform that aims to provide interoperable access for the collections.
Although still in its early stages, my initial findings are related to the schemas used as well as the syntaxes employed within each schema. One goal of this analysis is to find a common data model for the various collections within the project. My initial findings, expressed with the assistance of Elizabeth Barrett Browning, include:
How do I describe thee? Let me count the ways: These collections have been described using many different standards throughout the years. The standards used by the partner institutions include: Encoded Archival Description (EAD) with DACS as the model for description for archival collections in the United States; ISAD(G) for archival collections in the United Kingdom; MARC for bibliographic, map, serial, and print collections; and Dublin Core employed for certain digital collections records. There are also collections that have been described using additional library and museum standards that need to be analyzed further. Luckily, most of these standards work well together because either they are international standards or there are established crosswalks.
I describe thee to the depth and breadth and height / My soul can reach, when feeling out of sight / For the ends of searching and ideal access: Key access points that I have highlighted as needing review include: subject headings; dates; languages; and place, personal, and corporate names. Syntactical inconsistency of these fields can lead to difficulties in the future when using a shared access platform. Names have proven to be particularly challenging as royal names can change throughout an individual’s life based on succession, titles, and changes in empire. Having only worked in American archival institutions up until this point, this is the first time I have encountered such complicated names.
I describe thee to the level of every day’s / most quiet need: The level of description between collections varies based on whether or not the materials are from archival collections or library collections. Many have item level description while some archival collections are described at the file or series level. The difficulties faced when trying to represent different levels of description in a shared digital library environment is something that has been explored previously by other archivists, most recently in Aggregating and Representing Collections in the Digital Public Library of America published in November 2016.
The need for interoperability between collections that use different data models is in no way unique to international collaborative projects. Within a large institution such as the Library of Congress, there is an emphasis on making all collections accessible in a single viewer rather than maintaining multiple sites. I hope that with international collaborative projects such as my own, the digital heritage community will continue to work towards the common goal of interoperability.
 Allison-Bunnell, Jodi et al. Digital Public Library of America. Aggregating and Representing Collections in the Digital Public Library of America. Boston, Massachusetts: Digital Public Library of America, 2016.
Charlotte Kostelic is the National Digital Stewardship Resident for the Georgian Papers Programme hosted by the Library of Congress and the Royal Collection Trust. She is a recent graduate from Queens College Graduate School of Library and Information Studies, and has previously held positions at StoryCorps and the Barnard Archives and Special Collections.
The University of Houston (UH) Libraries made an institutional commitment in late 2015 to migrate the data for its digitized and born-digital cultural heritage collections to open source systems for preservation and access: Hydra-in-a-Box (now Hyku!), Archivematica, and ArchivesSpace. As a part of that broader initiative, the Digital Preservation Task Force began implementing Archivematica in 2016 for preservation processing and storage.
At the same time, the DAMS Implementation Task Force was also starting to create data models, use cases, and workflows with the goal of ultimately providing access to digital collections via a new online repository to replace CONTENTdm. We decided that this would be a great opportunity to create an end-to-end digital access and preservation workflow for digitized collections, in which digital production tasks could be partially or fully automated and workflow tools could integrate directly with repository/management systems like Archivematica, ArchivesSpace, and the new DAMS. To carry out this work, we created a cross-departmental working group consisting of members from Metadata & Digitization Services, Web Services, and Special Collections.
In Mexico, I was able to conduct interviews with nearly thirty organizations working on building, managing, sharing and preserving their digital collections. The types of organizations I visited were diverse in several areas: geographic location (i.e. outside of heavily centralized Mexico City), organization size, organization mission, and industry sector.
Cultural Heritage organizations (galleries, libraries, archives, museums)
College and University archives and libraries
Because of the diversity of the types of institutions that I visited, the results and conclusions I drew were also varied, and I noticed distinct trends within each area or category of institutions. For the brevity of this blog post, I have taken the liberty to abbreviate my findings in the following bullet points. These are not meant to be definitive or exhaustive, as I am still compiling, codifying and quantifying interview data.
The focus on digital collection building and preservation in business and government tends toward records management approaches. Retention schedules are dictated by the federal government and administered and enforced by the National Archives. All federal and state government entities are obligated to follow these guidelines for retention and transfer of records and archives. While the guidelines and processes for paper records are robust, many institutions are only beginning to implement and use electronic records management platforms. Long-term digital preservation of records designated for permanent deposit is an ongoing challenge.
In cultural heritage institutions and college and university archives, digital collection work is focused on building digitization and digital collection management programs. The primary focus of the majority of institutions is still on digitization, storage and diffusion of digitized assets, and wrangling issues related to long-term, sustainable maintenance of digital collections platforms and backups on precarious physical media formats like optical disks and (non-redundant) hard drives.
While digital preservation issues are still in the nascent stages of being worked through and solved everywhere around the globe, in some areas strong national and regional groups have been formed to help share strategies, create standards and think through local solutions. In Mexico and Latin America, this has mostly been done through participation in the InterPARES project, but a national Mexican digital preservation consortium, similar to the National Digital Stewardship Alliance (NDSA) in the United States, is still yet to be established in Mexico. In the meantime, several Mexican academic and government institutions have taken the lead on digital preservation issues, and through those initiatives, a more cohesive, intentional organization similar to the NDSA may be able to take root in the near future.
My opportunity to live and do research in Mexico was life-changing. It is now more crucial than ever for librarians, archivists, developers, administrators, and program leaders to look outside of the United States for collaborations and opportunities to learn with and from colleagues abroad. The work we have at hand is critical, and we need to share all the resources we have, especially those resources money cannot buy: a different perspective, diversity of language, and the shared desire to make the whole world, not just our little corner of it, a better place for all.
Natalie Baur is currently the Preservation Librarian at El Colegio de México in Mexico City, an institution of higher learning specializing in the humanities and social sciences. Previously, she served as the Archivist for the Cuban Heritage Collection at the University of Miami Libraries and was a 2015-2016 Fulbright-García Robles fellowship recipient, looking at digital preservation issues in Mexican libraries, archives and museums. She holds an M.A. in History and a certificate in Museum Studies from the University of Delaware and an M.L.S. with a concentration in Archives, Information and Records Management from the University of Maryland. She is also co-founder of the Desmantelando Fronteras webinar series and the Itinerant Archivists project. You can read more about her Fullbright-García Robles fellowship here.
One key area of collaboration is digital preservation. We jointly use the Goportis Digital Archive based on Ex Libris’s Rosetta since 2010. The certification of our digital archive is part of our quality management, since all workflows are evaluated. Beyond that, a certification seal signals to external parties, like stakeholders and customers, that the long-term availability of the data is ensured, and the digital archive is trustworthy.
So far TIB and ZBW have successfully completed the certification processes for the Data Seal of Approval (DSA) and are currently working on the application for the nestor Seal. Here are some key facts about the seals:
In general, we are equal partners. For digital preservation, though, TIB is the consortium leader, since it is the software licensee and hosts the computing center.
Due to the terms of the DSA—as well as those of the nestor Seal—a consortium cannot be certified as a whole, but only each partner individually. For that reason each partner drew up its own application. However, for some aspects of the certification ZBW had to refer to the answers of TIB, which functions as its service provider.
Beside these external requirements, we organized the distribution of tasks on the basis of internal goals as well. We interpreted the certification process as an opportunity to get a deeper insight in the workflows, policies and dependencies of our partner institutions. That is why we analyzed the DSA guidelines together. Moreover, we discussed the progress of the application process regularly in telephone conferences and matched our answers to each guideline. As a positive side effect, this way of proceeding strengthened not only the ability of our teamwork, it also led to a better understanding of the guidelines and more elaborate answers for the DSA application.
The documentations for the DSA were created in more detail than recommended in order to facilitate further use of the documents for the nestor Seal.
The certification process for the DSA extended over six months (12/2014–08/2015). In each institution one employee was in charge of the certification process. Other staff members added special information about their respective areas of work. This included technical development, data specialists, legal professionals, team leaders, and system administration (TIB only). The costs of applying for the seal can be measured in person months:
Based on our positive experiences with the DSA certification, we plan to acquire the nestor Seal following the same procedures. The DSA application has prepared the ground for this task, since important documents, such as policies, have already been drafted.
Franziska Schwab is working as a Preservation Analyst in the Digital Preservation team at the German National Library of Science and Technology (TIB) since 2014. She’s responsible for Pre-Ingest data analysis, Ingest, process documentation, policies, and certification.
Yvonne Tunnat is the Digital Preservation Manager for the Leibniz Information Centre for Economics in Kiel/Hamburg (ZBW) since 2011. Her key working areas are format identification, validation, and preservation planning.
Dr. Thomas Gerdes is part of the Digital Preservation team of the Leibniz Information Centre for Economics in Kiel/Hamburg (ZBW), since 2015. His interests are in the field of certification methods.
The University of Illinois at Urbana-Champaign’s (Illinois) library-based Research Data Service (RDS) will be launching an institutional data repository, the Illinois Data Bank (IDB), in May 2016. The IDB will provide University of Illinois researchers with a repository for research data that will facilitate data sharing and ensure reliable stewardship of published data. The IDB is a web application that transfers deposited datasets into Medusa, the University Library’s digital preservation service for the long-term retention and accessibility of its digital collections. Content is ingested into Medusa via the IDB’s unmediated self-deposit process.
As we conceived of and developed our dataset curation workflow for digital datasets ingested in the IDB, we turned to archivists in the University Archives to gain an understanding of their approach to processing digital materials. [Note: I am not specifying whether data deposited in the IDB is “born digital” or “digitized” because, from an implementation perspective, both types of material can be deposited via the self-deposit system in the IDB. We are not currently offering research data digitization services in the RDS.] There were a few reasons for consulting with the archivists: 1) Archivists have deep, real-world curation expertise and we anticipate that many of the challenges we face with data will have solutions whose foundations were developed by archivists and 2) If, through discussing processes, we found areas where the RDS and Archives have converging preservation or curation needs, we could communicate these to the Preservation Services Unit, who develops and manages Medusa, and 3) I’m an archivist by training and I jump on any opportunity to talk with archivists about archives!
Even though the RDS and the University Archives share a central goal–to preserve and make accessible the digital objects that we steward–we learned that there are some operational and policy differences between our approaches to digital stewardship that necessitate points of variance in our processing/curation workflow:
Appraisal and Selection
In my view, appraisal and selection are fundamental to the archives practice. The archives field has developed a rich theoretical foundation when it comes to appraisal and selection, and without these functions the archives endeavor would be wholly unsustainable. Appraisal and selection ideally tend to occur in the very early stages of the archival processing workflow. The IDB curation workflow will differ significantly–by and large, appraisal and selection procedures will not take place until at least five years after a dataset is published in the IDB–making our appraisal process more akin to that of an archives that chooses to appraise records after accessioning or even during the processing of materials for long-term storage. Our different approaches to appraisal and selection speak to the different functions the RDS and the University Archives fulfill within the Library and the University.
The University Archives is mandated to preserve University records in perpetuity by the General Rules of the University, the Illinois State Records Act. The RDS’s initiating goal, in contrast, is to provide a mechanism for Illinois researchers to be compliant with funder and/or journal requirements to make results of research publicly available. Here, there is no mandate for the IDB to accept solely what data is deemed to have “enduring value” and, in fact, the research data curation field is so new that we do not yet have a community-endorsed sense of what “enduring value” means for research data. Standards regarding the enduring value of research data may evolve over the long-term in response to discipline-specific circumstances.
To support researchers’ needs and/or desires to share their data in a simple and straightforward way, the IDB ingest process is largely unmediated. Depositing privileges are open to all campus affiliates who have the appropriate University log-in credentials (e.g., faculty, graduate students, and staff), and deposited files are ingested into Medusa immediately upon deposit. RDS curators will do a cursory check of deposits, as doing so remains scalable (see workflow chart below), and the IDB reserves the right to suppress access to deposits for a “compelling reason” (e.g., failure to meet criteria for depositing as outlined in the IDB Accession Policy, violations of publisher policy, etc.). Aside from cases that we assume will be rare, the files as deposited into the IDB, unappraised, are the files that are preserved and made accessible in the IDB.
A striking policy difference between the RDS and the University Archives is that the RDS makes a commitment to preserving and facilitating access to datasets for a minimum of five years after the date of publication in the Illinois Data Bank.
The University Archives, of course, makes a long-term commitment to preserving and making accessible records of the University. I have to say, when I learned that the five-year minimum commitment was the plan for the IDB, I was shocked and a bit dismayed! But after reflecting on the fact that files deposited in the IDB undergo no formal appraisal process at ingest, the concept began to feel more comfortable and reasonable. At a time when terabytes of data are created, oftentimes for single projects, and budgets are a universal concern, there are logistical storage issues to contend with. Now, I fully believe that for us to ensure that we are able to 1) meet current, short-term data sharing needs on our campus and 2) fulfill our commitment to stewarding research data in an effective and scalable manner over time, we have to make a circumspect minimum commitment and establish policies and procedures that enable us to assess the long-term viability of a dataset deposited into the IDB after five years.
The RDS has collaborated with archives and preservation experts at Illinois and, basing our work in archival appraisal theory, have developed guidelines and processes for reviewing published datasets after their five-year commitment ends to determine whether to retain, deaccession, or dedicate more stewardship resources to datasets. Enacting a systematic approach to appraising the long-term value of research data will enable us to allot resources to datasets in a way that is proportional to the datasets’ value to research communities and its preservation viability.
To show that we’re not all that different after all, I’ll briefly mention a few areas where the University Archives and the RDS are taking similar approaches or facing similar challenges:
We are both taking an MPLP-style approach to file conversion. In order to get preservation control of digital content, at minimum, checksums are established for all accessioned files. As a general rule, if the file can be opened using modern technology, file conversion will not be pursued as an immediate preservation action. Establishing strategies and policies for managing a variety of file formats at scale is an area that will be evolving at Illinois through collaboration of the University Archives, the RDS, and the Preservation Services Unit.
Accruals present metadata challenges. How do we establish clear accrual relationships in our metadata when a dataset or a records series is updated annually? Are there ways to automate processes to support management of accruals?
Both units do as much as they can to get contextual information about the material being accessioned from the creator, and metadata is enhanced as possible throughout curation/processing.
The University Archives and the RDS control materials in aggregation, with the University Archives managing at the archival collection level and the RDS managing digital objects at the dataset level.
More? Certainly! For both the research data curation community and the archives community, continually adopting pragmatic strategies to manage the information created by humans (and machines!) is paramount, and we will continue to learn from one another.
The following represents our planned functional workflow for handling dataset deposits in the Illinois Data Bank:
To learn more about the IDB policies and procedures discussed in this post, keep an eye on the Illinois Data Bank website after it launches next month. Of particular interest on the Policies page will be the Accession Policy and the Preservation Review, Retention, Deaccession, Revision, and Withdrawal Procedure document.
Bethany Anderson and Chris Prom of the University of Illinois Archives
The rest of the Research Data Preservation Review Policy/Procedures team: Bethany Anderson, Susan Braxton, Heidi Imker, and Kyle Rimkus
The rest of the RDS team: Qian Zhang, Elizabeth Wickes, Colleen Fallaw, and Heidi Imker
Elise Dunham is a Data Curation Specialist for the Research Data Service at the University of Illinois at Urbana-Champaign. She holds an MLS from the Simmons College Graduate School of Library and Information Science where she specialized in archives and metadata. She contributes to the development of the Illinois Data Bank in areas of metadata management, repository policy, and workflow development. Currently she co-chairs the Research Data Alliance Archives and Records Professionals for Research Data Interest Group and is leading the DACS workshop revision working group of the Society of American Archivists Technical Subcommittee for Describing Archives: A Content Standard.
When it comes to working to process large sets of electronic records, it’s all too easy to get so wrapped up in the task at hand that when you finally come up for air you look at the clock and think to yourself, “Where did the time go? How long was I gone?” Okay, that may sound rather apocalyptic, but tracking time spent is an important yet easily elided step in electronic records processing.
At the University of Minnesota Libraries, the members of the Electronic Records Task Force are charged with developing workflows and making estimates for future capacity and personnel needs. In an era of very tight budgets, making a strong, well-documented case for additional personnel and resources is critical. To that end, we’ve made some efforts to more systematically track our time as we pilot our workflows.
Chief among those efforts has been a customization of the Data Accessioner tool. Originally written for internal use at the David M. Rubenstein Rare Book & Manuscript Library at Duke University, the project has since become open source, with support for recent releases coming from the POWRR Project. Written in Java and utilizing the common logging library log4j, Data Accessioner is structured in a way that made it possible for someone like me (familiar with programming, but not much experience with Java) to enhance the time logging functionality. As we know some accession tasks take a few minutes, others can run for many hours (if not days). Enhancing the logging functionality of Data Accessioner allows staff to accurately see how long any data transfer takes, without needing to be physically present. The additional functionality was in itself pretty minor: log the time and folder name before starting accessioning of a folder and upon completion. The most complex part of this process was not writing the additional code, but rather modifying the log4j configuration. Luckily, with an existing configuration file, solid documentation, and countless examples in the wild, I was able to produce a version of Data Accessioner that outputs a daily log as a plain text file, which makes time tracking accessioning jobs much easier. You can see more description of the changes I made and the log output formatting on GitHub. You can download a ZIP file with the application with this addition from that page as well, or use this download link.
Screenshots and a sample log file:
With this change, we are now able to better estimate the time it takes to use Data Accessioner. Do the tools you use keep track of the time it takes to run? If not, how are you doing this? Questions or comments can be sent to lib-ertf [at] umn [dot] edu.
Kevin Dyke is the spatial data analyst/curator at the University of Minnesota’s John R. Borchert Map Library. He’s a member of the University of Minnesota Libraries’ Electronic Records Task Force, works as a data curator for the Data Repository for the University of Minnesota (DRUM), and is also part of the Committee on Institutional Cooperation’s (CIC) Geospatial Data Discovery Project. He received a Masters degree in Geography from the University of Minnesota and can be reached at dykex005 [at] umn.edu.
Stanford University Libraries is in the process of changing how it documents its digital processing activities and records lab statistics. This is our third iteration of how we track our born-digital work in six years and is a collaborative effort between Digital Library Systems and Services, our Digital Archivist Peter Chan, and Glynn Edwards, who manages our Born-Digital Program and is the Director of the ePADD project.
Initially we documented our statistics using a library-hosted FileMaker Pro database. In this initial iteration we were focused on tracking media counts and media failure rates. After a single year of using the database we decided that we needed to modify the data structure and the data entry templates significantly. Our staff found the database too time consuming and cumbersome to modify.
We decided to simplify and replaced the database with a spreadsheet stored with our collection data. Our digital archivist and hourly lab employees were responsible for updating this spreadsheet when they had finished working with a collection. This was a simple solution that was easy to edit and update, and it worked well for four years until we realized we needed more data for our fiscal year-end reports. As our born-digital program has grown and matured, we discovered we were missing key data points that documented important processing decisions in our workflows. It was time to again improve how we documented our work.
Stanford Statistics Spreadsheet version 2
For our brand new version of work tracking we have decided to continue to use a spreadsheet but have migrated our data to Google Drive to better facilitate updates and versioning of our documentation. New data points have been included to better track specific types of born-digital content like email. This new version also allows us to better document the processing lifecycle of our born-digital collections. In order to better do this we have created the following additional data points:
Number of email messages
Email in ePADD.stanford.edu
File count in media cart
File size on media cart (GB)
SearchWorks (materials discoverable / available in library catalog)
SpotLight Exhibit (a virtual exhibit)
Stanford Statistics Spreadsheet version 3
We anticipate that evolving library administrative needs, the continually changing nature of born-digital data, and new methodologies for processing these materials will make it necessary to again change how we document our work. Our solution is not perfect but is flexible enough to allow us to reimagine our documentation strategy in a few short years. If anyone is interested in learning more about what we are documenting and why, please do let us know, as we would be happy to provide further information and may learn something from our colleagues in the process.
Michael G. Olson is the Service Manager for the Born-Digital / Forensics Labs at Stanford University Libraries. In this capacity he is responsible for working with library stakeholders to develop services for acquiring, preserving and accessing born-digital library materials. Michael holds a Masters in Philosophy in History and Computing from the University of Glasgow. He can be reached at mgolson [at] Stanford [dot] edu.
Tucked away in the manuscript collections at the Thomas Fisher Rare Book Library, there are disks. They’ve been quietly hiding out in folders and boxes for the last 30 years. As the University of Toronto Libraries develops its digital preservation policies and workflows, we identified these disks as an ideal starting point to test out some of our processes. The Fisher was the perfect place to start:
the collections are heterogeneous in terms of format, age, media and filesystems
the scope is manageable (we identified just under 2000 digital objects in the manuscript collections)
the content has relatively clear boundaries (we’re dealing with disks and drives, not relational databases, software or web archives)
the content is at risk
The Thomas Fisher Rare Book Library Digital Preservation Pilot Project was born. It’s purpose: to evaluate the extent of the content at risk and establish a baseline level of preservation on the content.
Identifying digital assets
The project started by identifying and listing all the known digital objects in the manuscript collections. I did this by batch searching all the .pdf finding aids from post-1960 with terms like “digital,” “electronic,” “disk,” —you get the idea. Once we knew how many items we were dealing with and where we could find them, we could begin.
Early days, testing and fails When I first started, I optimistically thought I would just fire up BitCurator and everything would work.
It didn’t, but that’s okay. All of the reasons we chose these collections in the first place (format, media, filesystem and age diversity) also posed a variety of challenges to our workflow for capture and analysis. There was also a question of scalability – could I really expect to create preservation copies of ~2000 disks along with accompanying metadata within a target 18-month window? By processing each object one-by-one in a graphical user interface? While working on the project part-time? No, I couldn’t. Something needed to change.
Our early iterations of the process went something like this:
Use a Kryoflux and its corresponding software to take an image of the disk
This was slow, inconsistent, and not well-suited to the project timetable. I tried using fiwalk (included with BitCurator) to walk through a series of images and automatically generate manifests of their contents, but fiwalk does not support HFS and other, older filesystems. Considering 40% of our disks thus far were HFS (at this point, I was 100 disks in), fiwalk wasn’t going to save us. I could automate the process for 60% of the disks, but the remainder would still need to be handled individually–and I wouldn’t have those beautifully formatted DFXML (Digital Forensics XML) files to accompany them. I needed a fix.
Enter disktype and md5deep
I needed a way to a) mount a series of disk images, b) look inside, c) generate metadata on the file contents and d) produce a more human-readable manifest that could serve as a finding aid.
Ideally, the format of all that metadata would be consistent. Critically, the whole process would be as automated as possible.
This is where disktypeand md5deepcome in. I could use disktype to identify an image’s filesystem, mount it accordingly and then use md5deep to generate DFXML and .csv files. The first iteration of our script did just that, but md5deep doesn’t produce as much metadata as fiwalk. While I don’t have the skills to rewrite fiwalk, I do have the skills to write a simple bash script that routes disk images based on their filesystem to either md5deep or fiwalk. You can find that script here, and a visualization of how it works below:
I could now turn this (collection of image files and corresponding imaging logs):
into this (collection of image files, logs, DFXML files, and CSV manifests):
Or, to put it another way, I could now take one of these:
And rapidly turn it into this ready-to-be-bagged package:
Challenges, Future Considerations and Questions
Are we going too fast? Do we really want to do this quickly? What discoveries or insights will we miss by automating this process? There is value in slowing down and spending time with an artifact and learning from it. Opportunities to do this will likely come up thanks to outliers, but I still want to carve out some time to play around with how these images can be used and studied, individually and as a set.
Virus Checks: We’re still investigating ways to run virus checks that are efficient and thorough, but not invasive (won’t modify the image in any way). One possibility is to include the virus check in our bash script, but this will slow it down significantly and make quick passes through a collection of images impossible (during the early, testing phases of this pilot, this is critical). Another possibility is running virus checks before the images are bagged. This would let us run the virus checks overnight and then address any flagged images (so far, we’ve found viruses in ~3% of our disk images and most were boot-sector viruses). I’m curious to hear how others fit virus checks into their workflows, so please comment if you have suggestions or ideas.
Adding More Filesystem Recognition Right now, the processing script only recognizes FAT and HFS filesystems and then routes them accordingly. So far, these are the only two filesystems that have come up in my work, but the plan is to add other filesystems to the script on an as-needed basis. In other words, if I happen to meet an Amiga disk on the road, I can add it then.
Access Copies: This project is currently focused on creating preservation copies. For now, access requests are handled on an as-needed basis. This is definitely something that will require future work.
Error Checking: Automating much of this process means we can complete the work with available resources, but it raises questions about error checking. If a human isn’t opening each image individually, poking around, maybe extracting a file or two, then how can we be sure of successful capture? That said, we do currently have some indicators: the Kryoflux log files, human monitoring of the imaging process (are there “bad” sectors? Is it worth taking a closer look?), and the DFXML and .csv manifest files (were they successfully created? Are there files in the image?). How are other archives handling automation and exception handling?
If you’d like to see our evolving workflow or follow along with our project timeline, you can do so here. Your feedback and comments are welcome.
Jess Whyte is a Masters Student in the Faculty of Information at the University of Toronto. She holds a two-year digital preservation internship with the University of Toronto Libraries and also works as a Research Assistant with the Digital Curation Institute.
Gengenbach, M. (2012). The way we do it here”: Mapping digital forensics workflows in collecting institutions.”. Unpublished master’s thesis, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina.
Goldman, B. (2011). Bridging the gap: taking practical steps toward managing born-digital collections in manuscript repositories. RBM: A Journal of Rare Books, Manuscripts and Cultural Heritage, 12(1), 11-24
At the Rockefeller Archive Center, we’re working to get “digital processing” out of the hands of “digital” archivists and into the realm of “regular” archivists. We are using “digital processing” to mean description, arrangement, and initial preservation of born digital archival content stored on removable storage media. Our definition will likely expand over time, as we start to receive more born digital materials via network transfer and fewer acquisitions of floppy disks and CDs.
The vast majority of our born digital materials are on removable storage media and currently inaccessible to our researchers, donors, and staff. We have content on over 3,000 digital storage media items, which are rapidly deteriorating. Our backlog of digital media items includes over 2,500 optical disks, almost 200 3.5″ floppy disks, and almost 100 5.25″ floppy disks. There are also a handful of USB flash drives, hard drives, and older and unusual media (Bernoulli disks, Sy-Quest cartridges, 8″ floppy disks). This is a lot of work for one digital archivist! Having multiple “regular” archivists process these materials distributes the work, which means we can get through the backlog much more quickly. Additionally, integrating digital processing into regular processing work will prevent a future backlog from being created.
In order to help our processing archivists establish and enhance intellectual control of our born digital holdings, I’m working to provide them with the tools, workflows, and competencies needed to process digital materials. Over the next several months, a core group of processing archivists will be trained and provided with documentation on digital media inventorying, digital forensics, and other born digital workflows. After training, archivists will be able to use the skills they gained in their “normal” processing projects. The core group of archivists trained on dealing with born digital materials will then be able to train other archivists. This will help digital processing be perceived as just another aspect of “regular” processing. Additionally, providing good workflow documentation gives our processing archivists the tools and competencies to do their jobs.
Streamlining our digital processing workflows is also a really important part of this. One step in this direction is to create a digital media inventory and disk imaging log that will be able to “talk” to our collections management system (ArchivesSpace). We currently have an inventory and imaging log, but they’re in a Microsoft Access database, which has a number of limitations, one of the primary ones being that it can’t integrate with our other systems. Integrating with ArchivesSpace reduces duplicate data entry, inconsistent data, and further integrates digital processing into our “regular” processing work.
The RAC’s processing archivists establish and enhance intellectual and physical control of our archival holdings, regardless of format, in order to facilitate user access. By fully integrating digital processing into “normal” processing activities, we will be able to preserve and provide access to unique born digital content stored on obsolete and decaying media.
Bonnie Gordon is an Assistant Digital Archivist at the Rockefeller Archive Center, where she works primarily with born digital materials and digital preservation workflows. She received her M.A. in Archives and Public History, with a concentration in Archives, from New York University.