Put 40 digital archivists, programmers, technologists, curators, scholars, and managers in a room together for three days, give them unlimited cups of tea and coffee, and get ready for some seriously productive discussions.
This magic happened at the Born Digital Archiving & eXchange (BDAX) unconference, held at Stanford University on July 18-20, 2016. I joined the other BDAX attendees to tackle the continuing challenges of acquiring, discovering, delivering and preserving born-digital materials.
The discussions highlighted five key themes to me:
1) Born-digital workflows are, generally, specific
We’re all coping with the general challenges of born-digital archiving, but we’re encountering individual collections which need to be addressed with local solutions and resources. BDAXers generously shared examples of use cases and successful workflows, and, although these guidelines couldn’t always translate across diverse institutions (big/small, private/public, IT help/no IT help), they’re a foundation for building best practices which can be adapted to specific needs.
A lot of people are doing a lot of work to guide and document efforts in born-digital archiving. We need to share these efforts widely, find common points of application, and build momentum – especially for proposed guidelines, best guesses, and continually changing procedures. (We’re laying this train track as we go, but everybody can get on board!) A brilliant resource from BDAX is a “Topical Brain Dump” Google doc where everyone can share tips related to what we each know about born-digital archives (hat-tip to Kari Smith for creating the doc, and to all BDAXers for their contributions).
4) Talking to each other helps!
Chatting with BDAX colleagues over coffee or lunch provided space to compare notes, seek advice, make connections, and find reassurance that we’re not alone in this difficult endeavor. Published literature is continually emerging on born-digital archiving topics (for example, born-digital description), but if we’re not quite ready to commit our own practices to paper magnetic storage media, then informal conversations allow us to share ideas and experiences.
5) Born-digital archiving needs YOU
BDAX attendees brainstormed a wide range of topics for discussion, illustrating that born-digital archiving collides with traditional processes at all stages of stewardship, from appraisal to access. All of these functions need to be re-examined and potentially re-imagined. It’s a big job (*understatement*) but brings with it the opportunity to gather perspective and expertise from individuals across different roles. We need to make sure everyone is invited to this party.
How to Get Involved
So, what’s next? The BDAX organizers and attendees recognize that there are many, many more colleagues out there who need to be included in these conversations. Continuing efforts are coalescing around processing levels and metrics for born-digital collections; accurately measuring and recording extent statements for digital content; and managing security and storage needs for unprocessed digital accessions. Please, join in!
To subscribe to the BDAX email listserv, please email Michael Olson (mgolson[at]stanford[dot]edu), or, to join the new BDAX Slack channel, email Shira Peltzman (speltzman[at]library[dot]ucla[dot]edu).
Kate Tasker works with born-digital collections and information management systems at The Bancroft Library, University of California, Berkeley. She has an MLIS from San Jose State University and is a member of the Academy of Certified Archivists. Kate attended Capture Lab in 2015 and is currently designing workflows to provide access to born-digital collections.
The University of Illinois at Urbana-Champaign’s (Illinois) library-based Research Data Service (RDS) will be launching an institutional data repository, the Illinois Data Bank (IDB), in May 2016. The IDB will provide University of Illinois researchers with a repository for research data that will facilitate data sharing and ensure reliable stewardship of published data. The IDB is a web application that transfers deposited datasets into Medusa, the University Library’s digital preservation service for the long-term retention and accessibility of its digital collections. Content is ingested into Medusa via the IDB’s unmediated self-deposit process.
As we conceived of and developed our dataset curation workflow for digital datasets ingested in the IDB, we turned to archivists in the University Archives to gain an understanding of their approach to processing digital materials. [Note: I am not specifying whether data deposited in the IDB is “born digital” or “digitized” because, from an implementation perspective, both types of material can be deposited via the self-deposit system in the IDB. We are not currently offering research data digitization services in the RDS.] There were a few reasons for consulting with the archivists: 1) Archivists have deep, real-world curation expertise and we anticipate that many of the challenges we face with data will have solutions whose foundations were developed by archivists and 2) If, through discussing processes, we found areas where the RDS and Archives have converging preservation or curation needs, we could communicate these to the Preservation Services Unit, who develops and manages Medusa, and 3) I’m an archivist by training and I jump on any opportunity to talk with archivists about archives!
Even though the RDS and the University Archives share a central goal–to preserve and make accessible the digital objects that we steward–we learned that there are some operational and policy differences between our approaches to digital stewardship that necessitate points of variance in our processing/curation workflow:
Appraisal and Selection
In my view, appraisal and selection are fundamental to the archives practice. The archives field has developed a rich theoretical foundation when it comes to appraisal and selection, and without these functions the archives endeavor would be wholly unsustainable. Appraisal and selection ideally tend to occur in the very early stages of the archival processing workflow. The IDB curation workflow will differ significantly–by and large, appraisal and selection procedures will not take place until at least five years after a dataset is published in the IDB–making our appraisal process more akin to that of an archives that chooses to appraise records after accessioning or even during the processing of materials for long-term storage. Our different approaches to appraisal and selection speak to the different functions the RDS and the University Archives fulfill within the Library and the University.
The University Archives is mandated to preserve University records in perpetuity by the General Rules of the University, the Illinois State Records Act. The RDS’s initiating goal, in contrast, is to provide a mechanism for Illinois researchers to be compliant with funder and/or journal requirements to make results of research publicly available. Here, there is no mandate for the IDB to accept solely what data is deemed to have “enduring value” and, in fact, the research data curation field is so new that we do not yet have a community-endorsed sense of what “enduring value” means for research data. Standards regarding the enduring value of research data may evolve over the long-term in response to discipline-specific circumstances.
To support researchers’ needs and/or desires to share their data in a simple and straightforward way, the IDB ingest process is largely unmediated. Depositing privileges are open to all campus affiliates who have the appropriate University log-in credentials (e.g., faculty, graduate students, and staff), and deposited files are ingested into Medusa immediately upon deposit. RDS curators will do a cursory check of deposits, as doing so remains scalable (see workflow chart below), and the IDB reserves the right to suppress access to deposits for a “compelling reason” (e.g., failure to meet criteria for depositing as outlined in the IDB Accession Policy, violations of publisher policy, etc.). Aside from cases that we assume will be rare, the files as deposited into the IDB, unappraised, are the files that are preserved and made accessible in the IDB.
A striking policy difference between the RDS and the University Archives is that the RDS makes a commitment to preserving and facilitating access to datasets for a minimum of five years after the date of publication in the Illinois Data Bank.
The University Archives, of course, makes a long-term commitment to preserving and making accessible records of the University. I have to say, when I learned that the five-year minimum commitment was the plan for the IDB, I was shocked and a bit dismayed! But after reflecting on the fact that files deposited in the IDB undergo no formal appraisal process at ingest, the concept began to feel more comfortable and reasonable. At a time when terabytes of data are created, oftentimes for single projects, and budgets are a universal concern, there are logistical storage issues to contend with. Now, I fully believe that for us to ensure that we are able to 1) meet current, short-term data sharing needs on our campus and 2) fulfill our commitment to stewarding research data in an effective and scalable manner over time, we have to make a circumspect minimum commitment and establish policies and procedures that enable us to assess the long-term viability of a dataset deposited into the IDB after five years.
The RDS has collaborated with archives and preservation experts at Illinois and, basing our work in archival appraisal theory, have developed guidelines and processes for reviewing published datasets after their five-year commitment ends to determine whether to retain, deaccession, or dedicate more stewardship resources to datasets. Enacting a systematic approach to appraising the long-term value of research data will enable us to allot resources to datasets in a way that is proportional to the datasets’ value to research communities and its preservation viability.
To show that we’re not all that different after all, I’ll briefly mention a few areas where the University Archives and the RDS are taking similar approaches or facing similar challenges:
We are both taking an MPLP-style approach to file conversion. In order to get preservation control of digital content, at minimum, checksums are established for all accessioned files. As a general rule, if the file can be opened using modern technology, file conversion will not be pursued as an immediate preservation action. Establishing strategies and policies for managing a variety of file formats at scale is an area that will be evolving at Illinois through collaboration of the University Archives, the RDS, and the Preservation Services Unit.
Accruals present metadata challenges. How do we establish clear accrual relationships in our metadata when a dataset or a records series is updated annually? Are there ways to automate processes to support management of accruals?
Both units do as much as they can to get contextual information about the material being accessioned from the creator, and metadata is enhanced as possible throughout curation/processing.
The University Archives and the RDS control materials in aggregation, with the University Archives managing at the archival collection level and the RDS managing digital objects at the dataset level.
More? Certainly! For both the research data curation community and the archives community, continually adopting pragmatic strategies to manage the information created by humans (and machines!) is paramount, and we will continue to learn from one another.
The following represents our planned functional workflow for handling dataset deposits in the Illinois Data Bank:
To learn more about the IDB policies and procedures discussed in this post, keep an eye on the Illinois Data Bank website after it launches next month. Of particular interest on the Policies page will be the Accession Policy and the Preservation Review, Retention, Deaccession, Revision, and Withdrawal Procedure document.
Bethany Anderson and Chris Prom of the University of Illinois Archives
The rest of the Research Data Preservation Review Policy/Procedures team: Bethany Anderson, Susan Braxton, Heidi Imker, and Kyle Rimkus
The rest of the RDS team: Qian Zhang, Elizabeth Wickes, Colleen Fallaw, and Heidi Imker
Elise Dunham is a Data Curation Specialist for the Research Data Service at the University of Illinois at Urbana-Champaign. She holds an MLS from the Simmons College Graduate School of Library and Information Science where she specialized in archives and metadata. She contributes to the development of the Illinois Data Bank in areas of metadata management, repository policy, and workflow development. Currently she co-chairs the Research Data Alliance Archives and Records Professionals for Research Data Interest Group and is leading the DACS workshop revision working group of the Society of American Archivists Technical Subcommittee for Describing Archives: A Content Standard.
Why do we process archival materials? Do our processing goals differ based on whether the materials are paper or digital? Processing objectives may depend in part upon institutional priorities, policies, and donor agreements, or collection-specific issues. Yet, irrespective of the format of the materials, we recognize two primary goals to arranging and describing materials: screening for confidential, restricted, or legally-protected information that would impede repositories from providing ready access to the materials; and preparing the files for use by researchers, including by efficiently optimizing discovery and access to the material’s intellectual content.
More and more of the work required to achieve these two goals for electronic records can be performed with the aid of computer assisted technology, automating many archival processes. To help screen for confidential information, for instance, several software platforms utilize regular expression search (BitCurator, AccessData Forensic ToolKit, ePADD). Lexicon search (ePADD) can also help identify confidential information by checking a collection against a categorized list of user-supplied keywords. Additional technologies that may harness machine learning and natural language processing (NLP), and that are being adopted by the profession to assist with arrangement and description, include: topic modeling (ArchExtract); latent semantic analysis (GAMECIP); predictive coding (University of Illinois); and named entity recognition (Linked Jazz, ArchExtract, ePADD). For media, automated transcription and timecoding services (Pop Up Archive) already offer richer access. Likewise, computer vision, including pattern recognition and face recognition, has the potential to help automate image and video description (Stanford Vision Lab, IBM Watson Visual Recognition). Other projects (Overview) outside of the archival community are also exploring similar technologies to make sense of large corpuses of text.
From an archivist’s perspective, one of the most game-changing technologies to support automated processing may be named entity recognition (NER). NER works by identifying and extracting named entities across a corpus, and is in widespread commercial use, especially in the fields of search, advertising, marketing, and litigation discovery. A range of proprietary tools, such as Open Calais, Semantria, and AlchemyAPI, offer entity extraction as a commercial service, especially geared toward facilitating access to breaking news across these industries. ePADD, an open source tool being developed to promote the appraisal, processing, discovery, and delivery of email archives, relies upon a custom NER to reveal the intellectual content of historical email archives.
Currently, however, there are no open source NER tools broadly tuned towards the diverse variety of other textual content collected and shared by cultural heritage institutions. Most open source NER tools, such as StanfordNER and Apache OpenNLP, focus on extracting named persons, organizations, and locations. While ePADD also initially focused on just these three categories, an upcoming release will improve browsing accuracy by including more fine-grained categories of organization and location entities bootstrapped from Wikipedia, such as libraries, museums, and universities. This enhanced NER, trained to also identify probable matches, also recognizes other entity types such as diseases, which can assist with screening for protected health information, and events.
What if an open source NER like that in development for ePADD for historical email could be refined to support processing of an even broader set of archival substrates? Expanding the study and use of NLP in this fashion stands to benefit the public and an ever-growing body of researchers, including those in digital humanities, seeking to work with the illuminative and historically significant content collected by cultural heritage organizations.
Of course, entity extraction algorithms are not perfect, and questions remain for archivists regarding how best to disambiguate entities extracted from a corpus, and link disambiguated entities to authority headings. Some of these issues reflect technical hurdles, and others underscore the need for robust institutional policies around what constitutes “good enough” digital processing. Yet, the benefits of NER, especially when considered in the context of large text corpora, are staggering. Facilitating browsing and visualization of a corpus by entity type provides new ways for researchers to access content. Publishing extracted entities as linked open data can enable new content discovery pathways and uncover trends across institutional holdings, while also helping balance outstanding privacy and copyright concerns that may otherwise limit online material sharing.
It is likely that “good enough” processing will remain a moving target as researcher practices and expectations continue to evolve with emerging technologies. But we believe entity extraction fulfills an ongoing need to enable researchers to gain quick access to archival collections’ intellectual content, and that its broader application would greatly benefit both repositories and researchers.
Stanford University Libraries is in the process of changing how it documents its digital processing activities and records lab statistics. This is our third iteration of how we track our born-digital work in six years and is a collaborative effort between Digital Library Systems and Services, our Digital Archivist Peter Chan, and Glynn Edwards, who manages our Born-Digital Program and is the Director of the ePADD project.
Initially we documented our statistics using a library-hosted FileMaker Pro database. In this initial iteration we were focused on tracking media counts and media failure rates. After a single year of using the database we decided that we needed to modify the data structure and the data entry templates significantly. Our staff found the database too time consuming and cumbersome to modify.
We decided to simplify and replaced the database with a spreadsheet stored with our collection data. Our digital archivist and hourly lab employees were responsible for updating this spreadsheet when they had finished working with a collection. This was a simple solution that was easy to edit and update, and it worked well for four years until we realized we needed more data for our fiscal year-end reports. As our born-digital program has grown and matured, we discovered we were missing key data points that documented important processing decisions in our workflows. It was time to again improve how we documented our work.
Stanford Statistics Spreadsheet version 2
For our brand new version of work tracking we have decided to continue to use a spreadsheet but have migrated our data to Google Drive to better facilitate updates and versioning of our documentation. New data points have been included to better track specific types of born-digital content like email. This new version also allows us to better document the processing lifecycle of our born-digital collections. In order to better do this we have created the following additional data points:
Number of email messages
Email in ePADD.stanford.edu
File count in media cart
File size on media cart (GB)
SearchWorks (materials discoverable / available in library catalog)
SpotLight Exhibit (a virtual exhibit)
Stanford Statistics Spreadsheet version 3
We anticipate that evolving library administrative needs, the continually changing nature of born-digital data, and new methodologies for processing these materials will make it necessary to again change how we document our work. Our solution is not perfect but is flexible enough to allow us to reimagine our documentation strategy in a few short years. If anyone is interested in learning more about what we are documenting and why, please do let us know, as we would be happy to provide further information and may learn something from our colleagues in the process.
Michael G. Olson is the Service Manager for the Born-Digital / Forensics Labs at Stanford University Libraries. In this capacity he is responsible for working with library stakeholders to develop services for acquiring, preserving and accessing born-digital library materials. Michael holds a Masters in Philosophy in History and Computing from the University of Glasgow. He can be reached at mgolson [at] Stanford [dot] edu.
At the Rockefeller Archive Center, we’re working to get “digital processing” out of the hands of “digital” archivists and into the realm of “regular” archivists. We are using “digital processing” to mean description, arrangement, and initial preservation of born digital archival content stored on removable storage media. Our definition will likely expand over time, as we start to receive more born digital materials via network transfer and fewer acquisitions of floppy disks and CDs.
The vast majority of our born digital materials are on removable storage media and currently inaccessible to our researchers, donors, and staff. We have content on over 3,000 digital storage media items, which are rapidly deteriorating. Our backlog of digital media items includes over 2,500 optical disks, almost 200 3.5″ floppy disks, and almost 100 5.25″ floppy disks. There are also a handful of USB flash drives, hard drives, and older and unusual media (Bernoulli disks, Sy-Quest cartridges, 8″ floppy disks). This is a lot of work for one digital archivist! Having multiple “regular” archivists process these materials distributes the work, which means we can get through the backlog much more quickly. Additionally, integrating digital processing into regular processing work will prevent a future backlog from being created.
In order to help our processing archivists establish and enhance intellectual control of our born digital holdings, I’m working to provide them with the tools, workflows, and competencies needed to process digital materials. Over the next several months, a core group of processing archivists will be trained and provided with documentation on digital media inventorying, digital forensics, and other born digital workflows. After training, archivists will be able to use the skills they gained in their “normal” processing projects. The core group of archivists trained on dealing with born digital materials will then be able to train other archivists. This will help digital processing be perceived as just another aspect of “regular” processing. Additionally, providing good workflow documentation gives our processing archivists the tools and competencies to do their jobs.
Streamlining our digital processing workflows is also a really important part of this. One step in this direction is to create a digital media inventory and disk imaging log that will be able to “talk” to our collections management system (ArchivesSpace). We currently have an inventory and imaging log, but they’re in a Microsoft Access database, which has a number of limitations, one of the primary ones being that it can’t integrate with our other systems. Integrating with ArchivesSpace reduces duplicate data entry, inconsistent data, and further integrates digital processing into our “regular” processing work.
The RAC’s processing archivists establish and enhance intellectual and physical control of our archival holdings, regardless of format, in order to facilitate user access. By fully integrating digital processing into “normal” processing activities, we will be able to preserve and provide access to unique born digital content stored on obsolete and decaying media.
Bonnie Gordon is an Assistant Digital Archivist at the Rockefeller Archive Center, where she works primarily with born digital materials and digital preservation workflows. She received her M.A. in Archives and Public History, with a concentration in Archives, from New York University.