Researcher Interactions with Born-Digital: Out of the Frying Pan and into the Reading Room

By Julia Kim

This post is the ninth in a bloggERS series about access to born-digital materials.

This past SAA 2015 , I co-presented in the panel “Out of the Frying Pan and into the Reading Room: Approaches to Serving Electronic Records.” My talk focused on a study of researcher interactions with born-digital collections at NYU’s Fales Library and Special Collections, including unprocessed directories of files, emulated works, and migrated software. Given the enormous resources required for born-digital access, our preliminary study sought to understand how archival researchers received this work. This is something that is relatively unexplored in our field thus far.

NYU began processing the Jeremy Blake Papers and the Exit Art collection in fall 2014 as a way to test and model “access-driven” born-digital workflows. Throughout the year, we focused on ensuring researchers would be able to access collections before the year’s end.

Exit Art
The archives of Exit Art, a non-profit cultural center in Manhattan, were donated to NYU’s Fales Library and Special Collections in 2012 and included analog content, time-based materials, and the organization’s 2TB server. Within the 2TB server, we narrowed the focus for our study to a directory called “Alternative Histories,” which was the name of a major 2010 survey exhibition organized by Exit Art. This small sliver of born-digital files measured 47.6 GB, 18,642 files, and 1132 folders. Contents included administrative files and correspondences, photographs, promotional materials, and published works. Due to time constraints, the Alternative Histories files were not arranged.

Exit Art finding aid
Exit Art finding aid

Jeremy Blake
Jeremy Aaron Blake (1971-2007) was an American digital artist known for his digital c-prints and his looped animated sequences that he coined his “time-based paintings.” The Jeremy Blake Papers included approximately 400 pieces of media on optical media, digital linear tape, and jaz drives; as well as external hard drives and copied files. The majority of his working files are in the Adobe Photoshop formats. The Blake archive afforded a detailed analysis of his working process, which in his case was primarily in Adobe Photoshop files created in the late 1990s-mid 2000’s (more background information here).

Researcher Interviews
Five seasoned Fales Library and Special Collections archival researchers participated in evaluating new forms of digital access through an on-site visit to New York University’s Digital Forensics Lab for one to three hours. They were invited to participate due to their extensive familiarity with archival research.

While a few of them had some familiarity with the artistic content of the collections, their disciplinary focuses included contemporary art history, 18th-century literature, digital humanities, platform studies, and computer science.

Each researcher agreed to be audio- and video-recorded and to the use of Think Out Loud Protocol (TOP), in which they verbalized anything they saw, did, or noticed. Afterward, the interviews were transcribed and the recordings were deleted to ensure anonymity.

While we allotted an hour, most researchers stayed longer to explore and discuss. After an initial overview, they were shown the existing Alternative Histories portion of the Exit Art finding aid. They were introduced to QuickView Plus software and the Alternative Histories files on our designated locked down researcher laptop. After 30 minutes, the researchers switched to another laptop that was installed with an emulation to access the Blake collection.

Jeremy Blake emulation
Jeremy Blake emulation

Researchers were encouraged to explore the files using the contemporary Windows PC and contemporary Photoshop program, and the software program Forensic Toolkit (FTK), to make comparisons. FTK included a preliminary “bookmarked” arrangement of the imaged Blake files. Researchers were also able to compare the same files, for comparison and viewing, on older computers.

Jeremy Blake FTK bookmarks
Jeremy Blake FTK bookmarks

Below are a few interesting researcher responses from the interviews:

“This seems to principally be a record of how an institution planned events.”

In exploring “Alternative Histories,” some noted that future interest in the records could be in the realm of organizational management and administrative studies, rather than in the artistic content of the exhibition files. An arrangement, then, might lose this record of how a small arts organization planned and executed a major exhibition. The directory organization, the many drafts of letters, the records of the organization’s own archival research, for example, might be obscured by processing.

The possible value of the “original order” of unprocessed administrative files found in Exit Art was appreciated deeply by the researcher quoted above, in spite of some confusion about how to navigate the unprocessed organizational directories and files. Our evaluation of this researcher’s experience lends credence to the idea of multiple, interactive, and flexible arrangements to support multiple understandings of the collection material. These options are not only increasingly possible, but expeditious and in-keeping with patron expectations of random access, keyword-driven searches.

“Given the choice between emulated access and contemporary access, I would use contemporary unless I was doing a book-length study.”

Most of the researchers emphasized that if it came to partially processed files or emulations and a significant time delay in processing, they would take unprocessed and relatively inauthentic files. Access by any means, and ease of access were stressed by the majority. All found the emulation’s authentically slow-processing speed and instability impediment enough to prefer contemporary computing system access. With the exception of one researcher with prior archival processing experience, authenticity was not a concern. While all researchers appreciated the emulation, hearing that they ultimately did not prefer using it was a surprise to me.

“This type of art, born-digital, is not taught. If it had been available, I might have changed my course of study.”

This researcher was thrilled at the opportunity emulation represented to access born-digital technology dependent artwork. She reflected that had she the opportunity, she might have studied more art “after the 1990s.” Her enthusiasm was both a validation and a call for further work with researchers to study these newly available collections. Because we publicized this user study online in blogs, more researchers have had the opportunity to come in and study with these collections, despite the collections’ “imperfect” and “unfinished” status.

While not conclusive or generalizable by any means, these interviews were necessary first steps to begin understanding how these new forms of access were interpreted and received. More information on this research is coming soon!

Many thanks to Lisa Darms and Donald Mennerich. Their support and encouragement were essential to this project.

Kim_ResearcherInteractionsWithBornDigital_ERSblog_4Julia Kim is the Digital Assets Manager with the American Folklife Center at the Library of Congress. Previously, she was a National Digital Stewardship Member at New York University where she worked on the project detailed in this post. She received her B.A. from Columbia University and her M.A. in Moving Image Archiving and Preservation from New York University. She can be contacted at juliakim [at] loc [dot] gov and her handle is @ jy_kim29.

When It Comes to Born-Digital, How Well Do We Know Our Users?

By Wendy Hagenmaier, with insights and inspirations from the Understanding Users for Access Hackfest Team

This post is the eighth in a bloggERS series about access to born-digital materials.

What do we know about the needs, motivations, and experiences of users of born-digital archival materials?

"Computing laboratories," courtesy of Georgia Tech Archives via DPLA
“Computing laboratories,” courtesy of Georgia Tech Archives via DPLA

The archival profession has been processing and preserving born-digital collections for years, but as we begin to engage in more conversations about designing solutions for providing access to those materials, it seems we need to ask ourselves how much we actually know about what our users want. Or perhaps, how much we don’t know. In an era of user-centered design, what do archivists and their IT allies still need to learn about users of born-digital materials in order to engineer intuitive mechanisms for providing access to those collections? And how can we go about learning those unknowns? Can we gather data that might help anticipate future access needs or encourage more people to discover and reuse our born-digital materials?

These are the complex and captivating questions the members of the Understanding Users for Access Hackfest Team have been tackling since the 2015 SAA Annual Meeting. The Team’s mission was to develop a proposal for a long-term collaborative project that would empower archivists to better understand users of born-digital materials, and would thereby help to address current obstacles to born digital access. Archivists from around the country self-selected to form the Team, and Elizabeth Keathley, owner of Atlanta Metadata Authority, volunteered to serve as our Leader. As a member of the research group behind the Born-Digital Access in Archival Repositories study, I served as the Team’s Researcher. We started by exploring data and themes related to understanding users, gathered through the Born-Digital Access in Archival Repositories study. For example, two anonymized quotes we examined from the study:

“We’ve done a number of usability studies with our finding aids. But I don’t […] know that anybody’s done this yet with born-digital. And part of the barrier might be that not a lot of places are making it available. Or we’ve just not seen the demand for it. I know of other institutions where they already have reading room access to born-digital materials, but nobody’s asking for it, you know. [Our software developer was] very reluctant to do any sort of real development without knowing what it was that users wanted. So […] we definitely see that it’s a need.”

“I would like to know what people are actually using, or interested in using, how […] they know our material exists, and what’s driving them to make their requests in the first place? […] If they are working on an annotated copy of a literary work, knowing that we actually do maybe need to provide them access to as close to original Word documents as possible for literary manuscripts, versus whether they just need access to a fixed form of the document that we could just port to PDF […]. So knowing both their research question and also the larger project they are working on would certainly be helpful.”

Rachael Dreyer, Head of Research Services for Special Collections at the Pennsylvania State University, and Katie Pierce Meyer, Humanities Librarian for Architecture & Planning at the University of Texas at Austin, documented the Team’s discussion as we identified our research questions and debated various strategies for our project proposal.

In the months following SAA, the Team worked together to articulate and iterate over a proposal for a “Community-Wide Mixed-Methods Needs Assessment of Users of Born-Digital Archives.” We invite you to read and comment on our proposal, and to join us in implementing the ideas it outlines.

As described in the proposal, the roughly three-year-long Needs Assessment would include:

  • the development of mixed-methods research instruments to explore the needs, backgrounds, experiences, and skill sets of users of born-digital archives;
  • a community wide data gathering effort;
  • a rigorous analysis of the data;
  • and the publication of findings and next steps for improving born-digital access.

The insights gained through the Needs Assessment would empower practitioners to design interfaces for access that are tailored to user needs, to communicate user needs more effectively to software developers, and to provide improved access services to users. The study would also provide opportunities for busy practitioners to participate in important research in a manageable but meaningful way, and for the community to work together to ensure that archives remain agile, relevant, and ready to meet the needs of 21st-century users.

Products of the Needs Assessment might include, but not be limited to, the following:

  • A website that would act as a hub for information about the project as well as results gathered from it. The website would provide access to project documentation and links to the research instruments, scrubbed data set, and reports (hosted in an open access repository).
  • A web form to serve as the interview script and notes document (hosted by an IRB-certified service, such as a university’s instance of Qualtrics).
  • An online survey (also hosted by an IRB-certified service).
    • Interview and survey questions would be broad enough to apply to a wide variety of institutional contexts yet specific enough to gather useful data to describe user needs and experiences. They could be used before, during, or after accessing born-digital materials–this would be determined by the research team during the first six months of the project.
  • Instructions and training materials for using the interview and survey instruments.
  • Open access, (possibly peer-reviewed) published analysis of the results of the study, along with an anonymized version of the data set. The analysis could also include user personas and user stories created from the generalized data from the study.

The project team could include:

  • Project Leads: An IRB-certified research team, composed of a diverse group of approximately 10 individuals with varying expertise and from several different institutions (archivists, software developers, user experience designers, PhD students, etc.).
  • Project Participants: Practitioners (ideally, at least 25) who opt-in to use the research instruments in their local institutions and contribute data to the project.

Additional details about proposed timeline, budget, dissemination and preservation plans, and sustainability strategy are available (and ready for comment!) in the proposal.

The Team hopes that results from the Needs Assessment would suggest next steps for improving born-digital access and could be cited in the future when archivists apply for further grant funding to expand access to born-digital materials. The research instruments could be revised for a second round of the study, which could be completed in approximately ten years, when many more archives will be providing access to born-digital materials. Results from the first and second rounds could be compared to track born-digital access over time.

Want to help? Let’s do this. We are seeking volunteers for a team that can make this Needs Assessment a reality. Please feel welcome to comment on the proposal and contact me if you’re interested in getting involved (wendy [dot] hagenmaier [at] library [dot] gatech [dot] edu).

Wendy Hagenmaier is the Digital Collections Archivist at the Georgia Tech Archives, where she develops policies and workflows for digital processing, preservation, and access. She received her M.S.I.S. with a focus on digital archives from the University of Texas at Austin. She is Vice President of the Society of Georgia Archivists, Chair of the SAA Issues and Advocacy Roundtable, and steering committee member for the SAA Electronic Records Section and Architectural Records Roundtable.

Processing in Tears Tiers: Applying a Flexible Approach to Born-Digital Materials

By Dorothy Waugh

This post is the seventh in a bloggERS series about access to born-digital materials.

As a digital archivist, I am always looking for ways to streamline my processing workflows. Because, when faced with a multi-gigabyte hard drive or pile of aging 3.5” floppy disks, time is rarely on my side. The diverse nature of our collections, however, can be a hindrance when trying to build workflows based on efficiency—in spite of our best efforts, there are frequently cases where a collection’s particular characteristics demand greater levels of resources and time. During the past year, archivists at the Stuart A. Rose Manuscript, Archives, and Rare Book Library have developed a tiered approach to processing and access designed to flex in response to a collection’s limitations without obscuring its salient characteristics.

First, unprocessed collections are evaluated based on four criteria:

  1. Quality of data, defined by us as the totality, scope, and viability of the acquired material. Our focus here is on the completeness of the acquisition (an entire hard drive, for example, provides more content and context than a directory of files), the number of years spanned, and the extent to which data can be rendered using modern software.
  2. Authenticity of data, which refers to our ability to establish that it was indeed the donor that created, used, or managed the digital content.
  3. The number of donor restrictions and any concern regarding intellectual property rights.
  4. The extent to which we anticipate particularly high levels of use. Crucially, we want to avoid postponing questions about access until processing is complete, instead evaluating expected use up front so that access will inform processing from the very start.

Based on this evaluation, we assign collections to one of three processing tiers:

Tier One: Low complexity, tool-driven
Collections assigned to this tier will primarily include familiar, homogenous file formats and will have few donor restrictions or intellectual property rights concerns. The limited complexity of the data will allow for largely tool-driven processing.

Tier Two: Average complexity, combination of tool-driven and manual approaches
Collections assigned to this tier will primarily include familiar file formats, although there may be instances of more challenging file formats. Some donor restrictions may apply.

Tier Three: High complexity, high manual effort
Collections assigned to this tier will include a large number of heterogeneous and challenging file formats. There may be a high level of donor restrictions. The scope of the collection and anticipated high levels of use may demand a more involved approach to arrangement, description, and access.

This diagram illustrates the application of this approach to born-digital materials acquired as part of Lucille Clifton’s papers:

Waugh_TieredApproachtoProcessing_ERSblogInformation in the three left-hand columns provides a broad assessment of the collection’s born-digital material based on our four criteria. This assessment is then used to determine that material should be processed as a tier three collection, meaning that it is a complex collection and will likely require high manual effort during processing.

At the same time, our assessment guides decision-making about access. We’ve very loosely labelled the three levels at which we make collections available to researchers as standard, emulation only, and optimal—although it is important to note that we define these levels based not only on what we are currently able to do but also on what we hope to be able to implement in the future. As a result, consideration of these levels is as much about processing in such a way that does not preclude improved future models of access as it is about providing access right now. So, while optimal is perhaps more “optimal” as it currently stands, these are collections that we have deemed good candidates for a more advanced access point once we have the necessary resources in place. We determined, based on our assessment, that the Lucille Clifton collection warranted an “optimal” approach to access.

Rather than automating our decision making, the initial assessment and subsequent assignment of a processing tier and access level provides a structured vocabulary by which recurring project considerations can be discussed, and a comprehensive rubric by which new projects can be prioritized and planned. Once identified, these tiers influence decision making at almost every stage of our born-digital workflows, including how processed collections are made available to our researchers. As we continue to apply this approach, we hope too to better track our work in order to more accurately allocate the time and resources required to process a collection at each tier.

Note: This blog post borrows in part from a forthcoming article, “Flexible Processing and Diverse Collections: A Tiered Approach to Delivering Born Digital Archives,” written in collaboration by Dorothy Waugh, Elizabeth Roke, and Erika Farr, to be published in the journal Archives and Records. The article will offer additional information on how the tiered approach described in this post has been applied in practice at Emory.

Dorothy Waugh is Digital Archivist at the Stuart A. Rose Manuscript, Archives, and Rare Book Library at Emory University, where she is responsible for the management of born-digital manuscript and archival material.

Access to Born-Digital Archives at the Russell Library

By Adriane Hanson

This post is the sixth in a bloggERS series about access to born-digital materials.

Institutional Context
The Richard B. Russell Library for Political Research and Studies at the University of Georgia has been providing access to digital archives for about a year and a half. We needed something that was free, web-based (for broader access), and integrated with our existing workflows for paper (to keep it simpler). We ended up using our finding aids for description, our existing circulation system (Aeon) to track requests, and Google Drive to provide access to files.

Finding Aids
Researchers learn about digital files through our finding aids. If there is a series with related papers, we list the digital files at the end of that series. We are trying to balance the need to keep things together intellectually (rather than having a separate “electronic records” series) without taking on the labor-intensive work of integrating the folder list for papers and for digital files.

HansonFindingAid
Container list from the Davison Papers, showing description of digital files.

Digital files are described in the aggregate, not at the item level. For instance, when describing files from a server, only the first 1-2 levels of folders are included in the finding aid. In addition to being a time-saving measure, it makes the finding aid more usable. Researchers get an overview of what the collection contains rather than an overwhelmingly-long list of filenames. When folder titles are insufficient for description, we will add a scope/content note for the folder and/or link to a directory print of the contents of the folder. For examples, see the Eric Johnson Papers or the Eleanor Smith Papers.

The Workflow
The process for providing access to digital files is summarized below and described in more detail in our access policy. This policy will soon be updated to reflect changes in how we use Google Drive.

  1. The researcher requests digital files from the finding aid, just like they do for paper.
  2. The request is routed to a queue in Aeon that I monitor daily.
  3. After some communication with the researcher, I upload a copy of the files to a Google Drive account and share them with the researcher.
  4. The request is changed to “checked out” in Aeon.
  5. The researcher has two weeks to view the files. After that, I delete the files from Drive and mark the request finished in Aeon.

Google Drive as a Virtual Reading Room
In the first iteration of this process, we used Google Drive like a virtual reading room. Permissions were set to view-only and files could not be downloaded or printed. We started with this strategy to address concerns at our library about properly protecting copyright. It worked well for researchers who needed basic access to files but limited the functionality of some file formats (i.e. spreadsheets were frozen as tables) and did not allow researchers to save search results or integrate what they were finding with copies obtained from other institutions.

HansonDrive
Photographs from Davison Papers shared with a patron.

Google Drive as a Delivery System
This year, we developed a policy to allow digital cameras in our reading room. During those conversations, we decided that providing copies of born-digital archival materials for personal research use would be permissible under the same fair-use provision of copyright law that allows the cameras. So now, the researcher signs a form agreeing to abide by copyright law and our policies, and then I provide full access to the files via Google Drive, including allowing downloads.

Conclusion
We are happy with this process, at least for now. Ultimately, I would like a system that can pull from our access copies storage automatically and offer researchers tools for viewing and analysis. But while we are working on that, this workflow lets us provide reasonable access to everything in our holdings.

Adriane Hanson is Digital Curation and Processing Archivist at the Richard B. Russell Library for Political Research and Studies at the University of Georgia, a position she has held for 3 years. She can be reached at ahanson [at] uga [dot] edu.

Agile for Access: Iterative Approaches to Solving Born-Digital Access

By Jessica Meyerson

This post is the fifth in a bloggERS series about access to born-digital materials.

At the 2015 SAA conference in Cleveland, the Agile for Access Hackfest Team focused on creating a collaborative project that introduces agile development principles as a strategy for overcoming obstacles to born-digital access.

To start the discussion, the Born-Digital Access Research Team provided a baseline understanding of agile and its growth in popularity throughout the 1990s. The Manifesto for Agile Software Development, written and published in 2001, emphasizes “individuals and interactions,” “working software” (working solutions), “customer collaboration,” and “responding to change.” In its most abstract and broadly applicable form, agile shares many of the basic tenets of design thinking or design research: empowerment, collaboration, rapid/frequent iterations, and continual planning, in place of one monolithic plan executed from start to finish.

Our Hackfest Team consisted of a mix of archivists from different levels of experience and exposure to agile development principles, so one focus of our discussion was how to communicate what agile is and how to apply it in an archival setting. Erin Faulder (Archivist for Digital Collection at Tufts University) volunteered to serve the group as our fearless Hackfest Team Leader, a role responsible for leading the discussion during the in-session activity and working with research team members to complete the proposal during Phase II. Sarah Bost, (Student Success Archivist at the University of Arkansas) and Amy Wickner (Digital Projects Graduate Assistant at the University of Maryland) graciously volunteered to take notes and record observations, which were later compiled into the first draft of the proposal.

The Agile for Access Hackfest Team collaboration resulted in a project proposal entitled “Why Agile Works in Archives.” The purpose of this project will be to provide a set of resources for archivists to learn about and implement agile in their own institutions, emphasizing rapid iteration to improve digital access solutions and embracing “good enough” over “perfect.” In order to make this toolkit useful for archivists,  this project would highlight real world agile case studies and best practices for working with born-digital archival materials, and include the following deliverables:

  • Agile toolkit:
    • Tool for determining whether agile is a good fit for your project–this could take the form of a checklist for project assessment
    • Use cases and case studies covering a range of professional settings, from large government and/or educational institutions to lone arrangers working without the support of information technology professionals
    • Agile quick-start guide, covering fundamental concepts, guidelines, and FAQs
    • Foundational readings
  • Platform for sharing experiences with implementing agile:
    • Reports on outcomes of agile projects in institutions
    • Remixes of the toolkit for particular audiences or contexts

You’re invited to view and comment on the full project proposal here.

Reflecting on my own participation in Phases I and II of the Born-Digital Access Hackfest, I felt that even though it was challenging to balance Phase II participation against other professional commitments, the Hackfest model proved to be an effective way to incubate collaboration–providing a well-defined structure in which Hackfest team members could explore strategies and exchange ideas.

As Daniel Johnson wrote in his Archivist Bootcamp for Access post, “There is still a lot of work to do.”

We are looking for a project team to develop this Agile for Access proposal. This project team will be responsible for developing/designing the agile toolkit; identifying possible hosts/distribution platforms; documenting audience use cases that may correspond to toolkit modules (agile for administration, agile for processing archivists, etc.); designing a project sustainability plan; locating funding sources; and promoting the project. At this time, we are seeking volunteers for the project team, as well as feedback on all aspects of the proposal. We are in the beginning stages of this project and want it to accurately assess the needs of the community to provide access to born-digital materials. Please feel welcome to send comments, ideas, and questions to the Agile for Access Hackfest Team Leader, Erin Faulder (erin.faulder [at] tufts [dot] edu), and Researcher, Jessica Meyerson (j.meyerson [at] austin [dot] utexas [dot] edu).

Many thanks to Agile for Access Hackfest Team member Martin Gengenbach for his contributions to this post.

Jessica Meyerson is the Digital Archivist at the Dolph Briscoe Center for American History at the University of Texas at Austin, focused on research-in-practice and building community infrastructure to support long-term access of digital material on and off campus. Meyerson currently serves as steering committee member for Texas Archival Reseources Online and co-investigator on the IMLS-funded Software Preservation Network project.

Viewing Email through a New Lens: Screening, Managing, and Providing Access to Historical Email using ePADD

By Josh Schneider

This post is the fourth in a bloggERS series about access to born-digital materials.

The email correspondence of historical and political figures presents great potential to support scholarly research into contemporary society. Yet archival repositories of all types—whether public or private, government or cultural—currently face significant impediments to collecting and administering email collections due to concerns about privacy and copyright and the difficulty of processing large, multi-decade archives containing hundreds of thousands of messages. ePADD is an open source and freely downloadable software suite released by Stanford University and its NHPRC grant partners this past summer that expressly addresses these challenges.

With the initial release, the ePADD team harnessed machine learning, natural language processing (NLP), automated metadata extraction, and other batch processes, to support identification and restriction of messages containing regular expressions, such as credit card and social security numbers, and browsing and visualization of the archive by correspondent or mentioned entities and the ability to link these with established authority headings. We also built functionality for custom lexicons that permit tiered thematic searching of the email archive, remote access to a redacted email archive to aid in discovery, and query generation, which compares the entity index of the archive with that of any other text-based corpus.

At the heart of the software is a custom named entity recognizer (NER), which benefits from the email address book associated with the archive and supports browsing and visualization of named person, organization, and location entities within message headers and text. Since its initial release, ePADD has continued to expand its NER capabilities dramatically. The developer version now includes more fine-grained categories of organization and location entities, such as libraries, museums, and universities, bootstrapped from Wikipedia, which should enhance browsing accuracy. This enhanced NER—which will be rolled out in future public releases—also supports extraction, browsing, and visualization of other entity types, including events and diseases, the latter of which should improve screening for protected health information.

The ePADD team and its partners aim to keep these improvements rolling. In September 2015, IMLS generously agreed to fund three years of additional ePADD development, “ePADD Phase 2,” to further IMLS’s vision of a national digital platform. ePADD is proud to include University of Illinois Urbana-Champaign, Harvard University, University of California, Irvine, and Metropolitan New York Library Council (METRO) among its partners for this second phase of growth. We will maintain an iterative release cycle, with a major software release scheduled for every six months; the first of these is set for June 2016.

In Phase 2, we aim to greatly improve archival institutions’ ability to appraise, process, and provide access to email collections otherwise unavailable to researchers. We hope that fulfilling Phase 2 development goals will also lay the groundwork for future efforts applying similar automated workflows and functionalities to other classes of born-digital materials.

Phase 2 should bring critical enhancements to ePADD’s feature set with respect to electronic records management and access, including:

  • Supporting the ability to work with archives of up to 750,000 messages;
  • Ensuring integration with existing and emerging tools and services for archival workflows, including support for preservation;
  • Supporting export of entities and other metadata as linked open data; and
  • Improving restriction management, implementing an audit trail, and exploring the creation of a public delivery module for email archives released through FOIA and sunshine requests.

We will also build a community of users and developers through sustained outreach, including a symposium for government repositories and a developer hackathon, to be held further in the grant cycle.

We hope you will join our growing community: Please download the software, join our user forums or mailing list, or contribute a user story to inform future development. You can also follow us on Twitter for the latest news and announcements.

JoshSchneider Josh Schneider is Assistant University Archivist in the Department of Special Collections and University Archives at Stanford University. He is also Community Manager for ePADD, an open-source software package that supports archival processes around the appraisal, processing, discovery, and delivery of email archives.

Let the Bits Describe Themselves

By Brian Dietz

This post is the third in a bloggERS series about access to born-digital materials.

Over the last three years, the Special Collections Research Center (SCRC) at NCSU Libraries has undertaken an initiative to enhance our capacity for managing born-digital archival materials. For two years, the initiative was staffed half-time by a Libraries’ Fellow and guided by an advisory group representing the department’s core functional areas, with input from colleagues in our Digital Library Initiatives (DLI) department. The initiative has been focused on a set of minimally viable tools and workflows for ingest, processing, and access. For ingest, we’re using a combination of tools, like FTK Imager, BitCurator, and FITS; for access, we are providing access to materials, in most cases, on a laptop in the SCRC’s reading room.

For processing, we decided that we did not want to dedicate time and money to arranging folders and files below the object, be it a floppy, optical media, hard drive, or set of files. In ArchivesSpace, we create an archival object record in the appropriate series in which files on media belong; the archival object is given a title as descriptive as possible, based, in part, on information found on the object itself. If an appropriate existing series does not exist, we create it. However, if the media contains content that fits into more than one series, we create a new series, “Electronic Media,” in which the record for the media object will go.

The decision not to rearrange files has numerous advantages, including saving staff time, maintaining data authenticity, and allowing users to see the environment in which the creator worked. Most compelling is the fact that we have access to all sorts of metadata about files and their computing environment that we can leverage to make materials discoverable by researchers and to provide them with the resources necessary to do their own arrangement.

For instance, from most media, and for the bulk of files we’re currently dealing with, we can easily grab:

  •    File name
  •    File path
  •    File type
  •    Document size
  •    Dates

If we’re lucky, we might get from embedded metadata:

  •    Creator
  •    Title

And with a little extra work, information about the computing environment can be gathered, like:

  •    Software used to create the files
  •    Operating system information
  •    Word lists

So, what can be done with this metadata? One thing is to combine it into a CSV file that is available for download via our finding aids. There is real potential benefit to the researcher in offering her the ability to do her own arrangement, through sorting by file path, date, document type, or other criteria she sees that might give sense and order to materials. With staff in DLI, we are developing this feature in our finding aids.

Another tool we’re developing to help researchers explore the contents of electronic media is a virtual filesystem browser.  Using the file and system metadata gathered during ingest, it recreates a file browsing environment—like Explorer (Windows), Finder (Mac), or Nautilus (Ubuntu)—in a web browser, allowing a researcher, from within the context of a finding aid, to navigate virtually the contents of a media object or a set of folders and files. At the file level, there is additional file metadata available for the researcher to consider.

Directories and files in “My Documents,” viewed through the filesystem browser.
Directories and files in “My Documents,” viewed through the filesystem browser.
File representation, with metadata, viewed through the filesystem browser.
File representation, with metadata, viewed through the filesystem browser.

We currently do not plan to provide researchers with access to files through the filesystem browser (we are exploring tiers of access, from restricted to unmediated web access, and the filesystem browser will likely have a role in access). Still, it will allow researchers to get a sense of what kinds of content may be on a disk—which may inform their decisions regarding requests for access—without the expense of symbolically arranging folders or files and tracking that work.

Leveraging this metadata for description and resource discovery initially seemed like a minimally viable product, but, as we go along, I think we’ll find that it’s much better than that.

 

Linda Sellars and Trevor Thornton provided insightful suggestions and edits for this post.

Brian Dietz is the Digital Program Librarian for Special Collections at NCSU Libraries, where he manages digitization, born-digital processing, and web and social media archiving.

NPR: Will Future Historians Consider These Days The Digital Dark Ages?

Nice article from NPR on digital preservation with some thoughtful comments by our colleagues in archives and museums.
Will Future Historians Consider These Days The Digital Dark Ages?
We are awash in a sea of information, but how to historians sift through the mountain of data? In the future, computer programs will be unreadable, and therefore worthless, to historians.
NPR.ORG|BY MORNING EDITION
Additional discussions on Bert Lyons Twitter feed regarding this article.

Creating an Archivist Bootcamp for Born-Digital Access

By Daniel Johnson

This post is the second in a bloggERS series about access to born-digital materials.

At the 2015 SAA conference in Cleveland, I participated in a born-digital hackfest along with about 50 other people, with the purpose of developing proposals for collaborative projects that would help address current obstacles to born-digital access in the following areas: Advocacy for Access, Understanding Users for Access, Agile for Access, and an Archivist Bootcamp.

My group was assigned the task of putting together a proposal for a hands-on Archivist Bootcamp for Born-Digital Access that would train attendees to provide access to born-digital content. Our group discussed a variety of topics, including the growing need to provide access to born-digital material, how to design a bootcamp for an audience with differing skill levels, topics to be included in the bootcamp, other similar existing programs, and the pros and cons of an in-person bootcamp.

After the hackfest, we began an iterative process to create a formal proposal. You can view and comment on the proposal here. The proposed two-day bootcamp would cover a full range of expertise necessary to facilitate access to born-digital materials, from hard technical skills to soft people skills. We also established some broad topics that such a bootcamp could cover:

  • Establishing policies
  • Technical implementation case studies and best practices
  • Hands-on technical sessions
  • Reference for born-digital materials
  • Advocacy/program development — Interactions with managers
  • Workflows for access
  • Donor relations
  • User relations — Working with researchers
  • Risk assessment — Access vs. security/privacy
  • Processing for access

Although the bootcamp itself is designed to be a hands-on two-day experience, additional infrastructure will be needed to ensure the viability of the training.  A centralized wiki, in conjunction with online forums with a specific focus on accessing born-digital material, will provide information to bootcamp participants prior to attending and will also serve the wider community as a resource in its own right.

Information added to the wiki and forums will inform the curriculum of the bootcamp. New information and ideas uncovered during bootcamp sessions will bolster the online tools, which could eventually develop into something like a virtual bootcamp or ongoing webinar series. In other words, the bootcamp and online infrastructure would help sustain each other.

There is still a lot of work to do. Most importantly, we need to put together a project team. This will include 5-10 people to develop/design the bootcamp and identify possible instructors, funding sources, and spaces to hold the bootcamps. At this time, we are seeking volunteers for the project team, as well as feedback on all aspects of the proposal. We are in the beginning stages of this project and want it to accurately assess the needs of the community to provide access to born digital materials. Please feel welcome to send comments, ideas, and questions to the Born-Digital Access Hackfest Team Leader, Daniel Johnson (daniel-h-johnson [at] uiowa [dot] edu), and Researcher, Alison Clemens (alison.clemens [at] yale [dot] edu).

Daniel Johnson is the Digital Preservation Librarian at The University of Iowa focusing on the long-term preservation of digital and born-digital material. Previously Johnson worked as a project archivist at Brown University running a CLIR grant to process The Gordon Hall and Grace Hoag Collection of Dissenting and Extremist Printed Propaganda and as a digital archivist at The HistoryMakers African American Video Oral History Archive.