We all have dreams… Have you ever found yourself enamored with the idea of digitizing? How fast and easy it would be to just scan an image and post it online (or email it) instead of pulling item after item after awkward map off the shelf in the stacks day in and day out? Have you ever imagined a world where small institutions have metadata as good as the Digital Commonwealth? While these are all great dreams to have, the reality of the situation remains that many repositories cannot digitize entire collections and on-demand digitizing requires consistency on both collection and item level metadata. This is the story of backtracking through years of unlabeled scans, and what I learned about part time, lone arranger digitization practices.
For every good intention, there is an equally powerful Murphy’s Law reaction…
In a small institution with limited personnel and resources, it is very easy to leave a long trail of to-do tasks that simply do not get done on time. For example:
A researcher asks for a scan of a photograph they found on a finding aid posted on your website.
You scan the photograph, email it to them, and then get pulled off task because your intern needs help with some loose papers in the box she is processing.
Before you get back to the computer, your co-worker asks to meet about pulling materials for an event.
You don’t get back to the computer for another hour, and by then there are six more emails waiting for you and new reference requests. You dive into those.
You haven’t even looked at your to-do list from yesterday.
Maybe you remember to put the photograph into the correct collection folder. Maybe you labeled the photograph 3.014 LLH but never uploaded it onto the server. Maybe you forgot about it all together until someone asked for it again and you remembered doing it before. Do you take the time to find it? Do you rescan it?
I came into years of unlabeled scans, almost a terabyte of images either completely unlabeled or images organized into collection folders, and I am sure I am not alone. I’m sure I’ve also contributed to this type of problem over the years.
By not labeling the images, the person scanning is either relying on visual recognition for the future, or not thinking about the future at all. What if you get sick and need someone to cover, or leave the institution? What if you forget the number you assigned it in the finding aid? This all begs the following question: can anyone keep thousands of images straight? I sure can’t. By not organizing and labeling the scans from the get-go, the person wasted time and resources, exposed the originals to adverse conditions, and flooded the server with useless materials that just make searching harder. Looking at a sea of scans and not knowing where they might belong is difficult and intimidating. Scanning an image and emailing it, then forgetting about it because of other goings on, is quite easy.
With many scans comes great responsibility… On one hand, someone already took the time to scan all those materials, pulling them out of their controlled environment and subjecting them to the scanner. On the other hand, deleting everything without metadata and starting from scratch with specific standards in place would eliminate significant time spent backtracking. I chose to act on a case by case basis, working through the images that were sorted into collection folders, and relegating the completely unlabeled photographs into a folder to delete or deal with later. I ended up going collection by collection through the finding aids to determine which scan was which, ending up with 25 collections with a significant number of scans identified with their item number in addition to the collections that were scanned in their entirety.
Making the best of it all With all these scans identified, I have been able to enrich the metadata of our collections posted on Flickr and on our website. Well, once our new website is actually up and running. Digitization is only step one of the process…and only if you set the correct dpi the first time around…
Irina Sandler graduated in May 2017 with her MLIS from Simmons College. She is currently an archivist at the Baker Library of Harvard Business School as well as the Cambridge Historical Society.
I’m the Records Coordinator for a global energy engineering, procurement, and construction contractor, herein referred to as the “Company.” The Company does design, fabrication, installation, and commissioning of upstream and downstream technologies for operators. I manage the program for our hard copy and electronic records produced from our Houston office.
A few years ago our Records Management team was asked by the IT department to help create a process to archive digital records of closed projects created out of the Houston office. I saw the effort as an opportunity to expand the scope and authority of our records program to include digital records. Up to this point, our practice only covered paper records, and we asked employees to apply the paper record policies to their own electronic records.
The Records Management team’s role was limited to providing IT with advice on how to deploy a software tool where files could be stored for a long-term period. We were not included in the discussions on which software tool to use. It took us over a year to develop the new process with IT and standardize it into a published procedure. We had many areas of triumph and failure throughout the process. Here is a synopsis of the project.
Objective: IT was told that retaining closed projects files on the local server was an unnecessary cost and was tasked with removing them. IT reached out to Records Management to develop a process to maintain the project files for the long-term in a more cost-effective solution that was nearline or offline, where records management policies could be applied.
Vault: The software chosen was a proprietary cloud-based file storage center or “vault.” It has search, tagging, and records disposition capabilities. It is more cost-effective than storing files on the local server.
Process: At 80% project completion, Records Management reaches out to active projects to discover their methods for storing files and the project completion schedule. 80% engineering completion is an important timeline for projects because most of the project team is still involved and the bulk of the work is complete. Records Management also gains knowledge of the project schedule so we can accurately apply the two-year timespan to when the files will be migrated off the local server and to the vault. The two-year time span was created to ensure that all project files would be available to the project team during the typical warranty period. Two years after a project is closed, all technical files and data are exported from the current management system and ingested into the vault, and access groups are created so employees can view and download the files for reference as needed.
Deployment: Last year, we began to apply the process to large active projects that had passed 80% engineering completion. Large projects are those that have greater than 5 million in revenue.
Observations: Recently we have begun to audit the whole project with IT, and are just now identifying our areas of failure and triumph. We will conduct an analysis of these areas and assess where we can make improvements.
Our big areas of failure were related to stakeholder involvement in the development, deployment, and utilization of the vault.
Stakeholders, including the Records Management team, were not involved in the selection or development of the vault software tool. As a result, the vault development project lacked the resources required to make it as successful as possible.
In the deployment of the vault, we did not create an outreach campaign with training courses that would introduce the tool across our very large company. Due to this, many employees are still unaware of the vault. When we talk with departments and projects about methods to save old files for less money they are reluctant to try the solution because it seems like another way for IT to save money from their budget without thinking about the greater needs of the company. IT is still viewed as a support function that is inessential to the Company’s philosophy.
Lastly, we did not have methods to export project files from all systems for ingest into the vault; nor did we, in North America, have the authority to develop that solution. To be effective, that type of decision and process can only be developed by our corporate office in another country. The Company also does not make information about project closure available to most employees. A project end date can be determined by several factors, including when the final invoice was received or the end of the warranty period. This type of information is essential to the information lifecycle of a project, and since we had no involvement from upper level management, we were not able to devise a solution for easily discovering this information.
We had some triumphs throughout the process, though. Our biggest triumph is that this project gave Records Management an opportunity to showcase our knowledge of records retention and its value as a method to save money and maintain business continuity. We were able to collaborate with IT and promulgate a process. It gave us a great opportunity to grow by harnessing better relationships with the business lines. Although some departments and teams are still skeptical about the value of the vault, when we advertise it to other project teams, they see the vault as evidence that the Company cares about preserving their work. We earned our seat at the table with these players, but we still have to work on winning over more projects and departments. We’ve also preserved more than 30 TB of records and saved the Company several thousands of dollars by ingesting inactive project files into the vault.
I am optimistic that when we have support from upper management, we will be able to improve the vault process and infrastructure, and create an effective solution for utilizing records management policies to ensure legal compliance, maintain business continuity, and save money.
Sarah Dushkin earned her MSIS from the University of Texas at Austin School of Information with a focus in Archival Enterprise and Records Management. Afterwards, she sought to diversify her expertise by working outside of the traditional archival setting and moved to Houston to work in oil and gas. She has collaborated with management from across her company to refine their records management program and develop a process that includes the retention of electronic records and data. She lives in Sugar Land, Texas with her husband.
When was the last time you totally, completely, utterly loused up a project or a report or some other task in your professional life? When was the last time you dissected that failure, in meticulous detail, in front of a room full of colleagues? Let’s face it: we’ve all had the first experience, and I’d wager that most of us would pay good money to avoid the second.
It’s a given that we’ll all encounter failure professionally, but there’s a strong cultural disincentive to talk about it. Failure is bad. It is to be avoided at all costs. And should one fail, that failure should be buried away in a dark closet with one’s other skeletons. At the same time, it’s well acknowledged that failure is a critical step on the path to success. It’s only through failing and learning from that experience that we can make the necessary course corrections. In that sense, refusing to acknowledge or unpack failure is a disservice: failure is more valuable when well-understood than when ignored.
This philosophy — that we can gain value from failure by acknowledging and understanding it openly — is the underlying principle behind Fail4Lib, the perennial preconference workshop that takes place at the annual Code4Lib conference, and which completed its fifth iteration (Fail5Lib!) at Code4Lib 2017 in Los Angeles. Jason Casden (now of UNC Libraries) originally conceived of the Fail4Lib idea, and together he and I developed the concept into a workshop about understanding, analyzing, and coming to terms with professional failure in a safe, collegial environment.
Participants in a Fail4Lib workshop engage in a number of activities to foster a healthier relationship with failure: case study discussions to analyze high-profile failures such as the Challenger disaster and the Volkswagen diesel emissions scandal; lightning talks where brave souls share their own professional failures and talk about the lessons they learned; and an open bull session about risk, failure, and organizational culture, to brainstorm on how we can identify and manage failure, and how to encourage our organizations to become more failure-tolerant.
Fail4Lib’s goal is to help its participants to get better at failing. By practicing talking about and thinking about failure, we position ourselves to learn more from the failures of others as well as our own future failures. By sharing and talking through our failures we maximize the value of our experiences, we normalize the practice of openly acknowledging and discussing failure, and we reinforce the message to participants that it happens to all of us. And by brainstorming approaches to allow our institutions to be more failure-tolerant, we can begin making meaningful organizational change towards accepting failure as part of the development process.
The principles I’ve outlined here not only form the framework for the Fail4Lib workshop, they also represent a philosophy for engaging with professional failure in a constructive and blameless way. It’s only by normalizing the experience of failure that we can gain the most from it; in so doing, we make failure more productive, we accelerate our successes, and we make ourselves more resilient.
Andreas Orphanides is Associate Head, User Experience at the NCSU Libraries, where he develops user-focused solutions to support teaching, learning, and information discovery. He has facilitated Fail4Lib workshops at the annual Code4Lib conference since 2013. He holds a BA from Oberlin College and an MSLS from UNC-Chapel Hill.
This is the second post in the bloggERS series describing outcomes of the #OSS4Pres 2.0 workshop at iPRES 2016, addressing open source tool and software development for digital preservation. This post outlines the work of the group tasked with “drafting a design guide and requirements for Free and Open Source Software (FOSS) tools, to ensure that they integrate easily with digital preservation institutional systems and processes.”
The FOSS Development Requirements Group set out to create a design guide for FOSS tools to ensure easier adoption of open-source tools by the digital preservation community, including their integration with common end-to-end software and tools supporting digital preservation and access that are now in use by that community.
The group included representatives of large digital preservation and access projects such as Fedora and Archivematica, as well as tool developers and practitioners, ensuring a range of perspectives were represented. The group’s initial discussion led to the creation of a list of minimum necessary requirements for developing open source tools for digital preservation, based on similar examples from the Open Preservation Foundation (OPF) and from other fields. Below is the draft list that the group came up with, followed by some intended future steps. We welcome feedback or additions to the list, as well as suggestions for where such a list might be hosted long term.
Minimum Necessary Requirements for FOSS Digital Preservation Tool Development
Provide publicly accessible documentation and an issue tracker
Have a documented process for how people can contribute to development, report bugs, and suggest new documentation
Every tool should do the smallest possible task really well; if you are developing an end-to-end system, develop it in a modular way in keeping with this principle
Follow established standards and practices for development and use of the tool
Keep documentation up-to-date and versioned
Follow test-driven development philosophy
Don’t develop a tool without use cases, and stakeholders willing to validate those use cases
Use an open and permissive software license to allow for integrations and broader use
Have a mailing list, Slack or IRC channel, or other means for community interaction
Establish community guidelines
Provide a well-documented mechanism for integration with other tools/systems in different languages
Provide functionality of tool as a library, separating out the GUI and the actual functions
Package tool in an easy-to-use way; the more broadly you want the tool to be used, package it for different operating systems
Use a packaging format that supports any dependencies
Provide examples of functionality for potential users
Consider the organizational home or archive for the tool for long-term sustainability; develop your tool based on potential organizations’ guidelines
Consider providing a mechanism for internationalization of your tool (this is a broader community need as well, to identify the tools that exist and to incentivize this)
Digital preservation is an operating system-agnostic field
Feedback and Perspectives. Because of the expense of the iPRES conference (and its location in Switzerland), all of the group members were from relatively large and well-resourced institutions. The perspective of under-resourced institutions is very often left out of open-source development communities, as they are unable to support and contribute to such projects; in this case, this design guide would greatly benefit from the perspective of such institutions as to how FOSS tools can be developed to better serve their digital preservation needs. The group was also largely from North America and Europe, so this work would eventually benefit greatly from adding perspectives from the FOSS and digital preservation communities in South America, Asia, and Africa.
Institutional Home and Stewardship. When finalized, the FOSS development requirements list should live somewhere permanently and develop based on the ongoing needs of our community. As this line of communication between practitioners and tool developers is key to the continual development of better and more user-friendly digital preservation tools, we should continue to build on the work of this group.
Heidi Elaine Kelly is the Digital Preservation Librarian at Indiana University, where she is responsible for building out the infrastructure to support long-term sustainability of digital content. Previously she was a DiXiT fellow at Huygens ING and an NDSR fellow at the Library of Congress.
Organized by Sam Meister (Educopia), Shira Peltzman (UCLA), Carl Wilson (Open Preservation Foundation), and Heidi Kelly (Indiana University), OSS4PRES 2.0 was a half-day workshop that took place during the 13th annual iPRES 2016 conference in Bern, Switzerland. The workshop aimed to bring together digital preservation practitioners, developers, and administrators in order to discuss the role of open source software (OSS) tools in the field.
Although several months have passed since the workshop wrapped up, we are sharing this information now in an effort to raise awareness of the excellent work completed during this event, to continue the important discussion that took place, and to hopefully broaden involvement in some of the projects that developed. First, however, a bit of background: The initial OSS4PRES workshop was held at iPRES 2015. Attended by over 90 digital preservation professionals from all areas of the open source community, individuals reported on specific issues related to open source tools, which were followed by small group discussions about the opportunities, challenges, and gaps that they observed. The energy from this initial workshop led to both the proposal of a second workshop, as well as a report that was published in Code4Lib Journal, OSS4EVA: Using Open-Source Tools to Fulfill Digital Preservation Requirements.
The overarching goal for the 2016 workshop was to build bridges and fill gaps within the open source community at large. In order to facilitate a focused and productive discussion, OSS4PRES 2.0 was organized into three groups, each of which was led by one of the workshop’s organizers. Additionally, Shira Peltzman floated between groups to minimize overlap and ensure that each group remained on task. In addition to maximizing our output, one of the benefits of splitting up into groups was that each group was able to focus on disparate but complementary aspects of the open source community.
Develop user stories for existing tools (group leader: Carl Wilson)
Carl’s group was comprised principally of digital preservation practitioners. The group scrutinized existing pain points associated with the day-to-day management of digital material, identified tools that had not yet been built that were needed by the open source community, and began to fill this gap by drafting functional requirements for these tools.
Define requirements for online communities to share information about local digital curation and preservation workflows (group leader: Sam Meister)
With an aim to strengthen the overall infrastructure around open source tools in digital preservation, Sam’s group focused on the larger picture by addressing the needs of the open source community at large. The group drafted a list of requirements for an online community space for sharing workflows, tool integrations, and implementation experiences, to facilitate connections between disparate groups, individuals, and organizations that use and rely upon open source tools.
Define requirements for new tools (group leader: Heidi Kelly)
Heidi’s group looked at how the development of open source digital preservation tools could be improved by implementing a set of minimal requirements to make them more user-friendly. Since a list of these requirements specifically for the preservation community had not existed previously, this list both fills a gap and facilitates the building of bridges, by enabling developers to create tools that are easier to use, implement, and contribute to.
Ultimately OSS4PRES 2.0 was an effort to make the open source community more open and diverse, and in the coming weeks we will highlight what each group managed to accomplish towards that end. The blog posts will provide an in-depth summary of the work completed both during and since the event took place, as well as a summary of next steps and potential project outcomes. Stay tuned!
Shira Peltzman is the Digital Archivist for the UCLA Library where she leads the development of a sustainable preservation program for born-digital material. Shira received her M.A. in Moving Image Archiving and Preservation from New York University’s Tisch School of the Arts and was a member of the inaugural class of the National Digital Stewardship Residency in New York (NDSR-NY).
Heidi Elaine Kelly is the Digital Preservation Librarian at Indiana University, where she is responsible for building out the infrastructure to support long-term sustainability of digital content. Previously she was a DiXiT fellow at Huygens ING and an NDSR fellow at the Library of Congress.
The Electronic Records Task Force (ERTF) at the University of Minnesota just completed its second year of work. This year’s focus was on the processing of electronic records for units of the Archives and Special Collections. The Archives and Special Collections (ASC) is home to the University of Minnesota’s collection of rare books, personal papers, and organizational archives. ASC is composed of 17 separate collecting units, each focusing on a specific subject area. Each unit is run separately with some processing activities being done centrally through the Central Processing department.
We realized quickly that even more than traditional analog records processing, electronic records work would be greatly facilitated by utilizing Central Processing, rather than relying on each ASC unit to ingest and process this material. In keeping with traditional archival best practices, Central Processing typically creates a processing plan for each collection. The processing plan form records useful information to use during processing, which may be done immediately or at a later date, and assigns a level of processing to the incoming records. This procedure and form works very well with analog records and the Task Force initially established the same practice for electronic records. However we learned that it is not always efficient to follow current practices and that processes and procedures must be evaluated on a continual basis.
The processing plan is a form about a page and a half long with fields to fill out describing the condition and processing needs of the accession. Prior to being used for electronic records, fields included: Collection Name, Collection Date Span, Collection Number, Location(s), Extent (pre-processing), Desired Level of Processing, Restrictions/Redactions Needed, Custodial History, Separated Materials*, Related Materials, Preservation Concerns, Languages Other than English, Existing Order, Does the Collection need to be Reboxed/Refoldered, Are there Significant Pockets of Duplicates, Supplies Needed, Potential Series, Notes to Processors, Anticipated Time for Processing, Historical/Bibliographical Notes, Questions/Comments.
A few changes were made to include information about electronic records. The definitions for Level of Processing were modified to include what was expected for Minimal, Intermediate, or Full level of processing of electronic records. Preservation Concerns specifically asked if the collection included digital formats that are currently not supported or that are known to be problematic. After these minor changes were made, the Task Force completed a processing plan for each new electronic accession.
After several months experience using the form, Task Force members began questioning the value of the processing plan for electronic records. In the relatively few instances where accessions were initially reviewed for processing at a later date it captured useful information for the processor to refer back to without having to start from the beginning. However, the majority of electronic records that were being ingested were also being processed immediately and only a handful of the fields were relevant to the electronic recorded.. The only piece of information being captured on the Processing Plan that was not recorded elsewhere was the expected “level of processing” for the accession. To address this, the level of processing was added to existing documentation for the electronic accessions eliminating the need for creating a Processing Plan for accessions that were to be immediately processed.
The level of processing itself soon became a point of contention among some of the Electronic Records Task Force members. For electronic records, the level of processing only defined the level at which the collection would be described on a finding aid – collection, series, sub-series, or potentially to the item. The following definitions were created based on Describing Archives: A Content Standard (DACS).
Minimal: There will be no file arrangement or renaming done for the purpose of description/discovery enhancement. File formats will not be normalized. Action will generally not be taken to address duplicate files or PII information identified during ingest. Description will meet the requirements for DACS single level description (collection level).
Intermediate: Top level folder arrangement and top-level folder renaming for the purpose of description/discovery enhancement will be done as needed. File formats will not be normalized. Some duplicates may be weeded and redaction of PII done. Description will meet DACS multi-level elements: described to the series level with high research value series complemented with scope and content notes.
Full: Top level folder arrangement and renaming will be done as needed, but where appropriate renaming and arrangement may also be done down to the item level. File normalization may be conducted as necessary or appropriate. Identified duplicates will be removed as appropriate and PII will be redacted as needed. Description will meet DACS multi-level elements: described to series, subseries, or item level where appropriate with high research value components complemented with additional scope and content notes.
Discussions between the ERTF and unit staff about each accession assisted with assigning the appropriate level of processing. This “level of processing,” however, did not always correlate with the amount of effort that was being given to an accession. For example, a collection assigned a minimal level of processing could take days to address while a collection assigned a full level of processing might only take hours based on a number of factors. Just because the minimal level of processing says that there will be no file arranging or renaming done – for the purpose of description/discovery – does not mean that no file arranging or renaming will be done for preservation or ingest purposes. File renaming must often be done for preservation purposes. If unsupported characters are found in file names these must be addressed. If file names are too long this must also be addressed.
Other tasks that might be necessary to assist with the long-term management of these materials include:
Removing empty directories
Removing .DSStore and .Thumbs files
Removing identified PII while not necessary for the description, better protects the University. The less PII we need to manage, the less risk we put ourselves in.
Deleting duplicates (as much as the “level of processing” tries to limit this, as someone who needs to manage the storage space, continually adding duplicates will cause problems down the line). We have a program that easily removes them, so use it.
Therefore the “level of processing”, while helpful in setting expectations for final description, does not provide accurate insight into the amount of work that is being done to make electronic records accessible. In order to address the lack of correlation between the processing level assigned to an accession and the actual level of effort being given to the processing of the accession, a Levels of Effort document was drafted to help categorize the amount of staff time and resources put forth when working with electronic materials. The expected level of effort may be more useful for setting priorities then a level of processing as there is a closer one-to-one relationship with the amount of time required to complete the processing.
This is another example of how we were not able to directly apply procedures developed for analog records towards electronic records. The key is finding the balance between not reinventing the wheel and doing things the way they have always been done.
Carol Kussmann is a Digital Preservation Analyst with the University of Minnesota Libraries Data & Technology – Digital Preservation & Repository Technologies unit, and co-chair of the University of Minnesota Libraries – Electronic Records Task Force (ERTF). Questions about the activities of the ERTF can be sent to: email@example.com.
Lara Friedman-Shedlov is Description and Access Archivist for the Kautz Family YMCA Archives at the University of Minnesota Libraries. Her current interests include born digital archives and diverse and inclusive metadata.
The human element of digital archiving has lately been covered very well, with well-known professionals like Bergis Jules and Hillel Arnold taking on various pieces of the topic. At Indiana University (IU), we have a tacit commitment in most of our collections to the concept of a culture of care: taking care of those who entrust us with the longevity of their materials. However, sometimes even the best of intentions can lead to failures to achieve that goal, especially when the project involves a fraught issue, like downsizing and the loss or fundamental change of employment for people whose materials we want to bring into the collection. We wanted to share a story of one of our failures, as part of the #digitalarchivesfail series, because we hope that others can learn from our oversights. We hope that by sharing this out we can contribute to the conversation that projects like Documenting the Now have really mobilized.
Introduction: Indiana University Archives and the Born Digital Preservation Lab
Mary Mellon (MM): The mission of the Indiana University Archives is to collect, organize, preserve and make accessible records documenting Indiana University’s origins and development and the activities and achievements of its officers, faculty, students, alumni and benefactors. We often work with internal units and donors to preserve the institution’s legacy. In fulfilling our mission, we are beginning to receive increasing amounts of born digital material without in-house resources or specialized staff for maintaining a comprehensive digital preservation program.
Heidi Kelly (HK): The Indiana University Libraries Born Digital Preservation Lab (BDPL) is a sub-unit of Digital Preservation. It started last January as a service modeled off of the Libraries’ Digitization Services unit, which works with various collection owners to digitize and make accessible books and other analog materials. In terms of daily management of the BDPL, I generally do outreach with collection units to plan new projects, and the Digital Preservation Technician, Luke Menzies, creates the workflows and solves any technical problems. The University Archives has been our main partner so far, since they have ongoing requests from both internal and external donors with a lot of born-digital materials.
Developing a Project With a Downsized Unit
MM: In the middle of 2016, IU Archives staff were contacted by another unit on campus about transferring its records and faculty papers. Unfortunately, the academic unit was facing impending closure, and they were expected to move out of their physical location within a few months. After an initial assessment, we worked with unit’s administrative staff and faculty members, the IU Libraries’ facilities officer, and Auxiliary Library Facility staff to inventory, pack, and transfer boxes of papers, A/V material, and other physical media to archival storage.
HK: The unit’s staff also suggested that they transfer, along with their paper records, all of the content from their server, which is how the BDPL got involved. The server contained a relatively small amount of content comparatively to the papers records, but “accessioning a server” was a new challenge for us.
What We Did
MM: I should probably mention that the job of coordinating the transfer of the unit’s material landed on my desk on my second day of work at Indiana University (thanks, boss!), so I was not involved in the initial consultation process. While the IU Archives typically asks campus offices and donors to prepare boxes for transfer, the unit’s shrinking staff had many competing priorities resulting from the approaching closure (the former office manager had already left for a new job). They reached out for additional help about four weeks prior to closure, and the IU Archives assumed the responsibility of packing and physically transferring records at that point. As a result, the large volume of paper records to accession was the immediate focus of our work with the unit.
In the end, we boxed about 48 linear feet of material, facing several unanticipated challenges along the way, namely coordinating with personnel who were transferring or leaving employment. The building that housed the unit was not regularly staffed anyway during the summer, requiring a new email back-and-forth to gain access every time we needed to resume file packing. This need for access also necessitated special trips to the office by unit staff. The staff were obviously stretched a bit thin with the closure in general, so any difficulties or setbacks in terms of transferring materials, whether paper or digital, just added to their stress.
In addition, despite consulting with IU Archives staff and our transfer policies, the unit’s faculty exhibited an unfortunate level of modesty over the enduring value of their papers, which significantly slowed the process of acquiring paper materials. To top it all off, during a stretch of relentless rain, the basement, where most of the paper files were stored prior to transfer, flooded. The level of humidity necessitated moving the paper out as soon as possible, which again made the analog materials our main focus in the IU Archives.
HK: Accessioning the server, on the other hand, didn’t involve many trips to the unit’s physical location. Our main challenges were determining how best to ensure that everything got properly transferred, and how to actually transfer the digital content.
Because the server was Windows, this posed the first big issue for us. Our main workflow focuses on creating a disk image, which in this case was not optimal. We had a bash script that we were regularly using on our main Linux machines to inventory any large external hard drives, but we weren’t prepared for running it on a Windows operating system. Luke, having no experience with Python, amazinglyadapted our scriptwithin a short period of time, but unfortunately this still set us back since the unit’s staff were working against the clock. I also didn’t realize that in explaining the technical issue to them, I had inadvertently encouraged them to search out their own solutions to effectively capture all of the information that our bash inventory was generating. This was a fundamental misstep, as I didn’t think about their attenuated timeline and how that might push a different response. While the staff’s proactivity was helpful, it compounded the amount of time they ultimately spent and left them feeling frustrated.
All of that said, we got the inventory working and then we faced a new set of challenges in terms of the actual transfer. First, we requested read-only access to the server through the unit’s IT department, but were unable to obtain the user privileges. We then tried to move things using Box, the university’s cloud storage, but discovered some major limitations. In comparing the inventory we generated on the user’s end with what we received in the BDPL, there were a lot of files missing. As we found out, the donor had set everything to upload, but received no notifications when the system failed to complete the process. Box had already been less than ideal in that some of the metadata we wanted to keep for every file was wiped as soon as the files were uploaded, so this made us certain that it was not the right solution. Finally, we landed on using external media for the transfer–an option that we had considered earlier but we did not have any media available early on in the project.
What We Learned
HK: In digital archiving, and digital curation more broadly, we are constantly talking about building our <insert preferred mode of transport here> as we <corresponding verb> it. Our fumblings to figure out better and faster ways to preserve obsolete media are, often, comical. But they also have an impact in several ways that we hadn’t really considered prior to this project. Beyond the fact of ensuring better odds for the longevity of the content that we’re preserving, we’re preserving the tangible histories of Indiana University and of its staff. Our work means that people’s legacies here will persist, and our work with the unit really laid that bare.
Going forward, there are several parts of the BDPL’s workflow that will improve based on the failures of this particular project. First, we’ve learned that our communication of technical issues is not optimal. As a tech geek, I think that the way in which I frame questions can sometimes push people to answer and act in different ways than if the questions are framed by a non-tech person. I spend every day with this stuff, so it’s hard to step back from, “How good are you at command line?” to “Do you know what command line is?” But that’s key to effective relationship management, and that caused a problem in this situation. My goal in this case is to further rely on Mary and other Archives staff to communicate directly with donors. To me, this makes the most sense because we in the BDPL aren’t front-facing people, our role is as advisors to the collection managers, to the people who have been regularly interfacing with donors for years. They’re much better at that element, and we’ve got the technical stuff covered, so everybody’s happy if we can continue building out our service model in that way. Right now we’re focusing on training for the archivists, librarians and curators that are going to be working directly with donors. Another improvement we’ve made is the creation of a decision matrix for BDPL projects in order to better define how we make decisions about new projects. This will again help archivists and other staff as they work with donors. It will also help us to continue focusing on workflows rather than one-offs–building on the knowledge we gain from other, similar projects, instead of starting fresh every time. The necessity of focusing on pathways rather than solutions is again something that became clearer after this project.
MM: Truism alert: Born-digital materials really need to be an early part of the conversations with donors. In this case, we would have gained a few weeks that could have contributed to a smoother experience for all parties and more ideal solution preservation-wise than what we ended up with. Despite the fact that our guidelines for transferring records and paper state that we want electronic records as well as analog, we cannot count on donors to be aware of this policy, or to be proactive about offering up born-digital content if it will almost certainly lead to more complications.
Seconding Heidi’s point, we at the Archives need to assume more responsibility in interactions with donors regarding transfer of materials, instead of leaving everything digital on the BDPL’s plate. The last thing we want is for donors to be discouraged from transferring born-digital material because it is too much of a hassle, or for the Archives to miss out on any contextual information about the transfers due to lack of involvement. We at the Archives are using this experience to develop formal workflows and policies in conjunction with the BDPL to optimize division of responsibilities between our offices based on expertise and resources. We’re not going to be bagging any files anytime soon on our own workstations, but we can certainly sit down with a donor to walk them through an initial born-digital survey and discuss transfer procedures and technical requirements as needed.
HK: In the end, we’re archiving the legacies of people and institutions. Creating a culture of care is easy to talk about, but when dealing with the legacies of staff who are being displaced or let go, it is crucial and much harder than we could have imagined. Empathy and communication are key, as is the understanding that failure is a fact in our field. We have to embrace it in order to learn from it. We hope that staff at other institutions can learn from our failures in this case.
Mary Mellon is the Assistant Archivist at Indiana University Archives, where she manages digitization and encoding projects and provides research and outreach support. She previously worked at the Rubenstein Library at Duke University and the University Libraries at UNC-Chapel Hill. She holds an M.S. in Information Science from UNC-Chapel Hill and a B.A. from Duke University.
Heidi Kelly is the Digital Preservation Librarian at Indiana University. Her current focuses involve infrastructure development and services for born-digital objects. Previously Heidi worked at Huygens ING, Library of Congress, and Nazarbayev University. She holds an Master’s Degree in Library and Information Science from Wayne State University.
Here are a few ideas that were discussed during the two chats:
Backlogs, workflows, delivery mechanisms, lack of known standards, appraisal and familiarity with software were major barriers to providing access.
Participants were eager to learn more about new tools, existing functioning systems, providing access to restricted material and complicated objects, which institutions are already providing access to data, what researchers want/need, and if any user testing has been done.
Access is being prioritized by user demand, donor concerns, fragile formats and a general mandate that born-digital records are not preserved unless access is provided.
Very little user testing has been done.
A variety of archivists, IT staff and services librarians are needed to provide access.
You can search #bdaccess on Twitter to see how the conversation evolves or view the complete conversation from these chats on Storify.
Daniel Johnson is the digital preservation librarian at the University of Iowa, exploring, adapting, and implementing digital preservation policies and strategies for the long-term protection and access to digital materials.
Seth Anderson is the project manager of the MoMA Electronic Records Archive initiative, overseeing the implementation of policy, procedures, and tools for the management and preservation of the Museum of Modern Art’s born-digital records.
They say you learn more from failure than from success. FIAT was a great teacher.
This is a story about never giving up, until you do: about the project where nothing went right, and just kept going. It takes place at the University of Texas, Austin, School of Information (iSchool) in the Digital Archiving course. A big part of that class is the hands-on technology project, where students apply archival theory to legacy hardware, digital records, or a mix of both. Our class had three mothballed servers and three ancient personal computers available; my group was assigned the largest (and oldest) of the retired School of Information servers, a monster tower-chassis Dell PowerEdge 4400 called FIAT.
Our assignment was clear: following archival principles, gain access to the machine’s filesystem, determine dates of service, inventory the contents, and image the disks or otherwise retrieve the data. We had an advantage in that we knew what FIAT had been used for: the backbone server for the iSchool, serving the public-facing website and the home directories for faculty, staff, and students. In light of this, we had one additional task: locate the old website directory. Hopefully, at the end of the semester, we would have a result to present for the iSchool Open House.
As the only one in the group with Linux server experience (I’d been the school’s deputy systems administrator for about a year), I volunteered as technical lead and immediately began worrying about what would go wrong.
It would be easier to list what went right.
We got access to the machine. We estimated manufacturing and usage dates. We determined the drive configuration. We were able to view the file directory, and we located the old iSchool website.
The catalog of dead ends, surprises, and failures is rather longer, and began almost immediately. None of us had done anything like this before, but I had enough experience with servers to develop specific fears, which may or may not have been better than the general anxiety my group members suffered, and turned out to be largely misplaced.
I was sure that FIAT had been set up in a RAID array, but I didn’t know the specifications or how to image it. (1) To find out without directly accessing the machine– which might have compromised the integrity of its filesystem– we needed its Dell service tag number. If we could give that to the iSchool’s IT coordinator (my boss), Dell’s lookup tool would tell us what we needed to know.
The service tag had been scraped off.
That was annoying, but not fatal. Since we had the model number, we could find a manual; with access to the iSchool’s IT inventory, I could look up the IT control tag and see what information we had. From this, we determined that FIAT was produced between 1999 and 2003, could have been set up for either hardware or software RAID depending on a hardware feature, and was probably running the operating system Red Hat v2.7. That gave us a ballpark for service life. It didn’t move us forward, though, so while my compatriots researched RAID imaging strategies, I looked for another route.
Best practice for computer accessions calls for accessing the machine from a “dead” state, so that metadata doesn’t get overwritten and the machine can be preserved in its shutdown state. For us, that meant booting from a Live CD, a distribution of Linux which runs in the RAM and mounts, or attaches, to the filesystem without engaging the operating system, allowing us to see everything without altering the data. My thought was that we could boot that way and then check for a RAID configuration at the system level: open the box with the crowbar inside it.
And it would have worked, too, if it weren’t for Murphy.
After making the live CD, we turned FIAT on and adjusted the boot order in the BIOS so we could boot from the disk. We learned three things from this: first, and most frighteningly, one of the drive ports didn’t show up in the boot sequence (and another spun up with the telltale whine of a Winchester drive going bad, increasing the pressure to get this done). Second, the battery on FIAT’s internal clock must have died, because it displayed a date in May 2000 (which we figured was probably when the board had been installed). Third, neither the service tag number nor the processor serial number appeared in the BIOS, so we still couldn’t look it up.
Carrying merrily on, we went ahead with the live CD boot. What happened next was our mistake, and I only realized it later. Though Knoppix (the Linux OS we were running from the live CD) started and ran, the commands for displaying partitions and drives returned no results, and navigating to /dev (where the drives mount in Linux) didn’t reveal any mount points. Nothing in the filesystem looked right, either.
What had happened (and a second attempt made this apparent) was that Knoppix hadn’t mounted at all. It was just running in the RAM. We hadn’t noticed the error message that told us this because we were too excited that the CD drive had worked. Knocked back but hardly defeated, we took a week off to email smarter people and regroup.
The next thing we did involved a screwdriver.
Popping the side off to read the diagram and locate the RAID controller key– or not, as it happened–was mildly cathartic and hideously dusty. I spent the next three days sneezing. Without a hardware controller, I was certain that the machine had been set up with a software RAID; since our attempts to boot from the CD had failed, I proposed that we pull the drives and image them separately with the forensic hardware we had available. My theory was that, since the RAID was configured in the software, we could rebuild it from disk images. This theory did not have a chance to be disproved.
Unscrewing the faceplate and pulling the drives gave me a certain amount of satisfaction, I’ll admit. It also solved the mystery of the missing drive: the reason why one of the SCSI ports wasn’t coming up on the boot screen was that it was empty. With that potential catastrophe averted, we imagined ourselves well set on our way to imaging the disks. Until we discovered that the Forensic Recovery of Evidence Device Laptop (or FRED for short) in the Digital Archaeology Lab didn’t have cables capable of connecting our 80-pin SCSI-2 drives to its 68-pin SCSI-3 write blocker. And that, despite having a morgue’s worth of old computer cables and connectors, there wasn’t anything in the lab with the right ends. That’s when I started fantasizing about making FIAT into a lamp table.
So, while my comrades returned to preparing a controlled vocabulary for our pictures and drafting up metadata files (remember, we never actually gave up on getting the data), I called or drove to every electronics store in town, including the Goodwill computer store. I found a lot of legacy components and machines, but nothing that would convert SCSI-2 to SCSI-3; so I put out a call to my nerd friends to find me something that would work.
With their help, I found an adaptor with SCSI-2 on one side and SCSI-3 on the other. When it arrived, I met up with one of my groupmates at the Digital Archaeology Lab, where the two of us daisy-chained the FRED cables, write blocker, our connector, and the (newly labeled according to our naming convention) drives to see what would happen.
The short version is: nothing.
The longer version is: complicated nothing. Some of the drives, attached to the power supply and write blocker, didn’t turn on at first, then did later without us changing anything. The write blocker’s SCSI connection light never lit up. FRED never registered an attached drive. We tried several jumper combinations, all with the same result: when we could get the drives to turn on at all, the write blocker couldn’t see them, and neither could FRED.
Having exhausted our options for doing it the right way, we explained the situation to our professor, Dr. Pat Galloway (who I think was enjoying our object lesson in Special Problems), and got permission to just turn FIAT on and access it directly. I put the drives back in, we tried booting with Knoppix again just in case (revealing the error), then changed the boot order back and watched it slowly come back to life.
Of course no one had the password.
Illustrating the adage “if physical access has been achieved, consider security compromised,” I put FIAT into Single User Mode, allowing me to reset the root password (we put it in the report, don’t worry) and become its new boss. (2)
This is where it got weird. And exciting! Prior to this, we’d been frustrated- afterwards, we upgraded to baffled. After watching FIAT rebuild itself- as a RAID 5 array- we had to figure out what to do next: how to image the machine, and onto what.
We made three attempts to connect FIAT to something, or something to FIAT, each of which resulted in its own unique kind of failure.
After noticing a SCSI-3 port on the back of FIAT–a match to the one FRED’s write blocker cables–and with no idea if this would even work for a live machine, I proposed plugging the two together to see what happened.
Again, the short answer is: nothing. We tried it both through FRED’s write-blocker and directly to the laptop, but neither FRED nor FIAT registered a connection. Checking for drives showed no new devices, and no connection events appeared in the log file or the kernel message buffer. (3)
Our next bright idea was to attach a USB storage device and run a disk dump to it. We formatted a drive, plugged it in, and prepared for nothing to happen. For once, it didn’t. Instead, FIAT reported an error addressing the device that even Stack Overflow didn’t recognize. I found the error class, however: kernel issues. We thought that maybe the drive was too big, or too new, so we hunted up an older USB drive and tried again. Same result. Then we rebooted. The error messages stopped, but no new connection registered. I tried stopping the service, removing the USB module, and restarting the service, but both ports continued throwing errors.
Servers are meant be networked. Hoping that FIAT’s core functions remained intact, I acquired some cables and a switch and rigged up a local access network (LAN) between FIAT and my work laptop. If it worked, we could send the disk dump command to a destination drive attached to my Mac. Or we could stare at the screen while FIAT dumped line after line of ethernet errors, which seemed like more fun. Again, I stopped the service, cleared the jobs on the controller (FIAT only had one ethernet port), restarted it, then restarted the service, but the timeout errors persisted (4).
FIAT’s total solipsism suggested a dead south bridge as well as serious kernel problems. While it might have been possible for an electrical engineer (ie, not us) to overcome the former, the latter presented a catch-22: even if we were willing to alter the operating system (no), the only way to fix FIAT would have been with an update, which thanks to the kernel errors, we couldn’t perform.
To be clear: those three things all happened in about two hours.
During this project, we kept reflective journals, accessible only to ourselves and our professor. The final entry in mine simply reads: “I think I’m so smart.”
With about a week left, I had an idea. Other than how to convert a tower-chassis server into an end table. I discussed it with the group, and when we couldn’t find anything wrong with it, I suggested they start the final report and presentation: if this worked, we’d have something to turn in. If not, I’d write my section of the report and we’d call it done.
We had found the website while exploring the filesystem. FIAT had a CD drive. I could compress and copy the website directory to a CD and we could turn that in. It wasn’t ideal, but we’d have something to show for a semester’s worth of work.
So while my compatriots got the project documentation ready for ingest to the iSchool’s DSpace, I went to work on FIAT one last time. I covered my bases–researched the compression protocols Red Hat 2.7 supported, the commands I’d need, how to find the hardware location so I could write the file out once it was created.
FIAT being FIAT, I hit a snag: the largest CD I had available was 700MB, but the real size of the website directory was 751MB. After a little investigation, I decided which files and folders we could live without (and put locations and my reasoning in the report): excluding them, I created an .iso smaller than 700MB.
That file still resides in the directory where I put it. The final indignity, FIAT’s sting in the tail, was this: it had a CD drive, which I found with cdrecord -scanbus. What it did not have was a CD-ROM drive. Attempting to write the .iso to disk, cdrecord returned an error: unsupported drive at bus location. FIAT can read, but it can’t write.
And then we were out of time. After my last idea fizzled, we gave up on FIAT and put together our final report, including suggestions for further work and server decommissioning recommendations for never letting this happen again. I presented the project at the Open House anyway, sitting on FIAT (the casters on the tower model made moving its 115lb bulk a breeze) for four hours, showing people the file structure and regaling them with tales of failure. Talking over the fantastic noise it made, while the other members of my group held down their own project posters, I found that people appreciate a good comedy of errors. I’ve embraced FIAT as my shaggy-dog story, my archival albatross. And now I know what to say in job interviews when they ask me to “talk about a time you didn’t succeed.”
The author would like to acknowledge the efforts and contributions of their fellow-travelers, Arely Alejandro, Maria Fernandez, Megan Moltrup, and Olivia Solis, as well as the guidance and assistance of Dr. Patricia Galloway, Sam Burns, and members of the UT Austin storage ITS team, without whom none of this would have happened.
1 Redundant Array of Inexpensive (or Independent) Disks, a storage virtualization method which uses either hardware or software to combine multiple physical drives into a single logical unit, improving read/write speeds and providing redundancy to protect against drive failure. RAID arrays can be set up at a number of levels depending on user need, all of which have their own implications for preservation and data recovery.
2 Something I had no prior experience with- certainly not with a Red Hat 2.7 machine! I spent more time looking up error codes, troubleshooting, and searching for workarounds than interacting with FIAT.
3 Throughout, I used the command df -h in FIAT to display drives, and read the kernel message buffer, where information about the operating status of the machine can be read, with demsg.
4 I cannot emphasize enough how much on-the-fly learning occurred during this part- even as a low-level systems administrator, trying to get FIAT to talk to something involved a lot of new material for me.
A.L. Carson, MSIS UT ‘16, is the only archivist on Earth who is allergic to cats. Trained as a digital archivist, they now apply those perspectives on metadata and digital repositories as a Library Fellow at the University of Nevada, Las Vegas. Twitter: @mdxCarson
In Mexico, I was able to conduct interviews with nearly thirty organizations working on building, managing, sharing and preserving their digital collections. The types of organizations I visited were diverse in several areas: geographic location (i.e. outside of heavily centralized Mexico City), organization size, organization mission, and industry sector.
Cultural Heritage organizations (galleries, libraries, archives, museums)
College and University archives and libraries
Because of the diversity of the types of institutions that I visited, the results and conclusions I drew were also varied, and I noticed distinct trends within each area or category of institutions. For the brevity of this blog post, I have taken the liberty to abbreviate my findings in the following bullet points. These are not meant to be definitive or exhaustive, as I am still compiling, codifying and quantifying interview data.
The focus on digital collection building and preservation in business and government tends toward records management approaches. Retention schedules are dictated by the federal government and administered and enforced by the National Archives. All federal and state government entities are obligated to follow these guidelines for retention and transfer of records and archives. While the guidelines and processes for paper records are robust, many institutions are only beginning to implement and use electronic records management platforms. Long-term digital preservation of records designated for permanent deposit is an ongoing challenge.
In cultural heritage institutions and college and university archives, digital collection work is focused on building digitization and digital collection management programs. The primary focus of the majority of institutions is still on digitization, storage and diffusion of digitized assets, and wrangling issues related to long-term, sustainable maintenance of digital collections platforms and backups on precarious physical media formats like optical disks and (non-redundant) hard drives.
While digital preservation issues are still in the nascent stages of being worked through and solved everywhere around the globe, in some areas strong national and regional groups have been formed to help share strategies, create standards and think through local solutions. In Mexico and Latin America, this has mostly been done through participation in the InterPARES project, but a national Mexican digital preservation consortium, similar to the National Digital Stewardship Alliance (NDSA) in the United States, is still yet to be established in Mexico. In the meantime, several Mexican academic and government institutions have taken the lead on digital preservation issues, and through those initiatives, a more cohesive, intentional organization similar to the NDSA may be able to take root in the near future.
My opportunity to live and do research in Mexico was life-changing. It is now more crucial than ever for librarians, archivists, developers, administrators, and program leaders to look outside of the United States for collaborations and opportunities to learn with and from colleagues abroad. The work we have at hand is critical, and we need to share all the resources we have, especially those resources money cannot buy: a different perspective, diversity of language, and the shared desire to make the whole world, not just our little corner of it, a better place for all.
Natalie Baur is currently the Preservation Librarian at El Colegio de México in Mexico City, an institution of higher learning specializing in the humanities and social sciences. Previously, she served as the Archivist for the Cuban Heritage Collection at the University of Miami Libraries and was a 2015-2016 Fulbright-García Robles fellowship recipient, looking at digital preservation issues in Mexican libraries, archives and museums. She holds an M.A. in History and a certificate in Museum Studies from the University of Delaware and an M.L.S. with a concentration in Archives, Information and Records Management from the University of Maryland. She is also co-founder of the Desmantelando Fronteras webinar series and the Itinerant Archivists project. You can read more about her Fullbright-García Robles fellowship here.