Sarah Barsness: Digital Archivist’s Manual

When I joined the staff of the Minnesota Historical Society, the institution was in the process of morphing from a series of special projects to a programmatic approach to digital collecting.  The two archivists who held my position before me laid an excellent foundation for a sustainable program, and I was tasked with continuing the work they started.  I decided that I ought to kick things off by creating a manual for my job, to be written in between collections work and other projects.  

As you might imagine, the process was painfully slow.  Each time I began to think I was nearing the end, something would change.  Staff retired, departments were reorganized, or systems were upgraded — I began to despair ever creating an actual finished product.  My many drafts also chart the change and growth of our communities of practice; the constantly evolving work of my fellow archivists, conservationists, and digital specialists of all stripes has profoundly shaped each version of this document.  

Despite all this change, each iteration of this manual has expanded upon a few core ideas that  have remained central to my approach:

  1. What if we treat digital collections just like we do any other collection that has preservation concerns?  What does that look like?  What tools and skills will be needed to support that approach?
  2. What if we assume that most details of the work will change regularly? Staffing, systems, workloads, formats collected, personal skill levels, even department organizations all change.  How can we structure this work to be maximally flexible?
  3. How can we make a framework to guide decisions for each collection, and to help the institution navigate more fundamental changes to the program over time?

The manual I began nearly seven years ago still isn’t finished in any traditional sense, but I’ve decided that’s how it ought to be.  Change is constant, and just because something isn’t static doesn’t mean it’s not ready to be used or shared.  I hope that you find this document useful, and if you have feedback, ideas, or your own manuals that you’d like to share with me (finished or not), I sincerely hope you do.

https://docs.google.com/document/d/1ZghkuLRQRLEqVkTDe2BCldKpvNGMOfaDXWDmYEP5KpA/edit?usp=sharing


Sarah Barsness has been the Digital Collections Archivist at the Minnesota Historical Society for nearly 7 years, where she works with staff across the institution to process, store, preserve, and provide access to digital collections.  Sarah has worked previously at the Wisconsin Historical Society and at Cargill Corporate Archives.  For tweets about digital archives and Dungeons & Dragons, follow her on Twitter @SarahRBarsness.

Estimating Energy Use for Digital Preservation, Part II

by Bethany Scott

This post is part of our BloggERS Another Kind of Glacier series. Part I was posted last week.


Conclusions

While the findings of the carbon footprint analysis are predicated on our institutional context and practices, and therefore may be difficult to directly extrapolate to other organizations’ preservation programs, there are several actionable steps and recommendations that sustainability-minded digital preservationists can implement right away. Getting in touch with any campus sustainability officers and investigating environmental sustainability efforts currently underway can provide enlightening information – for instance, you may discover that a portion of the campus energy grid is already renewable-powered, or that your institution is purchasing renewable energy credits (RECs). In my case, I was previously not aware that UH’s Office of Sustainability has published an improvement plan outlining its sustainability goals, including a 10% total campus waste reduction, a 15% campus water use reduction, and a 35% reduction in energy expenditures for campus buildings – all of which will require institutional support from the highest level of UH administration as well as partners among students, faculty, and staff across campus. I am proud to consider myself a partner in UH campus sustainability and look forward to promoting awareness of and advocating for our sustainability goals in the future.

As Keith Pendergrass highlighted in the first post of this series, there are other methods by which digital preservation practitioners can reduce their power draw and carbon footprint, thereby increasing the sustainability of their digital preservation programs – from turning off machines when not use or scheduling resource-intensive tasks for off-peak times, to making broader policy changes that incorporate sustainability principles and practices.

At UHL, one such policy change I would like to implement is a tiered approach to file format selection, through which we match the file formats and resolution of files created to the scale and scope of the project, the informational and research value of the content, the discovery and access needs of end users, and so on. Existing digital preservation policy documentation outlines file formats and specifications for preservation-quality archival masters for images, audio, and video files that are created through our digitization unit. However, as UHL conducts a greater number of mass digitization projects – and accumulates an ever larger number of high-resolution archival master files – greater flexibility is needed. By choosing to create lower-resolution files for some projects, we would reduce the total storage for our digital collections, thereby reducing our carbon footprint.

For instance, we may choose to retain large, high-resolution archival TIFFs for each page image of a medieval manuscript book, because researchers study minute details in the paper quality, ink and decoration, and the scribe’s lettering and handwriting. By contrast, a digitized UH thesis or dissertation from the mid-20th century could be stored long-term as one relatively small PDF, since the informational value of its contents (and not its physical characteristics) is what we are really trying to preserve. Similarly, we are currently discussing the workflow implications of providing an entire archival folder as a single PDF in our access system. Although the initial goal of this initiative was to make a larger amount of archival material quickly available online for patrons, the much smaller amount of storage needed to store one PDF vs. dozens or hundreds of high-res TIFF masters would also have a positive impact on the sustainability of the digital preservation and access systems.

UHL’s digital preservation policy also includes requirements for monthly fixity checking of a random sample of preservation packages stored in Archivematica, with a full fixity check of all packages to be conducted every three years during an audit of the overall digital preservation program. Frequent fixity checking is computationally intensive, though, and adds to the total energy expenditure of an institution’s digital preservation program. But in UHL’s local storage infrastructure, storage units run on the ZFS filesystem, which includes self-healing features such as internal checksum checks each time a read/write action is performed. This storage infrastructure was put in place in 2019, but we have not yet updated our policies and procedures for fixity checking to reflect the improved baseline durability of assets in storage.

Best practices calling for frequent fixity checks were developed decades ago – but modern technology like ZFS may be able to passively address our need for file integrity and durability in a less resource-intensive way. Through considered analysis matching the frequency of fixity checking to the features of our storage infrastructure, we may come to the conclusion that less frequent hands-on fixity checks, on a smaller random sample of packages, is sufficient moving forward. Since this is a new area of inquiry for me, I would love to hear thoughts from other digital preservationists about the pros and cons to such an approach – is fixity checking really the end-all, or could we use additional technological elements as part of a broader file integrity strategy over time?

Future work

I eagerly anticipate refining this electricity consumption research with exact figures and values (rather than estimates) when we are able to more consistently return to campus. We would like to investigate overhead costs such as lighting and HVAC in UHL’s server room, and we plan to grab point-in-time values physically from the power distribution units in the racks. Also, there may be additional power statistics that our Sys Admin can capture from the VMware hosts – which would allow us to begin on this portion of the research remotely in the interim. Furthermore, I plan to explore additional factors to provide a broader understanding of the impact of UHL’s energy consumption for digital systems and initiatives. By gaining more details on our total storage capacity, percentage of storage utilization, and GHG emissions per TB, we will be able to communicate about our carbon footprint in a way that will allow other libraries and archives to compare or estimate the environmental impact of their digital programs as well.

I would also like to investigate whether changes in preservation processes, such as the reduced hands-on fixity strategy outlined above, can have a positive impact on our energy expenditure – and whether this strategy can still provide a high level of integrity and durability for our digital assets over time. Finally, as a longer-term initiative I would like to take a deeper look at sustainability factors beyond energy expenditure, such as current practices for recycling e-waste on campus or a possible future life-cycle assessment for our hardware infrastructure. Through these efforts, I hope to help improve the long-term sustainability of UHL’s digital initiatives, and to aid other digital preservationists to undertake similar assessments of their programs and institutions as well.


Bethany Scott is Digital Projects Coordinator at the University of Houston Libraries, where she is a contributor to the development of the BCDAMS ecosystem incorporating Archivematica, ArchivesSpace, Hyrax, and Avalon. As a representative of UH Special Collections, she contributes knowledge on digital preservation, born-digital archives, and archival description to the BCDAMS team.

Estimating Energy Use for Digital Preservation, Part I

by Bethany Scott

This post is part of our BloggERS Another Kind of Glacier series. Part II will be posted next week.


Although the University of Houston Libraries (UHL) has taken steps over the last several years to initiate and grow an effective digital preservation program, until recently we had not yet considered the long-term sustainability of our digital preservation program from an environmental standpoint. As the leader of UHL’s digital preservation program, I aimed to address this disconnect by gathering information on the technology infrastructure used for digital preservation activities and its energy expenditures in collaboration with colleagues from UHL Library Technology Services and the UH Office of Sustainability. I also reviewed and evaluated the requirements of UHL’s digital preservation policy to identify areas where the overall sustainability of the program may be improved in the future by modifying current practices.

Inventory of equipment

I am fortunate to have a close collaborator in UHL’s Systems Administrator, who was instrumental in the process of implementing the technical/software elements of our digital preservation program over the past few years. He provided a detailed overview of our hardware and software infrastructure, both for long-term storage locations and for processing and workflows.

UHL’s digital access and preservation environment is almost 100% virtualized, with all of the major servers and systems for digital preservation – notably, the Archivematica processing location and storage service – running as virtual machines (VMs). The virtual environment runs on VMware ESXi and consists of five physical host servers that are part of a VMware vSAN cluster, which aggregates the disks across all five host servers into a single storage datastore.

VMs where Archivematica’s OS and application data reside may have their virtual disk data spread across multiple hosts at any given time. Therefore, exact resource use for digital preservation processes running via Archivematica is difficult to distinguish or pinpoint from other VM systems and processes, including UHL’s digital access systems. After discussing possible approaches for calculating the energy usage, we decided to take a generalized or blanket approach and include all five hosts. This calculation thus represents the energy expenditure for not only the digital preservation system and storage, but also for the A/V Repository and Digital Collections access systems. At UHL, digital access and preservation are strongly linked components of a single large ecosystem, so the decision to look at the overall energy expenditure makes sense from an ecosystem perspective.

In addition to the VM infrastructure described above, all user and project data is housed in the UHL storage environment. The storage environment includes both local shared network drive storage for digitized and born-digital assets in production, and additional shares that are not accessible to content producers or other end users, where data is processed and stored to be later served up by the preservation and access systems. Specifically, with the Archivematica workflow, preservation assets are processed through a series of automated preservation actions including virus scanning, file format characterization, fixity checking, and so on, and are then transferred and ingested to secure preservation storage.

UHL’s storage environment consists of two servers: a production unit and a replication unit. Archivematica’s processing shares are not replicated, but the end storage share is replicated. Again, for purposes of simplification, we generalized that both of these resources are being used as part of the digital preservation program when analyzing power use. Finally, within UHL’s server room there is a pair of redundant network switches that tie all the virtual and storage components together.

The specific hardware components that make up the digital access and preservation infrastructure described above include:

  • One (1) production storage unit: iXsystems True NAS M40 HA (Intel Xeon Silver 4114 CPU @ 2.2 Ghz and 128 GB RAM)
  • One (1) replication storage unit: iXsystems FreeNAS IXC-4224 P-IXN (Intel Xeon CPU E5-2630 v4 @ 2.2 Ghz and 128 GB RAM)
  • Two (2) disk expansion shelves: iXsystems ES60
  • Five (5) VMware ESXi hosts: Dell PowerEdge R630 (Intel Xeon CPU E5-2640 v4 @ 2.4 Ghz and 192 GB RAM)
  • Two (2) network switches: HPE Aruba 3810M 16SFP+ 2-slot

Electricity usage

Each of the hardware components listed above has two power supplies. However, the power draw is not always running at the maximum available for those power supplies and is dependent on current workloads, how many disks are in the units, and so on. Therefore, the power being drawn can be quantified but will vary over time.

With the unexpected closure of the campus due to COVID-19, I conducted this analysis remotely with the help of the UH campus Sustainability Coordinator. We compared the estimated maximum power draw based on the technical specifications for the hardware components, the draw when idle, and several partial power draw scenarios, with the understanding that the actual numbers will likely fall somewhere in this range.

Estimated power use and greenhouse gas emissions

 Daily Usage Total (Watts)Annual Total (kWh)Annual GHG (lbs)
Max9,09479,663.44124,175.71
95%8,639.375,680.268117,966.92
90%8,184.671,697.096111,758.14
85%7,729.967,713.924105,549.35
80%7,275.263,730.75299,340.565
Idle5,365.4647,001.4373,263.666

The estimated maximum annual greenhouse gas emissions derived from power use for the digital access and preservation hardware is over 124,000 pounds, or approximately 56.3 metric tons. To put this in perspective, it’s equivalent to the GHG emissions from nearly 140,000 miles driven by an average passenger vehicle, and to the carbon dioxide emissions from 62,063 pounds of coal burned or 130 barrels of oil consumed. While I hope to refine this analysis further in the future, for now these figures can serve as an entry point to discussions on the importance of environmental sustainability actions – and our plans to reduce our consumption – with Libraries administration, colleagues in the Office of Sustainability, and other campus leaders.

Part II, including conclusions and future work, will be posted next week.


Bethany Scott is Digital Projects Coordinator at the University of Houston Libraries, where she is a contributor to the development of the BCDAMS ecosystem incorporating Archivematica, ArchivesSpace, Hyrax, and Avalon. As a representative of UH Special Collections, she contributes knowledge on digital preservation, born-digital archives, and archival description to the BCDAMS team.

An intern’s experience: Preserving Jok Church’s Beakman

by Matt McShane

I seem to have a real penchant for completing my schoolings in the middle of “once-in-a-lifetime” economic crises: first the 2008 housing recession, and now a global pandemic. And while the current situation has somewhat altered the final semester of my MLIS program—not to mention many others’ situations much more intensely—I was still able to have a very engaging and rewarding practicum experience at The Ohio State University Libraries, working on an incredible digital collection accessioned by the Billy Ireland Cartoon Library and Museum. 

U Can with Beakman and Jax might be best known to a lot of people as the predecessor of the short-lived Saturday morning live action science show Beakman’s World, but the comic is arguably more successful than the television show it produced. With an international readership and a run that lasted more than 25 years, it was a success story that entertained and educated readers over generations. It was also the first syndicated newspaper comic to be entirely digitally drawn and distributed. Jok Church, the creator and author, used Adobe Illustrator throughout the run of the comic, and saved and migrated the files in various stages of creation through multiple hard drives. With the exception of a few gaps, the entire run was saved on Jok’s hard drive at the time of his death in April 2016. These are the files we received at University Libraries. 

Richard Bolingbroke, a friend of Jok’s and executor of his estate, donated the collection to the Billy Ireland. He also provided us with an in-progress biography of Jok, which gave insight into who he was as a person beyond his work with Beakman and Jax, as well as a condensed history of the publication. This will be useful as the Billy Ireland creates author metadata and information for the collection.

Richard provided us access to direct copies of twenty-four folders via Dropbox, containing nearly 10,000 files, which we downloaded to our local processing server. Each folder contained a year’s worth of comics, from 1993 to 2016, though the years 1995 and 1996 were empty due to a hard drive failure Jok had experienced. We’re still in the process of possibly hunting down any existing backups from these years. In the existing folders, though, we found not only the many years’ worth of terrific comic content, but also a glimpse into Jok’s creative and organizational process. An initial DROID scan of the contents found over 2,000 duplicate files scattered throughout. After speaking with Richard about this, we determined it to be a mistaken copy/paste issue. Rather than manipulate the existing archival collection, we decided to create a distribution collection better organized for user access to the works, with the intention of maintaining archival integrity of the donated collection. 

Before either of those goals could be reached, though, our second primary issue was that of file extensions. We found nearly 1,300 files without extensions in the collection, which we determined to be due to older Mac OS’s use of files without extensions appended. Adobe Illustrator produces both .ai and .eps file types. There are other file types among the collection, but these are the primary types for each work. It was impossible to determine which files were .ai versus .eps at a batch level, so the EXIF metadata of all files without extensions were manually examined to determine their proper extension. Using Bulk Rename Utility, we were able to semi-batch the extension appending, but it still required a fair amount of manual labor due to the intermingled nature of the different file types within subfolders. 

Even though create dates within EXIF metadata were unreliable because of different versions of Illustrator being used to access files throughout the years, Jok named and organized his files by publication date, which gave us reliable organization metadata for our distribution file. His file and folder organization did shift throughout the years—understandable over two decades and who knows how many machines. This required a fair bit of manual labor in creating and organizing the distribution collection in a standardized file name and subfolder format. The comic was published weekly, albeit with some breaks. Typically there are four different versions: portrait versus landscape and black and white versus color of the finished product. A year\month\date folder tree was created based on how the largest portion of Jok’s files were organized. Once that was completed, we shifted focus to Ohio State’s Accessibility standards, and investigated a batch workflow to convert the comic files to PDF/A. Unfortunately, we could not achieve PDF/A compliance due to the nature of the original files; additionally, the “batch” processing includes a significant human interaction.

Further complicating matters, while we were discovering this, the COVID-19 global pandemic hit Ohio. In response, Ohio State declared all non-essential personnel to move to tele-work, which cut off my access to the server behind the University’s firewall for the remainder of my internship. As a result, we had to put the completion of this project on indefinite hold. Despite these extreme circumstances preventing me from seeing the collection all the way through to public hands, I was able to leave it in an organized state, ready for file conversion and metadata creation. 

I learned a lot by being able to handle the collection from the beginning, untouched. One of the biggest takeaways was the importance of gathering information about the collection and its creator up front. Creating a manifest of the objects within the collection is a clear necessity to knowing how the collection should be preserved, and how it should be accessed, but also allowed us to see gaps in the collection, such as the significant number of duplicates and files without extensions. Having this knowledge up front allowed us to better plan our approach to the collection. I have actually suggested increasing students’ exposure to “messy” digital objects collections to my program’s faculty based on my experience with this project. 

The other key takeaway I discovered was that sometimes it might be best to dirty your hands, and perform tasks manually. Digital preservation can have a lot of automated shortcuts compared to processing its traditional analog cousins, but not everything can or should be done through batch processes. While it may be technically possible to program a process, it may not really be the best use of time or effort. Part of workflow development is recognizing when the creation of an automated solution outweighs the time and effort to manually perform the task. It may have been possible to code a script to identify and append the file extensions for our objects missing them, but the effort and time to learn, write, and troubleshoot that likely would have be greater than the somewhat tedious work of doing it by hand in this instance. Alternatively, it might be worth looking into automated scripting if this were a significantly larger collection of mislabeled or disorganized objects. Having a good understanding of cost and benefit is important when approaching a problem that can have multiple solutions.

My time on-site with The Ohio State University Libraries was a bit shorter than I had intended, but it still provided me with a great experience and helped to solidify my love for the digital preservation process and work. The fact that U Can with Beakman and Jax is the first digitally created syndicated newspaper comic makes the whole experience that much more apt and impactful. Even though some aspects of work are in limbo at the moment, I am confident that this terrific collection of Jok’s work will be available for the public to enjoy and learn from. Even if I am not able to fully carry the work over the finish line, I am thankful for the opportunity to work on it as much as I did. 


Matt McShane, a recent MLIS graduate from Kent State University, is currently focused on landing a role with a cultural heritage institution where he can work hands-on with digital collections, digital preservation, and influence broader preservation policy.

Integrating Environmental Sustainability into Policies and Workflows

by Keith Pendergrass

This is the first post in the BloggERS Another Kind of Glacier series.


Background and Challenges

My efforts to integrate environmental sustainability and digital preservation in my organization—Baker Library Special Collections at Harvard Business School—began several years ago when we were discussing the long-term preservation of forensic disk images in our collections. We came to the conclusion that keeping forensic images instead of (or in addition to) the final preservation file set can have ethical, privacy, and environmental issues. We decided that we would preserve forensic images only in use cases where there was a strong need to do so, such as a legal mandate in our records management program. I talked about our process and results at the BitCurator Users Forum 2017.

From this presentation grew a collaboration with three colleagues who heard me speak that day: Walker Sampson, Tessa Walsh, and Laura Alagna. Together, we reframed my initial inquiry to focus on environmental sustainability and enlarged the scope to include all digital preservation practices and the standards that guide them. The result was our recent article and workshop protocol.

During this time, I began aligning our digital archives work at Baker Library with this research as well as our organization-wide sustainability goals. My early efforts mainly took the form of the stopgap measures that we suggest in our article: turning off machines when not in use; scheduling tasks for off-peak network and electricity grid periods; and purchasing renewable energy certificates that promote additionality, which is done for us by Harvard University as part of its sustainability goals. As these were either unilateral decisions or were being done for me, they were straightforward and quick to implement.

To make more significant environmental gains along the lines of the paradigm shift we propose in our article, however, requires greater change. This, in turn, requires more buy-in and collaboration within and across departments, which often slows the process. In the face of immediate needs and other constraints, it can be easy for decision makers to justify deprioritizing the work required to integrate environmental sustainability into standard practices. With the urgency of the climate and other environmental crises, this can be quite frustrating. However, with repeated effort and clear reasoning, you can make progress on these larger sustainability changes. I found success most often followed continual reiteration of why I wanted to change policy, procedure, or standard practice, with a focus on how the changes would better align our work and department with organizational sustainability goals. Another key argument was showing how our efforts for environmental sustainability would also result in financial and staffing sustainability.

Below, I share examples of the work we have done at Baker Library Special Collections to include environmental sustainability in some of our policies and workflows. While the details may be specific to our context, the principles are widely applicable: integrate sustainability into your policies so that you have a strong foundation for including environmental concerns in your decision making; and start your efforts with appraisal as it can have the most impact for the time that you put in.

Policies

The first policy in which we integrated environmental sustainability was our technology change management policy, which controls our decision making around the hardware and software we use in our digital archives workflows. The first item we added to the policy was that we must dispose of all hardware following environmental standards for electronic waste and, for items other than hard drives, that we must donate them for reuse whenever possible. The second item involved more collaboration with our IT department, which controls computer refresh cycles, so that we could move away from the standard five-year replacement timeframe for desktop computers. The workstations that we use to capture, appraise, and process digital materials are designed for long service lives, heavy and sustained workloads, and easy component change out. We made our case to IT—as noted above, this was an instance where the complementarity of environmental and financial sustainability was key—and received an exemption for our workstations, which we wrote into our policy to ensure that it becomes standard practice.

We can now keep the workstations as long as they remain serviceable and work with IT to swap out components as they fail or need upgrading. For example, we replaced our current workstations’ six-year-old spinning disk drives with solid state drives when we updated from Windows 7 to Windows 10, improving performance while maintaining compliance with IT’s security requirements. Making changes like this allows us to move from the standard five-year to an expected ten-year service life for these workstations (they are currently at 7.5 years). While the policy change and subsequent maintenance actions are small, they add up over time to provide substantial reductions in the full life-cycle environmental and financial costs of our hardware.

We also integrated environmental sustainability into our new acquisition policy. The policy outlines the conditions and terms of several areas that affect the acquisition of materials in any format: appraisal, agreements, transfer, accessioning, and documentation. For appraisal, we document the value and costs of a potential acquisition, but previously had been fairly narrow in our definition of costs. With the new policy, we broadened the costs that were in scope for our acquisition decisions and as part of this included environmental costs. While only a minor point in the policy, it allows us to determine environmental costs in our archival and technical appraisals, and then take those costs into account when making an acquisition decision. Our next step is to figure out how best to measure or estimate environmental impacts for consistency across potential acquisitions. I am hopeful that explicitly integrating environmental sustainability into our first decision point—whether to acquire a collection—will make it easier to include sustainability in other decision points throughout the collection’s life cycle.

Workflows

In a parallel track, we have been integrating environmental sustainability into our workflows, focusing on the appraisal of born-digital and audiovisual materials. This is a direct result of the research article noted above, in which we argue that focusing on selective appraisal can be the most consequential action because it affects the quantity of digital materials that an organization stewards for the remainder of those materials’ life cycle and provides an opportunity to assign levels of preservation commitment. While conducting in-depth appraisal prior to physical or digital transfer is ideal, it is not always practical, so we altered our workflows to increase the opportunities for appraisal after transfer.

For born-digital materials, we added an appraisal point during the initial collection inventory, screening out storage media whose contents are wholly outside of our collecting policy. We then decide on a capture method based on the type of media: we create disk images of smaller-capacity media but often package the contents of larger-capacity media using the bagit specification (unless we have a use case that requires a forensic image) to reduce the storage capacity needed for the collection and to avoid the ethical and privacy issues previously mentioned. When we do not have control of the storage media—for network attached storage, cloud storage, etc.—we make every attempt to engage with donors and departments to conduct in-depth appraisal prior to capture, streamlining the remaining appraisal decision points.

After capture, we conduct another round of appraisal now that we can more easily view and analyze the digital materials across the collection. This tends to be a higher-level appraisal during which we make decisions about entire disk images or bagit bags, or large groupings within them. Finally (for now), we conduct our most granular and selective appraisal during archival processing when processing archivists, curators, and I work together to determine what materials should be part of the collection’s preservation file set. As our digital archives program is still young, we have not yet explored re-appraisal at further points of the life cycle such as access, file migration, or storage refresh.

For audiovisual materials, we follow a similar approach as we do for born-digital materials. We set up an audiovisual viewing station with equipment for reviewing audiocassettes, microcassettes, VHS and multiple Beta-formatted video tapes, multiple film formats, and optical discs. We first appraise the media items based on labels and collection context, and with the viewing station can now make a more informed appraisal decision before prioritizing for digitization. After digitization, we appraise again, making decisions on retention, levels of preservation commitment, and access methods.

While implementing multiple points of selective appraisal throughout workflows is more labor intensive than simply conducting an initial appraisal, several arguments moved us to take this approach: it is a one-time labor cost that helps us reduce on-going storage and maintenance costs; it allows us to target our resources to those materials that have the most value for our community; it decreases the burden of reappraisal and other information maintenance work that we are placing on future archivists; and, not least, it reduces the on-going environmental impact of our work.


Keith Pendergrass is the digital archivist for Baker Library Special Collections at Harvard Business School, where he develops and oversees workflows for born-digital materials. His research and general interests include integration of sustainability principles into digital archives standard practice, systems thinking, energy efficiency, and clean energy and transportation. He holds an MSLIS from Simmons College and a BA from Amherst College.

What’s Your Set-up?: Processing Digital Records at UAlbany (part 2)

by Gregory Wiedeman


In the last post I wrote about the theoretical and technical foundations for our born-digital records set-up at UAlbany. Here, I try to show the systems we use and how they work in practice.

Systems

ArchivesSpace

ArchivesSpace manages all archival description. Accession records and top level description for collections and file series are created directly in ArchivesSpace, while lower-level description, containers, locations, and digital objects are created using asInventory spreadsheets. Overnight, all modified published records are exported using exportPublicData.py and indexed into Solr using indexNewEAD.sh. This Solr index is read by ArcLight.

ArcLight

ArcLight provides discovery and display for archival description exported from ArchivesSpace. It uses URIs from ArchivesSpace digital objects to point to digital content in Hyrax while placing that content in the context of archival description. ArcLight is also really good at systems integration because it allows any system to query it through an unauthenticated API. This allows Hyrax and other tools to easily query ArcLight for description records.

Hyrax

Hyrax manages digital objects and item-level metadata. Some objects have detailed Dublin Core-style metadata, while other objects only have an ArchivesSpace identifier. Some custom client-side JavaScript uses this identifier to query ArcLight for more description to contextualize the object and provide links to more items. This means users can discover a file that does not have detailed metadata, such as Minutes, and Hyrax will display the Scope and Content note of the parent series along with links to more series and collection-level description.

Storage

Our preservation storage uses network shares managed by our university data center. We limit write access to the SIP and AIP storage directories to one service account used only by the server that runs the scheduled microservices. This means that only tested automated processes can create, edit, or delete SIPs and AIPs. Archivists have read-only access to these directories, which contain standard bags generated by BagIt-python that are validated against BagIt Profiles. Microservices also place a copy of all SIPs in a processing directory where archivists have full access to work directly with the files. These processing packages have specific subdirectories for master files, derivatives, and metadata. This allows other microservices to be run on them with just the package identifier. So, if you needed to batch create derivatives or metadata files, the microservices know which directories to look in.

The microservices themselves have built-in checks in place, such as they will make sure a valid AIP exists before deleting a SIP. The data center also has some low-level preservation features in place, and we are working to build additional preservation services that will run asynchronously from the rest of our processing workflows. This system is far from perfect, but it works for now, and at the end of the day, we are relying on the permanent positions in our department as well as in Library Systems and university IT to keep these files available long-term.

Microservices

These microservices are the glue that keeps most of our workflows working together. Most of the links here point to code in our Github page, but we’re also trying to add public information on these processes to our documentation site.

asInventory

This is a basic Python desktop app for managing lower-level description in ArchivesSpace through Excel spreadsheets using the API. Archivists can place a completed spreadsheet in a designated asInventory input directory and double-click an .exe file to add new archival objects to ArchivesSpace. A separate .exe can export all the child records from a resource or archival object identifier. The exported spreadsheets include the identifier for each archival object, container, and location, so we can easily roundtrip data from ArchivesSpace, edit it in Excel, and push the updates back into ArchivesSpace. 

We have since built our born digital description workflow on top of asInventory. The spreadsheet has a “DAO” column and will create a digital object using a URI that is placed there. An archivist can describe digital records in a spreadsheet while adding Hyrax URLs that link to individual or groups of files.

We have been using asInventory for almost 3 years, and it does need some maintenance work. Shifting a lot of the code to the ArchivesSnake library will help make this easier, and I also hope to find a way to eliminate the need for a GUI framework so it runs just like a regular script.

Syncing scripts

The ArchivesSpace-ArcLight-Workflow Github repository is a set of scripts that keeps our systems connected and up-to-date. exportPublicData.py ensures that all published description in ArchivesSpace is exported each night, and indexNewEAD.sh indexes this description into Solr so it can be used by ArcLight. processNewUploads.py is the most complex process. This script takes all new digital objects uploaded through the Hyrax web interface, stores preservation copies as AIPs, and creates digital object records in ArchivesSpace that points to them. Part of what makes this step challenging is that Hyrax does not have an API, so the script uses Solr and a web scraper as a workaround.

These scripts sound complicated, but they have been relatively stable over the past year or so. I hope we can work on simplifying them too, by relying more on ArchivesSnake and moving some separate functions to other smaller microservices. One example is how the ASpace export script also adds a link for each collection to our website. We can simplify this by moving this task to a separate, smaller script. That way, when one script breaks or needs to be updated, it would not affect the other function.

Ingest and Processing scripts

These scripts process digital records by uploading metadata for them in our systems and moving them to our preservation storage.

  • ingest.py packages files as a SIP and optionally updates ArchivesSpace accession records by added dates and extents.
  • We have standard transfer folders for some campus offices with designated paths for new records and log files along with metadata about the transferring office. transferAccession.py runs ingest.py but uses the transfer metadata to create accession records and produces spreadsheet log files so offices can see what they transferred
  • confluence.py scrapes files from our campus’s Confluence wiki system, so for offices that use Confluence all I need is access to their page to periodically transfer records.
  • convertImages.py makes derivative files. This is mostly designed for image files, such as batch converting TIFFs to JPGs or PDFs.
  • listFiles.py is very handy. All it does is create a text file that lists all filenames and paths in a SIP. These can then be easily copied into a spreadsheet.
  • An archivist can arrange records by creating an asInventory spreadsheet that points to individual or groups of files. buildHyraxUpload.py then creates a TSV file for uploading these files to Hyrax with the relevant ArchivesSpace identifiers.
  • updateASpace.py takes the output TSV from uploading to Hyrax and updates the same inventory spreadsheets. These can then be uploaded back into ArchivesSpace which will create digital objects that point to Hyrax URLs.

SIP and AIP Package Classes

These classes are extensions of the Bagit-python library. They contain a number of methods that are used by other microservices. This lets us easily create() or load() our specific SIP or AIP packages and add files to them. They also include complex things like getting a human-readable extent and date ranges from the filesystem. My favorite feature might be clean() which removes all Thumbs.db, desktop.ini, and .DS_Store files as the package is created.

Example use case

  1. Wild records appear! A university staff member has placed records of the University Senate from the past year in a standard folder share used for transfers.
  2. An archivist runs transferAccession.py, which creates an ArchivesSpace accession record using some JSON in the transfer folder and technical metadata from the filesystem (modified dates and digital extents). It then packages the files using BagIt-python and places one copy in the read-only SIP directory and a working copy in a processing directory.
    • For outside acquisitions, the archivists usually manually download, export, or image the materials and create an accession record manually. Then, ingest.py packages these materials and adds dates and extents to the accession records when possible.
  3. The archivist makes derivative files for access or preservation. Since there is a designated derivatives directory in the processing package, the archivists can use a variety of manual tools or run other microservices using the package identifier. Scripts such as convertImages.py can batch convert or combine images and PDFs and other scripts for processing email are still being developed.
  4. The archivist then runs listFiles.py to get a list of file paths and copies them into an asInventory spreadsheet.
  5. The archivist arranges the issues within the University Senate Records. They might create a new subseries and use that identifier in an asInventory spreadsheet to upload a list of files and then download them again to get a list of ref_ids.
  6. The archivist runs buildHyraxUpload.py to create a tab-separated values (TSV) file for uploading files to Hyrax using the description and ref_ids from the asInventory spreadsheet.
  7. After uploading the files to Hyrax, the archivist runs updateASpace.py to add the new Hyrax URLs to the same asInventory spreadsheet and uploads them back to ArchivesSpace. This creates new digital objects that point to Hyrax.

Successes and Challenges

Our set-up will always be a work in progress, and we hope to simplify, replace, or improve most of these processes over time. Since Hyrax and ArcLight have been in place for almost a year, we have noticed some aspects that are working really well and others that we still need to improve on.

I think the biggest success was customizing Hyrax to rely on description pulled from ArcLight. This has proven to be dependable and has allowed us to make significant amounts of born-digital and digitized materials available online without requiring detailed item-level metadata. Instead, we rely on high-level archival description and whatever information we can use at scale from the creator or the file system.

Suddenly we have a backlog. Since description is no longer the biggest barrier to making materials available, the holdup has been the parts of the workflow that require human intervention. Even though we are doing more with each action, large amounts of materials are still held up waiting for a human to process them. The biggest bottlenecks are working with campus offices and donors as well as arrangement and description.

There is also a ton of spreadsheets. I think this is a good thing, as we have discovered many cases where born-digital records come with some kind of existing description, but it often requires data cleaning and transformation. One collection came with authors, titles, and abstracts for each of a few thousand PDF files, but that metadata was trapped in hand-encoded HTML files from the 1990s. Spreadsheets are a really good tool for straddle the divide between automated and manual processes required to save this kind of metadata, and this is a comfortable environment for many archivists to work in.[1]

You may have noticed, but the biggest needs we have now—donor relations, arrangement and description, metadata cleanup—are roles that archivists are really good and comfortable at. It turned out that once we had effective digital infrastructure in place, it created further demands on archivists and traditional archival processes.

This brings us to the biggest challenge we face now. Since our set-up often requires comfort on the command line, we have severely limited the number of archivists who can work on these materials and required non-archival skills to perform basic archival functions. We are trying to mitigate this in some respects by better distributing individual stages for each collection and providing more documentation. Still, this has clearly been a major flaw, as we need to meet users (in this case other archivists) where they are rather than place further demands on them.[2]


Gregory Wiedeman is the university archivist in the M.E. Grenander Department of Special Collections & Archives at the University at Albany, SUNY where he helps ensure long-term access to the school’s public records. He oversees collecting, processing, and reference for the University Archives and supports the implementation and development of the department’s archival systems.

What’s Your Setup?: National Library of New Zealand Te Puna Mātauranga o Aotearoa

By Valerie Love

Introduction

The Alexander Turnbull Library holds the archives and special collections for the National Library of New Zealand Te Puna Mātauranga o Aotearoa (NLNZ). While digital materials have existed in the Turnbull Library’s collections since the 1980s, the National Library began to formalise its digital collecting and digital preservation policies in the early 2000s, and established the first Digital Archivist roles in New Zealand. In 2008, the National Library launched the National Digital Heritage Archive (NDHA), which now holds 27 million files spanning across 222 different formats, and consisting of 311 Terabytes.

Since the launch of the NDHA, there has been a marked increase in the size and complexity of incoming digital collections. Collections currently come to the Library on a combination of obsolete and contemporary media, as well as electronic transfer, such as email or File Transfer Protocol (FTP).

Digital Archivists’ workstation setup and equipment

Most staff at the National Library use either a Windows 10 Microsoft Surface Pro or HP EliteBook i5 at a docking station with two monitors to allow for flexibility in where they work. However, the Library’s Digital Archivists have specialised setups to support their work with large digital collections. The computers and workstations below are listed in order of frequency of usage. 

Computers and workstations

  1. HP Z200 i7 workstation tower

The Digital Archivists’ main ingest and processing device is an HP Desktop Z220 i7 workstation tower for processing digital collections. The Z220s have a built-in read/write optical disc drive, as well as USB and FireWire ports.

  1. HP Elitebook i7

The device we use second most-frequently is an HP Elitebook i7, which we use for electronic transfers of contemporary content. Our web archivists also use these for harvesting websites and running social media crawls. As there are only a handful of digital archivists in Aotearoa New Zealand, we do a significant amount of training and outreach to archives and organisations that don’t have a dedicated digital specialist on staff. Having a portable device as well as our desktop setups is extremely useful for meetings and workshops offsite. 

  1. MacBook Pro 15inch, 2017

The Alexander Turnbull Library is a collecting institution, and we often receive creative works from authors, composers, and artists. We regularly encounter portable hard drives, floppy disks, zip disks, and even optical discs which have been formatted for a Mac operating system, and are incompatible with our corporate Windows machines. And so, MacBook Pro to the rescue! Unfortunately, the MacBook Pro only has ports for USB-C, so we keep several USB-C to USB adapters on hand. The MacBook has access to staff wifi, but is not connected to the corporate network. We’ve recently begun to investigate using HFS+ for Windows software in order to be able to see Macintosh file structures on our main ingest PCs.

  1. Digital Intelligence FRED Forensic Workstation

If we can’t read content on either or corporate machines or the MacBook Pro, then our friend FRED is our next port of call. FRED is a forensic recovery of evidence device, and includes a variety of ports and drives with write blockers built in. We have a 5.25 inch floppy disk drive attached to the FRED, and also use it to mount internal hard drives removed from donated computers and laptops. We don’t create disk images by default on our other workstations, but if a collection is tricky enough to merit the FRED, we will create disk images for it, generally using FTK Imager. The FRED has its own isolated network connection separate from the corporate network so we can analyse high risk materials without compromising the Library’s security. 

  1. Standalone PC 

Adjacent to the FRED, we had an additional non-networked PC (also an HP Z200 i7 workstation tower) where we can analyse materials, download software, test scripts, and generally experiment separate from the corporate network. These are currently still operating under a Windows 7 build, as some of the drivers we use with legacy media carriers were not compatible with the Windows 10 during the initial testing and rollout of Windows 10 devices to Library staff. 

  1. A ragtag bunch of computer misfits 

[link to https://natlib.govt.nz/blog/posts/a-ragtag-bunch-of-computer-misfits]

Over the years, the Library has collected vintage computers with a variety of hardware and software capabilities and each machine offers different applications and tools in order to help us process and research legacy digital collections. We are also sometimes gifted computers from donors in order to support the processing of their legacy files, and allow us to see exactly what software and programmes they used, and their file systems.

  1. Kyroflux (located at Archives New Zealand)

And for the really tricky legacy media, we are fortunate to be able to call on our colleagues down the road at Archives New Zealand Te Rua Mahara o te Kāwanatanga, who have a Kyroflux set up in their digital preservation lab to read 3.5 inch and 5.25 inch floppy disks. We recently went over there to try to image a set of double sided, double density, 3.5 inch Macintosh floppy disks from 1986-1989 that we had been unable to read on our legacy Power Macintosh 7300/180. We were able to create disk image files for them using the Kryoflux, but unfortunately, the disks contained bad sectors so we weren’t able to render the files from them. 

Drives and accessories

In addition to our hardware and workstation setup, we use a variety of drives and accessories to aid in processing of born-digital materials.

  1. Tableau Forensic USB 3.0 Bridge write blocker
  2. 3.5 inch floppy drive
  3. 5.25 inch floppy drive
  4. Optical media drive 
  5. Zip drive 
  6. Memory card readers (CompactFlash cards, Secure Digital (SD) cards, Smart Media cards)
  7. Various connectors and converters

Some of our commonly used software and other processing tools

  1. SafeMover Python script (created in-house at NLNZ to transfer and check fixity for digital collections)
  2. DROID file profiling tool
  3. Karen’s Directory Printer
  4. Free Commander/Double Commander
  5. File List Creator
  6. FTK Imager
  7. OSF Mount
  8. IrfanView
  9. Hex Editor Neo
  10. Duplicate Cleaner
  11. ePADD
  12. HFS+ for Windows
  13. System Centre Endpoint Protection


Valerie Love is the Senior Digital Archivist Kaipupuri Pūranga Matihiko Matua at the Alexander Turnbull Library, National Library of New Zealand Te Puna Mātauranga o Aotearoa

What’s Your Set-up? Born-Digital Processing at NC State University Libraries

by Brian Dietz


Background

Until January 2018 the NC State University Libraries did our born-digital processing using the BitCurator VM running on a Windows 7 machine. The BCVM bootstrapped our operations, and much of what I think we’ve accomplished over the last several years would not have been possible without this set up. Two years ago, we shifted our workflows to be run mostly at the command line on a Mac computer. The desire to move to CLI meant a need for a nix environment. Cygwin for Windows is not a realistic option, and the Linux subsystem, available on Windows 10, had not been released. A dedicated Linux computer wasn’t an ideal option due to IT support. I no longer wanted to manage virtual machine distributions, and a dual boot machine seemed too inefficient. Also, of the three major operating systems, I’m most familiar and comfortable with Mac OSX, which is UNIX under the hood, and certified as such. Additionally, Homebrew, a package manager for Mac, made installing and updating the programs we needed, as well as their dependencies, relatively simple. In addition to Homebrew, we use pip to update brunnhilde; and freshclam, included in ClamAV, to keep the virus database up to date. HFS Explorer, necessary for exploring Mac-formatted disks, is a manual install and update, but it might be the main pain point (and not too painful yet). With the exception of HFS Explorer, updating is done at the time of processing, so the environment is always fresh.

Current workstation

We currently have one workstation where we process born-digital materials. We do our work on a Mac Pro:

  • macOS X 10.13 High Sierra
  • 3.7 GHz processor
  • 32GB memory
  • 1TB hard drive
  • 5TB NFS-mounted networked storage
  • 5TB Western Digital external drive

We have a number of peripherals:

  • 2 consumer grade Blu-ray optical drives (LG and Samsung)
  • 2 iomega USB-powered ZIP drives (100MB and 250MB)
  • Several 3.5” floppy drives (salvaged from surplused computers), but our go-to is a Sony 83 track drive (model MPF920)
  • One TEAC 5.25” floppy drive (salvaged from a local scrap exchange)
  • Kryoflux board with power supply and ribbon cable with various connectors
  • Wiebetech USB and Forensic UltraPack v4 write blockers
  • Apple iPod (for taking pictures of media, usually transferred via AirDrop)

The tools that we use for exploration/appraisal, extraction, and reporting are largely command line tools:

Exploration

  • diskutil (finding where a volume is mounted)
  • gls (finding volume name, where the GNU version shows escapes (“\”) in print outs)
  • hdiutil (mounting disk image files)
  • mmls (finding partition layout of disk images)
  • drutil status (showing information about optical media)

Packaging

  • tar (packaging content from media not being imaged)
  • ddrescue (disk imaging)
  • cdparanoia (packaging content from audio discs)
  • KryoFlux GUI (floppy imaging)

Reporting

  • brunnhilde (file and disk image profiling, duplication)
  • bulk_extractor (PII scanning)
  • clamav (virus scanning)
  • Exiftool (metadata)
  • Mediainfo (metadata)

Additionally, we perform archival description using ArchivesSpace, and we’ve developed an application called DAEV (“Digital Assets of Enduring Value”) that, among other things, guides processors through a session and interacts with ArchivesSpace to record certain descriptive metadata. 

Working with IT

We have worked closely with our Libraries Information Technology department to acquire and maintain hardware and peripherals, just as we have worked closely with our Digital Library Initiatives department on the development and maintenance of DAEV. For purchasing, we submit larger requests, with justifications, to IT annually, and smaller requests as needs arise, e.g., our ZIP drive broke and we need a new one. Our computer is on the refresh cycle, meaning once it reaches a certain age, it will be replaced with a comparable computer. Especially with peripherals, we provide exact technical specifications and anticipated costs, e.g., iomega 250MB ZIP drive, and IT determines the purchasing process.

I think it’s easy to assume that, because people in IT are among our most knowledgeable colleagues about computing technology, they understand what it is we’re trying to do and what it is we’ll need to do it. I think that, while they are capable of understanding our needs, their specializations lay elsewhere, and it’s a bad assumption which can result in a less than ideal computing situation. My experience is that my coworkers in IT are eager to understand our problems and to help us to solve them, but that they really don’t know what our problems are. 

The counter assumption is that we ourselves are supposed to know everything about computing. That’s probably more counterproductive than assuming IT knows everything, because 1) we feel bad when we don’t know everything and 2) in trying to hide what we don’t know, we end up not getting what we need. I think the ideal situation is for us to know what processes we need to run (and why), and to share those with IT, who should be able to say what sort of processor and how RAM is needed. If your institution has a division of labor, i.e., specializations, take advantage of it. 

So, rather than saying, “we need a computer to support digital archiving,” or “I need a computer with exactly these specs,” we’ll be better off requesting a consultation and explaining what sort of work we need a computer to support. Of course, the first computer we requested for a born-digital workstation, which was intended to support a large initiative, came at a late hour and was in the form of “We need a computer to support digital archiving,” with the additional assumption of “I thought you knew this was happening.” We got a pretty decent Windows 7 computer that worked well enough.

I also recognize that I may be describing a situation that does not exist in man other institutions. In those cases, perhaps that’s something to be worked toward, through personal and inter-departmental relationship building. At any rate, I recognize and am grateful for the support my institution has extended to my work. 

Challenges and opportunities

I’ve got two challenges coming up. Campus IT has required that all Macs be upgraded to macOS Mojave to “meet device security requirements.” From a security perspective, I’m all onboard for this. However, in our testing the Kryoflux is not compatible with Mojave. This appears to be related to a security measure Mojave has in place for controlling USB communication. After several conversations with Libraries IT, they’ve recommended assigning us a Windows 10 computer for use with the Kryoflux. Beyond having two computers, I see obvious benefits to this. One is that I’ll be able to install the Linux subsystem on Windows 10 and explore whether going full-out Linux might be an option for us. Another is that I’ll have ready access to FTK Imager again, which comes in handy from time to time. 

The other challenge we have is working with our optical drives. We have consumer grade drives, and they work inconsistently. While Drive 1 may read Disc X but not Disc Y, Drive 2 will do the obverse. At the 2019 BitCurator Users Forum, Kam Woods discussed higher grade optical drives in the “There Are No Dumb Questions” session. (By the way, everyone should consider attending the Forum. It’s a great meeting that’s heavily focused on practice, and it gets better each year. This year, the Forum will be hosted by Arizona State University, October 12-13. The call for proposals will be coming out in early March).

In the coming months we’ll be doing some significant changes to our workflow, which will include tweaking a few things, reordering some steps, introducing new tools, e.g., walk_to_dfxml, Bulk Reviewer, and, I hope, introducing more automation into the process. We’re also due for a computer refresh, and, while we’re sticking with Macs for the time being, we’ll again work with our IT to review computer specifications.


Brian Dietz is the Digital Program Librarian for Special Collections at NC State University Libraries, where he manages born-digital processing, and web archiving, and digitization.

What’s Your Set-Up?: Establishing a Born-Digital Records Program at Brooklyn Historical Society

by Maggie Schreiner and Erica López


In establishing a born-digital records program at Brooklyn Historical Society, one of our main challenges was scaling the recommendations and best practices, which thus far have been primarily articulated by large and well-funded research universities, to fit our reality: a small historical society with limited funding, a very small staff, and no in-house IT support. In navigating this process, we’ve attempted to strike a balance that will allow us to responsibly steward the born-digital records in our collections, be sustainable for our staffing and financial realities, and allow us to engage with and learn from our colleagues doing similar work.

We started our process with research and learning. Our Digital Preservation Committee, which meets monthly, held a reading group. We read and discussed SAA’s Digital Preservation Essentials, reached out to colleagues at local institutions with born-digital records programs for advice, and read widely on the internet (including bloggERS!). Our approach was also strongly influenced by Bonnie Weddle’s presentation “Born Digital Collections: Practical First Steps for Institutions,” given at the Conservation Center for Art & Historic Artifact’s 2018 conference at the Center for Jewish History. Bonnie’s presentation focused on iterative processes that can be implemented by smaller institutions. Her presentation empowered us to envision a BHS-sized program, to start small, iterate when possible, and in the ways that make sense for our staff and our collections. 

We first enacted this approach in our equipment decisions. We assembled a workstation that consists of an air-gapped desktop computer, and a set of external drives based on our known and anticipated needs (3 ½ floppy, CD/DVD, Zip drives, and memory card readers). Our most expensive piece of equipment was our write-blocker (a Tableau TK8u USB 3.0 Bridge), which, based on our research, seemed like the most important place to splurge. We based our equipment decisions on background reading, informal conversations with colleagues about equipment possibilities, and an existing survey of born-digital carriers in our collections. We were also limited by our small budget; the total cost for our workstation was approximately $1,500. 

Born digital records workstation at the Brooklyn Historical Society

A grant from the Gardiner Foundation allowed us to create a paid Digital Preservation Fellowship, and hire the amazing Erica López for the position. The goals and timeline for Erica’s position were developed to allow lots of time for research, learning through trial and error, and mistakes. For a small staff, it is often difficult for us to create the time and space necessary for experimentation. Erica began by crafting processes for imaging and appraisal: testing software, researching, adapting workflows from other institutions, creating test disk images, and drafting appraisal reports. We opted to use BitCurator, due to the active user community. We also reached out to Bonnie Weddle, who generously agreed to answer our questions and review draft workflows. Bonnie’s feedback and support gave us additional confidence that we were on the right track.

Starting from an existing inventory of legacy media in our collections, Erica created disk images of the majority of items, and created appraisal assessments for each collection. Ultimately, Erica imaged eighty-seven born-digital objects (twelve 3.5 inch floppy disks, thirty-eight DVDs, and thirty-seven CDs), which contained a total of seventy-seven different file formats. Although these numbers may seem very small for some (or even most) institutions, these numbers are big for us! Our archives program is maintained by two FTE staff with multiple responsibilities, and vendor IT with no experience supporting the unique needs of archives and special collections. 

We encountered a few big bumps during the process! The first was that we unexpectedly had to migrate our archival storage server, and as a result did not have read-write access for several months. This interrupted our planned storage workflow for the disk images that Erica was creating. In hindsight, we made what was a glaring mistake to keep the disk images in the virtual machine running BitCurator. Inevitably, we had a day when we were no longer able to launch the virtual machine. After several days of failed attempts to recover the disk images, we decided that Erica would re-image the media. Fortunately, by this time, Erica was very proficient and it took less than two weeks! 

We had also hoped to do a case study on a hard drive in our collection, as Erica’s work had otherwise been limited to smaller removable media. After some experimentation, we discovered that our system would not be able to connect to the drive, and that we would need to use a FRED to access the content. We booked time at the Metropolitan New York Library Council’s Studio to use their FRED. Erica spent a day imaging the drive, and brought back a series of disk images… which to date we have not successfully opened in our BitCurator environment at BHS! After spending several weeks troubleshooting the technical difficulties and reaching out to colleagues, we decided to table the case study. Although disappointing, we also recognized that we have made huge strides in our ability to steward born-digital materials, and that we will continually iterate on this work in the future.

What have we learned about creating a BHS-sized born-digital records program? We learned that our equipment meets the majority of our use-case scenarios, that we have access to additional equipment at METRO when needed, and that maybe we aren’t quite ready to tackle more complex legacy media anyway. We learned that’s okay! We haven’t read everything, we don’t have the fanciest equipment, and we didn’t start with any in-house expertise. We did our research, did our best work, made mistakes, and in the end we are much more equipped to steward the born-digital materials in our collections. 


Maggie Schreiner is the Manager of Archives and Special Collections at the Brooklyn Historical Society, an adjunct faculty member in New York University’s Archives and Public History program, and a long-time volunteer at Interference Archive. She has previously held positions at the Fashion Institute of Technology (SUNY),  NYU, and Queens Public Library. Maggie holds an MA in Archives and Public History from NYU.

Erica López was born and raised in California by undocumented parents. Education was important but exploring Los Angeles’s colorful nightlife was more important. After doing hair for over a decade, Erica started studying to be a Spanish teacher at UC-Berkeley. Eventually, Erica quit the Spanish teacher dream, and found first film theory and then the archival world. Soon, Erica was finishing up an MA at NYU and working to become an archivist. Erica worked with Brooklyn Historical Society to setup workflows for born-digital collections, and is currently finishing up an internship at The Riverside Church translating and cataloging audio files.

What’s Your Set-up?: Curation on a Shoestring

by Rachel MacGregor


At the Modern Records Centre, University of Warwick in the United Kingdom we have been making steady progress in our digital preservation work. Jessica Venlet from UNC Chapel Hill wrote recently about being in the lucky position of finding “an excellent stock of hardware and two processors” when she started in 2016. We’re a little further behind than this—when I began in 2017 I had a lot less!

What we want is FRED. Who’s he? He’s your Forensic Recovery of Evidence Device (forensic workstation), but costing several thousand dollars, it’s beyond the reach of many of us.  

What I had in 2017: 

  • A Tableau T8-R2 write blocker. Write blockers are very important when working with rewritable media (USB drives, hard drives, etc.) because they prevent accidental alteration of material by blocking overwriting or deletion.
  • A (fingers crossed) working 3.5 inch external floppy disk drive.
  • A lot of enthusiasm.

What I didn’t have: 

  • A budget.
Image of Dell monitor and computer, keyboard, mouse, and writeblocker on a desk in an office.  Bitcurator software opened on the screen.
My digital curation workstation – not fancy but it works for me. Photo taken by MacGregor, under CC-BY license.

Whilst doing background research for tackling our born-digital collections, I got interested in the BitCurator toolkit which is designed to help with the forensic recovery of digital materials.  It interested me particularly because:

  • It’s free.
  • It’s open source.
  • It’s created and managed by archivists for archivists.
  • There’s a great user community.
  • There are loads of training materials online and an online discussion group.

I found this excellent blog post by Porter Olsen to help get started. He suggests starting with a standard workstation with a relatively high specification (e.g. 8 GB of RAM). So, I asked our IT folk for one, which they had in stock (yay!).  I specified a Windows operating system and installed a virtual machine, which runs a Linux operating system on which to run BitCurator. 

I’m still exploring BitCurator—it’s a powerful suite of tools with lots of features. However, when trialing it on the papers of the eminent historian Eric Hobsbawm, I found that it was a bit like using a hammer to crack a nut. Whilst it was possible to produce all sorts of sophisticated reports identifying email addresses etc., this isn’t much use on drafts of published articles from the late 1990-early 2000s. I turned to FTK Imager which is proprietary but free software. It is widely used in the preservation community, but not designed by, with, or for archivists (as BitCurator is). I guess its popularity derives from the fact that it’s easy to use and will allow you to image (i.e. take a complete copy of all the whole media including deleted and empty space),  or just extract the files, without too much time spent learning to use it. There are standard options for disk image output (e.g. as a raw byte-for-byte image, an E01 Expert witness format, SMART, and AFF formats). However, I would like to spend some more time getting to know BitCurator and becoming part of its community. There is always room for new and different tools and I suspect the best approaches are those which embrace diversity. 

Another tool that looks useful for disk imaging is one called diskimgr created by Johan van der Knijff of the Nationale Bibliotheek van Nederland. It will only run on a Linux operating system (not on a virtual machine), so now I am wondering about getting a separate Linux workstation.  BitCurator also works more effectively in a Linux environment as opposed to a virtual machine–it does stall sometimes with larger collections. I wonder if I should have opted for a Linux machine to start with. . . it’s certainly something to consider when creating a specification for a digital curation workstation. 

Once the content is extracted, we need further tools to help us manage and process. Bitcurator does a lot, but there may be extra things that you might need depending on your intended workflow. I never go anywhere without DROID software. DROID is useful for loads of stuff like file format identification, creating checksums, deduplication, and lots more. My standard workflow is to create a DROID profile and then use this as part of the appraisal process further down the line. What I don’t yet have is some sort of file viewer—Quick View Plus is the one I have in mind (it’s not free and as I think I mentioned my resources are limited!). I would also like to get LibreOffice installed as it deals quite well with old Word processed documents.

I guess I’ll keep adding to it as I go along. I now need to work out the most efficient ways of using the tools I have and capturing the relevant metadata that is produced. I would encourage everyone to take some time experimenting with some of the tools out there and I’d love to hear about how people get on.


Rachel MacGregor is Digital Preservation Officer at the Modern Records Centre, University of Warwick, United Kingdom. Rachel is responsible for developing and implementing digital preservation processes at the Modern Records Centre, including developing cataloguing guidelines, policies and workflows. She is particularly interested in workforce digital skills development.