Scalability, Automation, and Impostor Syndrome! Oh My!

By Kathryn Slover


From October 13-16, 2020 BitCurator hosted their annual Users Forum virtually. The forum consisted of presentations on a variety of digital preservation topics, but one that stood out to me was the first panel on scalability and automation. This session consisted of 3 presentations:

In August I started in a new job at the University of Texas at Arlington, and this was my first BitCurator Users Forum (BUF). I was very excited to hear from other professionals in digital preservation and looked forward to learning about new resources I could use in my new position. While watching the first presentation, I was hit with some serious impostor syndrome as a lot of the terms flew over my head! I started rapidly writing down terms to Google later like DPX film scans, RAWcooked, and FFv1 Matroska. Joanna White’s presentation about the project to convert 3PB of DPX film scans into FFv1 Matroska video files using automation scripts at The British Film Institute sounded fascinating, but I must admit I was a tad overwhelmed (and by a tad I mean completely). I couldn’t help but think of all the things I didn’t know. 

After a moment (or several moments) of panic, the second presentation in this session restored my faith that I was, in fact, a Digital Archivist who did know things about digital preservation. Lynn Moulton’s presentation really resonated with me. As it turns out, during last year’s BUF, she struggled with the same feelings I had while watching the previous presentation. She spoke about her own experience with imposter syndrome and reminded me that, I, like most Digital Archivists, come from an archives background and not a computer science one. 

As someone relatively new to the world of digital archives, it was comforting to hear that I’m not the only one who sometimes gets overwhelmed and feels like a phony. Luckily, there are people like Lynn who share their own experience and detail how they overcame those feelings. She talked about her process, which includes looking at others’ documentation, testing solutions (and testing them again), the need for support, and the fact that failure is an important part of the process. By the end of her presentation, my overwhelmed feelings had subsided a bit. Even though I don’t know everything, there is an amazing community of digital preservation professionals out there that are dealing with similar issues and are always there to help.  

With my renewed energy, I was a bit more prepared for the final presentation of the session. David Cirella and Greta Graf presented on automating the packaging and ingest process of electronic resources at Yale University. This presentation felt a little bit more in my wheelhouse of knowledge. They focused primarily on scripting using Python and Bash Shell. Some automation had been implemented by my predecessor using Python, so I was particularly excited about this subject. Even though I wasn’t familiar with everything, I came out of this presentation with a few ideas about implementing automation at my new institution. 

I wasn’t expecting a session on scalability and automation to be such an emotional roller coaster, but it was an informative ride! In this session alone, I came to realize that there are varying levels of skill and expertise when it comes to the work of digital preservation. We all bring something to the table. The rest of the BUF sessions reinforced the fact that digital preservation professionals (no matter the project) are all trying to do the best they can. We all face obstacles in our work but, with a solid network of hard-working colleagues, we can do a lot. I learned about so many helpful tools and educational resources that can hopefully conquer my own feelings of impostor syndrome as they pop up. Overall, the BUF was an amazing experience and I am grateful I was able to learn from this conference (even if I do have a notebook filled with terms and tools to look up)! 


Kathryn Slover is the Digital Archivist at the University of Texas at Arlington Special Collections. She has a M.A. in Public History from Middle Tennessee State University and previously held the role of Electronic Processing Archivist at the South Carolina Department of Archives and History.

Dispatches from a Distance: Work/Work Balance

by Marcella Huggard

This post is part of Dispatches from a Distance, a series of short posts o provide a forum for those of us facing disruption in our professional lives, whether that’s working from home or something else, to stay engaged with the community. Now that so many of us are returning to full- or part-time on-site work, we’d like to extend this series to include reflections on reopening, returning to work, and other anxieties facing the profession due to COVID-19. There is no specific topic or theme for submissions–rather, this is a space to share your thoughts on current projects or ideas you’d like to share with other readers of the Electronic Records Section blog. Dispatches should be between 200-500 words and can be submitted here.


My special collections and archives library has started the reopening process, in preparation for the fall semester.  We’re not open to the public yet but expect we will be in a limited fashion for the fall, and in the meantime us staff in processing and conservation are coming into the building regularly to get back to working with the collections.

Transitioning to working strictly from home was one set of processes—physical, emotional, and mental. Transitioning to a hybrid situation is another set of processes. My staff are working approximately 50% in the office, 50% at home.  This means getting back to processing projects they haven’t really looked at since March, and it means continuing data cleanup projects they started in March, or starting new data cleanup projects from home. It means possibly inconsistent schedules depending on when the building is open (for some, this is good—variety is the spice of life!—for others, routine is essential and this is a disruption). It means adjusting to long stretches wearing a mask and getting sweaty extra quickly when schlepping boxes or archival supplies around. It means still not seeing some co-workers in person as we continue to work split shifts to lower the numbers of people in our building.

I’m taking our university administration’s direction to work from home as much as possible seriously, and I find that a lot of my regular work can be done remotely. Reviewing finding aids? Check. Ongoing data cleanup projects? Check. Research involving materials I’ve already retrieved from other archives and from electronically available resources? Check. Meetings with colleagues to plan projects and determine what we’ll do this fall?  Check. Professional reading, conferences, and workshops? Check. Data entry for processing projects? Check. This means extra disruption, though—“I’ll be able to get a full 4-hour shift in processing collections tomorrow afternoon,” I think happily to myself, until somebody schedules a meeting smack in the middle of what would have been that shift, and I’m adjusting yet again.

The guiding principles for this pandemic has been adaptability and flexibility, and I don’t see that changing anytime soon.

Estimating Energy Use for Digital Preservation, Part II

by Bethany Scott

This post is part of our BloggERS Another Kind of Glacier series. Part I was posted last week.


Conclusions

While the findings of the carbon footprint analysis are predicated on our institutional context and practices, and therefore may be difficult to directly extrapolate to other organizations’ preservation programs, there are several actionable steps and recommendations that sustainability-minded digital preservationists can implement right away. Getting in touch with any campus sustainability officers and investigating environmental sustainability efforts currently underway can provide enlightening information – for instance, you may discover that a portion of the campus energy grid is already renewable-powered, or that your institution is purchasing renewable energy credits (RECs). In my case, I was previously not aware that UH’s Office of Sustainability has published an improvement plan outlining its sustainability goals, including a 10% total campus waste reduction, a 15% campus water use reduction, and a 35% reduction in energy expenditures for campus buildings – all of which will require institutional support from the highest level of UH administration as well as partners among students, faculty, and staff across campus. I am proud to consider myself a partner in UH campus sustainability and look forward to promoting awareness of and advocating for our sustainability goals in the future.

As Keith Pendergrass highlighted in the first post of this series, there are other methods by which digital preservation practitioners can reduce their power draw and carbon footprint, thereby increasing the sustainability of their digital preservation programs – from turning off machines when not use or scheduling resource-intensive tasks for off-peak times, to making broader policy changes that incorporate sustainability principles and practices.

At UHL, one such policy change I would like to implement is a tiered approach to file format selection, through which we match the file formats and resolution of files created to the scale and scope of the project, the informational and research value of the content, the discovery and access needs of end users, and so on. Existing digital preservation policy documentation outlines file formats and specifications for preservation-quality archival masters for images, audio, and video files that are created through our digitization unit. However, as UHL conducts a greater number of mass digitization projects – and accumulates an ever larger number of high-resolution archival master files – greater flexibility is needed. By choosing to create lower-resolution files for some projects, we would reduce the total storage for our digital collections, thereby reducing our carbon footprint.

For instance, we may choose to retain large, high-resolution archival TIFFs for each page image of a medieval manuscript book, because researchers study minute details in the paper quality, ink and decoration, and the scribe’s lettering and handwriting. By contrast, a digitized UH thesis or dissertation from the mid-20th century could be stored long-term as one relatively small PDF, since the informational value of its contents (and not its physical characteristics) is what we are really trying to preserve. Similarly, we are currently discussing the workflow implications of providing an entire archival folder as a single PDF in our access system. Although the initial goal of this initiative was to make a larger amount of archival material quickly available online for patrons, the much smaller amount of storage needed to store one PDF vs. dozens or hundreds of high-res TIFF masters would also have a positive impact on the sustainability of the digital preservation and access systems.

UHL’s digital preservation policy also includes requirements for monthly fixity checking of a random sample of preservation packages stored in Archivematica, with a full fixity check of all packages to be conducted every three years during an audit of the overall digital preservation program. Frequent fixity checking is computationally intensive, though, and adds to the total energy expenditure of an institution’s digital preservation program. But in UHL’s local storage infrastructure, storage units run on the ZFS filesystem, which includes self-healing features such as internal checksum checks each time a read/write action is performed. This storage infrastructure was put in place in 2019, but we have not yet updated our policies and procedures for fixity checking to reflect the improved baseline durability of assets in storage.

Best practices calling for frequent fixity checks were developed decades ago – but modern technology like ZFS may be able to passively address our need for file integrity and durability in a less resource-intensive way. Through considered analysis matching the frequency of fixity checking to the features of our storage infrastructure, we may come to the conclusion that less frequent hands-on fixity checks, on a smaller random sample of packages, is sufficient moving forward. Since this is a new area of inquiry for me, I would love to hear thoughts from other digital preservationists about the pros and cons to such an approach – is fixity checking really the end-all, or could we use additional technological elements as part of a broader file integrity strategy over time?

Future work

I eagerly anticipate refining this electricity consumption research with exact figures and values (rather than estimates) when we are able to more consistently return to campus. We would like to investigate overhead costs such as lighting and HVAC in UHL’s server room, and we plan to grab point-in-time values physically from the power distribution units in the racks. Also, there may be additional power statistics that our Sys Admin can capture from the VMware hosts – which would allow us to begin on this portion of the research remotely in the interim. Furthermore, I plan to explore additional factors to provide a broader understanding of the impact of UHL’s energy consumption for digital systems and initiatives. By gaining more details on our total storage capacity, percentage of storage utilization, and GHG emissions per TB, we will be able to communicate about our carbon footprint in a way that will allow other libraries and archives to compare or estimate the environmental impact of their digital programs as well.

I would also like to investigate whether changes in preservation processes, such as the reduced hands-on fixity strategy outlined above, can have a positive impact on our energy expenditure – and whether this strategy can still provide a high level of integrity and durability for our digital assets over time. Finally, as a longer-term initiative I would like to take a deeper look at sustainability factors beyond energy expenditure, such as current practices for recycling e-waste on campus or a possible future life-cycle assessment for our hardware infrastructure. Through these efforts, I hope to help improve the long-term sustainability of UHL’s digital initiatives, and to aid other digital preservationists to undertake similar assessments of their programs and institutions as well.


Bethany Scott is Digital Projects Coordinator at the University of Houston Libraries, where she is a contributor to the development of the BCDAMS ecosystem incorporating Archivematica, ArchivesSpace, Hyrax, and Avalon. As a representative of UH Special Collections, she contributes knowledge on digital preservation, born-digital archives, and archival description to the BCDAMS team.

Estimating Energy Use for Digital Preservation, Part I

by Bethany Scott

This post is part of our BloggERS Another Kind of Glacier series. Part II will be posted next week.


Although the University of Houston Libraries (UHL) has taken steps over the last several years to initiate and grow an effective digital preservation program, until recently we had not yet considered the long-term sustainability of our digital preservation program from an environmental standpoint. As the leader of UHL’s digital preservation program, I aimed to address this disconnect by gathering information on the technology infrastructure used for digital preservation activities and its energy expenditures in collaboration with colleagues from UHL Library Technology Services and the UH Office of Sustainability. I also reviewed and evaluated the requirements of UHL’s digital preservation policy to identify areas where the overall sustainability of the program may be improved in the future by modifying current practices.

Inventory of equipment

I am fortunate to have a close collaborator in UHL’s Systems Administrator, who was instrumental in the process of implementing the technical/software elements of our digital preservation program over the past few years. He provided a detailed overview of our hardware and software infrastructure, both for long-term storage locations and for processing and workflows.

UHL’s digital access and preservation environment is almost 100% virtualized, with all of the major servers and systems for digital preservation – notably, the Archivematica processing location and storage service – running as virtual machines (VMs). The virtual environment runs on VMware ESXi and consists of five physical host servers that are part of a VMware vSAN cluster, which aggregates the disks across all five host servers into a single storage datastore.

VMs where Archivematica’s OS and application data reside may have their virtual disk data spread across multiple hosts at any given time. Therefore, exact resource use for digital preservation processes running via Archivematica is difficult to distinguish or pinpoint from other VM systems and processes, including UHL’s digital access systems. After discussing possible approaches for calculating the energy usage, we decided to take a generalized or blanket approach and include all five hosts. This calculation thus represents the energy expenditure for not only the digital preservation system and storage, but also for the A/V Repository and Digital Collections access systems. At UHL, digital access and preservation are strongly linked components of a single large ecosystem, so the decision to look at the overall energy expenditure makes sense from an ecosystem perspective.

In addition to the VM infrastructure described above, all user and project data is housed in the UHL storage environment. The storage environment includes both local shared network drive storage for digitized and born-digital assets in production, and additional shares that are not accessible to content producers or other end users, where data is processed and stored to be later served up by the preservation and access systems. Specifically, with the Archivematica workflow, preservation assets are processed through a series of automated preservation actions including virus scanning, file format characterization, fixity checking, and so on, and are then transferred and ingested to secure preservation storage.

UHL’s storage environment consists of two servers: a production unit and a replication unit. Archivematica’s processing shares are not replicated, but the end storage share is replicated. Again, for purposes of simplification, we generalized that both of these resources are being used as part of the digital preservation program when analyzing power use. Finally, within UHL’s server room there is a pair of redundant network switches that tie all the virtual and storage components together.

The specific hardware components that make up the digital access and preservation infrastructure described above include:

  • One (1) production storage unit: iXsystems True NAS M40 HA (Intel Xeon Silver 4114 CPU @ 2.2 Ghz and 128 GB RAM)
  • One (1) replication storage unit: iXsystems FreeNAS IXC-4224 P-IXN (Intel Xeon CPU E5-2630 v4 @ 2.2 Ghz and 128 GB RAM)
  • Two (2) disk expansion shelves: iXsystems ES60
  • Five (5) VMware ESXi hosts: Dell PowerEdge R630 (Intel Xeon CPU E5-2640 v4 @ 2.4 Ghz and 192 GB RAM)
  • Two (2) network switches: HPE Aruba 3810M 16SFP+ 2-slot

Electricity usage

Each of the hardware components listed above has two power supplies. However, the power draw is not always running at the maximum available for those power supplies and is dependent on current workloads, how many disks are in the units, and so on. Therefore, the power being drawn can be quantified but will vary over time.

With the unexpected closure of the campus due to COVID-19, I conducted this analysis remotely with the help of the UH campus Sustainability Coordinator. We compared the estimated maximum power draw based on the technical specifications for the hardware components, the draw when idle, and several partial power draw scenarios, with the understanding that the actual numbers will likely fall somewhere in this range.

Estimated power use and greenhouse gas emissions

 Daily Usage Total (Watts)Annual Total (kWh)Annual GHG (lbs)
Max9,09479,663.44124,175.71
95%8,639.375,680.268117,966.92
90%8,184.671,697.096111,758.14
85%7,729.967,713.924105,549.35
80%7,275.263,730.75299,340.565
Idle5,365.4647,001.4373,263.666

The estimated maximum annual greenhouse gas emissions derived from power use for the digital access and preservation hardware is over 124,000 pounds, or approximately 56.3 metric tons. To put this in perspective, it’s equivalent to the GHG emissions from nearly 140,000 miles driven by an average passenger vehicle, and to the carbon dioxide emissions from 62,063 pounds of coal burned or 130 barrels of oil consumed. While I hope to refine this analysis further in the future, for now these figures can serve as an entry point to discussions on the importance of environmental sustainability actions – and our plans to reduce our consumption – with Libraries administration, colleagues in the Office of Sustainability, and other campus leaders.

Part II, including conclusions and future work, will be posted next week.


Bethany Scott is Digital Projects Coordinator at the University of Houston Libraries, where she is a contributor to the development of the BCDAMS ecosystem incorporating Archivematica, ArchivesSpace, Hyrax, and Avalon. As a representative of UH Special Collections, she contributes knowledge on digital preservation, born-digital archives, and archival description to the BCDAMS team.

Call for bloggERS: Blog Posts on the BitCurator Users Forum

With short weeks to go before the virtual 2020 BitCurator Users Forum (October 13-16), bloggERS is seeking attendees who are interested in writing a re-cap or a blog post covering a particular session, theme, or topic relevant to SAA Electronic Records Section members. The program for the Forum is available here.

Please let us know if you are interested in contributing by sending an email to ers.mailer.blog@gmail.com! You can also let us know if you’re interested in writing a general re-cap or if you’d like to cover something more specific.

Writing for bloggERS!

  • We encourage visual representations: Posts can include or largely consist of comics, flowcharts, a series of memes, etc!
  • Written content should be roughly 600-800 words in length
  • Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA
  • Align with other editorial guidelines as outlined in the bloggERS! guidelines for writers.