From Aspirational to Actionable: Working through the OSSArcFlow Guide

by Elizabeth Stauber


Before I begin extolling the virtues of theOSSArcFlow Guide to Documenting Born-Digital Archival Workflows, I must confess that I created an aspirational digital archiving workflow four years ago, and for its entire life it has existed purely as a decorative piece of paper hanging next to my computer. This workflow was extensive and contained as many open source tools as I could find. It was my attempt to follow every digital archiving best practice that has ever existed.

In actual practice, I never had time to follow this workflow. As a lone arranger at the Hogg Foundation for Mental Health, my attention is constantly divided. Instead, I found ways to incorporate aspects of digital archiving into my records management and archival description work, thus making the documentation fragmented. A birds-eye view of the entire lifecycle of the digital record was not captured – the transition points between accession and processing and description were unaccounted for.

Over the summer, a colleague suggested we go through theOSSArcFlow Guide to Documenting Born-Digital Archival Workflows together. Initially, I was skeptical, but my new home office needed some sprucing up, so I decided to go along. Immediately, I saw that the biggest difference between working through this guide and my prior, ill-fated attempt is that the OSSArcFlow Guide systematically helps you document what you already do. It is not shaming you for not properly updating every file type to the most archivally sound format or for not completing fixity checks every month. Rather, it showed me I am doing the best I can as one person managing an entire organization’s records and look how far I have come!

Taking the time to work through a structured approach for developing a workflow helped organize my digital archiving priorities and thoughts. It is easy to be haphazard as a lone arranger with so many competing projects. Following the guide allowed me to be systematic in my development and led to a better understanding of what I currently do in regards to digital archiving. For example, the act of categorizing my activities as appraisal, pre-accessioning, accessioning, arrangement, description, preservation, and access parceled out the disparate, but co-existing work into manageable amounts. It connected the different processes I already had, and revealed the overlaps and gaps in my workflow.

As I continued mapping out my activities, I was also able to more easily see the natural “pause” points in my workflow. This is important because digital archiving is often fit in around other work, and knowing when I can break from the workflow allows me to manage my time more efficiently – making it more likely that I will achieve progress on my digital archiving work. Having this workflow that documents my actual activities rather than my aspirational activities allows for easier future adaptability. Now I can spot more readily what needs to be added or removed. This is helpful in a lone arranger archive as it allows for flexibility and the opportunity for improvement over time.

The Hogg Foundation was established in 1940 by Ima Hogg. The Foundation’s archive houses many types of records from its 80 years of existence – newspapers, film, cassette tapes, and increasingly born-digital records. As the Foundation continues to make progress in transforming how communities promote mental health in everyday life, it is important to develop robust digital archiving workflows that capture this progress.

Now I understand my workflow as an evolving document that serves as the documentation of the connections between different activities, as well as a visualization to pinpoint areas for growth. My digital processing workflow is no longer simply a decorative piece of paper hanging next to my computer.


Elizabeth Stauber stewards the Hogg Foundation’s educational mission to document, archive and share the foundation’s history, which has become an important part of the histories of mental and public health in Texas, and the evolution of mental health discourse nationally and globally. Elizabeth provides access to the Hogg Foundation’s research, programs, and operations through the publicly accessible archive. Learn more about how to access our records here.

Laying Out the Horizon of Possibilities: Reflections on Developing the OSSArcFlow Guide to Documenting Born-Digital Archival Workflows

by Alexandra Chassanoff and Hannah Wang


OSSArcFlow (2017-2020) was an IMLS-funded grant initiative that began as a collaboration between the Educopia Institute and the University of North Carolina School of Library and Information Science. The goal of the project was to investigate, model, and synchronize born-digital curation workflows for collecting institutions who were using three leading open source software (OSS) platforms – BitCurator, Archivematica and ArchivesSpace. The team recruited a diverse group of twelve partner institutions, ranging from a state historical society to public libraries to academic archives and special collections units at large research universities and consortia.

OSSArcFlow partners at in-person meeting in Chapel Hill, NC (December 2017)
Creator: Educopia Institute

Early on in the project, it became clear that many institutions were planning for and carrying out digital preservation activities ad hoc rather than as part of fully formed workflows. The lack of “best practice” workflow models to consult also seemed to hinder institutions’ abilities to articulate what shape their ideal workflows might take. Creating visual workflow diagrams for each institution provided an important baseline from which to compare and contrast workflow steps, tools, software, roles, and other factors across institutions. It also played an important, if unexpected, role in helping the project team understand the sociotechnical challenges underlying digital curation work. While configuring systems and processing born-digital content, institutions make many important decisions – what to do, how to do it, when to do it, and why – that influence the contours of their workflows. These decisions and underlying challenges, however, are often hidden from view, and can only be made visible by articulating and documenting the actions taken at each stage of the process. Similarly, while partners noted that automation in workflows was highly desirable, the documented workflows revealed the highly customized local implementations at each institution, which prevented the team from writing generalizable scripts for metadata handoffs that could apply to more than one institution’s use case.

Another unexpected but important pivot in the project was a shift towards breakout group discussions to focus on shared challenges or “pain points” identified in our workflow analysis. For partners, talking through shared challenges and hearing suggested approaches proved immensely helpful in advancing their own digital preservation planning. Our observation echoes similar findings by Clemens et al. (2020) in “Participatory Archival Research and Development: The Born-Digital Access Initiative,” who note that “the engagement and vulnerability involved in sharing works in progress resonates with people, particularly practitioners who are working to determine and achieve best practices in still-developing areas of digital archives and user services.” These conversations not only helped to build camaraderie and a community of practice around digital curation, but also revealed that planning for more mature workflows seemed to ultimately depend on understanding more about what was possible.

Overall, our research on the OSSArcFlow project led us to understand more about how gaps in coordinated work practices and knowledge sharing can impact the ability of institutions to plan and advance their workflows. These gaps are not just technical but also social, and crucially, often embedded in the work practices themselves. Diagramming current practices helps to make these gaps more visible so that they can be addressed programmatically. 

At the same time, the use of research-in-practice approaches that prioritize practitioner experiences in knowledge pursuits can help institutions bridge these gaps between where they are today and where they want to be tomorrow.  As Clements et al. (2020) point out, “much of digital practice itself is research, as archivists test new methods and gather information about emerging areas of the field.” Our project findings show a significant difference between how the digital preservation literature conceptualizes workflow development and how boots-on-the-ground practitioners actually do the work of constructing workflows. Archival research and development projects should build in iterative practitioner reflections as a component of the R&D process, an important step for continuing to advance the work of doing digital preservation.  

Initially, we imagined that the Implementation Guide we would produce would focus on strategies used to synchronize workflows across three common OSS environments. Based on our project findings, however, it became clear that helping institutions articulate a plan for digital preservation through shared and collaborative documentation of workflows would provide an invaluable resource for institutions as they undertake similar activities. Our hope in writing the Guide to Documenting Born-Digital Archival Workflows is to provide a resource that focuses on common steps, tools, and implementation examples in service of laying out the “horizon of possibilities” for practitioners doing this challenging work.  

The authors would like to recognize and extend immense gratitude to the rest of the OSSArcFlow team and the project partners who helped make the project and its deliverables a success. The Guide to Documenting Born-Digital Archival Workflows was authored by Alexandra Chassanoff and Colin Post and edited by Katherine Skinner, Jessica Farrell, Brandon Locke, Caitlin Perry, Kari Smith, and Hannah Wang, with contributions from Christopher A. Lee, Sam Meister, Jessica Meyerson, Andrew Rabkin, and Yinglong Zhang, and design work from Hannah Ballard.


Alexandra Chassanoff is an Assistant Professor at the School of Library and Information Sciences at North Carolina Central University. Her research focuses on the use and users of born-digital cultural heritage. From 2017 to 2018, she was the OSSArcFlow Project Manager. Previously, she worked with the BitCurator and BitCurator Access projects while pursuing her doctorate in Information Science at UNC-Chapel Hill. She co-authored (with Colin Post and Katherine Skinner) the Guide to Documenting Born-Digital Archival Workflows.    

Hannah Wang is currently the Project Manager for BitCuratorEdu (IMLS, 2018-2021), where she manages the development of open learning objects for digital forensics and facilitates a community of digital curation educators. She served as the Project Manager for the final stage of OSSArcFlow and co-edited the Guide to Documenting Born-Digital Archival Workflows.

Integrating Environmental Sustainability into Policies and Workflows

by Keith Pendergrass

This is the first post in the BloggERS Another Kind of Glacier series.


Background and Challenges

My efforts to integrate environmental sustainability and digital preservation in my organization—Baker Library Special Collections at Harvard Business School—began several years ago when we were discussing the long-term preservation of forensic disk images in our collections. We came to the conclusion that keeping forensic images instead of (or in addition to) the final preservation file set can have ethical, privacy, and environmental issues. We decided that we would preserve forensic images only in use cases where there was a strong need to do so, such as a legal mandate in our records management program. I talked about our process and results at the BitCurator Users Forum 2017.

From this presentation grew a collaboration with three colleagues who heard me speak that day: Walker Sampson, Tessa Walsh, and Laura Alagna. Together, we reframed my initial inquiry to focus on environmental sustainability and enlarged the scope to include all digital preservation practices and the standards that guide them. The result was our recent article and workshop protocol.

During this time, I began aligning our digital archives work at Baker Library with this research as well as our organization-wide sustainability goals. My early efforts mainly took the form of the stopgap measures that we suggest in our article: turning off machines when not in use; scheduling tasks for off-peak network and electricity grid periods; and purchasing renewable energy certificates that promote additionality, which is done for us by Harvard University as part of its sustainability goals. As these were either unilateral decisions or were being done for me, they were straightforward and quick to implement.

To make more significant environmental gains along the lines of the paradigm shift we propose in our article, however, requires greater change. This, in turn, requires more buy-in and collaboration within and across departments, which often slows the process. In the face of immediate needs and other constraints, it can be easy for decision makers to justify deprioritizing the work required to integrate environmental sustainability into standard practices. With the urgency of the climate and other environmental crises, this can be quite frustrating. However, with repeated effort and clear reasoning, you can make progress on these larger sustainability changes. I found success most often followed continual reiteration of why I wanted to change policy, procedure, or standard practice, with a focus on how the changes would better align our work and department with organizational sustainability goals. Another key argument was showing how our efforts for environmental sustainability would also result in financial and staffing sustainability.

Below, I share examples of the work we have done at Baker Library Special Collections to include environmental sustainability in some of our policies and workflows. While the details may be specific to our context, the principles are widely applicable: integrate sustainability into your policies so that you have a strong foundation for including environmental concerns in your decision making; and start your efforts with appraisal as it can have the most impact for the time that you put in.

Policies

The first policy in which we integrated environmental sustainability was our technology change management policy, which controls our decision making around the hardware and software we use in our digital archives workflows. The first item we added to the policy was that we must dispose of all hardware following environmental standards for electronic waste and, for items other than hard drives, that we must donate them for reuse whenever possible. The second item involved more collaboration with our IT department, which controls computer refresh cycles, so that we could move away from the standard five-year replacement timeframe for desktop computers. The workstations that we use to capture, appraise, and process digital materials are designed for long service lives, heavy and sustained workloads, and easy component change out. We made our case to IT—as noted above, this was an instance where the complementarity of environmental and financial sustainability was key—and received an exemption for our workstations, which we wrote into our policy to ensure that it becomes standard practice.

We can now keep the workstations as long as they remain serviceable and work with IT to swap out components as they fail or need upgrading. For example, we replaced our current workstations’ six-year-old spinning disk drives with solid state drives when we updated from Windows 7 to Windows 10, improving performance while maintaining compliance with IT’s security requirements. Making changes like this allows us to move from the standard five-year to an expected ten-year service life for these workstations (they are currently at 7.5 years). While the policy change and subsequent maintenance actions are small, they add up over time to provide substantial reductions in the full life-cycle environmental and financial costs of our hardware.

We also integrated environmental sustainability into our new acquisition policy. The policy outlines the conditions and terms of several areas that affect the acquisition of materials in any format: appraisal, agreements, transfer, accessioning, and documentation. For appraisal, we document the value and costs of a potential acquisition, but previously had been fairly narrow in our definition of costs. With the new policy, we broadened the costs that were in scope for our acquisition decisions and as part of this included environmental costs. While only a minor point in the policy, it allows us to determine environmental costs in our archival and technical appraisals, and then take those costs into account when making an acquisition decision. Our next step is to figure out how best to measure or estimate environmental impacts for consistency across potential acquisitions. I am hopeful that explicitly integrating environmental sustainability into our first decision point—whether to acquire a collection—will make it easier to include sustainability in other decision points throughout the collection’s life cycle.

Workflows

In a parallel track, we have been integrating environmental sustainability into our workflows, focusing on the appraisal of born-digital and audiovisual materials. This is a direct result of the research article noted above, in which we argue that focusing on selective appraisal can be the most consequential action because it affects the quantity of digital materials that an organization stewards for the remainder of those materials’ life cycle and provides an opportunity to assign levels of preservation commitment. While conducting in-depth appraisal prior to physical or digital transfer is ideal, it is not always practical, so we altered our workflows to increase the opportunities for appraisal after transfer.

For born-digital materials, we added an appraisal point during the initial collection inventory, screening out storage media whose contents are wholly outside of our collecting policy. We then decide on a capture method based on the type of media: we create disk images of smaller-capacity media but often package the contents of larger-capacity media using the bagit specification (unless we have a use case that requires a forensic image) to reduce the storage capacity needed for the collection and to avoid the ethical and privacy issues previously mentioned. When we do not have control of the storage media—for network attached storage, cloud storage, etc.—we make every attempt to engage with donors and departments to conduct in-depth appraisal prior to capture, streamlining the remaining appraisal decision points.

After capture, we conduct another round of appraisal now that we can more easily view and analyze the digital materials across the collection. This tends to be a higher-level appraisal during which we make decisions about entire disk images or bagit bags, or large groupings within them. Finally (for now), we conduct our most granular and selective appraisal during archival processing when processing archivists, curators, and I work together to determine what materials should be part of the collection’s preservation file set. As our digital archives program is still young, we have not yet explored re-appraisal at further points of the life cycle such as access, file migration, or storage refresh.

For audiovisual materials, we follow a similar approach as we do for born-digital materials. We set up an audiovisual viewing station with equipment for reviewing audiocassettes, microcassettes, VHS and multiple Beta-formatted video tapes, multiple film formats, and optical discs. We first appraise the media items based on labels and collection context, and with the viewing station can now make a more informed appraisal decision before prioritizing for digitization. After digitization, we appraise again, making decisions on retention, levels of preservation commitment, and access methods.

While implementing multiple points of selective appraisal throughout workflows is more labor intensive than simply conducting an initial appraisal, several arguments moved us to take this approach: it is a one-time labor cost that helps us reduce on-going storage and maintenance costs; it allows us to target our resources to those materials that have the most value for our community; it decreases the burden of reappraisal and other information maintenance work that we are placing on future archivists; and, not least, it reduces the on-going environmental impact of our work.


Keith Pendergrass is the digital archivist for Baker Library Special Collections at Harvard Business School, where he develops and oversees workflows for born-digital materials. His research and general interests include integration of sustainability principles into digital archives standard practice, systems thinking, energy efficiency, and clean energy and transportation. He holds an MSLIS from Simmons College and a BA from Amherst College.

Announcing the Digital Processing Framework

by Erin Faulder

Development of the Digital Processing Framework began after the second annual Born Digital Archiving eXchange unconference at Stanford University in 2016. There, a group of nine archivists saw a need for standardization, best practices, or general guidelines for processing digital archival materials. What came out of this initial conversation was the Digital Processing Framework (https://hdl.handle.net/1813/57659) developed by a team of 10 digital archives practitioners: Erin Faulder, Laura Uglean Jackson, Susanne Annand, Sally DeBauche, Martin Gengenbach, Karla Irwin, Julie Musson, Shira Peltzman, Kate Tasker, and Dorothy Waugh.

An initial draft of the Digital Processing Framework was presented at the Society of American Archivists’ Annual meeting in 2017. The team received feedback from over one hundred participants who assessed whether the draft was understandable and usable. Based on that feedback, the team refined the framework into a series of 23 activities, each composed of a range of assessment, arrangement, description, and preservation tasks involved in processing digital content. For example, the activity Survey the collection includes tasks like Determine total extent of digital material and Determine estimated date range.

The Digital Processing Framework’s target audience is folks who process born digital content in an archival setting and are looking for guidance in creating processing guidelines and making level-of-effort decisions for collections. The framework does not include recommendations for archivists looking for specific tools to help them process born digital material. We draw on language from the OAIS reference model, so users are expected to have some familiarity with digital preservation, as well as with the management of digital collections and with processing analog material.

Processing born-digital materials is often non-linear, requires technical tools that are selected based on unique institutional contexts, and blends terminology and theories from archival and digital preservation literature. Because of these characteristics, the team first defined 23 activities involved in digital processing that could be generalized across institutions, tools, and localized terminology. These activities may be strung together in a workflow that makes sense for your particular institution. They are:

  • Survey the collection
  • Create processing plan
  • Establish physical control over removeable media
  • Create checksums for transfer, preservation, and access copies
  • Determine level of description
  • Identify restricted material based on copyright/donor agreement
  • Gather metadata for description
  • Add description about electronic material to finding aid
  • Record technical metadata
  • Create SIP
  • Run virus scan
  • Organize electronic files according to intellectual arrangement
  • Address presence of duplicate content
  • Perform file format analysis
  • Identify deleted/temporary/system files
  • Manage personally identifiable information (PII) risk
  • Normalize files
  • Create AIP
  • Create DIP for access
  • Publish finding aid
  • Publish catalog record
  • Delete work copies of files

Within each activity are a number of associated tasks. For example, tasks identified as part of the Establish physical control over removable media activity include, among others, assigning a unique identifier to each piece of digital media and creating suitable housing for digital media. Taking inspiration from MPLP and extensible processing methods, the framework assigns these associated tasks to one of three processing tiers. These tiers include: Baseline, which we recommend as the minimum level of processing for born digital content; Moderate, which includes tasks that may be done on collections or parts of collections that are considered as having higher value, risk, or access needs; and Intensive, which includes tasks that should only be done to collections that have exceptional warrant. In assigning tasks to these tiers, practitioners balance the minimum work needed to adequately preserve the content against the volume of work that could happen for nuanced user access. When reading the framework, know that if a task is recommended at the Baseline tier, then it should also be done as part of any higher tier’s work.

We designed this framework to be a step towards a shared vocabulary of what happens as part of digital processing and a recommendation of practice, not a mandate. We encourage archivists to explore the framework and use it however it fits in their institution. This may mean re-defining what tasks fall into which tier(s), adding or removing activities and tasks, or stringing tasks into a defined workflow based on tier or common practice. Further, we encourage the professional community to build upon it in practical and creative ways.


Erin Faulder is the Digital Archivist at Cornell University Library’s Division of Rare and Manuscript Collections. She provides oversight and management of the division’s digital collections. She develops and documents workflows for accessioning, arranging and describing, and providing access to born-digital archival collections. She oversees the digitization of analog collection material. In collaboration with colleagues, Erin develops and refines the digital preservation and access ecosystem at Cornell University Library.

Small-Scale Scripts for Large-Scale Analysis: Python at the Alexander Turnbull Library

by Flora Feltham

This is the third post in the bloggERS Script It! Series.

The Alexander Turnbull is a research library that holds archives and special collections within the National Library of New Zealand. This means exactly what you’d expect: manuscripts and archives, music, oral histories, photographs, and paintings, but also artefacts such as Katherine Mansfield’s typewriter and a surprising amount of hair. In 2008, the National Library established the National Digital Heritage Archive (NDHA), and has been actively collecting and managing born-digital materials since. I am one of two Digital Archivists who administer the transfer of born-digital heritage material to the Library. We also analyse files to ensure they have all the components needed for long-term preservation and ingest collections to the NDHA. We work closely with our digital preservation system administrators and the many other teams responsible for appraisal, arrangement and description, and providing access to born-digital heritage.

Why Scripting?

As archivists responsible for safely handling and managing born-digital heritage, we use scripts to work safely and sanely at scale. Python provides a flexible yet reliable platform for our work: we don’t have to download and learn a new piece of software every time we need to accomplish a different task. The increasing size and complexity of digital collections often means that machine processing is the only way to get our work done. A human could not reliably identify every duplicate file name in a collection of 200,000 files… but Python can. To protect original material, too, the scripting we do during appraisal and technical analysis is done using copies of collections. Here are some of the extremely useful tasks my team scripted recently:

  • Transfer
    • Generating a list of files on original storage media
    • Transferring files off the original digital media to our storage servers
  • Appraisal
    • Identifying duplicate files across different locations
    • Adding file extensions so material opens in the correct software
    • Flattening complex folder structures to support easy assessment
  • Technical Analysis
    • Sorting files into groups based on file extension to isolate unknown files
    • Extracting file signature information from unknown files

Our most-loved Python script even has a name: Safe Mover. Developed and maintained by our Digital Preservation Analyst, Safe Mover will generate file checksums, maintain metadata integrity, and check file authenticity, all the while copying material off digital storage media. Running something somebody else wrote was a great introduction to scripting. I finally understood that: a) you can do nimble computational work from a text editor; and b) a ‘script’ is just a set of instructions you write for the computer to follow.

Developing Skills Slowly, Systematically, and as Part of a Group

Once we recognised that we couldn’t do our work without scripting, my team started regular ‘Scripting Sessions’ with other colleagues who code. At each meeting we solve a genuine processing challenge from someone’s job, which allows us to apply what we learn immediately. We also write the code together on a big screen which, like learning any language, helped us become comfortable making mistakes. Recently, I accidentally copied 60,000 spreadsheets to my desktop and then wondered aloud why my laptop ground to a halt.

Outside of these meetings, learning to script has been about deliberately choosing to problem-solve using Python rather than doing it ‘by-hand’. Initially, this felt counter-intuitive because I was painfully slow: I really could have counted 200 folders on my fingers faster than I wrote a script to do the same thing.  But luckily for me, my team recognised the overall importance of this skill set and I also regularly remind myself: “this will be so useful the next 5000 times I need to inevitably do this task”.

The First Important Thing I Remembered How to Write

Practically every script I write starts with import os. It became so simple once I’d done it a few times: import is a command, and ‘os’ is the name of the thing I want. os is a Python module that allows you to interact with and leverage the functionality of your computer’s operating system. In general terms, a Python module is a just pre-existing code library for you to use. They are usually grouped around a particular theme or set of tasks.

The function I use the most is os.walk(). You tell os.walk() where to start and then it systematically traverses every folder beneath that. For every folder it finds, os.walk() will record three things: 1: the path to that folder, 2: a list of any sub-folders it contains, and 3: a list of any files it contains. Once os.walk() has completed its… well… walk, you have access to the name and location of every folder and file in your collection.

You then use Python to do something with this data: print it to the screen, write it in a spreadsheet, ask where something is, or open a file. Having access to this information becomes relevant to archivists very quickly. Just think about our characteristic tasks and concerns: identifying and recording original order, gaining intellectual control, maintaining authenticity. I often need to record or manipulate file paths, move actual files around the computer, or extract metadata stored in the operating system.

An Actual Script

Recently, we received a Mac-formatted 1TB hard drive from a local composer and performer. When #1 script Safe Mover stopped in the middle of transferring files, we wondered if there was a file path clash. Generally speaking, in a Windows formatted file system there’s a limit of 255 characters to a file path (“D:\so_it_pays_to\keep_your_file_paths\niceand\short.doc”).

Some older Mac systems have no limit on the number of file path characters so, if they’re really long, there can be a conflict when you’re transferring material to a Windows environment. To troubleshoot this problem we wrote a small script:

import os

top = "D:\example_folder_01\collection"
for root, dir_list, file_list in os.walk(top):
    for item in file_list:
    file_path = os.path.join(root,item)
    if len(file_path) > 255:
        print (file_path)

So that’s it– and running it over the hard drive gave us the answer we needed. But what is this little script actually doing?

# import the Python module 'os'.
import os

# tell our script where to start
top = " D:\example_folder_01\collection"
# now os.walk will systematically traverse the directory tree starting from 'top' and retain information in the variables root, dir_list, and file_list.
# remember: root is 'path to our current folder', dir_list is 'a list of sub-folder names', and file_list is 'a list of file names'.
for root, dir_list, file_list in os.walk(top):
    # for every file in those lists of files...
    for item in file_list:
        # ...store its location and name in a variable called 'file_path'.
        # os.path.join joins the folder path (root) with the file name for every file (item).
        file_path = os.path.join(root,item)
        # now we do the actual analysis. We want to know 'if the length of any file path is greater than 255 characters'.
        if len(file_path) > 255:
            # if so: print that file path to the screen so we can see it!
            print (file_path)

 

All it does is find every file path longer than 255 characters and print it to the screen. The archivists can then eyeball the list and decide how to proceed. Or, in the case of our 1TB hard drive, exclude that as a problem because it didn’t contain any really long file paths. But at least we now know that’s not the problem. So maybe we need… another script?


Flora Feltham is the Digital Archivist at the Alexander Turnbull Library, National Library of New Zealand Te Puna Mātauranga o Aotearoa. In this role she supports the acquisition, ingest, management, and preservation of born-digital heritage collections. She lives in Wellington.

Restructuring and Uploading ZIP Files to the Internet Archive with Bash

by Lindsey Memory and Nelson Whitney

This is the second post in the bloggERS Script It! series.

This blog is for anyone interested in uploading ZIP files into the Internet Archive.

The Harold B. Lee Library at Brigham Young University has been a scanning partner with the Internet Archive since 2009. Any loose-leaf or oversized items go through the convoluted workflow depicted below, which can take hours of painstaking mouse-clicking if you have a lot of items like we do (think 14,000 archival issues of the student newspaper). Items must be scanned individually as JPEGS, each JPEG must be reformatted into a JP2, the JP2s must all be zipped into ZIP files, then ZIPs are (finally) uploaded one-by-one into the Internet Archive.

old workflow
Workflow for uploading ZIP files to the Internet Archive.

Earlier this year, the engineers at Internet Archive published a single line of Python code that allows scan centers to upload multiple ZIP files into the Internet Archive at once (see “Bulk Uploads”) . My department has long dreamed of a script that could reformat Internet-Archive-bound items and upload them automatically. The arrival of the Python code got us moving.  I enlisted the help of the library’s senior software engineer and we discussed ways to compress scans, how Python scripts communicate with the Internet Archive, and ways to reorganize the scans’ directory files in a way conducive to a basic Bash script.

The project was delegated to Nelson Whitney, a student developer. Nelson wrote the script, troubleshot it with me repeatedly, and helped author this blog. Below we present his final script in two parts, written in Bash for iOS in Spring 2018.

Part 1: makeDirectories.sh

This simple command, executed through Terminal on a Mac, takes a list of identifiers (list.txt) and generates a set of organized subdirectories for each item on that list. These subdirectories house the JPEGs and are structured such that later they streamline the uploading process.

#! /bin/bash

# Move into the directory "BC-100" (the name of our quarto scanner), then move into the subdirectory named after whichever project we're scanning, then move into a staging subdirectory.
cd BC-100/$1/$2

# Takes the plain text "list.txt," which is saved inside the staging subdirectory, and makes a new subdirectory for each identifier on the list.
cat list.txt | xargs mkdir
# For each identifier subdirectory,
for d in */; do
  # Terminal moves into that directory and
  cd $d
  # creates three subdirectories inside named "01_JPGs_cropped,"
mkdir 01_JPGs_cropped
  # "02_JP2s,"
mkdir 02_JP2s
  # and "03_zipped_JP2_file," respectively.
mkdir 03_zipped_JP2_file
  # It also deposits a blank text file in each identifier folder for employees to write notes in.
  touch Notes.txt

  cd ..
done

file structure copy
Workflow for uploading ZIP files to the Internet Archive.

Part 2: macjpzipcreate.sh

This Terminal command can recursively move through subdirectories, going into 01_JPEG_cropped first and turning all JPEGs therein into JP2s. Terminal saves the JP2s into the subdirectory 02_JP2s, then zips the JP2s into a zip file and saves the zip in subdirectory 03_zipped_JP2_file. Finally, Terminal uploads the zip into the Internet Archive. Note that for the bulk upload to work, you must have configured Terminal with the “./ia configure” command and entered your IA admin login credentials.

#! /bin/bash

# Places the binary file "ia" into the project directory (this enables upload to Internet Archive)
cp ia BC-100/$1/$2
# Move into the directory "BC-100" (the name of our quarto scanner), then moves into the subdirectory named after whichever project we're scanning, then move into a staging subdirectory
cd BC-100/$1/$2

  # For each identifier subdirectory
  for d in */; do
  # Terminal moves into that identifier's directory and then the directory containing all the JPG files
  cd $d
  cd 01_JPGs_cropped

  # For each jpg file in the directory
  for jpg in< *.jpg; do
    # Terminal converts the jpg files into jp2 format using the sips command in MAC OS terminals
    sips -s format jp2 --setProperty formatOptions best $jpg --out ../02_JP2s/$jpg.jp2
  done

  cd ../02_JP2s
  # The directory variable contains a trailing slash. Terminal removes the trailing slash,
  d=${d%?}
  # gives the correct name to the zip file,
  im="_images"
  # and zips up all JP2 files.
  zip $d$im.zip *
  # Terminal moves the zip files into the intended zip file directory.
  mv $d$im.zip ../03_zipped_JP2_file

  # Terminal moves back up the project directory to where the ia script exists
  cd ../..
  # Uses the Internet-Archive-provided Python Script to upload the zip file to the internet
  ./ia upload $d $d/03_zipped_JP2_file/$d$im.zip --retries 10
  # Change the repub_state of the online identifier to 4, which marks the item as done in Internet Archive.
  ./ia metadata $d --modify=repub_state:4
done

The script has reduced the labor devoted to the Internet Archive by a factor of four. Additionally, it has bolstered Digital Initiatives’ relationship with IT. It was a pleasure working with Nelson; he gained real-time engineering experience working with a “client,” and I gained a valuable working knowledge of Terminal and the design of basic scripts.


 

lindsey_memoryLindsey Memory is the Digital Initiatives Workflows Supervisor at the Harold B. Lee Library at Brigham Young University. She loves old books and the view from her backyard.

 

 

Nelson Whitney

Nelson Whitney is an undergraduate pursuing a degree in Computer Science at Brigham Young University. He enjoys playing soccer, solving Rubik’s cubes, and spending time with his family and friends. One day, he hopes to do cyber security for the Air Force.

There’s a First Time for Everything

By Valencia Johnson

This is the fourth post in the bloggERS series on Archiving Digital Communication.


This summer I had the pleasure of accessioning a large digital collection from a retiring staff member. Due to their longevity with the institution, the creator had amassed an extensive digital record. In addition to their desktop files, the archive collected an archival Outlook .pst file of 15.8 GB! This was my first time working with emails. This was also the first time some of the tools discussed below were used in the workflow at my institution. As a newcomer to the digital archiving community, I would like to share this case study and my first impressions on the tools I used in this acquisition.

My original workflow:

  1. Convert the .pst file into an .mbox file.
  2. Place both files in a folder titled Emails and add this folder to the acquisition folder that contains the Desktop files folder. This way the digital records can be accessioned as one unit.
  3. Follow and complete our accessioning procedures.

Things were moving smoothly; I was able to use Emailchemy, a tool that converts email from closed, proprietary file formats, such as .pst files used by Outlook, to standard, portable formats that any application can use, such as .mbox files, which can be read using Thunderbird, Mozilla’s open source email client. I used a Windows laptop that had Outlook and Thunderbird installed to complete this task. I had no issues with Emailchemy, the instructions in the owner’s manual were clear, and the process was easy. Next, I uploaded the Email folder, which contained the .pst and .mbox files, to the acquisition external hard drive and began processing with BitCurator. The machine I used to accession is a FRED, a powerful forensic recovery tool used by law enforcement and some archivists. Our FRED runs BitCurator, which is a Linux environment. This is an important fact to remember because .pst files will not open on a Linux machine.

At Princeton, we use Bulk Extractor to check for Personally Identifiable Information (PII) and credit card numbers. This is step 6 in our workflow and this is where I ran into some issues.

Yeah Bulk Extractor I’ll just pick up more cores during lunch.

The program was unable to complete 4 threads within the Email folder and timed out. The picture above is part of the explanation message I received. In my understanding and research, aka Google because I did not understand the message, the program was unable to completely execute the task with the amount of processing power available. So the message is essentially saying “I don’t know why this is taking so long. It’s you not me. You need a better computer.” From the initial scan results, I was able to remove PII from the Desktop folder. So instead of running the scan on the entire acquisition folder, I ran the scan solely on the Email folder and the scan still timed out. Despite the incomplete scan, I moved on with the results I had.  

I tried to make sense of the reports Bulk Extractor created for the email files. The Bulk Extractor output includes a full file path for each file flagged, e.g. (/home/user/Desktop/blogERs/Email.docx). This is how I was able locate files within the Desktop folder. The output for the Email folder looked like this:

(Some text has been blacked out for privacy.)

Even though Bulk Extractor Viewer does display the content, it displays it like a text editor, e.g. Notepad, with all the coding alongside the content of the message, not as an email, because all the results were from the .mbox file. This is just the format .mbox generates without an email client. This coding can be difficult to interpret without an email client to translate the material into a human readable format. This output makes it hard to locate an individual message within a .pst because it is hard but not impossible to find the date or title of the email amongst the coding. But this was my first time encountering results like this and it freaked me out a bit.

Because regular expressions, the search method used by Bulk Extractor, looks for number patterns, some of the hits were false positives, number strings that matched the pattern of SSN or credit card numbers. So in lieu of social security numbers, I found the results were FedEx tracking numbers or mistyped phone numbers, though to be fair mistyped numbers are someone’s SSN. For credit card numbers, the program picked up email coding and non-financially related number patterns.

The scan found a SSN I had to remove from the .pst and the .mbox. Remember .pst files only work with Microsoft Outlook. At this point in processing, I was on a Linux machine and could not open the .pst so I focused on the .mbox.  Using the flagged terms, I thought maybe I could use a keyword search within the .mbox to locate and remove the flagged material because you can open .mbox files using a text editor. Remember when I said the .pst was over 15 GB? Well the .mbox was just as large and this caused the text editor to stall and eventually give up opening the file. Despite these challenges, I remained steadfast and found UltraEdit, a large text file editor. This whole process took a couple of days and in the end the results from Bulk Extractor’s search indicated the email files contained one SSN and no credit card numbers.  

While discussing my difficulties with my supervisor, she suggested trying FileLocator Pro, a scanner like Bulk Extractor that was created with .pst files in mind, to fulfill our due diligence to look for sensitive information since the Bulk Extractor scan timed out before finishing.  Though FileLocator Pro operates on Windows so, unfortunately, we couldn’t do the scan on the FRED,  FileLocator Pro was able to catch real SSNs hidden in attachments that did not appear in the Bulk Extractor results.

I was able to view the email with the flagged content highlighted within FileLocator Pro like Bulk Extractor. Also, there is the option to open the attachments or emails in their respective programs. So a .pdf file opened in Adobe and the email messages opened in Outlook. Even though I had false positives with FileLocator Pro, verifying the content was easy. It didn’t perform as well searching for credit card numbers; I had some error messages stating that some attached files contained no readable text or that FileLocator Pro had to use a raw data search instead of the primary method. These errors were limited to attachments with .gif, .doc, .pdf, and .xls extensions. But overall it was a shorter and better experience working with FileLocator Pro, at least when it comes to email files.

As emails continue to dominate how we communicate at work and in our personal lives, archivists and electronic records managers can expect to process even larger files, despite how long an individual stays at an institution. Larger files can make the hunt for PII and other sensitive data feel like searching for a needle in a haystack, especially when our scanners are unable to flag individual emails, attachments, or even complete a scan. There’s no such thing as a perfect program; I like Bulk Extractor for non-email files, and I have concerns with FileLocator Pro. However, technology continues to improve and with forums like this blog we can learn from one another.


Valencia Johnson is the Digital Accessioning Assistant for the Seeley G. Mudd Manuscript Library at Princeton University. She is a certified archivist with an MA in Museum Studies from Baylor University.