Announcing the Digital Processing Framework

by Erin Faulder

Development of the Digital Processing Framework began after the second annual Born Digital Archiving eXchange unconference at Stanford University in 2016. There, a group of nine archivists saw a need for standardization, best practices, or general guidelines for processing digital archival materials. What came out of this initial conversation was the Digital Processing Framework (https://hdl.handle.net/1813/57659) developed by a team of 10 digital archives practitioners: Erin Faulder, Laura Uglean Jackson, Susanne Annand, Sally DeBauche, Martin Gengenbach, Karla Irwin, Julie Musson, Shira Peltzman, Kate Tasker, and Dorothy Waugh.

An initial draft of the Digital Processing Framework was presented at the Society of American Archivists’ Annual meeting in 2017. The team received feedback from over one hundred participants who assessed whether the draft was understandable and usable. Based on that feedback, the team refined the framework into a series of 23 activities, each composed of a range of assessment, arrangement, description, and preservation tasks involved in processing digital content. For example, the activity Survey the collection includes tasks like Determine total extent of digital material and Determine estimated date range.

The Digital Processing Framework’s target audience is folks who process born digital content in an archival setting and are looking for guidance in creating processing guidelines and making level-of-effort decisions for collections. The framework does not include recommendations for archivists looking for specific tools to help them process born digital material. We draw on language from the OAIS reference model, so users are expected to have some familiarity with digital preservation, as well as with the management of digital collections and with processing analog material.

Processing born-digital materials is often non-linear, requires technical tools that are selected based on unique institutional contexts, and blends terminology and theories from archival and digital preservation literature. Because of these characteristics, the team first defined 23 activities involved in digital processing that could be generalized across institutions, tools, and localized terminology. These activities may be strung together in a workflow that makes sense for your particular institution. They are:

  • Survey the collection
  • Create processing plan
  • Establish physical control over removeable media
  • Create checksums for transfer, preservation, and access copies
  • Determine level of description
  • Identify restricted material based on copyright/donor agreement
  • Gather metadata for description
  • Add description about electronic material to finding aid
  • Record technical metadata
  • Create SIP
  • Run virus scan
  • Organize electronic files according to intellectual arrangement
  • Address presence of duplicate content
  • Perform file format analysis
  • Identify deleted/temporary/system files
  • Manage personally identifiable information (PII) risk
  • Normalize files
  • Create AIP
  • Create DIP for access
  • Publish finding aid
  • Publish catalog record
  • Delete work copies of files

Within each activity are a number of associated tasks. For example, tasks identified as part of the Establish physical control over removable media activity include, among others, assigning a unique identifier to each piece of digital media and creating suitable housing for digital media. Taking inspiration from MPLP and extensible processing methods, the framework assigns these associated tasks to one of three processing tiers. These tiers include: Baseline, which we recommend as the minimum level of processing for born digital content; Moderate, which includes tasks that may be done on collections or parts of collections that are considered as having higher value, risk, or access needs; and Intensive, which includes tasks that should only be done to collections that have exceptional warrant. In assigning tasks to these tiers, practitioners balance the minimum work needed to adequately preserve the content against the volume of work that could happen for nuanced user access. When reading the framework, know that if a task is recommended at the Baseline tier, then it should also be done as part of any higher tier’s work.

We designed this framework to be a step towards a shared vocabulary of what happens as part of digital processing and a recommendation of practice, not a mandate. We encourage archivists to explore the framework and use it however it fits in their institution. This may mean re-defining what tasks fall into which tier(s), adding or removing activities and tasks, or stringing tasks into a defined workflow based on tier or common practice. Further, we encourage the professional community to build upon it in practical and creative ways.


Erin Faulder is the Digital Archivist at Cornell University Library’s Division of Rare and Manuscript Collections. She provides oversight and management of the division’s digital collections. She develops and documents workflows for accessioning, arranging and describing, and providing access to born-digital archival collections. She oversees the digitization of analog collection material. In collaboration with colleagues, Erin develops and refines the digital preservation and access ecosystem at Cornell University Library.

Advertisements

The Top 10 Things We Learned from Building the Queer Omaha Archives, Part 2 – Lessons 6 to 10

by Angela Kroeger and Yumi Ohira

The Queer Omaha Archives (QOA) is an ongoing effort by the University of Nebraska at Omaha Libraries’ Archives and Special Collections to collect and preserve Omaha’s LGBTQIA+ history. This is still a fairly new initiative at the UNO Libraries, having been launched in June 2016. This blog post is adapted and expanded from a presentation entitled “Show Us Your Omaha: Combatting LGBTQ+ Archival Silences,” originally given at the June 2017 Nebraska Library Association College & University Section spring meeting. The QOA was only a year old at that point, and now that another year (plus change) has passed, the collection has continued to grow, and we’ve learned some new lessons.

So here are the top takeaways from UNO’s experience with the QOA.

#6. Words have power, and sometimes also baggage.

Words sometimes mean different things to different people. Each person’s life experience lends context to the way they interpret the words they hear. And certain words have baggage.

We named our collecting initiative the Queer Omaha Archives because in our case, “queer” was the preferred term for all LGBTQIA+ people as well as referring to the academic discipline of queer studies. In the early 1990s, the community in Omaha most commonly referred to themselves as “gays and lesbians.” Later on, bisexuals were included, and the acronym “GLB” came into more common use. Eventually, when trans people were finally acknowledged, it became “GLBT.” Then there was a push to switch the order to “LGBT.” And then more letters started piling on, until we ended up with the LGBTQIA+ commonly used today. Sometimes, it is taken even further, and we’ve seen LGBTQIAPK+, LGBTQQIP2SAA, LGBTQIAGNC, and other increasingly long and difficult-to-remember variants. (Although, Angela confesses to finding QUILTBAG to be quite charming.) The acronym keeps shifting, but we didn’t want our name to shift, so we followed the students’ lead (remember the QTS “Cuties”?) and just went with “queer.” “Queer” means everyone.

Except . . . “queer” has baggage. Heavy, painful baggage. At Pride 2016, an older man, who had donated materials to our archive, stopped by our booth and we had a conversation about the word. For him, it was the slur his enemies had been verbally assaulting him with for decades. The word still had a sharp edge for him. Angela (being younger than this donor, but older than most of the students on campus) acknowledged that they were just old enough for the word to be startling and sometimes uncomfortable. But many Millennials and Generation Z youths, as well as some older adults, have embraced “queer” as an identity. Notably, many of the younger people on campus have expressed their disdain for being put into boxes. Identifying as “gay” or “lesbian” or “bi” seems too limiting to them. Our older patron left our booth somewhat comforted by the knowledge that for much of the population, especially the younger generations, “queer” has lost its sting and taken on a positive, liberating openness.

But what about other LGBTQIA+ people who haven’t stopped by to talk to us, to learn what we mean when we call our archives “queer”? Who feels sufficiently put off by this word that they might choose against sharing their stories with our archive? We aren’t planning to change our name, but we are aware that our choice of word may give some potential donors and potential users a reason to hesitate before approaching us.

So whatever community your archive serves, think about the words that community uses to describe themselves, and the words others use to describe them, and whether those words might have connotations you don’t intend.

#7. Find your community. Partnerships, donors, and volunteers are the keys to success.

It goes without saying that archives are built from communities. We don’t (usually) create the records. We invite them, gather them, describe them, preserve them, and make them available to all, but the records (usually) come from somewhere else.

Especially if you’re building an archive for an underrepresented community, you need buy-in from the members of that community if you want your project to be successful. You need to prove that you’re going to be trustworthy, honorable, reliable stewards of the community’s resources. You need someone in your archive who is willing and able to go out into that community and engage with them. For us, that was UNO Libraries’ Archives and Special Collections Director Amy Schindler, who has a gift for outreach. Though herself cis and straight, she has put in the effort to prove herself a friend to Omaha’s LGBTQIA+ community.

You also need members of that community to be your advocates. For us, our advocates were our first donors, the people who took that leap of faith and trusted us with their resources. We started with our own university. The work of the archivist and the archives would not have been possible without the collaboration and support of campus partners. UNO GSRC Director Dr. Jessi Hitchins and UNO Sociology Professor Dr. Jay Irwin together provided the crucial mailing list for the QOA’s initial publicity and networking. Dr. Irwin and his students collected the interviews which launched the LGBTQ+ Voices oral history project. Retired UNO professor Dr. Meredith Bacon donated her personal papers and extensive library of trans resources. From outside the UNO community, Terry Sweeney, who with his partner Pat Phalen had served as editor of The New Voice of Nebraska, donated a complete set of that publication, along with a substantial collection of papers, photographs, and artifacts, and he volunteered in the archives for many weeks, creating detailed and accurate descriptions of the items. These four people, and many others, have become our advocates, friends, and champions within the Omaha LGBTQIA+ community.

Our lesson here: Find your champions. Prove your trustworthiness, and your champions will help open doors into the community.

#8. Be respectful, be interested, and be present.

Outreach is key to building connections, bringing in both donors and users for the collection. This isn’t Field of Dreams, where “If you build it, they will come.” You need to forge the partnerships first, in order to even be able to build it. And they won’t come if they don’t know about it and don’t believe in its value. (“They” in this case meaning the community or communities your archives serve, and “it” of course meaning your archives or special collections for that community.)

Fig. 3: UNO Libraries table at a Transgender Day of Remembrance event.

Yumi and Angela are both behind-the-scenes technical services types, so we don’t have quite as much contact with patrons and donors as some others in our department, but we’ve helped out staffing tables at events, such as Pride, Transgender Day of Remembrance, and Transgender Day of Visibility events. We also work to create a welcoming atmosphere for guests who come to the archives for events, tours, and research. We recognize the critical importance of the work our director does, going out and meeting people, attending events, talking to everyone, and inviting everyone to visit. As our director Amy Schindler said in the article “Collaborative Approaches to Digital Projects,” “Engagement with community partners is key . . .”

There’s also something to be said for simply ensuring that folks within the archives, and the library as a whole for that matter, have a basic awareness of the QOA and other collecting initiatives, so that we can better fulfill our mission of connecting people to the resources they need. After all, when someone walks into the library, before they even reach the archives, any staff member might be their first point of contact. So be sure to do outreach within your own institution, as well.

#9. Let me sing you the song of my administrative support.

The QOA was conceived by the efforts from UNO students, UNO employees, and Omaha communities to address the underrepresentation of the LGBTQIA+ communities in Omaha.

The initiative of the QOA was inspired by Josh Burford who delivered a presentation about collecting and archiving historical materials related to queering history. This presentation was co-hosted by UNO’s Gender and Sexuality Resource Center in the LGBTQIA+ History Month. After this event, the UNO community became keenly interested in collecting and preserving historical materials and oral history interviews about “Queer Omaha,” and began collaborating with our local LGBTQIA+ communities. In Summer 2016, the QOA was officially launched to preserve the enduring value of the legacy of LGBTQIA+ communities in greater Omaha. The QOA is an effort to combat an archival silence in the community, and digital collections and digital engagement are especially effective tools for making LGBTQIA+ communities aware that the archives welcome their records!

But none of this would have been possible without administrative support. If the library administration or the university administration had been indifferentor worse, hostileto the idea of building a LGBTQIA+ archive, we might not have been allowed to allocate staff time and other resources to this project. We might have been told “no.” Thank goodness, our library dean is 100% on board. And UNO is deeply committed to inclusion as one of our core values, which has created a campus-wide openness to the LGBTQIA+ community, resulting in an environment perfectly conducive to building this archive. In fact, in November 2017, UNO was identified as the most LGBTQIA+-friendly college in the state of Nebraska by the Campus Pride Index in partnership with BestColleges.com. An initiative like the QOA will always go much more smoothly when your administration is on your side.

#10. The Neverending Story.

We recognize that we still have a long way to go. There are quite a few gaps within our collection. We have the papers of a trans woman, but only oral histories from trans men. We don’t yet have anything from intersex folks or asexuals. We have very little from queer people of color or queer immigrants, although we do have some oral histories from those groups, thanks to the efforts of Dr. Jay Irwin, who launched the oral history project, and Luke Wegener, who was hired as a dedicated oral history associate for UNO Libraries. A major focus on the LGBTQIA+ oral history interview project is filling identified gaps within the collection, actively seeking more voices of color and other underrepresented groups within the LGBTQIA+ community. However, despite our efforts to increase the diversity within the collection, we haven’t successfully identified potential donors or interviewees to represent all of the letters within LGBTQIA, much less the +.

This isn’t—and should never bea series of checkboxes on a list. “Oh, we have a trans woman. We don’t need any more trans women.” No, that’s not how it works. We seek to fill the gaps, while continuing to welcome additional material from groups already represented. We are absolutely not going to turn away a white cis gay man just because we already have multiple resources from white cis gay men. Every individual is different. Every individual brings a new perspective. We want our collection to include as many voices as possible. So we need to promote our collection more. We need to do more outreach. We need to attract more donors, users, and champions. This will remain an ongoing effort without an endpoint. There is always room for growth.


Angela Kroeger is the Archives and Special Collections associate at the University of Nebraska at Omaha and a lifelong Nebraskan. They received their B.A. in English from the University of Nebraska at Omaha and their Master’s in Library and Information Science from the University of Missouri.

Yumi Ohira is the Digital Initiatives Librarian at the University of Nebraska at Omaha. Ohira is originally from Japan where she received a B.S. in Applied Physics from Fukuoka University. Ohira moved to the United States to attend University of Kansas and Southern Illinois University-Carbondale where she was awarded an M.F.A. in Studio Art. Ohira went on to study at Emporia State University, Kansas, where she received an M.L.S. and Archive Studies Certification.

Using Python, FFMPEG, and the ArchivesSpace API to Create a Lightweight Clip Library

by Bonnie Gordon

This is the twelfth post in the bloggERS Script It! Series.

Context

Over the past couple of years at the Rockefeller Archive Center, we’ve digitized a substantial portion of our audiovisual collection. Our colleagues in our Research and Education department wanted to create a clip library using this digitized content, so that they could easily find clips to use in presentations and on social media. Since the scale would be somewhat small and we wanted to spin up a solution quickly, we decided to store A/V clips in a folder with an accompanying spreadsheet containing metadata.

All of our (processed) A/V materials are described at the item level in ArchivesSpace. Since this description existed already, we wanted a way to get information into the spreadsheet without a lot of copying-and-pasting or rekeying. Fortunately, the access copies of our digitized A/V have ArchivesSpace refIDs as their filenames, so we’re able to easily link each .mp4 file to its description via the ArchivesSpace API. To do so, I wrote a Python script that uses the ArchivesSpace API to gather descriptive metadata and output it to a spreadsheet, and also uses the command line tool ffmpeg to automate clip creation.

The script asks for user input on the command line. This is how it works:

Step 1: Log into ArchivesSpace

First, the script asks the user for their ArchivesSpace username and password. (The script requires a config file with the IP address of the ArchivesSpace instance.) It then starts an ArchivesSpace session using methods from ArchivesSnake, an open-source Python library for working with the ArchivesSpace API.

Step 2: Get refID and number to start appending to file

The script then starts a while loop, and asks if the user would like to input a new refID. If the user types back “yes” or “y,” the script then asks for the the ArchivesSpace refID, followed by the number to start appending to the end of each clip. This is because the filename for each clip is the original refID, followed by an underscore, followed by a number, and to allow for more clips to be made from the same original file when the script is run again later.

Step 3: Get clip length and create clip

The script then calculates the duration of the original file, in order to determine whether to ask the user to input the number of hours for the start time of the clip, or to skip that prompt. The user is then asked for the number of minutes and seconds of the start time of the clip, then the number of minutes and seconds for the duration of the clip. Then the clip is created. In order to calculate the duration of the original file and create the clip, I used the os Python module to run ffmpeg commands. Ffmpeg is a powerful command line tool for manipulating A/V files; I find ffmprovisr to be an extremely helpful resource.

Clip from Rockefeller Family at Pocantico – Part I , circa 1920, FA1303, Rockefeller Family Home Movies. Rockefeller Archive Center.

Step 4: Get information about clip from ArchivesSpace

Now that the clip is made, the script uses the ArchivesSnake library again and the find_by_id endpoint of the ArchivesSpace API to get descriptive metadata. This includes the original item’s title, date, identifier, and scope and contents note, and the collection title and identifier.

Step 5: Format data and write to csv

The script then takes the data it’s gathered, formats it as needed—such as by removing line breaks in notes from ArchivesSpace, or formatting duration length—and writes it to the csv file.

Step 6: Decide how to continue

The loop starts again, and the user is asked “New refID? y/n/q.” If the user inputs “n” or “no,” the script skips asking for a refID and goes straight to asking for information about how to create the clip. If the user inputs “q” or “quit,” the script ends.

The script is available on GitHub. Issues and pull requests welcome!


Bonnie Gordon is a Digital Archivist at the Rockefeller Archive Center, where she focuses on digital preservation, born digital records, and training around technology.

Call for Contributions: Making Tech Skills a Strategic Priority

As a follow-up to our popular Script It! Series — which attempted to break down barriers and demystify scripting with walkthroughs of simple scripts — we’re interested in learning more about how archival institutions (as such) encourage their archivists to develop and promote their technical literacy more generally. As Trevor Owens notes in his forthcoming book, The Theory and Craft of Digital Preservation, “the scale and inherent structures of digital information suggest working more with a shovel than with a tweezers.” Encouraging archivists to develop and promote their technical literacy is one such way to use a metaphorical shovel!

Maybe you work for an institution that explicitly encourages its employees to learn new technical skills. Maybe your team or institution has made technical literacy a strategic priority. Maybe you’ve formed a collaborative study group with your peers to learn a programming language. Whatever the case, we want to hear about it!

Writing for bloggERS! “Making Tech Skills a Strategic Priority” Series

  • We encourage visual representations: Posts can include or largely consist of comics, flowcharts, a series of memes, etc!
  • Written content should be roughly 600-800 words in length
  • Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA
  • Align with other editorial guidelines as outlined in the bloggERS! guidelines for writers.

Posts for this series will start in late November or December, so let us know if you are interested in contributing by sending an email to ers.mailer.blog@gmail.com!

The BitCurator Script Library

by Walker Sampson

This is the eleventh post in the bloggERS Script It! Series.

One of the strengths of the BitCurator Environment (BCE) is the open-ended potential of the toolset. BCE is a customized version of the popular (granted, insofar as desktop Linux can be popular) Ubuntu distribution, and as such it remains a very configurable working environment. While there is a basic notion of a default workflow found in the documentation (i.e., acquire content, run analyses on it, and increasingly, do something based on those analyses, then export all of it to another spot), the range of tools and prepackaged scripts in BCE can be used in whatever order fits the needs of the user. But even aside from this configurability, there is the further option of using new scripts to achieve different or better outcomes.

What is a script? I’m going to shamelessly pull from my book for a brief overview:

A script is a set of commands that you can write and execute in order to automatically run through a sequence of actions. A script can support a number of variations and branching paths, thereby supporting a considerable amount of logic inside it – or it can be quite straightforward, depending upon your needs. A script creates this chain of actions by using the commands and syntax of a command line shell, or by using the commands and functions of a programming language, such as Python, Perl or PHP.

In short, scripts allow the user to string multiple actions together in some defined way. This can open the door to batch operations – repeating the same job automatically for a whole set of items – that speed up processing. Alternatively, a user may notice that they are repeating a chain of actions for a single item in a very routine way. Again, a script may fill in here, grouping together all those actions into a single script that the user need only initiate once. Scripting can also bridge the gap between two programs, or adjust the output of one process to make it fit into the input of another. If you’re interested in scripting, there are basically two (non-exclusive) routes to take: shell scripting or scripting with a programming language.

  • For an intro on both writing and running bash shell scripts, one of if not the most popular Unix shell – and the one BitCurator defaults with check out this tutorial by Tania Rascia.
  • There are many programming languages that can be used in scripts; Python is a very common one. Learning how to script with Python is tantamount to simply learning Python, so it’s probably best to set upon that path first. Resources abound for this endeavor, and the book Automate the Boring Stuff with Python is free under a Creative Commons license.

The BitCurator Scripts Library

The BitCurator Scripts Library is a spot we designed to help connect users with scripts for the environment. Most scripts are already available online somewhere (typically GitHub), but a single page that inventories these resources can further their use. A brief look at a few of the scripts available will give a better idea of the utility of the library.

  • If you’ve ever had the experience of repeatedly trying every listed disk format in the KryoFlux GUI (which numbers well over a dozen) in an attempt to resolve stream files into a legible disk image, the DiskFormatID program can automate that process.
  • fiwalk, which is used to identify files and their metadata, doesn’t operate on Hierarchical File System (HFS) disk images. This prevents the generation of a DFXML file for HFS disks as well. Given the utility and the volume of metadata located in that single document, along with the fact that other disk images receive the DFXML treatment, this stands out as a frustrating process gap. Dianne Dietrich has fortunately whipped up a script to generate just such a DFXML for all your HFS images!
  • The shell scripts available at rl-bitcurator-scripts are a great example of running the same command over multiple files: multi_annofeatures.sh, multi_be.sh, and multifiwalk.sh run identify_filenames.py, bulk_extractor and fiwalk over a directory, respectively. Conversely, simgen_prod.sh is an example of shell script grouping multiple commands together and running that group over a set of items.

For every script listed, we provide a link (where applicable) to any related resources, such as a paper that explains the thinking behind a script, a webinar or slides where it is discussed, or simply a blog post that introduces the code. Presently, the list includes both bash shell scripts along with Python and Perl scripts.

Scripts have a higher bar to use than programs with a graphic frontend, and some familiarity or comfort with the command line is required. The upside is that scripts can be amazingly versatile and customizable, filling in gaps in process, corralling disparate data into a single presentable sheet, or saving time by automating commands. Along with these strengths, viewing scripts often sparks an idea for one you may actually benefit from or want to write yourself.

Following from this, if you have a script you would like added to the library, please contact us (select ‘Website Feedback’) or simply post in our Google Group. Bear one aspect in mind however: scripts do not need to be perfect. Scripts are meant to be used and adjusted over time, so if you are considering a script to include, please know that it doesn’t need to accommodate every user or situation. If you have a quick and dirty script that completes the task, it will likely be beneficial to someone else, even if, or especially if, they need to adjust it for their work environment.


Walker Sampson is the Digital Archivist at the University of Colorado Boulder Libraries, where he is responsible for the acquisition, accessioning and description of born digital materials, along with the continued preservation and stewardship of all digital materials in the Libraries. He is the coauthor of The No-nonsense Guide to Born-digital Content.

Improving Workflows at UNC Libraries’ Wilson Special Collections Library

by Erica Titkemeyer and Jessica Venlet

This is the tenth post in the bloggERS Script It! Series.

At Wilson Special Collections Library, we are always trying to find ways to improve our digital preservation workflows. Improving our skills with the command line and using existing command line tools has played a key role in workflow improvements. So, we’ve picked a few favorite tools and tips to share.

FFmpeg

We use FFmpeg for a number of daily tasks, whether it’s generating derivatives for preservation files or analyzing a video or audio file we’ve received through a born-digital accession.

Clearing embedded metadata and uses for FFprobe:
As part of our audio digitization work, we embed metadata into all preservation WAV files. This metadata follows guidelines set out by the Federal Agencies Digital Guidelines Initiative (FADGI) and mostly relates the file back to the original item it was digitized from, including its unique identifier, title, and the curatorial unit it is held by. It has come up a few times now where we have recognized inconsistencies in how this data is reflected in the file, that the data itself is incorrect, or the data is insufficient.

WAV file metadata
Look at that terrible metadata!

When large-scale issues have come up, particularly with legacy files in our backlog, we’ve made use of FFmpeg’s ‘-map_metadata’ command to batch delete the embedded metadata. Below is a script used to batch create brand new files without metadata, with “_clean” attached to their original file name:

For i in *.wav; do ffmpeg -i “$i” -map_metadata –1 –c:a copy “${i%.wav}_clean.wav”; done

After successfully removing metadata from the files, we use the tool BWF MetaEdit to batch embed the correct metadata that we have prepared in a .csv file.

For born-digital work, we regularly use the tool/command ‘ffprobe’, a stream analyzer that is part of the FFmpeg build. It allows us to quickly see data about AV files (such as duration, file size, codecs, aspect ratio, etc.) that are helpful in identifying files and making general appraisal decisions. As we grow our capabilities in preserving born digital AV, we also foresee the need to document this type of file data in our ingest documentation.

walk_to_dfxml.py

In our born-digital workflows we don’t disk image every digital storage device we receive by default. This workflow choice has benefits and disadvantages. One disadvantage is losing the ability to quickly document all timestamps associated with files. While our workflows were preserving last modified dates, other timestamps like access or creation dates were not as effectively captured. In search of a way to remedy this issue, I turned to Twitter for some advice on the capture and value of each timestamp. Several folks recommended generating DFXML which is usually used on disk images. Tim Walsh helpfully pointed to a python script walk_to_dfxml.py that can generate DFXML on directories instead of disk images. Workflow challenge solved!

DFXML output example
DFXML output example

Brunnhilde

Brunnhilde is another tool that was particularly helpful in consolidating tasks and tools. By kicking off Brunnhilde in the command line, we are able to: check for viruses, create checksums, identify file formats, identify duplicates, create a manifest, and run a PII scan. Additionally, this tool gives us a report that is useful for digital archives specialists, but also holds potential as an appraisal tool for consultations with curators. We’re still working out that aspect of the workflow, but when it comes to the technical steps Brunnhilde and the associated command line tools it includes has really improved our processing work.

Learning as We Go

Like many archivists, we had limited experience with using the command line before graduate school. In the course of our careers, we’ve had to learn a lot on the fly because so many great command line tools are essential for working with digital archives.

One thing that can be tricky when you are new is moving the cursor around the terminal easily. It seems like it should be a no brainer, but it’s really not so obvious.

  • For Macs, see the excellent Script Ahoy resource:
  • For PC, see this resource for a variety of shortcuts including moving. In general:
    • Home key moves to beginning. End key moves to the end.
    • Ctrl + left or right arrow moves the cursor around in chunks

Another helpful set of commands are remove (rm) and move (mv). We use these when dealing with extraneous files created through quality control applications in our AV workflow that we’d like to delete quickly, or when we need to separate derivatives (such as mp3s) from a large batch of preservation files (wav).

    • Important note about rm: it’s always smart to first use ‘echo’ to see what files you would be removing with your command (ex: echo rm *.lvl would list all the .lvl files that would be removed by your command).

If you are just starting out, you may consider exploring online tutorials or guides like:


Erica Titkemeyer is the Project Director and Audiovisual Conservator for the Southern Folklife Collection at the UNC Wilson Special Collections Library, coordinating inhouse digitization and outsourcing of audiovisual materials for preservation. Erica also participants in the improvement of online access and digital preservation for digitized materials.

Jessica Venlet works as the Assistant University Archivist for Digital Records & Records Management at the UNC Wilson Special Collections Library. In this role, Jessica is responsible for a variety of things related to both records management and digital preservation. In particular, she leads the acquisition and management of born-digital university records. She earned a Master of Science in Information degree from the University of Michigan.

Of Python and Pandas: Using Programming to Improve Discovery and Access

by Kate Topham

This is the ninth post in the bloggERS Script It! Series.

Over my spring break this year, I joined a team of librarians at George Washington University who wanted to use their MARC records as data to learn more about their rare book collection. We decided to explore trends in the collection over time, using a combination of Python and OpenRefine. Our research question: in what subjects was the collection strongest? From which decades do we have the most books in a given subject?

This required some programming chops–so the second half of this post is targeted towards those who have a working knowledge of python, as well as the pandas library.

If you don’t know python, have no fear! The next section describes how to use OpenRefine, and is 100% snake-free.

Cleaning your data

A big issue with analyzing cataloging data is that humans are often inconsistent in their description. Two records may be labeled “African-American History” and “History, African American,” respectively. When asked to compare these two strings, python will consider them different. Therefore, when you ask python to count all the books with the subject “African American History,” it counts only those whose subjects match that string exactly.

Enter OpenRefine, an open source application that allows you to clean and transform data in ways Excel doesn’t. Below you’ll see a table generated from pymarc, a python module designed to handle data encoded in MARC21. It contains the bibliographic id of each book and its subject data, pulled from field 650.

Picture1

Facets allow you to view subsets of your data. A “text facet” means that OpenRefine will treat the values as text rather than numbers or dates, for example. If we create a text facet for column a…Picture1.png

it will display all the different values in that column in a box on the left.Picture1.png

If we choose one of these facets, we can view all the rows that have that value in the “a” column.Picture1.png

Wow, we have five fiction books about Dogs! However, if you look on the side, there are six records that have “Dogs.” instead of “Dogs”, and were therefore excluded. How can we look at all of the dog books? This is where we need clustering.

Clustering allows you to group similar strings and merge the whole group into one value. It does this by sorting each string alphabetically and match them that way. Since “History, African American” and “African American History” both evaluate to “aaaaccefhiiimnnorrrsty,” OpenRefine will group them together and allow you to change all of the matching fields to either (or something totally different!) according to your preference.Picture1.png

This way, when you analyze the data, you can ask “How many books on African-American History do we have?” and trust that your answer will be correct. After clustering to my heart’s content, I exported this table into a pandas dataframe for analysis.

Analyzing your data

In order to study the subjects over time, we needed to restructure the table so that we could merge it with the date data.

First, I pivoted the table from short form to long so that we could count  separate pairs of subject tags. The pandas ‘melt’ method set the index by bibliographic id and subject so that books with multiple subjects would be counted in both categories.Picture1.png

Then I merged the dates from the original table of MARC records, bib_data, onto our newly melted table.Picture1.png

I grouped the data by subject using .group(). The .agg() function communicates how we want to count within the subject groups.Picture1.png

Because of the vast number of subjects we chose to focus on the ten most numerous subjects. I used the same grouping and aggregating process on the original cleaned subject data: grouped by 650a, counted by bib_id, and sorted by bib_id count.Picture1.png

Once I had our top ten list, I could select the count by decade for each subject.Picture1.png

Visualizing your data

In order to visualize this data, I used a python library called plotly. Plotly generates graphs from your data. It plays very well with pandas dataframes. Plotly provides many examples of code that you can copy, replacing the example data with your own. I placed the plotly code in for loop create a line on the graph for each subject.  Picture1.pngPicture1.png

Some of the interesting patterns we noticed was the spike in African-American books soon after 1865, the end of the Civil War; and at the end of the 20th century, following the Civil Rights movement.Knowing where the peaks and gaps are in our collections helps us better assist patrons in the use of our collection, and better market it to researchers.

Acknowledgments

I’d like to thank Dolsy Smith, Leah Richardson, and Jenn King for including me in this collaborative project and sharing their expertise with me.

GVSU Scripted IR Curation in the Cloud

by Matt Schultz and Kyle Felker

This is the seventh post in the bloggERS Script It! Series.


In 2016, bepress launched a new service to assist subscribers of Digital Commons with archiving the contents of their institutional repository. Using Amazon Web Services (AWS), this new service pushes daily updates of an institution’s repository contents to an Amazon Simple Storage Service (S3) bucket setup that is hosted by the institution. Subscribers thus control a real-time copy of all their institutional repository content outside of bepress’s Digital Commons platform. They can download individual files or the entire repository from this S3 bucket, all on their own schedules, and for whatever purposes they deem appropriate.

Grand Valley State University (GVSU) Libraries makes use of Digital Commons for their ScholarWorks@GVSU institutional repository. Using bepress’s new archiving service has given GVSU an opportunity to perform scripted automation in the Cloud using the same open source tools that they use in their regular curation workflows. These tools include Brunnhilde, which bundles together ClamAV and Siegfried to produce file format, fixity, and virus check reports, as well as BagIt for preservation packaging.

Leveraging the ease of launching Amazon’s EC2 server instances and use of their AWS command line interface (CLI), GVSU Libraries was able to readily configure an EC2 “curation server” that directly syncs copies of their Digital Commons data from S3 to the new server where the installed tools mentioned above do their work of building preservation packages and sending them back to S3 and to Glacier for nearline and long-term storage. The entire process now begins and ends in the Cloud, rather than involving the download of data to a local workstation for processing.

Creating a digital preservation copy of the GVSU’s ScholarWorks files involves four steps:

  1. Syncing data from S3 to EC2:  No processing can actually be done on files in-place on S3, so they must be copied to the EC2 “curation server”. As part of this process, we wanted to reorganize the files into logical directories (alphabetized), so that it would be easier to locate and process files, and to better organize the virus reports and the “bags” the process generates.
  2. Virus and format reports with Brunnhilde:  The synced files are then run through Brunnhilde’s suite of tools.  Brunnhilde will generate both a command-line summary of problems found and detailed reports that are deposited with the files.  The reports stay with the files as they move through the process.
  3. Preservation Packaging with BagIt: Once the files are checked, they need to be put in “bags” for storage and archiving using the BagIt tool.  This will bundle the files in a data directory and generate metadata that can be used to check their integrity.
  4. Syncing files to S3 and Glacier: Checked and bagged files are then moved to a different S3 bucket for nearline storage.  From that bucket, we have set up automated processes (“lifecycle management” in AWS parlance) to migrate the files on a quarterly schedule into Amazon Glacier, our long-term storage solution.

Once the process has been completed once, new files are incorporated and re-synced on a quarterly basis to the BagIt data directories and re-checked with Brunnhilde.  The BagIt metadata must then be updated and re-verified using BagIt, and the changes synced to the destination S3 bucket.

Running all these tools in sequence manually using the command line interface is both tedious and time-consuming. We chose to automate the process using a shell script. Shell scripts are relatively easy to write, and are designed to automate command-line tasks that involve a lot of repetitive work (like this one).

These scripts can be found at our github repo: https://github.com/gvsulib/curationscripts

Process_backup is the main script. It handles each of the four processing stages outlined above.  As it does so, it stores the output of those tasks in log files so they can be examined later.  In addition, it emails notifications to our task management system (Asana) so that our curation staff can check on the process.

After the first time the process is run, the metadata that BagIt generates has to be updated to reflect new data.  The version of BagIt we are using (Python) can’t do this from the command line, but it does have an API with a command that will update existing “bag” metadata. So, we created a small Python script to do this (regen_BagIt_manifest.py).  The shell script invokes this script at the third stage if bags have previously been created.

Finally, the update.sh script automatically updates all the tools used in the process and emails curation staff when the process is done. We then schedule the scripts to run automatically using the Unix cron utility.

GVSU Libraries is now hard at work on a final bagit_check.py script that will facilitate spot retrieval of the most recent version of a “bag” from the S3 nearline storage bucket and perform a validation audit using BagIt.


Matt Schultz is the Digital Curation Librarian for Grand Valley State University Libraries. He provides digital preservation for the Libraries’ digital collections, and offers support to faculty and students in the areas of digital scholarship and research data management. He has the unique displeasure of working on a daily basis with Kyle Felker.

Kyle Felker was born feral in the swamps of Louisiana, and a career in technology librarianship didn’t do much to domesticate him.  He currently works as an Application Developer at GVSU Libraries.