An Exploration of BitCurator NLP: Incorporating New Tools for Born-Digital Collections

by Morgan Goodman

Natural Language Processing (NLP) has been a buzz-worthy topic for professionals working with born-digital material over the last few years. The BitCurator Project recently released a new set of natural language processing tools, and I had the opportunity to test out the topic modeler, Bitcurator-nlp-gentm, with a group of archivists in the Raleigh-Durham area. I was interested in exploring how NLP might assist archivists to more effectively and efficiently perform their everyday duties. While my goal was to explore possible applications of topic modeling in archival appraisal specifically, the discussions surrounding other possible uses were enlightening.  The resulting research informed my 2019 Master’s paper for the University of North Carolina Chapel Hill.

Topic Modeling extracts text from files and organizes the tokenized words into topics. Imagine a set of words such as: mask, october, horror, michael, myers. Based on this grouping of words you might be able to determine that somewhere across the corpus there is a file about one the Halloween franchise horror films. When I met with the archivists, I had them run the program with disk images from their own collections, and we discussed the visualization output and whether or not they were able easily analyze and determine the nature of the topics presented.

BitCurator utilizes open source tools in their applications and chose the pyLDAvis visualization for the final output of their topic modeling tool (more information about the algorithm and how it works can be found by reading Sievert and Shirley’s paper. You can also play around with the output through this Jupyter notebook).  The left side view of the visualization has topic circles displayed in relative sizes and plotted on a two-dimensional plane. Each topic is labeled with a number in decreasing order of prevalence (circle #1 is the main topic in the overall corpus, and is also the largest circle). The space between topics is determined by the relative relation of the topics, i.e. topics that are less related are plotted further away from each other. The right-side view contains a list of 30 words with a blue bar indicating that term’s frequency across the corpus. Clicking on a topic circle will alter the view of the terms list by adding a red bar for each term, showing the term frequency in that particular topic in relation to the overall corpus.

Picture1

The user can then manipulate a metric slider which is meant to help decipher what the topic is about. Essentially, when the slider is all the way to the right at “1”, the most prevalent terms for the entire corpus are listed. When a topic is selected and the slider is at 1, it shows all the prevalent terms for the corpus in relation to that particular topic (in your Halloween example, you might see more general words like: movie, plot, character). Alternatively, the closer to “0” the slider moves, the less corpus-wide terms appear and the more topic specific terms are displayed (i.e.: knife, haddonfield, strode).

While the NLP does the hard work to scan and extract text from the files, some analysis is still required by the user. The tool’s output offers archivists a bird’s eye view of the collection, which can be helpful when little to nothing is known about its contents. However, many of the archivists I spoke to felt this tool is most effective when you already know a bit about the collection you are looking at. In that sense, it may be beneficial to allow researchers to use topic modeling in the reading room to explore a large collection. Researchers and others with subject matter expertise may get the most benefit from this tool – do you have to know about the Halloween movie franchise to know that Michael Myers is a fictional horror film character? Probably. Now imagine more complex topics that the archivists may not have working knowledge of. The archivist can point the researcher to the right collection and let them do the analysis. This tool may also help for description or possibly identifying duplication across a collection (which seems to be a common problem for people working with born-digital collections).

The next steps to getting NLP tools like this off the ground are to implement training. Information retrieval and ranking methods that create the output may not be widely understood. To unlock the value within an NLP tool, users must know how they work, how to run them, and how to perform meaningful analysis.  Training archivists in the reading room to assist researchers would be an excellent way to get tools like this out of the think tank and into the real world.


MorganMorgan Goodman is a 2019 graduate from the University of North Carolina, Chapel Hill and currently resides in Denver, Colorado. She holds a MS in Information Science with a specialization in Archives and Records Management.

 

 

 

Advertisements

Using R to Migrate Box and Folder Lists into EAD

by Andy Meyer

Introduction

This post is a case study about how I used the statistical programming language R to help export, transform, and load data from legacy finding aids into ArchivesSpace. I’m sharing this workflow in the hopes that another institution might find this approach helpful and could be generalized to other issues facing archives.

I decided to use the programming language R because it is a free and open source programming language that I had some prior experience using. R has a large and active user community as well as a large number of relevant packages that extend the basic functions of R,  including libraries that can deal with Microsoft Word tables and read and write XML. All of the code for this project is posted on Github.

The specific task that sparked this script was when I inherited hundreds of finding aids with minimal collection-level information and very long and detailed box and folder lists. These were all Microsoft Word documents with the box and folder list formatted as a table within the Word document. We recently adopted ArchivesSpace as our archival content management system so the challenge was to reformat this data and upload it into ArchivesSpace. I considered manual approaches but eventually opted to develop this code to automate this work. The code is generally organized into three sections: data export, transforming and cleaning the data, and finally, creating an EAD file to load into ArchivesSpace.

Data Export

After installing the appropriate libraries, the first step of the process was to extract the data from the Microsoft Word tables. Given the nature of our finding aids, I focused on extracting only the box and folder list; collection-level information would be added manually later in the process.

This process was surprisingly straightforward; I created a variable with a path to a Word Document and used the “docx_extract_tbl” function from the docxtractr package to extract the contents of that table into a data.frame in R. Sometimes our finding aids were inconsistent so I occasionally had to tweak the data to rearrange the columns or add missing values. The outcome of this step of the process is four columns that contain folder title, date, box number, and folder number.

This data export process is remarkably flexible. Using other R functions and libraries, I have extended this process to export data from CSV files or Excel spreadsheets. In theory, this process could be extended to receive a wide variety of data including collection-level descriptions and digital objects from a wider variety of sources. There are other tools that can also do this work (Yale’s Excel to EAD process and Harvard’s Aspace Import Excel plugin), but I found this process to be easier for my institution’s needs.

Data Transformation and Cleaning

Once I extracted the data from the Microsoft Word document, I did some minimal data cleanup, a sampling of which included:

  1. Extracting a date range for the collection. Again, past practice focused on creating folder-level descriptions and nearly all of our finding aids lacked collection-level information. From the box/folder list, I tried to extract a date range for the entire collection. This process was messy but worked a fair amount of the time. In cases when the data were not standardized, I defined this information manually.
  2. Standardizing “No Date” text. Over the course of this project, I discovered the following terms for folders that didn’t have dates: “n.d.”,”N.D.”,”no date”,”N/A”,”NDG”,”Various”, “N. D.”,””,”??”,”n. d.”,”n. d. “,”No date”,”-“,”N.A.”,”ND”, “NO DATE”, “Unknown.” For all of these, I updated the date field to “Undated” as a way to standardize this field.
  3. Spelling out abbreviations. Occasionally, I would use regular expressions to spell out words in the title field. This could be standard terms like “Corresp” to “Correspondence” or local terms like “NPU” to “North Park University.”

R is a powerful tool and provides many options for data cleaning. We did pretty minimal cleaning but this approach could be extended to do major transformations to the data.

Create EAD to Load into ArchivesSpace

Lastly, with the data cleaned, I could restructure the data into an XML file. Because the goal of this project was to import into ArchivesSpace, I created an extremely basic EAD file meant mainly to enter the box and folder information into ArchivesSpace; collection-level information would be added manually within ArchivesSpace. In order to get the cleaned data to import, I first needed to define a few collection-level elements including the collection title, collection ID, and date range for the collection. I also took this as an opportunity to apply a standard conditions governing access note for all collections.

Next, I used the XML package in R to create the minimally required nodes and attributes. For this section, I relied on examples from the book XML and Web Technologies for Data Sciences with R by Deborah Nolan and Duncan Temple Lang. I created the basic EAD schema in R using the “newXMLNode” functions from the XML package. This section of code is very minimal, and I would welcome suggestions from the broader community about how to improve it. Lastly, I defined functions that make the title, date, box, and folder nodes, which were then applied to the data exported and transformed in earlier steps. Lastly, this script saves everything as an XML file that I then uploaded into ArchivesSpace.

Conclusion

Although this script was designed to solve a very specific problem—extracting box and folder information from a Microsoft Word table and importing that information into ArchivesSpace—I think this approach could have wide and varied usage. The import process can accept loosely formatted data in a variety of different formats including Microsoft Word, plain text, CSV, and Excel and reformat the underlying data into a standard table. R offers an extremely robust set of packages to update, clean, and reformat this data. Lastly, you can define the export process to reformat the data into a suitable file format. Given the nature of this programming language, it is easy to preserve your original data source as well as document all the transformations you perform.


Andy Meyer is the director (and lone arranger) of the F.M. Johnson Archives and Special Collections at North Park University. He is interested in archival content management systems, digital preservation, and creative ways to engage communities with archival materials.

More skills, less pain with Library Carpentry

By Jeffrey C. Oliver, Ph.D

This is the second post in the bloggERS Making Tech Skills a Strategic Priority series.

Remember that scene in The Matrix where Neo wakes and says “I know kung fu”? Library Carpentry is like that. Almost. Do you need to search lots of files for pieces of text and tire of using Ctrl-F? In the UNIX shell lesson you’ll learn to automate tasks and rapidly extract data from files. Are you managing datasets with not-quite-standardized data fields and formats? In the OpenRefine lesson you’ll easily wrangle data into standard formats for easier processing and de-duplication. There are also Library Carpentry lessons for Python (a popular scripting programming language), Git (a powerful version control system), SQL (a commonly used relational database interface), and many more.

But let me back up a bit.

Library Carpentry is part of the Carpentries, an organization is designed to provide training to scientists, researchers, and information professionals on the computational skills necessary for work in this age of big data.

The goals of Library Carpentry align with this series’ initial call for contributions, providing resources for those in data- or information-related fields to work “more with a shovel than with a tweezers.” Library Carpentry workshops are primarily hands-on experiences with tools to make work more efficient and less prone to mistakes when performing repeated tasks.

One of the greatest parts about a Library Carpentry workshop is that they begin at the beginning. That is, the first lesson is an Introduction to Data, which is a structured discussion and exercise session that breaks down jargon (“What is a version control system”) and sets down some best practices (naming things is hard).

Not only are the lessons designed for those working in library and information professions, but they’re also designed by “in the trenches” folks who are dealing with these data and information challenges daily. As part of the Mozilla Global Sprint, Library Carpentry ran a two-day hackathon in May 2018 where lessons were developed, revised, remixed, and made pretty darn shiny by contributors at ten different sites. For some, the hackathon itself was an opportunity to learn how to use GitHub as a collaboration tool.

Furthermore, Library Carpentry workshops are led by librarians, like the most recent workshop at the University of Arizona, where lessons were taught by our Digital Scholarship Librarian, our Geospatial Specialist, our Liaison Librarian to Anthropology (among other domains), and our Research Data Management Specialist.

Now, a Library Carpentry workshop won’t make you an expert in Python or the UNIX command line in two days. Even Neo had to practice his kung fu a bit. But workshops are designed to be inclusive and accessible, myth-busting, and – I’ll say it – fun. Don’t take my word for it, here’s a sampling of comments from our most recent workshop:

  • Loved the hands-on practice on regular expressions
  • Really great lesson – I liked the challenging exercises, they were fun! It made SQL feel fun instead of scary
  • Feels very powerful to be able to navigate files this way, quickly & in bulk.

So regardless of how you work with data, Library Carpentry has something to offer. If you’d like to host a Library Carpentry workshop, you can use our request a workshop form. You can also connect to Library Carpentry through social media, the web, or good old fashioned e-mail. And since you’re probably working with data already, you have something to offer Library Carpentry. This whole endeavor runs on the multi-faceted contributions of the community, so join us, we have cookies. And APIs. And a web scraping lesson. The terrible puns are just a bonus.

The Archivist’s Guide to KryoFlux

by [Matthew] Farrell and Shira Peltzman

As cultural icons go, the floppy disk continues to persist in the contemporary information technology landscape. Though digital storage has moved beyond the 80 KB – 1.44 MB storage capacity of the floppy disk, its image is often shorthand for the concept of saving one’s work (to wit: Microsoft Word 2016 still uses an icon of a 3.5″ floppy disk to indicate save in its user interface). Likewise, floppy disks make up a sizable portion of many archival collections, in number of objects if not storage footprint. If a creator of personal papers or institutional records maintained their work in electronic form in the 1980s or 1990s, chances are high that these are stored on floppy disks. But the persistent image of the ubiquitous floppy disk conceals a long list of challenges that come into play as archivists attempt to capture their data.

For starters, we often grossly underestimate the extent to which the technology was in active development during its heyday. One would be forgiven the assumption that there existed only a small number of floppy disk formats: namely 5.25″ and 3.5″, plus their 8″ forebears. But within each of these sizes there existed myriad variations of density and encoding, all of which complicate the archivist’s task now that these disks have entered our stacks. This is to say nothing of the hardware: 8″ and 5.25″ drives and standard controller boards are no longer made, and the only 3.5″ drive currently manufactured is a USB-enabled device capable only of reading disks with the more recent encoding methods storing file systems compatible with the host computer. And, of course, none of the above accounts for media stability over time for obsolete carriers.

Enter KryoFlux, a floppy disk controller board first made available in 2009. KryoFlux is very powerful, allowing users of contemporary Windows, Mac, and Linux machines to interface with legacy floppy drives via a USB port. The KryoFlux does not attempt to mount a floppy disk’s file system to the host computer, granting two chief affordances: users can acquire data (a) independent of their host computer’s file system, and (b) without necessarily knowing the particulars of the disk in question. The latter is particularly useful when attempting to analyze less stable media.

Despite the powerful utility of KryoFlux, uptake among archives and digital preservation programs has been hampered by a lack of accessible documentation and training resources. The official documentation and user forums assume a level of technical knowledge largely absent from traditional archival training. Following several informal conversations at Stanford University’s Born-Digital Archives eXchange events in 2015 and 2016, as well as discussions at various events hosted by the BitCurator Consortium, we formed a working group that included archivists and archival studies students from Emory University, the University of California Los Angeles, Yale University, Duke University, and the University of Texas at Austin to create user-friendly documentation aimed specifically at archivists.

Development of The Archivists Guide to KryoFlux began in 2016, with a draft released on Google Docs in Spring 2017. The working group invited feedback over a 6-month comment period and were gratified to receive a wide range of comments and questions from the community. Informed by this incredible feedback, a revised version of the Guide is now hosted in GitHub and available for anyone to use, though the use cases described are generally those encountered by archivists working with born-digital collections in institutional and manuscript repositories.

The Guide is written in two parts. “Part One: Getting Started” provides practical guidance on how to set-up and begin using the KryoFlux and aims to be as inclusive and user-friendly as possible. It includes instructions for running KryoFlux using both Mac and Windows operating systems. Instructions for running KryoFlux using Linux are also provided, allowing repositories that use BitCurator (an Ubuntu-based open-source suite of digital archives tools) to incorporate the KryoFlux into their workflows.

“Part Two: In Depth” examines KryoFlux features and floppy disk technology in more detail. This section introduces the variety of floppy disk encoding formats and provides guidance as to how KryoFlux users can identify them. Readers can also find information about working with 40-track floppy disks. Part Two covers KryoFlux-specific output too, including log files and KryoFlux stream files, and suggests ways in which archivists might make use of these files to support digital preservation best practices. Short case studies documenting the experiences of archivists at other institutions are also included here, providing real-life examples of KryoFlux in action.

As with any technology, the KryoFlux hardware and software will undergo updates and changes in the future which will, if we are not careful, have an effect on the currency of the Guide. In an attempt to address this possibility, the working group have chosen to host the guide as a public GitHub repository. This platform supports versioning and allows for easy collaboration between members of the working group. Perhaps most importantly, GitHub supports the integration of community-driven contributions, including revisions, corrections, and updates. We have established a process for soliciting and reviewing additional contributions and corrections (short answer: submit a pull request via GitHub!), and will annually review the membership of an ongoing working group responsible for monitoring this work to ensure that the Guide remains actively maintained for as long as humanly possible.

WDPD2018groot-30

On this year’s World Digital Preservation Day, the Digital Preservation Coalition presented The Archivist’s Guide to KryoFlux with the 2018 Digital Preservation Award for Teaching and Communications. It was truly an honor to be recognized alongside the other very worthy finalists, and a cherry-on-top for what we hope will remain a valuable resource for years to come.


[Matthew] Farrell is the Digital Records Archivist in Duke University’s David M. Rubenstein Rare Book & Manuscript Library. Farrell holds an MLS from the University of North Carolina at Chapel Hill.


Shira Peltzman is the Digital Archivist for the UCLA Library where she leads a preservation program for Library Special Collections’ born-digital material. Shira received her M.A. in Moving Image Archiving and Preservation from New York University’s Tisch School of the Arts, and was a member of the inaugural cohort of the National Digital Stewardship Residency in New York (NDSR-NY).

Announcing the Digital Processing Framework

by Erin Faulder

Development of the Digital Processing Framework began after the second annual Born Digital Archiving eXchange unconference at Stanford University in 2016. There, a group of nine archivists saw a need for standardization, best practices, or general guidelines for processing digital archival materials. What came out of this initial conversation was the Digital Processing Framework (https://hdl.handle.net/1813/57659) developed by a team of 10 digital archives practitioners: Erin Faulder, Laura Uglean Jackson, Susanne Annand, Sally DeBauche, Martin Gengenbach, Karla Irwin, Julie Musson, Shira Peltzman, Kate Tasker, and Dorothy Waugh.

An initial draft of the Digital Processing Framework was presented at the Society of American Archivists’ Annual meeting in 2017. The team received feedback from over one hundred participants who assessed whether the draft was understandable and usable. Based on that feedback, the team refined the framework into a series of 23 activities, each composed of a range of assessment, arrangement, description, and preservation tasks involved in processing digital content. For example, the activity Survey the collection includes tasks like Determine total extent of digital material and Determine estimated date range.

The Digital Processing Framework’s target audience is folks who process born digital content in an archival setting and are looking for guidance in creating processing guidelines and making level-of-effort decisions for collections. The framework does not include recommendations for archivists looking for specific tools to help them process born digital material. We draw on language from the OAIS reference model, so users are expected to have some familiarity with digital preservation, as well as with the management of digital collections and with processing analog material.

Processing born-digital materials is often non-linear, requires technical tools that are selected based on unique institutional contexts, and blends terminology and theories from archival and digital preservation literature. Because of these characteristics, the team first defined 23 activities involved in digital processing that could be generalized across institutions, tools, and localized terminology. These activities may be strung together in a workflow that makes sense for your particular institution. They are:

  • Survey the collection
  • Create processing plan
  • Establish physical control over removeable media
  • Create checksums for transfer, preservation, and access copies
  • Determine level of description
  • Identify restricted material based on copyright/donor agreement
  • Gather metadata for description
  • Add description about electronic material to finding aid
  • Record technical metadata
  • Create SIP
  • Run virus scan
  • Organize electronic files according to intellectual arrangement
  • Address presence of duplicate content
  • Perform file format analysis
  • Identify deleted/temporary/system files
  • Manage personally identifiable information (PII) risk
  • Normalize files
  • Create AIP
  • Create DIP for access
  • Publish finding aid
  • Publish catalog record
  • Delete work copies of files

Within each activity are a number of associated tasks. For example, tasks identified as part of the Establish physical control over removable media activity include, among others, assigning a unique identifier to each piece of digital media and creating suitable housing for digital media. Taking inspiration from MPLP and extensible processing methods, the framework assigns these associated tasks to one of three processing tiers. These tiers include: Baseline, which we recommend as the minimum level of processing for born digital content; Moderate, which includes tasks that may be done on collections or parts of collections that are considered as having higher value, risk, or access needs; and Intensive, which includes tasks that should only be done to collections that have exceptional warrant. In assigning tasks to these tiers, practitioners balance the minimum work needed to adequately preserve the content against the volume of work that could happen for nuanced user access. When reading the framework, know that if a task is recommended at the Baseline tier, then it should also be done as part of any higher tier’s work.

We designed this framework to be a step towards a shared vocabulary of what happens as part of digital processing and a recommendation of practice, not a mandate. We encourage archivists to explore the framework and use it however it fits in their institution. This may mean re-defining what tasks fall into which tier(s), adding or removing activities and tasks, or stringing tasks into a defined workflow based on tier or common practice. Further, we encourage the professional community to build upon it in practical and creative ways.


Erin Faulder is the Digital Archivist at Cornell University Library’s Division of Rare and Manuscript Collections. She provides oversight and management of the division’s digital collections. She develops and documents workflows for accessioning, arranging and describing, and providing access to born-digital archival collections. She oversees the digitization of analog collection material. In collaboration with colleagues, Erin develops and refines the digital preservation and access ecosystem at Cornell University Library.

Using Python, FFMPEG, and the ArchivesSpace API to Create a Lightweight Clip Library

by Bonnie Gordon

This is the twelfth post in the bloggERS Script It! Series.

Context

Over the past couple of years at the Rockefeller Archive Center, we’ve digitized a substantial portion of our audiovisual collection. Our colleagues in our Research and Education department wanted to create a clip library using this digitized content, so that they could easily find clips to use in presentations and on social media. Since the scale would be somewhat small and we wanted to spin up a solution quickly, we decided to store A/V clips in a folder with an accompanying spreadsheet containing metadata.

All of our (processed) A/V materials are described at the item level in ArchivesSpace. Since this description existed already, we wanted a way to get information into the spreadsheet without a lot of copying-and-pasting or rekeying. Fortunately, the access copies of our digitized A/V have ArchivesSpace refIDs as their filenames, so we’re able to easily link each .mp4 file to its description via the ArchivesSpace API. To do so, I wrote a Python script that uses the ArchivesSpace API to gather descriptive metadata and output it to a spreadsheet, and also uses the command line tool ffmpeg to automate clip creation.

The script asks for user input on the command line. This is how it works:

Step 1: Log into ArchivesSpace

First, the script asks the user for their ArchivesSpace username and password. (The script requires a config file with the IP address of the ArchivesSpace instance.) It then starts an ArchivesSpace session using methods from ArchivesSnake, an open-source Python library for working with the ArchivesSpace API.

Step 2: Get refID and number to start appending to file

The script then starts a while loop, and asks if the user would like to input a new refID. If the user types back “yes” or “y,” the script then asks for the the ArchivesSpace refID, followed by the number to start appending to the end of each clip. This is because the filename for each clip is the original refID, followed by an underscore, followed by a number, and to allow for more clips to be made from the same original file when the script is run again later.

Step 3: Get clip length and create clip

The script then calculates the duration of the original file, in order to determine whether to ask the user to input the number of hours for the start time of the clip, or to skip that prompt. The user is then asked for the number of minutes and seconds of the start time of the clip, then the number of minutes and seconds for the duration of the clip. Then the clip is created. In order to calculate the duration of the original file and create the clip, I used the os Python module to run ffmpeg commands. Ffmpeg is a powerful command line tool for manipulating A/V files; I find ffmprovisr to be an extremely helpful resource.

Clip from Rockefeller Family at Pocantico – Part I , circa 1920, FA1303, Rockefeller Family Home Movies. Rockefeller Archive Center.

Step 4: Get information about clip from ArchivesSpace

Now that the clip is made, the script uses the ArchivesSnake library again and the find_by_id endpoint of the ArchivesSpace API to get descriptive metadata. This includes the original item’s title, date, identifier, and scope and contents note, and the collection title and identifier.

Step 5: Format data and write to csv

The script then takes the data it’s gathered, formats it as needed—such as by removing line breaks in notes from ArchivesSpace, or formatting duration length—and writes it to the csv file.

Step 6: Decide how to continue

The loop starts again, and the user is asked “New refID? y/n/q.” If the user inputs “n” or “no,” the script skips asking for a refID and goes straight to asking for information about how to create the clip. If the user inputs “q” or “quit,” the script ends.

The script is available on GitHub. Issues and pull requests welcome!


Bonnie Gordon is a Digital Archivist at the Rockefeller Archive Center, where she focuses on digital preservation, born digital records, and training around technology.

The BitCurator Script Library

by Walker Sampson

This is the eleventh post in the bloggERS Script It! Series.

One of the strengths of the BitCurator Environment (BCE) is the open-ended potential of the toolset. BCE is a customized version of the popular (granted, insofar as desktop Linux can be popular) Ubuntu distribution, and as such it remains a very configurable working environment. While there is a basic notion of a default workflow found in the documentation (i.e., acquire content, run analyses on it, and increasingly, do something based on those analyses, then export all of it to another spot), the range of tools and prepackaged scripts in BCE can be used in whatever order fits the needs of the user. But even aside from this configurability, there is the further option of using new scripts to achieve different or better outcomes.

What is a script? I’m going to shamelessly pull from my book for a brief overview:

A script is a set of commands that you can write and execute in order to automatically run through a sequence of actions. A script can support a number of variations and branching paths, thereby supporting a considerable amount of logic inside it – or it can be quite straightforward, depending upon your needs. A script creates this chain of actions by using the commands and syntax of a command line shell, or by using the commands and functions of a programming language, such as Python, Perl or PHP.

In short, scripts allow the user to string multiple actions together in some defined way. This can open the door to batch operations – repeating the same job automatically for a whole set of items – that speed up processing. Alternatively, a user may notice that they are repeating a chain of actions for a single item in a very routine way. Again, a script may fill in here, grouping together all those actions into a single script that the user need only initiate once. Scripting can also bridge the gap between two programs, or adjust the output of one process to make it fit into the input of another. If you’re interested in scripting, there are basically two (non-exclusive) routes to take: shell scripting or scripting with a programming language.

  • For an intro on both writing and running bash shell scripts, one of if not the most popular Unix shell – and the one BitCurator defaults with check out this tutorial by Tania Rascia.
  • There are many programming languages that can be used in scripts; Python is a very common one. Learning how to script with Python is tantamount to simply learning Python, so it’s probably best to set upon that path first. Resources abound for this endeavor, and the book Automate the Boring Stuff with Python is free under a Creative Commons license.

The BitCurator Scripts Library

The BitCurator Scripts Library is a spot we designed to help connect users with scripts for the environment. Most scripts are already available online somewhere (typically GitHub), but a single page that inventories these resources can further their use. A brief look at a few of the scripts available will give a better idea of the utility of the library.

  • If you’ve ever had the experience of repeatedly trying every listed disk format in the KryoFlux GUI (which numbers well over a dozen) in an attempt to resolve stream files into a legible disk image, the DiskFormatID program can automate that process.
  • fiwalk, which is used to identify files and their metadata, doesn’t operate on Hierarchical File System (HFS) disk images. This prevents the generation of a DFXML file for HFS disks as well. Given the utility and the volume of metadata located in that single document, along with the fact that other disk images receive the DFXML treatment, this stands out as a frustrating process gap. Dianne Dietrich has fortunately whipped up a script to generate just such a DFXML for all your HFS images!
  • The shell scripts available at rl-bitcurator-scripts are a great example of running the same command over multiple files: multi_annofeatures.sh, multi_be.sh, and multifiwalk.sh run identify_filenames.py, bulk_extractor and fiwalk over a directory, respectively. Conversely, simgen_prod.sh is an example of shell script grouping multiple commands together and running that group over a set of items.

For every script listed, we provide a link (where applicable) to any related resources, such as a paper that explains the thinking behind a script, a webinar or slides where it is discussed, or simply a blog post that introduces the code. Presently, the list includes both bash shell scripts along with Python and Perl scripts.

Scripts have a higher bar to use than programs with a graphic frontend, and some familiarity or comfort with the command line is required. The upside is that scripts can be amazingly versatile and customizable, filling in gaps in process, corralling disparate data into a single presentable sheet, or saving time by automating commands. Along with these strengths, viewing scripts often sparks an idea for one you may actually benefit from or want to write yourself.

Following from this, if you have a script you would like added to the library, please contact us (select ‘Website Feedback’) or simply post in our Google Group. Bear one aspect in mind however: scripts do not need to be perfect. Scripts are meant to be used and adjusted over time, so if you are considering a script to include, please know that it doesn’t need to accommodate every user or situation. If you have a quick and dirty script that completes the task, it will likely be beneficial to someone else, even if, or especially if, they need to adjust it for their work environment.


Walker Sampson is the Digital Archivist at the University of Colorado Boulder Libraries, where he is responsible for the acquisition, accessioning and description of born digital materials, along with the continued preservation and stewardship of all digital materials in the Libraries. He is the coauthor of The No-nonsense Guide to Born-digital Content.

Improving Workflows at UNC Libraries’ Wilson Special Collections Library

by Erica Titkemeyer and Jessica Venlet

This is the tenth post in the bloggERS Script It! Series.

At Wilson Special Collections Library, we are always trying to find ways to improve our digital preservation workflows. Improving our skills with the command line and using existing command line tools has played a key role in workflow improvements. So, we’ve picked a few favorite tools and tips to share.

FFmpeg

We use FFmpeg for a number of daily tasks, whether it’s generating derivatives for preservation files or analyzing a video or audio file we’ve received through a born-digital accession.

Clearing embedded metadata and uses for FFprobe:
As part of our audio digitization work, we embed metadata into all preservation WAV files. This metadata follows guidelines set out by the Federal Agencies Digital Guidelines Initiative (FADGI) and mostly relates the file back to the original item it was digitized from, including its unique identifier, title, and the curatorial unit it is held by. It has come up a few times now where we have recognized inconsistencies in how this data is reflected in the file, that the data itself is incorrect, or the data is insufficient.

WAV file metadata
Look at that terrible metadata!

When large-scale issues have come up, particularly with legacy files in our backlog, we’ve made use of FFmpeg’s ‘-map_metadata’ command to batch delete the embedded metadata. Below is a script used to batch create brand new files without metadata, with “_clean” attached to their original file name:

For i in *.wav; do ffmpeg -i “$i” -map_metadata –1 –c:a copy “${i%.wav}_clean.wav”; done

After successfully removing metadata from the files, we use the tool BWF MetaEdit to batch embed the correct metadata that we have prepared in a .csv file.

For born-digital work, we regularly use the tool/command ‘ffprobe’, a stream analyzer that is part of the FFmpeg build. It allows us to quickly see data about AV files (such as duration, file size, codecs, aspect ratio, etc.) that are helpful in identifying files and making general appraisal decisions. As we grow our capabilities in preserving born digital AV, we also foresee the need to document this type of file data in our ingest documentation.

walk_to_dfxml.py

In our born-digital workflows we don’t disk image every digital storage device we receive by default. This workflow choice has benefits and disadvantages. One disadvantage is losing the ability to quickly document all timestamps associated with files. While our workflows were preserving last modified dates, other timestamps like access or creation dates were not as effectively captured. In search of a way to remedy this issue, I turned to Twitter for some advice on the capture and value of each timestamp. Several folks recommended generating DFXML which is usually used on disk images. Tim Walsh helpfully pointed to a python script walk_to_dfxml.py that can generate DFXML on directories instead of disk images. Workflow challenge solved!

DFXML output example
DFXML output example

Brunnhilde

Brunnhilde is another tool that was particularly helpful in consolidating tasks and tools. By kicking off Brunnhilde in the command line, we are able to: check for viruses, create checksums, identify file formats, identify duplicates, create a manifest, and run a PII scan. Additionally, this tool gives us a report that is useful for digital archives specialists, but also holds potential as an appraisal tool for consultations with curators. We’re still working out that aspect of the workflow, but when it comes to the technical steps Brunnhilde and the associated command line tools it includes has really improved our processing work.

Learning as We Go

Like many archivists, we had limited experience with using the command line before graduate school. In the course of our careers, we’ve had to learn a lot on the fly because so many great command line tools are essential for working with digital archives.

One thing that can be tricky when you are new is moving the cursor around the terminal easily. It seems like it should be a no brainer, but it’s really not so obvious.

  • For Macs, see the excellent Script Ahoy resource:
  • For PC, see this resource for a variety of shortcuts including moving. In general:
    • Home key moves to beginning. End key moves to the end.
    • Ctrl + left or right arrow moves the cursor around in chunks

Another helpful set of commands are remove (rm) and move (mv). We use these when dealing with extraneous files created through quality control applications in our AV workflow that we’d like to delete quickly, or when we need to separate derivatives (such as mp3s) from a large batch of preservation files (wav).

    • Important note about rm: it’s always smart to first use ‘echo’ to see what files you would be removing with your command (ex: echo rm *.lvl would list all the .lvl files that would be removed by your command).

If you are just starting out, you may consider exploring online tutorials or guides like:


Erica Titkemeyer is the Project Director and Audiovisual Conservator for the Southern Folklife Collection at the UNC Wilson Special Collections Library, coordinating inhouse digitization and outsourcing of audiovisual materials for preservation. Erica also participants in the improvement of online access and digital preservation for digitized materials.

Jessica Venlet works as the Assistant University Archivist for Digital Records & Records Management at the UNC Wilson Special Collections Library. In this role, Jessica is responsible for a variety of things related to both records management and digital preservation. In particular, she leads the acquisition and management of born-digital university records. She earned a Master of Science in Information degree from the University of Michigan.

Of Python and Pandas: Using Programming to Improve Discovery and Access

by Kate Topham

This is the ninth post in the bloggERS Script It! Series.

Over my spring break this year, I joined a team of librarians at George Washington University who wanted to use their MARC records as data to learn more about their rare book collection. We decided to explore trends in the collection over time, using a combination of Python and OpenRefine. Our research question: in what subjects was the collection strongest? From which decades do we have the most books in a given subject?

This required some programming chops–so the second half of this post is targeted towards those who have a working knowledge of python, as well as the pandas library.

If you don’t know python, have no fear! The next section describes how to use OpenRefine, and is 100% snake-free.

Cleaning your data

A big issue with analyzing cataloging data is that humans are often inconsistent in their description. Two records may be labeled “African-American History” and “History, African American,” respectively. When asked to compare these two strings, python will consider them different. Therefore, when you ask python to count all the books with the subject “African American History,” it counts only those whose subjects match that string exactly.

Enter OpenRefine, an open source application that allows you to clean and transform data in ways Excel doesn’t. Below you’ll see a table generated from pymarc, a python module designed to handle data encoded in MARC21. It contains the bibliographic id of each book and its subject data, pulled from field 650.

Picture1

Facets allow you to view subsets of your data. A “text facet” means that OpenRefine will treat the values as text rather than numbers or dates, for example. If we create a text facet for column a…Picture1.png

it will display all the different values in that column in a box on the left.Picture1.png

If we choose one of these facets, we can view all the rows that have that value in the “a” column.Picture1.png

Wow, we have five fiction books about Dogs! However, if you look on the side, there are six records that have “Dogs.” instead of “Dogs”, and were therefore excluded. How can we look at all of the dog books? This is where we need clustering.

Clustering allows you to group similar strings and merge the whole group into one value. It does this by sorting each string alphabetically and match them that way. Since “History, African American” and “African American History” both evaluate to “aaaaccefhiiimnnorrrsty,” OpenRefine will group them together and allow you to change all of the matching fields to either (or something totally different!) according to your preference.Picture1.png

This way, when you analyze the data, you can ask “How many books on African-American History do we have?” and trust that your answer will be correct. After clustering to my heart’s content, I exported this table into a pandas dataframe for analysis.

Analyzing your data

In order to study the subjects over time, we needed to restructure the table so that we could merge it with the date data.

First, I pivoted the table from short form to long so that we could count  separate pairs of subject tags. The pandas ‘melt’ method set the index by bibliographic id and subject so that books with multiple subjects would be counted in both categories.Picture1.png

Then I merged the dates from the original table of MARC records, bib_data, onto our newly melted table.Picture1.png

I grouped the data by subject using .group(). The .agg() function communicates how we want to count within the subject groups.Picture1.png

Because of the vast number of subjects we chose to focus on the ten most numerous subjects. I used the same grouping and aggregating process on the original cleaned subject data: grouped by 650a, counted by bib_id, and sorted by bib_id count.Picture1.png

Once I had our top ten list, I could select the count by decade for each subject.Picture1.png

Visualizing your data

In order to visualize this data, I used a python library called plotly. Plotly generates graphs from your data. It plays very well with pandas dataframes. Plotly provides many examples of code that you can copy, replacing the example data with your own. I placed the plotly code in for loop create a line on the graph for each subject.  Picture1.pngPicture1.png

Some of the interesting patterns we noticed was the spike in African-American books soon after 1865, the end of the Civil War; and at the end of the 20th century, following the Civil Rights movement.Knowing where the peaks and gaps are in our collections helps us better assist patrons in the use of our collection, and better market it to researchers.

Acknowledgments

I’d like to thank Dolsy Smith, Leah Richardson, and Jenn King for including me in this collaborative project and sharing their expertise with me.