This is the twelfth post in the bloggERS Script It! Series.
Context
Over the past couple of years at the Rockefeller Archive Center, we’ve digitized a substantial portion of our audiovisual collection. Our colleagues in our Research and Education department wanted to create a clip library using this digitized content, so that they could easily find clips to use in presentations and on social media. Since the scale would be somewhat small and we wanted to spin up a solution quickly, we decided to store A/V clips in a folder with an accompanying spreadsheet containing metadata.
All of our (processed) A/V materials are described at the item level in ArchivesSpace. Since this description existed already, we wanted a way to get information into the spreadsheet without a lot of copying-and-pasting or rekeying. Fortunately, the access copies of our digitized A/V have ArchivesSpace refIDs as their filenames, so we’re able to easily link each .mp4 file to its description via the ArchivesSpace API. To do so, I wrote a Python script that uses the ArchivesSpace API to gather descriptive metadata and output it to a spreadsheet, and also uses the command line tool ffmpeg to automate clip creation.
The script asks for user input on the command line. This is how it works:
Step 1: Log into ArchivesSpace
First, the script asks the user for their ArchivesSpace username and password. (The script requires a config file with the IP address of the ArchivesSpace instance.) It then starts an ArchivesSpace session using methods from ArchivesSnake, an open-source Python library for working with the ArchivesSpace API.
Step 2: Get refID and number to start appending to file
The script then starts a while loop, and asks if the user would like to input a new refID. If the user types back “yes” or “y,” the script then asks for the the ArchivesSpace refID, followed by the number to start appending to the end of each clip. This is because the filename for each clip is the original refID, followed by an underscore, followed by a number, and to allow for more clips to be made from the same original file when the script is run again later.
Step 3: Get clip length and create clip
The script then calculates the duration of the original file, in order to determine whether to ask the user to input the number of hours for the start time of the clip, or to skip that prompt. The user is then asked for the number of minutes and seconds of the start time of the clip, then the number of minutes and seconds for the duration of the clip. Then the clip is created. In order to calculate the duration of the original file and create the clip, I used the os Python module to run ffmpeg commands. Ffmpeg is a powerful command line tool for manipulating A/V files; I find ffmprovisr to be an extremely helpful resource.
Clip from Rockefeller Family at Pocantico – Part I , circa 1920, FA1303, Rockefeller Family Home Movies. Rockefeller Archive Center.
Step 4: Get information about clip from ArchivesSpace
Now that the clip is made, the script uses the ArchivesSnake library again and the find_by_id endpoint of the ArchivesSpace API to get descriptive metadata. This includes the original item’s title, date, identifier, and scope and contents note, and the collection title and identifier.
Step 5: Format data and write to csv
The script then takes the data it’s gathered, formats it as needed—such as by removing line breaks in notes from ArchivesSpace, or formatting duration length—and writes it to the csv file.
Step 6: Decide how to continue
The loop starts again, and the user is asked “New refID? y/n/q.” If the user inputs “n” or “no,” the script skips asking for a refID and goes straight to asking for information about how to create the clip. If the user inputs “q” or “quit,” the script ends.
Bonnie Gordon is a Digital Archivist at the Rockefeller Archive Center, where she focuses on digital preservation, born digital records, and training around technology.
As a follow-up to our popular Script It! Series — which attempted to break down barriers and demystify scripting with walkthroughs of simple scripts — we’re interested in learning more about how archival institutions (as such) encourage their archivists to develop and promote their technical literacy more generally. As Trevor Owens notes in his forthcoming book, The Theory and Craft of Digital Preservation, “the scale and inherent structures of digital information suggest working more with a shovel than with a tweezers.” Encouraging archivists to develop and promote their technical literacy is one such way to use a metaphorical shovel!
Maybe you work for an institution that explicitly encourages its employees to learn new technical skills. Maybe your team or institution has made technical literacy a strategic priority. Maybe you’ve formed a collaborative study group with your peers to learn a programming language. Whatever the case, we want to hear about it!
Writing for bloggERS! “Making Tech Skills a Strategic Priority” Series
We encourage visual representations: Posts can include or largely consist of comics, flowcharts, a series of memes, etc!
Written content should be roughly 600-800 words in length
Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA
Posts for this series will start in late November or December, so let us know if you are interested in contributing by sending an email to ers.mailer.blog@gmail.com!
This is the eleventh post in the bloggERS Script It! Series.
One of the strengths of the BitCurator Environment (BCE) is the open-ended potential of the toolset. BCE is a customized version of the popular (granted, insofar as desktop Linux can be popular) Ubuntu distribution, and as such it remains a very configurable working environment. While there is a basic notion of a default workflow found in the documentation (i.e., acquire content, run analyses on it, and increasingly, do something based on those analyses, then export all of it to another spot), the range of tools and prepackaged scripts in BCE can be used in whatever order fits the needs of the user. But even aside from this configurability, there is the further option of using new scripts to achieve different or better outcomes.
What is a script? I’m going to shamelessly pull from my book for a brief overview:
A script is a set of commands that you can write and execute in order to automatically run through a sequence of actions. A script can support a number of variations and branching paths, thereby supporting a considerable amount of logic inside it – or it can be quite straightforward, depending upon your needs. A script creates this chain of actions by using the commands and syntax of a command line shell, or by using the commands and functions of a programming language, such as Python, Perl or PHP.
In short, scripts allow the user to string multiple actions together in some defined way. This can open the door to batch operations – repeating the same job automatically for a whole set of items – that speed up processing. Alternatively, a user may notice that they are repeating a chain of actions for a single item in a very routine way. Again, a script may fill in here, grouping together all those actions into a single script that the user need only initiate once. Scripting can also bridge the gap between two programs, or adjust the output of one process to make it fit into the input of another. If you’re interested in scripting, there are basically two (non-exclusive) routes to take: shell scripting or scripting with a programming language.
For an intro on both writing and running bash shell scripts, one of if not the most popular Unix shell – and the one BitCurator defaults with check out this tutorial by Tania Rascia.
There are many programming languages that can be used in scripts; Python is a very common one. Learning how to script with Python is tantamount to simply learning Python, so it’s probably best to set upon that path first. Resources abound for this endeavor, and the book Automate the Boring Stuff with Python is free under a Creative Commons license.
The BitCurator Scripts Library
The BitCurator Scripts Library is a spot we designed to help connect users with scripts for the environment. Most scripts are already available online somewhere (typically GitHub), but a single page that inventories these resources can further their use. A brief look at a few of the scripts available will give a better idea of the utility of the library.
If you’ve ever had the experience of repeatedly trying every listed disk format in the KryoFlux GUI (which numbers well over a dozen) in an attempt to resolve stream files into a legible disk image, the DiskFormatID program can automate that process.
fiwalk, which is used to identify files and their metadata, doesn’t operate on Hierarchical File System (HFS) disk images. This prevents the generation of a DFXML file for HFS disks as well. Given the utility and the volume of metadata located in that single document, along with the fact that other disk images receive the DFXML treatment, this stands out as a frustrating process gap. Dianne Dietrich has fortunately whipped up a script to generate just such a DFXML for all your HFS images!
The shell scripts available at rl-bitcurator-scripts are a great example of running the same command over multiple files: multi_annofeatures.sh, multi_be.sh, and multifiwalk.sh run identify_filenames.py, bulk_extractor and fiwalk over a directory, respectively. Conversely, simgen_prod.sh is an example of shell script grouping multiple commands together and running that group over a set of items.
For every script listed, we provide a link (where applicable) to any related resources, such as a paper that explains the thinking behind a script, a webinar or slides where it is discussed, or simply a blog post that introduces the code. Presently, the list includes both bash shell scripts along with Python and Perl scripts.
Scripts have a higher bar to use than programs with a graphic frontend, and some familiarity or comfort with the command line is required. The upside is that scripts can be amazingly versatile and customizable, filling in gaps in process, corralling disparate data into a single presentable sheet, or saving time by automating commands. Along with these strengths, viewing scripts often sparks an idea for one you may actually benefit from or want to write yourself.
Following from this, if you have a script you would like added to the library, please contact us (select ‘Website Feedback’) or simply post in our Google Group. Bear one aspect in mind however: scripts do not need to be perfect. Scripts are meant to be used and adjusted over time, so if you are considering a script to include, please know that it doesn’t need to accommodate every user or situation. If you have a quick and dirty script that completes the task, it will likely be beneficial to someone else, even if, or especially if, they need to adjust it for their work environment.
Walker Sampson is the Digital Archivist at the University of Colorado Boulder Libraries, where he is responsible for the acquisition, accessioning and description of born digital materials, along with the continued preservation and stewardship of all digital materials in the Libraries. He is the coauthor of The No-nonsense Guide to Born-digital Content.
This is the tenth post in the bloggERS Script It! Series.
At Wilson Special Collections Library, we are always trying to find ways to improve our digital preservation workflows. Improving our skills with the command line and using existing command line tools has played a key role in workflow improvements. So, we’ve picked a few favorite tools and tips to share.
FFmpeg
We use FFmpeg for a number of daily tasks, whether it’s generating derivatives for preservation files or analyzing a video or audio file we’ve received through a born-digital accession.
Clearing embedded metadata and uses for FFprobe:
As part of our audio digitization work, we embed metadata into all preservation WAV files. This metadata follows guidelines set out by the Federal Agencies Digital Guidelines Initiative (FADGI) and mostly relates the file back to the original item it was digitized from, including its unique identifier, title, and the curatorial unit it is held by. It has come up a few times now where we have recognized inconsistencies in how this data is reflected in the file, that the data itself is incorrect, or the data is insufficient.
Look at that terrible metadata!
When large-scale issues have come up, particularly with legacy files in our backlog, we’ve made use of FFmpeg’s ‘-map_metadata’ command to batch delete the embedded metadata. Below is a script used to batch create brand new files without metadata, with “_clean” attached to their original file name:
For i in *.wav; do ffmpeg -i “$i” -map_metadata –1 –c:a copy “${i%.wav}_clean.wav”; done
After successfully removing metadata from the files, we use the tool BWF MetaEdit to batch embed the correct metadata that we have prepared in a .csv file.
For born-digital work, we regularly use the tool/command ‘ffprobe’, a stream analyzer that is part of the FFmpeg build. It allows us to quickly see data about AV files (such as duration, file size, codecs, aspect ratio, etc.) that are helpful in identifying files and making general appraisal decisions. As we grow our capabilities in preserving born digital AV, we also foresee the need to document this type of file data in our ingest documentation.
walk_to_dfxml.py
In our born-digital workflows we don’t disk image every digital storage device we receive by default. This workflow choice has benefits and disadvantages. One disadvantage is losing the ability to quickly document all timestamps associated with files. While our workflows were preserving last modified dates, other timestamps like access or creation dates were not as effectively captured. In search of a way to remedy this issue, I turned to Twitter for some advice on the capture and value of each timestamp. Several folks recommended generating DFXML which is usually used on disk images. Tim Walsh helpfully pointed to a python script walk_to_dfxml.py that can generate DFXML on directories instead of disk images. Workflow challenge solved!
DFXML output example
Brunnhilde
Brunnhilde is another tool that was particularly helpful in consolidating tasks and tools. By kicking off Brunnhilde in the command line, we are able to: check for viruses, create checksums, identify file formats, identify duplicates, create a manifest, and run a PII scan. Additionally, this tool gives us a report that is useful for digital archives specialists, but also holds potential as an appraisal tool for consultations with curators. We’re still working out that aspect of the workflow, but when it comes to the technical steps Brunnhilde and the associated command line tools it includes has really improved our processing work.
Learning as We Go
Like many archivists, we had limited experience with using the command line before graduate school. In the course of our careers, we’ve had to learn a lot on the fly because so many great command line tools are essential for working with digital archives.
One thing that can be tricky when you are new is moving the cursor around the terminal easily. It seems like it should be a no brainer, but it’s really not so obvious.
For PC, see this resource for a variety of shortcuts including moving. In general:
Home key moves to beginning. End key moves to the end.
Ctrl + left or right arrow moves the cursor around in chunks
Another helpful set of commands are remove (rm) and move (mv). We use these when dealing with extraneous files created through quality control applications in our AV workflow that we’d like to delete quickly, or when we need to separate derivatives (such as mp3s) from a large batch of preservation files (wav).
Important note about rm: it’s always smart to first use ‘echo’ to see what files you would be removing with your command (ex: echo rm *.lvl would list all the .lvl files that would be removed by your command).
If you are just starting out, you may consider exploring online tutorials or guides like:
Erica Titkemeyer is the Project Director and Audiovisual Conservator for the Southern Folklife Collection at the UNC Wilson Special Collections Library, coordinating inhouse digitization and outsourcing of audiovisual materials for preservation. Erica also participants in the improvement of online access and digital preservation for digitized materials.
Jessica Venlet works as the Assistant University Archivist for Digital Records & Records Management at the UNC Wilson Special Collections Library. In this role, Jessica is responsible for a variety of things related to both records management and digital preservation. In particular, she leads the acquisition and management of born-digital university records. She earned a Master of Science in Information degree from the University of Michigan.
This is the ninth post in the bloggERS Script It! Series.
Over my spring break this year, I joined a team of librarians at George Washington University who wanted to use their MARC records as data to learn more about their rare book collection. We decided to explore trends in the collection over time, using a combination of Python and OpenRefine. Our research question: in what subjects was the collection strongest? From which decades do we have the most books in a given subject?
This required some programming chops–so the second half of this post is targeted towards those who have a working knowledge of python, as well as the pandas library.
If you don’t know python, have no fear! The next section describes how to use OpenRefine, and is 100% snake-free.
Cleaning your data
A big issue with analyzing cataloging data is that humans are often inconsistent in their description. Two records may be labeled “African-American History” and “History, African American,” respectively. When asked to compare these two strings, python will consider them different. Therefore, when you ask python to count all the books with the subject “African American History,” it counts only those whose subjects match that string exactly.
Enter OpenRefine, an open source application that allows you to clean and transform data in ways Excel doesn’t. Below you’ll see a table generated from pymarc, a python module designed to handle data encoded in MARC21. It contains the bibliographic id of each book and its subject data, pulled from field 650.
Facets allow you to view subsets of your data. A “text facet” means that OpenRefine will treat the values as text rather than numbers or dates, for example. If we create a text facet for column a…
it will display all the different values in that column in a box on the left.
If we choose one of these facets, we can view all the rows that have that value in the “a” column.
Wow, we have five fiction books about Dogs! However, if you look on the side, there are six records that have “Dogs.” instead of “Dogs”, and were therefore excluded. How can we look at all of the dog books? This is where we need clustering.
Clustering allows you to group similar strings and merge the whole group into one value. It does this by sorting each string alphabetically and match them that way. Since “History, African American” and “African American History” both evaluate to “aaaaccefhiiimnnorrrsty,” OpenRefine will group them together and allow you to change all of the matching fields to either (or something totally different!) according to your preference.
This way, when you analyze the data, you can ask “How many books on African-American History do we have?” and trust that your answer will be correct. After clustering to my heart’s content, I exported this table into a pandas dataframe for analysis.
Analyzing your data
In order to study the subjects over time, we needed to restructure the table so that we could merge it with the date data.
First, I pivoted the table from short form to long so that we could count separate pairs of subject tags. The pandas ‘melt’ method set the index by bibliographic id and subject so that books with multiple subjects would be counted in both categories.
Then I merged the dates from the original table of MARC records, bib_data, onto our newly melted table.
I grouped the data by subject using .group(). The .agg() function communicates how we want to count within the subject groups.
Because of the vast number of subjects we chose to focus on the ten most numerous subjects. I used the same grouping and aggregating process on the original cleaned subject data: grouped by 650a, counted by bib_id, and sorted by bib_id count.
Once I had our top ten list, I could select the count by decade for each subject.
Visualizing your data
In order to visualize this data, I used a python library called plotly. Plotly generates graphs from your data. It plays very well with pandas dataframes. Plotly provides many examples of code that you can copy, replacing the example data with your own. I placed the plotly code in for loop create a line on the graph for each subject.
Some of the interesting patterns we noticed was the spike in African-American books soon after 1865, the end of the Civil War; and at the end of the 20th century, following the Civil Rights movement.Knowing where the peaks and gaps are in our collections helps us better assist patrons in the use of our collection, and better market it to researchers.
Acknowledgments
I’d like to thank Dolsy Smith, Leah Richardson, and Jenn King for including me in this collaborative project and sharing their expertise with me.
This is the eighth post in the bloggERS Script It! Series.
In the Photo Archives, we receive a lot of digital photographs directly from the creators. Most of these photographers are using either PhotoShelter or Adobe Photoshop to embed their descriptive and technical metadata in an IPTC format. Since we often receive hundreds to thousands of digital files, we do not have the time to manually pull the information from each individual file. We needed to develop a system that would allow us to extract the metadata and transform it into a MODS file.
The first step is to extract the metadata. We do this using a couple of free JavaScript plugins for Adobe Bridge, a tool we already use as part of our day to day operations. There are two tools we use in conjunction with each other. The first is the VRA Bridge Metadata Palette. This allows us to view the embedded metadata within the Bridge interface and to quickly verify whether or not a photographer has included metadata and, if they have, what types of metadata they are using. Please note in the attached screenshot below that we have the option to view both VRA and IPTC metadata.
Metadata Palette Interface
The second tool that we use is the VRA Bridge Export-Import Tool. This tool is critical for our workflow because it allows us to pull the data from the files into an Excel spreadsheet. To use this tool, we go to the export-import interface, which allows us to select specific images or entire folders and subfolders. We typically extract an entire collection (depending on the size of the collection) or folder at one time. The export-import tool lets us either use a default or custom set of field names so the IPTC fields match specifically to our MODS spreadsheet template headings. This saves us time having to rename the fields in the excel spreadsheet and also prevents empty fields from being imported into the spreadsheet.
Customizing Field Names
The tool also preserves the original dates and removes line breaks that can be problematic in Microsoft Excel or Google Sheets (our preferred interface).
Export-Import Interface
Once the metadata has been exported into a spreadsheet, it still needs to be cleaned up to meet our digitization standards and then transformed into individual MODS files. This is where OpenRefine comes in.
OpenRefine Main Interface
OpenRefine (http://openrefine.org/) is a fantastic open source tool that allows users to manipulate large quantities of data. It can do a lot of things and we use it in a lot of different ways to clean up data and to validate our subject headings against vocabularies like the Library of Congress. However, for our purposes today, I’m only going to explain how to convert the spreadsheet to MODS. OpenRefine has an export tool called templating where, similar to the Adobe Bridge tool, you can customize how the spreadsheet rows are exported. There are four input areas in the templating export: prefix, row template, row separator, and suffix. In our case, we are going to tell the system how to transform the spreadsheet into a MODS record but you can transform it into whatever schema your institution uses.
OpenRefine Templating Interface
Since we are taking one spreadsheet and breaking each row into an individual MODS record, we have to tell the system how to do that. In prefix and suffix, we tell the system to create an overall MODS collection that will contain individual MODS records inside of it.
In the row template is the transformation for each individual MODS file. It tells the system to look for a spreadsheet heading, for example, title, and then put it into the corresponding MODS field. In the example below, the spreadsheet column header is highlighted in green and the corresponding MODS field is highlighted in yellow.
Example of Row Template Field
It is basically set up the same as your standard MODS document but instead of putting the individual content inside each element, you put the following bit of code: {{jsonize(cells[“column_header”].value).replace(‘”‘, ”).trim()}}. The row separator is not needed and is simply left blank.
Once you have set all of your fields and exported your file from OpenRefine, you will have a large text file that you will save as an XML file in your favorite XML editor. Now, you have a MODS collection file that contains lots of individual MODS files. You can see in the image below the Mods collection and then the start of an individual MODS record just below it.
Example of a MODS collection with individual MODS files
So now, we have to separate out this single file into individual MODS records. We do this using XML_Split which is part of a larger set of XML manipulation tools, XML_Twig. XML_Split runs on ActivePerl . Once you have ActivePerl running on your individual machine, relocate your XML file to the location where your Perl software is located on your computer. At our institution, it is our policy for it to live in a folder on our C: called “devtools.” Open the computer’s command prompt. Enter the following commands (note that you will have to adjust the commands to where it lives at your institution).
U:\>C:
C:\>cd devtools
C:\cd devtools>xml_split -c mods [file name].xml
The [file name] is whatever your XML file is called. -c mods or tells the system where to split the file. So for an XML file named Example, it would look like the screenshot below.
XML_Split Command Prompt
What you end up with is a series of individual XML files all with the original file name plus a numerical extension. So in this case, Example-1, Example-2 etc. The next step for us is to batch rename them in anticipation of batch uploading our XML files and their corresponding images into our digital asset management system. For our batch uploading procedures, we need our XML files to be named with their accession number, a field that exists within the XML file itself. To go in and manually rename hundreds of files would again take too much time and resources, so we turned to Ruby and a renaming script developed by Rachel Donahue. We download Ruby and the script to the same devtools folder. Your Ruby rename.rb file should look like the screenshot below. We also created an XML folder within the devtools folder where we place our files that need to be renamed.
Rename.rb File
In our case, we want to rename each file with the accession number so we are asking Ruby to look for the identifier field and replace the file name with whatever is in that field. You will need to make sure that you configure your Ruby files to search for whatever field you need to pull from. Once you have finished your configuration, open upyour command prompt and enter the following.
U:\>C: C:\>cd devtools C:\devtools>cd ruby C:\devtools\ruby>rename.rb “Please enter a directory of XML Files…” devtools/xml
You can see in the screenshot below that all of our files have been renamed.
Command Prompt Showing Rename
And now our files are ready for batch upload. While this method was developed to deal with large acquisitions of born digital material, we also use Google Sheets with this OpenRefine, XML_Split, Ruby procedure anytime we are working on large digitization projects and it can be applied anytime you are manipulating large quantities of data.
Kelli Bogan is the Photo Archivist at the National Baseball Hall of Fame and Museum. In this position, she oversees all aspects of a photograph’s lifecycle, from acquisition to dissemination. She works closely with the Digital Strategy department to develop new workflows and strategies for ingesting both born-digital and digitized materials into the museum’s DAMS.