Using Python, FFMPEG, and the ArchivesSpace API to Create a Lightweight Clip Library

by Bonnie Gordon

This is the twelfth post in the bloggERS Script It! Series.

Context

Over the past couple of years at the Rockefeller Archive Center, we’ve digitized a substantial portion of our audiovisual collection. Our colleagues in our Research and Education department wanted to create a clip library using this digitized content, so that they could easily find clips to use in presentations and on social media. Since the scale would be somewhat small and we wanted to spin up a solution quickly, we decided to store A/V clips in a folder with an accompanying spreadsheet containing metadata.

All of our (processed) A/V materials are described at the item level in ArchivesSpace. Since this description existed already, we wanted a way to get information into the spreadsheet without a lot of copying-and-pasting or rekeying. Fortunately, the access copies of our digitized A/V have ArchivesSpace refIDs as their filenames, so we’re able to easily link each .mp4 file to its description via the ArchivesSpace API. To do so, I wrote a Python script that uses the ArchivesSpace API to gather descriptive metadata and output it to a spreadsheet, and also uses the command line tool ffmpeg to automate clip creation.

The script asks for user input on the command line. This is how it works:

Step 1: Log into ArchivesSpace

First, the script asks the user for their ArchivesSpace username and password. (The script requires a config file with the IP address of the ArchivesSpace instance.) It then starts an ArchivesSpace session using methods from ArchivesSnake, an open-source Python library for working with the ArchivesSpace API.

Step 2: Get refID and number to start appending to file

The script then starts a while loop, and asks if the user would like to input a new refID. If the user types back “yes” or “y,” the script then asks for the the ArchivesSpace refID, followed by the number to start appending to the end of each clip. This is because the filename for each clip is the original refID, followed by an underscore, followed by a number, and to allow for more clips to be made from the same original file when the script is run again later.

Step 3: Get clip length and create clip

The script then calculates the duration of the original file, in order to determine whether to ask the user to input the number of hours for the start time of the clip, or to skip that prompt. The user is then asked for the number of minutes and seconds of the start time of the clip, then the number of minutes and seconds for the duration of the clip. Then the clip is created. In order to calculate the duration of the original file and create the clip, I used the os Python module to run ffmpeg commands. Ffmpeg is a powerful command line tool for manipulating A/V files; I find ffmprovisr to be an extremely helpful resource.

Clip from Rockefeller Family at Pocantico – Part I , circa 1920, FA1303, Rockefeller Family Home Movies. Rockefeller Archive Center.

Step 4: Get information about clip from ArchivesSpace

Now that the clip is made, the script uses the ArchivesSnake library again and the find_by_id endpoint of the ArchivesSpace API to get descriptive metadata. This includes the original item’s title, date, identifier, and scope and contents note, and the collection title and identifier.

Step 5: Format data and write to csv

The script then takes the data it’s gathered, formats it as needed—such as by removing line breaks in notes from ArchivesSpace, or formatting duration length—and writes it to the csv file.

Step 6: Decide how to continue

The loop starts again, and the user is asked “New refID? y/n/q.” If the user inputs “n” or “no,” the script skips asking for a refID and goes straight to asking for information about how to create the clip. If the user inputs “q” or “quit,” the script ends.

The script is available on GitHub. Issues and pull requests welcome!


Bonnie Gordon is a Digital Archivist at the Rockefeller Archive Center, where she focuses on digital preservation, born digital records, and training around technology.

Advertisements

The BitCurator Script Library

by Walker Sampson

This is the eleventh post in the bloggERS Script It! Series.

One of the strengths of the BitCurator Environment (BCE) is the open-ended potential of the toolset. BCE is a customized version of the popular (granted, insofar as desktop Linux can be popular) Ubuntu distribution, and as such it remains a very configurable working environment. While there is a basic notion of a default workflow found in the documentation (i.e., acquire content, run analyses on it, and increasingly, do something based on those analyses, then export all of it to another spot), the range of tools and prepackaged scripts in BCE can be used in whatever order fits the needs of the user. But even aside from this configurability, there is the further option of using new scripts to achieve different or better outcomes.

What is a script? I’m going to shamelessly pull from my book for a brief overview:

A script is a set of commands that you can write and execute in order to automatically run through a sequence of actions. A script can support a number of variations and branching paths, thereby supporting a considerable amount of logic inside it – or it can be quite straightforward, depending upon your needs. A script creates this chain of actions by using the commands and syntax of a command line shell, or by using the commands and functions of a programming language, such as Python, Perl or PHP.

In short, scripts allow the user to string multiple actions together in some defined way. This can open the door to batch operations – repeating the same job automatically for a whole set of items – that speed up processing. Alternatively, a user may notice that they are repeating a chain of actions for a single item in a very routine way. Again, a script may fill in here, grouping together all those actions into a single script that the user need only initiate once. Scripting can also bridge the gap between two programs, or adjust the output of one process to make it fit into the input of another. If you’re interested in scripting, there are basically two (non-exclusive) routes to take: shell scripting or scripting with a programming language.

  • For an intro on both writing and running bash shell scripts, one of if not the most popular Unix shell – and the one BitCurator defaults with check out this tutorial by Tania Rascia.
  • There are many programming languages that can be used in scripts; Python is a very common one. Learning how to script with Python is tantamount to simply learning Python, so it’s probably best to set upon that path first. Resources abound for this endeavor, and the book Automate the Boring Stuff with Python is free under a Creative Commons license.

The BitCurator Scripts Library

The BitCurator Scripts Library is a spot we designed to help connect users with scripts for the environment. Most scripts are already available online somewhere (typically GitHub), but a single page that inventories these resources can further their use. A brief look at a few of the scripts available will give a better idea of the utility of the library.

  • If you’ve ever had the experience of repeatedly trying every listed disk format in the KryoFlux GUI (which numbers well over a dozen) in an attempt to resolve stream files into a legible disk image, the DiskFormatID program can automate that process.
  • fiwalk, which is used to identify files and their metadata, doesn’t operate on Hierarchical File System (HFS) disk images. This prevents the generation of a DFXML file for HFS disks as well. Given the utility and the volume of metadata located in that single document, along with the fact that other disk images receive the DFXML treatment, this stands out as a frustrating process gap. Dianne Dietrich has fortunately whipped up a script to generate just such a DFXML for all your HFS images!
  • The shell scripts available at rl-bitcurator-scripts are a great example of running the same command over multiple files: multi_annofeatures.sh, multi_be.sh, and multifiwalk.sh run identify_filenames.py, bulk_extractor and fiwalk over a directory, respectively. Conversely, simgen_prod.sh is an example of shell script grouping multiple commands together and running that group over a set of items.

For every script listed, we provide a link (where applicable) to any related resources, such as a paper that explains the thinking behind a script, a webinar or slides where it is discussed, or simply a blog post that introduces the code. Presently, the list includes both bash shell scripts along with Python and Perl scripts.

Scripts have a higher bar to use than programs with a graphic frontend, and some familiarity or comfort with the command line is required. The upside is that scripts can be amazingly versatile and customizable, filling in gaps in process, corralling disparate data into a single presentable sheet, or saving time by automating commands. Along with these strengths, viewing scripts often sparks an idea for one you may actually benefit from or want to write yourself.

Following from this, if you have a script you would like added to the library, please contact us (select ‘Website Feedback’) or simply post in our Google Group. Bear one aspect in mind however: scripts do not need to be perfect. Scripts are meant to be used and adjusted over time, so if you are considering a script to include, please know that it doesn’t need to accommodate every user or situation. If you have a quick and dirty script that completes the task, it will likely be beneficial to someone else, even if, or especially if, they need to adjust it for their work environment.


Walker Sampson is the Digital Archivist at the University of Colorado Boulder Libraries, where he is responsible for the acquisition, accessioning and description of born digital materials, along with the continued preservation and stewardship of all digital materials in the Libraries. He is the coauthor of The No-nonsense Guide to Born-digital Content.

Of Python and Pandas: Using Programming to Improve Discovery and Access

by Kate Topham

This is the ninth post in the bloggERS Script It! Series.

Over my spring break this year, I joined a team of librarians at George Washington University who wanted to use their MARC records as data to learn more about their rare book collection. We decided to explore trends in the collection over time, using a combination of Python and OpenRefine. Our research question: in what subjects was the collection strongest? From which decades do we have the most books in a given subject?

This required some programming chops–so the second half of this post is targeted towards those who have a working knowledge of python, as well as the pandas library.

If you don’t know python, have no fear! The next section describes how to use OpenRefine, and is 100% snake-free.

Cleaning your data

A big issue with analyzing cataloging data is that humans are often inconsistent in their description. Two records may be labeled “African-American History” and “History, African American,” respectively. When asked to compare these two strings, python will consider them different. Therefore, when you ask python to count all the books with the subject “African American History,” it counts only those whose subjects match that string exactly.

Enter OpenRefine, an open source application that allows you to clean and transform data in ways Excel doesn’t. Below you’ll see a table generated from pymarc, a python module designed to handle data encoded in MARC21. It contains the bibliographic id of each book and its subject data, pulled from field 650.

Picture1

Facets allow you to view subsets of your data. A “text facet” means that OpenRefine will treat the values as text rather than numbers or dates, for example. If we create a text facet for column a…Picture1.png

it will display all the different values in that column in a box on the left.Picture1.png

If we choose one of these facets, we can view all the rows that have that value in the “a” column.Picture1.png

Wow, we have five fiction books about Dogs! However, if you look on the side, there are six records that have “Dogs.” instead of “Dogs”, and were therefore excluded. How can we look at all of the dog books? This is where we need clustering.

Clustering allows you to group similar strings and merge the whole group into one value. It does this by sorting each string alphabetically and match them that way. Since “History, African American” and “African American History” both evaluate to “aaaaccefhiiimnnorrrsty,” OpenRefine will group them together and allow you to change all of the matching fields to either (or something totally different!) according to your preference.Picture1.png

This way, when you analyze the data, you can ask “How many books on African-American History do we have?” and trust that your answer will be correct. After clustering to my heart’s content, I exported this table into a pandas dataframe for analysis.

Analyzing your data

In order to study the subjects over time, we needed to restructure the table so that we could merge it with the date data.

First, I pivoted the table from short form to long so that we could count  separate pairs of subject tags. The pandas ‘melt’ method set the index by bibliographic id and subject so that books with multiple subjects would be counted in both categories.Picture1.png

Then I merged the dates from the original table of MARC records, bib_data, onto our newly melted table.Picture1.png

I grouped the data by subject using .group(). The .agg() function communicates how we want to count within the subject groups.Picture1.png

Because of the vast number of subjects we chose to focus on the ten most numerous subjects. I used the same grouping and aggregating process on the original cleaned subject data: grouped by 650a, counted by bib_id, and sorted by bib_id count.Picture1.png

Once I had our top ten list, I could select the count by decade for each subject.Picture1.png

Visualizing your data

In order to visualize this data, I used a python library called plotly. Plotly generates graphs from your data. It plays very well with pandas dataframes. Plotly provides many examples of code that you can copy, replacing the example data with your own. I placed the plotly code in for loop create a line on the graph for each subject.  Picture1.pngPicture1.png

Some of the interesting patterns we noticed was the spike in African-American books soon after 1865, the end of the Civil War; and at the end of the 20th century, following the Civil Rights movement.Knowing where the peaks and gaps are in our collections helps us better assist patrons in the use of our collection, and better market it to researchers.

Acknowledgments

I’d like to thank Dolsy Smith, Leah Richardson, and Jenn King for including me in this collaborative project and sharing their expertise with me.

GVSU Scripted IR Curation in the Cloud

by Matt Schultz and Kyle Felker

This is the seventh post in the bloggERS Script It! Series.


In 2016, bepress launched a new service to assist subscribers of Digital Commons with archiving the contents of their institutional repository. Using Amazon Web Services (AWS), this new service pushes daily updates of an institution’s repository contents to an Amazon Simple Storage Service (S3) bucket setup that is hosted by the institution. Subscribers thus control a real-time copy of all their institutional repository content outside of bepress’s Digital Commons platform. They can download individual files or the entire repository from this S3 bucket, all on their own schedules, and for whatever purposes they deem appropriate.

Grand Valley State University (GVSU) Libraries makes use of Digital Commons for their ScholarWorks@GVSU institutional repository. Using bepress’s new archiving service has given GVSU an opportunity to perform scripted automation in the Cloud using the same open source tools that they use in their regular curation workflows. These tools include Brunnhilde, which bundles together ClamAV and Siegfried to produce file format, fixity, and virus check reports, as well as BagIt for preservation packaging.

Leveraging the ease of launching Amazon’s EC2 server instances and use of their AWS command line interface (CLI), GVSU Libraries was able to readily configure an EC2 “curation server” that directly syncs copies of their Digital Commons data from S3 to the new server where the installed tools mentioned above do their work of building preservation packages and sending them back to S3 and to Glacier for nearline and long-term storage. The entire process now begins and ends in the Cloud, rather than involving the download of data to a local workstation for processing.

Creating a digital preservation copy of the GVSU’s ScholarWorks files involves four steps:

  1. Syncing data from S3 to EC2:  No processing can actually be done on files in-place on S3, so they must be copied to the EC2 “curation server”. As part of this process, we wanted to reorganize the files into logical directories (alphabetized), so that it would be easier to locate and process files, and to better organize the virus reports and the “bags” the process generates.
  2. Virus and format reports with Brunnhilde:  The synced files are then run through Brunnhilde’s suite of tools.  Brunnhilde will generate both a command-line summary of problems found and detailed reports that are deposited with the files.  The reports stay with the files as they move through the process.
  3. Preservation Packaging with BagIt: Once the files are checked, they need to be put in “bags” for storage and archiving using the BagIt tool.  This will bundle the files in a data directory and generate metadata that can be used to check their integrity.
  4. Syncing files to S3 and Glacier: Checked and bagged files are then moved to a different S3 bucket for nearline storage.  From that bucket, we have set up automated processes (“lifecycle management” in AWS parlance) to migrate the files on a quarterly schedule into Amazon Glacier, our long-term storage solution.

Once the process has been completed once, new files are incorporated and re-synced on a quarterly basis to the BagIt data directories and re-checked with Brunnhilde.  The BagIt metadata must then be updated and re-verified using BagIt, and the changes synced to the destination S3 bucket.

Running all these tools in sequence manually using the command line interface is both tedious and time-consuming. We chose to automate the process using a shell script. Shell scripts are relatively easy to write, and are designed to automate command-line tasks that involve a lot of repetitive work (like this one).

These scripts can be found at our github repo: https://github.com/gvsulib/curationscripts

Process_backup is the main script. It handles each of the four processing stages outlined above.  As it does so, it stores the output of those tasks in log files so they can be examined later.  In addition, it emails notifications to our task management system (Asana) so that our curation staff can check on the process.

After the first time the process is run, the metadata that BagIt generates has to be updated to reflect new data.  The version of BagIt we are using (Python) can’t do this from the command line, but it does have an API with a command that will update existing “bag” metadata. So, we created a small Python script to do this (regen_BagIt_manifest.py).  The shell script invokes this script at the third stage if bags have previously been created.

Finally, the update.sh script automatically updates all the tools used in the process and emails curation staff when the process is done. We then schedule the scripts to run automatically using the Unix cron utility.

GVSU Libraries is now hard at work on a final bagit_check.py script that will facilitate spot retrieval of the most recent version of a “bag” from the S3 nearline storage bucket and perform a validation audit using BagIt.


Matt Schultz is the Digital Curation Librarian for Grand Valley State University Libraries. He provides digital preservation for the Libraries’ digital collections, and offers support to faculty and students in the areas of digital scholarship and research data management. He has the unique displeasure of working on a daily basis with Kyle Felker.

Kyle Felker was born feral in the swamps of Louisiana, and a career in technology librarianship didn’t do much to domesticate him.  He currently works as an Application Developer at GVSU Libraries.

Small-Scale Scripts for Large-Scale Analysis: Python at the Alexander Turnbull Library

by Flora Feltham

This is the third post in the bloggERS Script It! Series.

The Alexander Turnbull is a research library that holds archives and special collections within the National Library of New Zealand. This means exactly what you’d expect: manuscripts and archives, music, oral histories, photographs, and paintings, but also artefacts such as Katherine Mansfield’s typewriter and a surprising amount of hair. In 2008, the National Library established the National Digital Heritage Archive (NDHA), and has been actively collecting and managing born-digital materials since. I am one of two Digital Archivists who administer the transfer of born-digital heritage material to the Library. We also analyse files to ensure they have all the components needed for long-term preservation and ingest collections to the NDHA. We work closely with our digital preservation system administrators and the many other teams responsible for appraisal, arrangement and description, and providing access to born-digital heritage.

Why Scripting?

As archivists responsible for safely handling and managing born-digital heritage, we use scripts to work safely and sanely at scale. Python provides a flexible yet reliable platform for our work: we don’t have to download and learn a new piece of software every time we need to accomplish a different task. The increasing size and complexity of digital collections often means that machine processing is the only way to get our work done. A human could not reliably identify every duplicate file name in a collection of 200,000 files… but Python can. To protect original material, too, the scripting we do during appraisal and technical analysis is done using copies of collections. Here are some of the extremely useful tasks my team scripted recently:

  • Transfer
    • Generating a list of files on original storage media
    • Transferring files off the original digital media to our storage servers
  • Appraisal
    • Identifying duplicate files across different locations
    • Adding file extensions so material opens in the correct software
    • Flattening complex folder structures to support easy assessment
  • Technical Analysis
    • Sorting files into groups based on file extension to isolate unknown files
    • Extracting file signature information from unknown files

Our most-loved Python script even has a name: Safe Mover. Developed and maintained by our Digital Preservation Analyst, Safe Mover will generate file checksums, maintain metadata integrity, and check file authenticity, all the while copying material off digital storage media. Running something somebody else wrote was a great introduction to scripting. I finally understood that: a) you can do nimble computational work from a text editor; and b) a ‘script’ is just a set of instructions you write for the computer to follow.

Developing Skills Slowly, Systematically, and as Part of a Group

Once we recognised that we couldn’t do our work without scripting, my team started regular ‘Scripting Sessions’ with other colleagues who code. At each meeting we solve a genuine processing challenge from someone’s job, which allows us to apply what we learn immediately. We also write the code together on a big screen which, like learning any language, helped us become comfortable making mistakes. Recently, I accidentally copied 60,000 spreadsheets to my desktop and then wondered aloud why my laptop ground to a halt.

Outside of these meetings, learning to script has been about deliberately choosing to problem-solve using Python rather than doing it ‘by-hand’. Initially, this felt counter-intuitive because I was painfully slow: I really could have counted 200 folders on my fingers faster than I wrote a script to do the same thing.  But luckily for me, my team recognised the overall importance of this skill set and I also regularly remind myself: “this will be so useful the next 5000 times I need to inevitably do this task”.

The First Important Thing I Remembered How to Write

Practically every script I write starts with import os. It became so simple once I’d done it a few times: import is a command, and ‘os’ is the name of the thing I want. os is a Python module that allows you to interact with and leverage the functionality of your computer’s operating system. In general terms, a Python module is a just pre-existing code library for you to use. They are usually grouped around a particular theme or set of tasks.

The function I use the most is os.walk(). You tell os.walk() where to start and then it systematically traverses every folder beneath that. For every folder it finds, os.walk() will record three things: 1: the path to that folder, 2: a list of any sub-folders it contains, and 3: a list of any files it contains. Once os.walk() has completed its… well… walk, you have access to the name and location of every folder and file in your collection.

You then use Python to do something with this data: print it to the screen, write it in a spreadsheet, ask where something is, or open a file. Having access to this information becomes relevant to archivists very quickly. Just think about our characteristic tasks and concerns: identifying and recording original order, gaining intellectual control, maintaining authenticity. I often need to record or manipulate file paths, move actual files around the computer, or extract metadata stored in the operating system.

An Actual Script

Recently, we received a Mac-formatted 1TB hard drive from a local composer and performer. When #1 script Safe Mover stopped in the middle of transferring files, we wondered if there was a file path clash. Generally speaking, in a Windows formatted file system there’s a limit of 255 characters to a file path (“D:\so_it_pays_to\keep_your_file_paths\niceand\short.doc”).

Some older Mac systems have no limit on the number of file path characters so, if they’re really long, there can be a conflict when you’re transferring material to a Windows environment. To troubleshoot this problem we wrote a small script:

import os

top = "D:\example_folder_01\collection"
for root, dir_list, file_list in os.walk(top):
    for item in file_list:
    file_path = os.path.join(root,item)
    if len(file_path) > 255:
        print (file_path)

So that’s it– and running it over the hard drive gave us the answer we needed. But what is this little script actually doing?

# import the Python module 'os'.
import os

# tell our script where to start
top = " D:\example_folder_01\collection"
# now os.walk will systematically traverse the directory tree starting from 'top' and retain information in the variables root, dir_list, and file_list.
# remember: root is 'path to our current folder', dir_list is 'a list of sub-folder names', and file_list is 'a list of file names'.
for root, dir_list, file_list in os.walk(top):
    # for every file in those lists of files...
    for item in file_list:
        # ...store its location and name in a variable called 'file_path'.
        # os.path.join joins the folder path (root) with the file name for every file (item).
        file_path = os.path.join(root,item)
        # now we do the actual analysis. We want to know 'if the length of any file path is greater than 255 characters'.
        if len(file_path) > 255:
            # if so: print that file path to the screen so we can see it!
            print (file_path)

 

All it does is find every file path longer than 255 characters and print it to the screen. The archivists can then eyeball the list and decide how to proceed. Or, in the case of our 1TB hard drive, exclude that as a problem because it didn’t contain any really long file paths. But at least we now know that’s not the problem. So maybe we need… another script?


Flora Feltham is the Digital Archivist at the Alexander Turnbull Library, National Library of New Zealand Te Puna Mātauranga o Aotearoa. In this role she supports the acquisition, ingest, management, and preservation of born-digital heritage collections. She lives in Wellington.

Restructuring and Uploading ZIP Files to the Internet Archive with Bash

by Lindsey Memory and Nelson Whitney

This is the second post in the bloggERS Script It! series.

This blog is for anyone interested in uploading ZIP files into the Internet Archive.

The Harold B. Lee Library at Brigham Young University has been a scanning partner with the Internet Archive since 2009. Any loose-leaf or oversized items go through the convoluted workflow depicted below, which can take hours of painstaking mouse-clicking if you have a lot of items like we do (think 14,000 archival issues of the student newspaper). Items must be scanned individually as JPEGS, each JPEG must be reformatted into a JP2, the JP2s must all be zipped into ZIP files, then ZIPs are (finally) uploaded one-by-one into the Internet Archive.

old workflow
Workflow for uploading ZIP files to the Internet Archive.

Earlier this year, the engineers at Internet Archive published a single line of Python code that allows scan centers to upload multiple ZIP files into the Internet Archive at once (see “Bulk Uploads”) . My department has long dreamed of a script that could reformat Internet-Archive-bound items and upload them automatically. The arrival of the Python code got us moving.  I enlisted the help of the library’s senior software engineer and we discussed ways to compress scans, how Python scripts communicate with the Internet Archive, and ways to reorganize the scans’ directory files in a way conducive to a basic Bash script.

The project was delegated to Nelson Whitney, a student developer. Nelson wrote the script, troubleshot it with me repeatedly, and helped author this blog. Below we present his final script in two parts, written in Bash for iOS in Spring 2018.

Part 1: makeDirectories.sh

This simple command, executed through Terminal on a Mac, takes a list of identifiers (list.txt) and generates a set of organized subdirectories for each item on that list. These subdirectories house the JPEGs and are structured such that later they streamline the uploading process.

#! /bin/bash

# Move into the directory "BC-100" (the name of our quarto scanner), then move into the subdirectory named after whichever project we're scanning, then move into a staging subdirectory.
cd BC-100/$1/$2

# Takes the plain text "list.txt," which is saved inside the staging subdirectory, and makes a new subdirectory for each identifier on the list.
cat list.txt | xargs mkdir
# For each identifier subdirectory,
for d in */; do
  # Terminal moves into that directory and
  cd $d
  # creates three subdirectories inside named "01_JPGs_cropped,"
mkdir 01_JPGs_cropped
  # "02_JP2s,"
mkdir 02_JP2s
  # and "03_zipped_JP2_file," respectively.
mkdir 03_zipped_JP2_file
  # It also deposits a blank text file in each identifier folder for employees to write notes in.
  touch Notes.txt

  cd ..
done
file structure copy
Workflow for uploading ZIP files to the Internet Archive.

Part 2: macjpzipcreate.sh

This Terminal command can recursively move through subdirectories, going into 01_JPEG_cropped first and turning all JPEGs therein into JP2s. Terminal saves the JP2s into the subdirectory 02_JP2s, then zips the JP2s into a zip file and saves the zip in subdirectory 03_zipped_JP2_file. Finally, Terminal uploads the zip into the Internet Archive. Note that for the bulk upload to work, you must have configured Terminal with the “./ia configure” command and entered your IA admin login credentials.

#! /bin/bash

# Places the binary file "ia" into the project directory (this enables upload to Internet Archive)
cp ia BC-100/$1/$2
# Move into the directory "BC-100" (the name of our quarto scanner), then moves into the subdirectory named after whichever project we're scanning, then move into a staging subdirectory
cd BC-100/$1/$2

  # For each identifier subdirectory
  for d in */; do
  # Terminal moves into that identifier's directory and then the directory containing all the JPG files
  cd $d
  cd 01_JPGs_cropped

  # For each jpg file in the directory
  for jpg in< *.jpg; do
    # Terminal converts the jpg files into jp2 format using the sips command in MAC OS terminals
    sips -s format jp2 --setProperty formatOptions best $jpg --out ../02_JP2s/$jpg.jp2
  done

  cd ../02_JP2s
  # The directory variable contains a trailing slash. Terminal removes the trailing slash,
  d=${d%?}
  # gives the correct name to the zip file,
  im="_images"
  # and zips up all JP2 files.
  zip $d$im.zip *
  # Terminal moves the zip files into the intended zip file directory.
  mv $d$im.zip ../03_zipped_JP2_file

  # Terminal moves back up the project directory to where the ia script exists
  cd ../..
  # Uses the Internet-Archive-provided Python Script to upload the zip file to the internet
  ./ia upload $d $d/03_zipped_JP2_file/$d$im.zip --retries 10
  # Change the repub_state of the online identifier to 4, which marks the item as done in Internet Archive.
  ./ia metadata $d --modify=repub_state:4
done

The script has reduced the labor devoted to the Internet Archive by a factor of four. Additionally, it has bolstered Digital Initiatives’ relationship with IT. It was a pleasure working with Nelson; he gained real-time engineering experience working with a “client,” and I gained a valuable working knowledge of Terminal and the design of basic scripts.


 

lindsey_memoryLindsey Memory is the Digital Initiatives Workflows Supervisor at the Harold B. Lee Library at Brigham Young University. She loves old books and the view from her backyard.

 

 

Nelson Whitney

Nelson Whitney is an undergraduate pursuing a degree in Computer Science at Brigham Young University. He enjoys playing soccer, solving Rubik’s cubes, and spending time with his family and friends. One day, he hopes to do cyber security for the Air Force.