Small-Scale Scripts for Large-Scale Analysis: Python at the Alexander Turnbull Library

by Flora Feltham

This is the third post in the bloggERS Script It! Series.

The Alexander Turnbull is a research library that holds archives and special collections within the National Library of New Zealand. This means exactly what you’d expect: manuscripts and archives, music, oral histories, photographs, and paintings, but also artefacts such as Katherine Mansfield’s typewriter and a surprising amount of hair. In 2008, the National Library established the National Digital Heritage Archive (NDHA), and has been actively collecting and managing born-digital materials since. I am one of two Digital Archivists who administer the transfer of born-digital heritage material to the Library. We also analyse files to ensure they have all the components needed for long-term preservation and ingest collections to the NDHA. We work closely with our digital preservation system administrators and the many other teams responsible for appraisal, arrangement and description, and providing access to born-digital heritage.

Why Scripting?

As archivists responsible for safely handling and managing born-digital heritage, we use scripts to work safely and sanely at scale. Python provides a flexible yet reliable platform for our work: we don’t have to download and learn a new piece of software every time we need to accomplish a different task. The increasing size and complexity of digital collections often means that machine processing is the only way to get our work done. A human could not reliably identify every duplicate file name in a collection of 200,000 files… but Python can. To protect original material, too, the scripting we do during appraisal and technical analysis is done using copies of collections. Here are some of the extremely useful tasks my team scripted recently:

  • Transfer
    • Generating a list of files on original storage media
    • Transferring files off the original digital media to our storage servers
  • Appraisal
    • Identifying duplicate files across different locations
    • Adding file extensions so material opens in the correct software
    • Flattening complex folder structures to support easy assessment
  • Technical Analysis
    • Sorting files into groups based on file extension to isolate unknown files
    • Extracting file signature information from unknown files

Our most-loved Python script even has a name: Safe Mover. Developed and maintained by our Digital Preservation Analyst, Safe Mover will generate file checksums, maintain metadata integrity, and check file authenticity, all the while copying material off digital storage media. Running something somebody else wrote was a great introduction to scripting. I finally understood that: a) you can do nimble computational work from a text editor; and b) a ‘script’ is just a set of instructions you write for the computer to follow.

Developing Skills Slowly, Systematically, and as Part of a Group

Once we recognised that we couldn’t do our work without scripting, my team started regular ‘Scripting Sessions’ with other colleagues who code. At each meeting we solve a genuine processing challenge from someone’s job, which allows us to apply what we learn immediately. We also write the code together on a big screen which, like learning any language, helped us become comfortable making mistakes. Recently, I accidentally copied 60,000 spreadsheets to my desktop and then wondered aloud why my laptop ground to a halt.

Outside of these meetings, learning to script has been about deliberately choosing to problem-solve using Python rather than doing it ‘by-hand’. Initially, this felt counter-intuitive because I was painfully slow: I really could have counted 200 folders on my fingers faster than I wrote a script to do the same thing.  But luckily for me, my team recognised the overall importance of this skill set and I also regularly remind myself: “this will be so useful the next 5000 times I need to inevitably do this task”.

The First Important Thing I Remembered How to Write

Practically every script I write starts with import os. It became so simple once I’d done it a few times: import is a command, and ‘os’ is the name of the thing I want. os is a Python module that allows you to interact with and leverage the functionality of your computer’s operating system. In general terms, a Python module is a just pre-existing code library for you to use. They are usually grouped around a particular theme or set of tasks.

The function I use the most is os.walk(). You tell os.walk() where to start and then it systematically traverses every folder beneath that. For every folder it finds, os.walk() will record three things: 1: the path to that folder, 2: a list of any sub-folders it contains, and 3: a list of any files it contains. Once os.walk() has completed its… well… walk, you have access to the name and location of every folder and file in your collection.

You then use Python to do something with this data: print it to the screen, write it in a spreadsheet, ask where something is, or open a file. Having access to this information becomes relevant to archivists very quickly. Just think about our characteristic tasks and concerns: identifying and recording original order, gaining intellectual control, maintaining authenticity. I often need to record or manipulate file paths, move actual files around the computer, or extract metadata stored in the operating system.

An Actual Script

Recently, we received a Mac-formatted 1TB hard drive from a local composer and performer. When #1 script Safe Mover stopped in the middle of transferring files, we wondered if there was a file path clash. Generally speaking, in a Windows formatted file system there’s a limit of 255 characters to a file path (“D:\so_it_pays_to\keep_your_file_paths\niceand\short.doc”).

Some older Mac systems have no limit on the number of file path characters so, if they’re really long, there can be a conflict when you’re transferring material to a Windows environment. To troubleshoot this problem we wrote a small script:

import os

top = "D:\example_folder_01\collection"
for root, dir_list, file_list in os.walk(top):
    for item in file_list:
    file_path = os.path.join(root,item)
    if len(file_path) > 255:
        print (file_path)

So that’s it– and running it over the hard drive gave us the answer we needed. But what is this little script actually doing?

# import the Python module 'os'.
import os

# tell our script where to start
top = " D:\example_folder_01\collection"
# now os.walk will systematically traverse the directory tree starting from 'top' and retain information in the variables root, dir_list, and file_list.
# remember: root is 'path to our current folder', dir_list is 'a list of sub-folder names', and file_list is 'a list of file names'.
for root, dir_list, file_list in os.walk(top):
    # for every file in those lists of files...
    for item in file_list:
        # ...store its location and name in a variable called 'file_path'.
        # os.path.join joins the folder path (root) with the file name for every file (item).
        file_path = os.path.join(root,item)
        # now we do the actual analysis. We want to know 'if the length of any file path is greater than 255 characters'.
        if len(file_path) > 255:
            # if so: print that file path to the screen so we can see it!
            print (file_path)

 

All it does is find every file path longer than 255 characters and print it to the screen. The archivists can then eyeball the list and decide how to proceed. Or, in the case of our 1TB hard drive, exclude that as a problem because it didn’t contain any really long file paths. But at least we now know that’s not the problem. So maybe we need… another script?


Flora Feltham is the Digital Archivist at the Alexander Turnbull Library, National Library of New Zealand Te Puna Mātauranga o Aotearoa. In this role she supports the acquisition, ingest, management, and preservation of born-digital heritage collections. She lives in Wellington.

Advertisements

Electronic Records at SAA 2018

With just weeks to go before the 2018 SAA Annual Meeting hits the capital, here’s a round-up of the sessions that might interest ERS members in particular. This year’s schedule offers plenty for archivists who deal with the digital, tackling current gnarly issues around transparency and access, and format-specific challenges like Web archiving and social media. Other sessions anticipate the opportunities and questions posed by new technologies: blockchain, artificial intelligence, and machine learning.

And of course, be sure to mark your calendars for the ERS annual meeting! This year’s agenda includes lightning talks from representatives from the IMLS-funded OSSArcFlow and Collections as Data projects, and the DLF Levels of Access research group. There will also be a mini-unconference session focused on problem-solving current challenges associated with the stewardship of electronic records. If you would like to propose an unconference topic or facilitate a breakout group, sign up here.

Wednesday, August 15

2:30-3:45

Electronic Records Section annual business meeting (https://archives2018.sched.com/event/ESmz/electronic-records-section)

Thursday, August 16

10:30-11:45

105 – Opening the Black Box: Transparency and Complexity in Digital  Preservation (https://archives2018.sched.com/event/ESlh)

12:00-1:15

Open Forums: Preservation of Electronic Government Information (PEGI) Project (https://archives2018.sched.com/event/ETNi/open-forums-preservation-of-electronic-government-information-pegi-project)

Open Forums: Safe Search Among Sensitive Content: Investigating Archivist and Donor Conceptions of Privacy, Secrecy, and Access (https://archives2018.sched.com/event/ETNh/open-forums-safe-search-among-sensitive-content-investigating-archivist-and-donor-conceptions-of-privacy-secrecy-and-access)

1:30-2:30

201 – Email Archiving Comes of Age (https://archives2018.sched.com/event/ESlo/201-email-archiving-comes-of-age)

204-Scheduling the Ephemeral: Creating and Implementing Records Management Policy for Social Media (https://archives2018.sched.com/event/ESls/204-scheduling-the-ephemeral-creating-and-implementing-records-management-policy-for-social-media)

Friday, August 17

2:00 – 3:00

501 – The National Archives Aims for Digital Future: Discuss NARA Strategic Plan and Future of Archives with NARA Leaders (https://archives2018.sched.com/event/ESmP/501-the-national-archives-aims-for-digital-future-discuss-nara-strategic-plan-and-future-of-archives-with-nara-leaders)

502 – This is not Skynet (yet): Why Archivists should care about Artificial Intelligence and Machine Learning (https://archives2018.sched.com/event/ESmQ/502-this-is-not-skynet-yet-why-archivists-should-care-about-artificial-intelligence-and-machine-learning)

504 – Equal Opportunities: Physical and Digital Accessibility of Archival Collections (https://archives2018.sched.com/event/ESmS/504-equal-opportunities-physical-and-digital-accessibility-of-archival-collections)

508 – Computing Against the Grain: Capturing and Appraising Underrepresented Histories of Computing (https://archives2018.sched.com/event/ESmW/508-computing-against-the-grain-capturing-and-appraising-underrepresented-histories-of-computing)

Saturday, August 18

8:30 – 10:15

605 – Taming the Web: Perspectives on the Transparent Management and Appraisal of Web Archives (https://archives2018.sched.com/event/ESme/605-taming-the-web-perspectives-on-the-transparent-management-and-appraisal-of-web-archives)

606 – Let’s Be Clear: Transparency and Access to Complex Digital Objects (https://archives2018.sched.com/event/ESmf/606-lets-be-clear-transparency-and-access-to-complex-digital-objects)

10:30 – 11:30

704 – Blockchain: What Is It and Why Should You Care (https://archives2018.sched.com/event/ESmo/704-blockchain-what-is-it-and-why-should-we-care)

 

Restructuring and Uploading ZIP Files to the Internet Archive with Bash

by Lindsey Memory and Nelson Whitney

This is the second post in the bloggERS Script It! series.

This blog is for anyone interested in uploading ZIP files into the Internet Archive.

The Harold B. Lee Library at Brigham Young University has been a scanning partner with the Internet Archive since 2009. Any loose-leaf or oversized items go through the convoluted workflow depicted below, which can take hours of painstaking mouse-clicking if you have a lot of items like we do (think 14,000 archival issues of the student newspaper). Items must be scanned individually as JPEGS, each JPEG must be reformatted into a JP2, the JP2s must all be zipped into ZIP files, then ZIPs are (finally) uploaded one-by-one into the Internet Archive.

old workflow
Workflow for uploading ZIP files to the Internet Archive.

Earlier this year, the engineers at Internet Archive published a single line of Python code that allows scan centers to upload multiple ZIP files into the Internet Archive at once (see “Bulk Uploads”) . My department has long dreamed of a script that could reformat Internet-Archive-bound items and upload them automatically. The arrival of the Python code got us moving.  I enlisted the help of the library’s senior software engineer and we discussed ways to compress scans, how Python scripts communicate with the Internet Archive, and ways to reorganize the scans’ directory files in a way conducive to a basic Bash script.

The project was delegated to Nelson Whitney, a student developer. Nelson wrote the script, troubleshot it with me repeatedly, and helped author this blog. Below we present his final script in two parts, written in Bash for iOS in Spring 2018.

Part 1: makeDirectories.sh

This simple command, executed through Terminal on a Mac, takes a list of identifiers (list.txt) and generates a set of organized subdirectories for each item on that list. These subdirectories house the JPEGs and are structured such that later they streamline the uploading process.

#! /bin/bash

# Move into the directory "BC-100" (the name of our quarto scanner), then move into the subdirectory named after whichever project we're scanning, then move into a staging subdirectory.
cd BC-100/$1/$2

# Takes the plain text "list.txt," which is saved inside the staging subdirectory, and makes a new subdirectory for each identifier on the list.
cat list.txt | xargs mkdir
# For each identifier subdirectory,
for d in */; do
  # Terminal moves into that directory and
  cd $d
  # creates three subdirectories inside named "01_JPGs_cropped,"
mkdir 01_JPGs_cropped
  # "02_JP2s,"
mkdir 02_JP2s
  # and "03_zipped_JP2_file," respectively.
mkdir 03_zipped_JP2_file
  # It also deposits a blank text file in each identifier folder for employees to write notes in.
  touch Notes.txt

  cd ..
done
file structure copy
Workflow for uploading ZIP files to the Internet Archive.

Part 2: macjpzipcreate.sh

This Terminal command can recursively move through subdirectories, going into 01_JPEG_cropped first and turning all JPEGs therein into JP2s. Terminal saves the JP2s into the subdirectory 02_JP2s, then zips the JP2s into a zip file and saves the zip in subdirectory 03_zipped_JP2_file. Finally, Terminal uploads the zip into the Internet Archive. Note that for the bulk upload to work, you must have configured Terminal with the “./ia configure” command and entered your IA admin login credentials.

#! /bin/bash

# Places the binary file "ia" into the project directory (this enables upload to Internet Archive)
cp ia BC-100/$1/$2
# Move into the directory "BC-100" (the name of our quarto scanner), then moves into the subdirectory named after whichever project we're scanning, then move into a staging subdirectory
cd BC-100/$1/$2

  # For each identifier subdirectory
  for d in */; do
  # Terminal moves into that identifier's directory and then the directory containing all the JPG files
  cd $d
  cd 01_JPGs_cropped

  # For each jpg file in the directory
  for jpg in< *.jpg; do
    # Terminal converts the jpg files into jp2 format using the sips command in MAC OS terminals
    sips -s format jp2 --setProperty formatOptions best $jpg --out ../02_JP2s/$jpg.jp2
  done

  cd ../02_JP2s
  # The directory variable contains a trailing slash. Terminal removes the trailing slash,
  d=${d%?}
  # gives the correct name to the zip file,
  im="_images"
  # and zips up all JP2 files.
  zip $d$im.zip *
  # Terminal moves the zip files into the intended zip file directory.
  mv $d$im.zip ../03_zipped_JP2_file

  # Terminal moves back up the project directory to where the ia script exists
  cd ../..
  # Uses the Internet-Archive-provided Python Script to upload the zip file to the internet
  ./ia upload $d $d/03_zipped_JP2_file/$d$im.zip --retries 10
  # Change the repub_state of the online identifier to 4, which marks the item as done in Internet Archive.
  ./ia metadata $d --modify=repub_state:4
done

The script has reduced the labor devoted to the Internet Archive by a factor of four. Additionally, it has bolstered Digital Initiatives’ relationship with IT. It was a pleasure working with Nelson; he gained real-time engineering experience working with a “client,” and I gained a valuable working knowledge of Terminal and the design of basic scripts.


 

lindsey_memoryLindsey Memory is the Digital Initiatives Workflows Supervisor at the Harold B. Lee Library at Brigham Young University. She loves old books and the view from her backyard.

 

 

Nelson Whitney

Nelson Whitney is an undergraduate pursuing a degree in Computer Science at Brigham Young University. He enjoys playing soccer, solving Rubik’s cubes, and spending time with his family and friends. One day, he hopes to do cyber security for the Air Force.

Meet the 2018 Candidates: Scott Kirycki

The 2018 elections for Electronic Records Section leadership are upon us! To support your getting to know the candidates, we will be presenting additional information provided by the 2018 nominees for ERS leadership positions. For more information about the slate of candidates, you can check out the full 2018 ERS elections site. ERS Members: be sure to vote! Polls are open through July 17!

Candidate name: Scott Kirycki

Running for: Steering Committee

What made you decide you wanted to become an archivist?

My journey to becoming an archivist began somewhere that did not, strictly speaking, exist: the fictional worlds of radio programs from the 1940s. When I was a boy, a local radio station replayed old shows such as The Lone Ranger, The Shadow, and The Jack Benny Program. The shows pulled me in with their appeal to the imagination, and I wanted to learn more about them. My parents and teachers had taught me well about the library, and my interest in radio programs (and soon my interest in the historical period that produced them) provided a new focus for the use of library resources. I checked out books, records, and tapes and studied the non-circulating reference material. Later, as I worked on more research projects for school assignments, I learned how to use The Reader’s Guide to Periodical Literature, and that led me to the treasures contained in bound volumes of magazines and on microfilm.

After starting college and choosing English as my major, I spent more time in libraries, particularly academic libraries with their special collections and emphasis on research. I grew to enjoy the research part of college work especially, which prompted me to pursue a master’s in English literature. When the time came for me to do a thesis for my degree, I picked a bibliographic research project over a literary interpretation or analysis. I created an annotated bibliography of the books advertised in The Tatler, an eighteenth-century British periodical.

Given my interest in research and the amount of time I spent in libraries, I considered following my master’s in English with a degree in library science. Since I liked historical material, I thought of studying to work in an archive. Although I looked into applying at some schools of library science, I was not wholeheartedly enthusiastic about additional years of schooling at that point in my life, so my journey took a turn to the business world.

The company where I began working after grad school turned out to be a good fit for me. The work connected to my interest in research because it involved making computerized databases for lawyers. I learned about database software, indexing, document scanning, and some electronic records management. In time, I became a department manager and later moved to project management. Regrettably for me, the company was eventually sold, and the new owners started a course of restructuring that culminated with the elimination of my position.

After exploring the job market, I reached the conclusion that further education would be a rewarding pursuit – rewarding not just from the standpoint of increasing the likelihood of landing a job, but also rewarding for personal growth and the opportunity to learn from other people. Because I had considered library science before and still had an interest in the things of the past, I decided to return to school to earn a degree in library science with a focus on archives. I was drawn to courses on digital content where I could continue to use and build on the experience and data-handling skills that I gained during my first career.

Though my decision to become an archivist was a long time coming, I am glad to have made it and look forward to continuing to discover the rewards of working in a field where I can benefit others by helping them connect to information.

What is one thing you’d like to see the Electronic Records Section accomplish during your time on the steering committee?

As I have been working on projects with the Records Management Team at the University of Notre Dame, I have become increasingly aware of how many electronic records consist of data points in enterprise-size content management systems rather than discrete files such as Word docs and PDFs. I anticipate that archiving databases as well as material from systems that were not necessarily designed with long-term preservation in mind will be a growing challenge for archivists. I would like to see the Electronic Records Section put forward guidance on best practices for meeting this challenge.

What cartoon character do you model yourself after?

The Tick (from the 1994 – 1996 Fox animated series)

Meet the 2018 Candidates: Kelsey O’Connell

The 2018 elections for Electronic Records Section leadership are upon us! To support your getting to know the candidates, we will be presenting additional information provided by the 2018 nominees for ERS leadership positions. For more information about the slate of candidates, you can check out the full 2018 ERS elections site. ERS Members: be sure to vote! Polls are open through July 17!

Candidate name: Kelsey O’Connell

Running for: Steering Committee

What made you decide you wanted to become an archivist?

As a history/English major in college, I knew I didn’t want to become a teacher or a lawyer so I began exploring other career options. I landed a position as a student assistant in my college library’s Special Collections and Archives department where I began processing collections. I immediately loved the organization, research, and learning that I participated in daily and realized I just wanted to do this for the rest of my life.

What is one thing you’d like to see the Electronic Records Section accomplish during your time on the steering committee?

My primary interest is employing ERS’s platform to influence SAA and related organizations to begin creating and formalizing documentation for electronic records. We all talk about documentation a lot, but we still haven’t rolled them out. I think a strategic approach to initiating much of this documentation is for ERS to survey, index, and prioritize various policies, guidelines, standards, and frameworks needed to have robust documentation on the management and care of electronic records. We’d be able to utilize Section members’ input and participation in the discussion, allowing us to have a comprehensive yet diverse perspective for our recommendations.

What cartoon character do you model yourself after?

I have to admit I had to ask my family and friends for help identifying this one. More than a few of them said Velma from Scooby Doo. I laughed because I assumed it was because I really can’t see without my glasses – like when my cat enjoys knocking them off my nightstand in the middle of the night and I have to search for them in the morning. But they all said it’s because I like figuring stuff out. Although mystery isn’t my favorite genre, there are some clear parallels to researching and employing some trial and error tactics with electronic records.

Meet the 2018 Candidates: Jane Kelly

The 2018 elections for Electronic Records Section leadership are upon us! To support your getting to know the candidates, we will be presenting additional information provided by the 2018 nominees for ERS leadership positions. For more information about the slate of candidates, you can check out the full 2018 ERS elections site. ERS Members: be sure to vote! Polls are open through July 17!

Candidate name: Jane Kelly

Running for: Steering Committee

What made you decide you wanted to become an archivist?

My first contact with archives was as a college intern. The position was unpaid, required an expensive ninety minute commute each way, and wasn’t exactly thrilling. I spent most of that job with my headphones in, dusting red rot off of books, waiting to go home. This was not my dream job. It wasn’t until several years after I finished college that I found myself truly engaged with archives as a career path. I studied history as an undergrad but chose not to pursue a PhD, so I was happy to find myself in a job where understanding the past really matters. I began to appreciate the ways in which archives and archivists construct narratives of the past in their everyday work. For better or worse, there’s a lot of power in what we do. More importantly, it was the people I worked with who made me want to become an archivist. Having coworkers who took the time to teach me on the job, especially before I started grad school, has been invaluable. Working with people who encouraged me to attend conferences, grapple with big questions, and take on responsibility made me want to keep working and learning. Without that, I’m not sure that I would have stuck around.

What is one thing you’d like to see the Electronic Records Section accomplish during your time on the steering committee?

I would love to see the Electronic Records Section become an even greater resource for other SAA sections. It seems inevitable that everyone who works in archives will need to understand electronic records and born-digital material, at least at a basic level. ERS seems like the obvious hub for those resources. I want other archivists to see that they are capable of understanding issues unique to electronic records and that they don’t need to be intimidated by this part of the field. As a young professional, I’m also particularly interested in partnering with SNAP. Access to mentorship has been really important for me, both in terms of choosing to stay in the archives profession and learning how to do the work. I would like to see deeper connections between these two groups and find ways to support folks who don’t have resources to pay for SAA courses to supplement what they learn in graduate school.

What cartoon character do you model yourself after?

This is a hard question. I’ll go with Eliza Thornberry because she’s a smart kid, and I also wish I could talk to my cat.

Meet the 2018 Candidates: Susan Malsbury

The 2018 elections for Electronic Records Section leadership are upon us! To support your getting to know the candidates, we will be presenting additional information provided by the 2018 nominees for ERS leadership positions. For more information about the slate of candidates, you can check out the full 2018 ERS elections site. ERS Members: be sure to vote! Polls are open through July 17!

Candidate name: Susan Malsbury

Running for: Vice-Chair / Chair-Elect

What made you want to become an archivist?

When I was initially applying to library schools, I wanted to be sure that I was choosing the right career path. To that end, I volunteered at the Portland Public Library and the Maine Historical Society, both in Portland, Maine. While I enjoyed my time at the public library, I immediately fell in love with the archival work at the historical society. My project there was helping an archivist process the Portland Press Herald glass plate negative collection and scan select negatives for inclusion in the Maine Memory Network. It was magic seeing all these early-20th century images unwrapped from their cracked, yellowed envelopes and reintroduced to the world after so many years via description and scanning. It was extremely fulfilling to help preserve the negatives for future researchers. I returned to New York City and was fortunate enough to get a job in the Manuscripts and Archives Division at the New York Public Library. I was able to supplement my graduate work with hands-on experience processing some truly incredible collections such as the 1939/1940 New York World’s Fair papers and the Truman Capote papers. While digital archives are a far cry from glass plate negatives, I feel a similar fulfillment knowing that I’m helping ensure the preservation and future accessibility of unique born-digital records.

What is one thing you’d like to see the Electronic Records Section accomplish during your time on the steering committee?

There are a lot of exciting initiatives, programs, and ad hoc groups developing in the digital archives and digital preservation communities. I would love for ERS to build on its mandate to be the locus of expertise for SAA by serving as a platform for these projects to reach SAA’s general membership. Additionally, I’d like to work to expand participation as an ever-greater number of archivists are working with born-digital material (even if “digital” isn’t in their job title).

What cartoon character do you model yourself after?

As a child of the ‘90s I’ve always strongly identified with Lisa Simpson.