Small-Scale Scripts for Large-Scale Analysis: Python at the Alexander Turnbull Library

by Flora Feltham

This is the third post in the bloggERS Script It! Series.

The Alexander Turnbull is a research library that holds archives and special collections within the National Library of New Zealand. This means exactly what you’d expect: manuscripts and archives, music, oral histories, photographs, and paintings, but also artefacts such as Katherine Mansfield’s typewriter and a surprising amount of hair. In 2008, the National Library established the National Digital Heritage Archive (NDHA), and has been actively collecting and managing born-digital materials since. I am one of two Digital Archivists who administer the transfer of born-digital heritage material to the Library. We also analyse files to ensure they have all the components needed for long-term preservation and ingest collections to the NDHA. We work closely with our digital preservation system administrators and the many other teams responsible for appraisal, arrangement and description, and providing access to born-digital heritage.

Why Scripting?

As archivists responsible for safely handling and managing born-digital heritage, we use scripts to work safely and sanely at scale. Python provides a flexible yet reliable platform for our work: we don’t have to download and learn a new piece of software every time we need to accomplish a different task. The increasing size and complexity of digital collections often means that machine processing is the only way to get our work done. A human could not reliably identify every duplicate file name in a collection of 200,000 files… but Python can. To protect original material, too, the scripting we do during appraisal and technical analysis is done using copies of collections. Here are some of the extremely useful tasks my team scripted recently:

  • Transfer
    • Generating a list of files on original storage media
    • Transferring files off the original digital media to our storage servers
  • Appraisal
    • Identifying duplicate files across different locations
    • Adding file extensions so material opens in the correct software
    • Flattening complex folder structures to support easy assessment
  • Technical Analysis
    • Sorting files into groups based on file extension to isolate unknown files
    • Extracting file signature information from unknown files

Our most-loved Python script even has a name: Safe Mover. Developed and maintained by our Digital Preservation Analyst, Safe Mover will generate file checksums, maintain metadata integrity, and check file authenticity, all the while copying material off digital storage media. Running something somebody else wrote was a great introduction to scripting. I finally understood that: a) you can do nimble computational work from a text editor; and b) a ‘script’ is just a set of instructions you write for the computer to follow.

Developing Skills Slowly, Systematically, and as Part of a Group

Once we recognised that we couldn’t do our work without scripting, my team started regular ‘Scripting Sessions’ with other colleagues who code. At each meeting we solve a genuine processing challenge from someone’s job, which allows us to apply what we learn immediately. We also write the code together on a big screen which, like learning any language, helped us become comfortable making mistakes. Recently, I accidentally copied 60,000 spreadsheets to my desktop and then wondered aloud why my laptop ground to a halt.

Outside of these meetings, learning to script has been about deliberately choosing to problem-solve using Python rather than doing it ‘by-hand’. Initially, this felt counter-intuitive because I was painfully slow: I really could have counted 200 folders on my fingers faster than I wrote a script to do the same thing.  But luckily for me, my team recognised the overall importance of this skill set and I also regularly remind myself: “this will be so useful the next 5000 times I need to inevitably do this task”.

The First Important Thing I Remembered How to Write

Practically every script I write starts with import os. It became so simple once I’d done it a few times: import is a command, and ‘os’ is the name of the thing I want. os is a Python module that allows you to interact with and leverage the functionality of your computer’s operating system. In general terms, a Python module is a just pre-existing code library for you to use. They are usually grouped around a particular theme or set of tasks.

The function I use the most is os.walk(). You tell os.walk() where to start and then it systematically traverses every folder beneath that. For every folder it finds, os.walk() will record three things: 1: the path to that folder, 2: a list of any sub-folders it contains, and 3: a list of any files it contains. Once os.walk() has completed its… well… walk, you have access to the name and location of every folder and file in your collection.

You then use Python to do something with this data: print it to the screen, write it in a spreadsheet, ask where something is, or open a file. Having access to this information becomes relevant to archivists very quickly. Just think about our characteristic tasks and concerns: identifying and recording original order, gaining intellectual control, maintaining authenticity. I often need to record or manipulate file paths, move actual files around the computer, or extract metadata stored in the operating system.

An Actual Script

Recently, we received a Mac-formatted 1TB hard drive from a local composer and performer. When #1 script Safe Mover stopped in the middle of transferring files, we wondered if there was a file path clash. Generally speaking, in a Windows formatted file system there’s a limit of 255 characters to a file path (“D:\so_it_pays_to\keep_your_file_paths\niceand\short.doc”).

Some older Mac systems have no limit on the number of file path characters so, if they’re really long, there can be a conflict when you’re transferring material to a Windows environment. To troubleshoot this problem we wrote a small script:

import os

top = "D:\example_folder_01\collection"
for root, dir_list, file_list in os.walk(top):
    for item in file_list:
    file_path = os.path.join(root,item)
    if len(file_path) > 255:
        print (file_path)

So that’s it– and running it over the hard drive gave us the answer we needed. But what is this little script actually doing?

# import the Python module 'os'.
import os

# tell our script where to start
top = " D:\example_folder_01\collection"
# now os.walk will systematically traverse the directory tree starting from 'top' and retain information in the variables root, dir_list, and file_list.
# remember: root is 'path to our current folder', dir_list is 'a list of sub-folder names', and file_list is 'a list of file names'.
for root, dir_list, file_list in os.walk(top):
    # for every file in those lists of files...
    for item in file_list:
        # ...store its location and name in a variable called 'file_path'.
        # os.path.join joins the folder path (root) with the file name for every file (item).
        file_path = os.path.join(root,item)
        # now we do the actual analysis. We want to know 'if the length of any file path is greater than 255 characters'.
        if len(file_path) > 255:
            # if so: print that file path to the screen so we can see it!
            print (file_path)

 

All it does is find every file path longer than 255 characters and print it to the screen. The archivists can then eyeball the list and decide how to proceed. Or, in the case of our 1TB hard drive, exclude that as a problem because it didn’t contain any really long file paths. But at least we now know that’s not the problem. So maybe we need… another script?


Flora Feltham is the Digital Archivist at the Alexander Turnbull Library, National Library of New Zealand Te Puna Mātauranga o Aotearoa. In this role she supports the acquisition, ingest, management, and preservation of born-digital heritage collections. She lives in Wellington.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s