by Alejandra Dean
This is the first post in the BloggERS Script It! Series.
When I joined the Massachusetts Archives last September as Assistant Digital Records Archivist, I began working on a project to transfer digital surrogate records temporarily stored on networked server space to Preservica, the Archives’ digital repository and preservation environment. Existing descriptive metadata could be exported from our archival content management system as a .csv file and included alongside the ~12,000 TIFF files on the server, but I first needed to convert each row in the .csv into an individual XML file.
Enter: Python. At that point I’d already had some experience using the Python programming language and knew there was a csv module included in the Python standard library that could handle data in .csv files. Modules in Python are basically references to pre-written code that are organized as discrete files. Instead of re-defining the same function over and over again in every script you use or write, Python makes it easy to just import the bit of code you need as a module. I was also interested in using the os module, which replicates the basic functionality of any operating system. Python’s os-independence is a plus, because there are both Windows 7 workstations and Mac OS workstations at the Archives (and I use a Mac at home). For example, the simple Python script below is equivalent to using the
dir /s command in the Windows Command Prompt or the
ls -R command in Mac Terminal to recursively list contents in a directory:
If you execute this script in Python’s shell, called the Integrated Development and Learning Environment or IDLE, the output “walks” through a directory tree and retrieves three pieces of information: the root path to the specified directory, the names of all the subdirectories in the root path, and the names of all the non-directory files in the root path. The data is formatted differently from what’s returned in either CMD or Terminal, but this is where scripting with Python gets stellar. The Python 3.7.0 documentation explains that os.walk takes this information and stores it as a tuple, an ordered, read-only list. The syntax for this uses parentheses and looks like:
mytuple = ("thing1", "thing2", "thing3")
Since os.walk automatically takes my input data and turns it into a tuple, a data type that I can act upon, there’s a lot of room to repurpose the data and build scripts into larger programs.
That principle guides the script I now use to convert .csv files to XML. I found a starting template for the code on Stack Overflow – never reinvent the wheel if you don’t have to – and I’ve commented my current working version to explain each line. The concept of a Python list again provides the basis for what’s going on here. A list is the same thing as a modifiable tuple, and is productive when managing delimited data because we can iterate, or sequentially “loop,” through each element in the list. We can then take our list data type and perform some operation on it, which in this case is converting the list into corresponding XML tags. The idea of order here is important; the order of the columns in my .csv (Filename, Record Group, Series, etc.) is maintained in my ordered list, which is mapped to a specific sequence of XML tags (<filename>, <recordGroup>, <series>, etc.). My final step after running csv_convert.py was to transform its output XML file into individual Dublin Core-based metadata files per TIFF using a separate XSLT stylesheet.
Ultimately, as someone without a computer science background, I’d say the effort to learn a programming language has been worth it. In addition to expediting what could otherwise snowball into weeks of repetitive data entry, writing scripts allows me to work towards short-term objectives while developing a framework or methodology that I can apply to different projects in the long-term. Learning scripts additionally opens up a lot of flexibility when making decisions about managing data.
But that’s not to say I haven’t encountered problems along the way, or that Python is the only solution. Since September, my biggest lesson has been coming to the realization that there are multiple ways to achieve the same goal and it’s just as important to invest time in deciding which tools make the most sense for a given task as it is to learn new skills. In my quest to learn Python, I gained a deeper appreciation for formulas in Excel, something I was already familiar with but hadn’t taken the time to really explore as an alternative. I also think it’s important to point out that Python mandates the use of indentation (excessive white space) to parse blocks of code, which can make scripting difficult for people who use assistive technologies. A positive outcome of non-programmers learning to code is the perspectives we can provide as newcomers to challenge and think critically about what might ordinarily be considered status-quo.
Alejandra Dean is the Assistant Digital Records Archivist at the Massachusetts Archives. She received an M.S. in archives management from the Simmons School of Library and Information Science in 2017 and a B.A. in History of Art and Architecture from Harvard College in 2013.