What’s Your Set-up?: Processing Digital Records at UAlbany (part 2)

by Gregory Wiedeman


In the last post I wrote about the theoretical and technical foundations for our born-digital records set-up at UAlbany. Here, I try to show the systems we use and how they work in practice.

Systems

ArchivesSpace

ArchivesSpace manages all archival description. Accession records and top level description for collections and file series are created directly in ArchivesSpace, while lower-level description, containers, locations, and digital objects are created using asInventory spreadsheets. Overnight, all modified published records are exported using exportPublicData.py and indexed into Solr using indexNewEAD.sh. This Solr index is read by ArcLight.

ArcLight

ArcLight provides discovery and display for archival description exported from ArchivesSpace. It uses URIs from ArchivesSpace digital objects to point to digital content in Hyrax while placing that content in the context of archival description. ArcLight is also really good at systems integration because it allows any system to query it through an unauthenticated API. This allows Hyrax and other tools to easily query ArcLight for description records.

Hyrax

Hyrax manages digital objects and item-level metadata. Some objects have detailed Dublin Core-style metadata, while other objects only have an ArchivesSpace identifier. Some custom client-side JavaScript uses this identifier to query ArcLight for more description to contextualize the object and provide links to more items. This means users can discover a file that does not have detailed metadata, such as Minutes, and Hyrax will display the Scope and Content note of the parent series along with links to more series and collection-level description.

Storage

Our preservation storage uses network shares managed by our university data center. We limit write access to the SIP and AIP storage directories to one service account used only by the server that runs the scheduled microservices. This means that only tested automated processes can create, edit, or delete SIPs and AIPs. Archivists have read-only access to these directories, which contain standard bags generated by BagIt-python that are validated against BagIt Profiles. Microservices also place a copy of all SIPs in a processing directory where archivists have full access to work directly with the files. These processing packages have specific subdirectories for master files, derivatives, and metadata. This allows other microservices to be run on them with just the package identifier. So, if you needed to batch create derivatives or metadata files, the microservices know which directories to look in.

The microservices themselves have built-in checks in place, such as they will make sure a valid AIP exists before deleting a SIP. The data center also has some low-level preservation features in place, and we are working to build additional preservation services that will run asynchronously from the rest of our processing workflows. This system is far from perfect, but it works for now, and at the end of the day, we are relying on the permanent positions in our department as well as in Library Systems and university IT to keep these files available long-term.

Microservices

These microservices are the glue that keeps most of our workflows working together. Most of the links here point to code in our Github page, but we’re also trying to add public information on these processes to our documentation site.

asInventory

This is a basic Python desktop app for managing lower-level description in ArchivesSpace through Excel spreadsheets using the API. Archivists can place a completed spreadsheet in a designated asInventory input directory and double-click an .exe file to add new archival objects to ArchivesSpace. A separate .exe can export all the child records from a resource or archival object identifier. The exported spreadsheets include the identifier for each archival object, container, and location, so we can easily roundtrip data from ArchivesSpace, edit it in Excel, and push the updates back into ArchivesSpace. 

We have since built our born digital description workflow on top of asInventory. The spreadsheet has a “DAO” column and will create a digital object using a URI that is placed there. An archivist can describe digital records in a spreadsheet while adding Hyrax URLs that link to individual or groups of files.

We have been using asInventory for almost 3 years, and it does need some maintenance work. Shifting a lot of the code to the ArchivesSnake library will help make this easier, and I also hope to find a way to eliminate the need for a GUI framework so it runs just like a regular script.

Syncing scripts

The ArchivesSpace-ArcLight-Workflow Github repository is a set of scripts that keeps our systems connected and up-to-date. exportPublicData.py ensures that all published description in ArchivesSpace is exported each night, and indexNewEAD.sh indexes this description into Solr so it can be used by ArcLight. processNewUploads.py is the most complex process. This script takes all new digital objects uploaded through the Hyrax web interface, stores preservation copies as AIPs, and creates digital object records in ArchivesSpace that points to them. Part of what makes this step challenging is that Hyrax does not have an API, so the script uses Solr and a web scraper as a workaround.

These scripts sound complicated, but they have been relatively stable over the past year or so. I hope we can work on simplifying them too, by relying more on ArchivesSnake and moving some separate functions to other smaller microservices. One example is how the ASpace export script also adds a link for each collection to our website. We can simplify this by moving this task to a separate, smaller script. That way, when one script breaks or needs to be updated, it would not affect the other function.

Ingest and Processing scripts

These scripts process digital records by uploading metadata for them in our systems and moving them to our preservation storage.

  • ingest.py packages files as a SIP and optionally updates ArchivesSpace accession records by added dates and extents.
  • We have standard transfer folders for some campus offices with designated paths for new records and log files along with metadata about the transferring office. transferAccession.py runs ingest.py but uses the transfer metadata to create accession records and produces spreadsheet log files so offices can see what they transferred
  • confluence.py scrapes files from our campus’s Confluence wiki system, so for offices that use Confluence all I need is access to their page to periodically transfer records.
  • convertImages.py makes derivative files. This is mostly designed for image files, such as batch converting TIFFs to JPGs or PDFs.
  • listFiles.py is very handy. All it does is create a text file that lists all filenames and paths in a SIP. These can then be easily copied into a spreadsheet.
  • An archivist can arrange records by creating an asInventory spreadsheet that points to individual or groups of files. buildHyraxUpload.py then creates a TSV file for uploading these files to Hyrax with the relevant ArchivesSpace identifiers.
  • updateASpace.py takes the output TSV from uploading to Hyrax and updates the same inventory spreadsheets. These can then be uploaded back into ArchivesSpace which will create digital objects that point to Hyrax URLs.

SIP and AIP Package Classes

These classes are extensions of the Bagit-python library. They contain a number of methods that are used by other microservices. This lets us easily create() or load() our specific SIP or AIP packages and add files to them. They also include complex things like getting a human-readable extent and date ranges from the filesystem. My favorite feature might be clean() which removes all Thumbs.db, desktop.ini, and .DS_Store files as the package is created.

Example use case

  1. Wild records appear! A university staff member has placed records of the University Senate from the past year in a standard folder share used for transfers.
  2. An archivist runs transferAccession.py, which creates an ArchivesSpace accession record using some JSON in the transfer folder and technical metadata from the filesystem (modified dates and digital extents). It then packages the files using BagIt-python and places one copy in the read-only SIP directory and a working copy in a processing directory.
    • For outside acquisitions, the archivists usually manually download, export, or image the materials and create an accession record manually. Then, ingest.py packages these materials and adds dates and extents to the accession records when possible.
  3. The archivist makes derivative files for access or preservation. Since there is a designated derivatives directory in the processing package, the archivists can use a variety of manual tools or run other microservices using the package identifier. Scripts such as convertImages.py can batch convert or combine images and PDFs and other scripts for processing email are still being developed.
  4. The archivist then runs listFiles.py to get a list of file paths and copies them into an asInventory spreadsheet.
  5. The archivist arranges the issues within the University Senate Records. They might create a new subseries and use that identifier in an asInventory spreadsheet to upload a list of files and then download them again to get a list of ref_ids.
  6. The archivist runs buildHyraxUpload.py to create a tab-separated values (TSV) file for uploading files to Hyrax using the description and ref_ids from the asInventory spreadsheet.
  7. After uploading the files to Hyrax, the archivist runs updateASpace.py to add the new Hyrax URLs to the same asInventory spreadsheet and uploads them back to ArchivesSpace. This creates new digital objects that point to Hyrax.

Successes and Challenges

Our set-up will always be a work in progress, and we hope to simplify, replace, or improve most of these processes over time. Since Hyrax and ArcLight have been in place for almost a year, we have noticed some aspects that are working really well and others that we still need to improve on.

I think the biggest success was customizing Hyrax to rely on description pulled from ArcLight. This has proven to be dependable and has allowed us to make significant amounts of born-digital and digitized materials available online without requiring detailed item-level metadata. Instead, we rely on high-level archival description and whatever information we can use at scale from the creator or the file system.

Suddenly we have a backlog. Since description is no longer the biggest barrier to making materials available, the holdup has been the parts of the workflow that require human intervention. Even though we are doing more with each action, large amounts of materials are still held up waiting for a human to process them. The biggest bottlenecks are working with campus offices and donors as well as arrangement and description.

There is also a ton of spreadsheets. I think this is a good thing, as we have discovered many cases where born-digital records come with some kind of existing description, but it often requires data cleaning and transformation. One collection came with authors, titles, and abstracts for each of a few thousand PDF files, but that metadata was trapped in hand-encoded HTML files from the 1990s. Spreadsheets are a really good tool for straddle the divide between automated and manual processes required to save this kind of metadata, and this is a comfortable environment for many archivists to work in.[1]

You may have noticed, but the biggest needs we have now—donor relations, arrangement and description, metadata cleanup—are roles that archivists are really good and comfortable at. It turned out that once we had effective digital infrastructure in place, it created further demands on archivists and traditional archival processes.

This brings us to the biggest challenge we face now. Since our set-up often requires comfort on the command line, we have severely limited the number of archivists who can work on these materials and required non-archival skills to perform basic archival functions. We are trying to mitigate this in some respects by better distributing individual stages for each collection and providing more documentation. Still, this has clearly been a major flaw, as we need to meet users (in this case other archivists) where they are rather than place further demands on them.[2]


Gregory Wiedeman is the university archivist in the M.E. Grenander Department of Special Collections & Archives at the University at Albany, SUNY where he helps ensure long-term access to the school’s public records. He oversees collecting, processing, and reference for the University Archives and supports the implementation and development of the department’s archival systems.

What’s Your Setup?: National Library of New Zealand Te Puna Mātauranga o Aotearoa

By Valerie Love

Introduction

The Alexander Turnbull Library holds the archives and special collections for the National Library of New Zealand Te Puna Mātauranga o Aotearoa (NLNZ). While digital materials have existed in the Turnbull Library’s collections since the 1980s, the National Library began to formalise its digital collecting and digital preservation policies in the early 2000s, and established the first Digital Archivist roles in New Zealand. In 2008, the National Library launched the National Digital Heritage Archive (NDHA), which now holds 27 million files spanning across 222 different formats, and consisting of 311 Terabytes.

Since the launch of the NDHA, there has been a marked increase in the size and complexity of incoming digital collections. Collections currently come to the Library on a combination of obsolete and contemporary media, as well as electronic transfer, such as email or File Transfer Protocol (FTP).

Digital Archivists’ workstation setup and equipment

Most staff at the National Library use either a Windows 10 Microsoft Surface Pro or HP EliteBook i5 at a docking station with two monitors to allow for flexibility in where they work. However, the Library’s Digital Archivists have specialised setups to support their work with large digital collections. The computers and workstations below are listed in order of frequency of usage. 

Computers and workstations

  1. HP Z200 i7 workstation tower

The Digital Archivists’ main ingest and processing device is an HP Desktop Z220 i7 workstation tower for processing digital collections. The Z220s have a built-in read/write optical disc drive, as well as USB and FireWire ports.

  1. HP Elitebook i7

The device we use second most-frequently is an HP Elitebook i7, which we use for electronic transfers of contemporary content. Our web archivists also use these for harvesting websites and running social media crawls. As there are only a handful of digital archivists in Aotearoa New Zealand, we do a significant amount of training and outreach to archives and organisations that don’t have a dedicated digital specialist on staff. Having a portable device as well as our desktop setups is extremely useful for meetings and workshops offsite. 

  1. MacBook Pro 15inch, 2017

The Alexander Turnbull Library is a collecting institution, and we often receive creative works from authors, composers, and artists. We regularly encounter portable hard drives, floppy disks, zip disks, and even optical discs which have been formatted for a Mac operating system, and are incompatible with our corporate Windows machines. And so, MacBook Pro to the rescue! Unfortunately, the MacBook Pro only has ports for USB-C, so we keep several USB-C to USB adapters on hand. The MacBook has access to staff wifi, but is not connected to the corporate network. We’ve recently begun to investigate using HFS+ for Windows software in order to be able to see Macintosh file structures on our main ingest PCs.

  1. Digital Intelligence FRED Forensic Workstation

If we can’t read content on either or corporate machines or the MacBook Pro, then our friend FRED is our next port of call. FRED is a forensic recovery of evidence device, and includes a variety of ports and drives with write blockers built in. We have a 5.25 inch floppy disk drive attached to the FRED, and also use it to mount internal hard drives removed from donated computers and laptops. We don’t create disk images by default on our other workstations, but if a collection is tricky enough to merit the FRED, we will create disk images for it, generally using FTK Imager. The FRED has its own isolated network connection separate from the corporate network so we can analyse high risk materials without compromising the Library’s security. 

  1. Standalone PC 

Adjacent to the FRED, we had an additional non-networked PC (also an HP Z200 i7 workstation tower) where we can analyse materials, download software, test scripts, and generally experiment separate from the corporate network. These are currently still operating under a Windows 7 build, as some of the drivers we use with legacy media carriers were not compatible with the Windows 10 during the initial testing and rollout of Windows 10 devices to Library staff. 

  1. A ragtag bunch of computer misfits 

[link to https://natlib.govt.nz/blog/posts/a-ragtag-bunch-of-computer-misfits]

Over the years, the Library has collected vintage computers with a variety of hardware and software capabilities and each machine offers different applications and tools in order to help us process and research legacy digital collections. We are also sometimes gifted computers from donors in order to support the processing of their legacy files, and allow us to see exactly what software and programmes they used, and their file systems.

  1. Kyroflux (located at Archives New Zealand)

And for the really tricky legacy media, we are fortunate to be able to call on our colleagues down the road at Archives New Zealand Te Rua Mahara o te Kāwanatanga, who have a Kyroflux set up in their digital preservation lab to read 3.5 inch and 5.25 inch floppy disks. We recently went over there to try to image a set of double sided, double density, 3.5 inch Macintosh floppy disks from 1986-1989 that we had been unable to read on our legacy Power Macintosh 7300/180. We were able to create disk image files for them using the Kryoflux, but unfortunately, the disks contained bad sectors so we weren’t able to render the files from them. 

Drives and accessories

In addition to our hardware and workstation setup, we use a variety of drives and accessories to aid in processing of born-digital materials.

  1. Tableau Forensic USB 3.0 Bridge write blocker
  2. 3.5 inch floppy drive
  3. 5.25 inch floppy drive
  4. Optical media drive 
  5. Zip drive 
  6. Memory card readers (CompactFlash cards, Secure Digital (SD) cards, Smart Media cards)
  7. Various connectors and converters

Some of our commonly used software and other processing tools

  1. SafeMover Python script (created in-house at NLNZ to transfer and check fixity for digital collections)
  2. DROID file profiling tool
  3. Karen’s Directory Printer
  4. Free Commander/Double Commander
  5. File List Creator
  6. FTK Imager
  7. OSF Mount
  8. IrfanView
  9. Hex Editor Neo
  10. Duplicate Cleaner
  11. ePADD
  12. HFS+ for Windows
  13. System Centre Endpoint Protection


Valerie Love is the Senior Digital Archivist Kaipupuri Pūranga Matihiko Matua at the Alexander Turnbull Library, National Library of New Zealand Te Puna Mātauranga o Aotearoa

What’s Your Set-up? Born-Digital Processing at NC State University Libraries

by Brian Dietz


Background

Until January 2018 the NC State University Libraries did our born-digital processing using the BitCurator VM running on a Windows 7 machine. The BCVM bootstrapped our operations, and much of what I think we’ve accomplished over the last several years would not have been possible without this set up. Two years ago, we shifted our workflows to be run mostly at the command line on a Mac computer. The desire to move to CLI meant a need for a nix environment. Cygwin for Windows is not a realistic option, and the Linux subsystem, available on Windows 10, had not been released. A dedicated Linux computer wasn’t an ideal option due to IT support. I no longer wanted to manage virtual machine distributions, and a dual boot machine seemed too inefficient. Also, of the three major operating systems, I’m most familiar and comfortable with Mac OSX, which is UNIX under the hood, and certified as such. Additionally, Homebrew, a package manager for Mac, made installing and updating the programs we needed, as well as their dependencies, relatively simple. In addition to Homebrew, we use pip to update brunnhilde; and freshclam, included in ClamAV, to keep the virus database up to date. HFS Explorer, necessary for exploring Mac-formatted disks, is a manual install and update, but it might be the main pain point (and not too painful yet). With the exception of HFS Explorer, updating is done at the time of processing, so the environment is always fresh.

Current workstation

We currently have one workstation where we process born-digital materials. We do our work on a Mac Pro:

  • macOS X 10.13 High Sierra
  • 3.7 GHz processor
  • 32GB memory
  • 1TB hard drive
  • 5TB NFS-mounted networked storage
  • 5TB Western Digital external drive

We have a number of peripherals:

  • 2 consumer grade Blu-ray optical drives (LG and Samsung)
  • 2 iomega USB-powered ZIP drives (100MB and 250MB)
  • Several 3.5” floppy drives (salvaged from surplused computers), but our go-to is a Sony 83 track drive (model MPF920)
  • One TEAC 5.25” floppy drive (salvaged from a local scrap exchange)
  • Kryoflux board with power supply and ribbon cable with various connectors
  • Wiebetech USB and Forensic UltraPack v4 write blockers
  • Apple iPod (for taking pictures of media, usually transferred via AirDrop)

The tools that we use for exploration/appraisal, extraction, and reporting are largely command line tools:

Exploration

  • diskutil (finding where a volume is mounted)
  • gls (finding volume name, where the GNU version shows escapes (“\”) in print outs)
  • hdiutil (mounting disk image files)
  • mmls (finding partition layout of disk images)
  • drutil status (showing information about optical media)

Packaging

  • tar (packaging content from media not being imaged)
  • ddrescue (disk imaging)
  • cdparanoia (packaging content from audio discs)
  • KryoFlux GUI (floppy imaging)

Reporting

  • brunnhilde (file and disk image profiling, duplication)
  • bulk_extractor (PII scanning)
  • clamav (virus scanning)
  • Exiftool (metadata)
  • Mediainfo (metadata)

Additionally, we perform archival description using ArchivesSpace, and we’ve developed an application called DAEV (“Digital Assets of Enduring Value”) that, among other things, guides processors through a session and interacts with ArchivesSpace to record certain descriptive metadata. 

Working with IT

We have worked closely with our Libraries Information Technology department to acquire and maintain hardware and peripherals, just as we have worked closely with our Digital Library Initiatives department on the development and maintenance of DAEV. For purchasing, we submit larger requests, with justifications, to IT annually, and smaller requests as needs arise, e.g., our ZIP drive broke and we need a new one. Our computer is on the refresh cycle, meaning once it reaches a certain age, it will be replaced with a comparable computer. Especially with peripherals, we provide exact technical specifications and anticipated costs, e.g., iomega 250MB ZIP drive, and IT determines the purchasing process.

I think it’s easy to assume that, because people in IT are among our most knowledgeable colleagues about computing technology, they understand what it is we’re trying to do and what it is we’ll need to do it. I think that, while they are capable of understanding our needs, their specializations lay elsewhere, and it’s a bad assumption which can result in a less than ideal computing situation. My experience is that my coworkers in IT are eager to understand our problems and to help us to solve them, but that they really don’t know what our problems are. 

The counter assumption is that we ourselves are supposed to know everything about computing. That’s probably more counterproductive than assuming IT knows everything, because 1) we feel bad when we don’t know everything and 2) in trying to hide what we don’t know, we end up not getting what we need. I think the ideal situation is for us to know what processes we need to run (and why), and to share those with IT, who should be able to say what sort of processor and how RAM is needed. If your institution has a division of labor, i.e., specializations, take advantage of it. 

So, rather than saying, “we need a computer to support digital archiving,” or “I need a computer with exactly these specs,” we’ll be better off requesting a consultation and explaining what sort of work we need a computer to support. Of course, the first computer we requested for a born-digital workstation, which was intended to support a large initiative, came at a late hour and was in the form of “We need a computer to support digital archiving,” with the additional assumption of “I thought you knew this was happening.” We got a pretty decent Windows 7 computer that worked well enough.

I also recognize that I may be describing a situation that does not exist in man other institutions. In those cases, perhaps that’s something to be worked toward, through personal and inter-departmental relationship building. At any rate, I recognize and am grateful for the support my institution has extended to my work. 

Challenges and opportunities

I’ve got two challenges coming up. Campus IT has required that all Macs be upgraded to macOS Mojave to “meet device security requirements.” From a security perspective, I’m all onboard for this. However, in our testing the Kryoflux is not compatible with Mojave. This appears to be related to a security measure Mojave has in place for controlling USB communication. After several conversations with Libraries IT, they’ve recommended assigning us a Windows 10 computer for use with the Kryoflux. Beyond having two computers, I see obvious benefits to this. One is that I’ll be able to install the Linux subsystem on Windows 10 and explore whether going full-out Linux might be an option for us. Another is that I’ll have ready access to FTK Imager again, which comes in handy from time to time. 

The other challenge we have is working with our optical drives. We have consumer grade drives, and they work inconsistently. While Drive 1 may read Disc X but not Disc Y, Drive 2 will do the obverse. At the 2019 BitCurator Users Forum, Kam Woods discussed higher grade optical drives in the “There Are No Dumb Questions” session. (By the way, everyone should consider attending the Forum. It’s a great meeting that’s heavily focused on practice, and it gets better each year. This year, the Forum will be hosted by Arizona State University, October 12-13. The call for proposals will be coming out in early March).

In the coming months we’ll be doing some significant changes to our workflow, which will include tweaking a few things, reordering some steps, introducing new tools, e.g., walk_to_dfxml, Bulk Reviewer, and, I hope, introducing more automation into the process. We’re also due for a computer refresh, and, while we’re sticking with Macs for the time being, we’ll again work with our IT to review computer specifications.


Brian Dietz is the Digital Program Librarian for Special Collections at NC State University Libraries, where he manages born-digital processing, and web archiving, and digitization.

Recap: Emulation in the Archives Workshop – UVA, July 18, 2019

By Brenna Edwards

The Emulation in the Archives workshop took place at the University of Virginia (UVA) July 18, 2019, as part of Software Preservation Network’s Fostering a Community of Practice grant cohort. This one-day workshop explored various aspects of emulation in archives, from the legal challenges to access, and included an overview of what UVA is currently doing in this area. The workshop featured talks from people across departments at UVA, as well as people from the Library of Congress. In addition to the talks, there was also a chance to sign up for wireframe testing for UVA’s current access methods for emulated material in their collections. This process was optional, but people could also sign up for distance testing after the workshop if they preferred. 

The day was split into four different parts: an introduction to software preservation and emulation, including legal information; an overview of UVA’s current work in emulation; a look into the metadata for emulations and video game preservation; and considerations for access and user experience. Breaking up the day into these chunks defined a flow for the day, walking through the steps and considerations needed to emulate software and born digital materials. It also helped contain these topics, though of course certain themes and aspects kept appearing throughout the day in other presentations. 

The first portion of the day covered an introduction to software preservation and emulation, and the legal landscape. After explaining more of what Software Preservation Network’s Fostering a Community of Practice grant is, Lauren Work provided some definitions of emulation, software, and curatorial for use throughout the day. 

  • Emulation: digital technique that allows new computers to run legacy systems so older software appears the way it was originally designed
  • Software: source code, executables, applications, and other related components that are set of instructions for computers
  • Curatorial: responsibility and practice of appraising, acquiring, describing

Work then talked more about the Peter Sheeran papers, a collection from an architectural firm based in Charlottesville and the main collection for this project. As a hybrid collection, there were Computer Aided Design (CAD) files and Building Information Modeling (BIM) software included, which posed the question of what to do with it. The answer? Emulation! Since CAD/BIM files are very dependent on what version of the software and files are being used, UVA first did an inventory of what they had, down to license keys and how compatible it is with other software. To do this, they used the FCOP Collections Inventory Exercise to help guide them through what they needed to consider. They also looked at what potential troubleshooting issues and legal issues they might run into. This led nicely into the next presentation all about the legal landscape for software preservation, presented by Brandon Butler of UVA. Butler talked about copyright and the The Copyright Permissions Culture in Software Preservation and Its Implications for the Cultural Records report done by ARL, as well as the idea of fair use, which is often an underutilized solution. He also talked about digital rights management, and how groups like SPN are bringing people together to ask these questions that haven’t been asked before and working to get exemptions granted every three years to help seek permission to crack locks. Overall, he said that you should be good legally, but to do your research just to be on the safe side. 

This was followed by an overview of what UVA is currently doing. After reiterating “Access is everything” to the room, Michael Durbin demonstrated the current working pieces of their emulation system using Archivematica, Apollo, and a Curio custom display interface. He also demonstrated some of the EaaSI platform (which has a sandbox now available!] demonstrating VectorWorks files and how they might be used. Durbin then explained how UVA, in their transition to ArchivesSpace, plans to use the Digital Object function to link to the external emulation, as well as display the metadata that goes along with it. UVA also is taking into consideration the description that can’t be stored in any of UVA’s systems as of yet and how they might incorporate WikiData in the future. Next was Lauren Work and Elizabeth Wilkinson to talk about the curation workflows for software at UVA, which included a revamped Deed of Gift, as well as additional checklists and questionnaires. Their main advice was to talk with the donors early, early, early to get all the information you can, work with the donor to help make preservation and access decisions, but they  also acknowledged it is not always possible. Work and Wilkinson are still working on integrating these steps into the curation workflow at UVA, but also plan to start working more on their appraisal and processing workflows. Have thoughts on the checklist and questionnaire? Feel free to comment on their documents and make suggestions! 

After lunch, we got more into the technical side of things and talked about metadata! Elizabeth Wilkinson and Jeremy Bartczak presented on how UVA is handling archival metadata for software, including questions of how much is enough information, and if ArchivesSpace would be accommodating to this amount of description. While heavily influenced by the University of California Guidelines for Born-Digital Archival Description, they also consulted the Software Preservation Network Emulation as a Service Infrastructure Metadata Model. The result? UVA Archival Description Strategies for Emulated Software, which presents two different approaches to describing software, and UVA MARC Field Look-up for Software Description in ArchivesSpace, which has suggestions on where to put the description in ArchivesSpace. To find out information about the software, they suggested using Google, WorldCat, and Wikidata (for which Yale has created a guide). 

The second portion of this block was about description and preservation of video games, presented by Laura Drake Davis and David Gibson of the Library of Congress. The LOC has been collecting video games since they were introduced, with the first being PacMan. The copyright registry requires a description of item and some sort of visual documentation or representation of game play (a video, source code, etc.). The LOC keeps the original packaging for the game if possible, and they also collect strategy guides and periodicals related to video games. They also take source code, and  the first and last 25 pages of source code are required to be printed out and sent as documentation. Right now, they are reworking their workflows for processing, cataloging, and describing video games, working on relationships with game developers and distributors and with the LC General Counsel Office to assess risks associated with providing access to actual games, and looking into ways to emulate the games themselves. 

The final part of the day was all about access and user experience. First was Lauren Work and Elizabeth Wilkinson to talk about how UVA is considering user access to emulated environments. As of now, they plan to have reading room access only, taking into consideration staff training required to do this and the computer station requirements. They are also taking into consideration what is important about access via emulated environments, a topic discussed at the Architecture, Design, and Engineering Summit at the Library of Congress in 2017. Currently, they are doing wireframe testing with ArchivesSpace to see how users navigate through ArchivesSpace, as well as what types of information is needed for researchers, such as troubleshooting tips, links to related collections, instructions or a note about what to expect within the emulated environment, and how to cite the emulation

The final talk of the day was by Julia Kim of the Library of Congress. Kim talked about her study on user experience with born digital materials at NYU from 2014 to 2015, and compared it to Tim Walsh’s survey on the same thing at the Canadian Center for Architecture done in 2017. Kim found that there is a very fine line between researcher responsibilities and digital archivist responsibilities, users got frustrated with the slowness of the emulations, and there is a learning curve. Overall, Kim found that it’s only somewhat worth it to do emulations, but thinks the EaaSI project will help with this, as well as a lot of outreach and education on what these materials are and how to use them effectively. 

Overall, I found the workshop to be highly informative and I feel more confident considering emulations for future projects. I feel the use of shared community notes helped everyone ask for clarification without disrupting the presenters and allowed for questions to be typed out to be asked at the end. It’s also been helpful to look back on these notes, as slides and links to resources have been added by both presenters and attendees. It’s nice that there is a cohort of people out there working on this and willing to share resources and talk as needed! If you’d like to learn more about the workshop, you can visit their website here, and if you’d like to see the community notes and presentations, you can click here, with the Twitter stream here


Brenna Edwards is currently Project Digital Archivist at the Stuart A. Rose Library at Emory University, Atlanta, GA. Her main responsibility is imaging and processing born digital materials, while also researching the best tools and practices to make them available. 

An Exploration of BitCurator NLP: Incorporating New Tools for Born-Digital Collections

by Morgan Goodman

Natural Language Processing (NLP) has been a buzz-worthy topic for professionals working with born-digital material over the last few years. The BitCurator Project recently released a new set of natural language processing tools, and I had the opportunity to test out the topic modeler, Bitcurator-nlp-gentm, with a group of archivists in the Raleigh-Durham area. I was interested in exploring how NLP might assist archivists to more effectively and efficiently perform their everyday duties. While my goal was to explore possible applications of topic modeling in archival appraisal specifically, the discussions surrounding other possible uses were enlightening.  The resulting research informed my 2019 Master’s paper for the University of North Carolina Chapel Hill.

Topic Modeling extracts text from files and organizes the tokenized words into topics. Imagine a set of words such as: mask, october, horror, michael, myers. Based on this grouping of words you might be able to determine that somewhere across the corpus there is a file about one the Halloween franchise horror films. When I met with the archivists, I had them run the program with disk images from their own collections, and we discussed the visualization output and whether or not they were able easily analyze and determine the nature of the topics presented.

BitCurator utilizes open source tools in their applications and chose the pyLDAvis visualization for the final output of their topic modeling tool (more information about the algorithm and how it works can be found by reading Sievert and Shirley’s paper. You can also play around with the output through this Jupyter notebook).  The left side view of the visualization has topic circles displayed in relative sizes and plotted on a two-dimensional plane. Each topic is labeled with a number in decreasing order of prevalence (circle #1 is the main topic in the overall corpus, and is also the largest circle). The space between topics is determined by the relative relation of the topics, i.e. topics that are less related are plotted further away from each other. The right-side view contains a list of 30 words with a blue bar indicating that term’s frequency across the corpus. Clicking on a topic circle will alter the view of the terms list by adding a red bar for each term, showing the term frequency in that particular topic in relation to the overall corpus.

Picture1

The user can then manipulate a metric slider which is meant to help decipher what the topic is about. Essentially, when the slider is all the way to the right at “1”, the most prevalent terms for the entire corpus are listed. When a topic is selected and the slider is at 1, it shows all the prevalent terms for the corpus in relation to that particular topic (in your Halloween example, you might see more general words like: movie, plot, character). Alternatively, the closer to “0” the slider moves, the less corpus-wide terms appear and the more topic specific terms are displayed (i.e.: knife, haddonfield, strode).

While the NLP does the hard work to scan and extract text from the files, some analysis is still required by the user. The tool’s output offers archivists a bird’s eye view of the collection, which can be helpful when little to nothing is known about its contents. However, many of the archivists I spoke to felt this tool is most effective when you already know a bit about the collection you are looking at. In that sense, it may be beneficial to allow researchers to use topic modeling in the reading room to explore a large collection. Researchers and others with subject matter expertise may get the most benefit from this tool – do you have to know about the Halloween movie franchise to know that Michael Myers is a fictional horror film character? Probably. Now imagine more complex topics that the archivists may not have working knowledge of. The archivist can point the researcher to the right collection and let them do the analysis. This tool may also help for description or possibly identifying duplication across a collection (which seems to be a common problem for people working with born-digital collections).

The next steps to getting NLP tools like this off the ground are to implement training. Information retrieval and ranking methods that create the output may not be widely understood. To unlock the value within an NLP tool, users must know how they work, how to run them, and how to perform meaningful analysis.  Training archivists in the reading room to assist researchers would be an excellent way to get tools like this out of the think tank and into the real world.


MorganMorgan Goodman is a 2019 graduate from the University of North Carolina, Chapel Hill and currently resides in Denver, Colorado. She holds a MS in Information Science with a specialization in Archives and Records Management.

 

 

 

Using R to Migrate Box and Folder Lists into EAD

by Andy Meyer

Introduction

This post is a case study about how I used the statistical programming language R to help export, transform, and load data from legacy finding aids into ArchivesSpace. I’m sharing this workflow in the hopes that another institution might find this approach helpful and could be generalized to other issues facing archives.

I decided to use the programming language R because it is a free and open source programming language that I had some prior experience using. R has a large and active user community as well as a large number of relevant packages that extend the basic functions of R,  including libraries that can deal with Microsoft Word tables and read and write XML. All of the code for this project is posted on Github.

The specific task that sparked this script was when I inherited hundreds of finding aids with minimal collection-level information and very long and detailed box and folder lists. These were all Microsoft Word documents with the box and folder list formatted as a table within the Word document. We recently adopted ArchivesSpace as our archival content management system so the challenge was to reformat this data and upload it into ArchivesSpace. I considered manual approaches but eventually opted to develop this code to automate this work. The code is generally organized into three sections: data export, transforming and cleaning the data, and finally, creating an EAD file to load into ArchivesSpace.

Data Export

After installing the appropriate libraries, the first step of the process was to extract the data from the Microsoft Word tables. Given the nature of our finding aids, I focused on extracting only the box and folder list; collection-level information would be added manually later in the process.

This process was surprisingly straightforward; I created a variable with a path to a Word Document and used the “docx_extract_tbl” function from the docxtractr package to extract the contents of that table into a data.frame in R. Sometimes our finding aids were inconsistent so I occasionally had to tweak the data to rearrange the columns or add missing values. The outcome of this step of the process is four columns that contain folder title, date, box number, and folder number.

This data export process is remarkably flexible. Using other R functions and libraries, I have extended this process to export data from CSV files or Excel spreadsheets. In theory, this process could be extended to receive a wide variety of data including collection-level descriptions and digital objects from a wider variety of sources. There are other tools that can also do this work (Yale’s Excel to EAD process and Harvard’s Aspace Import Excel plugin), but I found this process to be easier for my institution’s needs.

Data Transformation and Cleaning

Once I extracted the data from the Microsoft Word document, I did some minimal data cleanup, a sampling of which included:

  1. Extracting a date range for the collection. Again, past practice focused on creating folder-level descriptions and nearly all of our finding aids lacked collection-level information. From the box/folder list, I tried to extract a date range for the entire collection. This process was messy but worked a fair amount of the time. In cases when the data were not standardized, I defined this information manually.
  2. Standardizing “No Date” text. Over the course of this project, I discovered the following terms for folders that didn’t have dates: “n.d.”,”N.D.”,”no date”,”N/A”,”NDG”,”Various”, “N. D.”,””,”??”,”n. d.”,”n. d. “,”No date”,”-“,”N.A.”,”ND”, “NO DATE”, “Unknown.” For all of these, I updated the date field to “Undated” as a way to standardize this field.
  3. Spelling out abbreviations. Occasionally, I would use regular expressions to spell out words in the title field. This could be standard terms like “Corresp” to “Correspondence” or local terms like “NPU” to “North Park University.”

R is a powerful tool and provides many options for data cleaning. We did pretty minimal cleaning but this approach could be extended to do major transformations to the data.

Create EAD to Load into ArchivesSpace

Lastly, with the data cleaned, I could restructure the data into an XML file. Because the goal of this project was to import into ArchivesSpace, I created an extremely basic EAD file meant mainly to enter the box and folder information into ArchivesSpace; collection-level information would be added manually within ArchivesSpace. In order to get the cleaned data to import, I first needed to define a few collection-level elements including the collection title, collection ID, and date range for the collection. I also took this as an opportunity to apply a standard conditions governing access note for all collections.

Next, I used the XML package in R to create the minimally required nodes and attributes. For this section, I relied on examples from the book XML and Web Technologies for Data Sciences with R by Deborah Nolan and Duncan Temple Lang. I created the basic EAD schema in R using the “newXMLNode” functions from the XML package. This section of code is very minimal, and I would welcome suggestions from the broader community about how to improve it. Lastly, I defined functions that make the title, date, box, and folder nodes, which were then applied to the data exported and transformed in earlier steps. Lastly, this script saves everything as an XML file that I then uploaded into ArchivesSpace.

Conclusion

Although this script was designed to solve a very specific problem—extracting box and folder information from a Microsoft Word table and importing that information into ArchivesSpace—I think this approach could have wide and varied usage. The import process can accept loosely formatted data in a variety of different formats including Microsoft Word, plain text, CSV, and Excel and reformat the underlying data into a standard table. R offers an extremely robust set of packages to update, clean, and reformat this data. Lastly, you can define the export process to reformat the data into a suitable file format. Given the nature of this programming language, it is easy to preserve your original data source as well as document all the transformations you perform.


Andy Meyer is the director (and lone arranger) of the F.M. Johnson Archives and Special Collections at North Park University. He is interested in archival content management systems, digital preservation, and creative ways to engage communities with archival materials.

More skills, less pain with Library Carpentry

By Jeffrey C. Oliver, Ph.D

This is the second post in the bloggERS Making Tech Skills a Strategic Priority series.

Remember that scene in The Matrix where Neo wakes and says “I know kung fu”? Library Carpentry is like that. Almost. Do you need to search lots of files for pieces of text and tire of using Ctrl-F? In the UNIX shell lesson you’ll learn to automate tasks and rapidly extract data from files. Are you managing datasets with not-quite-standardized data fields and formats? In the OpenRefine lesson you’ll easily wrangle data into standard formats for easier processing and de-duplication. There are also Library Carpentry lessons for Python (a popular scripting programming language), Git (a powerful version control system), SQL (a commonly used relational database interface), and many more.

But let me back up a bit.

Library Carpentry is part of the Carpentries, an organization is designed to provide training to scientists, researchers, and information professionals on the computational skills necessary for work in this age of big data.

The goals of Library Carpentry align with this series’ initial call for contributions, providing resources for those in data- or information-related fields to work “more with a shovel than with a tweezers.” Library Carpentry workshops are primarily hands-on experiences with tools to make work more efficient and less prone to mistakes when performing repeated tasks.

One of the greatest parts about a Library Carpentry workshop is that they begin at the beginning. That is, the first lesson is an Introduction to Data, which is a structured discussion and exercise session that breaks down jargon (“What is a version control system”) and sets down some best practices (naming things is hard).

Not only are the lessons designed for those working in library and information professions, but they’re also designed by “in the trenches” folks who are dealing with these data and information challenges daily. As part of the Mozilla Global Sprint, Library Carpentry ran a two-day hackathon in May 2018 where lessons were developed, revised, remixed, and made pretty darn shiny by contributors at ten different sites. For some, the hackathon itself was an opportunity to learn how to use GitHub as a collaboration tool.

Furthermore, Library Carpentry workshops are led by librarians, like the most recent workshop at the University of Arizona, where lessons were taught by our Digital Scholarship Librarian, our Geospatial Specialist, our Liaison Librarian to Anthropology (among other domains), and our Research Data Management Specialist.

Now, a Library Carpentry workshop won’t make you an expert in Python or the UNIX command line in two days. Even Neo had to practice his kung fu a bit. But workshops are designed to be inclusive and accessible, myth-busting, and – I’ll say it – fun. Don’t take my word for it, here’s a sampling of comments from our most recent workshop:

  • Loved the hands-on practice on regular expressions
  • Really great lesson – I liked the challenging exercises, they were fun! It made SQL feel fun instead of scary
  • Feels very powerful to be able to navigate files this way, quickly & in bulk.

So regardless of how you work with data, Library Carpentry has something to offer. If you’d like to host a Library Carpentry workshop, you can use our request a workshop form. You can also connect to Library Carpentry through social media, the web, or good old fashioned e-mail. And since you’re probably working with data already, you have something to offer Library Carpentry. This whole endeavor runs on the multi-faceted contributions of the community, so join us, we have cookies. And APIs. And a web scraping lesson. The terrible puns are just a bonus.

The Archivist’s Guide to KryoFlux

by [Matthew] Farrell and Shira Peltzman

As cultural icons go, the floppy disk continues to persist in the contemporary information technology landscape. Though digital storage has moved beyond the 80 KB – 1.44 MB storage capacity of the floppy disk, its image is often shorthand for the concept of saving one’s work (to wit: Microsoft Word 2016 still uses an icon of a 3.5″ floppy disk to indicate save in its user interface). Likewise, floppy disks make up a sizable portion of many archival collections, in number of objects if not storage footprint. If a creator of personal papers or institutional records maintained their work in electronic form in the 1980s or 1990s, chances are high that these are stored on floppy disks. But the persistent image of the ubiquitous floppy disk conceals a long list of challenges that come into play as archivists attempt to capture their data.

For starters, we often grossly underestimate the extent to which the technology was in active development during its heyday. One would be forgiven the assumption that there existed only a small number of floppy disk formats: namely 5.25″ and 3.5″, plus their 8″ forebears. But within each of these sizes there existed myriad variations of density and encoding, all of which complicate the archivist’s task now that these disks have entered our stacks. This is to say nothing of the hardware: 8″ and 5.25″ drives and standard controller boards are no longer made, and the only 3.5″ drive currently manufactured is a USB-enabled device capable only of reading disks with the more recent encoding methods storing file systems compatible with the host computer. And, of course, none of the above accounts for media stability over time for obsolete carriers.

Enter KryoFlux, a floppy disk controller board first made available in 2009. KryoFlux is very powerful, allowing users of contemporary Windows, Mac, and Linux machines to interface with legacy floppy drives via a USB port. The KryoFlux does not attempt to mount a floppy disk’s file system to the host computer, granting two chief affordances: users can acquire data (a) independent of their host computer’s file system, and (b) without necessarily knowing the particulars of the disk in question. The latter is particularly useful when attempting to analyze less stable media.

Despite the powerful utility of KryoFlux, uptake among archives and digital preservation programs has been hampered by a lack of accessible documentation and training resources. The official documentation and user forums assume a level of technical knowledge largely absent from traditional archival training. Following several informal conversations at Stanford University’s Born-Digital Archives eXchange events in 2015 and 2016, as well as discussions at various events hosted by the BitCurator Consortium, we formed a working group that included archivists and archival studies students from Emory University, the University of California Los Angeles, Yale University, Duke University, and the University of Texas at Austin to create user-friendly documentation aimed specifically at archivists.

Development of The Archivists Guide to KryoFlux began in 2016, with a draft released on Google Docs in Spring 2017. The working group invited feedback over a 6-month comment period and were gratified to receive a wide range of comments and questions from the community. Informed by this incredible feedback, a revised version of the Guide is now hosted in GitHub and available for anyone to use, though the use cases described are generally those encountered by archivists working with born-digital collections in institutional and manuscript repositories.

The Guide is written in two parts. “Part One: Getting Started” provides practical guidance on how to set-up and begin using the KryoFlux and aims to be as inclusive and user-friendly as possible. It includes instructions for running KryoFlux using both Mac and Windows operating systems. Instructions for running KryoFlux using Linux are also provided, allowing repositories that use BitCurator (an Ubuntu-based open-source suite of digital archives tools) to incorporate the KryoFlux into their workflows.

“Part Two: In Depth” examines KryoFlux features and floppy disk technology in more detail. This section introduces the variety of floppy disk encoding formats and provides guidance as to how KryoFlux users can identify them. Readers can also find information about working with 40-track floppy disks. Part Two covers KryoFlux-specific output too, including log files and KryoFlux stream files, and suggests ways in which archivists might make use of these files to support digital preservation best practices. Short case studies documenting the experiences of archivists at other institutions are also included here, providing real-life examples of KryoFlux in action.

As with any technology, the KryoFlux hardware and software will undergo updates and changes in the future which will, if we are not careful, have an effect on the currency of the Guide. In an attempt to address this possibility, the working group have chosen to host the guide as a public GitHub repository. This platform supports versioning and allows for easy collaboration between members of the working group. Perhaps most importantly, GitHub supports the integration of community-driven contributions, including revisions, corrections, and updates. We have established a process for soliciting and reviewing additional contributions and corrections (short answer: submit a pull request via GitHub!), and will annually review the membership of an ongoing working group responsible for monitoring this work to ensure that the Guide remains actively maintained for as long as humanly possible.

WDPD2018groot-30

On this year’s World Digital Preservation Day, the Digital Preservation Coalition presented The Archivist’s Guide to KryoFlux with the 2018 Digital Preservation Award for Teaching and Communications. It was truly an honor to be recognized alongside the other very worthy finalists, and a cherry-on-top for what we hope will remain a valuable resource for years to come.


[Matthew] Farrell is the Digital Records Archivist in Duke University’s David M. Rubenstein Rare Book & Manuscript Library. Farrell holds an MLS from the University of North Carolina at Chapel Hill.


Shira Peltzman is the Digital Archivist for the UCLA Library where she leads a preservation program for Library Special Collections’ born-digital material. Shira received her M.A. in Moving Image Archiving and Preservation from New York University’s Tisch School of the Arts, and was a member of the inaugural cohort of the National Digital Stewardship Residency in New York (NDSR-NY).

Announcing the Digital Processing Framework

by Erin Faulder

Development of the Digital Processing Framework began after the second annual Born Digital Archiving eXchange unconference at Stanford University in 2016. There, a group of nine archivists saw a need for standardization, best practices, or general guidelines for processing digital archival materials. What came out of this initial conversation was the Digital Processing Framework (https://hdl.handle.net/1813/57659) developed by a team of 10 digital archives practitioners: Erin Faulder, Laura Uglean Jackson, Susanne Annand, Sally DeBauche, Martin Gengenbach, Karla Irwin, Julie Musson, Shira Peltzman, Kate Tasker, and Dorothy Waugh.

An initial draft of the Digital Processing Framework was presented at the Society of American Archivists’ Annual meeting in 2017. The team received feedback from over one hundred participants who assessed whether the draft was understandable and usable. Based on that feedback, the team refined the framework into a series of 23 activities, each composed of a range of assessment, arrangement, description, and preservation tasks involved in processing digital content. For example, the activity Survey the collection includes tasks like Determine total extent of digital material and Determine estimated date range.

The Digital Processing Framework’s target audience is folks who process born digital content in an archival setting and are looking for guidance in creating processing guidelines and making level-of-effort decisions for collections. The framework does not include recommendations for archivists looking for specific tools to help them process born digital material. We draw on language from the OAIS reference model, so users are expected to have some familiarity with digital preservation, as well as with the management of digital collections and with processing analog material.

Processing born-digital materials is often non-linear, requires technical tools that are selected based on unique institutional contexts, and blends terminology and theories from archival and digital preservation literature. Because of these characteristics, the team first defined 23 activities involved in digital processing that could be generalized across institutions, tools, and localized terminology. These activities may be strung together in a workflow that makes sense for your particular institution. They are:

  • Survey the collection
  • Create processing plan
  • Establish physical control over removeable media
  • Create checksums for transfer, preservation, and access copies
  • Determine level of description
  • Identify restricted material based on copyright/donor agreement
  • Gather metadata for description
  • Add description about electronic material to finding aid
  • Record technical metadata
  • Create SIP
  • Run virus scan
  • Organize electronic files according to intellectual arrangement
  • Address presence of duplicate content
  • Perform file format analysis
  • Identify deleted/temporary/system files
  • Manage personally identifiable information (PII) risk
  • Normalize files
  • Create AIP
  • Create DIP for access
  • Publish finding aid
  • Publish catalog record
  • Delete work copies of files

Within each activity are a number of associated tasks. For example, tasks identified as part of the Establish physical control over removable media activity include, among others, assigning a unique identifier to each piece of digital media and creating suitable housing for digital media. Taking inspiration from MPLP and extensible processing methods, the framework assigns these associated tasks to one of three processing tiers. These tiers include: Baseline, which we recommend as the minimum level of processing for born digital content; Moderate, which includes tasks that may be done on collections or parts of collections that are considered as having higher value, risk, or access needs; and Intensive, which includes tasks that should only be done to collections that have exceptional warrant. In assigning tasks to these tiers, practitioners balance the minimum work needed to adequately preserve the content against the volume of work that could happen for nuanced user access. When reading the framework, know that if a task is recommended at the Baseline tier, then it should also be done as part of any higher tier’s work.

We designed this framework to be a step towards a shared vocabulary of what happens as part of digital processing and a recommendation of practice, not a mandate. We encourage archivists to explore the framework and use it however it fits in their institution. This may mean re-defining what tasks fall into which tier(s), adding or removing activities and tasks, or stringing tasks into a defined workflow based on tier or common practice. Further, we encourage the professional community to build upon it in practical and creative ways.


Erin Faulder is the Digital Archivist at Cornell University Library’s Division of Rare and Manuscript Collections. She provides oversight and management of the division’s digital collections. She develops and documents workflows for accessioning, arranging and describing, and providing access to born-digital archival collections. She oversees the digitization of analog collection material. In collaboration with colleagues, Erin develops and refines the digital preservation and access ecosystem at Cornell University Library.

Using Python, FFMPEG, and the ArchivesSpace API to Create a Lightweight Clip Library

by Bonnie Gordon

This is the twelfth post in the bloggERS Script It! Series.

Context

Over the past couple of years at the Rockefeller Archive Center, we’ve digitized a substantial portion of our audiovisual collection. Our colleagues in our Research and Education department wanted to create a clip library using this digitized content, so that they could easily find clips to use in presentations and on social media. Since the scale would be somewhat small and we wanted to spin up a solution quickly, we decided to store A/V clips in a folder with an accompanying spreadsheet containing metadata.

All of our (processed) A/V materials are described at the item level in ArchivesSpace. Since this description existed already, we wanted a way to get information into the spreadsheet without a lot of copying-and-pasting or rekeying. Fortunately, the access copies of our digitized A/V have ArchivesSpace refIDs as their filenames, so we’re able to easily link each .mp4 file to its description via the ArchivesSpace API. To do so, I wrote a Python script that uses the ArchivesSpace API to gather descriptive metadata and output it to a spreadsheet, and also uses the command line tool ffmpeg to automate clip creation.

The script asks for user input on the command line. This is how it works:

Step 1: Log into ArchivesSpace

First, the script asks the user for their ArchivesSpace username and password. (The script requires a config file with the IP address of the ArchivesSpace instance.) It then starts an ArchivesSpace session using methods from ArchivesSnake, an open-source Python library for working with the ArchivesSpace API.

Step 2: Get refID and number to start appending to file

The script then starts a while loop, and asks if the user would like to input a new refID. If the user types back “yes” or “y,” the script then asks for the the ArchivesSpace refID, followed by the number to start appending to the end of each clip. This is because the filename for each clip is the original refID, followed by an underscore, followed by a number, and to allow for more clips to be made from the same original file when the script is run again later.

Step 3: Get clip length and create clip

The script then calculates the duration of the original file, in order to determine whether to ask the user to input the number of hours for the start time of the clip, or to skip that prompt. The user is then asked for the number of minutes and seconds of the start time of the clip, then the number of minutes and seconds for the duration of the clip. Then the clip is created. In order to calculate the duration of the original file and create the clip, I used the os Python module to run ffmpeg commands. Ffmpeg is a powerful command line tool for manipulating A/V files; I find ffmprovisr to be an extremely helpful resource.

Clip from Rockefeller Family at Pocantico – Part I , circa 1920, FA1303, Rockefeller Family Home Movies. Rockefeller Archive Center.

Step 4: Get information about clip from ArchivesSpace

Now that the clip is made, the script uses the ArchivesSnake library again and the find_by_id endpoint of the ArchivesSpace API to get descriptive metadata. This includes the original item’s title, date, identifier, and scope and contents note, and the collection title and identifier.

Step 5: Format data and write to csv

The script then takes the data it’s gathered, formats it as needed—such as by removing line breaks in notes from ArchivesSpace, or formatting duration length—and writes it to the csv file.

Step 6: Decide how to continue

The loop starts again, and the user is asked “New refID? y/n/q.” If the user inputs “n” or “no,” the script skips asking for a refID and goes straight to asking for information about how to create the clip. If the user inputs “q” or “quit,” the script ends.

The script is available on GitHub. Issues and pull requests welcome!


Bonnie Gordon is a Digital Archivist at the Rockefeller Archive Center, where she focuses on digital preservation, born digital records, and training around technology.