What’s Your Set-up?: Processing Digital Records at UAlbany (part 2)

by Gregory Wiedeman


In the last post I wrote about the theoretical and technical foundations for our born-digital records set-up at UAlbany. Here, I try to show the systems we use and how they work in practice.

Systems

ArchivesSpace

ArchivesSpace manages all archival description. Accession records and top level description for collections and file series are created directly in ArchivesSpace, while lower-level description, containers, locations, and digital objects are created using asInventory spreadsheets. Overnight, all modified published records are exported using exportPublicData.py and indexed into Solr using indexNewEAD.sh. This Solr index is read by ArcLight.

ArcLight

ArcLight provides discovery and display for archival description exported from ArchivesSpace. It uses URIs from ArchivesSpace digital objects to point to digital content in Hyrax while placing that content in the context of archival description. ArcLight is also really good at systems integration because it allows any system to query it through an unauthenticated API. This allows Hyrax and other tools to easily query ArcLight for description records.

Hyrax

Hyrax manages digital objects and item-level metadata. Some objects have detailed Dublin Core-style metadata, while other objects only have an ArchivesSpace identifier. Some custom client-side JavaScript uses this identifier to query ArcLight for more description to contextualize the object and provide links to more items. This means users can discover a file that does not have detailed metadata, such as Minutes, and Hyrax will display the Scope and Content note of the parent series along with links to more series and collection-level description.

Storage

Our preservation storage uses network shares managed by our university data center. We limit write access to the SIP and AIP storage directories to one service account used only by the server that runs the scheduled microservices. This means that only tested automated processes can create, edit, or delete SIPs and AIPs. Archivists have read-only access to these directories, which contain standard bags generated by BagIt-python that are validated against BagIt Profiles. Microservices also place a copy of all SIPs in a processing directory where archivists have full access to work directly with the files. These processing packages have specific subdirectories for master files, derivatives, and metadata. This allows other microservices to be run on them with just the package identifier. So, if you needed to batch create derivatives or metadata files, the microservices know which directories to look in.

The microservices themselves have built-in checks in place, such as they will make sure a valid AIP exists before deleting a SIP. The data center also has some low-level preservation features in place, and we are working to build additional preservation services that will run asynchronously from the rest of our processing workflows. This system is far from perfect, but it works for now, and at the end of the day, we are relying on the permanent positions in our department as well as in Library Systems and university IT to keep these files available long-term.

Microservices

These microservices are the glue that keeps most of our workflows working together. Most of the links here point to code in our Github page, but we’re also trying to add public information on these processes to our documentation site.

asInventory

This is a basic Python desktop app for managing lower-level description in ArchivesSpace through Excel spreadsheets using the API. Archivists can place a completed spreadsheet in a designated asInventory input directory and double-click an .exe file to add new archival objects to ArchivesSpace. A separate .exe can export all the child records from a resource or archival object identifier. The exported spreadsheets include the identifier for each archival object, container, and location, so we can easily roundtrip data from ArchivesSpace, edit it in Excel, and push the updates back into ArchivesSpace. 

We have since built our born digital description workflow on top of asInventory. The spreadsheet has a “DAO” column and will create a digital object using a URI that is placed there. An archivist can describe digital records in a spreadsheet while adding Hyrax URLs that link to individual or groups of files.

We have been using asInventory for almost 3 years, and it does need some maintenance work. Shifting a lot of the code to the ArchivesSnake library will help make this easier, and I also hope to find a way to eliminate the need for a GUI framework so it runs just like a regular script.

Syncing scripts

The ArchivesSpace-ArcLight-Workflow Github repository is a set of scripts that keeps our systems connected and up-to-date. exportPublicData.py ensures that all published description in ArchivesSpace is exported each night, and indexNewEAD.sh indexes this description into Solr so it can be used by ArcLight. processNewUploads.py is the most complex process. This script takes all new digital objects uploaded through the Hyrax web interface, stores preservation copies as AIPs, and creates digital object records in ArchivesSpace that points to them. Part of what makes this step challenging is that Hyrax does not have an API, so the script uses Solr and a web scraper as a workaround.

These scripts sound complicated, but they have been relatively stable over the past year or so. I hope we can work on simplifying them too, by relying more on ArchivesSnake and moving some separate functions to other smaller microservices. One example is how the ASpace export script also adds a link for each collection to our website. We can simplify this by moving this task to a separate, smaller script. That way, when one script breaks or needs to be updated, it would not affect the other function.

Ingest and Processing scripts

These scripts process digital records by uploading metadata for them in our systems and moving them to our preservation storage.

  • ingest.py packages files as a SIP and optionally updates ArchivesSpace accession records by added dates and extents.
  • We have standard transfer folders for some campus offices with designated paths for new records and log files along with metadata about the transferring office. transferAccession.py runs ingest.py but uses the transfer metadata to create accession records and produces spreadsheet log files so offices can see what they transferred
  • confluence.py scrapes files from our campus’s Confluence wiki system, so for offices that use Confluence all I need is access to their page to periodically transfer records.
  • convertImages.py makes derivative files. This is mostly designed for image files, such as batch converting TIFFs to JPGs or PDFs.
  • listFiles.py is very handy. All it does is create a text file that lists all filenames and paths in a SIP. These can then be easily copied into a spreadsheet.
  • An archivist can arrange records by creating an asInventory spreadsheet that points to individual or groups of files. buildHyraxUpload.py then creates a TSV file for uploading these files to Hyrax with the relevant ArchivesSpace identifiers.
  • updateASpace.py takes the output TSV from uploading to Hyrax and updates the same inventory spreadsheets. These can then be uploaded back into ArchivesSpace which will create digital objects that point to Hyrax URLs.

SIP and AIP Package Classes

These classes are extensions of the Bagit-python library. They contain a number of methods that are used by other microservices. This lets us easily create() or load() our specific SIP or AIP packages and add files to them. They also include complex things like getting a human-readable extent and date ranges from the filesystem. My favorite feature might be clean() which removes all Thumbs.db, desktop.ini, and .DS_Store files as the package is created.

Example use case

  1. Wild records appear! A university staff member has placed records of the University Senate from the past year in a standard folder share used for transfers.
  2. An archivist runs transferAccession.py, which creates an ArchivesSpace accession record using some JSON in the transfer folder and technical metadata from the filesystem (modified dates and digital extents). It then packages the files using BagIt-python and places one copy in the read-only SIP directory and a working copy in a processing directory.
    • For outside acquisitions, the archivists usually manually download, export, or image the materials and create an accession record manually. Then, ingest.py packages these materials and adds dates and extents to the accession records when possible.
  3. The archivist makes derivative files for access or preservation. Since there is a designated derivatives directory in the processing package, the archivists can use a variety of manual tools or run other microservices using the package identifier. Scripts such as convertImages.py can batch convert or combine images and PDFs and other scripts for processing email are still being developed.
  4. The archivist then runs listFiles.py to get a list of file paths and copies them into an asInventory spreadsheet.
  5. The archivist arranges the issues within the University Senate Records. They might create a new subseries and use that identifier in an asInventory spreadsheet to upload a list of files and then download them again to get a list of ref_ids.
  6. The archivist runs buildHyraxUpload.py to create a tab-separated values (TSV) file for uploading files to Hyrax using the description and ref_ids from the asInventory spreadsheet.
  7. After uploading the files to Hyrax, the archivist runs updateASpace.py to add the new Hyrax URLs to the same asInventory spreadsheet and uploads them back to ArchivesSpace. This creates new digital objects that point to Hyrax.

Successes and Challenges

Our set-up will always be a work in progress, and we hope to simplify, replace, or improve most of these processes over time. Since Hyrax and ArcLight have been in place for almost a year, we have noticed some aspects that are working really well and others that we still need to improve on.

I think the biggest success was customizing Hyrax to rely on description pulled from ArcLight. This has proven to be dependable and has allowed us to make significant amounts of born-digital and digitized materials available online without requiring detailed item-level metadata. Instead, we rely on high-level archival description and whatever information we can use at scale from the creator or the file system.

Suddenly we have a backlog. Since description is no longer the biggest barrier to making materials available, the holdup has been the parts of the workflow that require human intervention. Even though we are doing more with each action, large amounts of materials are still held up waiting for a human to process them. The biggest bottlenecks are working with campus offices and donors as well as arrangement and description.

There is also a ton of spreadsheets. I think this is a good thing, as we have discovered many cases where born-digital records come with some kind of existing description, but it often requires data cleaning and transformation. One collection came with authors, titles, and abstracts for each of a few thousand PDF files, but that metadata was trapped in hand-encoded HTML files from the 1990s. Spreadsheets are a really good tool for straddle the divide between automated and manual processes required to save this kind of metadata, and this is a comfortable environment for many archivists to work in.[1]

You may have noticed, but the biggest needs we have now—donor relations, arrangement and description, metadata cleanup—are roles that archivists are really good and comfortable at. It turned out that once we had effective digital infrastructure in place, it created further demands on archivists and traditional archival processes.

This brings us to the biggest challenge we face now. Since our set-up often requires comfort on the command line, we have severely limited the number of archivists who can work on these materials and required non-archival skills to perform basic archival functions. We are trying to mitigate this in some respects by better distributing individual stages for each collection and providing more documentation. Still, this has clearly been a major flaw, as we need to meet users (in this case other archivists) where they are rather than place further demands on them.[2]


Gregory Wiedeman is the university archivist in the M.E. Grenander Department of Special Collections & Archives at the University at Albany, SUNY where he helps ensure long-term access to the school’s public records. He oversees collecting, processing, and reference for the University Archives and supports the implementation and development of the department’s archival systems.

What’s Your Setup?: National Library of New Zealand Te Puna Mātauranga o Aotearoa

By Valerie Love

Introduction

The Alexander Turnbull Library holds the archives and special collections for the National Library of New Zealand Te Puna Mātauranga o Aotearoa (NLNZ). While digital materials have existed in the Turnbull Library’s collections since the 1980s, the National Library began to formalise its digital collecting and digital preservation policies in the early 2000s, and established the first Digital Archivist roles in New Zealand. In 2008, the National Library launched the National Digital Heritage Archive (NDHA), which now holds 27 million files spanning across 222 different formats, and consisting of 311 Terabytes.

Since the launch of the NDHA, there has been a marked increase in the size and complexity of incoming digital collections. Collections currently come to the Library on a combination of obsolete and contemporary media, as well as electronic transfer, such as email or File Transfer Protocol (FTP).

Digital Archivists’ workstation setup and equipment

Most staff at the National Library use either a Windows 10 Microsoft Surface Pro or HP EliteBook i5 at a docking station with two monitors to allow for flexibility in where they work. However, the Library’s Digital Archivists have specialised setups to support their work with large digital collections. The computers and workstations below are listed in order of frequency of usage. 

Computers and workstations

  1. HP Z200 i7 workstation tower

The Digital Archivists’ main ingest and processing device is an HP Desktop Z220 i7 workstation tower for processing digital collections. The Z220s have a built-in read/write optical disc drive, as well as USB and FireWire ports.

  1. HP Elitebook i7

The device we use second most-frequently is an HP Elitebook i7, which we use for electronic transfers of contemporary content. Our web archivists also use these for harvesting websites and running social media crawls. As there are only a handful of digital archivists in Aotearoa New Zealand, we do a significant amount of training and outreach to archives and organisations that don’t have a dedicated digital specialist on staff. Having a portable device as well as our desktop setups is extremely useful for meetings and workshops offsite. 

  1. MacBook Pro 15inch, 2017

The Alexander Turnbull Library is a collecting institution, and we often receive creative works from authors, composers, and artists. We regularly encounter portable hard drives, floppy disks, zip disks, and even optical discs which have been formatted for a Mac operating system, and are incompatible with our corporate Windows machines. And so, MacBook Pro to the rescue! Unfortunately, the MacBook Pro only has ports for USB-C, so we keep several USB-C to USB adapters on hand. The MacBook has access to staff wifi, but is not connected to the corporate network. We’ve recently begun to investigate using HFS+ for Windows software in order to be able to see Macintosh file structures on our main ingest PCs.

  1. Digital Intelligence FRED Forensic Workstation

If we can’t read content on either or corporate machines or the MacBook Pro, then our friend FRED is our next port of call. FRED is a forensic recovery of evidence device, and includes a variety of ports and drives with write blockers built in. We have a 5.25 inch floppy disk drive attached to the FRED, and also use it to mount internal hard drives removed from donated computers and laptops. We don’t create disk images by default on our other workstations, but if a collection is tricky enough to merit the FRED, we will create disk images for it, generally using FTK Imager. The FRED has its own isolated network connection separate from the corporate network so we can analyse high risk materials without compromising the Library’s security. 

  1. Standalone PC 

Adjacent to the FRED, we had an additional non-networked PC (also an HP Z200 i7 workstation tower) where we can analyse materials, download software, test scripts, and generally experiment separate from the corporate network. These are currently still operating under a Windows 7 build, as some of the drivers we use with legacy media carriers were not compatible with the Windows 10 during the initial testing and rollout of Windows 10 devices to Library staff. 

  1. A ragtag bunch of computer misfits 

[link to https://natlib.govt.nz/blog/posts/a-ragtag-bunch-of-computer-misfits]

Over the years, the Library has collected vintage computers with a variety of hardware and software capabilities and each machine offers different applications and tools in order to help us process and research legacy digital collections. We are also sometimes gifted computers from donors in order to support the processing of their legacy files, and allow us to see exactly what software and programmes they used, and their file systems.

  1. Kyroflux (located at Archives New Zealand)

And for the really tricky legacy media, we are fortunate to be able to call on our colleagues down the road at Archives New Zealand Te Rua Mahara o te Kāwanatanga, who have a Kyroflux set up in their digital preservation lab to read 3.5 inch and 5.25 inch floppy disks. We recently went over there to try to image a set of double sided, double density, 3.5 inch Macintosh floppy disks from 1986-1989 that we had been unable to read on our legacy Power Macintosh 7300/180. We were able to create disk image files for them using the Kryoflux, but unfortunately, the disks contained bad sectors so we weren’t able to render the files from them. 

Drives and accessories

In addition to our hardware and workstation setup, we use a variety of drives and accessories to aid in processing of born-digital materials.

  1. Tableau Forensic USB 3.0 Bridge write blocker
  2. 3.5 inch floppy drive
  3. 5.25 inch floppy drive
  4. Optical media drive 
  5. Zip drive 
  6. Memory card readers (CompactFlash cards, Secure Digital (SD) cards, Smart Media cards)
  7. Various connectors and converters

Some of our commonly used software and other processing tools

  1. SafeMover Python script (created in-house at NLNZ to transfer and check fixity for digital collections)
  2. DROID file profiling tool
  3. Karen’s Directory Printer
  4. Free Commander/Double Commander
  5. File List Creator
  6. FTK Imager
  7. OSF Mount
  8. IrfanView
  9. Hex Editor Neo
  10. Duplicate Cleaner
  11. ePADD
  12. HFS+ for Windows
  13. System Centre Endpoint Protection


Valerie Love is the Senior Digital Archivist Kaipupuri Pūranga Matihiko Matua at the Alexander Turnbull Library, National Library of New Zealand Te Puna Mātauranga o Aotearoa

What’s Your Set-up? Born-Digital Processing at NC State University Libraries

by Brian Dietz


Background

Until January 2018 the NC State University Libraries did our born-digital processing using the BitCurator VM running on a Windows 7 machine. The BCVM bootstrapped our operations, and much of what I think we’ve accomplished over the last several years would not have been possible without this set up. Two years ago, we shifted our workflows to be run mostly at the command line on a Mac computer. The desire to move to CLI meant a need for a nix environment. Cygwin for Windows is not a realistic option, and the Linux subsystem, available on Windows 10, had not been released. A dedicated Linux computer wasn’t an ideal option due to IT support. I no longer wanted to manage virtual machine distributions, and a dual boot machine seemed too inefficient. Also, of the three major operating systems, I’m most familiar and comfortable with Mac OSX, which is UNIX under the hood, and certified as such. Additionally, Homebrew, a package manager for Mac, made installing and updating the programs we needed, as well as their dependencies, relatively simple. In addition to Homebrew, we use pip to update brunnhilde; and freshclam, included in ClamAV, to keep the virus database up to date. HFS Explorer, necessary for exploring Mac-formatted disks, is a manual install and update, but it might be the main pain point (and not too painful yet). With the exception of HFS Explorer, updating is done at the time of processing, so the environment is always fresh.

Current workstation

We currently have one workstation where we process born-digital materials. We do our work on a Mac Pro:

  • macOS X 10.13 High Sierra
  • 3.7 GHz processor
  • 32GB memory
  • 1TB hard drive
  • 5TB NFS-mounted networked storage
  • 5TB Western Digital external drive

We have a number of peripherals:

  • 2 consumer grade Blu-ray optical drives (LG and Samsung)
  • 2 iomega USB-powered ZIP drives (100MB and 250MB)
  • Several 3.5” floppy drives (salvaged from surplused computers), but our go-to is a Sony 83 track drive (model MPF920)
  • One TEAC 5.25” floppy drive (salvaged from a local scrap exchange)
  • Kryoflux board with power supply and ribbon cable with various connectors
  • Wiebetech USB and Forensic UltraPack v4 write blockers
  • Apple iPod (for taking pictures of media, usually transferred via AirDrop)

The tools that we use for exploration/appraisal, extraction, and reporting are largely command line tools:

Exploration

  • diskutil (finding where a volume is mounted)
  • gls (finding volume name, where the GNU version shows escapes (“\”) in print outs)
  • hdiutil (mounting disk image files)
  • mmls (finding partition layout of disk images)
  • drutil status (showing information about optical media)

Packaging

  • tar (packaging content from media not being imaged)
  • ddrescue (disk imaging)
  • cdparanoia (packaging content from audio discs)
  • KryoFlux GUI (floppy imaging)

Reporting

  • brunnhilde (file and disk image profiling, duplication)
  • bulk_extractor (PII scanning)
  • clamav (virus scanning)
  • Exiftool (metadata)
  • Mediainfo (metadata)

Additionally, we perform archival description using ArchivesSpace, and we’ve developed an application called DAEV (“Digital Assets of Enduring Value”) that, among other things, guides processors through a session and interacts with ArchivesSpace to record certain descriptive metadata. 

Working with IT

We have worked closely with our Libraries Information Technology department to acquire and maintain hardware and peripherals, just as we have worked closely with our Digital Library Initiatives department on the development and maintenance of DAEV. For purchasing, we submit larger requests, with justifications, to IT annually, and smaller requests as needs arise, e.g., our ZIP drive broke and we need a new one. Our computer is on the refresh cycle, meaning once it reaches a certain age, it will be replaced with a comparable computer. Especially with peripherals, we provide exact technical specifications and anticipated costs, e.g., iomega 250MB ZIP drive, and IT determines the purchasing process.

I think it’s easy to assume that, because people in IT are among our most knowledgeable colleagues about computing technology, they understand what it is we’re trying to do and what it is we’ll need to do it. I think that, while they are capable of understanding our needs, their specializations lay elsewhere, and it’s a bad assumption which can result in a less than ideal computing situation. My experience is that my coworkers in IT are eager to understand our problems and to help us to solve them, but that they really don’t know what our problems are. 

The counter assumption is that we ourselves are supposed to know everything about computing. That’s probably more counterproductive than assuming IT knows everything, because 1) we feel bad when we don’t know everything and 2) in trying to hide what we don’t know, we end up not getting what we need. I think the ideal situation is for us to know what processes we need to run (and why), and to share those with IT, who should be able to say what sort of processor and how RAM is needed. If your institution has a division of labor, i.e., specializations, take advantage of it. 

So, rather than saying, “we need a computer to support digital archiving,” or “I need a computer with exactly these specs,” we’ll be better off requesting a consultation and explaining what sort of work we need a computer to support. Of course, the first computer we requested for a born-digital workstation, which was intended to support a large initiative, came at a late hour and was in the form of “We need a computer to support digital archiving,” with the additional assumption of “I thought you knew this was happening.” We got a pretty decent Windows 7 computer that worked well enough.

I also recognize that I may be describing a situation that does not exist in man other institutions. In those cases, perhaps that’s something to be worked toward, through personal and inter-departmental relationship building. At any rate, I recognize and am grateful for the support my institution has extended to my work. 

Challenges and opportunities

I’ve got two challenges coming up. Campus IT has required that all Macs be upgraded to macOS Mojave to “meet device security requirements.” From a security perspective, I’m all onboard for this. However, in our testing the Kryoflux is not compatible with Mojave. This appears to be related to a security measure Mojave has in place for controlling USB communication. After several conversations with Libraries IT, they’ve recommended assigning us a Windows 10 computer for use with the Kryoflux. Beyond having two computers, I see obvious benefits to this. One is that I’ll be able to install the Linux subsystem on Windows 10 and explore whether going full-out Linux might be an option for us. Another is that I’ll have ready access to FTK Imager again, which comes in handy from time to time. 

The other challenge we have is working with our optical drives. We have consumer grade drives, and they work inconsistently. While Drive 1 may read Disc X but not Disc Y, Drive 2 will do the obverse. At the 2019 BitCurator Users Forum, Kam Woods discussed higher grade optical drives in the “There Are No Dumb Questions” session. (By the way, everyone should consider attending the Forum. It’s a great meeting that’s heavily focused on practice, and it gets better each year. This year, the Forum will be hosted by Arizona State University, October 12-13. The call for proposals will be coming out in early March).

In the coming months we’ll be doing some significant changes to our workflow, which will include tweaking a few things, reordering some steps, introducing new tools, e.g., walk_to_dfxml, Bulk Reviewer, and, I hope, introducing more automation into the process. We’re also due for a computer refresh, and, while we’re sticking with Macs for the time being, we’ll again work with our IT to review computer specifications.


Brian Dietz is the Digital Program Librarian for Special Collections at NC State University Libraries, where he manages born-digital processing, and web archiving, and digitization.

What’s Your Set-Up?: Establishing a Born-Digital Records Program at Brooklyn Historical Society

by Maggie Schreiner and Erica López


In establishing a born-digital records program at Brooklyn Historical Society, one of our main challenges was scaling the recommendations and best practices, which thus far have been primarily articulated by large and well-funded research universities, to fit our reality: a small historical society with limited funding, a very small staff, and no in-house IT support. In navigating this process, we’ve attempted to strike a balance that will allow us to responsibly steward the born-digital records in our collections, be sustainable for our staffing and financial realities, and allow us to engage with and learn from our colleagues doing similar work.

We started our process with research and learning. Our Digital Preservation Committee, which meets monthly, held a reading group. We read and discussed SAA’s Digital Preservation Essentials, reached out to colleagues at local institutions with born-digital records programs for advice, and read widely on the internet (including bloggERS!). Our approach was also strongly influenced by Bonnie Weddle’s presentation “Born Digital Collections: Practical First Steps for Institutions,” given at the Conservation Center for Art & Historic Artifact’s 2018 conference at the Center for Jewish History. Bonnie’s presentation focused on iterative processes that can be implemented by smaller institutions. Her presentation empowered us to envision a BHS-sized program, to start small, iterate when possible, and in the ways that make sense for our staff and our collections. 

We first enacted this approach in our equipment decisions. We assembled a workstation that consists of an air-gapped desktop computer, and a set of external drives based on our known and anticipated needs (3 ½ floppy, CD/DVD, Zip drives, and memory card readers). Our most expensive piece of equipment was our write-blocker (a Tableau TK8u USB 3.0 Bridge), which, based on our research, seemed like the most important place to splurge. We based our equipment decisions on background reading, informal conversations with colleagues about equipment possibilities, and an existing survey of born-digital carriers in our collections. We were also limited by our small budget; the total cost for our workstation was approximately $1,500. 

Born digital records workstation at the Brooklyn Historical Society

A grant from the Gardiner Foundation allowed us to create a paid Digital Preservation Fellowship, and hire the amazing Erica López for the position. The goals and timeline for Erica’s position were developed to allow lots of time for research, learning through trial and error, and mistakes. For a small staff, it is often difficult for us to create the time and space necessary for experimentation. Erica began by crafting processes for imaging and appraisal: testing software, researching, adapting workflows from other institutions, creating test disk images, and drafting appraisal reports. We opted to use BitCurator, due to the active user community. We also reached out to Bonnie Weddle, who generously agreed to answer our questions and review draft workflows. Bonnie’s feedback and support gave us additional confidence that we were on the right track.

Starting from an existing inventory of legacy media in our collections, Erica created disk images of the majority of items, and created appraisal assessments for each collection. Ultimately, Erica imaged eighty-seven born-digital objects (twelve 3.5 inch floppy disks, thirty-eight DVDs, and thirty-seven CDs), which contained a total of seventy-seven different file formats. Although these numbers may seem very small for some (or even most) institutions, these numbers are big for us! Our archives program is maintained by two FTE staff with multiple responsibilities, and vendor IT with no experience supporting the unique needs of archives and special collections. 

We encountered a few big bumps during the process! The first was that we unexpectedly had to migrate our archival storage server, and as a result did not have read-write access for several months. This interrupted our planned storage workflow for the disk images that Erica was creating. In hindsight, we made what was a glaring mistake to keep the disk images in the virtual machine running BitCurator. Inevitably, we had a day when we were no longer able to launch the virtual machine. After several days of failed attempts to recover the disk images, we decided that Erica would re-image the media. Fortunately, by this time, Erica was very proficient and it took less than two weeks! 

We had also hoped to do a case study on a hard drive in our collection, as Erica’s work had otherwise been limited to smaller removable media. After some experimentation, we discovered that our system would not be able to connect to the drive, and that we would need to use a FRED to access the content. We booked time at the Metropolitan New York Library Council’s Studio to use their FRED. Erica spent a day imaging the drive, and brought back a series of disk images… which to date we have not successfully opened in our BitCurator environment at BHS! After spending several weeks troubleshooting the technical difficulties and reaching out to colleagues, we decided to table the case study. Although disappointing, we also recognized that we have made huge strides in our ability to steward born-digital materials, and that we will continually iterate on this work in the future.

What have we learned about creating a BHS-sized born-digital records program? We learned that our equipment meets the majority of our use-case scenarios, that we have access to additional equipment at METRO when needed, and that maybe we aren’t quite ready to tackle more complex legacy media anyway. We learned that’s okay! We haven’t read everything, we don’t have the fanciest equipment, and we didn’t start with any in-house expertise. We did our research, did our best work, made mistakes, and in the end we are much more equipped to steward the born-digital materials in our collections. 


Maggie Schreiner is the Manager of Archives and Special Collections at the Brooklyn Historical Society, an adjunct faculty member in New York University’s Archives and Public History program, and a long-time volunteer at Interference Archive. She has previously held positions at the Fashion Institute of Technology (SUNY),  NYU, and Queens Public Library. Maggie holds an MA in Archives and Public History from NYU.

Erica López was born and raised in California by undocumented parents. Education was important but exploring Los Angeles’s colorful nightlife was more important. After doing hair for over a decade, Erica started studying to be a Spanish teacher at UC-Berkeley. Eventually, Erica quit the Spanish teacher dream, and found first film theory and then the archival world. Soon, Erica was finishing up an MA at NYU and working to become an archivist. Erica worked with Brooklyn Historical Society to setup workflows for born-digital collections, and is currently finishing up an internship at The Riverside Church translating and cataloging audio files.

What’s Your Set-up?: Curation on a Shoestring

by Rachel MacGregor


At the Modern Records Centre, University of Warwick in the United Kingdom we have been making steady progress in our digital preservation work. Jessica Venlet from UNC Chapel Hill wrote recently about being in the lucky position of finding “an excellent stock of hardware and two processors” when she started in 2016. We’re a little further behind than this—when I began in 2017 I had a lot less!

What we want is FRED. Who’s he? He’s your Forensic Recovery of Evidence Device (forensic workstation), but costing several thousand dollars, it’s beyond the reach of many of us.  

What I had in 2017: 

  • A Tableau T8-R2 write blocker. Write blockers are very important when working with rewritable media (USB drives, hard drives, etc.) because they prevent accidental alteration of material by blocking overwriting or deletion.
  • A (fingers crossed) working 3.5 inch external floppy disk drive.
  • A lot of enthusiasm.

What I didn’t have: 

  • A budget.
Image of Dell monitor and computer, keyboard, mouse, and writeblocker on a desk in an office.  Bitcurator software opened on the screen.
My digital curation workstation – not fancy but it works for me. Photo taken by MacGregor, under CC-BY license.

Whilst doing background research for tackling our born-digital collections, I got interested in the BitCurator toolkit which is designed to help with the forensic recovery of digital materials.  It interested me particularly because:

  • It’s free.
  • It’s open source.
  • It’s created and managed by archivists for archivists.
  • There’s a great user community.
  • There are loads of training materials online and an online discussion group.

I found this excellent blog post by Porter Olsen to help get started. He suggests starting with a standard workstation with a relatively high specification (e.g. 8 GB of RAM). So, I asked our IT folk for one, which they had in stock (yay!).  I specified a Windows operating system and installed a virtual machine, which runs a Linux operating system on which to run BitCurator. 

I’m still exploring BitCurator—it’s a powerful suite of tools with lots of features. However, when trialing it on the papers of the eminent historian Eric Hobsbawm, I found that it was a bit like using a hammer to crack a nut. Whilst it was possible to produce all sorts of sophisticated reports identifying email addresses etc., this isn’t much use on drafts of published articles from the late 1990-early 2000s. I turned to FTK Imager which is proprietary but free software. It is widely used in the preservation community, but not designed by, with, or for archivists (as BitCurator is). I guess its popularity derives from the fact that it’s easy to use and will allow you to image (i.e. take a complete copy of all the whole media including deleted and empty space),  or just extract the files, without too much time spent learning to use it. There are standard options for disk image output (e.g. as a raw byte-for-byte image, an E01 Expert witness format, SMART, and AFF formats). However, I would like to spend some more time getting to know BitCurator and becoming part of its community. There is always room for new and different tools and I suspect the best approaches are those which embrace diversity. 

Another tool that looks useful for disk imaging is one called diskimgr created by Johan van der Knijff of the Nationale Bibliotheek van Nederland. It will only run on a Linux operating system (not on a virtual machine), so now I am wondering about getting a separate Linux workstation.  BitCurator also works more effectively in a Linux environment as opposed to a virtual machine–it does stall sometimes with larger collections. I wonder if I should have opted for a Linux machine to start with. . . it’s certainly something to consider when creating a specification for a digital curation workstation. 

Once the content is extracted, we need further tools to help us manage and process. Bitcurator does a lot, but there may be extra things that you might need depending on your intended workflow. I never go anywhere without DROID software. DROID is useful for loads of stuff like file format identification, creating checksums, deduplication, and lots more. My standard workflow is to create a DROID profile and then use this as part of the appraisal process further down the line. What I don’t yet have is some sort of file viewer—Quick View Plus is the one I have in mind (it’s not free and as I think I mentioned my resources are limited!). I would also like to get LibreOffice installed as it deals quite well with old Word processed documents.

I guess I’ll keep adding to it as I go along. I now need to work out the most efficient ways of using the tools I have and capturing the relevant metadata that is produced. I would encourage everyone to take some time experimenting with some of the tools out there and I’d love to hear about how people get on.


Rachel MacGregor is Digital Preservation Officer at the Modern Records Centre, University of Warwick, United Kingdom. Rachel is responsible for developing and implementing digital preservation processes at the Modern Records Centre, including developing cataloguing guidelines, policies and workflows. She is particularly interested in workforce digital skills development.

What’s Your Set-up?: Managing Electronic Records at University of Alaska Anchorage

by Veronica Denison


For years, archivists at the University of Alaska Anchorage Archives and Special Collections have known that we would have to grapple with how to store our electronic records. We were increasingly receiving more donations that contained items created born digitally, and I also knew I wanted to apply for grants to digitize audio, video, and film in our holdings. The grants would not provide funding for our storage system, which also meant we had to come up with alternative means to pay for it.

Prior to 2018, anything we digitized was saved on what we called “scribe drives” which were shared network drives mapped to the computers in the archives. These drive could store about 4TB of data. In 2017, we digitized 10 ¼-inch audio reels as .mp3 and .wav files, which gave us both access and master copies of those recordings. At the time, we had enough storage space for everything, but in 2018, we received a request to digitize two 16mm film reels from the Dorothy and Grenold Collins papers. Unfortunately, we could not save both the access and masters to the drive since the files were too large (about 350GB per film for the master copy). 

Around the same time, we also received the Anne Nevaldine papers. The collection contained 4 boxes of 35mm slides as well as multiple CDs that in total contained 64,932 files within 1446 folders for an amount of 322GB. I had a volunteer run each of the CDs through the Archives segregation machine to check for viruses and then transfer the digital files onto an external hard drive. We thought we had time to figure out a more permanent solution than the external hard drive, but two weeks after I made the finding aid available online, a researcher came in wanting to look at the digital photographs in the collection. This created an issue, as my only option was to give her the external hard drive to look at the images. While she was in the Archives Research Room, I watched her closely to make sure nothing was deleted or moved. 

We decided that we needed a system where we could save and access all of our digital content, while also having it backed up, and have the option to make read-only copies available to researchers in the Research Room. We knew we would probably end up having at least 5 TB of data right away if we factored in our current digital items and the possibility of future ones. We initially approached the University’s IT Department to learn about our options. Unfortunately, we were quoted a very high cost (over $10,000 a year for 20TB), so we approached the Library’s IT Department for suggestions. After some discussion about what would be appropriate for our needs, Brad, the Library’s PC and Network Administrator, presented us with some options.

We ultimately decided on a Synology DiskStation DS1817+, which cost $848, with WD Gold 10TB HDD drives. We settled on 8 drives for total of 80TB (to provide growth space), which were $375 each for a cost of $3000. Then we needed a system to hook it to. For that we used a Windows 10 Desktop, which cost $1065. The total cost for the hardware was $4913, however we also needed a cloud service provider to back up the files. We decided to go with Backblaze, which costs $5 per TB per month. This whole system is a network-attached storage system, which means it is a file-level computer data storage server connected to a computer network. We took to calling it “the NAS” for short. Thankfully, when we presented the need for electronic storage to the Library’s Dean, he was willing to provide us with the funding needed to purchase the items.

Once everything was installed, we had to transfer the files and develop a new system for arranging and saving the files. We decided on having three separate drives, two of which would be on the NAS (Master and Access), and one a separate network drive (Reference_Access). The Master drive is the only one that is backed-up to Backblaze. The Master drive acts as a dark archive, meaning once items are saved to it, they will not be accessed. Therefore, we created the Access drive where archivists can retrieve the digital contents for reference and use purposes. The Access drive is essentially a copy of the Master drive. There is also a Reference_Access drive, which is mapped separately to each computer within the Archives, and not on the NAS. Reference_Access is the drive researchers use to access digital content in the Research Room and contains the access copies and low resolution .jpgs of photographs that may be high resolution in the Master and Access drives. 

The next step was mapping the Reference_Access drive to the researcher computer in the Research Room, and to make it read-only, but only for that computer. After working with the University’s IT Department, Brad was able to make it work. Since establishing this system in Spring 2019, the Reference_Access drive has been used by multiple researchers and it works great! They are able to access digital content of collections as easily as looking through a box on a table. We are grateful for all those who helped the Archives have a great mechanism for saving our electronic records, at a relatively low cost.


Veronica Denison is currently the Assistant University Archivist at Kansas State University where she has been since September 2019. Prior to being hired at K-State, she was an archivist for six years at the University of Alaska Anchorage. She holds an MLIS with a Concentration in Archives Management from Simmons College.

What’s Your Set-Up? Born-Digital Processing at UNC Chapel Hill

by Jessica Venlet


At UNC-Chapel Hill Libraries’ Wilson Special Collections Library, our workflow and technology set-up for born-digital processing has evolved over many years and under the direction of a few different archivists. This post provides a look at what technology was here when I started work in 2016 and how we’ve built on those resources in the last three years. Our set-up for processing and appraisal centers on getting collections ready for ingest to the Libraries’ Digital Collections Repository where other file-level preservation actions occur. 

What We Had 

I arrived at UNC in 2016 and was happy to find an excellent stock of hardware and two processing computers. Thank you to my predecessors! 

The computers available for processing were an iMac (10.11.6, 8GB RAM, 500GB storage) and a Lenovo PC (Windows 7, 8 GB RAM, 465 GB storage, 2.4GHz processor).  These computers were not used for storing collection material. Collections were temporarily stored and processed on a server before ingest to the repository. While I’m not sure how these machines were selected, I was glad to have dedicated      born-digital processing workstations.

In addition to the computers, we had a variety of other devices including:

  • One Device Side Data FC0525 5.25” floppy controller and a 5.25” disk drive
  • One Tableau USB write blocker
  • One Tableau SATA/IDE write blocker
  • Several USB connectable 3.5” floppy drives 
  • Two memory card readers (SanDisk 12 in 1 and Delkin)
  • Several zip disk drives (which turned out to be broken)
  • External CD/DVD player 
  • 3 external hard drives and several USB drives
  • Camera for photographing storage devices
  • A variety of other cords and adapters, most of which are used infrequently. Some examples are extra SATA/IDE adapters (like this one or this kit), Molex power adapters and power cords (like this or this), and USB adapter kit (like this one). 

The primary programs in use at the time were FTK Imager, Exact Audio Copy, and Bagger. 

What We Have Now

Since 2016, our workflow has evolved to include more appraisal and technical review before ingest. As a result, our software set-up expanded to include actions like virus scanning and file format identification. While it was great to have two dedicated workstations, our computers definitely needed an upgrade, so we worked on securing replacements.

The iMac was replaced with a Mac Mini (10.14.6, 16 GB RAM, 251 GB flash storage). Our PC was upgraded to a Lenovo P330 tower (Windows 10, 16 GB RAM, 476 GB storage). The Mini was a special request, but the PC request fit into a normal upgrade cycle. We continue to temporarily store collections on a server for processing before ingest.

Our peripheral devices remain largely the same as above, but we have added new (functional) zip drives and another Tableau USB write blocker used for appraisal outside of the processing space (e.g. offsite for a donor visit). We also purchased a KryoFlux, which can be used for imaging floppies. While not strictly required for processing, the KryoFlux may be useful to have if you encounter frequent issues accessing floppies. To learn more about the KryoFlux, check out the excellent Archivist’s Guide to the KryoFlux resource.

The software and tools that we’ve used have changed more often that our hardware set-up. Since about May 2018, we’ve settled on a pretty stable selection of software to get things done. Our commonly used tools are Bagger, Brunnhilde (and the dependencies that go with like Siegfried and clamAV), Bulk_Extractor, Exact Audio Copy, ffmpeg, IsoBuster, LibreOffice, Quick View Plus, rsync, text editors (text wrangler or BBEdit), and VLC Media Player. 

Recommended Extras

  • USB hub. Having extra USB ports has proven useful. 
  • A basic repair toolkit. This isn’t something we use often, but we have had a few older external hard drives come through that we needed to remove from an enclosure to connect to the write blocker. 
  • Training Collection Materials. One of the things I recommend most for a digital archives set-up is a designated set of storage devices and files that are for training and testing only. This way you have some material ready to go for testing new tools or training colleagues. Our training and testing collection includes a few 3.5” and 5.25” floppies, optical discs, and a USB drive that is loaded with files (including files with information that will get caught by our PII scanning tools). Many of the storage devices were deaccessioned and destined for the recycle. 

So, that’s how our set-up has changed over the last several years. As we continue to understand our needs for born-digital processing and as born-digital collections grow, we’ll continue to improve our hardware and software set-up.


Jessica Venlet works as the Assistant University Archivist for Digital Records & Records Management at the UNC-Chapel Hill Libraries’ Wilson Special Collections Library. In this role, Jessica is responsible for a variety of things related to both records management and digital preservation. In particular, she leads the processing and management of born-digital special collections. She earned a Master of Science in Information degree from the University of Michigan.

Welcome to the newest series on bloggERS, “What’s Your Set-Up?”

By Emily Higgs


Welcome to the newest series on bloggERS, “What’s Your Set-Up?” In the coming weeks, bloggERS will feature posts from digital archives professionals will explore the question: what equipment do you need to get your job done? 

This series was born from personal need:; as the first Digital Archivist at my institution, one of my responsibilities has been setting up a workstation to ingest and process our born-digital collections. It’s easy to be overwhelmed by the range of hardware and software needed, the variety of options for different equipment types, and where to obtain everything. In my context, some groundwork had already been done by forward-thinking former employees, who set up a computer with the BitCurator environment and also purchased a WiebeTech USB WriteBlocker. While this was a good first step for a born-digital workstation, we had much farther to go.

The first question I asked was: what do I need to buy?

My initial list of equipment was pretty easy to compile: 3.5” floppy drive, 5.25” floppy drive, optical drive, memory card reader, etc. etc. Then it started to get more complicated: 

  • Do I need to purchase disk controllers now or should I wait until I’m more familiar with the collections and know what I need? 
  • How much will a KryoFlux cost us over time vs. hiring an outside vendor to read our difficult floppies? 
  • Is it feasible to share one workstation among multiple departments? Should some of this equipment be shared consortially, like much of our collections structure? 
  • What brands and models of all this stuff are appropriate for our use case? What is quality and what is not?

The second question was: where do I buy all this stuff? This question contained myriad sub-questions: 

  • How do I balance quality and cost? 
  • Can I buy this equipment from Amazon? Should I buy equipment from Amazon? 
  • Will our budget structure allow for me to use vendors like eBay? 
  • Which sellers on eBay can I trust to send us legacy equipment that’s in working condition?

As with most of my work, I have taken  an iterative approach to this process. The majority of our unprocessed born-digital materials were stored on CDs and 3.5” floppy disks, so those were the focus of our first round of purchasing a few weeks ago. In addition to the basic USB blocker and BitCurator machine we already had, we now have a Dell External USB CD drive, a Tendak USB 3.5” floppy drive, and an Aluratek multimedia card reader to read the most common media in our unprocessed collections. We chose the Tendak drive mainly because of its price point, but it has not been the most reliable hardware and we will likely try something else in the future. As I’ve gone through old boxes from archivists past, I have found additional readers such as an Iomega Jaz drive, which I’m very glad we have; there are a number of Jaz disks in our unprocessed collections as well. 

As I went about this process, I started by emailing many of my peers in the field to solicit their opinions and learn more about the equipment at their institutions. The range of responses I got was extremely helpful for my decision-making process. The team at bloggERS wanted to share that knowledge out to the rest of our readership, helping them learn from their peers at a variety of institutions. We hope you glean some useful information from this series, and we look forward to your comments and discussions on this important topic.


Emily Higgs is the Digital Archivist for the Swarthmore College Peace Collection and Friends Historical Library. Before moving to Swarthmore, she was a North Carolina State University Libraries Fellow. She is also the Assistant Team Leader for the SAA ERS section blog.

Data As Collections

By Nathan Gerth

Over the past several years there has been a growing conversation about “collections as data” in the archival field. Elizabeth Russey Roke underscored the growing impact of this movement in her recent post on the Collections As Data: Always Already Computational final report at Blo. Much like her, I have seen this computational approach to the data in our collections manifest itself in ways at my home institution, with our decision to start providing researchers with aggregate data harvested from our born-digital collections.

Data as Collections

At the same time, in my role as a digital archivist working with congressional papers, I have seen a growing body of what I call “data as collections.” I am using the term data in this case specifically in reference to exports from relational database systems in collections. More akin to research datasets than standard born-digital acquisitions, these exports amplify the privacy and technical challenges associated with typical digital collections. However, they also embody some of the more appealing possibilities for the computational research highlighted by the “collections as data” initiative, given their structured nature and millions of data points.   

The problem of curating and supplying access to a particular type of data export has become an acute problem in the field of congressional papers. As documented in a white paper by a Congressional Papers Section Task Force in 2017, members in the U.S. House of Representatives and U.S. Senate have widely adopted proprietary Constituent Management Systems (CMS) or Constituent Services Systems (CSS) to manage constituent communications. The exports from these systems document the core interactions between the American people and their representatives in Congress. Unfortunately, these data exports have remained largely inaccessible to archivists and researchers alike.

The question of curating, preserving, and supplying access to the exports from these systems has galvanized the work of several task forces in the archival community. In recent years, congressional papers archivists have collaborated to document the problem in the white paper referenced above and to support the development of a tool to access these exports. The latter effort, spearheaded by West Virginia Libraries, earned a Lyrasis Catalyst Fund grant in 2018 to assess the development possibilities for an open-source platform developed at WVU to open and search these data exports. You can see a screenshot of the application in action below.

https://lh6.googleusercontent.com/iLmJ2ebnslIsoeST9MSbfuSBwtCfZM7-aHoV620y-9gq1daH9iZbpGAxy1NJX8vSB1qSrnLsxeveRCGr0VybE6JrlB5Z_X6MMzzXWK3_S2FXLlOPekDdFdhMV95a81U1AQ4j

Screenshot of data table viewed in the West Virginia University Libraries CSS Data Tool

The project funded by the grant, America Contacts Congress, has now issued its final report and the members of the task force that served as its advisory board are transitioning to the next stage of the project. Here are where things stand:

What We Now Know

We now know much more about the key research audiences for this data and the archival needs associated with the tool. Researchers expressed solid enthusiasm for gaining access to the data, especially computationally minded quantitative scholars. For those of us involved in testing data in the tool, the project gave us a moment to become much more familiar with our data. I, for my part, also know a great deal more about the 16 million records in the relational data tables we received from the office of Senator Harry Reid, in addition to the 3 million attachments referenced by those tables. Without the ability to search and view the data in the tool, the tables and attachments from the Reid collection would have existed as little more than binary files.

Unresolved Questions

While members of the grant’s advisory board know much more about how the tool might be used in the sphere of congressional papers, we would like to learn more about other cases of “data in collections” in the archival field. Who beyond congressional papers archivists are grappling with supplying access to and preserving relational databases? We know, for example, that many state and local governments are using the same Constituent Relationship Management systems, such as iConstituent and Intranet Quorum, deployed in congressional offices. Do our needs overlap with those of other archivists and could this tool serve a broader community? While the amount of CSS data exports in congressional collections is significant, the direction we plan to take tool development and partnerships to supply access to the data will hinge on finding a broader audience of archivists facing similar challenges.

If any of the questions above apply to you, consider contacting the members of the America Contacts Congress project’s advisory board. We would love to hear from you and discuss how the outcomes of the grant might apply to a broader array of data exports in archival collections. Who knows, we might even help you test the tool on your own data exports! For more information about the project, visit our webpage.

Nathan Gerth

Nathan Gerth is the Head of Digital Services and Digital Archivist at the University of Nevada, Reno Libraries. Splitting his time between application support and digital preservation, he is the primary custodian of the electronic records from the papers of Senator Harry Reid. Outside of the University, he is an active participant in the congressional papers community, serving as the incoming chair of the Congressional Papers Section and as a member of the Association of Centers for the Study of Congress CSS Data Task Force.

Student Impressions of Tech Skills for the Field

by Sarah Nguyen


Back in March, during bloggERS’ Making Tech Skills a Strategic Priority series, we distributed an open survey to MLIS, MLS, MI, and MSIS students to understand what they know and have experienced in relation to  technology skills as they enter the field. 

To be frank, this survey stemmed from personal interests since I just completed an MLIS core course on Research, Assessment, and Design (re: survey to collect data on current landscape). I am also interested in what skills I need to build/what class I should sign up for my next quarter (re: what tech skills do I need to become hire-able?). While I feel comfortable with a variety of tech-related tools and tasks, I’ve been intimidated by more “high-level”computational languages for some years. This survey was helpful for exploring what skills other LIS pre-professionals are interested in and which skills will help us make these costly degrees worth the time and financial investment that is traditionally required to enter a stable archive or library position.

Method

The survey was open for one month on Google Forms, and distributed to SAA communities, @SAA_ERS Twitter, the Digital Curation Google Group, and a few MLIS university program listservs. There were 15 questions and we received responses from 51 participants. 

Results & Analysis

Here’s a superficial scan of the results. If you would like to come up with your own analyses, feel free to view the raw data on GitHub.

Figure 1. Technology-related skills that students want to learn

The most popular technology-related skill that students are interested in learning is data management (manipulating, querying, transforming data, etc.). This is a pretty broad topic as it involves many tools and protocols which can vary between a GUI or scripts. A separate survey that does a breakdown of specific data management tools might be in order, especially since these types of skills can be divided into specialty courses, workshops, which then translates into a specific job position. A more specific survey could help demonstrate what types of skills need to be taught in a full semester-long course, or what skills can be covered in a day-long or multi-day workshop.

It was interesting to see that even in this day and age where social media management can be second nature to many students’ daily lives, there was still a notable interest in understanding how to make this a part of their career. This makes me wonder what value students have in knowing how to strategically manage an archives’ social media account. How could this help with the job market, as well as an archival organization’s main mission?

Looking deeper into the popular data management category, it would be interesting to know the current landscape of knowledge or pedagogy in communicating with IT (e.g. project management and translating users’ needs). In many cases, archivists are working separately from but dependently on IT system administrators, and it can be frustrating since either department may have distinct concerns about a server or other networks. In June’s NYC Preservathon/Preservashare 2019, there was mention that IT exists to make sure servers and networks are spinning at all hours of the day. Unlike archivists, they are not concerned about the longevity of the content, obsolescence of file formats, or the software to render files. Could it be useful to have a course on how to effectively communicate and take control of issues that can be fuzzy lines between archives, data management, and IT? Or as one survey respondent said, “I think more basic programming courses focusing on tech languages commonly used in archives/libraries would be very helpful.” Personally, I’ve only learned this from experience working in different tech-related jobs. This is not a subject I see on my MLIS course catalog, nor a discussion at conference workshops. 

The popularity of data management skills sparked another question: what about knowledge around computer networks and servers? Even though LTO will forever be in our hearts, cloud storage is also a backup medium we’re budgeting for and relying on. Same goes for hosting a database for remote access and/or publishing digital files. A friend mentioned this networking workshop for non-tech savvy learners—Grassroots Networking: Network Administration for Small Organizations/Home Organizations—which could be helpful for multiple skill types including data management, digital forensics, web archiving, web development, etc. This is similar to a course that could be found in computer science or MLIS-adjacent information management departments.

Figure 2. Have you taken/will you take technology-focused courses in your program?
Figure 3. Do you feel comfortable defining the difference between scripting and programming

I can’t say this is statistically significant, but the inverse relationship between 15.7% who have not/will not take a technology-focused course in their program, compared to 78.4% of respondents who are not aware of the difference between scripting and programming is eyebrow raising. According to an article in PLOS Computational Biology,  the term “script” means “something that is executed directly as is”, while a “program[… is] something that is explicitly compiled before being used. The distinction is more one of degree than kind—libraries written in Python are actually compiled to bytecode as they are loaded, for example—so one other way to think of it is “things that are edited directly” and “things that are not edited directly” (Wilson et al 2017). This distinction is important since more archives are acquiring, processing and sharing collections that rely on the archivist to execute jobs such as web-scraping or metadata management (scripts) or archivists who can build and maintain a database (programming). These might be interpreted as trick questions, but the particular semantics and what is considered technology-focused is something modern library, archives, and information programs might want to consider. 

Figure 4. How do you approach new technology?

Figure 4 illustrates the various ways students tackle new technologies. Reading the f* manual (RTFM) and Searching forums are the most common approaches to navigating technology. Here are quotes from a couple students on how they tend to learn a new piece of software:

  • “break whatever I’m trying to do with a new technology into steps and look for tutorials & examples related to each of those steps (i.e. Is this step even possible with X, how to do it, how else to use it, alternatives for accomplishing that step that don’t involve X)”
  • “I tend to google “how to….” for specific tasks and learn new technology on a task-by-task basis.”

In the end, there was overwhelming interest in “more project-based courses that allow skills from other tech classes to be applied.” Unsurprisingly, many of us are looking for full-time, stable jobs after graduating and the “more practical stuff, like CONTENTdm for archives” seems to be a pressure felt in-order to get an entry-level position. Not just entry too; as continuing education learners, there is also a push to strive for more—several respondents are looking for a challenge to level up their tech skills: 

  • “I want more classes with hands-on experience with technical skills. A lot of my classes have been theory based or else they present technology to us in a way that is not easy to process (i.e. a lecture without much hands-on work).”
  • “Higher-level programming, etc. — everything on offer at my school is entry level. Also digital forensics — using tools such as BitCurator.”
  • “Advanced courses for the introductory courses. XML 2 and python 2 to continue to develop the skills.”
  • “A skills building survey of various code/scripting, that offers structured learning (my professor doesn’t give a ton of feedback and most learning is independent, and the main focus is an independent project one comes up with), but that isn’t online. It’s really hard to learn something without face to face interaction, I don’t know why.”

It’ll be interesting to see what skills recent MLIS, MLS, MIS, and MSIM graduates will enter the field with. While many job postings list certain software and skills as requirements, will programs follow suit? I have a feeling this might be a significant question to ask in the larger context of what is the purpose of this Master’s degree and how can the curriculum keep up with the dynamic technology needs of the field.

Disclaimer: 

  1. Potential bias: Those taking the survey might be interested in learning higher-level tech skills because they do not already know the skills, while those who are already tech-savvy might avoid a basic survey such as this one since they already know the skills. This may put a bias on the survey population consisting of mostly novice tech students.   
  2. More data on specific computational languages and technology courses taken are available in the GitHub csv file. As mentioned earlier, I just finished my first year as a part-time MLIS student, so I’m still learning the distinct jobs and nature of the LIS field. Feel free to submit an issue to the GitHub repo, or tweet me @snewyuen if you’d like to talk more about what this data could mean.

Bibliography

Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Computational Biology 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510


Sarah Nguyen with a Uovo storage truck

Sarah Nguyen is an advocate for open, accessible, and secure technologies. While studying as an MLIS candidate with the University of Washington iSchool, she is expressing interests through a few gigs: Project Coordinator for Preserve This Podcast at METRO, Assistant Research Scientist for Investigating & Archiving the Scholarly Git Experience at NYU Libraries, and archivist for the Dance Heritage Coalition/Mark Morris Dance Group. Offline, she can be found riding a Cannondale mtb or practicing movement through dance. (Views do not represent Uovo. And I don’t even work with them. Just liked the truck.)