A Recap of “DAM if you do and DAM if you don’t!”

by Regina Carra

When: December 3, 2018

Where: Metropolitan New York Library Council (METRO), New York, NY

Speakers:

  • Stephen Klein, Digital Services Librarian at the CUNY Graduate Center (CUNY)
  • Ashley Blewer, AV Preservation Specialist at Artefactual
  • Kelly Stewart, Digital Preservation Services Manager at Artefactual

On December 3, 2018, the Metropolitan New York Library Council (METRO)’s Digital Preservation Interest Group hosted an informative (and impeccably titled) presentation about how the CUNY Graduate Center (GC) plans to incorporate Archivematica, a web-based, open-source digital asset management software (DAMs) developed by Artefactual, into its document management strategy for student dissertations. Speakers included Stephen Klein, Digital Services Librarian at the CUNY Graduate Center (GC); Ashley Blewer, AV Preservation Specialist at Artefactual; and Kelly Stewart, Digital Preservation Services Manager at Artefactual. The presentation began with an overview from Stephen about the GC’s needs and why they chose Archivematica as a DAMs, followed by an introduction to and demo of Archivematica and Duracloud, an open-source cloud storage service, led by Ashley and Kelly (who was presenting via video-conference call). While this post provides a general summary of the presentation, I would recommend reaching out to any of the presenters for more detailed information about their work. They were all great!

Every year the GC Library receives between 400-500 dissertations, theses, and capstones. These submissions can include a wide variety of digital materials, from PDF, video, and audio files, to websites and software. Preservation of these materials is essential if the GC is to provide access to emerging scholarship and retain a record of students’ work towards their degrees. Prior to implementing a DAMs, however, the GC’s strategy for managing digital files of student work was focused primarily on access, not preservation. Access copies of student work were available on CUNY Academic Works, a site that uses Bepress Digital Commons as a CMS. Missing from the workflow, however, was the creation, storage, and management of archival originals. As Stephen explained, if the Open Archival Information System (OAIS) model is a guide for a proper digital preservation workflow, the GC was without the middle, Archival Information Package (AIP), portion of it. Some of the qualities that GC liked about Archivematica was that it was open-source and highly-customizable, came with strong customer support from Artefactual, and had an API that could integrate with tools already in use at the library. GC Library staff hope that Archivematica can eventually integrate with both the library’s electronic submission system (Vireo) and CUNY Academic Works, making the submission, preservation, and access of digital dissertations a much more streamlined, automated, and OAIS-compliant process.

A sample of one of Duracloud’s data visualization graphs from the presentation slides.

Next, Ashley and Kelly introduced and demoed Archivematica and Duracloud. I was very pleased to see several features of the Archivematica software that were made intentionally intuitive. The design of the interface is very clean and easily customizable to fit different workflows. Also, each AIP that is processed includes a plain-text, human-readable file which serves as extra documentation explaining what Archivematica did to each file. Artefactual recommends pairing Archivematica with Duracloud, although users can choose to integrate the software with local storage or with other cloud services like those offered by Google or Amazon. One of the features I found really interesting about Duracloud is that it comes with various data visualization graphs that show the user how much storage is available and what materials are taking up the most space.

I close by referencing something Ashley wrote in her recent bloggERS post (conveniently she also contributed to this event). She makes an excellent point about how different skill-sets are needed to do digital preservation, from the developers that create the tools that automate digital archival processes to the archivists that advocate for and implement said tools at their institutions. I think this talk was successful precisely because it included the practitioner and vendor perspectives, as well as the unique expertise that comes with each role. Both are needed if we are to meet the challenges and tap into the potential that digital archives present. I hope to see more of these “meetings of the minds” in the future.

(For more info: Stephen and Ashley and Kelly have generously shared their slides!)


Regina Carra is the Archive Project Metadata and Cataloging Coordinator at Mark Morris Dance Group. She is a recent graduate of the Dual Degree MLS/MA program in Library Science and History at Queens College – CUNY.

Advertisements

The Archivist’s Guide to KryoFlux

by [Matthew] Farrell and Shira Peltzman

As cultural icons go, the floppy disk continues to persist in the contemporary information technology landscape. Though digital storage has moved beyond the 80 KB – 1.44 MB storage capacity of the floppy disk, its image is often shorthand for the concept of saving one’s work (to wit: Microsoft Word 2016 still uses an icon of a 3.5″ floppy disk to indicate save in its user interface). Likewise, floppy disks make up a sizable portion of many archival collections, in number of objects if not storage footprint. If a creator of personal papers or institutional records maintained their work in electronic form in the 1980s or 1990s, chances are high that these are stored on floppy disks. But the persistent image of the ubiquitous floppy disk conceals a long list of challenges that come into play as archivists attempt to capture their data.

For starters, we often grossly underestimate the extent to which the technology was in active development during its heyday. One would be forgiven the assumption that there existed only a small number of floppy disk formats: namely 5.25″ and 3.5″, plus their 8″ forebears. But within each of these sizes there existed myriad variations of density and encoding, all of which complicate the archivist’s task now that these disks have entered our stacks. This is to say nothing of the hardware: 8″ and 5.25″ drives and standard controller boards are no longer made, and the only 3.5″ drive currently manufactured is a USB-enabled device capable only of reading disks with the more recent encoding methods storing file systems compatible with the host computer. And, of course, none of the above accounts for media stability over time for obsolete carriers.

Enter KryoFlux, a floppy disk controller board first made available in 2009. KryoFlux is very powerful, allowing users of contemporary Windows, Mac, and Linux machines to interface with legacy floppy drives via a USB port. The KryoFlux does not attempt to mount a floppy disk’s file system to the host computer, granting two chief affordances: users can acquire data (a) independent of their host computer’s file system, and (b) without necessarily knowing the particulars of the disk in question. The latter is particularly useful when attempting to analyze less stable media.

Despite the powerful utility of KryoFlux, uptake among archives and digital preservation programs has been hampered by a lack of accessible documentation and training resources. The official documentation and user forums assume a level of technical knowledge largely absent from traditional archival training. Following several informal conversations at Stanford University’s Born-Digital Archives eXchange events in 2015 and 2016, as well as discussions at various events hosted by the BitCurator Consortium, we formed a working group that included archivists and archival studies students from Emory University, the University of California Los Angeles, Yale University, Duke University, and the University of Texas at Austin to create user-friendly documentation aimed specifically at archivists.

Development of The Archivists Guide to KryoFlux began in 2016, with a draft released on Google Docs in Spring 2017. The working group invited feedback over a 6-month comment period and were gratified to receive a wide range of comments and questions from the community. Informed by this incredible feedback, a revised version of the Guide is now hosted in GitHub and available for anyone to use, though the use cases described are generally those encountered by archivists working with born-digital collections in institutional and manuscript repositories.

The Guide is written in two parts. “Part One: Getting Started” provides practical guidance on how to set-up and begin using the KryoFlux and aims to be as inclusive and user-friendly as possible. It includes instructions for running KryoFlux using both Mac and Windows operating systems. Instructions for running KryoFlux using Linux are also provided, allowing repositories that use BitCurator (an Ubuntu-based open-source suite of digital archives tools) to incorporate the KryoFlux into their workflows.

“Part Two: In Depth” examines KryoFlux features and floppy disk technology in more detail. This section introduces the variety of floppy disk encoding formats and provides guidance as to how KryoFlux users can identify them. Readers can also find information about working with 40-track floppy disks. Part Two covers KryoFlux-specific output too, including log files and KryoFlux stream files, and suggests ways in which archivists might make use of these files to support digital preservation best practices. Short case studies documenting the experiences of archivists at other institutions are also included here, providing real-life examples of KryoFlux in action.

As with any technology, the KryoFlux hardware and software will undergo updates and changes in the future which will, if we are not careful, have an effect on the currency of the Guide. In an attempt to address this possibility, the working group have chosen to host the guide as a public GitHub repository. This platform supports versioning and allows for easy collaboration between members of the working group. Perhaps most importantly, GitHub supports the integration of community-driven contributions, including revisions, corrections, and updates. We have established a process for soliciting and reviewing additional contributions and corrections (short answer: submit a pull request via GitHub!), and will annually review the membership of an ongoing working group responsible for monitoring this work to ensure that the Guide remains actively maintained for as long as humanly possible.

WDPD2018groot-30

On this year’s World Digital Preservation Day, the Digital Preservation Coalition presented The Archivist’s Guide to KryoFlux with the 2018 Digital Preservation Award for Teaching and Communications. It was truly an honor to be recognized alongside the other very worthy finalists, and a cherry-on-top for what we hope will remain a valuable resource for years to come.


[Matthew] Farrell is the Digital Records Archivist in Duke University’s David M. Rubenstein Rare Book & Manuscript Library. Farrell holds an MLS from the University of North Carolina at Chapel Hill.


Shira Peltzman is the Digital Archivist for the UCLA Library where she leads a preservation program for Library Special Collections’ born-digital material. Shira received her M.A. in Moving Image Archiving and Preservation from New York University’s Tisch School of the Arts, and was a member of the inaugural cohort of the National Digital Stewardship Residency in New York (NDSR-NY).

Trained in Classification, Without Classification

by Ashley Blewer

This is the first post in the bloggERS Making Tech Skills a Strategic Priority series.

Hi, SAA ERS readers! My name is Ashley Blewer, and I am sort of an archivist, sort of a developer, and sort of something else I haven’t quite figured out what to call myself. I work for a company for Artefactual Systems, and we make digital preservation and access software called Archivematica and AtoM (Access to Memory) respectively. My job title is AV Preservation Specialist, which is true, that is what I specialize in, and maybe that fulfills part of that “sort of something else I haven’t quite figured out.” I’ve held a lot of different roles in my career, as digital preservation consultant, open source software builder and promoter, developer at a big public library, archivist at a small public film archive, and other things. This, however, is my first time working for an open source technology company that makes software used by libraries, archives, museums, and other organizations in the cultural heritage sector. I think this is a rare vantage point from which to look at the field and its relationship to technology, and I think that even within this rare position, we have an even more unique culture and mentality around archives and technology that I’d like to talk about.

Within Archivematica, we have a few loosely defined types of jobs. There are systems archivists, which we speak of internally as analysts, there are developers (software engineers), and there are also systems operations folks (systems administrators and production support engineers). We have a few other roles that sit more at the executive level, but there isn’t a wall between any of these roles, as even those who are classified as being “in management” also work as analysts or systems engineers when called upon to do so. My role also sits between a lot of these loosely defined roles — I suppose I am technically classified as an analyst, and I run with the fellow analyst crew: I attend their meetings, work directly with clients, and other preservation-specific duties, but I also have software development skills, and can perform more traditionally technical tasks like writing code, changing how things function at a infrastructure level, and reviewing and testing the code that has been written by others. I’m still learning the ropes (I have been at the organization full-time for 4 months), but I am increasingly able to do some simple system administration tasks too, mostly for clients that need me to log in and check out what’s going on with their systems. This seems to be a way in which roles at my company and within the field (I hope) are naturally evolving. Another example is my brilliant colleague Ross Spencer who works as a software engineer, but has a long-established career working within the digital preservation space, so he definitely lends a hand providing crucial insight when doing “analyst-style” work.

We are a technical company, and everyone on staff has some components that are essential to a well-rounded digital preservation systems infrastructure. For example, all of us know how to use Git (a version control management system made popular by Github) and we use it as a regular part of our job, whether we are writing code or writing documentation for how to use our software. But “being technical” or having technical literacy skills involves much, much more than writing code. My fellow analysts have to do highly complex and nuanced workflow development and data mapping work, figuring out niche bugs associated with some of the microservices we run, and articulating in common human language some of the very technical parts of a large software system. I think Artefactual’s success as a company comes from the collective ability to foster a safe, warm, and collaborative environment that allows anyone on the team to get the advice or support they need to understand a technical problem, and use that knowledge to better support our software, every Archivematica user (client or non-client), and the larger digital preservation community. This is the most important part of any technical initiative or training, and it is the most fundamental component of any system.

I don’t write this as a representative for Artefactual, but as myself, a person who has held many different roles at many different institutions all with different relationships to technology, and this has by far been the most healthy and on-the-job educational experience I have had, and I think those two things go hand-in-hand. I can only hope that other organizations begin to narrow the line between “person who does archives work” and “technical person” in a way that supports collaboration and cross-training between people coming into the field with different backgrounds and experiences. We are all in this together, and the only way we are gonna get things done is if we work closely together.



Ashley works as at Artefactual Systems as their AV Preservation Specialist, primarily on the Archivematica project. She specializes in time-based media preservation, digital repository management, infrastructure/community building, computer-to-human interpretation, and teaching technical concepts. She is an active contributor to MediaArea’s MediaConch, a open source digital video file conformance checker software project, and Bay Area Video Coalition’s QCTools, an open source digitized video analysis software project. She holds Master of Library and Information Science (Archives) and Bachelor of Arts (Graphic Design) degrees from the University of South Carolina.

Announcing the Digital Processing Framework

by Erin Faulder

Development of the Digital Processing Framework began after the second annual Born Digital Archiving eXchange unconference at Stanford University in 2016. There, a group of nine archivists saw a need for standardization, best practices, or general guidelines for processing digital archival materials. What came out of this initial conversation was the Digital Processing Framework (https://hdl.handle.net/1813/57659) developed by a team of 10 digital archives practitioners: Erin Faulder, Laura Uglean Jackson, Susanne Annand, Sally DeBauche, Martin Gengenbach, Karla Irwin, Julie Musson, Shira Peltzman, Kate Tasker, and Dorothy Waugh.

An initial draft of the Digital Processing Framework was presented at the Society of American Archivists’ Annual meeting in 2017. The team received feedback from over one hundred participants who assessed whether the draft was understandable and usable. Based on that feedback, the team refined the framework into a series of 23 activities, each composed of a range of assessment, arrangement, description, and preservation tasks involved in processing digital content. For example, the activity Survey the collection includes tasks like Determine total extent of digital material and Determine estimated date range.

The Digital Processing Framework’s target audience is folks who process born digital content in an archival setting and are looking for guidance in creating processing guidelines and making level-of-effort decisions for collections. The framework does not include recommendations for archivists looking for specific tools to help them process born digital material. We draw on language from the OAIS reference model, so users are expected to have some familiarity with digital preservation, as well as with the management of digital collections and with processing analog material.

Processing born-digital materials is often non-linear, requires technical tools that are selected based on unique institutional contexts, and blends terminology and theories from archival and digital preservation literature. Because of these characteristics, the team first defined 23 activities involved in digital processing that could be generalized across institutions, tools, and localized terminology. These activities may be strung together in a workflow that makes sense for your particular institution. They are:

  • Survey the collection
  • Create processing plan
  • Establish physical control over removeable media
  • Create checksums for transfer, preservation, and access copies
  • Determine level of description
  • Identify restricted material based on copyright/donor agreement
  • Gather metadata for description
  • Add description about electronic material to finding aid
  • Record technical metadata
  • Create SIP
  • Run virus scan
  • Organize electronic files according to intellectual arrangement
  • Address presence of duplicate content
  • Perform file format analysis
  • Identify deleted/temporary/system files
  • Manage personally identifiable information (PII) risk
  • Normalize files
  • Create AIP
  • Create DIP for access
  • Publish finding aid
  • Publish catalog record
  • Delete work copies of files

Within each activity are a number of associated tasks. For example, tasks identified as part of the Establish physical control over removable media activity include, among others, assigning a unique identifier to each piece of digital media and creating suitable housing for digital media. Taking inspiration from MPLP and extensible processing methods, the framework assigns these associated tasks to one of three processing tiers. These tiers include: Baseline, which we recommend as the minimum level of processing for born digital content; Moderate, which includes tasks that may be done on collections or parts of collections that are considered as having higher value, risk, or access needs; and Intensive, which includes tasks that should only be done to collections that have exceptional warrant. In assigning tasks to these tiers, practitioners balance the minimum work needed to adequately preserve the content against the volume of work that could happen for nuanced user access. When reading the framework, know that if a task is recommended at the Baseline tier, then it should also be done as part of any higher tier’s work.

We designed this framework to be a step towards a shared vocabulary of what happens as part of digital processing and a recommendation of practice, not a mandate. We encourage archivists to explore the framework and use it however it fits in their institution. This may mean re-defining what tasks fall into which tier(s), adding or removing activities and tasks, or stringing tasks into a defined workflow based on tier or common practice. Further, we encourage the professional community to build upon it in practical and creative ways.


Erin Faulder is the Digital Archivist at Cornell University Library’s Division of Rare and Manuscript Collections. She provides oversight and management of the division’s digital collections. She develops and documents workflows for accessioning, arranging and describing, and providing access to born-digital archival collections. She oversees the digitization of analog collection material. In collaboration with colleagues, Erin develops and refines the digital preservation and access ecosystem at Cornell University Library.

The Top 10 Things We Learned from Building the Queer Omaha Archives, Part 2 – Lessons 6 to 10

by Angela Kroeger and Yumi Ohira

The Queer Omaha Archives (QOA) is an ongoing effort by the University of Nebraska at Omaha Libraries’ Archives and Special Collections to collect and preserve Omaha’s LGBTQIA+ history. This is still a fairly new initiative at the UNO Libraries, having been launched in June 2016. This blog post is adapted and expanded from a presentation entitled “Show Us Your Omaha: Combatting LGBTQ+ Archival Silences,” originally given at the June 2017 Nebraska Library Association College & University Section spring meeting. The QOA was only a year old at that point, and now that another year (plus change) has passed, the collection has continued to grow, and we’ve learned some new lessons.

So here are the top takeaways from UNO’s experience with the QOA.

#6. Words have power, and sometimes also baggage.

Words sometimes mean different things to different people. Each person’s life experience lends context to the way they interpret the words they hear. And certain words have baggage.

We named our collecting initiative the Queer Omaha Archives because in our case, “queer” was the preferred term for all LGBTQIA+ people as well as referring to the academic discipline of queer studies. In the early 1990s, the community in Omaha most commonly referred to themselves as “gays and lesbians.” Later on, bisexuals were included, and the acronym “GLB” came into more common use. Eventually, when trans people were finally acknowledged, it became “GLBT.” Then there was a push to switch the order to “LGBT.” And then more letters started piling on, until we ended up with the LGBTQIA+ commonly used today. Sometimes, it is taken even further, and we’ve seen LGBTQIAPK+, LGBTQQIP2SAA, LGBTQIAGNC, and other increasingly long and difficult-to-remember variants. (Although, Angela confesses to finding QUILTBAG to be quite charming.) The acronym keeps shifting, but we didn’t want our name to shift, so we followed the students’ lead (remember the QTS “Cuties”?) and just went with “queer.” “Queer” means everyone.

Except . . . “queer” has baggage. Heavy, painful baggage. At Pride 2016, an older man, who had donated materials to our archive, stopped by our booth and we had a conversation about the word. For him, it was the slur his enemies had been verbally assaulting him with for decades. The word still had a sharp edge for him. Angela (being younger than this donor, but older than most of the students on campus) acknowledged that they were just old enough for the word to be startling and sometimes uncomfortable. But many Millennials and Generation Z youths, as well as some older adults, have embraced “queer” as an identity. Notably, many of the younger people on campus have expressed their disdain for being put into boxes. Identifying as “gay” or “lesbian” or “bi” seems too limiting to them. Our older patron left our booth somewhat comforted by the knowledge that for much of the population, especially the younger generations, “queer” has lost its sting and taken on a positive, liberating openness.

But what about other LGBTQIA+ people who haven’t stopped by to talk to us, to learn what we mean when we call our archives “queer”? Who feels sufficiently put off by this word that they might choose against sharing their stories with our archive? We aren’t planning to change our name, but we are aware that our choice of word may give some potential donors and potential users a reason to hesitate before approaching us.

So whatever community your archive serves, think about the words that community uses to describe themselves, and the words others use to describe them, and whether those words might have connotations you don’t intend.

#7. Find your community. Partnerships, donors, and volunteers are the keys to success.

It goes without saying that archives are built from communities. We don’t (usually) create the records. We invite them, gather them, describe them, preserve them, and make them available to all, but the records (usually) come from somewhere else.

Especially if you’re building an archive for an underrepresented community, you need buy-in from the members of that community if you want your project to be successful. You need to prove that you’re going to be trustworthy, honorable, reliable stewards of the community’s resources. You need someone in your archive who is willing and able to go out into that community and engage with them. For us, that was UNO Libraries’ Archives and Special Collections Director Amy Schindler, who has a gift for outreach. Though herself cis and straight, she has put in the effort to prove herself a friend to Omaha’s LGBTQIA+ community.

You also need members of that community to be your advocates. For us, our advocates were our first donors, the people who took that leap of faith and trusted us with their resources. We started with our own university. The work of the archivist and the archives would not have been possible without the collaboration and support of campus partners. UNO GSRC Director Dr. Jessi Hitchins and UNO Sociology Professor Dr. Jay Irwin together provided the crucial mailing list for the QOA’s initial publicity and networking. Dr. Irwin and his students collected the interviews which launched the LGBTQ+ Voices oral history project. Retired UNO professor Dr. Meredith Bacon donated her personal papers and extensive library of trans resources. From outside the UNO community, Terry Sweeney, who with his partner Pat Phalen had served as editor of The New Voice of Nebraska, donated a complete set of that publication, along with a substantial collection of papers, photographs, and artifacts, and he volunteered in the archives for many weeks, creating detailed and accurate descriptions of the items. These four people, and many others, have become our advocates, friends, and champions within the Omaha LGBTQIA+ community.

Our lesson here: Find your champions. Prove your trustworthiness, and your champions will help open doors into the community.

#8. Be respectful, be interested, and be present.

Outreach is key to building connections, bringing in both donors and users for the collection. This isn’t Field of Dreams, where “If you build it, they will come.” You need to forge the partnerships first, in order to even be able to build it. And they won’t come if they don’t know about it and don’t believe in its value. (“They” in this case meaning the community or communities your archives serve, and “it” of course meaning your archives or special collections for that community.)

Fig. 3: UNO Libraries table at a Transgender Day of Remembrance event.

Yumi and Angela are both behind-the-scenes technical services types, so we don’t have quite as much contact with patrons and donors as some others in our department, but we’ve helped out staffing tables at events, such as Pride, Transgender Day of Remembrance, and Transgender Day of Visibility events. We also work to create a welcoming atmosphere for guests who come to the archives for events, tours, and research. We recognize the critical importance of the work our director does, going out and meeting people, attending events, talking to everyone, and inviting everyone to visit. As our director Amy Schindler said in the article “Collaborative Approaches to Digital Projects,” “Engagement with community partners is key . . .”

There’s also something to be said for simply ensuring that folks within the archives, and the library as a whole for that matter, have a basic awareness of the QOA and other collecting initiatives, so that we can better fulfill our mission of connecting people to the resources they need. After all, when someone walks into the library, before they even reach the archives, any staff member might be their first point of contact. So be sure to do outreach within your own institution, as well.

#9. Let me sing you the song of my administrative support.

The QOA was conceived by the efforts from UNO students, UNO employees, and Omaha communities to address the underrepresentation of the LGBTQIA+ communities in Omaha.

The initiative of the QOA was inspired by Josh Burford who delivered a presentation about collecting and archiving historical materials related to queering history. This presentation was co-hosted by UNO’s Gender and Sexuality Resource Center in the LGBTQIA+ History Month. After this event, the UNO community became keenly interested in collecting and preserving historical materials and oral history interviews about “Queer Omaha,” and began collaborating with our local LGBTQIA+ communities. In Summer 2016, the QOA was officially launched to preserve the enduring value of the legacy of LGBTQIA+ communities in greater Omaha. The QOA is an effort to combat an archival silence in the community, and digital collections and digital engagement are especially effective tools for making LGBTQIA+ communities aware that the archives welcome their records!

But none of this would have been possible without administrative support. If the library administration or the university administration had been indifferentor worse, hostileto the idea of building a LGBTQIA+ archive, we might not have been allowed to allocate staff time and other resources to this project. We might have been told “no.” Thank goodness, our library dean is 100% on board. And UNO is deeply committed to inclusion as one of our core values, which has created a campus-wide openness to the LGBTQIA+ community, resulting in an environment perfectly conducive to building this archive. In fact, in November 2017, UNO was identified as the most LGBTQIA+-friendly college in the state of Nebraska by the Campus Pride Index in partnership with BestColleges.com. An initiative like the QOA will always go much more smoothly when your administration is on your side.

#10. The Neverending Story.

We recognize that we still have a long way to go. There are quite a few gaps within our collection. We have the papers of a trans woman, but only oral histories from trans men. We don’t yet have anything from intersex folks or asexuals. We have very little from queer people of color or queer immigrants, although we do have some oral histories from those groups, thanks to the efforts of Dr. Jay Irwin, who launched the oral history project, and Luke Wegener, who was hired as a dedicated oral history associate for UNO Libraries. A major focus on the LGBTQIA+ oral history interview project is filling identified gaps within the collection, actively seeking more voices of color and other underrepresented groups within the LGBTQIA+ community. However, despite our efforts to increase the diversity within the collection, we haven’t successfully identified potential donors or interviewees to represent all of the letters within LGBTQIA, much less the +.

This isn’t—and should never bea series of checkboxes on a list. “Oh, we have a trans woman. We don’t need any more trans women.” No, that’s not how it works. We seek to fill the gaps, while continuing to welcome additional material from groups already represented. We are absolutely not going to turn away a white cis gay man just because we already have multiple resources from white cis gay men. Every individual is different. Every individual brings a new perspective. We want our collection to include as many voices as possible. So we need to promote our collection more. We need to do more outreach. We need to attract more donors, users, and champions. This will remain an ongoing effort without an endpoint. There is always room for growth.


Angela Kroeger is the Archives and Special Collections associate at the University of Nebraska at Omaha and a lifelong Nebraskan. They received their B.A. in English from the University of Nebraska at Omaha and their Master’s in Library and Information Science from the University of Missouri.

Yumi Ohira is the Digital Initiatives Librarian at the University of Nebraska at Omaha. Ohira is originally from Japan where she received a B.S. in Applied Physics from Fukuoka University. Ohira moved to the United States to attend University of Kansas and Southern Illinois University-Carbondale where she was awarded an M.F.A. in Studio Art. Ohira went on to study at Emporia State University, Kansas, where she received an M.L.S. and Archive Studies Certification.

Using Python, FFMPEG, and the ArchivesSpace API to Create a Lightweight Clip Library

by Bonnie Gordon

This is the twelfth post in the bloggERS Script It! Series.

Context

Over the past couple of years at the Rockefeller Archive Center, we’ve digitized a substantial portion of our audiovisual collection. Our colleagues in our Research and Education department wanted to create a clip library using this digitized content, so that they could easily find clips to use in presentations and on social media. Since the scale would be somewhat small and we wanted to spin up a solution quickly, we decided to store A/V clips in a folder with an accompanying spreadsheet containing metadata.

All of our (processed) A/V materials are described at the item level in ArchivesSpace. Since this description existed already, we wanted a way to get information into the spreadsheet without a lot of copying-and-pasting or rekeying. Fortunately, the access copies of our digitized A/V have ArchivesSpace refIDs as their filenames, so we’re able to easily link each .mp4 file to its description via the ArchivesSpace API. To do so, I wrote a Python script that uses the ArchivesSpace API to gather descriptive metadata and output it to a spreadsheet, and also uses the command line tool ffmpeg to automate clip creation.

The script asks for user input on the command line. This is how it works:

Step 1: Log into ArchivesSpace

First, the script asks the user for their ArchivesSpace username and password. (The script requires a config file with the IP address of the ArchivesSpace instance.) It then starts an ArchivesSpace session using methods from ArchivesSnake, an open-source Python library for working with the ArchivesSpace API.

Step 2: Get refID and number to start appending to file

The script then starts a while loop, and asks if the user would like to input a new refID. If the user types back “yes” or “y,” the script then asks for the the ArchivesSpace refID, followed by the number to start appending to the end of each clip. This is because the filename for each clip is the original refID, followed by an underscore, followed by a number, and to allow for more clips to be made from the same original file when the script is run again later.

Step 3: Get clip length and create clip

The script then calculates the duration of the original file, in order to determine whether to ask the user to input the number of hours for the start time of the clip, or to skip that prompt. The user is then asked for the number of minutes and seconds of the start time of the clip, then the number of minutes and seconds for the duration of the clip. Then the clip is created. In order to calculate the duration of the original file and create the clip, I used the os Python module to run ffmpeg commands. Ffmpeg is a powerful command line tool for manipulating A/V files; I find ffmprovisr to be an extremely helpful resource.

Clip from Rockefeller Family at Pocantico – Part I , circa 1920, FA1303, Rockefeller Family Home Movies. Rockefeller Archive Center.

Step 4: Get information about clip from ArchivesSpace

Now that the clip is made, the script uses the ArchivesSnake library again and the find_by_id endpoint of the ArchivesSpace API to get descriptive metadata. This includes the original item’s title, date, identifier, and scope and contents note, and the collection title and identifier.

Step 5: Format data and write to csv

The script then takes the data it’s gathered, formats it as needed—such as by removing line breaks in notes from ArchivesSpace, or formatting duration length—and writes it to the csv file.

Step 6: Decide how to continue

The loop starts again, and the user is asked “New refID? y/n/q.” If the user inputs “n” or “no,” the script skips asking for a refID and goes straight to asking for information about how to create the clip. If the user inputs “q” or “quit,” the script ends.

The script is available on GitHub. Issues and pull requests welcome!


Bonnie Gordon is a Digital Archivist at the Rockefeller Archive Center, where she focuses on digital preservation, born digital records, and training around technology.

Call for Contributions: Making Tech Skills a Strategic Priority

As a follow-up to our popular Script It! Series — which attempted to break down barriers and demystify scripting with walkthroughs of simple scripts — we’re interested in learning more about how archival institutions (as such) encourage their archivists to develop and promote their technical literacy more generally. As Trevor Owens notes in his forthcoming book, The Theory and Craft of Digital Preservation, “the scale and inherent structures of digital information suggest working more with a shovel than with a tweezers.” Encouraging archivists to develop and promote their technical literacy is one such way to use a metaphorical shovel!

Maybe you work for an institution that explicitly encourages its employees to learn new technical skills. Maybe your team or institution has made technical literacy a strategic priority. Maybe you’ve formed a collaborative study group with your peers to learn a programming language. Whatever the case, we want to hear about it!

Writing for bloggERS! “Making Tech Skills a Strategic Priority” Series

  • We encourage visual representations: Posts can include or largely consist of comics, flowcharts, a series of memes, etc!
  • Written content should be roughly 600-800 words in length
  • Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA
  • Align with other editorial guidelines as outlined in the bloggERS! guidelines for writers.

Posts for this series will start in late November or December, so let us know if you are interested in contributing by sending an email to ers.mailer.blog@gmail.com!

The BitCurator Script Library

by Walker Sampson

This is the eleventh post in the bloggERS Script It! Series.

One of the strengths of the BitCurator Environment (BCE) is the open-ended potential of the toolset. BCE is a customized version of the popular (granted, insofar as desktop Linux can be popular) Ubuntu distribution, and as such it remains a very configurable working environment. While there is a basic notion of a default workflow found in the documentation (i.e., acquire content, run analyses on it, and increasingly, do something based on those analyses, then export all of it to another spot), the range of tools and prepackaged scripts in BCE can be used in whatever order fits the needs of the user. But even aside from this configurability, there is the further option of using new scripts to achieve different or better outcomes.

What is a script? I’m going to shamelessly pull from my book for a brief overview:

A script is a set of commands that you can write and execute in order to automatically run through a sequence of actions. A script can support a number of variations and branching paths, thereby supporting a considerable amount of logic inside it – or it can be quite straightforward, depending upon your needs. A script creates this chain of actions by using the commands and syntax of a command line shell, or by using the commands and functions of a programming language, such as Python, Perl or PHP.

In short, scripts allow the user to string multiple actions together in some defined way. This can open the door to batch operations – repeating the same job automatically for a whole set of items – that speed up processing. Alternatively, a user may notice that they are repeating a chain of actions for a single item in a very routine way. Again, a script may fill in here, grouping together all those actions into a single script that the user need only initiate once. Scripting can also bridge the gap between two programs, or adjust the output of one process to make it fit into the input of another. If you’re interested in scripting, there are basically two (non-exclusive) routes to take: shell scripting or scripting with a programming language.

  • For an intro on both writing and running bash shell scripts, one of if not the most popular Unix shell – and the one BitCurator defaults with check out this tutorial by Tania Rascia.
  • There are many programming languages that can be used in scripts; Python is a very common one. Learning how to script with Python is tantamount to simply learning Python, so it’s probably best to set upon that path first. Resources abound for this endeavor, and the book Automate the Boring Stuff with Python is free under a Creative Commons license.

The BitCurator Scripts Library

The BitCurator Scripts Library is a spot we designed to help connect users with scripts for the environment. Most scripts are already available online somewhere (typically GitHub), but a single page that inventories these resources can further their use. A brief look at a few of the scripts available will give a better idea of the utility of the library.

  • If you’ve ever had the experience of repeatedly trying every listed disk format in the KryoFlux GUI (which numbers well over a dozen) in an attempt to resolve stream files into a legible disk image, the DiskFormatID program can automate that process.
  • fiwalk, which is used to identify files and their metadata, doesn’t operate on Hierarchical File System (HFS) disk images. This prevents the generation of a DFXML file for HFS disks as well. Given the utility and the volume of metadata located in that single document, along with the fact that other disk images receive the DFXML treatment, this stands out as a frustrating process gap. Dianne Dietrich has fortunately whipped up a script to generate just such a DFXML for all your HFS images!
  • The shell scripts available at rl-bitcurator-scripts are a great example of running the same command over multiple files: multi_annofeatures.sh, multi_be.sh, and multifiwalk.sh run identify_filenames.py, bulk_extractor and fiwalk over a directory, respectively. Conversely, simgen_prod.sh is an example of shell script grouping multiple commands together and running that group over a set of items.

For every script listed, we provide a link (where applicable) to any related resources, such as a paper that explains the thinking behind a script, a webinar or slides where it is discussed, or simply a blog post that introduces the code. Presently, the list includes both bash shell scripts along with Python and Perl scripts.

Scripts have a higher bar to use than programs with a graphic frontend, and some familiarity or comfort with the command line is required. The upside is that scripts can be amazingly versatile and customizable, filling in gaps in process, corralling disparate data into a single presentable sheet, or saving time by automating commands. Along with these strengths, viewing scripts often sparks an idea for one you may actually benefit from or want to write yourself.

Following from this, if you have a script you would like added to the library, please contact us (select ‘Website Feedback’) or simply post in our Google Group. Bear one aspect in mind however: scripts do not need to be perfect. Scripts are meant to be used and adjusted over time, so if you are considering a script to include, please know that it doesn’t need to accommodate every user or situation. If you have a quick and dirty script that completes the task, it will likely be beneficial to someone else, even if, or especially if, they need to adjust it for their work environment.


Walker Sampson is the Digital Archivist at the University of Colorado Boulder Libraries, where he is responsible for the acquisition, accessioning and description of born digital materials, along with the continued preservation and stewardship of all digital materials in the Libraries. He is the coauthor of The No-nonsense Guide to Born-digital Content.

Improving Workflows at UNC Libraries’ Wilson Special Collections Library

by Erica Titkemeyer and Jessica Venlet

This is the tenth post in the bloggERS Script It! Series.

At Wilson Special Collections Library, we are always trying to find ways to improve our digital preservation workflows. Improving our skills with the command line and using existing command line tools has played a key role in workflow improvements. So, we’ve picked a few favorite tools and tips to share.

FFmpeg

We use FFmpeg for a number of daily tasks, whether it’s generating derivatives for preservation files or analyzing a video or audio file we’ve received through a born-digital accession.

Clearing embedded metadata and uses for FFprobe:
As part of our audio digitization work, we embed metadata into all preservation WAV files. This metadata follows guidelines set out by the Federal Agencies Digital Guidelines Initiative (FADGI) and mostly relates the file back to the original item it was digitized from, including its unique identifier, title, and the curatorial unit it is held by. It has come up a few times now where we have recognized inconsistencies in how this data is reflected in the file, that the data itself is incorrect, or the data is insufficient.

WAV file metadata
Look at that terrible metadata!

When large-scale issues have come up, particularly with legacy files in our backlog, we’ve made use of FFmpeg’s ‘-map_metadata’ command to batch delete the embedded metadata. Below is a script used to batch create brand new files without metadata, with “_clean” attached to their original file name:

For i in *.wav; do ffmpeg -i “$i” -map_metadata –1 –c:a copy “${i%.wav}_clean.wav”; done

After successfully removing metadata from the files, we use the tool BWF MetaEdit to batch embed the correct metadata that we have prepared in a .csv file.

For born-digital work, we regularly use the tool/command ‘ffprobe’, a stream analyzer that is part of the FFmpeg build. It allows us to quickly see data about AV files (such as duration, file size, codecs, aspect ratio, etc.) that are helpful in identifying files and making general appraisal decisions. As we grow our capabilities in preserving born digital AV, we also foresee the need to document this type of file data in our ingest documentation.

walk_to_dfxml.py

In our born-digital workflows we don’t disk image every digital storage device we receive by default. This workflow choice has benefits and disadvantages. One disadvantage is losing the ability to quickly document all timestamps associated with files. While our workflows were preserving last modified dates, other timestamps like access or creation dates were not as effectively captured. In search of a way to remedy this issue, I turned to Twitter for some advice on the capture and value of each timestamp. Several folks recommended generating DFXML which is usually used on disk images. Tim Walsh helpfully pointed to a python script walk_to_dfxml.py that can generate DFXML on directories instead of disk images. Workflow challenge solved!

DFXML output example
DFXML output example

Brunnhilde

Brunnhilde is another tool that was particularly helpful in consolidating tasks and tools. By kicking off Brunnhilde in the command line, we are able to: check for viruses, create checksums, identify file formats, identify duplicates, create a manifest, and run a PII scan. Additionally, this tool gives us a report that is useful for digital archives specialists, but also holds potential as an appraisal tool for consultations with curators. We’re still working out that aspect of the workflow, but when it comes to the technical steps Brunnhilde and the associated command line tools it includes has really improved our processing work.

Learning as We Go

Like many archivists, we had limited experience with using the command line before graduate school. In the course of our careers, we’ve had to learn a lot on the fly because so many great command line tools are essential for working with digital archives.

One thing that can be tricky when you are new is moving the cursor around the terminal easily. It seems like it should be a no brainer, but it’s really not so obvious.

  • For Macs, see the excellent Script Ahoy resource:
  • For PC, see this resource for a variety of shortcuts including moving. In general:
    • Home key moves to beginning. End key moves to the end.
    • Ctrl + left or right arrow moves the cursor around in chunks

Another helpful set of commands are remove (rm) and move (mv). We use these when dealing with extraneous files created through quality control applications in our AV workflow that we’d like to delete quickly, or when we need to separate derivatives (such as mp3s) from a large batch of preservation files (wav).

    • Important note about rm: it’s always smart to first use ‘echo’ to see what files you would be removing with your command (ex: echo rm *.lvl would list all the .lvl files that would be removed by your command).

If you are just starting out, you may consider exploring online tutorials or guides like:


Erica Titkemeyer is the Project Director and Audiovisual Conservator for the Southern Folklife Collection at the UNC Wilson Special Collections Library, coordinating inhouse digitization and outsourcing of audiovisual materials for preservation. Erica also participants in the improvement of online access and digital preservation for digitized materials.

Jessica Venlet works as the Assistant University Archivist for Digital Records & Records Management at the UNC Wilson Special Collections Library. In this role, Jessica is responsible for a variety of things related to both records management and digital preservation. In particular, she leads the acquisition and management of born-digital university records. She earned a Master of Science in Information degree from the University of Michigan.