Using R to Migrate Box and Folder Lists into EAD

by Andy Meyer

Introduction

This post is a case study about how I used the statistical programming language R to help export, transform, and load data from legacy finding aids into ArchivesSpace. I’m sharing this workflow in the hopes that another institution might find this approach helpful and could be generalized to other issues facing archives.

I decided to use the programming language R because it is a free and open source programming language that I had some prior experience using. R has a large and active user community as well as a large number of relevant packages that extend the basic functions of R,  including libraries that can deal with Microsoft Word tables and read and write XML. All of the code for this project is posted on Github.

The specific task that sparked this script was when I inherited hundreds of finding aids with minimal collection-level information and very long and detailed box and folder lists. These were all Microsoft Word documents with the box and folder list formatted as a table within the Word document. We recently adopted ArchivesSpace as our archival content management system so the challenge was to reformat this data and upload it into ArchivesSpace. I considered manual approaches but eventually opted to develop this code to automate this work. The code is generally organized into three sections: data export, transforming and cleaning the data, and finally, creating an EAD file to load into ArchivesSpace.

Data Export

After installing the appropriate libraries, the first step of the process was to extract the data from the Microsoft Word tables. Given the nature of our finding aids, I focused on extracting only the box and folder list; collection-level information would be added manually later in the process.

This process was surprisingly straightforward; I created a variable with a path to a Word Document and used the “docx_extract_tbl” function from the docxtractr package to extract the contents of that table into a data.frame in R. Sometimes our finding aids were inconsistent so I occasionally had to tweak the data to rearrange the columns or add missing values. The outcome of this step of the process is four columns that contain folder title, date, box number, and folder number.

This data export process is remarkably flexible. Using other R functions and libraries, I have extended this process to export data from CSV files or Excel spreadsheets. In theory, this process could be extended to receive a wide variety of data including collection-level descriptions and digital objects from a wider variety of sources. There are other tools that can also do this work (Yale’s Excel to EAD process and Harvard’s Aspace Import Excel plugin), but I found this process to be easier for my institution’s needs.

Data Transformation and Cleaning

Once I extracted the data from the Microsoft Word document, I did some minimal data cleanup, a sampling of which included:

  1. Extracting a date range for the collection. Again, past practice focused on creating folder-level descriptions and nearly all of our finding aids lacked collection-level information. From the box/folder list, I tried to extract a date range for the entire collection. This process was messy but worked a fair amount of the time. In cases when the data were not standardized, I defined this information manually.
  2. Standardizing “No Date” text. Over the course of this project, I discovered the following terms for folders that didn’t have dates: “n.d.”,”N.D.”,”no date”,”N/A”,”NDG”,”Various”, “N. D.”,””,”??”,”n. d.”,”n. d. “,”No date”,”-“,”N.A.”,”ND”, “NO DATE”, “Unknown.” For all of these, I updated the date field to “Undated” as a way to standardize this field.
  3. Spelling out abbreviations. Occasionally, I would use regular expressions to spell out words in the title field. This could be standard terms like “Corresp” to “Correspondence” or local terms like “NPU” to “North Park University.”

R is a powerful tool and provides many options for data cleaning. We did pretty minimal cleaning but this approach could be extended to do major transformations to the data.

Create EAD to Load into ArchivesSpace

Lastly, with the data cleaned, I could restructure the data into an XML file. Because the goal of this project was to import into ArchivesSpace, I created an extremely basic EAD file meant mainly to enter the box and folder information into ArchivesSpace; collection-level information would be added manually within ArchivesSpace. In order to get the cleaned data to import, I first needed to define a few collection-level elements including the collection title, collection ID, and date range for the collection. I also took this as an opportunity to apply a standard conditions governing access note for all collections.

Next, I used the XML package in R to create the minimally required nodes and attributes. For this section, I relied on examples from the book XML and Web Technologies for Data Sciences with R by Deborah Nolan and Duncan Temple Lang. I created the basic EAD schema in R using the “newXMLNode” functions from the XML package. This section of code is very minimal, and I would welcome suggestions from the broader community about how to improve it. Lastly, I defined functions that make the title, date, box, and folder nodes, which were then applied to the data exported and transformed in earlier steps. Lastly, this script saves everything as an XML file that I then uploaded into ArchivesSpace.

Conclusion

Although this script was designed to solve a very specific problem—extracting box and folder information from a Microsoft Word table and importing that information into ArchivesSpace—I think this approach could have wide and varied usage. The import process can accept loosely formatted data in a variety of different formats including Microsoft Word, plain text, CSV, and Excel and reformat the underlying data into a standard table. R offers an extremely robust set of packages to update, clean, and reformat this data. Lastly, you can define the export process to reformat the data into a suitable file format. Given the nature of this programming language, it is easy to preserve your original data source as well as document all the transformations you perform.


Andy Meyer is the director (and lone arranger) of the F.M. Johnson Archives and Special Collections at North Park University. He is interested in archival content management systems, digital preservation, and creative ways to engage communities with archival materials.

Advertisements

Just do it: Building technical capacity among Princeton’s Archival Description and Processing Team

by Alexis Antracoli

This is the fifth post in the bloggERS Making Tech Skills a Strategic Priority series.

ArchivesSpace, Archivematica, BitCurator, EAD, the list goes on! The contemporary archivist is tasked with not only processing paper collections, but also with processing digital records and managing the descriptive data we create. This work requires technical skills that archivists twenty or even ten years ago didn’t need to master. It’s also rare that archivists get extensive training in the technical aspects of the field during their graduate programs. So, how can a team of archivists build the skills they’ll need to meet the needs of an increasingly technical field? At the Princeton University Library, the newly formed Archival Description and Processing Team (ADAPT), is committed to meeting these challenges by building technical capacity across the team. We are achieving this by working on real-world projects that require technical skills, and by leveraging existing knowledge and skills in the organization, seeking outside training, and championing supervisor support for using time to grow our technical skills.

One of the most important requirements for growing technical capacity on the processing team is supervisor support for the effort. Workshops, training, and solving technical problems take a significant amount of time. Without management support for the time needed to develop technical skills, the team would not be able experiment, attend trainings, or practice writing code. As the manager of ADAPT, I make this possible by encouraging staff to set specific goals related to developing technical skills on their yearly performance evaluations; I also accept that it might take us a little longer to complete all of our processing. To fit this work into my own schedule, I identify real-world problems and block out time on my schedule to work on them or arrange meetings with colleagues who can assist me. Blocking out time in advance helps me stick to my commitment to building my technical skills. While the time needed to develop these skills means that some work happens more slowly today, the benefit of having a team that can manipulate data and automate processes is an investment in the future that will result in a more productive and efficient processing team.

With the support to devote time to building technical skills, ADAPT staff use a number of resources to improve their skills. Working with internal staff who already have skills they want to learn has been one successful approach. This has generally paired well with the need to solve real-world data problems. For example, we recently identified the need to move some old container information to individual component-level scope and content notes in a finding aid. We were able to complete this after several in-house training sessions on XPath and XQuery taught by a Library staff member. This introductory training helped us realize that the problem could be solved with XQuery scripting and we took on the project, while drawing on the in-house XQuery expert for assistance. This combination of identifying real-world problems and leveraging existing knowledge within the organization leads both to increased technical skills and projects getting done. It also builds confidence and knowledge that can be more easily applied to the next situation that requires a particular kind of technical expertise.

Finally, building in-house expertise requires allowing staff to determine what technical skills they want to build and how they might go about doing it. Often that requires outside training. Over the past several years, we have brought workshops to campus on working with the command line and using the ArchivesSpace API. Staff have also identified online courses and classes offered by the Office of Information Technology as important resources for building their technical skills. Providing support and time to attend these various trainings or complete online courses during the work day creates an environment where individuals can explore their interests and the team can build a variety of technical skills that complement each other.

As archival work evolves, having deeper technology skills across the team improves our ability to get our work done. With the right support, tapping into in-house resources, and seeking out additional training, it’s possible to build increased technological capability with the processing team. In turn, the team will increasingly be able to more efficiently tackle day-to-day technical challenges needed to manage digital records and descriptive data.


Alexis Antracoli is Assistant University Archivist for Technical Services at Princeton University Library where she leads the Archival Processing and Description Team. She has published on web archiving and the archiving of born-digital audio visual content. Alexis is active in the Society of American Archivists, where she serves as Chair of the Web Archiving Section and on the Finance Committee. She is also active in Archives for Black Lives in Philadelphia, an informal group of local archivists who work on projects that engage issues at the intersection of the archival profession and the Black Lives Matter movement. She is especially interested in applying user experience research and user-center design to archival discovery systems, developing and applying inclusive description practices, and web archiving. She holds an M.S.I. in Archives and Records Management from the University of Michigan, a Ph.D. in American History from Brandeis University, and a B.A. in History from Boston College.

Digitizing the Stars: Harvard University’s Glass Plate Collection

by Shana Scott

When our team of experts at Anderson Archival isn’t busy with our own historical collection preservation projects, we like to dive into researching other preservation and digitization undertakings. We usually dedicate ourselves to the intimate collections of individuals or private institutions, so we relish opportunities to investigate projects like Harvard University’s Glass Plate Collection.

For most of the sciences, century-old information would be considered at best a historical curiosity and at worst obsolete. But for the last hundred and forty years, Harvard College’s Observatory has housed one of the most comprehensive collections of photographs of the night’s sky as seen from planet Earth, and this data is more than priceless—it’s breakable. For nearly a decade, Harvard has been working to not only protect the historical collection but to bring it—and its enormous amount of underutilized data—into the digital age.

Star Gazing in Glass

Before computers and cameras, the only way to see the stars was to look up with the naked eye or through a telescope. With the advent of the camera, a whole new way to study the stars was born, but taking photographs of the heavens isn’t as easy as pointing and clicking. Photographs taken by telescopes were produced on 8″x10″ or 8″x14″ glass plates coated in a silver emulsion exposed over a period of time. This created a photographic negative on the glass that could be studied during the day.

(DASCH Portion of Plate b41215) Halley’s comet taken on April 21, 1910 from Arequipa, Peru.

This allowed a far more thorough study of the stars than one night of stargazing could offer. By adjusting the telescopes used and exposure times, stars too faint for the human eye to see could be recorded and analyzed. It was Henry Draper who took this technology to the next level.

In 1842, amateur astronomer Dr. Henry Draper used a prism over the glass plate to record the stellar spectrum of stars and was the first to successfully record a star’s spectrum. Dr. Draper and his wife, Anna, intended to devote his retirement to the study of stellar spectroscopy, but he died before they could begin. To continue her husband’s work, Anna Draper donated much of her fortune and Dr. Draper’s equipment to the Harvard Observatory for the study of stellar spectroscopy. Harvard had already begun photographing on glass plates, but with Anna Draper’s continual contributions, Harvard expanded its efforts, photographing both the stars and their spectrums.

Harvard now houses over 500,000 glass plates of both the northern and southern hemispheres, starting in 1882 and ending in 1992 when digital methods outpaced traditional photography. This collection of nightly recordings, which began as the Henry Draper Memorial, has been the basis for many of astronomy’s advancements in understanding the universe.

The Women of Harvard’s Observatory

Edward C. Pickering was the director of the Harvard Observatory when the Henry Draper Memorial was formed, but he did more than merely advance the field through photographing of the stars. He fostered the education and professional study of some of astronomy’s most influential members—women who, at that time, might never have received the chance—or credit—Pickering provided.

Instead of hiring men to study the plates during the day, Pickering hired women. He felt they were more detailed, patient, and, he admitted, cheaper. Williamina Fleming was one of those female computers.  She developed the Henry Draper Catalogue of Stellar Spectra and is credited with being the first to see the Horsehead nebula through her work examining the plates.

The Horsehead nebula taken by the Hubble Space Telescope in infrared light in 2013.
Image Credit: NASA/ESA/Hubble Heritage Team
(DASCH Portion of Plate b2312) The collection’s first image of the Horsehead Nebula taken on February 7, 1888 from Cambridge.

 

 

 

 

 

 

 

 

 

The Draper Catalogue included the first classification of stars based on stellar spectra, as created by Fleming. Later, this classification system would be modified by another notable female astronomer at Harvard, Annie Jump Cannon. Cannon’s classification and organizational scheme became the official method of cataloguing stars by the International Solar Union in 1910, and it continues to be used today.

Another notable female computer was Henrietta Swan Leavitt, who figured out a way to judge the distance of stars based on the brightness of stars in the Small Megellanic Cloud. Leavitt’s Law is still used to determine astronomical distances. The Glass Universe by Dava Sobel chronicles the stories of many of the female computers and the creation of Harvard Observatory’s plate collection.

Digital Access to a Sky Century @ Harvard (DASCH)

The Harvard Plate Collection is one of the most comprehensive records of the night’s sky, but less than one percent of it has been studied. For all of the great work done by the Harvard women and the astronomers who followed them, the fragility of the glass plates meant someone had to travel to Harvard to see them, and then the study of even a single star over a hundred years required a great deal of time. For every discovery made from the plate collection, like finding Pluto, hundreds or thousands more are waiting to be found.

(DASCH Single scan tile from Plate mc24889) First discovery image of Pluto with Clyde Tombaugh’s notes written on the plate. Taken at Cambridge on April 23, 1930.
Initial enhanced color image of Pluto released in July 2015 during New Horizon’s flyby.
Source: NASA/JHUAPL/SwRI
This is a more accurate image of the natural colors of Pluto as the human eye would see it. Taken by New Horizons in July 2015.
Source: NASA/Johns Hopkins University Applied Physics Laboratory/Southwest Research Institute/Alex Parker

 

 

 

 

 

 

 

 

 

With all of this unused, breakable data and advances in computing ability, Professor Jonathan Grindlay began organizing and funding DASCH in 2003 in an effort to digitize the entire hundred-year plate historical document collection. But Grindlay had an extra obstacle to overcome. Many of the plates had handwritten notes written by the female computers and other astronomers. Grindlay had to balance the historical significance of the collection with the vast data it offered. To do this, the plates are scanned at low resolution with the marks in place, then they are cleaned and rescanned at the extremely high resolution necessary for data recording.

A custom scanner had to be designed and constructed specifically for the glass plates and new software was created to bring the digitized image into line with current astronomical data methods. The project hasn’t been without its setbacks, either. Finding funding for the project is a constant problem, and in January 2016, the Observatory’s lowest level flooded. Around 61,000 glass plates were submerged and had to be frozen immediately to prevent mold from damaging the negatives. While the plates are intact, many still need to be unfrozen and restored before being scanned. The custom scanner also had to be replaced because of the flooding.

George Champine Logbook Archive

In conjunction with the plate scanning, a second project is necessary to make the plates useable for extended study. The original logbooks of the female computers contain more than their observations of the plates. These books record the time, date, telescope, emulsion type, and a host of other identifying information necessary to place and digitally extrapolate the stars on the plates. Over 800 logbooks (nearly 80,000 images in total) were photographed by volunteer George Champine.

Those images are now in the time-consuming process of being manually transcribed. Harvard Observatory partnered with the Smithsonian Institution to enlist volunteers who work every day reading and transcribing the vital information in these logbooks. Without this data, the software can’t accurately use the star data scanned from the plates.

Despite all the challenges and setbacks, 314,797 plates have been scanned as of December 2018. The data released and analyzed from the DASCH project has already made new discoveries about variable stars. Once the entire collection of historical documents is digitized, more than a hundred years will be added to the digital collection of astronomical data, and they will be free for anyone to access and study, professional or amateur.

The Harvard Plate Collection is a great example of an extraordinary resource to its community being underused due to the medium. Digital conversion of data is a great way to help any field of research. While Harvard’s plate digitization project provides a model for the conversion of complex data into digital form, not all institutions have the resources to attempt such a large enterprise. If you have a collection in need of digitization, contact Anderson Archival today at 314.259.1900 or email us at info@andersonarchival.com.


Shana Scott is a Digital Archivist and Content Specialist with Anderson Archival, and has been digitally preserving historical materials for over three years. She is involved in every level of the archiving process, creating collections that are relevant, accessible, and impactful. Scott has an MA in Professional Writing and Publishing from Southeast Missouri State University and is a member of SFWA.

Announcing the Digital Processing Framework

by Erin Faulder

Development of the Digital Processing Framework began after the second annual Born Digital Archiving eXchange unconference at Stanford University in 2016. There, a group of nine archivists saw a need for standardization, best practices, or general guidelines for processing digital archival materials. What came out of this initial conversation was the Digital Processing Framework (https://hdl.handle.net/1813/57659) developed by a team of 10 digital archives practitioners: Erin Faulder, Laura Uglean Jackson, Susanne Annand, Sally DeBauche, Martin Gengenbach, Karla Irwin, Julie Musson, Shira Peltzman, Kate Tasker, and Dorothy Waugh.

An initial draft of the Digital Processing Framework was presented at the Society of American Archivists’ Annual meeting in 2017. The team received feedback from over one hundred participants who assessed whether the draft was understandable and usable. Based on that feedback, the team refined the framework into a series of 23 activities, each composed of a range of assessment, arrangement, description, and preservation tasks involved in processing digital content. For example, the activity Survey the collection includes tasks like Determine total extent of digital material and Determine estimated date range.

The Digital Processing Framework’s target audience is folks who process born digital content in an archival setting and are looking for guidance in creating processing guidelines and making level-of-effort decisions for collections. The framework does not include recommendations for archivists looking for specific tools to help them process born digital material. We draw on language from the OAIS reference model, so users are expected to have some familiarity with digital preservation, as well as with the management of digital collections and with processing analog material.

Processing born-digital materials is often non-linear, requires technical tools that are selected based on unique institutional contexts, and blends terminology and theories from archival and digital preservation literature. Because of these characteristics, the team first defined 23 activities involved in digital processing that could be generalized across institutions, tools, and localized terminology. These activities may be strung together in a workflow that makes sense for your particular institution. They are:

  • Survey the collection
  • Create processing plan
  • Establish physical control over removeable media
  • Create checksums for transfer, preservation, and access copies
  • Determine level of description
  • Identify restricted material based on copyright/donor agreement
  • Gather metadata for description
  • Add description about electronic material to finding aid
  • Record technical metadata
  • Create SIP
  • Run virus scan
  • Organize electronic files according to intellectual arrangement
  • Address presence of duplicate content
  • Perform file format analysis
  • Identify deleted/temporary/system files
  • Manage personally identifiable information (PII) risk
  • Normalize files
  • Create AIP
  • Create DIP for access
  • Publish finding aid
  • Publish catalog record
  • Delete work copies of files

Within each activity are a number of associated tasks. For example, tasks identified as part of the Establish physical control over removable media activity include, among others, assigning a unique identifier to each piece of digital media and creating suitable housing for digital media. Taking inspiration from MPLP and extensible processing methods, the framework assigns these associated tasks to one of three processing tiers. These tiers include: Baseline, which we recommend as the minimum level of processing for born digital content; Moderate, which includes tasks that may be done on collections or parts of collections that are considered as having higher value, risk, or access needs; and Intensive, which includes tasks that should only be done to collections that have exceptional warrant. In assigning tasks to these tiers, practitioners balance the minimum work needed to adequately preserve the content against the volume of work that could happen for nuanced user access. When reading the framework, know that if a task is recommended at the Baseline tier, then it should also be done as part of any higher tier’s work.

We designed this framework to be a step towards a shared vocabulary of what happens as part of digital processing and a recommendation of practice, not a mandate. We encourage archivists to explore the framework and use it however it fits in their institution. This may mean re-defining what tasks fall into which tier(s), adding or removing activities and tasks, or stringing tasks into a defined workflow based on tier or common practice. Further, we encourage the professional community to build upon it in practical and creative ways.


Erin Faulder is the Digital Archivist at Cornell University Library’s Division of Rare and Manuscript Collections. She provides oversight and management of the division’s digital collections. She develops and documents workflows for accessioning, arranging and describing, and providing access to born-digital archival collections. She oversees the digitization of analog collection material. In collaboration with colleagues, Erin develops and refines the digital preservation and access ecosystem at Cornell University Library.

The Top 10 Things We Learned from Building the Queer Omaha Archives, Part 2 – Lessons 6 to 10

by Angela Kroeger and Yumi Ohira

The Queer Omaha Archives (QOA) is an ongoing effort by the University of Nebraska at Omaha Libraries’ Archives and Special Collections to collect and preserve Omaha’s LGBTQIA+ history. This is still a fairly new initiative at the UNO Libraries, having been launched in June 2016. This blog post is adapted and expanded from a presentation entitled “Show Us Your Omaha: Combatting LGBTQ+ Archival Silences,” originally given at the June 2017 Nebraska Library Association College & University Section spring meeting. The QOA was only a year old at that point, and now that another year (plus change) has passed, the collection has continued to grow, and we’ve learned some new lessons.

So here are the top takeaways from UNO’s experience with the QOA.

#6. Words have power, and sometimes also baggage.

Words sometimes mean different things to different people. Each person’s life experience lends context to the way they interpret the words they hear. And certain words have baggage.

We named our collecting initiative the Queer Omaha Archives because in our case, “queer” was the preferred term for all LGBTQIA+ people as well as referring to the academic discipline of queer studies. In the early 1990s, the community in Omaha most commonly referred to themselves as “gays and lesbians.” Later on, bisexuals were included, and the acronym “GLB” came into more common use. Eventually, when trans people were finally acknowledged, it became “GLBT.” Then there was a push to switch the order to “LGBT.” And then more letters started piling on, until we ended up with the LGBTQIA+ commonly used today. Sometimes, it is taken even further, and we’ve seen LGBTQIAPK+, LGBTQQIP2SAA, LGBTQIAGNC, and other increasingly long and difficult-to-remember variants. (Although, Angela confesses to finding QUILTBAG to be quite charming.) The acronym keeps shifting, but we didn’t want our name to shift, so we followed the students’ lead (remember the QTS “Cuties”?) and just went with “queer.” “Queer” means everyone.

Except . . . “queer” has baggage. Heavy, painful baggage. At Pride 2016, an older man, who had donated materials to our archive, stopped by our booth and we had a conversation about the word. For him, it was the slur his enemies had been verbally assaulting him with for decades. The word still had a sharp edge for him. Angela (being younger than this donor, but older than most of the students on campus) acknowledged that they were just old enough for the word to be startling and sometimes uncomfortable. But many Millennials and Generation Z youths, as well as some older adults, have embraced “queer” as an identity. Notably, many of the younger people on campus have expressed their disdain for being put into boxes. Identifying as “gay” or “lesbian” or “bi” seems too limiting to them. Our older patron left our booth somewhat comforted by the knowledge that for much of the population, especially the younger generations, “queer” has lost its sting and taken on a positive, liberating openness.

But what about other LGBTQIA+ people who haven’t stopped by to talk to us, to learn what we mean when we call our archives “queer”? Who feels sufficiently put off by this word that they might choose against sharing their stories with our archive? We aren’t planning to change our name, but we are aware that our choice of word may give some potential donors and potential users a reason to hesitate before approaching us.

So whatever community your archive serves, think about the words that community uses to describe themselves, and the words others use to describe them, and whether those words might have connotations you don’t intend.

#7. Find your community. Partnerships, donors, and volunteers are the keys to success.

It goes without saying that archives are built from communities. We don’t (usually) create the records. We invite them, gather them, describe them, preserve them, and make them available to all, but the records (usually) come from somewhere else.

Especially if you’re building an archive for an underrepresented community, you need buy-in from the members of that community if you want your project to be successful. You need to prove that you’re going to be trustworthy, honorable, reliable stewards of the community’s resources. You need someone in your archive who is willing and able to go out into that community and engage with them. For us, that was UNO Libraries’ Archives and Special Collections Director Amy Schindler, who has a gift for outreach. Though herself cis and straight, she has put in the effort to prove herself a friend to Omaha’s LGBTQIA+ community.

You also need members of that community to be your advocates. For us, our advocates were our first donors, the people who took that leap of faith and trusted us with their resources. We started with our own university. The work of the archivist and the archives would not have been possible without the collaboration and support of campus partners. UNO GSRC Director Dr. Jessi Hitchins and UNO Sociology Professor Dr. Jay Irwin together provided the crucial mailing list for the QOA’s initial publicity and networking. Dr. Irwin and his students collected the interviews which launched the LGBTQ+ Voices oral history project. Retired UNO professor Dr. Meredith Bacon donated her personal papers and extensive library of trans resources. From outside the UNO community, Terry Sweeney, who with his partner Pat Phalen had served as editor of The New Voice of Nebraska, donated a complete set of that publication, along with a substantial collection of papers, photographs, and artifacts, and he volunteered in the archives for many weeks, creating detailed and accurate descriptions of the items. These four people, and many others, have become our advocates, friends, and champions within the Omaha LGBTQIA+ community.

Our lesson here: Find your champions. Prove your trustworthiness, and your champions will help open doors into the community.

#8. Be respectful, be interested, and be present.

Outreach is key to building connections, bringing in both donors and users for the collection. This isn’t Field of Dreams, where “If you build it, they will come.” You need to forge the partnerships first, in order to even be able to build it. And they won’t come if they don’t know about it and don’t believe in its value. (“They” in this case meaning the community or communities your archives serve, and “it” of course meaning your archives or special collections for that community.)

Fig. 3: UNO Libraries table at a Transgender Day of Remembrance event.

Yumi and Angela are both behind-the-scenes technical services types, so we don’t have quite as much contact with patrons and donors as some others in our department, but we’ve helped out staffing tables at events, such as Pride, Transgender Day of Remembrance, and Transgender Day of Visibility events. We also work to create a welcoming atmosphere for guests who come to the archives for events, tours, and research. We recognize the critical importance of the work our director does, going out and meeting people, attending events, talking to everyone, and inviting everyone to visit. As our director Amy Schindler said in the article “Collaborative Approaches to Digital Projects,” “Engagement with community partners is key . . .”

There’s also something to be said for simply ensuring that folks within the archives, and the library as a whole for that matter, have a basic awareness of the QOA and other collecting initiatives, so that we can better fulfill our mission of connecting people to the resources they need. After all, when someone walks into the library, before they even reach the archives, any staff member might be their first point of contact. So be sure to do outreach within your own institution, as well.

#9. Let me sing you the song of my administrative support.

The QOA was conceived by the efforts from UNO students, UNO employees, and Omaha communities to address the underrepresentation of the LGBTQIA+ communities in Omaha.

The initiative of the QOA was inspired by Josh Burford who delivered a presentation about collecting and archiving historical materials related to queering history. This presentation was co-hosted by UNO’s Gender and Sexuality Resource Center in the LGBTQIA+ History Month. After this event, the UNO community became keenly interested in collecting and preserving historical materials and oral history interviews about “Queer Omaha,” and began collaborating with our local LGBTQIA+ communities. In Summer 2016, the QOA was officially launched to preserve the enduring value of the legacy of LGBTQIA+ communities in greater Omaha. The QOA is an effort to combat an archival silence in the community, and digital collections and digital engagement are especially effective tools for making LGBTQIA+ communities aware that the archives welcome their records!

But none of this would have been possible without administrative support. If the library administration or the university administration had been indifferentor worse, hostileto the idea of building a LGBTQIA+ archive, we might not have been allowed to allocate staff time and other resources to this project. We might have been told “no.” Thank goodness, our library dean is 100% on board. And UNO is deeply committed to inclusion as one of our core values, which has created a campus-wide openness to the LGBTQIA+ community, resulting in an environment perfectly conducive to building this archive. In fact, in November 2017, UNO was identified as the most LGBTQIA+-friendly college in the state of Nebraska by the Campus Pride Index in partnership with BestColleges.com. An initiative like the QOA will always go much more smoothly when your administration is on your side.

#10. The Neverending Story.

We recognize that we still have a long way to go. There are quite a few gaps within our collection. We have the papers of a trans woman, but only oral histories from trans men. We don’t yet have anything from intersex folks or asexuals. We have very little from queer people of color or queer immigrants, although we do have some oral histories from those groups, thanks to the efforts of Dr. Jay Irwin, who launched the oral history project, and Luke Wegener, who was hired as a dedicated oral history associate for UNO Libraries. A major focus on the LGBTQIA+ oral history interview project is filling identified gaps within the collection, actively seeking more voices of color and other underrepresented groups within the LGBTQIA+ community. However, despite our efforts to increase the diversity within the collection, we haven’t successfully identified potential donors or interviewees to represent all of the letters within LGBTQIA, much less the +.

This isn’t—and should never bea series of checkboxes on a list. “Oh, we have a trans woman. We don’t need any more trans women.” No, that’s not how it works. We seek to fill the gaps, while continuing to welcome additional material from groups already represented. We are absolutely not going to turn away a white cis gay man just because we already have multiple resources from white cis gay men. Every individual is different. Every individual brings a new perspective. We want our collection to include as many voices as possible. So we need to promote our collection more. We need to do more outreach. We need to attract more donors, users, and champions. This will remain an ongoing effort without an endpoint. There is always room for growth.


Angela Kroeger is the Archives and Special Collections associate at the University of Nebraska at Omaha and a lifelong Nebraskan. They received their B.A. in English from the University of Nebraska at Omaha and their Master’s in Library and Information Science from the University of Missouri.

Yumi Ohira is the Digital Initiatives Librarian at the University of Nebraska at Omaha. Ohira is originally from Japan where she received a B.S. in Applied Physics from Fukuoka University. Ohira moved to the United States to attend University of Kansas and Southern Illinois University-Carbondale where she was awarded an M.F.A. in Studio Art. Ohira went on to study at Emporia State University, Kansas, where she received an M.L.S. and Archive Studies Certification.

Using Python, FFMPEG, and the ArchivesSpace API to Create a Lightweight Clip Library

by Bonnie Gordon

This is the twelfth post in the bloggERS Script It! Series.

Context

Over the past couple of years at the Rockefeller Archive Center, we’ve digitized a substantial portion of our audiovisual collection. Our colleagues in our Research and Education department wanted to create a clip library using this digitized content, so that they could easily find clips to use in presentations and on social media. Since the scale would be somewhat small and we wanted to spin up a solution quickly, we decided to store A/V clips in a folder with an accompanying spreadsheet containing metadata.

All of our (processed) A/V materials are described at the item level in ArchivesSpace. Since this description existed already, we wanted a way to get information into the spreadsheet without a lot of copying-and-pasting or rekeying. Fortunately, the access copies of our digitized A/V have ArchivesSpace refIDs as their filenames, so we’re able to easily link each .mp4 file to its description via the ArchivesSpace API. To do so, I wrote a Python script that uses the ArchivesSpace API to gather descriptive metadata and output it to a spreadsheet, and also uses the command line tool ffmpeg to automate clip creation.

The script asks for user input on the command line. This is how it works:

Step 1: Log into ArchivesSpace

First, the script asks the user for their ArchivesSpace username and password. (The script requires a config file with the IP address of the ArchivesSpace instance.) It then starts an ArchivesSpace session using methods from ArchivesSnake, an open-source Python library for working with the ArchivesSpace API.

Step 2: Get refID and number to start appending to file

The script then starts a while loop, and asks if the user would like to input a new refID. If the user types back “yes” or “y,” the script then asks for the the ArchivesSpace refID, followed by the number to start appending to the end of each clip. This is because the filename for each clip is the original refID, followed by an underscore, followed by a number, and to allow for more clips to be made from the same original file when the script is run again later.

Step 3: Get clip length and create clip

The script then calculates the duration of the original file, in order to determine whether to ask the user to input the number of hours for the start time of the clip, or to skip that prompt. The user is then asked for the number of minutes and seconds of the start time of the clip, then the number of minutes and seconds for the duration of the clip. Then the clip is created. In order to calculate the duration of the original file and create the clip, I used the os Python module to run ffmpeg commands. Ffmpeg is a powerful command line tool for manipulating A/V files; I find ffmprovisr to be an extremely helpful resource.

Clip from Rockefeller Family at Pocantico – Part I , circa 1920, FA1303, Rockefeller Family Home Movies. Rockefeller Archive Center.

Step 4: Get information about clip from ArchivesSpace

Now that the clip is made, the script uses the ArchivesSnake library again and the find_by_id endpoint of the ArchivesSpace API to get descriptive metadata. This includes the original item’s title, date, identifier, and scope and contents note, and the collection title and identifier.

Step 5: Format data and write to csv

The script then takes the data it’s gathered, formats it as needed—such as by removing line breaks in notes from ArchivesSpace, or formatting duration length—and writes it to the csv file.

Step 6: Decide how to continue

The loop starts again, and the user is asked “New refID? y/n/q.” If the user inputs “n” or “no,” the script skips asking for a refID and goes straight to asking for information about how to create the clip. If the user inputs “q” or “quit,” the script ends.

The script is available on GitHub. Issues and pull requests welcome!


Bonnie Gordon is a Digital Archivist at the Rockefeller Archive Center, where she focuses on digital preservation, born digital records, and training around technology.

The BitCurator Script Library

by Walker Sampson

This is the eleventh post in the bloggERS Script It! Series.

One of the strengths of the BitCurator Environment (BCE) is the open-ended potential of the toolset. BCE is a customized version of the popular (granted, insofar as desktop Linux can be popular) Ubuntu distribution, and as such it remains a very configurable working environment. While there is a basic notion of a default workflow found in the documentation (i.e., acquire content, run analyses on it, and increasingly, do something based on those analyses, then export all of it to another spot), the range of tools and prepackaged scripts in BCE can be used in whatever order fits the needs of the user. But even aside from this configurability, there is the further option of using new scripts to achieve different or better outcomes.

What is a script? I’m going to shamelessly pull from my book for a brief overview:

A script is a set of commands that you can write and execute in order to automatically run through a sequence of actions. A script can support a number of variations and branching paths, thereby supporting a considerable amount of logic inside it – or it can be quite straightforward, depending upon your needs. A script creates this chain of actions by using the commands and syntax of a command line shell, or by using the commands and functions of a programming language, such as Python, Perl or PHP.

In short, scripts allow the user to string multiple actions together in some defined way. This can open the door to batch operations – repeating the same job automatically for a whole set of items – that speed up processing. Alternatively, a user may notice that they are repeating a chain of actions for a single item in a very routine way. Again, a script may fill in here, grouping together all those actions into a single script that the user need only initiate once. Scripting can also bridge the gap between two programs, or adjust the output of one process to make it fit into the input of another. If you’re interested in scripting, there are basically two (non-exclusive) routes to take: shell scripting or scripting with a programming language.

  • For an intro on both writing and running bash shell scripts, one of if not the most popular Unix shell – and the one BitCurator defaults with check out this tutorial by Tania Rascia.
  • There are many programming languages that can be used in scripts; Python is a very common one. Learning how to script with Python is tantamount to simply learning Python, so it’s probably best to set upon that path first. Resources abound for this endeavor, and the book Automate the Boring Stuff with Python is free under a Creative Commons license.

The BitCurator Scripts Library

The BitCurator Scripts Library is a spot we designed to help connect users with scripts for the environment. Most scripts are already available online somewhere (typically GitHub), but a single page that inventories these resources can further their use. A brief look at a few of the scripts available will give a better idea of the utility of the library.

  • If you’ve ever had the experience of repeatedly trying every listed disk format in the KryoFlux GUI (which numbers well over a dozen) in an attempt to resolve stream files into a legible disk image, the DiskFormatID program can automate that process.
  • fiwalk, which is used to identify files and their metadata, doesn’t operate on Hierarchical File System (HFS) disk images. This prevents the generation of a DFXML file for HFS disks as well. Given the utility and the volume of metadata located in that single document, along with the fact that other disk images receive the DFXML treatment, this stands out as a frustrating process gap. Dianne Dietrich has fortunately whipped up a script to generate just such a DFXML for all your HFS images!
  • The shell scripts available at rl-bitcurator-scripts are a great example of running the same command over multiple files: multi_annofeatures.sh, multi_be.sh, and multifiwalk.sh run identify_filenames.py, bulk_extractor and fiwalk over a directory, respectively. Conversely, simgen_prod.sh is an example of shell script grouping multiple commands together and running that group over a set of items.

For every script listed, we provide a link (where applicable) to any related resources, such as a paper that explains the thinking behind a script, a webinar or slides where it is discussed, or simply a blog post that introduces the code. Presently, the list includes both bash shell scripts along with Python and Perl scripts.

Scripts have a higher bar to use than programs with a graphic frontend, and some familiarity or comfort with the command line is required. The upside is that scripts can be amazingly versatile and customizable, filling in gaps in process, corralling disparate data into a single presentable sheet, or saving time by automating commands. Along with these strengths, viewing scripts often sparks an idea for one you may actually benefit from or want to write yourself.

Following from this, if you have a script you would like added to the library, please contact us (select ‘Website Feedback’) or simply post in our Google Group. Bear one aspect in mind however: scripts do not need to be perfect. Scripts are meant to be used and adjusted over time, so if you are considering a script to include, please know that it doesn’t need to accommodate every user or situation. If you have a quick and dirty script that completes the task, it will likely be beneficial to someone else, even if, or especially if, they need to adjust it for their work environment.


Walker Sampson is the Digital Archivist at the University of Colorado Boulder Libraries, where he is responsible for the acquisition, accessioning and description of born digital materials, along with the continued preservation and stewardship of all digital materials in the Libraries. He is the coauthor of The No-nonsense Guide to Born-digital Content.

Of Python and Pandas: Using Programming to Improve Discovery and Access

by Kate Topham

This is the ninth post in the bloggERS Script It! Series.

Over my spring break this year, I joined a team of librarians at George Washington University who wanted to use their MARC records as data to learn more about their rare book collection. We decided to explore trends in the collection over time, using a combination of Python and OpenRefine. Our research question: in what subjects was the collection strongest? From which decades do we have the most books in a given subject?

This required some programming chops–so the second half of this post is targeted towards those who have a working knowledge of python, as well as the pandas library.

If you don’t know python, have no fear! The next section describes how to use OpenRefine, and is 100% snake-free.

Cleaning your data

A big issue with analyzing cataloging data is that humans are often inconsistent in their description. Two records may be labeled “African-American History” and “History, African American,” respectively. When asked to compare these two strings, python will consider them different. Therefore, when you ask python to count all the books with the subject “African American History,” it counts only those whose subjects match that string exactly.

Enter OpenRefine, an open source application that allows you to clean and transform data in ways Excel doesn’t. Below you’ll see a table generated from pymarc, a python module designed to handle data encoded in MARC21. It contains the bibliographic id of each book and its subject data, pulled from field 650.

Picture1

Facets allow you to view subsets of your data. A “text facet” means that OpenRefine will treat the values as text rather than numbers or dates, for example. If we create a text facet for column a…Picture1.png

it will display all the different values in that column in a box on the left.Picture1.png

If we choose one of these facets, we can view all the rows that have that value in the “a” column.Picture1.png

Wow, we have five fiction books about Dogs! However, if you look on the side, there are six records that have “Dogs.” instead of “Dogs”, and were therefore excluded. How can we look at all of the dog books? This is where we need clustering.

Clustering allows you to group similar strings and merge the whole group into one value. It does this by sorting each string alphabetically and match them that way. Since “History, African American” and “African American History” both evaluate to “aaaaccefhiiimnnorrrsty,” OpenRefine will group them together and allow you to change all of the matching fields to either (or something totally different!) according to your preference.Picture1.png

This way, when you analyze the data, you can ask “How many books on African-American History do we have?” and trust that your answer will be correct. After clustering to my heart’s content, I exported this table into a pandas dataframe for analysis.

Analyzing your data

In order to study the subjects over time, we needed to restructure the table so that we could merge it with the date data.

First, I pivoted the table from short form to long so that we could count  separate pairs of subject tags. The pandas ‘melt’ method set the index by bibliographic id and subject so that books with multiple subjects would be counted in both categories.Picture1.png

Then I merged the dates from the original table of MARC records, bib_data, onto our newly melted table.Picture1.png

I grouped the data by subject using .group(). The .agg() function communicates how we want to count within the subject groups.Picture1.png

Because of the vast number of subjects we chose to focus on the ten most numerous subjects. I used the same grouping and aggregating process on the original cleaned subject data: grouped by 650a, counted by bib_id, and sorted by bib_id count.Picture1.png

Once I had our top ten list, I could select the count by decade for each subject.Picture1.png

Visualizing your data

In order to visualize this data, I used a python library called plotly. Plotly generates graphs from your data. It plays very well with pandas dataframes. Plotly provides many examples of code that you can copy, replacing the example data with your own. I placed the plotly code in for loop create a line on the graph for each subject.  Picture1.pngPicture1.png

Some of the interesting patterns we noticed was the spike in African-American books soon after 1865, the end of the Civil War; and at the end of the 20th century, following the Civil Rights movement.Knowing where the peaks and gaps are in our collections helps us better assist patrons in the use of our collection, and better market it to researchers.

Acknowledgments

I’d like to thank Dolsy Smith, Leah Richardson, and Jenn King for including me in this collaborative project and sharing their expertise with me.