Data As Collections

By Nathan Gerth

Over the past several years there has been a growing conversation about “collections as data” in the archival field. Elizabeth Russey Roke underscored the growing impact of this movement in her recent post on the Collections As Data: Always Already Computational final report at Blo. Much like her, I have seen this computational approach to the data in our collections manifest itself in ways at my home institution, with our decision to start providing researchers with aggregate data harvested from our born-digital collections.

Data as Collections

At the same time, in my role as a digital archivist working with congressional papers, I have seen a growing body of what I call “data as collections.” I am using the term data in this case specifically in reference to exports from relational database systems in collections. More akin to research datasets than standard born-digital acquisitions, these exports amplify the privacy and technical challenges associated with typical digital collections. However, they also embody some of the more appealing possibilities for the computational research highlighted by the “collections as data” initiative, given their structured nature and millions of data points.   

The problem of curating and supplying access to a particular type of data export has become an acute problem in the field of congressional papers. As documented in a white paper by a Congressional Papers Section Task Force in 2017, members in the U.S. House of Representatives and U.S. Senate have widely adopted proprietary Constituent Management Systems (CMS) or Constituent Services Systems (CSS) to manage constituent communications. The exports from these systems document the core interactions between the American people and their representatives in Congress. Unfortunately, these data exports have remained largely inaccessible to archivists and researchers alike.

The question of curating, preserving, and supplying access to the exports from these systems has galvanized the work of several task forces in the archival community. In recent years, congressional papers archivists have collaborated to document the problem in the white paper referenced above and to support the development of a tool to access these exports. The latter effort, spearheaded by West Virginia Libraries, earned a Lyrasis Catalyst Fund grant in 2018 to assess the development possibilities for an open-source platform developed at WVU to open and search these data exports. You can see a screenshot of the application in action below.

https://lh6.googleusercontent.com/iLmJ2ebnslIsoeST9MSbfuSBwtCfZM7-aHoV620y-9gq1daH9iZbpGAxy1NJX8vSB1qSrnLsxeveRCGr0VybE6JrlB5Z_X6MMzzXWK3_S2FXLlOPekDdFdhMV95a81U1AQ4j

Screenshot of data table viewed in the West Virginia University Libraries CSS Data Tool

The project funded by the grant, America Contacts Congress, has now issued its final report and the members of the task force that served as its advisory board are transitioning to the next stage of the project. Here are where things stand:

What We Now Know

We now know much more about the key research audiences for this data and the archival needs associated with the tool. Researchers expressed solid enthusiasm for gaining access to the data, especially computationally minded quantitative scholars. For those of us involved in testing data in the tool, the project gave us a moment to become much more familiar with our data. I, for my part, also know a great deal more about the 16 million records in the relational data tables we received from the office of Senator Harry Reid, in addition to the 3 million attachments referenced by those tables. Without the ability to search and view the data in the tool, the tables and attachments from the Reid collection would have existed as little more than binary files.

Unresolved Questions

While members of the grant’s advisory board know much more about how the tool might be used in the sphere of congressional papers, we would like to learn more about other cases of “data in collections” in the archival field. Who beyond congressional papers archivists are grappling with supplying access to and preserving relational databases? We know, for example, that many state and local governments are using the same Constituent Relationship Management systems, such as iConstituent and Intranet Quorum, deployed in congressional offices. Do our needs overlap with those of other archivists and could this tool serve a broader community? While the amount of CSS data exports in congressional collections is significant, the direction we plan to take tool development and partnerships to supply access to the data will hinge on finding a broader audience of archivists facing similar challenges.

If any of the questions above apply to you, consider contacting the members of the America Contacts Congress project’s advisory board. We would love to hear from you and discuss how the outcomes of the grant might apply to a broader array of data exports in archival collections. Who knows, we might even help you test the tool on your own data exports! For more information about the project, visit our webpage.

Nathan Gerth

Nathan Gerth is the Head of Digital Services and Digital Archivist at the University of Nevada, Reno Libraries. Splitting his time between application support and digital preservation, he is the primary custodian of the electronic records from the papers of Senator Harry Reid. Outside of the University, he is an active participant in the congressional papers community, serving as the incoming chair of the Congressional Papers Section and as a member of the Association of Centers for the Study of Congress CSS Data Task Force.

Student Impressions of Tech Skills for the Field

by Sarah Nguyen


Back in March, during bloggERS’ Making Tech Skills a Strategic Priority series, we distributed an open survey to MLIS, MLS, MI, and MSIS students to understand what they know and have experienced in relation to  technology skills as they enter the field. 

To be frank, this survey stemmed from personal interests since I just completed an MLIS core course on Research, Assessment, and Design (re: survey to collect data on current landscape). I am also interested in what skills I need to build/what class I should sign up for my next quarter (re: what tech skills do I need to become hire-able?). While I feel comfortable with a variety of tech-related tools and tasks, I’ve been intimidated by more “high-level”computational languages for some years. This survey was helpful for exploring what skills other LIS pre-professionals are interested in and which skills will help us make these costly degrees worth the time and financial investment that is traditionally required to enter a stable archive or library position.

Method

The survey was open for one month on Google Forms, and distributed to SAA communities, @SAA_ERS Twitter, the Digital Curation Google Group, and a few MLIS university program listservs. There were 15 questions and we received responses from 51 participants. 

Results & Analysis

Here’s a superficial scan of the results. If you would like to come up with your own analyses, feel free to view the raw data on GitHub.

Figure 1. Technology-related skills that students want to learn

The most popular technology-related skill that students are interested in learning is data management (manipulating, querying, transforming data, etc.). This is a pretty broad topic as it involves many tools and protocols which can vary between a GUI or scripts. A separate survey that does a breakdown of specific data management tools might be in order, especially since these types of skills can be divided into specialty courses, workshops, which then translates into a specific job position. A more specific survey could help demonstrate what types of skills need to be taught in a full semester-long course, or what skills can be covered in a day-long or multi-day workshop.

It was interesting to see that even in this day and age where social media management can be second nature to many students’ daily lives, there was still a notable interest in understanding how to make this a part of their career. This makes me wonder what value students have in knowing how to strategically manage an archives’ social media account. How could this help with the job market, as well as an archival organization’s main mission?

Looking deeper into the popular data management category, it would be interesting to know the current landscape of knowledge or pedagogy in communicating with IT (e.g. project management and translating users’ needs). In many cases, archivists are working separately from but dependently on IT system administrators, and it can be frustrating since either department may have distinct concerns about a server or other networks. In June’s NYC Preservathon/Preservashare 2019, there was mention that IT exists to make sure servers and networks are spinning at all hours of the day. Unlike archivists, they are not concerned about the longevity of the content, obsolescence of file formats, or the software to render files. Could it be useful to have a course on how to effectively communicate and take control of issues that can be fuzzy lines between archives, data management, and IT? Or as one survey respondent said, “I think more basic programming courses focusing on tech languages commonly used in archives/libraries would be very helpful.” Personally, I’ve only learned this from experience working in different tech-related jobs. This is not a subject I see on my MLIS course catalog, nor a discussion at conference workshops. 

The popularity of data management skills sparked another question: what about knowledge around computer networks and servers? Even though LTO will forever be in our hearts, cloud storage is also a backup medium we’re budgeting for and relying on. Same goes for hosting a database for remote access and/or publishing digital files. A friend mentioned this networking workshop for non-tech savvy learners—Grassroots Networking: Network Administration for Small Organizations/Home Organizations—which could be helpful for multiple skill types including data management, digital forensics, web archiving, web development, etc. This is similar to a course that could be found in computer science or MLIS-adjacent information management departments.

Figure 2. Have you taken/will you take technology-focused courses in your program?
Figure 3. Do you feel comfortable defining the difference between scripting and programming

I can’t say this is statistically significant, but the inverse relationship between 15.7% who have not/will not take a technology-focused course in their program, compared to 78.4% of respondents who are not aware of the difference between scripting and programming is eyebrow raising. According to an article in PLOS Computational Biology,  the term “script” means “something that is executed directly as is”, while a “program[… is] something that is explicitly compiled before being used. The distinction is more one of degree than kind—libraries written in Python are actually compiled to bytecode as they are loaded, for example—so one other way to think of it is “things that are edited directly” and “things that are not edited directly” (Wilson et al 2017). This distinction is important since more archives are acquiring, processing and sharing collections that rely on the archivist to execute jobs such as web-scraping or metadata management (scripts) or archivists who can build and maintain a database (programming). These might be interpreted as trick questions, but the particular semantics and what is considered technology-focused is something modern library, archives, and information programs might want to consider. 

Figure 4. How do you approach new technology?

Figure 4 illustrates the various ways students tackle new technologies. Reading the f* manual (RTFM) and Searching forums are the most common approaches to navigating technology. Here are quotes from a couple students on how they tend to learn a new piece of software:

  • “break whatever I’m trying to do with a new technology into steps and look for tutorials & examples related to each of those steps (i.e. Is this step even possible with X, how to do it, how else to use it, alternatives for accomplishing that step that don’t involve X)”
  • “I tend to google “how to….” for specific tasks and learn new technology on a task-by-task basis.”

In the end, there was overwhelming interest in “more project-based courses that allow skills from other tech classes to be applied.” Unsurprisingly, many of us are looking for full-time, stable jobs after graduating and the “more practical stuff, like CONTENTdm for archives” seems to be a pressure felt in-order to get an entry-level position. Not just entry too; as continuing education learners, there is also a push to strive for more—several respondents are looking for a challenge to level up their tech skills: 

  • “I want more classes with hands-on experience with technical skills. A lot of my classes have been theory based or else they present technology to us in a way that is not easy to process (i.e. a lecture without much hands-on work).”
  • “Higher-level programming, etc. — everything on offer at my school is entry level. Also digital forensics — using tools such as BitCurator.”
  • “Advanced courses for the introductory courses. XML 2 and python 2 to continue to develop the skills.”
  • “A skills building survey of various code/scripting, that offers structured learning (my professor doesn’t give a ton of feedback and most learning is independent, and the main focus is an independent project one comes up with), but that isn’t online. It’s really hard to learn something without face to face interaction, I don’t know why.”

It’ll be interesting to see what skills recent MLIS, MLS, MIS, and MSIM graduates will enter the field with. While many job postings list certain software and skills as requirements, will programs follow suit? I have a feeling this might be a significant question to ask in the larger context of what is the purpose of this Master’s degree and how can the curriculum keep up with the dynamic technology needs of the field.

Disclaimer: 

  1. Potential bias: Those taking the survey might be interested in learning higher-level tech skills because they do not already know the skills, while those who are already tech-savvy might avoid a basic survey such as this one since they already know the skills. This may put a bias on the survey population consisting of mostly novice tech students.   
  2. More data on specific computational languages and technology courses taken are available in the GitHub csv file. As mentioned earlier, I just finished my first year as a part-time MLIS student, so I’m still learning the distinct jobs and nature of the LIS field. Feel free to submit an issue to the GitHub repo, or tweet me @snewyuen if you’d like to talk more about what this data could mean.

Bibliography

Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Computational Biology 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510


Sarah Nguyen with a Uovo storage truck

Sarah Nguyen is an advocate for open, accessible, and secure technologies. While studying as an MLIS candidate with the University of Washington iSchool, she is expressing interests through a few gigs: Project Coordinator for Preserve This Podcast at METRO, Assistant Research Scientist for Investigating & Archiving the Scholarly Git Experience at NYU Libraries, and archivist for the Dance Heritage Coalition/Mark Morris Dance Group. Offline, she can be found riding a Cannondale mtb or practicing movement through dance. (Views do not represent Uovo. And I don’t even work with them. Just liked the truck.)

Using R to Migrate Box and Folder Lists into EAD

by Andy Meyer

Introduction

This post is a case study about how I used the statistical programming language R to help export, transform, and load data from legacy finding aids into ArchivesSpace. I’m sharing this workflow in the hopes that another institution might find this approach helpful and could be generalized to other issues facing archives.

I decided to use the programming language R because it is a free and open source programming language that I had some prior experience using. R has a large and active user community as well as a large number of relevant packages that extend the basic functions of R,  including libraries that can deal with Microsoft Word tables and read and write XML. All of the code for this project is posted on Github.

The specific task that sparked this script was when I inherited hundreds of finding aids with minimal collection-level information and very long and detailed box and folder lists. These were all Microsoft Word documents with the box and folder list formatted as a table within the Word document. We recently adopted ArchivesSpace as our archival content management system so the challenge was to reformat this data and upload it into ArchivesSpace. I considered manual approaches but eventually opted to develop this code to automate this work. The code is generally organized into three sections: data export, transforming and cleaning the data, and finally, creating an EAD file to load into ArchivesSpace.

Data Export

After installing the appropriate libraries, the first step of the process was to extract the data from the Microsoft Word tables. Given the nature of our finding aids, I focused on extracting only the box and folder list; collection-level information would be added manually later in the process.

This process was surprisingly straightforward; I created a variable with a path to a Word Document and used the “docx_extract_tbl” function from the docxtractr package to extract the contents of that table into a data.frame in R. Sometimes our finding aids were inconsistent so I occasionally had to tweak the data to rearrange the columns or add missing values. The outcome of this step of the process is four columns that contain folder title, date, box number, and folder number.

This data export process is remarkably flexible. Using other R functions and libraries, I have extended this process to export data from CSV files or Excel spreadsheets. In theory, this process could be extended to receive a wide variety of data including collection-level descriptions and digital objects from a wider variety of sources. There are other tools that can also do this work (Yale’s Excel to EAD process and Harvard’s Aspace Import Excel plugin), but I found this process to be easier for my institution’s needs.

Data Transformation and Cleaning

Once I extracted the data from the Microsoft Word document, I did some minimal data cleanup, a sampling of which included:

  1. Extracting a date range for the collection. Again, past practice focused on creating folder-level descriptions and nearly all of our finding aids lacked collection-level information. From the box/folder list, I tried to extract a date range for the entire collection. This process was messy but worked a fair amount of the time. In cases when the data were not standardized, I defined this information manually.
  2. Standardizing “No Date” text. Over the course of this project, I discovered the following terms for folders that didn’t have dates: “n.d.”,”N.D.”,”no date”,”N/A”,”NDG”,”Various”, “N. D.”,””,”??”,”n. d.”,”n. d. “,”No date”,”-“,”N.A.”,”ND”, “NO DATE”, “Unknown.” For all of these, I updated the date field to “Undated” as a way to standardize this field.
  3. Spelling out abbreviations. Occasionally, I would use regular expressions to spell out words in the title field. This could be standard terms like “Corresp” to “Correspondence” or local terms like “NPU” to “North Park University.”

R is a powerful tool and provides many options for data cleaning. We did pretty minimal cleaning but this approach could be extended to do major transformations to the data.

Create EAD to Load into ArchivesSpace

Lastly, with the data cleaned, I could restructure the data into an XML file. Because the goal of this project was to import into ArchivesSpace, I created an extremely basic EAD file meant mainly to enter the box and folder information into ArchivesSpace; collection-level information would be added manually within ArchivesSpace. In order to get the cleaned data to import, I first needed to define a few collection-level elements including the collection title, collection ID, and date range for the collection. I also took this as an opportunity to apply a standard conditions governing access note for all collections.

Next, I used the XML package in R to create the minimally required nodes and attributes. For this section, I relied on examples from the book XML and Web Technologies for Data Sciences with R by Deborah Nolan and Duncan Temple Lang. I created the basic EAD schema in R using the “newXMLNode” functions from the XML package. This section of code is very minimal, and I would welcome suggestions from the broader community about how to improve it. Lastly, I defined functions that make the title, date, box, and folder nodes, which were then applied to the data exported and transformed in earlier steps. Lastly, this script saves everything as an XML file that I then uploaded into ArchivesSpace.

Conclusion

Although this script was designed to solve a very specific problem—extracting box and folder information from a Microsoft Word table and importing that information into ArchivesSpace—I think this approach could have wide and varied usage. The import process can accept loosely formatted data in a variety of different formats including Microsoft Word, plain text, CSV, and Excel and reformat the underlying data into a standard table. R offers an extremely robust set of packages to update, clean, and reformat this data. Lastly, you can define the export process to reformat the data into a suitable file format. Given the nature of this programming language, it is easy to preserve your original data source as well as document all the transformations you perform.


Andy Meyer is the director (and lone arranger) of the F.M. Johnson Archives and Special Collections at North Park University. He is interested in archival content management systems, digital preservation, and creative ways to engage communities with archival materials.

Just do it: Building technical capacity among Princeton’s Archival Description and Processing Team

by Alexis Antracoli

This is the fifth post in the bloggERS Making Tech Skills a Strategic Priority series.

ArchivesSpace, Archivematica, BitCurator, EAD, the list goes on! The contemporary archivist is tasked with not only processing paper collections, but also with processing digital records and managing the descriptive data we create. This work requires technical skills that archivists twenty or even ten years ago didn’t need to master. It’s also rare that archivists get extensive training in the technical aspects of the field during their graduate programs. So, how can a team of archivists build the skills they’ll need to meet the needs of an increasingly technical field? At the Princeton University Library, the newly formed Archival Description and Processing Team (ADAPT), is committed to meeting these challenges by building technical capacity across the team. We are achieving this by working on real-world projects that require technical skills, and by leveraging existing knowledge and skills in the organization, seeking outside training, and championing supervisor support for using time to grow our technical skills.

One of the most important requirements for growing technical capacity on the processing team is supervisor support for the effort. Workshops, training, and solving technical problems take a significant amount of time. Without management support for the time needed to develop technical skills, the team would not be able experiment, attend trainings, or practice writing code. As the manager of ADAPT, I make this possible by encouraging staff to set specific goals related to developing technical skills on their yearly performance evaluations; I also accept that it might take us a little longer to complete all of our processing. To fit this work into my own schedule, I identify real-world problems and block out time on my schedule to work on them or arrange meetings with colleagues who can assist me. Blocking out time in advance helps me stick to my commitment to building my technical skills. While the time needed to develop these skills means that some work happens more slowly today, the benefit of having a team that can manipulate data and automate processes is an investment in the future that will result in a more productive and efficient processing team.

With the support to devote time to building technical skills, ADAPT staff use a number of resources to improve their skills. Working with internal staff who already have skills they want to learn has been one successful approach. This has generally paired well with the need to solve real-world data problems. For example, we recently identified the need to move some old container information to individual component-level scope and content notes in a finding aid. We were able to complete this after several in-house training sessions on XPath and XQuery taught by a Library staff member. This introductory training helped us realize that the problem could be solved with XQuery scripting and we took on the project, while drawing on the in-house XQuery expert for assistance. This combination of identifying real-world problems and leveraging existing knowledge within the organization leads both to increased technical skills and projects getting done. It also builds confidence and knowledge that can be more easily applied to the next situation that requires a particular kind of technical expertise.

Finally, building in-house expertise requires allowing staff to determine what technical skills they want to build and how they might go about doing it. Often that requires outside training. Over the past several years, we have brought workshops to campus on working with the command line and using the ArchivesSpace API. Staff have also identified online courses and classes offered by the Office of Information Technology as important resources for building their technical skills. Providing support and time to attend these various trainings or complete online courses during the work day creates an environment where individuals can explore their interests and the team can build a variety of technical skills that complement each other.

As archival work evolves, having deeper technology skills across the team improves our ability to get our work done. With the right support, tapping into in-house resources, and seeking out additional training, it’s possible to build increased technological capability with the processing team. In turn, the team will increasingly be able to more efficiently tackle day-to-day technical challenges needed to manage digital records and descriptive data.


Alexis Antracoli is Assistant University Archivist for Technical Services at Princeton University Library where she leads the Archival Processing and Description Team. She has published on web archiving and the archiving of born-digital audio visual content. Alexis is active in the Society of American Archivists, where she serves as Chair of the Web Archiving Section and on the Finance Committee. She is also active in Archives for Black Lives in Philadelphia, an informal group of local archivists who work on projects that engage issues at the intersection of the archival profession and the Black Lives Matter movement. She is especially interested in applying user experience research and user-center design to archival discovery systems, developing and applying inclusive description practices, and web archiving. She holds an M.S.I. in Archives and Records Management from the University of Michigan, a Ph.D. in American History from Brandeis University, and a B.A. in History from Boston College.

Digitizing the Stars: Harvard University’s Glass Plate Collection

by Shana Scott

When our team of experts at Anderson Archival isn’t busy with our own historical collection preservation projects, we like to dive into researching other preservation and digitization undertakings. We usually dedicate ourselves to the intimate collections of individuals or private institutions, so we relish opportunities to investigate projects like Harvard University’s Glass Plate Collection.

For most of the sciences, century-old information would be considered at best a historical curiosity and at worst obsolete. But for the last hundred and forty years, Harvard College’s Observatory has housed one of the most comprehensive collections of photographs of the night’s sky as seen from planet Earth, and this data is more than priceless—it’s breakable. For nearly a decade, Harvard has been working to not only protect the historical collection but to bring it—and its enormous amount of underutilized data—into the digital age.

Star Gazing in Glass

Before computers and cameras, the only way to see the stars was to look up with the naked eye or through a telescope. With the advent of the camera, a whole new way to study the stars was born, but taking photographs of the heavens isn’t as easy as pointing and clicking. Photographs taken by telescopes were produced on 8″x10″ or 8″x14″ glass plates coated in a silver emulsion exposed over a period of time. This created a photographic negative on the glass that could be studied during the day.

(DASCH Portion of Plate b41215) Halley’s comet taken on April 21, 1910 from Arequipa, Peru.

This allowed a far more thorough study of the stars than one night of stargazing could offer. By adjusting the telescopes used and exposure times, stars too faint for the human eye to see could be recorded and analyzed. It was Henry Draper who took this technology to the next level.

In 1842, amateur astronomer Dr. Henry Draper used a prism over the glass plate to record the stellar spectrum of stars and was the first to successfully record a star’s spectrum. Dr. Draper and his wife, Anna, intended to devote his retirement to the study of stellar spectroscopy, but he died before they could begin. To continue her husband’s work, Anna Draper donated much of her fortune and Dr. Draper’s equipment to the Harvard Observatory for the study of stellar spectroscopy. Harvard had already begun photographing on glass plates, but with Anna Draper’s continual contributions, Harvard expanded its efforts, photographing both the stars and their spectrums.

Harvard now houses over 500,000 glass plates of both the northern and southern hemispheres, starting in 1882 and ending in 1992 when digital methods outpaced traditional photography. This collection of nightly recordings, which began as the Henry Draper Memorial, has been the basis for many of astronomy’s advancements in understanding the universe.

The Women of Harvard’s Observatory

Edward C. Pickering was the director of the Harvard Observatory when the Henry Draper Memorial was formed, but he did more than merely advance the field through photographing of the stars. He fostered the education and professional study of some of astronomy’s most influential members—women who, at that time, might never have received the chance—or credit—Pickering provided.

Instead of hiring men to study the plates during the day, Pickering hired women. He felt they were more detailed, patient, and, he admitted, cheaper. Williamina Fleming was one of those female computers.  She developed the Henry Draper Catalogue of Stellar Spectra and is credited with being the first to see the Horsehead nebula through her work examining the plates.

The Horsehead nebula taken by the Hubble Space Telescope in infrared light in 2013.
Image Credit: NASA/ESA/Hubble Heritage Team

(DASCH Portion of Plate b2312) The collection’s first image of the Horsehead Nebula taken on February 7, 1888 from Cambridge.

 

 

 

 

 

 

 

 

 

The Draper Catalogue included the first classification of stars based on stellar spectra, as created by Fleming. Later, this classification system would be modified by another notable female astronomer at Harvard, Annie Jump Cannon. Cannon’s classification and organizational scheme became the official method of cataloguing stars by the International Solar Union in 1910, and it continues to be used today.

Another notable female computer was Henrietta Swan Leavitt, who figured out a way to judge the distance of stars based on the brightness of stars in the Small Megellanic Cloud. Leavitt’s Law is still used to determine astronomical distances. The Glass Universe by Dava Sobel chronicles the stories of many of the female computers and the creation of Harvard Observatory’s plate collection.

Digital Access to a Sky Century @ Harvard (DASCH)

The Harvard Plate Collection is one of the most comprehensive records of the night’s sky, but less than one percent of it has been studied. For all of the great work done by the Harvard women and the astronomers who followed them, the fragility of the glass plates meant someone had to travel to Harvard to see them, and then the study of even a single star over a hundred years required a great deal of time. For every discovery made from the plate collection, like finding Pluto, hundreds or thousands more are waiting to be found.

(DASCH Single scan tile from Plate mc24889) First discovery image of Pluto with Clyde Tombaugh’s notes written on the plate. Taken at Cambridge on April 23, 1930.

Initial enhanced color image of Pluto released in July 2015 during New Horizon’s flyby.
Source: NASA/JHUAPL/SwRI

This is a more accurate image of the natural colors of Pluto as the human eye would see it. Taken by New Horizons in July 2015.
Source: NASA/Johns Hopkins University Applied Physics Laboratory/Southwest Research Institute/Alex Parker

 

 

 

 

 

 

 

 

 

With all of this unused, breakable data and advances in computing ability, Professor Jonathan Grindlay began organizing and funding DASCH in 2003 in an effort to digitize the entire hundred-year plate historical document collection. But Grindlay had an extra obstacle to overcome. Many of the plates had handwritten notes written by the female computers and other astronomers. Grindlay had to balance the historical significance of the collection with the vast data it offered. To do this, the plates are scanned at low resolution with the marks in place, then they are cleaned and rescanned at the extremely high resolution necessary for data recording.

A custom scanner had to be designed and constructed specifically for the glass plates and new software was created to bring the digitized image into line with current astronomical data methods. The project hasn’t been without its setbacks, either. Finding funding for the project is a constant problem, and in January 2016, the Observatory’s lowest level flooded. Around 61,000 glass plates were submerged and had to be frozen immediately to prevent mold from damaging the negatives. While the plates are intact, many still need to be unfrozen and restored before being scanned. The custom scanner also had to be replaced because of the flooding.

George Champine Logbook Archive

In conjunction with the plate scanning, a second project is necessary to make the plates useable for extended study. The original logbooks of the female computers contain more than their observations of the plates. These books record the time, date, telescope, emulsion type, and a host of other identifying information necessary to place and digitally extrapolate the stars on the plates. Over 800 logbooks (nearly 80,000 images in total) were photographed by volunteer George Champine.

Those images are now in the time-consuming process of being manually transcribed. Harvard Observatory partnered with the Smithsonian Institution to enlist volunteers who work every day reading and transcribing the vital information in these logbooks. Without this data, the software can’t accurately use the star data scanned from the plates.

Despite all the challenges and setbacks, 314,797 plates have been scanned as of December 2018. The data released and analyzed from the DASCH project has already made new discoveries about variable stars. Once the entire collection of historical documents is digitized, more than a hundred years will be added to the digital collection of astronomical data, and they will be free for anyone to access and study, professional or amateur.

The Harvard Plate Collection is a great example of an extraordinary resource to its community being underused due to the medium. Digital conversion of data is a great way to help any field of research. While Harvard’s plate digitization project provides a model for the conversion of complex data into digital form, not all institutions have the resources to attempt such a large enterprise. If you have a collection in need of digitization, contact Anderson Archival today at 314.259.1900 or email us at info@andersonarchival.com.


Shana Scott is a Digital Archivist and Content Specialist with Anderson Archival, and has been digitally preserving historical materials for over three years. She is involved in every level of the archiving process, creating collections that are relevant, accessible, and impactful. Scott has an MA in Professional Writing and Publishing from Southeast Missouri State University and is a member of SFWA.

Announcing the Digital Processing Framework

by Erin Faulder

Development of the Digital Processing Framework began after the second annual Born Digital Archiving eXchange unconference at Stanford University in 2016. There, a group of nine archivists saw a need for standardization, best practices, or general guidelines for processing digital archival materials. What came out of this initial conversation was the Digital Processing Framework (https://hdl.handle.net/1813/57659) developed by a team of 10 digital archives practitioners: Erin Faulder, Laura Uglean Jackson, Susanne Annand, Sally DeBauche, Martin Gengenbach, Karla Irwin, Julie Musson, Shira Peltzman, Kate Tasker, and Dorothy Waugh.

An initial draft of the Digital Processing Framework was presented at the Society of American Archivists’ Annual meeting in 2017. The team received feedback from over one hundred participants who assessed whether the draft was understandable and usable. Based on that feedback, the team refined the framework into a series of 23 activities, each composed of a range of assessment, arrangement, description, and preservation tasks involved in processing digital content. For example, the activity Survey the collection includes tasks like Determine total extent of digital material and Determine estimated date range.

The Digital Processing Framework’s target audience is folks who process born digital content in an archival setting and are looking for guidance in creating processing guidelines and making level-of-effort decisions for collections. The framework does not include recommendations for archivists looking for specific tools to help them process born digital material. We draw on language from the OAIS reference model, so users are expected to have some familiarity with digital preservation, as well as with the management of digital collections and with processing analog material.

Processing born-digital materials is often non-linear, requires technical tools that are selected based on unique institutional contexts, and blends terminology and theories from archival and digital preservation literature. Because of these characteristics, the team first defined 23 activities involved in digital processing that could be generalized across institutions, tools, and localized terminology. These activities may be strung together in a workflow that makes sense for your particular institution. They are:

  • Survey the collection
  • Create processing plan
  • Establish physical control over removeable media
  • Create checksums for transfer, preservation, and access copies
  • Determine level of description
  • Identify restricted material based on copyright/donor agreement
  • Gather metadata for description
  • Add description about electronic material to finding aid
  • Record technical metadata
  • Create SIP
  • Run virus scan
  • Organize electronic files according to intellectual arrangement
  • Address presence of duplicate content
  • Perform file format analysis
  • Identify deleted/temporary/system files
  • Manage personally identifiable information (PII) risk
  • Normalize files
  • Create AIP
  • Create DIP for access
  • Publish finding aid
  • Publish catalog record
  • Delete work copies of files

Within each activity are a number of associated tasks. For example, tasks identified as part of the Establish physical control over removable media activity include, among others, assigning a unique identifier to each piece of digital media and creating suitable housing for digital media. Taking inspiration from MPLP and extensible processing methods, the framework assigns these associated tasks to one of three processing tiers. These tiers include: Baseline, which we recommend as the minimum level of processing for born digital content; Moderate, which includes tasks that may be done on collections or parts of collections that are considered as having higher value, risk, or access needs; and Intensive, which includes tasks that should only be done to collections that have exceptional warrant. In assigning tasks to these tiers, practitioners balance the minimum work needed to adequately preserve the content against the volume of work that could happen for nuanced user access. When reading the framework, know that if a task is recommended at the Baseline tier, then it should also be done as part of any higher tier’s work.

We designed this framework to be a step towards a shared vocabulary of what happens as part of digital processing and a recommendation of practice, not a mandate. We encourage archivists to explore the framework and use it however it fits in their institution. This may mean re-defining what tasks fall into which tier(s), adding or removing activities and tasks, or stringing tasks into a defined workflow based on tier or common practice. Further, we encourage the professional community to build upon it in practical and creative ways.


Erin Faulder is the Digital Archivist at Cornell University Library’s Division of Rare and Manuscript Collections. She provides oversight and management of the division’s digital collections. She develops and documents workflows for accessioning, arranging and describing, and providing access to born-digital archival collections. She oversees the digitization of analog collection material. In collaboration with colleagues, Erin develops and refines the digital preservation and access ecosystem at Cornell University Library.

The Top 10 Things We Learned from Building the Queer Omaha Archives, Part 2 – Lessons 6 to 10

by Angela Kroeger and Yumi Ohira

The Queer Omaha Archives (QOA) is an ongoing effort by the University of Nebraska at Omaha Libraries’ Archives and Special Collections to collect and preserve Omaha’s LGBTQIA+ history. This is still a fairly new initiative at the UNO Libraries, having been launched in June 2016. This blog post is adapted and expanded from a presentation entitled “Show Us Your Omaha: Combatting LGBTQ+ Archival Silences,” originally given at the June 2017 Nebraska Library Association College & University Section spring meeting. The QOA was only a year old at that point, and now that another year (plus change) has passed, the collection has continued to grow, and we’ve learned some new lessons.

So here are the top takeaways from UNO’s experience with the QOA.

#6. Words have power, and sometimes also baggage.

Words sometimes mean different things to different people. Each person’s life experience lends context to the way they interpret the words they hear. And certain words have baggage.

We named our collecting initiative the Queer Omaha Archives because in our case, “queer” was the preferred term for all LGBTQIA+ people as well as referring to the academic discipline of queer studies. In the early 1990s, the community in Omaha most commonly referred to themselves as “gays and lesbians.” Later on, bisexuals were included, and the acronym “GLB” came into more common use. Eventually, when trans people were finally acknowledged, it became “GLBT.” Then there was a push to switch the order to “LGBT.” And then more letters started piling on, until we ended up with the LGBTQIA+ commonly used today. Sometimes, it is taken even further, and we’ve seen LGBTQIAPK+, LGBTQQIP2SAA, LGBTQIAGNC, and other increasingly long and difficult-to-remember variants. (Although, Angela confesses to finding QUILTBAG to be quite charming.) The acronym keeps shifting, but we didn’t want our name to shift, so we followed the students’ lead (remember the QTS “Cuties”?) and just went with “queer.” “Queer” means everyone.

Except . . . “queer” has baggage. Heavy, painful baggage. At Pride 2016, an older man, who had donated materials to our archive, stopped by our booth and we had a conversation about the word. For him, it was the slur his enemies had been verbally assaulting him with for decades. The word still had a sharp edge for him. Angela (being younger than this donor, but older than most of the students on campus) acknowledged that they were just old enough for the word to be startling and sometimes uncomfortable. But many Millennials and Generation Z youths, as well as some older adults, have embraced “queer” as an identity. Notably, many of the younger people on campus have expressed their disdain for being put into boxes. Identifying as “gay” or “lesbian” or “bi” seems too limiting to them. Our older patron left our booth somewhat comforted by the knowledge that for much of the population, especially the younger generations, “queer” has lost its sting and taken on a positive, liberating openness.

But what about other LGBTQIA+ people who haven’t stopped by to talk to us, to learn what we mean when we call our archives “queer”? Who feels sufficiently put off by this word that they might choose against sharing their stories with our archive? We aren’t planning to change our name, but we are aware that our choice of word may give some potential donors and potential users a reason to hesitate before approaching us.

So whatever community your archive serves, think about the words that community uses to describe themselves, and the words others use to describe them, and whether those words might have connotations you don’t intend.

#7. Find your community. Partnerships, donors, and volunteers are the keys to success.

It goes without saying that archives are built from communities. We don’t (usually) create the records. We invite them, gather them, describe them, preserve them, and make them available to all, but the records (usually) come from somewhere else.

Especially if you’re building an archive for an underrepresented community, you need buy-in from the members of that community if you want your project to be successful. You need to prove that you’re going to be trustworthy, honorable, reliable stewards of the community’s resources. You need someone in your archive who is willing and able to go out into that community and engage with them. For us, that was UNO Libraries’ Archives and Special Collections Director Amy Schindler, who has a gift for outreach. Though herself cis and straight, she has put in the effort to prove herself a friend to Omaha’s LGBTQIA+ community.

You also need members of that community to be your advocates. For us, our advocates were our first donors, the people who took that leap of faith and trusted us with their resources. We started with our own university. The work of the archivist and the archives would not have been possible without the collaboration and support of campus partners. UNO GSRC Director Dr. Jessi Hitchins and UNO Sociology Professor Dr. Jay Irwin together provided the crucial mailing list for the QOA’s initial publicity and networking. Dr. Irwin and his students collected the interviews which launched the LGBTQ+ Voices oral history project. Retired UNO professor Dr. Meredith Bacon donated her personal papers and extensive library of trans resources. From outside the UNO community, Terry Sweeney, who with his partner Pat Phalen had served as editor of The New Voice of Nebraska, donated a complete set of that publication, along with a substantial collection of papers, photographs, and artifacts, and he volunteered in the archives for many weeks, creating detailed and accurate descriptions of the items. These four people, and many others, have become our advocates, friends, and champions within the Omaha LGBTQIA+ community.

Our lesson here: Find your champions. Prove your trustworthiness, and your champions will help open doors into the community.

#8. Be respectful, be interested, and be present.

Outreach is key to building connections, bringing in both donors and users for the collection. This isn’t Field of Dreams, where “If you build it, they will come.” You need to forge the partnerships first, in order to even be able to build it. And they won’t come if they don’t know about it and don’t believe in its value. (“They” in this case meaning the community or communities your archives serve, and “it” of course meaning your archives or special collections for that community.)

Fig. 3: UNO Libraries table at a Transgender Day of Remembrance event.

Yumi and Angela are both behind-the-scenes technical services types, so we don’t have quite as much contact with patrons and donors as some others in our department, but we’ve helped out staffing tables at events, such as Pride, Transgender Day of Remembrance, and Transgender Day of Visibility events. We also work to create a welcoming atmosphere for guests who come to the archives for events, tours, and research. We recognize the critical importance of the work our director does, going out and meeting people, attending events, talking to everyone, and inviting everyone to visit. As our director Amy Schindler said in the article “Collaborative Approaches to Digital Projects,” “Engagement with community partners is key . . .”

There’s also something to be said for simply ensuring that folks within the archives, and the library as a whole for that matter, have a basic awareness of the QOA and other collecting initiatives, so that we can better fulfill our mission of connecting people to the resources they need. After all, when someone walks into the library, before they even reach the archives, any staff member might be their first point of contact. So be sure to do outreach within your own institution, as well.

#9. Let me sing you the song of my administrative support.

The QOA was conceived by the efforts from UNO students, UNO employees, and Omaha communities to address the underrepresentation of the LGBTQIA+ communities in Omaha.

The initiative of the QOA was inspired by Josh Burford who delivered a presentation about collecting and archiving historical materials related to queering history. This presentation was co-hosted by UNO’s Gender and Sexuality Resource Center in the LGBTQIA+ History Month. After this event, the UNO community became keenly interested in collecting and preserving historical materials and oral history interviews about “Queer Omaha,” and began collaborating with our local LGBTQIA+ communities. In Summer 2016, the QOA was officially launched to preserve the enduring value of the legacy of LGBTQIA+ communities in greater Omaha. The QOA is an effort to combat an archival silence in the community, and digital collections and digital engagement are especially effective tools for making LGBTQIA+ communities aware that the archives welcome their records!

But none of this would have been possible without administrative support. If the library administration or the university administration had been indifferentor worse, hostileto the idea of building a LGBTQIA+ archive, we might not have been allowed to allocate staff time and other resources to this project. We might have been told “no.” Thank goodness, our library dean is 100% on board. And UNO is deeply committed to inclusion as one of our core values, which has created a campus-wide openness to the LGBTQIA+ community, resulting in an environment perfectly conducive to building this archive. In fact, in November 2017, UNO was identified as the most LGBTQIA+-friendly college in the state of Nebraska by the Campus Pride Index in partnership with BestColleges.com. An initiative like the QOA will always go much more smoothly when your administration is on your side.

#10. The Neverending Story.

We recognize that we still have a long way to go. There are quite a few gaps within our collection. We have the papers of a trans woman, but only oral histories from trans men. We don’t yet have anything from intersex folks or asexuals. We have very little from queer people of color or queer immigrants, although we do have some oral histories from those groups, thanks to the efforts of Dr. Jay Irwin, who launched the oral history project, and Luke Wegener, who was hired as a dedicated oral history associate for UNO Libraries. A major focus on the LGBTQIA+ oral history interview project is filling identified gaps within the collection, actively seeking more voices of color and other underrepresented groups within the LGBTQIA+ community. However, despite our efforts to increase the diversity within the collection, we haven’t successfully identified potential donors or interviewees to represent all of the letters within LGBTQIA, much less the +.

This isn’t—and should never bea series of checkboxes on a list. “Oh, we have a trans woman. We don’t need any more trans women.” No, that’s not how it works. We seek to fill the gaps, while continuing to welcome additional material from groups already represented. We are absolutely not going to turn away a white cis gay man just because we already have multiple resources from white cis gay men. Every individual is different. Every individual brings a new perspective. We want our collection to include as many voices as possible. So we need to promote our collection more. We need to do more outreach. We need to attract more donors, users, and champions. This will remain an ongoing effort without an endpoint. There is always room for growth.


Angela Kroeger is the Archives and Special Collections associate at the University of Nebraska at Omaha and a lifelong Nebraskan. They received their B.A. in English from the University of Nebraska at Omaha and their Master’s in Library and Information Science from the University of Missouri.

Yumi Ohira is the Digital Initiatives Librarian at the University of Nebraska at Omaha. Ohira is originally from Japan where she received a B.S. in Applied Physics from Fukuoka University. Ohira moved to the United States to attend University of Kansas and Southern Illinois University-Carbondale where she was awarded an M.F.A. in Studio Art. Ohira went on to study at Emporia State University, Kansas, where she received an M.L.S. and Archive Studies Certification.

Using Python, FFMPEG, and the ArchivesSpace API to Create a Lightweight Clip Library

by Bonnie Gordon

This is the twelfth post in the bloggERS Script It! Series.

Context

Over the past couple of years at the Rockefeller Archive Center, we’ve digitized a substantial portion of our audiovisual collection. Our colleagues in our Research and Education department wanted to create a clip library using this digitized content, so that they could easily find clips to use in presentations and on social media. Since the scale would be somewhat small and we wanted to spin up a solution quickly, we decided to store A/V clips in a folder with an accompanying spreadsheet containing metadata.

All of our (processed) A/V materials are described at the item level in ArchivesSpace. Since this description existed already, we wanted a way to get information into the spreadsheet without a lot of copying-and-pasting or rekeying. Fortunately, the access copies of our digitized A/V have ArchivesSpace refIDs as their filenames, so we’re able to easily link each .mp4 file to its description via the ArchivesSpace API. To do so, I wrote a Python script that uses the ArchivesSpace API to gather descriptive metadata and output it to a spreadsheet, and also uses the command line tool ffmpeg to automate clip creation.

The script asks for user input on the command line. This is how it works:

Step 1: Log into ArchivesSpace

First, the script asks the user for their ArchivesSpace username and password. (The script requires a config file with the IP address of the ArchivesSpace instance.) It then starts an ArchivesSpace session using methods from ArchivesSnake, an open-source Python library for working with the ArchivesSpace API.

Step 2: Get refID and number to start appending to file

The script then starts a while loop, and asks if the user would like to input a new refID. If the user types back “yes” or “y,” the script then asks for the the ArchivesSpace refID, followed by the number to start appending to the end of each clip. This is because the filename for each clip is the original refID, followed by an underscore, followed by a number, and to allow for more clips to be made from the same original file when the script is run again later.

Step 3: Get clip length and create clip

The script then calculates the duration of the original file, in order to determine whether to ask the user to input the number of hours for the start time of the clip, or to skip that prompt. The user is then asked for the number of minutes and seconds of the start time of the clip, then the number of minutes and seconds for the duration of the clip. Then the clip is created. In order to calculate the duration of the original file and create the clip, I used the os Python module to run ffmpeg commands. Ffmpeg is a powerful command line tool for manipulating A/V files; I find ffmprovisr to be an extremely helpful resource.

Clip from Rockefeller Family at Pocantico – Part I , circa 1920, FA1303, Rockefeller Family Home Movies. Rockefeller Archive Center.

Step 4: Get information about clip from ArchivesSpace

Now that the clip is made, the script uses the ArchivesSnake library again and the find_by_id endpoint of the ArchivesSpace API to get descriptive metadata. This includes the original item’s title, date, identifier, and scope and contents note, and the collection title and identifier.

Step 5: Format data and write to csv

The script then takes the data it’s gathered, formats it as needed—such as by removing line breaks in notes from ArchivesSpace, or formatting duration length—and writes it to the csv file.

Step 6: Decide how to continue

The loop starts again, and the user is asked “New refID? y/n/q.” If the user inputs “n” or “no,” the script skips asking for a refID and goes straight to asking for information about how to create the clip. If the user inputs “q” or “quit,” the script ends.

The script is available on GitHub. Issues and pull requests welcome!


Bonnie Gordon is a Digital Archivist at the Rockefeller Archive Center, where she focuses on digital preservation, born digital records, and training around technology.

The BitCurator Script Library

by Walker Sampson

This is the eleventh post in the bloggERS Script It! Series.

One of the strengths of the BitCurator Environment (BCE) is the open-ended potential of the toolset. BCE is a customized version of the popular (granted, insofar as desktop Linux can be popular) Ubuntu distribution, and as such it remains a very configurable working environment. While there is a basic notion of a default workflow found in the documentation (i.e., acquire content, run analyses on it, and increasingly, do something based on those analyses, then export all of it to another spot), the range of tools and prepackaged scripts in BCE can be used in whatever order fits the needs of the user. But even aside from this configurability, there is the further option of using new scripts to achieve different or better outcomes.

What is a script? I’m going to shamelessly pull from my book for a brief overview:

A script is a set of commands that you can write and execute in order to automatically run through a sequence of actions. A script can support a number of variations and branching paths, thereby supporting a considerable amount of logic inside it – or it can be quite straightforward, depending upon your needs. A script creates this chain of actions by using the commands and syntax of a command line shell, or by using the commands and functions of a programming language, such as Python, Perl or PHP.

In short, scripts allow the user to string multiple actions together in some defined way. This can open the door to batch operations – repeating the same job automatically for a whole set of items – that speed up processing. Alternatively, a user may notice that they are repeating a chain of actions for a single item in a very routine way. Again, a script may fill in here, grouping together all those actions into a single script that the user need only initiate once. Scripting can also bridge the gap between two programs, or adjust the output of one process to make it fit into the input of another. If you’re interested in scripting, there are basically two (non-exclusive) routes to take: shell scripting or scripting with a programming language.

  • For an intro on both writing and running bash shell scripts, one of if not the most popular Unix shell – and the one BitCurator defaults with check out this tutorial by Tania Rascia.
  • There are many programming languages that can be used in scripts; Python is a very common one. Learning how to script with Python is tantamount to simply learning Python, so it’s probably best to set upon that path first. Resources abound for this endeavor, and the book Automate the Boring Stuff with Python is free under a Creative Commons license.

The BitCurator Scripts Library

The BitCurator Scripts Library is a spot we designed to help connect users with scripts for the environment. Most scripts are already available online somewhere (typically GitHub), but a single page that inventories these resources can further their use. A brief look at a few of the scripts available will give a better idea of the utility of the library.

  • If you’ve ever had the experience of repeatedly trying every listed disk format in the KryoFlux GUI (which numbers well over a dozen) in an attempt to resolve stream files into a legible disk image, the DiskFormatID program can automate that process.
  • fiwalk, which is used to identify files and their metadata, doesn’t operate on Hierarchical File System (HFS) disk images. This prevents the generation of a DFXML file for HFS disks as well. Given the utility and the volume of metadata located in that single document, along with the fact that other disk images receive the DFXML treatment, this stands out as a frustrating process gap. Dianne Dietrich has fortunately whipped up a script to generate just such a DFXML for all your HFS images!
  • The shell scripts available at rl-bitcurator-scripts are a great example of running the same command over multiple files: multi_annofeatures.sh, multi_be.sh, and multifiwalk.sh run identify_filenames.py, bulk_extractor and fiwalk over a directory, respectively. Conversely, simgen_prod.sh is an example of shell script grouping multiple commands together and running that group over a set of items.

For every script listed, we provide a link (where applicable) to any related resources, such as a paper that explains the thinking behind a script, a webinar or slides where it is discussed, or simply a blog post that introduces the code. Presently, the list includes both bash shell scripts along with Python and Perl scripts.

Scripts have a higher bar to use than programs with a graphic frontend, and some familiarity or comfort with the command line is required. The upside is that scripts can be amazingly versatile and customizable, filling in gaps in process, corralling disparate data into a single presentable sheet, or saving time by automating commands. Along with these strengths, viewing scripts often sparks an idea for one you may actually benefit from or want to write yourself.

Following from this, if you have a script you would like added to the library, please contact us (select ‘Website Feedback’) or simply post in our Google Group. Bear one aspect in mind however: scripts do not need to be perfect. Scripts are meant to be used and adjusted over time, so if you are considering a script to include, please know that it doesn’t need to accommodate every user or situation. If you have a quick and dirty script that completes the task, it will likely be beneficial to someone else, even if, or especially if, they need to adjust it for their work environment.


Walker Sampson is the Digital Archivist at the University of Colorado Boulder Libraries, where he is responsible for the acquisition, accessioning and description of born digital materials, along with the continued preservation and stewardship of all digital materials in the Libraries. He is the coauthor of The No-nonsense Guide to Born-digital Content.