This is the first post in a new series of conversations between emerging professionals and archivists actively working with digital materials.
Amy Berish is an Assistant Archivist at the Rockefeller Archive Center in Sleepy Hollow, New York. There, she is a member of the Processing Team, working on processing collections that cover a wide range of philanthropic history and a variety of materials. A recent graduate of the University of Pittsburgh Master of Library and Information Science program, Amy has generously shared her path and experiences with bloggERS!
Amy began working in her local library when she was 14 and went on to major in library and information science as an undergraduate. While there and throughout graduate school, she worked at the university library, took various internships, and worked for school credit at the preservation lab, all in an effort to find her place in the library and archives world.
In her current role at the Rockefeller Archive Center, she works as part of a larger staff to process incoming collections in both paper and digital formats. The Rockefeller Archive Center collects materials related to the Rockefeller family, but also several other large philanthropic organizations, including the Ford Foundation, the Near East Foundation, the Commonwealth Fund, the Rockefeller Brothers Fund, the Henry Luce Foundation, and the W. T. Grant Foundation, among others. While she shied away from working with digital formats and learning coding skills during college, she has had the opportunity to pursue that work in her current role and has embraced the challenges that have come with it.
“I feel like digital work is the biggest challenge right now, in both the work I am doing and the work of the broader archival profession,” she said. “Learning to navigate the technical skills required to do some of the work we are doing can be especially daunting. Having a positive attitude about change and a willingness to learn is often easier said than done – but I also think these two factors could help make this type of work seem more doable.”
Amy has found support in her teams at the Rockefeller Archive Center and in the archives community in and around New York City. For example, Digital Team members at the Rockefeller Archive Center reminded her that it would be ok to break things in the code, and that they would be able to fix it if she wanted to experiment with a new way of scripting. She has also found support in online forums, which have allowed her to connect to others doing related work across the country.
Beyond scripting, part of her position requires her to deal with formats that might be obsolete or nearly so, and to face policy questions regarding proprietary information and copyright. Like coding however, Amy has used her enthusiasm for learning new skills as an asset in facing these challenges.
“I love learning new things and as a processing archivist, it’s part of my job to continue to learn more about various topics through each collection I process,” Amy said. “I also get the opportunity to learn through some of the digital projects I am working on. I have learned to automate processes by writing scripts. I have also had a lot experience lately working with legacy digital media – from optical disks and floppies to zip disks and Bernoulli disks – it has been a challenge trying to get 10-year-old media to function properly!”
As a new professional, Amy was quick to mention some of the challenges that archivists can face at the beginning of their career. Still, she said, a pat on the back for each small step you take is well-deserved. She cited one of her graduate school professors, who encouraged her to cultivate an “ethos of fearlessness” when facing technology; she said the phrase has become a mantra in her current position. Since that, Amy acknowledged, is easier said than done, especially while you’re still in school, she has three other pieces of advice to share for others just starting out in digital archives work: Take the opportunities you’re given, always be ready to learn, and don’t be afraid digital work.
Georgia Westbrook is an MSLIS student at Syracuse University. She’s interested in visual resources, oral histories, digital publishing, and open access. Connect with her on LinkedIn or on her website.
This post is a case study about how I used the statistical programming language R to help export, transform, and load data from legacy finding aids into ArchivesSpace. I’m sharing this workflow in the hopes that another institution might find this approach helpful and could be generalized to other issues facing archives.
I decided to use the programming language R because it is a free and open source programming language that I had some prior experience using. R has a large and active user community as well as a large number of relevant packages that extend the basic functions of R, including libraries that can deal with Microsoft Word tables and read and write XML. All of the code for this project is posted on Github.
The specific task that sparked this script was when I inherited hundreds of finding aids with minimal collection-level information and very long and detailed box and folder lists. These were all Microsoft Word documents with the box and folder list formatted as a table within the Word document. We recently adopted ArchivesSpace as our archival content management system so the challenge was to reformat this data and upload it into ArchivesSpace. I considered manual approaches but eventually opted to develop this code to automate this work. The code is generally organized into three sections: data export, transforming and cleaning the data, and finally, creating an EAD file to load into ArchivesSpace.
After installing the appropriate libraries, the first step of the process was to extract the data from the Microsoft Word tables. Given the nature of our finding aids, I focused on extracting only the box and folder list; collection-level information would be added manually later in the process.
This process was surprisingly straightforward; I created a variable with a path to a Word Document and used the “docx_extract_tbl” function from the docxtractr package to extract the contents of that table into a data.frame in R. Sometimes our finding aids were inconsistent so I occasionally had to tweak the data to rearrange the columns or add missing values. The outcome of this step of the process is four columns that contain folder title, date, box number, and folder number.
This data export process is remarkably flexible. Using other R functions and libraries, I have extended this process to export data from CSV files or Excel spreadsheets. In theory, this process could be extended to receive a wide variety of data including collection-level descriptions and digital objects from a wider variety of sources. There are other tools that can also do this work (Yale’s Excel to EAD process and Harvard’s Aspace Import Excel plugin), but I found this process to be easier for my institution’s needs.
Data Transformation and Cleaning
Once I extracted the data from the Microsoft Word document, I did some minimal data cleanup, a sampling of which included:
Extracting a date range for the collection. Again, past practice focused on creating folder-level descriptions and nearly all of our finding aids lacked collection-level information. From the box/folder list, I tried to extract a date range for the entire collection. This process was messy but worked a fair amount of the time. In cases when the data were not standardized, I defined this information manually.
Standardizing “No Date” text. Over the course of this project, I discovered the following terms for folders that didn’t have dates: “n.d.”,”N.D.”,”no date”,”N/A”,”NDG”,”Various”, “N. D.”,””,”??”,”n. d.”,”n. d. “,”No date”,”-“,”N.A.”,”ND”, “NO DATE”, “Unknown.” For all of these, I updated the date field to “Undated” as a way to standardize this field.
Spelling out abbreviations. Occasionally, I would use regular expressions to spell out words in the title field. This could be standard terms like “Corresp” to “Correspondence” or local terms like “NPU” to “North Park University.”
R is a powerful tool and provides many options for data cleaning. We did pretty minimal cleaning but this approach could be extended to do major transformations to the data.
Create EAD to Load into ArchivesSpace
Lastly, with the data cleaned, I could restructure the data into an XML file. Because the goal of this project was to import into ArchivesSpace, I created an extremely basic EAD file meant mainly to enter the box and folder information into ArchivesSpace; collection-level information would be added manually within ArchivesSpace. In order to get the cleaned data to import, I first needed to define a few collection-level elements including the collection title, collection ID, and date range for the collection. I also took this as an opportunity to apply a standard conditions governing access note for all collections.
Next, I used the XML package in R to create the minimally required nodes and attributes. For this section, I relied on examples from the book XML and Web Technologies for Data Sciences with R by Deborah Nolan and Duncan Temple Lang. I created the basic EAD schema in R using the “newXMLNode” functions from the XML package. This section of code is very minimal, and I would welcome suggestions from the broader community about how to improve it. Lastly, I defined functions that make the title, date, box, and folder nodes, which were then applied to the data exported and transformed in earlier steps. Lastly, this script saves everything as an XML file that I then uploaded into ArchivesSpace.
Although this script was designed to solve a very specific problem—extracting box and folder information from a Microsoft Word table and importing that information into ArchivesSpace—I think this approach could have wide and varied usage. The import process can accept loosely formatted data in a variety of different formats including Microsoft Word, plain text, CSV, and Excel and reformat the underlying data into a standard table. R offers an extremely robust set of packages to update, clean, and reformat this data. Lastly, you can define the export process to reformat the data into a suitable file format. Given the nature of this programming language, it is easy to preserve your original data source as well as document all the transformations you perform.
Andy Meyer is the director (and lone arranger) of the F.M. Johnson Archives and Special Collections at North Park University. He is interested in archival content management systems, digital preservation, and creative ways to engage communities with archival materials.
San José is in many ways an apt location for a tech-centered library conference like Code4Lib. It is the largest city in Santa Clara Valley (aka Silicon Valley) and home to San Jose State University, one of the biggest library science programs in the country. Yet the tone of the 14th annual Code4Lib conference, which convened on February 19-22, 2019, was cautious and at times critical of the “big tech” landscape. In her opening keynote, Sarah Roberts, Assistant Professor of Information Studies at UCLA, talked about her research on social media content moderation. She said that while this work is deemed critical by social media companies to manage lewd or disturbing content, it is also emotionally taxing, low-paying, and executed by a mostly invisible global labor force. In keeping this work hidden, consumers are led to believe that social media content is either unmediated, or that content moderation is somehow automated. This push towards transparency and openness—in how we manipulate our code, technologies, content, and even our labor practices—was a recurring theme throughout the conference.
There were a number of archivists and archives-adjacent folks attending the conference and a handful of interesting sessions related to digital archives. In a talk entitled “Natural Language Processing for Discovery of Born-Digital Records,” NCSU Libraries Fellow Emily Higgs discussed her exploration of named entity recognition (NER) to aid in describing digital collections. Using the open source natural language processing software, spaCy, Higgs extracted personal names to a CSV file, with entities ranked by frequency, and included the top five to ten names in the Scope and Content section of the finding aid. She also tested a discovery tool, Open Semantic Desktop Search, to enable researchers to more easily browse through a digital collection using the reading room computer. She noted that while it offered faceted browsing as well as fuzzy and semantic search capabilities, the major drawback was the long indexing time for larger digital collections.
In the realm of web-archiving, Ilya Kreymer of Rhizome presented a demo of Webrecorder, a set of free and open source tools for creating and viewing web archives. Funded by two Mellon Foundation grants, Webrecorder is a browser-based application that focuses on capturing high-fidelity web archives. Unlike the more traditional web crawlers, Webrecorder is meant to be used as a more curated approach to web archiving—think quality over quantity. In his demo, Kreymer quickly and easily archived audio files from a SoundCloud library as well as the most recent Code4Lib conference hashtag posts on Twitter. One of Webrecorder’s most impressive features is its ability to emulate legacy browsers to record things like flash-based websites. Webrecorder has a lot going for it—it’s free and easy to use, with an attractive and intuitive interface. While Kreymer was quick to point out that they haven’t solved web-archiving, it was nonetheless exciting to see a concentrated effort towards refining it.
As a metadata librarian, I am probably a little biased here, but one of the most exciting talks of the conference was given by Dhanushka Samarakoon and Harish Maringanti of the University of Utah’s Marriott Library. Inspired by a story they heard on NPR about PoetiX, a sonnet-writing competition where judges are asked to determine if a sonnet was written by man or machine, Samarakoon and Maringati began to think about the implications of machine learning on metadata creation. Recognizing that metadata is typically where the bottleneck occurs in digital library workflows, they wanted to explore how machine learning technology might simplify descriptive metadata creation for historical image collections. To do this they created a model using data from Imagenet, a database of over 14 million images designed for use in visual object recognition software research; and over 470 photographs with high quality human-generated metadata from their own digital library collections. Once this data was introduced into a pre-trained neural network, they ran a collection of photographs through the system to see how well the model worked. It wasn’t perfect—for instance, a photo of a man standing next to a cow was described as “Mary Jane standing by a cow,” apparently due to the many people identified as “Mary Jane” in the original digital library dataset. However, it was exciting to see the possibilities of AI in image analysis and the implications this might have for future metadata automation.
At one point during the conference someone took a quick visual poll of how many first-time attendees were in the audience. There were a lot of us. But there were also a lot of Code4Lib veterans. During a lightning talk about the origin of the conference, Karen Coombs, Ryan Wick, and Roy Tennant recalled wanting to create a conference with a “no spectators” motto—where attendees had ample opportunities to engage, participate, and have their voices heard. Unlike most other library conferences, Code4Lib doesn’t have competing programming. Everyone gathers in one large room and attends the same talks and sessions. It was this model of inclusivity, equality, and innovation that I found most appealing about Code4Lib, and will no doubt draw me back in coming years.
For more information about the conference, including streaming video and slides, visit the Code4Lib 2019 website.
Nicole Shibata is the Metadata Librarian at California State University, Northridge.
Over the past couple of months, we’ve heard a lot on bloggERS about how current students, recent grads, and mid-career professionals have made tech skills a strategic priority in their development plans. I like to think about the problem of “gaining tech skills” as being similar to “saving the environment”: individual action is needed and necessary, but it is most effective when it feeds clearly into systemic action.
So that begs the question, what root changes might educators of all types suggest and support to help GLAM professionals prioritize tech skills development? What are educator communities and systems – iSchools, faculty, and continuing education instructors – doing to achieve this? These questions are among those addressed by the BitCuratorEdu research project.
The BitCuratorEdu project is a two-year effort funded by the Institute of Museum and Library Services (IMLS) to study and advance the adoption of born-digital archiving and digital forensics tools and methods in libraries and archives through a range of professional education efforts. The project is a partnership between the School of Information and Library Science at the University of North Carolina at Chapel Hill and the Educopia Institute, along with the Council of State Archivists (CoSA) and nine universities that are educating future information professionals.
We’re addressing two main research questions:
What are the primary institutional and technological factors that influence adoption of digital forensics tools and methods in different educational settings?
What are the most viable mechanisms for sustaining collaboration among LIS programs on the adoption of digital forensics tools and methods?
The project started in September 2018 and will conclude in Fall 2021, and Educopia and UNC SILS will be conducting ongoing research and releasing open educational resources on a rolling basis. With the help of our Advisory Board made up of nine iSchools and our Professional Experts Panel composed of leaders in the GLAM sector, we’re:
Piloting instruction to produce and disseminate a publicly accessible set of learning objects that can be used by education providers to administer hands-on digital forensics education
Gathering information and centralizing existing educational content to produce guides and other resources, such as this (still-in-development) guide to datasets that can be used to learn new digital forensics skills or test digital archives software/processes
Investigating and reporting on institutional factors that facilitate, hinder and shape adoption of digital forensics educational offerings
Through this work and intentional community cultivation, we hope to advance a community of practice around digital forensics education though partner collaboration, wider engagement, and exploration of community sustainability mechanisms.
To support our research and steer the direction of the project, we have conducted and analyzed nine advisory board interviews with current faculty who have taught or are developing a curriculum for digital forensics education. So far we’ve learned that:
instructors want and need access to example datasets to use in the classroom (especially cultural heritage datasets);
many want lesson plans and activities for teaching born-digital archiving tools and environments like BitCurator in one or two weeks because few courses are devoted solely to digital forensics;
they want further guidance on how to facilitate hands-on digital forensics instruction in distributed online learning environments; and
they face challenges related to IT support at their home institutions, just like those grappled with by practitioners in the field.
This list barely scratches the surface of our exploration into the experiences and needs of instructors for providing more effective digital forensics education, and we’re excited to tackle the tough job of creating resources and instructional modules that address these and many other topics. We’re also interested in exploring how the resources we produce may also support continuing education needs across libraries, archives, and museums.
We recently conducted a Twitter chat with SAA’s SNAP Section to learn about students’ experiences in digital forensics learning environments. We heard a range of experiences, from students who reported they had no opportunity to learn about digital forensics in some programs, to students who received effective instruction that remained useful post-graduation. We hope that the learning modules released at the conclusion of our project will address students’ learning needs just as much as their instructors’ teaching needs.
Later this year, we’ll be conducting an educational provider survey that will gather information on barriers to adoption of digital forensics instruction in continuing education. We hope to present to and conduct workshops for a broader set of audiences including museum and public records professionals.
Our deliverables, from conference presentations to learning modules, will be released openly and freely through a variety of outlets including the project website, the BitCurator Consortium wiki, and YouTube (for recorded webinars). Follow along at the project website or contact email@example.com if you have feedback or want to share your insights with the project team.
Jess Farrell is the project manager for BitCuratorEdu and community coordinator for the Software Preservation Network at Educopia Institute. Katherine Skinner is the Executive Director of Educopia Institute, and Christopher (Cal) Lee is Associate Professor at the School of Information and Library Science at the University of North Carolina, Chapel Hill, teaching courses on archival administration, records management, and digital curation. Katherine and Cal are Co-PIs on the BitCuratorEdu project, funded by the Institute of Museum and Library Services.
PASIG 2019 met the week of February 11th at El Colegio de México (commonly known as Colmex) in Mexico City. PASIG stands for Preservation and Archiving Special Interest Group, and the group’s meeting brings together an international group of practitioners, industry experts, vendors, and researchers to discuss practical digital preservation topics and approaches. This meeting was particularly special because it was the first time the group convened in Latin America (past meetings have generally been held in Europe and the United States). Excellent real-time bilingual translation for presentations given in both English and Spanish enabled conversations across geographical and lingual boundaries and made room to center Latin American preservationists’ perspectives and transformative post-custodial archival practice.
The conference began with broad overviews of digital preservation topics and tools to create a common starting ground, followed by more focused deep-dives on subsequent days. I saw two major themes emerge over the course of the week. The first was the importance of people over technology in digital preservation. From David Minor’s introductory session to Isabel Galina Russell’s overview of the digital preservation landscape in Mexico, presenters continuously surfaced examples of the “people side” of digital preservation (think: preservation policies, appraisal strategies, human labor and decision-making, keeping momentum for programs, communicating to stakeholders, ethical partnerships). One point that struck me during the community archives session was Verónica Reyes-Escudero’s discussion of “cultural competency as a tool for front-end digital preservation.” By conceptualizing interpersonal skills as a technology for facilitating digital preservation, we gain a broader and more ethically grounded idea of what it is we are really trying to do by preserving bits in the first place. Software and hardware are part of the picture, but they are certainly not the whole view.
The second major theme was that digital preservation is best done together. Distributed digital preservation platforms, consortial preservation models, and collaborative research networks were also well-represented by speakers from LOCKSS, Texas Digital Library (TDL), Duraspace, Open Preservation Foundation, Software Preservation Network, and others. The takeaway from these sessions was that the sheer resource-intensiveness of digital preservation means that institutions, both large and small, are going to have to collaborate in order to achieve their goals. PASIG seemed to be a place where attendees could foster and strengthen these collective efforts. Throughout the conference, presenters also highlighted failures of collaborative projects and the need for sustainable financial and governance models, particularly in light of recent developments at the Digital Preservation Network (DPN) and Digital Public Library of America (DPLA). I was particularly impressed by Mary Molinaro’s honest and informative discussion about the factors that led to the shuttering of DPN. Molinaro indicated that DPN would soon be publishing a final report in order to transparently share their model, flaws and all, with the broader community.
Touching on both of these themes, Carlos Martínez Suárez of Video Trópico Sur gave a moving keynote about his collaboration with Natalie M. Baur, Preservation Librarian at Colmex, to digitize and preserve video recordings he made while living with indigenous groups in the Mexican state of Chiapas. The question and answer portion of this session highlighted some of the ethical issues surrounding rights and consent when providing access to intimate documentation of people’s lives. While Colmex is not yet focusing on access to this collection, it was informative to hear Baur and others talk a bit about the ongoing technical, legal, and ethical challenges of a work-in-progress collaboration.
Presenters also provided some awesome practical tools for attendees to take home with them. One of the many great open resources session leaders shared was Frances Harrell (NEDCC) and Alexandra Chassanoff (Educopia)’s DigiPET: A Community Built Guide for Digital Preservation Education + Training Google document, a living resource for compiling educational tools that you can add to using this form. Julian Morley also shared a Preservation Storage Cost Model Google sheet that contains a template with a wealth of information about estimating the cost of different digital preservation storage models, including comparisons for several cloud providers. Amy Rudersdorf (AVP), Ben Fino-Radin (Small Data Industries), and Frances Harrell (NEDCC) also discussed helpful frameworks for conducting self-assessments.
PASIG closed out by spending some time on the challenges involved with preserving emerging and complex formats. On the last afternoon of sessions, Amelia Acker (University of Texas at Austin) spoke about the importance of preserving APIs, terms of service, and other “born-networked” formats when archiving social media. She was followed by a panel of software preservationists who discussed different use cases for preserving binaries, source code, and other software artifacts.
Thanks to the wonderful work of the PASIG 2019 steering, program, and local arrangements committees!
Kelly Bolding is the Project Archivist for Americana Manuscript Collections at Princeton University Library, as well as the team leader for bloggERS! She is interested in developing workflows for processing born-digital and audiovisual materials and making archival description more accurate, ethical, and inclusive.
Please take this short survey to contribute to the 2019 ERS Community Project! The survey closes on Friday, March 29.
In December 2018, the ERS Steering Committee put out a call for ideas for a 2019 ERS community project. We’re thankful for the community input and are pleased to announce that we’re building a master list of digital archives and digital preservation resources that can be used for reference, or to provide a resource overlay for existing best practice and workflow documentation. The Committee has begun compiling resources and thinking about how they connect, but broader input is essential to this project’s success.
At this stage, we are interested in getting a sense of what the most useful resources are in our community. Please take our survey to share your top three go-to resources as well as any areas of electronic records work that you feel lack guidance and documentation. We are thinking of resources broadly, so feel free to suggest your three favorite journal articles, blogs, handbooks, workflows, tools and manuals, or any other style of resource that helps you process and preserve born-digital collections.
After the survey closes on Friday, March 29, we’ll compile and share the results. We also hope to eventually open up a community documentation space where anyone can add to our current list of resources. Once the data collection period is over, we’ll determine the best way to share a more polished version of this resource list.
On behalf of the ERS Steering Committee, thank you for participating!
ArchivesSpace, Archivematica, BitCurator, EAD, the list goes on! The contemporary archivist is tasked with not only processing paper collections, but also with processing digital records and managing the descriptive data we create. This work requires technical skills that archivists twenty or even ten years ago didn’t need to master. It’s also rare that archivists get extensive training in the technical aspects of the field during their graduate programs. So, how can a team of archivists build the skills they’ll need to meet the needs of an increasingly technical field? At the Princeton University Library, the newly formed Archival Description and Processing Team (ADAPT), is committed to meeting these challenges by building technical capacity across the team. We are achieving this by working on real-world projects that require technical skills, and by leveraging existing knowledge and skills in the organization, seeking outside training, and championing supervisor support for using time to grow our technical skills.
One of the most important requirements for growing technical capacity on the processing team is supervisor support for the effort. Workshops, training, and solving technical problems take a significant amount of time. Without management support for the time needed to develop technical skills, the team would not be able experiment, attend trainings, or practice writing code. As the manager of ADAPT, I make this possible by encouraging staff to set specific goals related to developing technical skills on their yearly performance evaluations; I also accept that it might take us a little longer to complete all of our processing. To fit this work into my own schedule, I identify real-world problems and block out time on my schedule to work on them or arrange meetings with colleagues who can assist me. Blocking out time in advance helps me stick to my commitment to building my technical skills. While the time needed to develop these skills means that some work happens more slowly today, the benefit of having a team that can manipulate data and automate processes is an investment in the future that will result in a more productive and efficient processing team.
With the support to devote time to building technical skills, ADAPT staff use a number of resources to improve their skills. Working with internal staff who already have skills they want to learn has been one successful approach. This has generally paired well with the need to solve real-world data problems. For example, we recently identified the need to move some old container information to individual component-level scope and content notes in a finding aid. We were able to complete this after several in-house training sessions on XPath and XQuery taught by a Library staff member. This introductory training helped us realize that the problem could be solved with XQuery scripting and we took on the project, while drawing on the in-house XQuery expert for assistance. This combination of identifying real-world problems and leveraging existing knowledge within the organization leads both to increased technical skills and projects getting done. It also builds confidence and knowledge that can be more easily applied to the next situation that requires a particular kind of technical expertise.
Finally, building in-house expertise requires allowing staff to determine what technical skills they want to build and how they might go about doing it. Often that requires outside training. Over the past several years, we have brought workshops to campus on working with the command line and using the ArchivesSpace API. Staff have also identified online courses and classes offered by the Office of Information Technology as important resources for building their technical skills. Providing support and time to attend these various trainings or complete online courses during the work day creates an environment where individuals can explore their interests and the team can build a variety of technical skills that complement each other.
As archival work evolves, having deeper technology skills across the team improves our ability to get our work done. With the right support, tapping into in-house resources, and seeking out additional training, it’s possible to build increased technological capability with the processing team. In turn, the team will increasingly be able to more efficiently tackle day-to-day technical challenges needed to manage digital records and descriptive data.
Alexis Antracoli is Assistant University Archivist for Technical Services at Princeton University Library where she leads the Archival Processing and Description Team. She has published on web archiving and the archiving of born-digital audio visual content. Alexis is active in the Society of American Archivists, where she serves as Chair of the Web Archiving Section and on the Finance Committee. She is also active in Archives for Black Lives in Philadelphia, an informal group of local archivists who work on projects that engage issues at the intersection of the archival profession and the Black Lives Matter movement. She is especially interested in applying user experience research and user-center design to archival discovery systems, developing and applying inclusive description practices, and web archiving. She holds an M.S.I. in Archives and Records Management from the University of Michigan, a Ph.D. in American History from Brandeis University, and a B.A. in History from Boston College.
When our team of experts at Anderson Archival isn’t busy with our own historical collection preservation projects, we like to dive into researching other preservation and digitization undertakings. We usually dedicate ourselves to the intimate collections of individuals or private institutions, so we relish opportunities to investigate projects like Harvard University’s Glass Plate Collection.
For most of the sciences, century-old information would be considered at best a historical curiosity and at worst obsolete. But for the last hundred and forty years, Harvard College’s Observatory has housed one of the most comprehensive collections of photographs of the night’s sky as seen from planet Earth, and this data is more than priceless—it’s breakable. For nearly a decade, Harvard has been working to not only protect the historical collection but to bring it—and its enormous amount of underutilized data—into the digital age.
Star Gazing in Glass
Before computers and cameras, the only way to see the stars was to look up with the naked eye or through a telescope. With the advent of the camera, a whole new way to study the stars was born, but taking photographs of the heavens isn’t as easy as pointing and clicking. Photographs taken by telescopes were produced on 8″x10″ or 8″x14″ glass plates coated in a silver emulsion exposed over a period of time. This created a photographic negative on the glass that could be studied during the day.
This allowed a far more thorough study of the stars than one night of stargazing could offer. By adjusting the telescopes used and exposure times, stars too faint for the human eye to see could be recorded and analyzed. It was Henry Draper who took this technology to the next level.
In 1842, amateur astronomer Dr. Henry Draper used a prism over the glass plate to record the stellar spectrum of stars and was the first to successfully record a star’s spectrum. Dr. Draper and his wife, Anna, intended to devote his retirement to the study of stellar spectroscopy, but he died before they could begin. To continue her husband’s work, Anna Draper donated much of her fortune and Dr. Draper’s equipment to the Harvard Observatory for the study of stellar spectroscopy. Harvard had already begun photographing on glass plates, but with Anna Draper’s continual contributions, Harvard expanded its efforts, photographing both the stars and their spectrums.
Harvard now houses over 500,000 glass plates of both the northern and southern hemispheres, starting in 1882 and ending in 1992 when digital methods outpaced traditional photography. This collection of nightly recordings, which began as the Henry Draper Memorial, has been the basis for many of astronomy’s advancements in understanding the universe.
The Women of Harvard’s Observatory
Edward C. Pickering was the director of the Harvard Observatory when the Henry Draper Memorial was formed, but he did more than merely advance the field through photographing of the stars. He fostered the education and professional study of some of astronomy’s most influential members—women who, at that time, might never have received the chance—or credit—Pickering provided.
Instead of hiring men to study the plates during the day, Pickering hired women. He felt they were more detailed, patient, and, he admitted, cheaper. Williamina Fleming was one of those female computers. She developed the Henry Draper Catalogue of Stellar Spectra and is credited with being the first to see the Horsehead nebula through her work examining the plates.
The Draper Catalogue included the first classification of stars based on stellar spectra, as created by Fleming. Later, this classification system would be modified by another notable female astronomer at Harvard, Annie Jump Cannon. Cannon’s classification and organizational scheme became the official method of cataloguing stars by the International Solar Union in 1910, and it continues to be used today.
Another notable female computer was Henrietta Swan Leavitt, who figured out a way to judge the distance of stars based on the brightness of stars in the Small Megellanic Cloud. Leavitt’s Law is still used to determine astronomical distances. The Glass Universe by Dava Sobel chronicles the stories of many of the female computers and the creation of Harvard Observatory’s plate collection.
Digital Access to a Sky Century @ Harvard (DASCH)
The Harvard Plate Collection is one of the most comprehensive records of the night’s sky, but less than one percent of it has been studied. For all of the great work done by the Harvard women and the astronomers who followed them, the fragility of the glass plates meant someone had to travel to Harvard to see them, and then the study of even a single star over a hundred years required a great deal of time. For every discovery made from the plate collection, like finding Pluto, hundreds or thousands more are waiting to be found.
With all of this unused, breakable data and advances in computing ability, Professor Jonathan Grindlay began organizing and funding DASCH in 2003 in an effort to digitize the entire hundred-year plate historical document collection. But Grindlay had an extra obstacle to overcome. Many of the plates had handwritten notes written by the female computers and other astronomers. Grindlay had to balance the historical significance of the collection with the vast data it offered. To do this, the plates are scanned at low resolution with the marks in place, then they are cleaned and rescanned at the extremely high resolution necessary for data recording.
A custom scanner had to be designed and constructed specifically for the glass plates and new software was created to bring the digitized image into line with current astronomical data methods. The project hasn’t been without its setbacks, either. Finding funding for the project is a constant problem, and in January 2016, the Observatory’s lowest level flooded. Around 61,000 glass plates were submerged and had to be frozen immediately to prevent mold from damaging the negatives. While the plates are intact, many still need to be unfrozen and restored before being scanned. The custom scanner also had to be replaced because of the flooding.
George Champine Logbook Archive
In conjunction with the plate scanning, a second project is necessary to make the plates useable for extended study. The original logbooks of the female computers contain more than their observations of the plates. These books record the time, date, telescope, emulsion type, and a host of other identifying information necessary to place and digitally extrapolate the stars on the plates. Over 800 logbooks (nearly 80,000 images in total) were photographed by volunteer George Champine.
Those images are now in the time-consuming process of being manually transcribed. Harvard Observatory partnered with the Smithsonian Institution to enlist volunteers who work every day reading and transcribing the vital information in these logbooks. Without this data, the software can’t accurately use the star data scanned from the plates.
Despite all the challenges and setbacks, 314,797 plates have been scanned as of December 2018. The data released and analyzed from the DASCH project has already made new discoveries about variable stars. Once the entire collection of historical documents is digitized, more than a hundred years will be added to the digital collection of astronomical data, and they will be free for anyone to access and study, professional or amateur.
The Harvard Plate Collection is a great example of an extraordinary resource to its community being underused due to the medium. Digital conversion of data is a great way to help any field of research. While Harvard’s plate digitization project provides a model for the conversion of complex data into digital form, not all institutions have the resources to attempt such a large enterprise. If you have a collection in need of digitization, contact Anderson Archival today at 314.259.1900 or email us at firstname.lastname@example.org.
Shana Scott is a Digital Archivist and Content Specialist with Anderson Archival, and has been digitally preserving historical materials for over three years. She is involved in every level of the archiving process, creating collections that are relevant, accessible, and impactful. Scott has an MA in Professional Writing and Publishing from Southeast Missouri State University and is a member of SFWA.
Digital skills have become increasingly important for both new and established archivists of all stripes, not just those with “digital” in their job titles. This series aims to foster relationships and facilitate the sharing of knowledge between archivists who are already working with born-digital records and those who are interested in building their digital skills. In collaboration with SAA’s Students and New Professionals (SNAP) section, the Electronic Records Section seeks students and new professionals to conduct brief interviews of people working with born-digital records about what it’s like on a daily basis, as well as career pathways, helpful skill sets, and other topics. Students/new professionals will then write up the interviews for publication on both the ERS and SNAP section blogs. We are currently seeking volunteers for both interviewers and interviewees. Please see additional information about both roles below, and fill out this short Google form to sign up!
Call for interviewers:
You are: a student or new professional (or anyone else interested in learning more about what digital archivists do)
Get paired with an archivist who is well-versed in digital records
Schedule and conduct a brief interview (via chat/email, video, phone, etc.), using your own interview questions (plus a few we’ll suggest)
Write up the interview into a blog post and run it by your interviewee for review
Build a relationship with a cool archivist; learn and help others learn about born-digital archives work
Call for interviewees:
You are: a digital archivist or an archivist with any job title who works with born-digital records
Get paired with a student/new professional
Participate in a brief interview (via chat/email, video, phone, etc.)
Review the interview write-up prior to publication
Build a relationship with an awesome student/new professional; generously share your expertise/wisdom with others in the field
Writing for bloggERS! “Conversations” Series
Written content should be roughly 600-800 words in length (ok to exceed a bit)
Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA
As part of our “Making Tech Skills a Strategic Priority” series, the bloggERS team asked five current or recent MLIS/MSIS students to reflect on how they have learned the technology skills necessary to tackle their careers after school. In this post, Anna Speth and Jane Kelly reflect thoughtfully on adapting their mindsets to embrace new challenges and learn from failure.
Anna Speth, 2017 graduate, Simmons College
I am about to celebrate a year in my first full-time position, Librarian for Emerging Technology and Digital Projects at Pepperdine University. In this role I work on digital initiatives, often in tandem with the archive, and direct our emerging technology makerspace. By choosing to center my graduate career on digital archiving, I felt well prepared for the digital initiatives piece. However, running the makerspace has been a whirlwind of grappling with the world of emerging tech. My best piece of advice (which we’ve all heard a million times) is to maintain a “learner mindset.” I’m a traditional learner who has mastered the lecture-memorize-regurgitate academic system. This approach doesn’t do much when it comes to hands-on tech. I am faced with 3D printers, VR systems, arduinos, ozobots, CONTENTdm, and more with minimal instruction. I watch tutorials, but these rarely offer a path to in-depth understanding. Instead, I’ve had to overcome the mindset that I’m not a tech person and will make something worse by messing with it. If the 3D printer doesn’t work, you certainly aren’t going to make it worse by taking it apart and trying to put it back together. If you don’t know how to reorder a multipage object on the backend of CONTENTdm, create a hidden sandbox collection and start experimenting. Remember that the internet – Google, user forums, Reddit, company reps – is your friend. Also remember (and I tell this to kids in the makerspace just as often as I tell it to myself) that failure is your friend. If you mess something up, then all you’ve done is learn more about how the system works by learning how it doesn’t work. Iteration and perseverance are key. And, as this traditional learner has realized, a whole lot of fun!
Jane Kelly, 2018 grad, University of Illinois at Urbana-Champaign
Developing new tech skills has, at least for me, been a process of learning to fail. The intensive Introduction to Computer Science course I took several years ago was supposed to be fun – a benefit of being able to take college courses for almost nothing as a staff member on campus. It might have been fun for the first three weeks of the semester, but that was followed by a lot of agonizing, handwringing, and tears.
I now reflect on my time in that course as an intensive introduction to failure. This shift in mentality – learning how to fail, and how to accept it – has been key for me in being open to developing my tech skills on the job. I don’t worry so much about messing up, not knowing the answer, or the possibility of breaking my computer.
As a humanities student, it simply was never acceptable to me to turn in an assignment incomplete or “wrong.” In that computer science class, and in the information processing course I took at the iSchool at the University of Illinois a couple years later, an incomplete assignment could be a stellar attempt, proof of lessons learned, and an indication of where help is required. The rubric for good work is different for a computer science problem set than a history paper. It has been a valuable lesson to revisit as I try to develop my skills independently and in the workplace.
I have acquired and maintained my tech skills through a combination of computer science coursework before and during library school, an in-person SAA pre-conference sessions that my employer paid for, and, of course, the internet. Apps like Learn to Code with Python or free online courses can be an introduction to a programming language or a quick refresher since I inevitably forget much of what I learn in class before I can put it to work at a job. Google and Stack Exchange are lifesavers, both because I can often find the answer to my question about the mysterious error code I see in the terminal window and reassure myself that I’m not the first person to pose the question.
More than anything, my openness to what I once thought of as failure has been pivotal to my development. It can take a long time to learn and understand exactly what is going on under the hood with some new software or process, but that’s okay. Sometimes a fake-it-til-you-make-it mentality is exactly what’s needed to push yourself to tackle a new challenge. For me, learning tech skills is learning to be okay with failure as a learning process.
Anna Speth is the Librarian for Emerging Technology and Digital Projects at Pepperdine’s Payson Library where she co-directs a makerspace and works with digital initiatives. Anna focuses on the point of connection between technology and history. She holds a BA from Duke University and a MLIS from Simmons College.