Software Preservation Network: Community Roadmapping for Moving Forward

By Susan Malsbury

This is the fifth post in our series on the Software Preservation Network 2016 Forum.

Software Preservation Network logo

The final session of the Software Preservation Forum was a community roadmapping activity with two objectives: to synthesize topics, patterns, and projects that came up during the forum, and to articulate steps and the time frame for future work. This session built off of two earlier activities in the day: an icebreaker in the morning and a brainstorming activity in the afternoon.

For the morning icebreaker, participants –armed with blank index cards and a pen–found someone in the room they hadn’t met before. After brief introductions they each shared one challenge that their organization faced with software and/or software preservation, and they wrote their partner’s challenge on their own index card. After five rounds of this, participants returned to their tables for the opening remarks from the Jessica Meyerson and Zach Vowell, and Cal Lee.

At the afternoon brainstorming activity, participants took the cards form the morning icebreaker as well as fresh cards and again paired with someone they hadn’t met. Each pair looked over their notes from the morning and wrote out goals, tasks, and projects that could respond to the challenges. By that point, we had three excellent sessions as well as casual conversations over lunch and coffee breaks to further inform potential projects.

I paired with Amy Stevenson from the Microsoft Corporation. Even though her organization is very different from mine (the New York Public Library), we easily identified projects that would address our own challenges as well as the challenges we gathered in the morning. The projects we identified  included the need for a software registry, educational resources, and a clearinghouse to provide discovery for software. We then placed our cards on a butcher paper timeline at the front of the room that spanned from right now to 2022–a six-year time frame with the first full year being 2017.

During the fourth session on partnerships, Jessica Meyerson entered the goals, projects, and ideas from the timeline into a spreadsheet so that for the fifth session we were ready to get road mapping! For this session we broke into three groups to discuss the roadmap and to work on our own group’s copy of the spreadsheet. Our group subdivided into smaller groups who each took a year of the timeline to edit and comment on. While we all focused on our year, conversation between subgroups flowed freely and people felt comfortable moving projects into other years or streamlining ideas across the entire time frame. Links to the master spreadsheet and our three versions can be found here.

Despite having  three separate groups, it was remarkable how much our edited roadmaps aligned with the others. Not surprisingly, most people felt like it was important to front-load steps regarding research, developing platforms for sharing information, and identifying similar projects to form partnerships. Projects in the later years would grow from this earlier research: creating the registry, establishing a coalition, and developing software metadata models.

I found the forum and this session in particular to be energizing. I had attended the talk that Jessica Meyerson and Zach Vowell gave at SAA in 2014 when they first formed the Software Preservation Network. While I was intrigued by the idea of software preservation it seemed a far off concept to me. At that time, there were still many other issues regarding digital archives that seemed far more pressing. When I heard other people’s challenges at the forum, and had space to think about my own,  I realized how important and timely software preservation is. As digital archives best practices are being codified, more and more we are realizing how dependent we are on (often obsolete) software to do our work.


Susan Malsbury is the Digital Archivist for The New York Public Library, working with born digital archival material across the three research centers of the Library. In this role, she assists curators with acquisitions; oversees technical services staff handling ingest and processing; and coordinates with public service staff to design and implement access systems for born digital content. Susan has worked with archives at NYPL in various capacities since 2007.

Should We Collect It Because We Can?

The following is a post by Dan Noonan, Digital Resources Archivist at Ohio State University, based on a breakout session at the ERS section meeting of last year’s SAA annual meeting.

With an expanding capacity to store information in the digital age, do archivists still need to consider the size of collections in making appraisal decisions? Is it more compelling to accept a collection that can be held on a few CD’s than one that occupies 30 cubic feet of climate-controlled compact shelving? Should archivists make different acquisition decisions for digital and physical collections? These questions were the topics of a break-out discussion at the 2014 Electronic Records Section of the Society of American Archivists annual meeting in Washington, DC.

Participants identified many examples of document sets not typically accessioned as a whole, either subject to sampling or outright rejection:  timecards and attendance records, correspondence (email), financial records (besides annual reports, budgets, and general ledgers), policies, promotion and tenure files, research data, resumes, and syllabi. Appraisal and selection of these types of materials have traditionally been justified by a lack of resources–space to store documents, supplies to house them, workers to process them. The presence (or potential presence) of sensitive or confidential information has often led archivists to select out whole categories of documents to avoid the risk disclosure.

It could be argued that digital files counter many of the standard arguments for selection and appraisal. With appropriate indexing and metadata, it may be easier to understand and appraise large volumes of digital content. Likewise in a digital environment, locating and redacting sensitive or confidential information could be automated. New tools and systems to manage large collections help support the argument that size may not apply as an appraisal criteria for digital content.

However, session participants also noted that  digital files pose their own special problems. Digital storage may be cheap and getting cheaper, but institutions with digital collections will still require server space for storage, and staff and resources to process, preserve, and provide access to them. And what about more complex digital objects, like audio and video files, research data, and web archives? Preservation quality versions of these files can be enormous and quickly consume all of your available storage space. Maintaining these types of content at scale may require powerful, expensive processing workstations, and more sophisticated metadata and indexing information to ensure their long-term preservation and accessibility.

Ultimately the participants agreed that any decision to acquire (or not acquire) a collection should align with an organization’s core collection development policies. Organizations still struggling may want to create a decision matrix that weighs the costs and benefits of acquisition of different types of content alongside those collection development policies. Such tools, along with staff training, would be helpful for personnel to use when making decisions about whether to accept potential digital acquisitions. Archives also need to appropriately plan for and allocate resources for the long-term preservation and management of collections—digital and physical. For digital collections this type of planning accounts not only for one-time costs of hardware and software purchases, but also for equipment replacement, upgrades, and migration, human resources to operate and manage the equipment, and other overhead. These costs should be annualized and accounted for in the same way as for annual plant operations and maintenance fees, facility rental/lease fees or mortgage payments.

Forever is a long time, and can be difficult to conceptualize in a digital environment. Archival collection policies should be subject to reconsideration, and collection decisions to reappraisal. One participant noted that in the past, archivists regularly excluded things based on format— turning down a paper collection that too voluminous to handle–so archivists can anticipate the conversation will continue in the digital realm.

Big Data and Big Challenges for Archives

The following is a post by Glen McAninch based on a breakout session at the ERS section meeting of last year’s SAA annual meeting.

What is “big data” and how does it relate to what archivists do? Many of us, particularly those outside the Federal Government, private technology companies, and research based universities, will doubt that they will ever have to deal with “big data,” but the topic addresses issues that those of us who manage electronic records collections are facing more and more. No doubt most archivists are beginning to acquire increasingly large collections of electronic records that challenge our abilities to process them, preserve them, and provide access to them.

Image courtesy of Stafano Bertolo.

So, what is big data? According to Gartner Research, it is “high-volume, high-velocity, high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” Traditionally, computers excelled at manipulating structured data, but increasingly businesses need to be able to integrate and make sense of data from a wide range of sources, in a variety of formats, at various levels of structure and cleanliness, and it needs to do so in a timely fashion.[1]

Most archivists are acquiring collections that are rapidly increasing in volume and complexity and may fit this definition of big data. It is chiefly a matter of scale that separates most of us from the data tsunami faced by some institutions. Can archivists learn processes and acquire tools from those who are using big data sets for non-archival purposes?

The current use of big data is basically for analytics or research rather than to document specific activities. Data analytics tools allow researchers to manipulate and analyze data stored in multiple formats. Issues with big data of greatest concern for archivists are:

  • Appraisal involves selection of records, some of which may be useful for analytic research. When acquiring unstructured and structured records, it is important for archivists to carefully document the context of the original record so that researchers who use big data tools have the proper framework to do analysis.
  • Searching, one of the long suits of “big data” tools, can be leveraged to help archivists improve access to massive amounts of records in multiple formats.
  • Big data sets are often not managed in the traditional way that archivists and records managers select records for long-term retention. Thus, the emphasis of big data tools, such as Apache’s Hadoop, is on analysis of objects and not on the management of records like is done in structured databases. This makes many big data tools unsuited for archival management and preservation needs.
  • How will you and your institution acquire big data? From whom? Will this data come directly from those who collect it, for instance, your university’s department of institutional research? If so, do you want to acquire the full set of raw data or are you only interested in the different outputs and analyses performed on that data? Or will you acquire the work of researchers who had obtained copies of electronic records, extracted selected content from those records, mashed that content up with data from other sources, and then performed analyses on that data? Is the goal of acquisition to allow future researchers to reinterpret and reanalyze the data or is your goal to document the information that informed certain decisions at an institution?
  • Privacy, the fear that big brother is watching us, is a popular issue that is often associated with big data and archivists need to address that fear through access restrictions and redaction. Visualization tools are increasingly being used to appraise records and establish links between records, particularly for large e-mail projects. Additionally, users of high volume data have made advances in using crowd sourcing, face recognition, and other techniques that archivists are adapting.

Projects to watch:

  • Brown Dog is a collaborative big data management project based on the integration of heterogeneous datasets and multi-source historical and digital collections.
  • Tools like the CI-BER treemap GIS interface to NARA records, the visual analysis tools being developed by Maria Esteva at the Texas Advanced Computing Center, and Kenton McHenry’s 1940 Census big data analysis, indexing, and visualization at the National Center for Supercomputing Applications (now part of Brown Dog) provide good examples of adapting big data techniques to the mission and spirit of the archival profession.

[1] Accessed 12/17/2014.