SAA 2019 recap | Session 504: Building Community History Web Archives: Lessons Learned from the Community Webs Program

by Steven Gentry


Introduction

Session 504 focused on the Community Webs program and the experiences of archivists who worked at either the Schomburg Center for Research in Black Culture or the Grand Rapids Public Library. The panelists consisted of Sylvie Rollason-Cass (Web Archivist, Internet Archive), Makiba Foster (Manager, African American Research Library and Cultural Center, formerly the Assistant Chief Librarian, the Schomburg Center for Research in Black Culture), and Julie Tabberer (Head of Grand Rapids History & Special Collections).

Note: The content of this recap has been paraphrased from the panelists’ presentations and all quoted content is drawn directly from the panelists’ presentations.

Session summary

Sylvie Rollason-Cass began with an overview of web archiving and web archives, including:

  • The definition of web archiving.
  • The major components of web archives, including relevant capture tools (e.g. web crawlers, such as Wget or Heritrix) and playback software (e.g. Webrecorder Player).
  • The ARC and WARC web archive file formats. 

Rollason-Cass then noted both the necessity of web archiving—especially due to the web’s ephemeral nature—and that many organizations archiving web content are higher education institutions. The Community Webs program was therefore designed to get more public library institutions involved in web archiving, which is critical given that these institutions often collect unique local and/or regional material.

After a brief description of the issues facing public libraries and public library archives—such as a lack of relevant case studies—Rollason-Cass provided information about the institutions that joined the program, the resources provided by the Internet Archive as part of the program (e.g. a multi-year subscription to Archive-It), and the project’s results, including:

  • The creation of more than 200 diverse web archives (see the Remembering 1 October web archive for one example).
  • Institutions’ creation of collection development policies pertaining specifically to web archives, in addition to other local resources.
  • The production of an online course entitled “Web Archiving for Public Libraries.” 
  • The creation of the Community Webs program website.

Rollason-Cass concluded by noting that although some issues—such as resource limitations—may continue to limit public libraries’ involvement in web archiving, the Community Webs program has greatly increased the ability for other institutions to confidently archive web content. 

Makiba Foster then addressed her experiences as a Community Webs program member. After a brief description of the Schomburg Center, its mission, and unique place where “collections, community, and current events converge”, Foster highlighted the specific reasons for becoming more engaged with web archiving:

  • Like many other institutions, the Schomburg Center has long collected clippings files—and web archiving would allow this practice to continue.
  • Materials that document the experiences of the black community are prominent on the World Wide Web.
  • Marginalized community members often publish content on the Web.

Foster then described the #HashtagSyllabusMovement collection, a web archive of educational material “related to publicly produced and crowd-sourced content highlighting race, police violence, and other social justice issues within the Black community.” Foster had known this content could be lost, so—even before participating in the Community Webs program—she began collecting URLs. Upon joining the Community Webs program, Foster used Archive-It to archive various relevant materials (e.g. Google docs, blog posts, etc.) dated from 2014 to the current. Although some content was lost, the #HashtagSyllabusMovement collection both continues to grow—especially if, as Foster hopes, it begins to include international educational content—and shows the value of web archiving. 

In her conclusion, Foster addressed various success, challenges, and future endeavors:

  • Challenges:
    • Learning web archiving technology and having confidence in one’s decisions.
    • Curating content for the Center’s five divisions.
    • “Getting institutional support.”
  • Future Directions:
    • A new digital archivist will work with each division to collect and advocate for web archives.
    • Considering how to both do outreach for and catalog web archives.
    • Ideally, working alongside community groups to help them implement web archiving practices.

The final speaker, Julie Tabberer, addressed the value of public libraries’ involvement in web archives. After a brief overview of the Grand Rapids Public Library, the necessity of archives, and the importance of public libraries’ unique collecting efforts, Tabberer posited the following question: “Does it matter if public libraries are doing web archiving?” 

To test her hypothesis that “public libraries document mostly community web content [unlike academic archives],” Tabberer analyzed the seed URLs of fifty academic and public libraries to answer two specific questions:

  • “Is the institution crawling their own website?”
  • “What type of content [e.g. domain types] is being crawled [by each institution]?”

After acknowledging some caveats with her sampling and analysis—such as the fact that data analysis is still ongoing and that only Archive-It websites were examined—Taberrer showed audience members several graphics that revealed academic libraries 1.) Typically crawled their websites more so than public libraries and 2.) Captured more academic websites than public libraries.

Tabberer then concluded with several questions and arguments for the audience to consider:

  • In addition to encouraging more public libraries to archive web content—especially given their values of access and social justice—what other information institutions are underrepresented in this community?
  • Are librarians and archivists really collecting content that represents the community?
  • Even though resource limitations are problematic, academic institutions must expand their web archiving efforts.

Steven Gentry currently serves a Project Archivist at the Bentley Historical Library. His responsibilities include assisting with accessioning efforts, processing complex collections, and building various finding aids. He previously worked at St. Mary’s College of Maryland, Tufts University, and the University of Baltimore.

Advertisements

SAA 2019 recap | Session 204: Demystifying the Digital: Providing User Access to Born-Digital Records in Varying Contexts

by Steven Gentry


Introduction

Session 204 addressed how three dissimilar institutions—North Carolina State University (NCSU), the Wisconsin Historical Society (WHS), and the Canadian Centre for Architecture (CCA)—are connecting their patrons with born-digital archival content. The panelists consisted of Emily Higgs (NCSU Libraries Fellow, North Carolina State University), Hannah Wang (Electronic Records & Digital Preservation Archivist, Wisconsin Historical Society), and Stefana Breitwieser (Digital Archivist, Canadian Centre for Architecture). In addition, Kelly Stewart (Director of Archival and Digital Preservation Services, Artefactual Systems) briefly spoke about the development of SCOPE, the tool featured in Breitwieser’s presentation.

Note: The content of this recap has been paraphrased from the panelists’ presentations and all quoted content is drawn directly from the panelists’ presentations.

Session summary

Emily Higgs’s presentation focused on the different ways that NCSU’s Special Collections Research Center (SCRC) staff enhance access to their born-digital archives. After a brief overview of NCSU’s collections, Higgs first described their lightweight workflow for bridging researchers and requested digital content, a process that involves SCRC staff accessing an administrator account on a reading room Macbook; transferring copies of requested content to a read-only folder shared with a researcher account; and limiting the computer’s overall  capabilities, such as restricting its internet and ports (the latter is accomplished via Endpoint Protector Basic). Should a patron want copies of the material, they simply drag and drop those resources into another folder for SCRC staff to review.

Higgs then described an experimental Named Entity Recognition (NER) workflow that employs spaCy and which allows archivists to better describefiles in NCSU’s finding aids.The workflow employs a Jupyter notebook (see her Github repository for more information) to automate the following process:

  • “Define directory [to be analyzed by spaCy].”
  • “Walk directory…[to retrieve] text files [such as PDFs].”
  • “Extract text (textract).”
  • “Process and NER (spaCy).”
  • “Data cleaning.”
  • “Ranked output of entities (csv) [which is based on the number of times a particular name appears in the files].”

Once the process is completed, the most frequent 5-10 names are placed in an ArchivesSpace scope and content note. Higgs concluded by emphasizing this workflow’s overall ease of use and noting that—in the future—staff will integrate application programming interfaces (APIs) to enhance the workflow’s efficiency.

Next to speak was Hannah Wang, who addressed how Wisconsin Historical Society (WHS) has made its born-digital state government records more accessible. Wang began her presentation by discussing the Wisconsin State Preservation of Electronic Records Project (WiSPER) Project and its two goals:

  • “Ingest a minimum of 75 GB of scheduled [and processed] electronic records from state agencies.”
  • “Develop public access interface.” 

And explained the reasons behind Preservica’s selection:

  • WHS’s lack of significant IT support meant an easily implementable tool was preferred over open-source and/or homegrown solutions.
  • Preservica allowed WHS to simultaneously preserve and provide (tiered) access to digital records.
  • Preservica has a public-facing WordPress site, which fulfilled the second WiSPER grant objective.

Wang then addressed how WHS staff appropriately restricted access to digital records by placing records into one of three groupings:

  • “Content that has a legal restriction.”
  • “Content that requires registration and onsite viewing [such as email addresses].”
  • “Open, unrestricted content.” 

WHS staff actually achieved this goal by employing different methods to locate and restrict digital records:

  • For identification: 
    • Reviewing “[record] retention schedules…[and consulting with] agency [staff who would notify WHS personnel of sensitive content].” 
    • Using resources like bulk extractor
    • Reading records if necessary.
  • For restricting records:
    • Employing scripts—such as batch scripts—to transfer and restrict individual files and whole SIPs.

Wang demonstrated how WHS makes its restricted content accessible via Preservica:

  • “Content that has a legal restriction”: Only higher levels of description can be searched by external researchers, although patrons have information concerning how to access this content.
  • “Content that requires registration and onsite viewing”: Individual files can be located by external researchers, although researchers still need to visit the WHS to view materials. Again, information concerning how to access this content is provided.

Wang concluded her presentation by describing efforts to link materials in Preservica with other descriptive resources, such as WHS’s MARC records; expressing hope that WHS will integrate Preservica with their new ArchivesSpace instance; and discussing the usability testing that resulted in several upgrades to the WHS Electronic Records Portal prior to its release.

The penultimate speaker was Stefana Breitwieser, who spoke about SCOPE and its features. Breitwieser first discussed the “Archaeology of the Digital” project and how—through this project—the CCA acquired the bulk of its digital content, more than “660,000 files (3.5 TB).” In order to better enhance access to these resources, Breitwieser stressed that two problems had to be addressed:

  • “[A] long access workflow [that involved twelve steps].”
  • “Low discoverability.” Breitwieser stressed some issues with their current access tool included its inability to search across collections and its non-usage of metadata in Archivematica.

CCA staff ultimately decided on working alongside Artefactual Systems to build SCOPE, “an access interface for DIPs from Archivematica.” The goals of this project included:

  • “Direct user access to access copies of digital archives from [the] reading room.”
  • “Minimal reference intervention [by CCA staff].”
  • “Maximum discoverability using [granular] Archivematica-generated metadata.”
  • “Item-level searching with filtering and facetting.” 

To illustrate SCOPE’s capabilities, Breitwieser demonstrated the tool and its features (e.g. its ability to download DIPs) for the audience. During the presentation, she emphasized that although incredibly useful, SCOPE will ultimately supplement—rather than replace—the CCA’s finding aids. 

Breitwieser concluded by describing the CCA’s reading room—which include computers that possess a variety of useful software (e.g. computer-aided design, or CAD, software) and, like NCSU’s workstation, only limited technical capabilities—and highlighting CCA’s much simpler 5-step access workflow.

The final speaker, Kelly Stewart, spoke of SCOPE’s development process. Heavily emphasized during this presentation were Artefactual’s use of CCA user stories to develop “feature files”—or “logic-based, structured descriptions” of these user stories—that were used by Artefactual staff to build SCOPE. After its completion, Stewart noted that “user acceptance testing” occurred repeatedly until SCOPE was deemed ready. Stewart concluded her presentation with the hope that other archivists will implement and improve upon SCOPE.


Steven Gentry currently serves a Project Archivist at the Bentley Historical Library. His responsibilities include assisting with accessioning efforts, processing complex collections, and building various finding aids. He previously worked at St. Mary’s College of Maryland, Tufts University, and the University of Baltimore.

Meet our newest ERS steering committee members!

All this week we’ll be featuring introductions to our newest ERS steering committee members! Today, meet Elizabeth Carron, one of our our new steering committee members.

Tell us a little bit about yourself.

“I graduated from the University of Massachusetts Amherst with a background in Early Modern Literature and French Studies and finished my master’s at Simmons College in 2014. I didn’t take the archives track – rather, I was more focused on subject librarianship and digital scholarship. I made amazing connections in the Five College area as a student and as a librarian; in 2014, shortly after graduating, I was offered a project at the Smith College Archives – and I’ve been in archives and archives management ever since! After my project at Smith ended, I moved to Ann Arbor to be a project archivist in collections development at the Bentley Historical Library. Eventually, the position of Archivist for Records Management was created and I transitioned into that role. It’s been my responsibility to develop the program by establishing and communicating best RIM practices to the University community and to push forward acquisition procedures that will support description, arrangement, and access further down the road.” 

What made you decide you wanted to become an archivist?

“Honestly, no one thing. I studied Early Modern language and literature and got involved with several digital humanities projects, which in turn led to a deeper exposure to libraries and archives. From there, I explored graduate programs while working for a cultural heritage org on the admin side and just felt a click with archival programs. Being an archivist means I get to learn about a variety of topics, to meet new people and communities. It also means I get a hand in history-making. Whether I’m collecting or advocating for resources and partnerships, preservation is a profound responsibility.”

What is one thing you’d like to see the Electronic Records Section accomplish during your time as vice-chair?

“I do a lot of acquiring of electronic/digital records and not so much processing; I’d like to explore this process of acquisition and perhaps work on perspectives to assist with understanding e-records/e-record concerns in this process. “

What three people, alive or dead, would you invite to dinner?

“George Sand and Dolly Parton to keep things lively; and my grampa, who was an amazing cook with a never-ending cache of dad jokes.”

Meet our newest ERS steering committee members!

All this week we’ll be featuring introductions to our newest ERS steering committee members! Today, meet Andrea Belair, one of our our new steering committee members.

“My name is Andrea Belair. I am from rural western Massachusetts, and I earned my BA from Marlboro College in Vermont where my focus was Literature and Creative Writing. After taking time off to travel and work in various jobs, and decided to pursue librarianship and went on to earn my MLIS from Rutgers University in 2012. I am currently the Librarian for Archives and Special Collections at Union College in Schenectady, NY, where I started in July, 2018. Before this current role, I was the Archivist for the Office of the President at Yale University for 5.5 years. I have a broad set of duties here at Union College, since we have numerous collections that include rare books and archival collections, but I have been actively involved in records policy and retention for the campus.”

What made you decide you wanted to become an archivist?

“After a part-time job shelving in the stacks of a large university, I decided to pursue a graduate degree to try for a career as an academic librarian. An archivist was always an ideal position that seemed fascinating but perhaps too much of a dream job, so I acquired many broad skills with archival experience “just in case.” I did ultimately land a job as an assistant archivist, and now I am living the dream.”

What is one thing you’d like to see the Electronic Records Section accomplish during your time as vice-chair?

“I always like to bring the importance of records management to light in the profession, since this subject can be an excellent basis to streamline the rest of the workflows and processes within an archive. Records management is often undervalued and under-rated, or it just seems pretty uninteresting, and archivists do not always take time to understand it fully, which can lead to issues down the road. Perhaps some emphasis on records management and records retention would be interesting to explore.”

What three people, alive or dead, would you invite to dinner?

“The Ghost of Christmas Past, the Ghost of Christmas Present, and the Ghost of Christmas Future.”

Meet our newest ERS steering committee members!

All this week we’ll be featuring introductions to our newest ERS steering committee members! Today, meet Annalise Berdini, our new vice-chair and chair-elect.

Annalise Berdini is the Digital Archivist for University Archives at Princeton University. She is responsible for leading the ongoing development of the University Archives digital curation program. As part of this role, she accessions and processes born-digital collections, offers digital preservation consultation and education to students and staff, and collaborates with Public Services to improve born-digital access practices. She also manages the web archives program, processes analog collections, and provides reference services. She was previously the Digital Archivist for Special Collections and Archives at UC San Diego, where she instituted a brand new digital curation program and co-authored the UC Guidelines for Born-Digital Archival Description.

What made you decide you wanted to become an archivist?

“Honestly, I sort of fell into it. I had just started looking into library school and started researching my options after about 6 years of post-undergrad job hopping, and the program I was most interested in had an archives concentration. I remembered a really great archivist that I encountered during some research I did during undergrad, and started asking questions of archivists about the field and what they did. Mostly, the response I got was that there weren’t many jobs! But the archivists I spoke to were also so passionate about the work they did, and talked about all the ways they felt it was important — and they were so happy to answer my questions and offer help and advice — to connect me with more people in the field. That may actually be the main reason I chose it. Up to that point, my experience in my other career(s) had been that people were generally reluctant to offer help or support. That was never my experience with archivists. Once I started classes and some initial processing work, I knew it was where I wanted to be. Constantly changing work, lively academic discourse, exciting new opportunities in applying technology and leveraging data — it’s exactly the kind of job I hoped I’d find. I’m doing work I never thought I would do, and I get to work with such incredible people who challenge me to do more and better every day.”

What is one thing you’d like to see the Electronic Records Section accomplish during your time as vice-chair?

“I’m really excited about the ongoing work the section is already doing to centralize and make easily discoverable favorite resources for practitioners. I’d also like to see the membership get involved in partnering with other sections to talk about the ways/offer guidance on how electronic records can make more discoverable resources/voices traditionally left out of the archives.”

What three people, alive or dead, would you invite to dinner?

“Janelle Monae, Neil Gaiman, and Carrie Fisher.”

Recap: Emulation in the Archives Workshop – UVA, July 18, 2019

By Brenna Edwards

The Emulation in the Archives workshop took place at the University of Virginia (UVA) July 18, 2019, as part of Software Preservation Network’s Fostering a Community of Practice grant cohort. This one-day workshop explored various aspects of emulation in archives, from the legal challenges to access, and included an overview of what UVA is currently doing in this area. The workshop featured talks from people across departments at UVA, as well as people from the Library of Congress. In addition to the talks, there was also a chance to sign up for wireframe testing for UVA’s current access methods for emulated material in their collections. This process was optional, but people could also sign up for distance testing after the workshop if they preferred. 

The day was split into four different parts: an introduction to software preservation and emulation, including legal information; an overview of UVA’s current work in emulation; a look into the metadata for emulations and video game preservation; and considerations for access and user experience. Breaking up the day into these chunks defined a flow for the day, walking through the steps and considerations needed to emulate software and born digital materials. It also helped contain these topics, though of course certain themes and aspects kept appearing throughout the day in other presentations. 

The first portion of the day covered an introduction to software preservation and emulation, and the legal landscape. After explaining more of what Software Preservation Network’s Fostering a Community of Practice grant is, Lauren Work provided some definitions of emulation, software, and curatorial for use throughout the day. 

  • Emulation: digital technique that allows new computers to run legacy systems so older software appears the way it was originally designed
  • Software: source code, executables, applications, and other related components that are set of instructions for computers
  • Curatorial: responsibility and practice of appraising, acquiring, describing

Work then talked more about the Peter Sheeran papers, a collection from an architectural firm based in Charlottesville and the main collection for this project. As a hybrid collection, there were Computer Aided Design (CAD) files and Building Information Modeling (BIM) software included, which posed the question of what to do with it. The answer? Emulation! Since CAD/BIM files are very dependent on what version of the software and files are being used, UVA first did an inventory of what they had, down to license keys and how compatible it is with other software. To do this, they used the FCOP Collections Inventory Exercise to help guide them through what they needed to consider. They also looked at what potential troubleshooting issues and legal issues they might run into. This led nicely into the next presentation all about the legal landscape for software preservation, presented by Brandon Butler of UVA. Butler talked about copyright and the The Copyright Permissions Culture in Software Preservation and Its Implications for the Cultural Records report done by ARL, as well as the idea of fair use, which is often an underutilized solution. He also talked about digital rights management, and how groups like SPN are bringing people together to ask these questions that haven’t been asked before and working to get exemptions granted every three years to help seek permission to crack locks. Overall, he said that you should be good legally, but to do your research just to be on the safe side. 

This was followed by an overview of what UVA is currently doing. After reiterating “Access is everything” to the room, Michael Durbin demonstrated the current working pieces of their emulation system using Archivematica, Apollo, and a Curio custom display interface. He also demonstrated some of the EaaSI platform (which has a sandbox now available!] demonstrating VectorWorks files and how they might be used. Durbin then explained how UVA, in their transition to ArchivesSpace, plans to use the Digital Object function to link to the external emulation, as well as display the metadata that goes along with it. UVA also is taking into consideration the description that can’t be stored in any of UVA’s systems as of yet and how they might incorporate WikiData in the future. Next was Lauren Work and Elizabeth Wilkinson to talk about the curation workflows for software at UVA, which included a revamped Deed of Gift, as well as additional checklists and questionnaires. Their main advice was to talk with the donors early, early, early to get all the information you can, work with the donor to help make preservation and access decisions, but they  also acknowledged it is not always possible. Work and Wilkinson are still working on integrating these steps into the curation workflow at UVA, but also plan to start working more on their appraisal and processing workflows. Have thoughts on the checklist and questionnaire? Feel free to comment on their documents and make suggestions! 

After lunch, we got more into the technical side of things and talked about metadata! Elizabeth Wilkinson and Jeremy Bartczak presented on how UVA is handling archival metadata for software, including questions of how much is enough information, and if ArchivesSpace would be accommodating to this amount of description. While heavily influenced by the University of California Guidelines for Born-Digital Archival Description, they also consulted the Software Preservation Network Emulation as a Service Infrastructure Metadata Model. The result? UVA Archival Description Strategies for Emulated Software, which presents two different approaches to describing software, and UVA MARC Field Look-up for Software Description in ArchivesSpace, which has suggestions on where to put the description in ArchivesSpace. To find out information about the software, they suggested using Google, WorldCat, and Wikidata (for which Yale has created a guide). 

The second portion of this block was about description and preservation of video games, presented by Laura Drake Davis and David Gibson of the Library of Congress. The LOC has been collecting video games since they were introduced, with the first being PacMan. The copyright registry requires a description of item and some sort of visual documentation or representation of game play (a video, source code, etc.). The LOC keeps the original packaging for the game if possible, and they also collect strategy guides and periodicals related to video games. They also take source code, and  the first and last 25 pages of source code are required to be printed out and sent as documentation. Right now, they are reworking their workflows for processing, cataloging, and describing video games, working on relationships with game developers and distributors and with the LC General Counsel Office to assess risks associated with providing access to actual games, and looking into ways to emulate the games themselves. 

The final part of the day was all about access and user experience. First was Lauren Work and Elizabeth Wilkinson to talk about how UVA is considering user access to emulated environments. As of now, they plan to have reading room access only, taking into consideration staff training required to do this and the computer station requirements. They are also taking into consideration what is important about access via emulated environments, a topic discussed at the Architecture, Design, and Engineering Summit at the Library of Congress in 2017. Currently, they are doing wireframe testing with ArchivesSpace to see how users navigate through ArchivesSpace, as well as what types of information is needed for researchers, such as troubleshooting tips, links to related collections, instructions or a note about what to expect within the emulated environment, and how to cite the emulation

The final talk of the day was by Julia Kim of the Library of Congress. Kim talked about her study on user experience with born digital materials at NYU from 2014 to 2015, and compared it to Tim Walsh’s survey on the same thing at the Canadian Center for Architecture done in 2017. Kim found that there is a very fine line between researcher responsibilities and digital archivist responsibilities, users got frustrated with the slowness of the emulations, and there is a learning curve. Overall, Kim found that it’s only somewhat worth it to do emulations, but thinks the EaaSI project will help with this, as well as a lot of outreach and education on what these materials are and how to use them effectively. 

Overall, I found the workshop to be highly informative and I feel more confident considering emulations for future projects. I feel the use of shared community notes helped everyone ask for clarification without disrupting the presenters and allowed for questions to be typed out to be asked at the end. It’s also been helpful to look back on these notes, as slides and links to resources have been added by both presenters and attendees. It’s nice that there is a cohort of people out there working on this and willing to share resources and talk as needed! If you’d like to learn more about the workshop, you can visit their website here, and if you’d like to see the community notes and presentations, you can click here, with the Twitter stream here


Brenna Edwards is currently Project Digital Archivist at the Stuart A. Rose Library at Emory University, Atlanta, GA. Her main responsibility is imaging and processing born digital materials, while also researching the best tools and practices to make them available. 

ml4arc – Machine Learning, Deep Learning, and Natural Language Processing Applications in Archives

by Emily Higgs


On Friday, July 26, 2019, academics and practitioners met at Wilson Library at UNC Chapel Hill for “ml4arc – Machine Learning, Deep Learning, and Natural Language Processing Applications in Archives.” This meeting featured expert panels and participant-driven discussions about how we can use natural language processing – using software to understand text and its meaning – and machine learning – a branch of artificial intelligence that learns to infer patterns from data – in the archives.

The meeting was hosted by the RATOM Project (Review, Appraisal, and Triage of Mail).  The RATOM project is a partnership between the State Archives of North Carolina and the School of Information and Library Science at UNC Chapel Hill. RATOM will extend the email processing capabilities currently present in the TOMES software and BitCurator environment, developing additional modules for identifying and extracting the contents of email-containing formats, NLP tasks, and machine learning approaches. RATOM and the ml4arc meeting are generously supported by the Andrew W. Mellon Foundation.

Presentations at ml4arc were split between successful applications of machine learning and problems that could potentially be addressed by machine learning in the future. In his talk, Mike Shallcross from Indiana University identified archival workflow pain points that provide opportunities for machine learning. In particular, he sees the potential for machine learning to address issues of authenticity and integrity in digital archives, PII and risk mitigation, aggregate description, and how all these processes are (or are not) scalable and sustainable. Many of the presentations addressed these key areas and how natural language processing and machine learning can lend aid to archivists and records managers. Additionally, attendees got to see presentations and demonstrations from tools for email such as RATOM, TOMES, and ePADD. Euan Cochrane also gave a talk about the EaaSI sandbox and discussed potential relationships between software preservation and machine learning.

The meeting agenda had a strong focus on using machine learning in email archives; collecting and processing emails is a large encumbrance in many archives that can stand to benefit greatly from machine learning tools. For example, Joanne Kaczmarek from the University of Illinois presented a project processing capstone email accounts using an e-discovery and predictive coding software called Ringtail. In partnership with the Illinois State Archives, Kaczmarek used Ringtail to identify groups of “archival” and “non-archival” emails from 62 capstone accounts, and to further break down the “archival” category into “restricted” and “public.” After 3-4 weeks of tagging training data with this software, the team was able to reduce the volume of emails by 45% by excluding “non-archival” messages, and identify 1.8 million emails that met the criteria to be made available to the public. Manually, this tagging process could have easily taken over 13 years of staff time.

After the ml4arc meeting, I am excited to see the evolution of these projects and how natural language processing and machine learning can help us with our responsibilities as archivists and records managers. From entity extraction to PII identification, there are myriad possibilities for these technologies to help speed up our processes and overcome challenges.


Emily Higgs is the Digital Archivist for the Swarthmore College Peace Collection and Friends Historical Library. Before moving to Swarthmore, she was a North Carolina State University Libraries Fellow. She is also the Assistant Team Leader for the SAA ERS section blog.


Securing Our Digital Legacy: An Introduction to the Digital Preservation Coalition

by Sharon McMeekin, Head of Workforce Development


Nineteen years ago, the digital preservation community gathered in York, UK, for the Cedars Project’s Preservation 2000 conference. It was here that the first seeds were sown for what would become the Digital Preservation Coalition (DPC). Guided by Neil Beagrie, then of King’s College London and Jisc, work to establish the DPC continued over the next 18 months and, in 2002, representatives from 7 organizations signed the articles that formally constituted the DPC.

In the 17 years since its creation, the DPC has gone from strength to strength, the last 10 years under the leadership of current Executive Director, William Kilbride. The past decade has been a particular period of growth, as shown by the rise in the staff compliment from 2 to 7. We now have more than 90 members who represent an increasingly diverse group of organizations from 12 countries across sectors including cultural heritage, higher education, government, banking, industry, media, research and international bodies.

DPC staff, chair, and president

Our mission at the DPC is to:

[…] enable our members to deliver resilient long-term access to digital content and services, helping them to derive enduring value from digital assets and raising awareness of the strategic, cultural and technological challenges they face.

We work to achieve this through a broad portfolio of work across six strategic areas of activity: Community Engagement, Advocacy, Workforce Development, Capacity Building, Good Practice and Standards, and Management and Governance. Everything we do is member-driven and they guide our activities through the DPC Board, Representative Council, and Sub-Committees which oversee each strategic area.

Although the DPC is driven primarily by the needs of our members, we do also aim to contribute to the broader digital preservation community. As such, many of the resources we develop are made publicly available. In the remainder of this blog post, I’ll be taking a quick look at each of the DPC’s areas of activity and pointing out resources you might find useful.

1 | Community Engagement

First up is our work in the area of Community Engagement. Here our aim is to enable “a growing number of agencies and individuals in all sectors and in all countries to participate in a dynamic and mutually supportive digital preservation community”. Collaboration is a key to digital preservation success, and we hope to encourage and support it by helping build an inclusive and active community. An important step in achieving this aim was the publication of our ‘Inclusion and Diversity Policy’ in 2018.

Webinars are key to building community engagement amongst our members. We invite speakers to talk to our members about particular topics and share experiences through case studies. These webinars are recorded and made available for members to watch at a later date. We also run a monthly ‘Members Lounge’ to allow informal sharing of current work and discussion of issues as they arise and, on the public end of the website, a popular blog, covering case studies, new innovations, thought pieces, recaps of events and more.

2 | Advocacy

Our advocacy work campaigns “for a political and institutional climate more responsive and better informed about the digital preservation challenge”, as well as “raising awareness about the new opportunities that resilient digital assets create”. This tends to happen on several levels, from enabling and aiding members’ advocacy efforts within their own organizations, through raising legislators’ and policy makers’ awareness of digital preservation, to educating the wider populace.

To help those advocating for digital preservation within their own context, we have recently published our Executive Guide. The Guide provides a grab bag of statements and facts to help make the case for digital preservation, including key messages, motivators, opportunities to be gained and risks faced. We welcome any suggestions for additions or changes to this resource!

Our longest running advocacy activity is the biannual Digital Preservation Awards, last held in 2018. The Awards aim to celebrate excellence and innovation in digital preservation across a range of categories. This high-profile event has been joined in recent years by two other activities with a broad remit and engagement. The first is the Bit List of Digitally Endangered Species, which highlights at risk digital information, showing both where preservation work is needed and where efforts have been successful. Finally, there is World Digital Preservation Day (WDPD), a day to showcase digital preservation around the globe. Response to WDPD since its inauguration in 2017 has been exceptionally positive. There’s been tweets, blogs, events, webinars, and even a song and dance! This year WDPD is scheduled for 7th November, and we encourage everyone to get involved.

The nominees, winners, and judges for the 2018 Digital Preservation Awards

3 | Workforce Development

Workforce Development activities at the DPC focus on “providing opportunities for our members to acquire, develop and retain competent and responsive workforces that are ready to address the challenges of digital preservation”. There are many threads to this work, but key for our members are the scholarships we provide through our Career Development Fund and free access to the training courses we run.

At the moment we offer three training courses: ‘Getting Started with Digital Preservation’, ‘Making Progress with Digital Preservation’ and ‘Advocacy for Digital Preservation’, but we have plans to expand the portfolio in the coming year. All of our training courses are available to non-members for a modest fee, but at the moment are mostly held face to face in the UK and Ireland. A move to online training provision is, however, planned for 2020. We are also happy to share training resources and have set up a Slack workspace to enable this and greater collaboration with regards to digital preservation training.

Other resources that may prove helpful that fall under our Workforce Development heading include the ‘Digital Preservation Handbook’, a free online publication covering a digital preservation in the broadest sense. The Handbook aims to be a comprehensive guide for those starting with digital preservation, whilst also offering links additional resources. The content for Handbook was crowd-sourced from experts and has all been peer reviewed. Another useful and slightly less well-known series of publications are our ‘Topical Notes’, originally funded by the National Archives of Ireland, and intended to create resources that introduced key digital preservation issues to a non-specialist audience (particularly record creators). Each note is only two pages long and jargon-free, so a great resource to help raise awareness.

4 | Capacity Building

Perhaps the biggest area of DPC work covers Capacity Building, that is “supporting and assuring our members in the delivery and maintenance of high quality and sustainable digital preservation services through knowledge exchange, technology watch, research and development.” This can take the form of direct member support, helping with tasks such as policy development and procurement, as well as participation in research projects.

Our more advanced publication series, the Technology Watch Reports, also sit below the Capacity Building heading. Written by experts and peer reviewed, each report takes a deeper dive into a particular digital preservation issue. Our latest report on Email Preservation is currently available for member preview but will be publicly released shortly. Some other ‘classics’ include Preserving Social Media, Personal Digital Archiving, and the always popular The Open Archival Information System (OAIS) Reference Model: Introductory Guide (2nd Edition) (I always tell those new to OAIS to start here rather than the 200+ dry pages of the full standard!)

We also run around six thematic Briefing Day events a year on topical issues. As with the training, these are largely held in the UK and Ireland, but they are now also live-streamed for members. We support a number of Thematic Task Forces and Working Groups, with the ‘Web Archiving and Preservation Working Group’ being particularly active at the moment.

DPC members engaged in a brainstorming session

5 | Good Practice and Standards

Our Good Practice and Standards stream of work was a new addition as of the publication of our latest Strategic Plan (2018-22). Here we are contributing work towards “identifying and developing good practice and standards that make digital preservation achievable, supporting efforts to ensure services are tightly matched to shifting requirements.”

We hope this work will allow us to input into standards with the needs of our members in mind and facilitate the sharing of good practice that already happens across the coalition. This has already borne fruit in the shape of the forthcoming DPC Rapid Assessment Model, a maturity model to help with benchmarking digital preservation progress within your organization. You can read a bit more about it in this blog post by Jen Mitcham and the model will be released publicly in late September.

We also work with vendors through our Supporter Program and events like our ‘Digital Futures’ series to help bridge the gap between practice and solutions.

6 | Management and Governance

Our final stream of work is less focused on digital preservation and instead on “ensuring the DPC is a sustainable, competent organization focussed on member needs, providing a robust and trusted platform for collaboration within and beyond the Coalition.” This obviously relates to both the viability of the organization and well as good governance. It is essential that everything we do is transparent and that the members can both direct what we do and ensure accountability.

The Future

Before I depart, I thought I would share a little bit about some of our plans for the future. In the next few years we’ll be taking steps to further internationalize as an organization. At the moment our membership is roughly 75% UK and Ireland and 25% international, but those numbers are gradually moving closer and we hope that continues. With that in mind we will be investigating new ways to deliver services and resources online, as well as in languages beyond English. We’re starting this year with the publication of our prospectus in German, French and Spanish.

We’re also beginning to look forward to our 20th anniversary in 2022. It’s a Digital Preservation Awards Year, so that’s reason enough for a celebration, but we will also be welcoming the digital preservation community to Glasgow, Scotland, as hosts of iPRES 2022. Plans are already afoot for the conference, and we’re excited to make it a showcase for both the community and one of our home cities. Hopefully we’ll see you there, but I encourage you to make use of our resources and to get in touch soon!

Access our Knowledge Base: https://www.dpconline.org/knowledge-base

Follow us on Twitter: https://twitter.com/dpc_chat

Find out how to join us: https://www.dpconline.org/about/join-us


Sharon McMeekin is Head of Workforce Development with the Digital Preservation Coalition and leads on work including training workshops and their scholarship program. She is also Managing Editor of the ‘Digital Preservation Handbook’. With Masters degrees in Information Technology and Information Management and Preservation, both from the University of Glasgow, Sharon is an archivist by training, specializing in digital preservation. She is also an ILM qualified trainer. Before joining the DPC she spent five years as Digital Archivist with RCAHMS. As an invited speaker, Sharon presents on digital preservation at a wide variety of training events, conferences and university courses.

Student Impressions of Tech Skills for the Field

by Sarah Nguyen


Back in March, during bloggERS’ Making Tech Skills a Strategic Priority series, we distributed an open survey to MLIS, MLS, MI, and MSIS students to understand what they know and have experienced in relation to  technology skills as they enter the field. 

To be frank, this survey stemmed from personal interests since I just completed an MLIS core course on Research, Assessment, and Design (re: survey to collect data on current landscape). I am also interested in what skills I need to build/what class I should sign up for my next quarter (re: what tech skills do I need to become hire-able?). While I feel comfortable with a variety of tech-related tools and tasks, I’ve been intimidated by more “high-level”computational languages for some years. This survey was helpful for exploring what skills other LIS pre-professionals are interested in and which skills will help us make these costly degrees worth the time and financial investment that is traditionally required to enter a stable archive or library position.

Method

The survey was open for one month on Google Forms, and distributed to SAA communities, @SAA_ERS Twitter, the Digital Curation Google Group, and a few MLIS university program listservs. There were 15 questions and we received responses from 51 participants. 

Results & Analysis

Here’s a superficial scan of the results. If you would like to come up with your own analyses, feel free to view the raw data on GitHub.

Figure 1. Technology-related skills that students want to learn

The most popular technology-related skill that students are interested in learning is data management (manipulating, querying, transforming data, etc.). This is a pretty broad topic as it involves many tools and protocols which can vary between a GUI or scripts. A separate survey that does a breakdown of specific data management tools might be in order, especially since these types of skills can be divided into specialty courses, workshops, which then translates into a specific job position. A more specific survey could help demonstrate what types of skills need to be taught in a full semester-long course, or what skills can be covered in a day-long or multi-day workshop.

It was interesting to see that even in this day and age where social media management can be second nature to many students’ daily lives, there was still a notable interest in understanding how to make this a part of their career. This makes me wonder what value students have in knowing how to strategically manage an archives’ social media account. How could this help with the job market, as well as an archival organization’s main mission?

Looking deeper into the popular data management category, it would be interesting to know the current landscape of knowledge or pedagogy in communicating with IT (e.g. project management and translating users’ needs). In many cases, archivists are working separately from but dependently on IT system administrators, and it can be frustrating since either department may have distinct concerns about a server or other networks. In June’s NYC Preservathon/Preservashare 2019, there was mention that IT exists to make sure servers and networks are spinning at all hours of the day. Unlike archivists, they are not concerned about the longevity of the content, obsolescence of file formats, or the software to render files. Could it be useful to have a course on how to effectively communicate and take control of issues that can be fuzzy lines between archives, data management, and IT? Or as one survey respondent said, “I think more basic programming courses focusing on tech languages commonly used in archives/libraries would be very helpful.” Personally, I’ve only learned this from experience working in different tech-related jobs. This is not a subject I see on my MLIS course catalog, nor a discussion at conference workshops. 

The popularity of data management skills sparked another question: what about knowledge around computer networks and servers? Even though LTO will forever be in our hearts, cloud storage is also a backup medium we’re budgeting for and relying on. Same goes for hosting a database for remote access and/or publishing digital files. A friend mentioned this networking workshop for non-tech savvy learners—Grassroots Networking: Network Administration for Small Organizations/Home Organizations—which could be helpful for multiple skill types including data management, digital forensics, web archiving, web development, etc. This is similar to a course that could be found in computer science or MLIS-adjacent information management departments.

Figure 2. Have you taken/will you take technology-focused courses in your program?
Figure 3. Do you feel comfortable defining the difference between scripting and programming

I can’t say this is statistically significant, but the inverse relationship between 15.7% who have not/will not take a technology-focused course in their program, compared to 78.4% of respondents who are not aware of the difference between scripting and programming is eyebrow raising. According to an article in PLOS Computational Biology,  the term “script” means “something that is executed directly as is”, while a “program[… is] something that is explicitly compiled before being used. The distinction is more one of degree than kind—libraries written in Python are actually compiled to bytecode as they are loaded, for example—so one other way to think of it is “things that are edited directly” and “things that are not edited directly” (Wilson et al 2017). This distinction is important since more archives are acquiring, processing and sharing collections that rely on the archivist to execute jobs such as web-scraping or metadata management (scripts) or archivists who can build and maintain a database (programming). These might be interpreted as trick questions, but the particular semantics and what is considered technology-focused is something modern library, archives, and information programs might want to consider. 

Figure 4. How do you approach new technology?

Figure 4 illustrates the various ways students tackle new technologies. Reading the f* manual (RTFM) and Searching forums are the most common approaches to navigating technology. Here are quotes from a couple students on how they tend to learn a new piece of software:

  • “break whatever I’m trying to do with a new technology into steps and look for tutorials & examples related to each of those steps (i.e. Is this step even possible with X, how to do it, how else to use it, alternatives for accomplishing that step that don’t involve X)”
  • “I tend to google “how to….” for specific tasks and learn new technology on a task-by-task basis.”

In the end, there was overwhelming interest in “more project-based courses that allow skills from other tech classes to be applied.” Unsurprisingly, many of us are looking for full-time, stable jobs after graduating and the “more practical stuff, like CONTENTdm for archives” seems to be a pressure felt in-order to get an entry-level position. Not just entry too; as continuing education learners, there is also a push to strive for more—several respondents are looking for a challenge to level up their tech skills: 

  • “I want more classes with hands-on experience with technical skills. A lot of my classes have been theory based or else they present technology to us in a way that is not easy to process (i.e. a lecture without much hands-on work).”
  • “Higher-level programming, etc. — everything on offer at my school is entry level. Also digital forensics — using tools such as BitCurator.”
  • “Advanced courses for the introductory courses. XML 2 and python 2 to continue to develop the skills.”
  • “A skills building survey of various code/scripting, that offers structured learning (my professor doesn’t give a ton of feedback and most learning is independent, and the main focus is an independent project one comes up with), but that isn’t online. It’s really hard to learn something without face to face interaction, I don’t know why.”

It’ll be interesting to see what skills recent MLIS, MLS, MIS, and MSIM graduates will enter the field with. While many job postings list certain software and skills as requirements, will programs follow suit? I have a feeling this might be a significant question to ask in the larger context of what is the purpose of this Master’s degree and how can the curriculum keep up with the dynamic technology needs of the field.

Disclaimer: 

  1. Potential bias: Those taking the survey might be interested in learning higher-level tech skills because they do not already know the skills, while those who are already tech-savvy might avoid a basic survey such as this one since they already know the skills. This may put a bias on the survey population consisting of mostly novice tech students.   
  2. More data on specific computational languages and technology courses taken are available in the GitHub csv file. As mentioned earlier, I just finished my first year as a part-time MLIS student, so I’m still learning the distinct jobs and nature of the LIS field. Feel free to submit an issue to the GitHub repo, or tweet me @snewyuen if you’d like to talk more about what this data could mean.

Bibliography

Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Computational Biology 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510


Sarah Nguyen with a Uovo storage truck

Sarah Nguyen is an advocate for open, accessible, and secure technologies. While studying as an MLIS candidate with the University of Washington iSchool, she is expressing interests through a few gigs: Project Coordinator for Preserve This Podcast at METRO, Assistant Research Scientist for Investigating & Archiving the Scholarly Git Experience at NYU Libraries, and archivist for the Dance Heritage Coalition/Mark Morris Dance Group. Offline, she can be found riding a Cannondale mtb or practicing movement through dance. (Views do not represent Uovo. And I don’t even work with them. Just liked the truck.)

The Theory and Craft of Digital Preservation: An interview with Trevor Owens

BloggERS! editor, Dorothy Waugh recently interviewed Trevor Owens, Head of Digital Content Management at the Library of Congress about his recent–and award-winning–book, The Theory and Craft of Digital Preservation.


Who is this book for and how do you imagine it being used?

I attempted to write a book that would be engaging and accessible to anyone who cares about long-term access to digital content and wants to devote time and energy to helping ensure that important digital content is not lost to the ages. In that context, I imagine the primary audience as current and emerging professionals that work to ensure enduring access to cultural heritage: archivists, librarians, curators, conservators, folklorists, oral historians, etc. With that noted, I think the book can also be of use to broader conversations in information science, computer science and engineering, and the digital humanities. 

Tell us about the title of the book and, in particular, your decision to use the word “craft” to describe digital preservation.

The words “theory” and “craft” in the title of the book forecast both the structure and the two central arguments that I advance in the book. 

The first chapters focus on theory. This includes tracing the historical lineages of preservation in libraries, archives, museums, folklore, and historic preservation. I then move to explore work in new media studies and platform studies to round out a nuanced understanding of the nature of digital media. I start there because I think it’s essential that cultural heritage practitioners moor their own frameworks and approaches to digital preservation in a nuanced understanding of the varied and historically contingent nature of preservation as a concept and the complexities of digital media and digital information. 

The latter half of the book is focused on what I describe as the “craft” of digital preservation. My use of the term craft is designed to intentionally challenge the notion that work in digital preservation should be understood as “a science.” Given the complexities of both what counts as preservation in a given context and the varied nature of digital media, I believe it is essential that we explicitly distance ourselves from many of the assumptions and baggage that come along with the ideology of “digital.” 

We can’t build some super system that just solves digital preservation. Digital preservation requires making judgement calls. Digital preservation requires the applied thinking and work of professionals. Digital preservation is not simply a technical question, instead digital preservation involves understanding the nature of the content that matters most to an intended community and making judgement calls about how best to mitigate risks of potential loss of access to that content. As a result of my focus on craft, I offer less of a “this is exactly what one should do” approach, and more of an invitation to join the community of practice that is developing knowledge and honing and refining their craft. 

Reading the book, I was so happy to see you make connections between the work that we do as archivists and digital preservation. Can you speak to that relationship and why you think it is important?

Archivists are key players in making preservation happen and the emergence of digital content across all kinds of materials and media that archivists work with means that digital preservation is now a core part of the work that archivists do. 

I organize a lot of my discussion about the craft of digital preservation around archival concepts as opposed to library science or curatorial practices. For example, I talk about arrangement and description. I also draw from ideas like MPLP as key concepts for work in digital preservation and from work on community archives. 

Old Files. From XKCD: webcomic of romance, sarcasm, math, and language. 2014

Broadly speaking, in the development of digital media, I see a growing context collapse between formats that had been distinct in the past. That is, conservation of oil paintings, management and preservation of bound volumes, and organizing and managing heterogeneous sets of records have some strong similarities but there are also a lot of differences. The born digital incarnations of those works; digital art, digital publishing, and digital records, are all made up of digital information and file formats, and face a related set of digital preservation issues.

With that note, I think archival practice tends to be particularly well-suited for dealing with the nature of digital content. Archives have long dealt with the problem of scale that is now intensified by digital data. At the same time, archivists have also long dealt with hybrid collections and complex jumbles of formats, forms, and organizational structures, which is also increasingly the case for all types of forms that transition into born-digital content. 

You emphasize that the technical component of digital preservation is sometimes prioritized over social, ethical, and organizational components. What are the risks implicit in overlooking these other important components?

Digital preservation is not primarily a technical problem. The ideology of “digital” is that things should be faster, cheaper, and automatic. The ideology of “digital” suggests that we should need less labor, less expertise, and less resources to make digital stuff happen. If we let this line of thinking infect our idea of digital preservation we are going to see major losses of important data, we will see major failures to respect ethical and privacy issues relating to digital content, and lots of money will be spent on work that fails to get us the results that we want.

In contrast, when we take as a starting point that digital preservation is about investing resources in building strong organizations and teams who participate in the community of practice and work on the complex interactions that emerge between competing library and archives values then we have a chance of both being effective but also building great and meaningful jobs for professionals.

If digital preservation work is happening in organizations that have an overly technical view of the problem, it is happening despite, not because, of their organization’s approach. That is, there are people doing the work, they just likely aren’t getting credit and recognition for doing that work. Digital preservation happens because of people who understand that the fundamental nature of the work requires continual efforts to get enough resources to meaningfully mitigate risks of loss, and thoughtful decision making about building and curating collections of value to communities.

Considerations related to access and discovery form a central part of the book and you encourage readers to “Start simple and prioritize access,” an approach that reminded me of many similar initiatives focused on getting institutions started with the management and preservation of born-digital archives. Can you speak to this approach and tell us how you see the relationship between preservation and access?

A while back, OCLC ran an initiative called “walk before you run,” focused on working with digital archives and digital content. I know it was a major turning point for helping the field build our practices. Our entire community is learning how to do this work and we do it together. We need to try things and see which things work best and which don’t. 

It’s really important to prioritize access in this work. Preservation is fundamentally about access in the future. The best way you know that something will be accessible in the future is if you’re making it accessible now. Then your users will help you. They can tell you if something isn’t working. The more that we can work end-to-end, that is, that we accession, process, arrange, describe, and make available digital content to our users, the more that we are able to focus on how we can continually improve that process end-to-end. Without having a full end-to-end process in place, it’s impossible to zoom out and look at that whole sequence of processes to start figuring out where the bottlenecks are and where you need to focus on working to optimize things. 


Dr. Trevor Owens is a librarian, researcher, policy maker, and educator working on digital infrastructure for libraries. Owens serves as the first Head of Digital Content Management for Library Services at the Library of Congress. He previously worked as a senior program administrator at the United States Institute of Museum and Library Services (IMLS) and, prior to that, as a Digital Archivist for the National Digital Information Infrastructure and Preservation Program and as a history of science curator at the Library of Congress. Owens is the author of three books, including The Theory and Craft of Digital Preservation and Designing Online Communities: How Designers, Developers, Community Managers, and Software Structure Discourse and Knowledge Production on the Web. His research and writing has been featured in: Curator: The Museum Journal, Digital Humanities Quarterly, The Journal of Digital Humanities, D-Lib, Simulation & Gaming, Science Communication, New Directions in Folklore, and American Libraries. In 2014 the Society for American Archivists granted him the Archival Innovator Award, presented annually to recognize the archivist, repository, or organization that best exemplifies the “ability to think outside the professional norm.”