by Lauren Work and Jeremy Bartczak
This is the fifth post in the BloggERS Embedded Series.
Web Archiving Workflows
The Web Archiving Working Group at the University of Virginia Library focuses on building technical and administrative workflows, creating policy for everything from collection development to description, and developing education for web archives and their function at the university. This cross-departmental group includes an archivist, a digital preservation librarian, a digital projects coordinator, a curator, and a metadata librarian. While emphasizing our local Archives & Special Collections policies, we intentionally embedded non-archivists in our team to leverage different skill-sets and share distributed project responsibilities.
Recently, we have been working to create a standards-based application profile for descriptive metadata that will allow us to build our web archiving program in a consistent and scalable way. This includes thinking carefully about how web archives description can become better embedded with other library and archives descriptive workflows, as well as thinking about future integrations of web archives metadata external to UVa (the Cobweb project is one example where we could eventually contribute standardized metadata about our web archives).
The Library uses the Archive-It subscription service as one of our main tools for web archiving, which describes preserved web content using the Dublin Core element set. While Dublin Core is a flexible and lean specification with wide implementation, the description of web resources is relatively new for libraries and archives. Standardizing metadata workflows in a way that is consistent with broader efforts in the community was challenging.
Environmental Survey & Community Best Practices
To meet the challenges of alignment and standardization, we began by conducting an environmental survey of those who have been working in this space before us (Columbia, Stanford, and the University of Albany are a few academic examples), and adapting and reviewing existing community best practices, tools, and myriad reports. We also consulted content standards such as DACS and RDA as well as the individual Dublin Core element definitions. Other valuable resources included the Digital Public Library of America Metadata Application Profile (DPLAMAP) and various published institutional guidelines.
The group developed expertise with metadata design for web archives by experimenting directly with the different approaches encountered in the environmental survey. We created test records using the University of Michigan Bentley Historical Library Guidelines and also experimented with building our own element set by applying DACS and RDA against Dublin Core elements.
In May 2017 the OCLC Web Archiving Metadata Working Group published a draft data dictionary of 14 elements for describing web resources. This was an essential tool that provided us with conceptual alignment with our previously consulted resources in addition to some helpful crosswalks to Dublin Core, EAD, MARC, and MODS fields. The OCLC recommendations ultimately provided a more concrete roadmap for navigating these different approaches. In particular, the crosswalks detailed in the report and examples from existing descriptions and guidelines were critical for informing our own approaches and prioritizations.
Prioritizing the Element Set
At its core, our profile is based on mappings from OCLC’s 14 Data Dictionary elements to Dublin Core and EAD. From there we were able to prioritize fields by applying DACS requirements to data dictionary elements (see Table 1, “Mappings and Element Requirements from DACS” in University of Virginia Web Archiving Element Set Prioritization Version 1.0”). Note that since Archive-It uses a Dublin Core metadata editor, our element set is expressed in Dublin Core.
The results of this first attempt provided a lean element set with flexibility to incorporate web archive metadata into a DACS-compliant collection description such as a resource record in ArchivesSpace. However, we also wanted to be mindful of display in the Archive-It public portal, we needed guidance in determining what additional Dublin Core elements to prioritize, and we wanted to think broadly about existing profiles that might allow us to extend interoperability with external community efforts. The DPLAMAP was an excellent resource for helping us address these goals since it aggregates large amounts of library and archival metadata using properties that map closely to Dublin Core. Table 2, “Mapping and supplemental elements per DPLAMAP recommendations/requirements” illustrates the additional elements that we prioritized based on an analysis of the DPLAMAP.
The Application Profile and Next Steps
The complete first version of the University of Virginia Web Archiving Metadata Application Profile is available here and includes an element set, definitions, local guidelines, and further reference to related content standards when applicable. We look forward to implementing the profile as we ramp up web archiving workflows at UVa. We welcome feedback from the community on what we’ve come up with so far and have enabled comments on our shared Google Docs.
We’re also already thinking about several next steps, including increasing our ability to automate and extract information from WARCs to aid with processing and technical context, how to use APIs more regularly to scale and automate some of our work, conducting internal feedback and outreach sessions, and how our use of other web archiving tools like webrecorder.io might affect some of our metadata workflow.
Lauren Work is the Digital Preservation Librarian at the University of Virginia Library, where she is responsible for the implementation of digital preservation strategy, systems, and workflows.
Jeremy Bartczak is Metadata Librarian at the University of Virginia Library where he works on metadata assessment, remediation, and strategic workflows for library collections.