Call for Contributions: What’s Your Set-Up?

bloggERS!, the blog of the Electronic Records Section of SAA, is accepting proposals for blog posts on the theme “What’s Your Set-Up?” These posts will address the question: what equipment do you need to get your job done in digital archives? We’re particularly interested in posts that consist of a detailed description of hardware, software, and other equipment used in your institution’s digital archives workflow (computers, readers, drives, etc.), as well as more general posts about equipment needs in digital archives.

See our call for posts below and email any proposals to ers.mailer.blog@gmail.com.

We look forward to hearing from all of you.

—The bloggERS! editorial subcommittee

Call for Posts

When starting a digital archives program from scratch, archivists can be easily overwhelmed by the range of hardware and software needed to effectively manage and preserve digital media, the variety of options for different equipment types, and where to obtain everything needed. As our practice evolves, so does the required equipment, and archivists are constantly replacing and improving our equipment according to our needs and resources. 

This series hopes to help break down barriers by allowing archivists to learn from their peers at a variety of institutions. We want to hear about the specific equipment you use in your day-to-day workflows, addressing questions such as: what do your workstations consist of? How many do you have? What readers and drives work reliably for your workflows? How did you obtain them? What doesn’t work? What is on your wish list for equipment acquisition?

We welcome posts from staff at institutions with all levels of budgetary resources. 

Other potential topics and themes for posts:

  • Creating a low-cost digital archives workstation
  • Stories of assembling workstations iteratively
  • Strategies for obtaining the necessary equipment, and preferred vendors
  • Working with IT to establish and support digital archives hardware and software
  • Stories of success or failure with advanced equipment such as the FRED Forensic Workstation or the Kryoflux

Writing for bloggERS! “What’s Your Set-Up?” Series

  • We encourage visual representations: Posts can include or largely consist of comics, flowcharts, a series of memes, etc!
  • Written content should be roughly 600-800 words in length
  • Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA
  • Align with other editorial guidelines as outlined in the bloggERS! guidelines for writers.

Please let us know if you are interested in contributing by sending an email to ers.mailer.blog@gmail.com!

Data As Collections

By Nathan Gerth

Over the past several years there has been a growing conversation about “collections as data” in the archival field. Elizabeth Russey Roke underscored the growing impact of this movement in her recent post on the Collections As Data: Always Already Computational final report at Blo. Much like her, I have seen this computational approach to the data in our collections manifest itself in ways at my home institution, with our decision to start providing researchers with aggregate data harvested from our born-digital collections.

Data as Collections

At the same time, in my role as a digital archivist working with congressional papers, I have seen a growing body of what I call “data as collections.” I am using the term data in this case specifically in reference to exports from relational database systems in collections. More akin to research datasets than standard born-digital acquisitions, these exports amplify the privacy and technical challenges associated with typical digital collections. However, they also embody some of the more appealing possibilities for the computational research highlighted by the “collections as data” initiative, given their structured nature and millions of data points.   

The problem of curating and supplying access to a particular type of data export has become an acute problem in the field of congressional papers. As documented in a white paper by a Congressional Papers Section Task Force in 2017, members in the U.S. House of Representatives and U.S. Senate have widely adopted proprietary Constituent Management Systems (CMS) or Constituent Services Systems (CSS) to manage constituent communications. The exports from these systems document the core interactions between the American people and their representatives in Congress. Unfortunately, these data exports have remained largely inaccessible to archivists and researchers alike.

The question of curating, preserving, and supplying access to the exports from these systems has galvanized the work of several task forces in the archival community. In recent years, congressional papers archivists have collaborated to document the problem in the white paper referenced above and to support the development of a tool to access these exports. The latter effort, spearheaded by West Virginia Libraries, earned a Lyrasis Catalyst Fund grant in 2018 to assess the development possibilities for an open-source platform developed at WVU to open and search these data exports. You can see a screenshot of the application in action below.

https://lh6.googleusercontent.com/iLmJ2ebnslIsoeST9MSbfuSBwtCfZM7-aHoV620y-9gq1daH9iZbpGAxy1NJX8vSB1qSrnLsxeveRCGr0VybE6JrlB5Z_X6MMzzXWK3_S2FXLlOPekDdFdhMV95a81U1AQ4j

Screenshot of data table viewed in the West Virginia University Libraries CSS Data Tool

The project funded by the grant, America Contacts Congress, has now issued its final report and the members of the task force that served as its advisory board are transitioning to the next stage of the project. Here are where things stand:

What We Now Know

We now know much more about the key research audiences for this data and the archival needs associated with the tool. Researchers expressed solid enthusiasm for gaining access to the data, especially computationally minded quantitative scholars. For those of us involved in testing data in the tool, the project gave us a moment to become much more familiar with our data. I, for my part, also know a great deal more about the 16 million records in the relational data tables we received from the office of Senator Harry Reid, in addition to the 3 million attachments referenced by those tables. Without the ability to search and view the data in the tool, the tables and attachments from the Reid collection would have existed as little more than binary files.

Unresolved Questions

While members of the grant’s advisory board know much more about how the tool might be used in the sphere of congressional papers, we would like to learn more about other cases of “data in collections” in the archival field. Who beyond congressional papers archivists are grappling with supplying access to and preserving relational databases? We know, for example, that many state and local governments are using the same Constituent Relationship Management systems, such as iConstituent and Intranet Quorum, deployed in congressional offices. Do our needs overlap with those of other archivists and could this tool serve a broader community? While the amount of CSS data exports in congressional collections is significant, the direction we plan to take tool development and partnerships to supply access to the data will hinge on finding a broader audience of archivists facing similar challenges.

If any of the questions above apply to you, consider contacting the members of the America Contacts Congress project’s advisory board. We would love to hear from you and discuss how the outcomes of the grant might apply to a broader array of data exports in archival collections. Who knows, we might even help you test the tool on your own data exports! For more information about the project, visit our webpage.

Nathan Gerth

Nathan Gerth is the Head of Digital Services and Digital Archivist at the University of Nevada, Reno Libraries. Splitting his time between application support and digital preservation, he is the primary custodian of the electronic records from the papers of Senator Harry Reid. Outside of the University, he is an active participant in the congressional papers community, serving as the incoming chair of the Congressional Papers Section and as a member of the Association of Centers for the Study of Congress CSS Data Task Force.

SAA 2019 recap | Session 504: Building Community History Web Archives: Lessons Learned from the Community Webs Program

by Steven Gentry


Introduction

Session 504 focused on the Community Webs program and the experiences of archivists who worked at either the Schomburg Center for Research in Black Culture or the Grand Rapids Public Library. The panelists consisted of Sylvie Rollason-Cass (Web Archivist, Internet Archive), Makiba Foster (Manager, African American Research Library and Cultural Center, formerly the Assistant Chief Librarian, the Schomburg Center for Research in Black Culture), and Julie Tabberer (Head of Grand Rapids History & Special Collections).

Note: The content of this recap has been paraphrased from the panelists’ presentations and all quoted content is drawn directly from the panelists’ presentations.

Session summary

Sylvie Rollason-Cass began with an overview of web archiving and web archives, including:

  • The definition of web archiving.
  • The major components of web archives, including relevant capture tools (e.g. web crawlers, such as Wget or Heritrix) and playback software (e.g. Webrecorder Player).
  • The ARC and WARC web archive file formats. 

Rollason-Cass then noted both the necessity of web archiving—especially due to the web’s ephemeral nature—and that many organizations archiving web content are higher education institutions. The Community Webs program was therefore designed to get more public library institutions involved in web archiving, which is critical given that these institutions often collect unique local and/or regional material.

After a brief description of the issues facing public libraries and public library archives—such as a lack of relevant case studies—Rollason-Cass provided information about the institutions that joined the program, the resources provided by the Internet Archive as part of the program (e.g. a multi-year subscription to Archive-It), and the project’s results, including:

  • The creation of more than 200 diverse web archives (see the Remembering 1 October web archive for one example).
  • Institutions’ creation of collection development policies pertaining specifically to web archives, in addition to other local resources.
  • The production of an online course entitled “Web Archiving for Public Libraries.” 
  • The creation of the Community Webs program website.

Rollason-Cass concluded by noting that although some issues—such as resource limitations—may continue to limit public libraries’ involvement in web archiving, the Community Webs program has greatly increased the ability for other institutions to confidently archive web content. 

Makiba Foster then addressed her experiences as a Community Webs program member. After a brief description of the Schomburg Center, its mission, and unique place where “collections, community, and current events converge”, Foster highlighted the specific reasons for becoming more engaged with web archiving:

  • Like many other institutions, the Schomburg Center has long collected clippings files—and web archiving would allow this practice to continue.
  • Materials that document the experiences of the black community are prominent on the World Wide Web.
  • Marginalized community members often publish content on the Web.

Foster then described the #HashtagSyllabusMovement collection, a web archive of educational material “related to publicly produced and crowd-sourced content highlighting race, police violence, and other social justice issues within the Black community.” Foster had known this content could be lost, so—even before participating in the Community Webs program—she began collecting URLs. Upon joining the Community Webs program, Foster used Archive-It to archive various relevant materials (e.g. Google docs, blog posts, etc.) dated from 2014 to the current. Although some content was lost, the #HashtagSyllabusMovement collection both continues to grow—especially if, as Foster hopes, it begins to include international educational content—and shows the value of web archiving. 

In her conclusion, Foster addressed various success, challenges, and future endeavors:

  • Challenges:
    • Learning web archiving technology and having confidence in one’s decisions.
    • Curating content for the Center’s five divisions.
    • “Getting institutional support.”
  • Future Directions:
    • A new digital archivist will work with each division to collect and advocate for web archives.
    • Considering how to both do outreach for and catalog web archives.
    • Ideally, working alongside community groups to help them implement web archiving practices.

The final speaker, Julie Tabberer, addressed the value of public libraries’ involvement in web archives. After a brief overview of the Grand Rapids Public Library, the necessity of archives, and the importance of public libraries’ unique collecting efforts, Tabberer posited the following question: “Does it matter if public libraries are doing web archiving?” 

To test her hypothesis that “public libraries document mostly community web content [unlike academic archives],” Tabberer analyzed the seed URLs of fifty academic and public libraries to answer two specific questions:

  • “Is the institution crawling their own website?”
  • “What type of content [e.g. domain types] is being crawled [by each institution]?”

After acknowledging some caveats with her sampling and analysis—such as the fact that data analysis is still ongoing and that only Archive-It websites were examined—Taberrer showed audience members several graphics that revealed academic libraries 1.) Typically crawled their websites more so than public libraries and 2.) Captured more academic websites than public libraries.

Tabberer then concluded with several questions and arguments for the audience to consider:

  • In addition to encouraging more public libraries to archive web content—especially given their values of access and social justice—what other information institutions are underrepresented in this community?
  • Are librarians and archivists really collecting content that represents the community?
  • Even though resource limitations are problematic, academic institutions must expand their web archiving efforts.

Steven Gentry currently serves a Project Archivist at the Bentley Historical Library. His responsibilities include assisting with accessioning efforts, processing complex collections, and building various finding aids. He previously worked at St. Mary’s College of Maryland, Tufts University, and the University of Baltimore.

SAA 2019 recap | Session 204: Demystifying the Digital: Providing User Access to Born-Digital Records in Varying Contexts

by Steven Gentry


Introduction

Session 204 addressed how three dissimilar institutions—North Carolina State University (NCSU), the Wisconsin Historical Society (WHS), and the Canadian Centre for Architecture (CCA)—are connecting their patrons with born-digital archival content. The panelists consisted of Emily Higgs (NCSU Libraries Fellow, North Carolina State University), Hannah Wang (Electronic Records & Digital Preservation Archivist, Wisconsin Historical Society), and Stefana Breitwieser (Digital Archivist, Canadian Centre for Architecture). In addition, Kelly Stewart (Director of Archival and Digital Preservation Services, Artefactual Systems) briefly spoke about the development of SCOPE, the tool featured in Breitwieser’s presentation.

Note: The content of this recap has been paraphrased from the panelists’ presentations and all quoted content is drawn directly from the panelists’ presentations.

Session summary

Emily Higgs’s presentation focused on the different ways that NCSU’s Special Collections Research Center (SCRC) staff enhance access to their born-digital archives. After a brief overview of NCSU’s collections, Higgs first described their lightweight workflow for bridging researchers and requested digital content, a process that involves SCRC staff accessing an administrator account on a reading room Macbook; transferring copies of requested content to a read-only folder shared with a researcher account; and limiting the computer’s overall  capabilities, such as restricting its internet and ports (the latter is accomplished via Endpoint Protector Basic). Should a patron want copies of the material, they simply drag and drop those resources into another folder for SCRC staff to review.

Higgs then described an experimental Named Entity Recognition (NER) workflow that employs spaCy and which allows archivists to better describefiles in NCSU’s finding aids.The workflow employs a Jupyter notebook (see her Github repository for more information) to automate the following process:

  • “Define directory [to be analyzed by spaCy].”
  • “Walk directory…[to retrieve] text files [such as PDFs].”
  • “Extract text (textract).”
  • “Process and NER (spaCy).”
  • “Data cleaning.”
  • “Ranked output of entities (csv) [which is based on the number of times a particular name appears in the files].”

Once the process is completed, the most frequent 5-10 names are placed in an ArchivesSpace scope and content note. Higgs concluded by emphasizing this workflow’s overall ease of use and noting that—in the future—staff will integrate application programming interfaces (APIs) to enhance the workflow’s efficiency.

Next to speak was Hannah Wang, who addressed how Wisconsin Historical Society (WHS) has made its born-digital state government records more accessible. Wang began her presentation by discussing the Wisconsin State Preservation of Electronic Records Project (WiSPER) Project and its two goals:

  • “Ingest a minimum of 75 GB of scheduled [and processed] electronic records from state agencies.”
  • “Develop public access interface.” 

And explained the reasons behind Preservica’s selection:

  • WHS’s lack of significant IT support meant an easily implementable tool was preferred over open-source and/or homegrown solutions.
  • Preservica allowed WHS to simultaneously preserve and provide (tiered) access to digital records.
  • Preservica has a public-facing WordPress site, which fulfilled the second WiSPER grant objective.

Wang then addressed how WHS staff appropriately restricted access to digital records by placing records into one of three groupings:

  • “Content that has a legal restriction.”
  • “Content that requires registration and onsite viewing [such as email addresses].”
  • “Open, unrestricted content.” 

WHS staff actually achieved this goal by employing different methods to locate and restrict digital records:

  • For identification: 
    • Reviewing “[record] retention schedules…[and consulting with] agency [staff who would notify WHS personnel of sensitive content].” 
    • Using resources like bulk extractor
    • Reading records if necessary.
  • For restricting records:
    • Employing scripts—such as batch scripts—to transfer and restrict individual files and whole SIPs.

Wang demonstrated how WHS makes its restricted content accessible via Preservica:

  • “Content that has a legal restriction”: Only higher levels of description can be searched by external researchers, although patrons have information concerning how to access this content.
  • “Content that requires registration and onsite viewing”: Individual files can be located by external researchers, although researchers still need to visit the WHS to view materials. Again, information concerning how to access this content is provided.

Wang concluded her presentation by describing efforts to link materials in Preservica with other descriptive resources, such as WHS’s MARC records; expressing hope that WHS will integrate Preservica with their new ArchivesSpace instance; and discussing the usability testing that resulted in several upgrades to the WHS Electronic Records Portal prior to its release.

The penultimate speaker was Stefana Breitwieser, who spoke about SCOPE and its features. Breitwieser first discussed the “Archaeology of the Digital” project and how—through this project—the CCA acquired the bulk of its digital content, more than “660,000 files (3.5 TB).” In order to better enhance access to these resources, Breitwieser stressed that two problems had to be addressed:

  • “[A] long access workflow [that involved twelve steps].”
  • “Low discoverability.” Breitwieser stressed some issues with their current access tool included its inability to search across collections and its non-usage of metadata in Archivematica.

CCA staff ultimately decided on working alongside Artefactual Systems to build SCOPE, “an access interface for DIPs from Archivematica.” The goals of this project included:

  • “Direct user access to access copies of digital archives from [the] reading room.”
  • “Minimal reference intervention [by CCA staff].”
  • “Maximum discoverability using [granular] Archivematica-generated metadata.”
  • “Item-level searching with filtering and facetting.” 

To illustrate SCOPE’s capabilities, Breitwieser demonstrated the tool and its features (e.g. its ability to download DIPs) for the audience. During the presentation, she emphasized that although incredibly useful, SCOPE will ultimately supplement—rather than replace—the CCA’s finding aids. 

Breitwieser concluded by describing the CCA’s reading room—which include computers that possess a variety of useful software (e.g. computer-aided design, or CAD, software) and, like NCSU’s workstation, only limited technical capabilities—and highlighting CCA’s much simpler 5-step access workflow.

The final speaker, Kelly Stewart, spoke of SCOPE’s development process. Heavily emphasized during this presentation were Artefactual’s use of CCA user stories to develop “feature files”—or “logic-based, structured descriptions” of these user stories—that were used by Artefactual staff to build SCOPE. After its completion, Stewart noted that “user acceptance testing” occurred repeatedly until SCOPE was deemed ready. Stewart concluded her presentation with the hope that other archivists will implement and improve upon SCOPE.


Steven Gentry currently serves a Project Archivist at the Bentley Historical Library. His responsibilities include assisting with accessioning efforts, processing complex collections, and building various finding aids. He previously worked at St. Mary’s College of Maryland, Tufts University, and the University of Baltimore.