User Centered Collaboration for Archival Discovery (Part 2)

By the SAA 2017 Session 403 Team: James Bullen, Alison Clemens, Wendy Hagenmaier, Adriane Hanson, Emilie Hardman, Carrie Hintz, Mark Matienzo, Jessica Meyerson, Amanda Pellerin, Susan Pyzynski, Mike Shallcross, Seth Shaw, Sally Vermaaten, Tim Walsh

Insights from Discussion Groups

  • In discussion group 1, we went through a few different discussion areas. We had an interesting conversation about how to navigate using a user-centered approach. We talked about a) how to balance changing user needs professional practices; b) the difficulty of being user-centered within an MPLP paradigm; c) the contrasting difficulty of being *too* detailed in our description, and that getting in the way of discovery; and d) shifting our reference model so that public services staff are more facile with using finding aids and assisting users in navigating minimal description.
  • Discussion group 4 began by discussing the range of concerns relating to the state of discovery at the participants’ institutions. Everyone recognized that their current discovery systems for archives were not ideal, and there was a common interest across the group in centralizing discovery within an institution or consortium. The group also spent a significant amount of time discussing specific known issues to implementing a new discovery system, including issues related to system integration, the reality that information about archival materials is spread across multiple platforms, and that abrupt transitions across platforms were jarring for users. We also discussed challenges to undertaking user-centered design and collaborative work, which included barriers related to administrative support, systemic IT issues, lack of knowledge of user experience design methodologies, and resources for these projects.
  • Discussion Group 5 began by discussing archival discovery at participants’ institutions.  There was a wide range of strategies being used to facilitate archival discovery, but none of the participants were happy with their current state.  In most cases, the discovery systems were too deeply connected to library technologies like the OPAC, or utilized static html websites.  Participants were frustrated that their systems didn’t meet potential users where they were (the open web) and didn’t provide desired opportunity for users to find materials or more effectively use the finding aid data.   Participants saw flaws in their systems that negatively impact users, and all saw user testing as something that should be done when designing and maintaining archival discovery systems.  There was, however, some concern that user testing is resource intensive and that many archives don’t have the tools or training to do it effectively and that, while we feel good about our attempts to include users on product development we don’t have a good sense of what the return on that investment actually is in most cases.  
  • Discussion group 6 first considered how to go from having good description in multiple tools to having an effective user experience searching across all of those tools.  A user centered design approach was appealing, but there was concern about lacking the staff time and expertise to take this on, as well as challenges of knowing the demographics of your users well enough to establish meaningful user personas and being able to prioritize across different user groups’ needs. Collaboration and sharing the work seemed to be the answer to lacking staff time and expertise, although formal collaborative efforts do require overhead to manage the logistics of the collaboration. We discussed ways to have more effective collaboration, such as sharing the results of our work (like user personas) online rather than writing journal articles, making room for smaller institutions in the conversation, and allowing for different levels of time commitment and expertise within a collaboration. Roles for institutions without technical expertise include providing feedback or replicating a test at your own institution using another institution’s method.

 

Working on a user-centered design project for your archives? Have questions about the topic? Chime in via the comments below!

Advertisements

User Centered Collaboration for Archival Discovery (Part 1)

By the SAA 2017 Session 403 Team: James Bullen, Alison Clemens, Wendy Hagenmaier, Adriane Hanson, Emilie Hardman, Carrie Hintz, Mark Matienzo, Jessica Meyerson, Amanda Pellerin, Susan Pyzynski, Mike Shallcross, Seth Shaw, Sally Vermaaten, Tim Walsh

At the SAA Annual Meeting, a group of archivists and technologists organized a session on collaborative user-centered design processes across project and institutional boundaries: ArcLight (based at Stanford University), the ArchivesSpace public user interface enhancement project, and New York University’s archival discovery work. Using community-oriented approaches that foreground user experience design and usability testing, these initiatives seek to respond to the documented needs and requirements of archivists and researchers. In an effort to continue the conversation about user-centered design in archives, we wanted to share a recap of the session and discussion reflections with the community.

 

Sally Vermaaten started off the presentations by outlining NYU’s staged design work on a new archival discovery layer. In the first phase of work, a team of archivists, technologists, and librarians conducted a literature review on usability of archival discovery systems, held an blue-sky requirements workshop with stakeholders, assessed several systems in use by other archives, and drafted personas and high-level requirements. In parallel with this design work, a Blacklight-based proof of concept site was set up that utilized code developed by Adam Wead. The results of the pilot and design work were promising but also highlighted the growing need to upgrade other archival systems including migration to ArchivesSpace. Because implementing ArchivesSpace would offer new mechanisms to access metadata via API and would change underlying data structures, it made sense to migrate to ArchivesSpace before a full redesign of the discovery layer.

 

At the same time, the team at NYU knew that the proof of concept Blacklight-based site already running in a test environment included several tangible improvements to search and browse functionality that could be polished and rolled out with minimal investment. NYU therefore decided to take a phased approach in order to put those usability improvements in the hands of users earlier. First, they used rapid user-centered design techniques to quickly iterate on the proof of concept site, including a heuristic analysis and wireframing, and were able to deploy significantly improved search functionality to users within a few months. Next, they focused on ArchivesSpace migration and once that system was live, ‘Phase II’ of archival discovery work, a holistic rethink of the archival discovery layer, was kicked off. Sally wrapped up her presentation by sharing some of the aspects of the NYU approach that proved most helpful in their process and encouraged other institutions to consider these strategies in archival discovery work:

  • sharing and re-using existing resources (user research, design work, and code
  • documenting user needs as an impetus for and input into a future systems projects
  • incremental improvement and proof-of-concept approaches.

 

Susan Pyzynski and Emilie Hardman discussed the ongoing collaborative work toward an enhanced public user interface for ArchivesSpace. The first Design Phase took place between March and December 2015. This phase produced a set of wireframes and a report by the design firm, Cherry Hill, which was contracted to establish initial plans for the PUI.  The Development Phase, which spanned January-June 2017, took this initial planning into account and fleshed out the firm’s work with both fully exploratory and structured comparative user tests. A selection of these tests and findings may be found here. This work yeilded a 2.0 test release of the ASpace PUI: https://github.com/archivesspace/archivesspace/releases/tag/PUBLIC-BETA. Though it sounds conclusive, the Release Phase (summer 2017) is not the final work, though it has put forth a user-informed product:

With this new release Harvard University is pursuing an aggressive timeline, looking at a January release for the ASpace PUI, and plan to engage in further and more specific community-centered user testing.

 

Finally, Mark Matienzo presented on the design process used for ArcLight, a project initiated by Stanford University Libraries to develop a purpose-built discovery and delivery system for archival collections. ArcLight’s design process followed a similar model as Stanford’s design process for the Spotlight exhibits platform, but with morea higher amount of community input and participation. The ArcLight design process included input from thirteen institutions, and significant individual contributions from eleven individuals, including both user experience designers and archivists. After providing an overview of the design process, Mark presented on how requirements were identified and evolved through over time. Using the example of the delivery of digital objects within the description of a specific collection component, this review included looking at early stakeholder goals and investigating existing functionality in their environmental scan; identifying questions to ask in user interviews, and subsequent analysis of their answers; how those insights were reflected in design documents like personas and wireframes; and their eventual implementation in the ArcLight minimum viable product. Mark closed his presentation by discussing lessons learned about the highly collaborative process. This included the recognition of the value of very broad input, the time and effort needed to organize collaboration, and the importance of needing professional knowledge and expertise in user experience in creating certain kinds of design artifacts./

Archival Collections as Data for Digital Scholarship

By Laurie Allen and Stewart Varner


Archives and special collections have a long history experimenting with and embracing digital tools, so it is not surprising that they have been natural partners for digital scholarship librarians. In this blog post, we want to share a couple of experiences we’ve had where digital scholarship and the archives came together.

Laurie Allen, Director for Digital Scholarship, University of Pennsylvania:

The Cope Evans project was an early collaboration between the Digital Scholarship group, Special Collections, and a group of students at the Haverford College Libraries. Over the years, a series of gifts had made it possible for Haverford to digitize and richly describe the Cope Evans Family Papers, which include correspondence and other documents from a connected group of Philadelphia Quaker families. While the ContentDM system used by the library allowed for searching through the digitized items, it did not take full advantage of the available metadata, including geospatial metadata. In the summer of 2014, the library employed a group of students to make use of the metadata records and associated images as a dataset. Over the following two summers, two groups of Haverford undergraduates explored the exported data from the Cope Collections to create maps, network analyses, and other visualizations and analyses of the collection. Of course, their exploration of the data led them directly back to the original materials and the resulting work represented a broader and deeper connection to the materials.

This experimentation with using our collections as data for student work led the Haverford Libraries to continue approaching the data and metadata of Quaker collections in data rich ways going forward. The Quakers and Mental Health site and the Beyond Penn’s Treaty site that have since been made take this work forward at Haverford.

Stewart Varner, Managing Director of Price Lab for Digital Humanities, University of Pennsylvania:

When I was the Digital Scholarship Librarian at the University of North Carolina, I worked on a project called DocSouth Data which was designed to facilitate innovative research methods on Documenting the American South, one of the library’s most popular online collections. Documenting the American South is composed of eighteen thematic collections of digitized material. DocSouth Data takes four of the most text-heavy collections, including the heavily used North American Slave Narrative, and makes them available as .txt files as well as .xml files. With these files, scholars can start looking for patterns using simple tools like Voyant and easily experiment with text analysis methods like topic modeling and sentiment analysis.

DocSouth Data was an exciting partnership between myself, the Library and Information Technology team, and archivists in UNC’s Special Collections. The original idea came from Nick Graham who, at the time, was the Program Coordinator for the North Carolina Digital Heritage Center (and is currently the University Archivist at UNC). I worked closely with Library and Information Technology who created the plain text files, organized them into a clear folder structure and made them available as .zip files on the library’s website. Once DocSouth Data was live, I hosted workshops at UNC and elsewhere that gave faculty, students, and librarians the chance to explore new ways to study the collections.


Since these two projects started, both Laurie and Stewart have joined the project team for the IMLS funded Collections as Data project. The Haverford Libraries contributed two facets to that project.

Philly Born Digital Access Bootcamp

by Faith Charlton

As Princeton University Library’s Manuscripts Division processing team continues to move forward in terms of managing its born-digital materials, much of its focus as of late has been on providing access to this content (else, why preserve it?). So, the timing of the Born Digital Access bootcamp that was held in Philadelphia this past summer was very opportune. Among other takeaways, it was helpful and comforting to learn how other institutions are grappling with the issue of providing or restricting access in relation to what Princeton is currently doing.  

The bootcamp, led by Alison Clemens from Yale and Greg Weideman from SUNY Albany, was well-organized and very informative; and I really appreciate how community-driven and participatory this initiative is, down to the community notes prepared by one of its organizers,Rachel Appel who was in attendance. I also appreciated that the content provided a holistic and comprehensive approach to access, including reinforcement of the fact that the ability to provide access to born-digital materials starts at the point of record creation; and that once implemented, the effectiveness of the means by which institutions are providing access should be determined through frequent user testing.

One point in particular that Alison and Greg emphasized that stood out to me is how the discovery of born-digital content is often almost as difficult as the delivery of that content. This was exemplified during the user testing portion of the bootcamp where attendees had the opportunity to interact with several discovery platforms that describe and/or provide access to digital records. The testing demonstrated that the barriers that remain in terms of locating and accessing digital content are still fairly significant.

The issues surrounding discovery and delivery are something that archivists at Princeton are trying to manage and improve upon. For example, I’m part of two working groups that are tackling these issues from different angles: the Description and Access to Born Digital Archival Collections and the User Experience working groups. The latter has started to embark on both formal and informal user testing of our finding aids site. One aspect that we’re paying particular attention to is the ease with which users can locate and access digital content. I had the opportunity to contribute one of Princeton’s finding aids as a use case for the user testing portion of the workshop; and received helpful feedback, both positive and negative, from bootcamp attendees about the description and delivery methods found on our site. Although one can access the digital records from this collection, there are some impediments in actually viewing the files; namely, one would have to download a program like Thunderbird in order to view the mbox file of emails, a fact that’s not evident to the user.     

Untitled drawing

Technical Services archivists at Princeton are also collaborating with colleagues in Public Services and Systems to determine how we might best provide various methods of access to our born-digital records. Because much of the content in Manuscripts Division collections is (at the moment) restricted due to issues related to copyright, privacy, and donor concerns, we’re trying to determine how we can provide mediated access to content both on and off-site. I was somewhat relieved to learn that, like Princeton, many institutions represented at the bootcamp are still relying on non-networked “frankenstein” computers in the reading room as the only other means of providing access aside from having content openly available online. Hopefully Princeton will be able to provide better forms of mediated access in the near future as we intend to implement a pilot version of networked access in the reading room for various forms of digital content, including text, image, and AV files. The next step could be to implement a “virtual reading room” where users can access content via authentication. As these initiatives are realized, we’ll continue to conduct user testing to make sure that what we’re providing is actually useful to patrons. Princeton staff look forward to continuing to participate in the initiatives of the Born Digital Access group as a way to both learn from and share our experiences with this community.    


Untitled drawing (1)

Faith Charlton is Lead Processing Archivist for Manuscripts Division Collections at Princeton University Library. She is a certified archivist and holds an MLIS from Drexel University, an MA in History from Villanova University, and a BA in History from The College of New Jersey.

DLF Forum & Digital Preservation 2017 Recap

By Kelly Bolding


The 2017 DLF Forum and NDSA’s Digital Preservation took place this October in Pittsburgh, Pennsylvania. Each year the DLF Forum brings together a variety of digital library practitioners, including librarians, archivists, museum professionals, metadata wranglers, technologists, digital humanists, and scholars in support of the Digital Library Federation’s mission to “advance research, learning, social justice, and the public good through the creative design and wise application of digital library technologies.” The National Digital Stewardship Alliance follows up the three-day main forum with Digital Preservation (DigiPres), a day-long conference dedicated to the “long-term preservation and stewardship of digital information and cultural heritage.” While there were a plethora of takeaways from this year’s events for the digital archivist community, for the sake of brevity, this recap will focus on a few broad themes, followed by some highlights related to electronic records specifically.

As an early career archivist and a first-time DLF/DigiPres attendee, I was impressed by the DLF community’s focus on inclusion and social justice. While technology was central to all aspects of the conference, the sessions centered the social and ethical aspects of digital tools in a way that I found both refreshing and productive. (The theme for this year’s DigiPres was, in fact, “Preservation is Political.”) Rasheedah Phillips, a Philadelphia-based public interest attorney, activist, artist, and science fiction writer opened the forum with a powerful keynote about the Community Futures Lab, a space she co-founded and designed around principles of Afrofuturism and Black Quantum Futurism. By presenting an alternate model of archiving deeply grounded in the communities affected, Phillips’s talk and Q&A responses brought to light an important critique of the restrictive nature of archival repositories. I left Phillips’s talk thinking about how we might allow the the liberatory “futures” she envisions to shape how we design online spaces for engaging with born-digital archival materials, as opposed to modeling these virtual spaces after the physical reading rooms that have alienated many of our potential users.

Other conference sessions echoed Phillips’s challenge to archivists to better engage and center the communities they document, especially those who have been historically marginalized. Ricky Punzalan noted in his talk on access to dispersed ethnographic photographs that collaboration with documented communities should now be a baseline expectation for all digital projects. Rosalie Lack and T-Kay Sangwand spoke about UCLA’s post-custodial approach to ethically developing digital collections across international borders using a collaborative partnership framework. Martha Tenney discussed concrete steps taken by archivists at Barnard College to respect the digital and emotional labor of students whose materials the archives is collecting to fill in gaps in the historical record.

Eira Tansey, Digital Archivist and Records Manager at the University of Cincinnati and organizer for Project ARCC, gave her DigiPres keynote about how our profession can develop an ethic of environmental justice. Weaving stories about the environmental history of Pittsburgh throughout her talk, Tansey called for archivists to commit firmly to ensuring the preservation and usability of environmental information. Related themes of transparency and accountability in the context of preserving and providing access to government and civic data (which is nowadays largely born-digital) were also present through the conference sessions. Regarding advocacy and awareness initiatives, Rachel Mattson and Brandon Locke spoke about Endangered Data Week; and several sessions discussed the PEGI Project. Others presented on the challenges of preserving born-digital civic and government information, including how federal institutions and smaller universities are tackling digital preservation given their often limited budgets, as well as how repositories are acquiring and preserving born-digital congressional records.

Collaborative workflow development for born-digital processing was another theme that emerged in a variety of sessions. Annalise Berdini, Charlie Macquarie, Shira Peltzman, and Kate Tasker, all digital archivists representing different University of California campuses, spoke about their process in coming together to create a standardized set of UC-wide guidelines for describing born-digital materials. Representatives from the OSSArcFlow project also presented some initial findings regarding their research into how repositories are integrating open source tools including BitCurator, Archivematica, and ArchivesSpace within their born-digital workflows; they reported on concerns about the scalability of various tools and standards, as well as desires to transition from siloed workflows to a more holistic approach and to reduce the time spent transforming the output of one tool to be compatible with another tool in the workflow. Elena Colón-Marrero of the Computer History Museum’s Center for Software History provided a thorough rundown of building a software preservation workflow from the ground-up, from inventorying software and establishing a controlled vocabulary for media formats to building a set of digital processing workstations, developing imaging workflows for different media formats, and eventually testing everything out on a case study collection (and she kindly placed her whole talk online!)

Also during the forum, the DLF Born-Digital Access Group met over lunch for an introduction and discussion. The meeting was well-attended, and the conversation was lively as members shared their current born-digital access solutions, both pretty and not so pretty (but never perfect); their wildest hopes and dreams for future access models; and their ideas for upcoming projects the group could tackle together. While technical challenges certainly figured into the discussion about impediments to providing better born-digital access, many of the problems participants reported had to do with their institutions being unwilling to take on perceived legal risks. The main action item that came out of the meeting is that the group plans to take steps to expand NDSA’s Levels of Preservation framework to include Levels of Access, as well as corresponding tiers of rights issues. The goal would be to help archivists assess the state of existing born-digital access models at their institutions, as well as give them tools to advocate for more robust, user-friendly, and accessible models moving forward.

For additional reports on the conference, reflections from several DLF fellows are available on the DLF blog. In addition to the sessions I mentioned, there are plenty more gems to be found in the openly available community notes (DLF, DigiPres) and OSF Repository of slides (DLF, DigiPres), as well as in the community notes for the Liberal Arts Colleges/HBCU Library Alliance unconference that preceded DLF.


Kelly Bolding is a processing archivist for the Manuscripts Division at Princeton University Library, where she is responsible for the arrangement and description of early American history collections and has been involved in the development of born-digital processing workflows. She holds an MLIS from Rutgers University and a BA in English Literature from Reed College.

Supervised versus Unsupervised Machine Learning for Text-based Datasets

By Aarthi Shankar

This is the fifth post in the bloggERS series on Archiving Digital Communication.


I am a graduate student working as a Research Assistant on an NHPRC-funded project, Processing Capstone Email Using Predictive Coding, that is investigating ways to provide appropriate access to email records of state agencies from the State of Illinois. We have been exploring various text mining technologies with a focus on those used in the e-discovery industry to see how well these tools may improve the efficiency and effectiveness of identifying sensitive content.

During our preliminary investigations we have begun to ask one another whether tools that use Supervised Machine Learning or Unsupervised Machine Learning would be best suited for our objectives. In my undergraduate studies I conducted a project on Digital Image Processing for Accident Prevention, involving building a system that uses real-time camera feeds to detect human-vehicle interactions and sound alarms if a collision is imminent. I used a Supervised Machine Learning algorithm – Support Vector Machine (SVM) to train and identify the car and human on individual data frames. With this project, Supervised Machine Learning worked well when applied to identifying objects embedded in images. But I do not believe it will be as beneficial for our project which is working with text only data. Here is my argument for my position.

In Supervised Machine Learning, a pre-classified input set (training dataset) has to be given to the tool in order to be trained. Training is based on the input set and the algorithms used to process the input set gives the required output. In my project, I created a training dataset which contained pictures of specific attributes of cars (windshields, wheels) and specific attributes of humans (faces, hands, and legs). I needed to create a training set of 900-1,000 images to achieve a ~92% level of accuracy. Supervised Machine Learning works well for this kind of image detection because unsupervised learning algorithms would be challenged to accurately make distinctions between windshield glass and other glass (e.g. building windows) present in many places on a whole data frame.

For Supervised Machine Learning to work well, the expected output of an algorithm should be known and the data that is used to train the algorithm should be properly labeled. This takes a great deal of effort. A huge volume of words along with their synonyms would be needed as a training set. But this implies we know what we are looking for in the data. I believe for the purposes of our project, the expected output is not so clearly known (all “sensitive” content) and therefore a reliable training set and algorithm would be difficult to create.

In Unsupervised Machine Learning, the algorithms allow the machine to learn to identify complex processes and patterns without human intervention. Text can be identified as relevant based on similarity and grouped together based on likely relevance. Unsupervised Machine Learning tools can still allow humans to add their own input text or data for the algorithm to understand and train itself. I believe this approach is better than Supervised Machine Learning for our purposes. Through the use of clustering mechanisms in Unsupervised Machine Learning the input data can first be divided into clusters and then the test data identified using those clusters.

In summary, a Supervised Machine Learning tool learns to ascribe the labels that are input from the training data but the effort to create a reliable training dataset is significant and not easy to create from our textual data. I feel that Unsupervised Machine Learning tools can provide better results (faster, more reliable) for our project particularly with regard to identifying sensitive content. Of course, we are still investigating various tools, so time will tell.


Aarthi Shankar is a graduate student in the Information Management Program specializing in Data Science and Analytics. She is working as a Research Assistant for the Records and Information Management Services at the University of Illinois.

Stanford Hosts Pivotal Session of Personal Digital Archiving Conference

By Mike Ashenfelder


In March, Stanford University Libraries hosted Personal Digital Archiving 2017, a conference about preservation and access of digital stuff for individuals and for aggregations of individuals. Presenters included librarians, data scientists, academics, data hobbyists, researchers, humanitarian citizens and more. PDA 2017 differed from previous PDA conferences though, when an honest, intense discussion erupted about race, privilege and bias.

Topics did not fall into neat categories. Some people collected data, some processed it, some managed it, some analyzed it. But often the presenters’ interests overlapped. Here are just some of the presentations, grouped by loosely related themes.

  • Joan Jeffri’s (Research Center for Arts & Culture/The Actors Fund) project archives the legacy of older performing artists. Jessica Moran (National Library of New Zealand) talked about the digital archives of a contemporary New Zealand composer and Shu-Wen Lin (NYU) talked about archiving an artist’s software-based installation.
  • In separate projects, Andrea Prichett (Berkeley Copwatch), Stacy Wood and Robin Margolis (UCLA), and Ina Kelleher (UC Berkeley) talked about citizens tracking the actions of police officers and holding the police accountable.
  • Stace Maples (Stanford) helped digitize 500,000+ consecutive photos of buildings along Sunset Strip. Pete Schreiner (North Carolina State University) archived the debris from a van that three bands shared for ten years. Schreiner said, “(The van) accrued the detritus of low-budget road life.”
  • Adam Lefloic Lebel (University of Montreal) talked about archiving video games and Eric Kaltman (UC Santa Cruz) talked about the game-research tool, GISST.
  • Robert Douglas Ferguson (McGill) examined personal financial information management among young adults. Chelsea Gunn (University of Pittsburgh) talked about the personal data that service providers collect from their users.
  • Rachel Foss (The British Library) talked about users of born-digital archives. Dorothy Waugh and Elizabeth Russey Roke (Emory) talked about how digital archiving have evolved since Emory acquired the Salman Rushdie collection.
  • Jean-Yves Le Meur (CERN) called for a non-profit international collaboration to archive personal data, “…so that individuals would know they have deposited stuff in a place where they know it is safe for the long term.”
  • Sarah Slade (State Library Victoria) talked about Born Digital 2016, an Australasian public-outreach program. Natalie Milbrodt (Queens Public Library) talked about helping her community archive personal artifacts and oral histories. Russell Martin (DC Public Library) talked about helping the DC community digitize their audio, video, photos and documents. And Jasmyn Castro (Smithsonian African American History Museum) talked about digitizing AV stuff for the general public.
  • Melody Condron (University of Houston) reviewed tools for handling and archiving social media and Wendy Hagenmaier (Georgia Tech) introduced several custom-built resources for preservation and emulation.
  • Sudheendra Hangal and Abhilasha Kumar (Ashoka University) talked about using personal email as a tool to research memory. And Stanford University Libraries demonstrated their ePADD software for appraisal, ingest, processing, discovery and delivery of email archives. Stanford also hosted a hackathon.
  • Carly Dearborn (Purdue), talked about data analysis and management for researchers. Leisa Gibbons (Kent State) analyzed interactions between YouTube and its users. and Nancy Van House (UC Berkeley) and Smiljana Antonijevic Ubois (Penn State) talked about digital scholarly workflow. Gary Wolf (Quantified Self) talked about himself.

Some presentations addressed cultural sensitivity and awareness. Barbara Jenkins (University of Oregon) discussed a collaborative digital project in Afghanistan. Kim Christen (Washington State) demonstrated Mukurtu, built with indigenous communities, and Traditional Knowledge Labels, a metadata tool for adding local cultural protocols. Katrina Vandeven (University of Denver) talked about a transformative insight she had during a Women’s March project, where she suddenly became aware of the bias and “privileged understanding” she brought to it.

The conference ended with observations from a group of panelists who have been involved with the PDA conferences since the beginning: Cathy Marshall, Howard Besser (New York University), Jeff Ubois (MacArthur Foundation), Cliff Lynch (Coalition for Networked Information) and me.

Marshall said, “I still think there’s a long-term tendency toward benign neglect, unless you’re a librarian, in which case you tend to take better care of your stuff than other people do.” She added that cloud services have improved to the point where people can trust online backup. Marshall said, “Besides, it’s much more fun to create new stuff than to be the steward of your own of your own mess.”

Lynch agreed about automated backup. “There used to be a view that catastrophic data loss was part of life and you’d just lose your stuff and start over,” Lynch said. “It was cleansing and terrifying at the same time.” He said the possibility of data loss is still real but less urgent.

Marshall spoke of backing the same stuff up again and again, and how it’s “all over the place.”

Besser described a conversation he had with his sister that they carried out over voicemail, WhatsApp, text and email. “All this is one conversation,” Besser said. “And it’s all over the place.” Lynch predicted that the challenge of organizing digital things is “…going to shift as we see more and more…automatic correlation of things.”

Ubois said, “I think emulation has been proven as something that we can trust.” He also indicated the “cognitive diversity” around the room. He said, “Many of the best talks at PDA over the years have been by persistent fanatics who had something and they were going to grab it and make it available.”

Besser said, “Things that we were talking about…years ago, that were far out ideas, have entered popular discourse…One example is what happens to your digital stuff when you die…Now we have laws in some states about it…and the social media have stepped up somewhat.”

I noted that the first PDA conference included presentations about personal medical records and about genealogy, but those two areas haven’t been covered since. Lynch made a similar statement about how genealogy “…richly deserves a bit more exploration.” I also noted that the general public still needs expert information about digitizing and digital preservation, and we see more examples of university and public librarians taking the initiative to help their communities with PDA.

In a Q&A session, Charles Ransom (University of Michigan), raised the bias issue again when he said, “I was wondering…how privilege plays a part in all of this. Look at the audience and it’s clear that it does,” referring to the overwhelmingly white audience.

Besser said that NYU reaches out to activist, ethnic and neighborhood communities. “Most of us (at this conference) …work with disenfranchised communities,” said Besser. “It doesn’t bring people here to this meeting…but it does mean that at least some of those voices are being heard through outreach.” Besser said that when NYU hosted PDA 2015, they worked toward diversity. “We still had a pretty white audience,” Besser said. “But…it’s more than just who gets the call for proposal…It’s a big society problem that is not really easy to solve and we all have to really work on it.”

I said it was a challenge for PDA conference organizers to reach a wide, diverse audience just through the host institution’s social media tools and a few newsgroups.  When asked what the racial mix was at the PDA 2016 conference (which University of Michigan hosted), Ransom said it was about the same as this conference. He said, “I specifically reached out to presenters that I knew and the pushback I got from them was ‘We don’t have a budget to go to Ann Arbor for four days and pay for hotel and travel and registration fees.’ “

Audience members suggested having the PDA host invite more local community organizations, so travel and lodging won’t matter, and possibly waiving fees. The University of Houston will host PDA 2018; Melody Condron said UH has a close relationship with Houston community organizations and she will explore ways to involve them in the conference.

Lynch, whose continuous conference travels afford him a global technological perspective, said of the PDA 2017 conference, “I’m really encouraged, particularly by the way we seem to be moving the deeper and harder problems into focus…We’re just now starting to understand the contours of the digital environment.”

The conference videos are available online at https://archive.org/details/pda2017.

Call for Contributions: Collaborating Beyond the Archival Profession

The work of archivists is highly collaborative in nature. While the need for and benefits of collaboration are widely recognized, the practice of collaboration can be, well, complicated.

This year’s ARCHIVES 2017 program featured a number of sessions on collaboration: archivists collaborating with users, Indigenous communities, secondary school teachers, etc. We would like to continue that conversation in a series of posts that cover the practical issues that arise when collaborating with others outside of the archival profession at any stage of the archival enterprise. Give us your stories about working with technologists, videogame enthusiasts, artists, musicians, activists, or anyone else with whom you find yourself collaborating!

A few potential topics and themes for posts:

  • Posts written by non-traditional archivists or others working to preserve heritage materials outside of traditional archival repositories
  • Posts co-written by archivists and collaborators
  • Tips for translating archive jargon, and suggestions for working with others in general
  • Incorporating researcher feedback into archival work
  • The role of empathy in digital archives practice

Writing for bloggERS! Collaborating Beyond the Archival Profession series

  • We encourage visual representations: Posts can include or largely consist of comics, flowcharts, a series of memes, etc!
  • Written content should be 600-800 words in length
  • Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA
  • Align with other editorial guidelines as outlined in the bloggERS! guidelines for writers.

Posts for this series will start in November, so let us know if you are interested in contributing by sending an email to ers.mailer.blog@gmail.com!

There’s a First Time for Everything

By Valencia Johnson

This is the fourth post in the bloggERS series on Archiving Digital Communication.


This summer I had the pleasure of accessioning a large digital collection from a retiring staff member. Due to their longevity with the institution, the creator had amassed an extensive digital record. In addition to their desktop files, the archive collected an archival Outlook .pst file of 15.8 GB! This was my first time working with emails. This was also the first time some of the tools discussed below were used in the workflow at my institution. As a newcomer to the digital archiving community, I would like to share this case study and my first impressions on the tools I used in this acquisition.

My original workflow:

  1. Convert the .pst file into an .mbox file.
  2. Place both files in a folder titled Emails and add this folder to the acquisition folder that contains the Desktop files folder. This way the digital records can be accessioned as one unit.
  3. Follow and complete our accessioning procedures.

Things were moving smoothly; I was able to use Emailchemy, a tool that converts email from closed, proprietary file formats, such as .pst files used by Outlook, to standard, portable formats that any application can use, such as .mbox files, which can be read using Thunderbird, Mozilla’s open source email client. I used a Windows laptop that had Outlook and Thunderbird installed to complete this task. I had no issues with Emailchemy, the instructions in the owner’s manual were clear, and the process was easy. Next, I uploaded the Email folder, which contained the .pst and .mbox files, to the acquisition external hard drive and began processing with BitCurator. The machine I used to accession is a FRED, a powerful forensic recovery tool used by law enforcement and some archivists. Our FRED runs BitCurator, which is a Linux environment. This is an important fact to remember because .pst files will not open on a Linux machine.

At Princeton, we use Bulk Extractor to check for Personally Identifiable Information (PII) and credit card numbers. This is step 6 in our workflow and this is where I ran into some issues.

Yeah Bulk Extractor I’ll just pick up more cores during lunch.

The program was unable to complete 4 threads within the Email folder and timed out. The picture above is part of the explanation message I received. In my understanding and research, aka Google because I did not understand the message, the program was unable to completely execute the task with the amount of processing power available. So the message is essentially saying “I don’t know why this is taking so long. It’s you not me. You need a better computer.” From the initial scan results, I was able to remove PII from the Desktop folder. So instead of running the scan on the entire acquisition folder, I ran the scan solely on the Email folder and the scan still timed out. Despite the incomplete scan, I moved on with the results I had.  

I tried to make sense of the reports Bulk Extractor created for the email files. The Bulk Extractor output includes a full file path for each file flagged, e.g. (/home/user/Desktop/blogERs/Email.docx). This is how I was able locate files within the Desktop folder. The output for the Email folder looked like this:

(Some text has been blacked out for privacy.)

Even though Bulk Extractor Viewer does display the content, it displays it like a text editor, e.g. Notepad, with all the coding alongside the content of the message, not as an email, because all the results were from the .mbox file. This is just the format .mbox generates without an email client. This coding can be difficult to interpret without an email client to translate the material into a human readable format. This output makes it hard to locate an individual message within a .pst because it is hard but not impossible to find the date or title of the email amongst the coding. But this was my first time encountering results like this and it freaked me out a bit.

Because regular expressions, the search method used by Bulk Extractor, looks for number patterns, some of the hits were false positives, number strings that matched the pattern of SSN or credit card numbers. So in lieu of social security numbers, I found the results were FedEx tracking numbers or mistyped phone numbers, though to be fair mistyped numbers are someone’s SSN. For credit card numbers, the program picked up email coding and non-financially related number patterns.

The scan found a SSN I had to remove from the .pst and the .mbox. Remember .pst files only work with Microsoft Outlook. At this point in processing, I was on a Linux machine and could not open the .pst so I focused on the .mbox.  Using the flagged terms, I thought maybe I could use a keyword search within the .mbox to locate and remove the flagged material because you can open .mbox files using a text editor. Remember when I said the .pst was over 15 GB? Well the .mbox was just as large and this caused the text editor to stall and eventually give up opening the file. Despite these challenges, I remained steadfast and found UltraEdit, a large text file editor. This whole process took a couple of days and in the end the results from Bulk Extractor’s search indicated the email files contained one SSN and no credit card numbers.  

While discussing my difficulties with my supervisor, she suggested trying FileLocator Pro, a scanner like Bulk Extractor that was created with .pst files in mind, to fulfill our due diligence to look for sensitive information since the Bulk Extractor scan timed out before finishing.  Though FileLocator Pro operates on Windows so, unfortunately, we couldn’t do the scan on the FRED,  FileLocator Pro was able to catch real SSNs hidden in attachments that did not appear in the Bulk Extractor results.

I was able to view the email with the flagged content highlighted within FileLocator Pro like Bulk Extractor. Also, there is the option to open the attachments or emails in their respective programs. So a .pdf file opened in Adobe and the email messages opened in Outlook. Even though I had false positives with FileLocator Pro, verifying the content was easy. It didn’t perform as well searching for credit card numbers; I had some error messages stating that some attached files contained no readable text or that FileLocator Pro had to use a raw data search instead of the primary method. These errors were limited to attachments with .gif, .doc, .pdf, and .xls extensions. But overall it was a shorter and better experience working with FileLocator Pro, at least when it comes to email files.

As emails continue to dominate how we communicate at work and in our personal lives, archivists and electronic records managers can expect to process even larger files, despite how long an individual stays at an institution. Larger files can make the hunt for PII and other sensitive data feel like searching for a needle in a haystack, especially when our scanners are unable to flag individual emails, attachments, or even complete a scan. There’s no such thing as a perfect program; I like Bulk Extractor for non-email files, and I have concerns with FileLocator Pro. However, technology continues to improve and with forums like this blog we can learn from one another.


Valencia Johnson is the Digital Accessioning Assistant for the Seeley G. Mudd Manuscript Library at Princeton University. She is a certified archivist with an MA in Museum Studies from Baylor University.