DLF Forum & Digital Preservation 2017 Recap

By Kelly Bolding


The 2017 DLF Forum and NDSA’s Digital Preservation took place this October in Pittsburgh, Pennsylvania. Each year the DLF Forum brings together a variety of digital library practitioners, including librarians, archivists, museum professionals, metadata wranglers, technologists, digital humanists, and scholars in support of the Digital Library Federation’s mission to “advance research, learning, social justice, and the public good through the creative design and wise application of digital library technologies.” The National Digital Stewardship Alliance follows up the three-day main forum with Digital Preservation (DigiPres), a day-long conference dedicated to the “long-term preservation and stewardship of digital information and cultural heritage.” While there were a plethora of takeaways from this year’s events for the digital archivist community, for the sake of brevity, this recap will focus on a few broad themes, followed by some highlights related to electronic records specifically.

As an early career archivist and a first-time DLF/DigiPres attendee, I was impressed by the DLF community’s focus on inclusion and social justice. While technology was central to all aspects of the conference, the sessions centered the social and ethical aspects of digital tools in a way that I found both refreshing and productive. (The theme for this year’s DigiPres was, in fact, “Preservation is Political.”) Rasheedah Phillips, a Philadelphia-based public interest attorney, activist, artist, and science fiction writer opened the forum with a powerful keynote about the Community Futures Lab, a space she co-founded and designed around principles of Afrofuturism and Black Quantum Futurism. By presenting an alternate model of archiving deeply grounded in the communities affected, Phillips’s talk and Q&A responses brought to light an important critique of the restrictive nature of archival repositories. I left Phillips’s talk thinking about how we might allow the the liberatory “futures” she envisions to shape how we design online spaces for engaging with born-digital archival materials, as opposed to modeling these virtual spaces after the physical reading rooms that have alienated many of our potential users.

Other conference sessions echoed Phillips’s challenge to archivists to better engage and center the communities they document, especially those who have been historically marginalized. Ricky Punzalan noted in his talk on access to dispersed ethnographic photographs that collaboration with documented communities should now be a baseline expectation for all digital projects. Rosalie Lack and T-Kay Sangwand spoke about UCLA’s post-custodial approach to ethically developing digital collections across international borders using a collaborative partnership framework. Martha Tenney discussed concrete steps taken by archivists at Barnard College to respect the digital and emotional labor of students whose materials the archives is collecting to fill in gaps in the historical record.

Eira Tansey, Digital Archivist and Records Manager at the University of Cincinnati and organizer for Project ARCC, gave her DigiPres keynote about how our profession can develop an ethic of environmental justice. Weaving stories about the environmental history of Pittsburgh throughout her talk, Tansey called for archivists to commit firmly to ensuring the preservation and usability of environmental information. Related themes of transparency and accountability in the context of preserving and providing access to government and civic data (which is nowadays largely born-digital) were also present through the conference sessions. Regarding advocacy and awareness initiatives, Rachel Mattson and Brandon Locke spoke about Endangered Data Week; and several sessions discussed the PEGI Project. Others presented on the challenges of preserving born-digital civic and government information, including how federal institutions and smaller universities are tackling digital preservation given their often limited budgets, as well as how repositories are acquiring and preserving born-digital congressional records.

Collaborative workflow development for born-digital processing was another theme that emerged in a variety of sessions. Annalise Berdini, Charlie Macquarie, Shira Peltzman, and Kate Tasker, all digital archivists representing different University of California campuses, spoke about their process in coming together to create a standardized set of UC-wide guidelines for describing born-digital materials. Representatives from the OSSArcFlow project also presented some initial findings regarding their research into how repositories are integrating open source tools including BitCurator, Archivematica, and ArchivesSpace within their born-digital workflows; they reported on concerns about the scalability of various tools and standards, as well as desires to transition from siloed workflows to a more holistic approach and to reduce the time spent transforming the output of one tool to be compatible with another tool in the workflow. Elena Colón-Marrero of the Computer History Museum’s Center for Software History provided a thorough rundown of building a software preservation workflow from the ground-up, from inventorying software and establishing a controlled vocabulary for media formats to building a set of digital processing workstations, developing imaging workflows for different media formats, and eventually testing everything out on a case study collection (and she kindly placed her whole talk online!)

Also during the forum, the DLF Born-Digital Access Group met over lunch for an introduction and discussion. The meeting was well-attended, and the conversation was lively as members shared their current born-digital access solutions, both pretty and not so pretty (but never perfect); their wildest hopes and dreams for future access models; and their ideas for upcoming projects the group could tackle together. While technical challenges certainly figured into the discussion about impediments to providing better born-digital access, many of the problems participants reported had to do with their institutions being unwilling to take on perceived legal risks. The main action item that came out of the meeting is that the group plans to take steps to expand NDSA’s Levels of Preservation framework to include Levels of Access, as well as corresponding tiers of rights issues. The goal would be to help archivists assess the state of existing born-digital access models at their institutions, as well as give them tools to advocate for more robust, user-friendly, and accessible models moving forward.

For additional reports on the conference, reflections from several DLF fellows are available on the DLF blog. In addition to the sessions I mentioned, there are plenty more gems to be found in the openly available community notes (DLF, DigiPres) and OSF Repository of slides (DLF, DigiPres), as well as in the community notes for the Liberal Arts Colleges/HBCU Library Alliance unconference that preceded DLF.


Kelly Bolding is a processing archivist for the Manuscripts Division at Princeton University Library, where she is responsible for the arrangement and description of early American history collections and has been involved in the development of born-digital processing workflows. She holds an MLIS from Rutgers University and a BA in English Literature from Reed College.

Advertisements

Supervised versus Unsupervised Machine Learning for Text-based Datasets

By Aarthi Shankar

This is the fifth post in the bloggERS series on Archiving Digital Communication.


I am a graduate student working as a Research Assistant on an NHPRC-funded project, Processing Capstone Email Using Predictive Coding, that is investigating ways to provide appropriate access to email records of state agencies from the State of Illinois. We have been exploring various text mining technologies with a focus on those used in the e-discovery industry to see how well these tools may improve the efficiency and effectiveness of identifying sensitive content.

During our preliminary investigations we have begun to ask one another whether tools that use Supervised Machine Learning or Unsupervised Machine Learning would be best suited for our objectives. In my undergraduate studies I conducted a project on Digital Image Processing for Accident Prevention, involving building a system that uses real-time camera feeds to detect human-vehicle interactions and sound alarms if a collision is imminent. I used a Supervised Machine Learning algorithm – Support Vector Machine (SVM) to train and identify the car and human on individual data frames. With this project, Supervised Machine Learning worked well when applied to identifying objects embedded in images. But I do not believe it will be as beneficial for our project which is working with text only data. Here is my argument for my position.

In Supervised Machine Learning, a pre-classified input set (training dataset) has to be given to the tool in order to be trained. Training is based on the input set and the algorithms used to process the input set gives the required output. In my project, I created a training dataset which contained pictures of specific attributes of cars (windshields, wheels) and specific attributes of humans (faces, hands, and legs). I needed to create a training set of 900-1,000 images to achieve a ~92% level of accuracy. Supervised Machine Learning works well for this kind of image detection because unsupervised learning algorithms would be challenged to accurately make distinctions between windshield glass and other glass (e.g. building windows) present in many places on a whole data frame.

For Supervised Machine Learning to work well, the expected output of an algorithm should be known and the data that is used to train the algorithm should be properly labeled. This takes a great deal of effort. A huge volume of words along with their synonyms would be needed as a training set. But this implies we know what we are looking for in the data. I believe for the purposes of our project, the expected output is not so clearly known (all “sensitive” content) and therefore a reliable training set and algorithm would be difficult to create.

In Unsupervised Machine Learning, the algorithms allow the machine to learn to identify complex processes and patterns without human intervention. Text can be identified as relevant based on similarity and grouped together based on likely relevance. Unsupervised Machine Learning tools can still allow humans to add their own input text or data for the algorithm to understand and train itself. I believe this approach is better than Supervised Machine Learning for our purposes. Through the use of clustering mechanisms in Unsupervised Machine Learning the input data can first be divided into clusters and then the test data identified using those clusters.

In summary, a Supervised Machine Learning tool learns to ascribe the labels that are input from the training data but the effort to create a reliable training dataset is significant and not easy to create from our textual data. I feel that Unsupervised Machine Learning tools can provide better results (faster, more reliable) for our project particularly with regard to identifying sensitive content. Of course, we are still investigating various tools, so time will tell.


Aarthi Shankar is a graduate student in the Information Management Program specializing in Data Science and Analytics. She is working as a Research Assistant for the Records and Information Management Services at the University of Illinois.