The fifth annual BitCurator Users Forum was held at Yale University from October 24-25, bringing library, archives, and museum practitioners together to learn and discuss many aspects of digital forensics work. Over two days of workshops, lightning talks, and panels, the Forum covered a range of topics around acquisition, processing, and access for born digital materials. In addition to traditional panels and conference sessions, attendees also participated in hands-on workshops on digital forensics techniques and tools, including the BitCurator environment.
Throughout the workshops, sessions, and discussions, one of the most dominant themes to emerge was the question of how archivists and institutions should address the environmental unsustainability of digital preservation. Attendees were quick to highlight recent work in this area, including the article Toward Environmentally Sustainable Digital Preservationby Keith L. Pendergrass, Walker Sampson, Tim Walsh, and Laura Alagna among others. The prevalence of this topic at the Forum as well as other conferences and in our professional literature points to urgency that archivists feel toward ensuring that we are able to continue to preserve our digital holdings while minimizing negative environmental impact as much as possible.
The role of appraisal in relation to the environmental sustainability of digital preservation specifically was a major focus of the Forum. One attendee remarked that the “low cost of storage has outpaced the ability to appraise content,” summing up the situation that many institutions find themselves in, where the ever decreasing cost of digital storage, anxiety about discarding potentially valuable collection material, and a lack of time and guidance on appraisal of digital materials has resulted in the ballooning of their digital holdings.
Participants challenged the notion that “keeping everything forever” should be our default preservation strategy. One common thread to emerge was the need to be more thoughtful about what we choose to retain and to develop and share appraisal criteria for born digital materials to help us make those decisions.
Also related to concerns about the environmental impact of digital preservation, presenters posed questions about how much data and related metadata for digital collections should be captured in the first place. Kelsey O’Connell, digital archivist at Northwestern University, proposed defining levels of digital forensics rather than applying the same workflow to every collection. Taking this type of approach to acquisition and metadata creation for born digital collection materials could help institutions minimize the storage of unnecessary collection data.
The BitCurator Users Forum provides an excellent opportunity for library and archives practitioners to learn new skills and discuss the many challenges and opportunities in the field of digital archiving. This year’s Forum was no exception and I have no doubt that it will continue to serve as a valuable resource for experienced practitioners as well as those just starting out.
Sally DeBauche is a Digital Archivist at Stanford University and the ePADD Project Manager.
Email Archiving: Strategies, Tools, Techniques was a one-day workshop held on August 1, 2019. Chris Prom (University of Illinois) and Tricia Patterson (Harvard University) taught the workshop, which gave a broad overview of the opportunities and challenges of email archiving and some tools that can be used to make this daunting task easier.
As a processing archivist, email sits squarely within the electronic records processing workflow I’m helping develop: I took this class to build my digital archiving skills and to learn about techniques for managing email archives. Attending this class while my department is developing a digital archiving workflow helped me think ahead about technical limitations, ethical considerations, storage, and access issues related to email.
For me, the class was a good introduction to the opportunities and challenges of preserving this ephemeral and widespread communication. The class was divided into three sections: Assessing Needs and Challenges, Understanding Tools and Techniques, and Implementing Workflows. These sections were based on the Lifecycle of Email Model fromThe Future of Email ArchivesCLIR Report.
During the first portion of the class, we discussed the types of communication that occur through email, and the functions which fall under the creation and use as well as appraisal and selection categories of the email lifecycle. This section featured an interesting group activity asking us to list all of the email accounts we had used in our lifetime, the type of correspondence that occurred on the platform, an estimated size of the collection, and the scope and contents. This exercise helped illustrate how large, multifaceted, and varied even a single email a collection can be: I found this exercise effective for thinking about the complexities of archiving email.
In the second section, Prom and Patterson walked the class through seven tools for capturing and processing emails. The instructors gave a brief description of each tool’s functions and where they fit in the lifecycle model before giving a demo. Unfortunately, the demo portion was the weakest part of this workshop for me: instead of a live demonstration, the instructors used screenshots and a video recording. It was difficult to read the screenshots and the slides containing the screenshots do not have any explanatory text, so unless you took good notes, it would be difficult to understand how these tools work after the class was over. If SAA offers this class again, I would suggest the instructors do a live demo and provide more notes on how the tools work so that we can use class materials as a resource when we are doing this work at our own institutions.
The group activity for this class was to export a small portion of our own email and use one of the tools discussed in class to begin processing. During this activity, we discovered that Yahoo makes it difficult or impossible to export email. I think this activity would have been more effective if we had been told to download our own emails and how before the class began. Most of the time allotted for this activity was spent figuring out how to download our emails and waiting for them to download, so we never got the chance to use the programs we discussed.
Overall, I thought the class provided a good introduction to the complexities of preserving email and introducing open-source and hosted tools that help with different parts of the email lifecycle. I would recommend this class to people who are exploring how to archive email and what would work for their institution.
Kahlee Leingang is a Processing Archivist at Iowa State University, where she works on creating guidelines and workflows for processing, preservation, and access of born-digital records as well as processing collections in the backlog.
bloggERS!, the blog of the Electronic Records Section of SAA, is accepting proposals for blog posts on the theme “What’s Your Set-Up?” These posts will address the question: what equipment do you need to get your job done in digital archives? We’re particularly interested in posts that consist of a detailed description of hardware, software, and other equipment used in your institution’s digital archives workflow (computers, readers, drives, etc.), as well as more general posts about equipment needs in digital archives.
See our call for posts below and email any proposals to firstname.lastname@example.org.
We look forward to hearing from all of you.
—The bloggERS! editorial subcommittee
Call for Posts
When starting a digital archives program from scratch, archivists can be easily overwhelmed by the range of hardware and software needed to effectively manage and preserve digital media, the variety of options for different equipment types, and where to obtain everything needed. As our practice evolves, so does the required equipment, and archivists are constantly replacing and improving our equipment according to our needs and resources.
This series hopes to help break down barriers by allowing archivists to learn from their peers at a variety of institutions. We want to hear about the specific equipment you use in your day-to-day workflows, addressing questions such as: what do your workstations consist of? How many do you have? What readers and drives work reliably for your workflows? How did you obtain them? What doesn’t work? What is on your wish list for equipment acquisition?
We welcome posts from staff at institutions with all levels of budgetary resources.
Other potential topics and themes for posts:
Creating a low-cost digital archives workstation
Stories of assembling workstations iteratively
Strategies for obtaining the necessary equipment, and preferred vendors
Working with IT to establish and support digital archives hardware and software
Stories of success or failure with advanced equipment such as the FRED Forensic Workstation or the Kryoflux
Writing for bloggERS! “What’s Your Set-Up?” Series
We encourage visual representations: Posts can include or largely consist of comics, flowcharts, a series of memes, etc!
Written content should be roughly 600-800 words in length
Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA
On Friday, July 26, 2019, academics and practitioners met at Wilson Library at UNC Chapel Hill for “ml4arc – Machine Learning, Deep Learning, and Natural Language Processing Applications in Archives.” This meeting featured expert panels and participant-driven discussions about how we can use natural language processing – using software to understand text and its meaning – and machine learning – a branch of artificial intelligence that learns to infer patterns from data – in the archives.
The meeting was hosted by the RATOM Project (Review, Appraisal, and Triage of Mail). The RATOM project is a partnership between the State Archives of North Carolina and the School of Information and Library Science at UNC Chapel Hill. RATOM will extend the email processing capabilities currently present in the TOMES software and BitCurator environment, developing additional modules for identifying and extracting the contents of email-containing formats, NLP tasks, and machine learning approaches. RATOM and the ml4arc meeting are generously supported by the Andrew W. Mellon Foundation.
Presentations at ml4arc were split between successful applications of machine learning and problems that could potentially be addressed by machine learning in the future. In his talk, Mike Shallcross from Indiana University identified archival workflow pain points that provide opportunities for machine learning. In particular, he sees the potential for machine learning to address issues of authenticity and integrity in digital archives, PII and risk mitigation, aggregate description, and how all these processes are (or are not) scalable and sustainable. Many of the presentations addressed these key areas and how natural language processing and machine learning can lend aid to archivists and records managers. Additionally, attendees got to see presentations and demonstrations from tools for email such as RATOM, TOMES, and ePADD. Euan Cochrane also gave a talk about the EaaSI sandbox and discussed potential relationships between software preservation and machine learning.
The meeting agenda had a strong focus on using machine learning in email archives; collecting and processing emails is a large encumbrance in many archives that can stand to benefit greatly from machine learning tools. For example, Joanne Kaczmarek from the University of Illinois presented a project processing capstone email accounts using an e-discovery and predictive coding software called Ringtail. In partnership with the Illinois State Archives, Kaczmarek used Ringtail to identify groups of “archival” and “non-archival” emails from 62 capstone accounts, and to further break down the “archival” category into “restricted” and “public.” After 3-4 weeks of tagging training data with this software, the team was able to reduce the volume of emails by 45% by excluding “non-archival” messages, and identify 1.8 million emails that met the criteria to be made available to the public. Manually, this tagging process could have easily taken over 13 years of staff time.
After the ml4arc meeting, I am excited to see the evolution of these projects and how natural language processing and machine learning can help us with our responsibilities as archivists and records managers. From entity extraction to PII identification, there are myriad possibilities for these technologies to help speed up our processes and overcome challenges.
Emily Higgs is the Digital Archivist for the Swarthmore College Peace Collection and Friends Historical Library. Before moving to Swarthmore, she was a North Carolina State University Libraries Fellow. She is also the Assistant Team Leader for the SAA ERS section blog.
Nineteen years ago, the digital preservation community gathered in York, UK, for the Cedars Project’s Preservation 2000 conference. It was here that the first seeds were sown for what would become the Digital Preservation Coalition (DPC). Guided by Neil Beagrie, then of King’s College London and Jisc, work to establish the DPC continued over the next 18 months and, in 2002, representatives from 7 organizations signed the articles that formally constituted the DPC.
In the 17 years since its creation, the DPC has gone from strength to strength, the last 10 years under the leadership of current Executive Director, William Kilbride. The past decade has been a particular period of growth, as shown by the rise in the staff compliment from 2 to 7. We now have more than 90 members who represent an increasingly diverse group of organizations from 12 countries across sectors including cultural heritage, higher education, government, banking, industry, media, research and international bodies.
Our mission at the DPC is to:
[…] enable our members to deliver resilient long-term access to digital content and services, helping them to derive enduring value from digital assets and raising awareness of the strategic, cultural and technological challenges they face.
We work to achieve this through a broad portfolio of work across six strategic areas of activity: Community Engagement, Advocacy, Workforce Development, Capacity Building, Good Practice and Standards, and Management and Governance. Everything we do is member-driven and they guide our activities through the DPC Board, Representative Council, and Sub-Committees which oversee each strategic area.
Although the DPC is driven primarily by the needs of our members, we do also aim to contribute to the broader digital preservation community. As such, many of the resources we develop are made publicly available. In the remainder of this blog post, I’ll be taking a quick look at each of the DPC’s areas of activity and pointing out resources you might find useful.
1 | Community Engagement
First up is our work in the area of Community Engagement. Here our aim is to enable “a growing number of agencies and individuals in all sectors and in all countries to participate in a dynamic and mutually supportive digital preservation community”. Collaboration is a key to digital preservation success, and we hope to encourage and support it by helping build an inclusive and active community. An important step in achieving this aim was the publication of our ‘Inclusion and Diversity Policy’ in 2018.
Webinars are key to building community engagement amongst our members. We invite speakers to talk to our members about particular topics and share experiences through case studies. These webinars are recorded and made available for members to watch at a later date. We also run a monthly ‘Members Lounge’ to allow informal sharing of current work and discussion of issues as they arise and, on the public end of the website, a popular blog, covering case studies, new innovations, thought pieces, recaps of events and more.
2 | Advocacy
Our advocacy work campaigns “for a political and institutional climate more responsive and better informed about the digital preservation challenge”, as well as “raising awareness about the new opportunities that resilient digital assets create”. This tends to happen on several levels, from enabling and aiding members’ advocacy efforts within their own organizations, through raising legislators’ and policy makers’ awareness of digital preservation, to educating the wider populace.
To help those advocating for digital preservation within their own context, we have recently published our Executive Guide. The Guide provides a grab bag of statements and facts to help make the case for digital preservation, including key messages, motivators, opportunities to be gained and risks faced. We welcome any suggestions for additions or changes to this resource!
Our longest running advocacy activity is the biannual Digital Preservation Awards, last held in 2018. The Awards aim to celebrate excellence and innovation in digital preservation across a range of categories. This high-profile event has been joined in recent years by two other activities with a broad remit and engagement. The first is the Bit List of Digitally Endangered Species, which highlights at risk digital information, showing both where preservation work is needed and where efforts have been successful. Finally, there is World Digital Preservation Day (WDPD), a day to showcase digital preservation around the globe. Response to WDPD since its inauguration in 2017 has been exceptionally positive. There’s been tweets, blogs, events, webinars, and even a song and dance! This year WDPD is scheduled for 7th November, and we encourage everyone to get involved.
3 | Workforce Development
Workforce Development activities at the DPC focus on “providing opportunities for our members to acquire, develop and retain competent and responsive workforces that are ready to address the challenges of digital preservation”. There are many threads to this work, but key for our members are the scholarships we provide through our Career Development Fund and free access to the training courses we run.
At the moment we offer three training courses: ‘Getting Started with Digital Preservation’, ‘Making Progress with Digital Preservation’ and ‘Advocacy for Digital Preservation’, but we have plans to expand the portfolio in the coming year. All of our training courses are available to non-members for a modest fee, but at the moment are mostly held face to face in the UK and Ireland. A move to online training provision is, however, planned for 2020. We are also happy to share training resources and have set up a Slack workspace to enable this and greater collaboration with regards to digital preservation training.
Other resources that may prove helpful that fall under our Workforce Development heading include the ‘Digital Preservation Handbook’, a free online publication covering a digital preservation in the broadest sense. The Handbook aims to be a comprehensive guide for those starting with digital preservation, whilst also offering links additional resources. The content for Handbook was crowd-sourced from experts and has all been peer reviewed. Another useful and slightly less well-known series of publications are our ‘Topical Notes’, originally funded by the National Archives of Ireland, and intended to create resources that introduced key digital preservation issues to a non-specialist audience (particularly record creators). Each note is only two pages long and jargon-free, so a great resource to help raise awareness.
4 | Capacity Building
Perhaps the biggest area of DPC work covers Capacity Building, that is “supporting and assuring our members in the delivery and maintenance of high quality and sustainable digital preservation services through knowledge exchange, technology watch, research and development.” This can take the form of direct member support, helping with tasks such as policy development and procurement, as well as participation in research projects.
We also run around six thematic Briefing Day events a year on topical issues. As with the training, these are largely held in the UK and Ireland, but they are now also live-streamed for members. We support a number of Thematic Task Forces and Working Groups, with the ‘Web Archiving and Preservation Working Group’ being particularly active at the moment.
5 | Good Practice and Standards
Our Good Practice and Standards stream of work was a new addition as of the publication of our latest Strategic Plan (2018-22). Here we are contributing work towards “identifying and developing good practice and standards that make digital preservation achievable, supporting efforts to ensure services are tightly matched to shifting requirements.”
We hope this work will allow us to input into standards with the needs of our members in mind and facilitate the sharing of good practice that already happens across the coalition. This has already borne fruit in the shape of the forthcoming DPC Rapid Assessment Model, a maturity model to help with benchmarking digital preservation progress within your organization. You can read a bit more about it in this blog post by Jen Mitcham and the model will be released publicly in late September.
We also work with vendors through our Supporter Program and events like our ‘Digital Futures’ series to help bridge the gap between practice and solutions.
6 | Management and Governance
Our final stream of work is less focused on digital preservation and instead on “ensuring the DPC is a sustainable, competent organization focussed on member needs, providing a robust and trusted platform for collaboration within and beyond the Coalition.” This obviously relates to both the viability of the organization and well as good governance. It is essential that everything we do is transparent and that the members can both direct what we do and ensure accountability.
Before I depart, I thought I would share a little bit about some of our plans for the future. In the next few years we’ll be taking steps to further internationalize as an organization. At the moment our membership is roughly 75% UK and Ireland and 25% international, but those numbers are gradually moving closer and we hope that continues. With that in mind we will be investigating new ways to deliver services and resources online, as well as in languages beyond English. We’re starting this year with the publication of our prospectus in German, French and Spanish.
We’re also beginning to look forward to our 20th anniversary in 2022. It’s a Digital Preservation Awards Year, so that’s reason enough for a celebration, but we will also be welcoming the digital preservation community to Glasgow, Scotland, as hosts of iPRES 2022. Plans are already afoot for the conference, and we’re excited to make it a showcase for both the community and one of our home cities. Hopefully we’ll see you there, but I encourage you to make use of our resources and to get in touch soon!
Sharon McMeekin is Head of Workforce Development with the Digital Preservation Coalition and leads on work including training workshops and their scholarship program. She is also Managing Editor of the ‘Digital Preservation Handbook’. With Masters degrees in Information Technology and Information Management and Preservation, both from the University of Glasgow, Sharon is an archivist by training, specializing in digital preservation. She is also an ILM qualified trainer. Before joining the DPC she spent five years as Digital Archivist with RCAHMS. As an invited speaker, Sharon presents on digital preservation at a wide variety of training events, conferences and university courses.
The Midwest Archivists Conference 2019 meeting, held April 3-6 in Detroit (in the GM Renaissance Center, which may have the distinction, with its concentric circle design, of being the most bewildering conference center I’ve ever been in), chose “Innovations, Transformation, Resurgence” as its theme. The organizers put out a call for participants to “consider the ways they have transformed their local communities and the world,” and it seemed to have struck a chord: the sessions reflected a sense of rootedness as well as a desire to increase and deepen connections between repositories, their holdings, the communities they represent, and (crucially) those they haven’t.
Two standouts on the technical practice and electronic records side were “Computer-Assisted Appraisal of Electronic Records” and “Archival Revitalization: Transforming Technical Services with Innovative Workflows,” both of which were relevant to my (new) position as a processing archivist. For a play-by-play of some of these sessions, you can check out my MAC Twitter feed (yes, I live-tweet conferences). Both emphasized balancing competing priorities and unequal capacities, familiar themes for anyone working in archives. Leading off “Computer Assisted Appraisal,” Cal Lee reminded everyone that there was no such thing as a perfect machine system (which would remove the human labor from appraisal), and that the goal should never be to create one: that machines are tools, not agents. That emphasis on human action, particularly communicating across and about technological divides, was echoed again in “Archival Revitalization,” which focused on instances of implementation (new processes, tools, and workflows) that were made possible through and in turn assisted human collaboration. Both sessions, too, spoke to the importance of understanding iteration as an integral part of workflows (whether appraisal, processing, or providing access) rather than something to be engineered out of a process.
Thanks to scholarship and grant programs (of which we can always have more), a number of paraprofessionals and short-term or project archivists were able to attend and present, which enriched the programming significantly. There was a strong showing from the regional LIS students, both in their poster session on Friday and the general programming. Having just started my position at Iowa State, this was my first MAC; it was also my first time in Detroit, and overall I was favorably impressed. While the conference center itself is a marvel of hostile architecture (which made literal accessibility a real and not-to-be-downplayed challenge), the intellectual content of the presentations and general attitude of the attendees made it a fairly easy space in which to be a newcomer.
A.L. Carson is a processing archivist at Iowa State University, where they are engaged in developing processing, preservation, and access guidelines for digital records as well as increasing the availability of the traditional collections.
Caitlin Birch is the Digital Collections and Oral History Archivist for the Rauner Special Collections Library at Dartmouth College in Hanover, New Hampshire: she sat down with Juli Folk, a graduate student at the University of Maryland-College Park iSchool, who is pursuing an archives-focused MLIS and certificate in Museum Scholarship and Material Culture. Caitlin’s descriptions of her career path, her roles and achievements, and her insights into the challenges she faces helped frame a discussion of helpful skill sets for working with born-digital archival records on a daily basis.
Caitlin’s Career Path
As an undergraduate, Caitlin majored in English, concentrating in journalism with minors in history and Irish studies. After a few years working as a reporter and editor, she began to consider a different career path, looking for other fields that emphasize constant learning, storytelling, and contributions to the historical record. In time, she decided on a dual degree (MA/MSLIS) in history and archives management from Simmons College (now Simmons University). Throughout grad school, her studies focused on both historical methods and original research as well as archival theory and practice.
When asked about the path to her current position, Caitlin responded, “To the extent that my program allowed, I tried to take courses with a digital focus whenever I could. I also completed two internships and worked in several paraprofessional positions, which were really invaluable to preparing me for professional work in the field. I finished my degrees in December 2013 and landed my job at Dartmouth a few months later.” She now works as the Digital Collections and Oral History Archivist for Rauner Special Collections Library, the home of Dartmouth College’s rare books, manuscripts, and archives, compartmentalized within the larger academic research library.
Favorite Aspects of Being an Archivist
For Caitlin, the best aspects of being an archivist are working at the intersection of history and technology; teaching and interacting with people every day; and having new opportunities to create, innovate, and learn. Her position includes roles in both oral history and born-digital records, and on any given day she may be juggling tasks like teaching students oral history methodology, working on the implementation of a digital repository, building Dartmouth’s web archiving program, managing staff, sharing reference desk duty, and staying abreast of the profession via involvement with the SAA and the New England Archivists Executive Board. “I like that no two days are the same,” she shared, adding, “I like that my work can have a positive impact on others.”
Challenges of Being an Archivist
Caitlin pointed out that aspects of the profession change and evolve at a pace that can make it difficult to keep up, especially when job- or project-related tasks demand so much attention. She also noted other challenges: “More and more we’re grappling with issues like the ethical implications of digital archives and the environmental impact of digital preservation.” That said, she finds that “the biggest challenge is also the biggest opportunity: most of what I do hasn’t been done before at Dartmouth. I’m the first digital archivist to be hired at my institution, so everything—infrastructure, policies, workflows, etc.—has been/is being built from the ground up. It’s exciting and often very daunting, especially because this corner of the archives field is dynamic.”
Advice for Students and Young Professionals
As a result, Caitlin emphasized the importance of experimentation and failure. “Traditional archival practice is well-defined and there are standards to guide it, but digital archives present all kinds of unique challenges that didn’t exist until very recently. Out of necessity, you have to innovate and try new things and learn from failure in order to get anywhere.” For this reason, she recommended building a good professional network and finding time to keep up with the professional literature. “It’s really key to cultivate a community of practice with colleagues at other institutions.”
When asked whether she sets aside time specified for these tasks or if she finds that networking and research are natural outputs of her daily work, Caitlin stated that networking comes more easily because of her involvement with professional organizations. However, finding time for professional literature and research proved more difficult, a concern Caitlin brought to her manager. In response, he encouraged her to block 1-2 hours on her calendar at the same time every week to catch up on reading and professional news. She remains grateful for that support: “I would hope that every manager in this profession encourages time for regular professional development. It may seem like it’s taking time away from job responsibilities, but in actuality it’s helping you to build the skills and knowledge you need for future innovation.”
Juli Folk is finishing the MLIS program at the University of Maryland-College Park iSchool, specializing in Archives and Digital Curation. Previously a corporate editor and project manager, Juli’s graduate work supplements her passions for writing, art, and technology with formal archival training, to refocus her career on cultural heritage institutions.
San José is in many ways an apt location for a tech-centered library conference like Code4Lib. It is the largest city in Santa Clara Valley (aka Silicon Valley) and home to San Jose State University, one of the biggest library science programs in the country. Yet the tone of the 14th annual Code4Lib conference, which convened on February 19-22, 2019, was cautious and at times critical of the “big tech” landscape. In her opening keynote, Sarah Roberts, Assistant Professor of Information Studies at UCLA, talked about her research on social media content moderation. She said that while this work is deemed critical by social media companies to manage lewd or disturbing content, it is also emotionally taxing, low-paying, and executed by a mostly invisible global labor force. In keeping this work hidden, consumers are led to believe that social media content is either unmediated, or that content moderation is somehow automated. This push towards transparency and openness—in how we manipulate our code, technologies, content, and even our labor practices—was a recurring theme throughout the conference.
There were a number of archivists and archives-adjacent folks attending the conference and a handful of interesting sessions related to digital archives. In a talk entitled “Natural Language Processing for Discovery of Born-Digital Records,” NCSU Libraries Fellow Emily Higgs discussed her exploration of named entity recognition (NER) to aid in describing digital collections. Using the open source natural language processing software, spaCy, Higgs extracted personal names to a CSV file, with entities ranked by frequency, and included the top five to ten names in the Scope and Content section of the finding aid. She also tested a discovery tool, Open Semantic Desktop Search, to enable researchers to more easily browse through a digital collection using the reading room computer. She noted that while it offered faceted browsing as well as fuzzy and semantic search capabilities, the major drawback was the long indexing time for larger digital collections.
In the realm of web-archiving, Ilya Kreymer of Rhizome presented a demo of Webrecorder, a set of free and open source tools for creating and viewing web archives. Funded by two Mellon Foundation grants, Webrecorder is a browser-based application that focuses on capturing high-fidelity web archives. Unlike the more traditional web crawlers, Webrecorder is meant to be used as a more curated approach to web archiving—think quality over quantity. In his demo, Kreymer quickly and easily archived audio files from a SoundCloud library as well as the most recent Code4Lib conference hashtag posts on Twitter. One of Webrecorder’s most impressive features is its ability to emulate legacy browsers to record things like flash-based websites. Webrecorder has a lot going for it—it’s free and easy to use, with an attractive and intuitive interface. While Kreymer was quick to point out that they haven’t solved web-archiving, it was nonetheless exciting to see a concentrated effort towards refining it.
As a metadata librarian, I am probably a little biased here, but one of the most exciting talks of the conference was given by Dhanushka Samarakoon and Harish Maringanti of the University of Utah’s Marriott Library. Inspired by a story they heard on NPR about PoetiX, a sonnet-writing competition where judges are asked to determine if a sonnet was written by man or machine, Samarakoon and Maringati began to think about the implications of machine learning on metadata creation. Recognizing that metadata is typically where the bottleneck occurs in digital library workflows, they wanted to explore how machine learning technology might simplify descriptive metadata creation for historical image collections. To do this they created a model using data from Imagenet, a database of over 14 million images designed for use in visual object recognition software research; and over 470 photographs with high quality human-generated metadata from their own digital library collections. Once this data was introduced into a pre-trained neural network, they ran a collection of photographs through the system to see how well the model worked. It wasn’t perfect—for instance, a photo of a man standing next to a cow was described as “Mary Jane standing by a cow,” apparently due to the many people identified as “Mary Jane” in the original digital library dataset. However, it was exciting to see the possibilities of AI in image analysis and the implications this might have for future metadata automation.
At one point during the conference someone took a quick visual poll of how many first-time attendees were in the audience. There were a lot of us. But there were also a lot of Code4Lib veterans. During a lightning talk about the origin of the conference, Karen Coombs, Ryan Wick, and Roy Tennant recalled wanting to create a conference with a “no spectators” motto—where attendees had ample opportunities to engage, participate, and have their voices heard. Unlike most other library conferences, Code4Lib doesn’t have competing programming. Everyone gathers in one large room and attends the same talks and sessions. It was this model of inclusivity, equality, and innovation that I found most appealing about Code4Lib, and will no doubt draw me back in coming years.
For more information about the conference, including streaming video and slides, visit the Code4Lib 2019 website.
Nicole Shibata is the Metadata Librarian at California State University, Northridge.
PASIG 2019 met the week of February 11th at El Colegio de México (commonly known as Colmex) in Mexico City. PASIG stands for Preservation and Archiving Special Interest Group, and the group’s meeting brings together an international group of practitioners, industry experts, vendors, and researchers to discuss practical digital preservation topics and approaches. This meeting was particularly special because it was the first time the group convened in Latin America (past meetings have generally been held in Europe and the United States). Excellent real-time bilingual translation for presentations given in both English and Spanish enabled conversations across geographical and lingual boundaries and made room to center Latin American preservationists’ perspectives and transformative post-custodial archival practice.
The conference began with broad overviews of digital preservation topics and tools to create a common starting ground, followed by more focused deep-dives on subsequent days. I saw two major themes emerge over the course of the week. The first was the importance of people over technology in digital preservation. From David Minor’s introductory session to Isabel Galina Russell’s overview of the digital preservation landscape in Mexico, presenters continuously surfaced examples of the “people side” of digital preservation (think: preservation policies, appraisal strategies, human labor and decision-making, keeping momentum for programs, communicating to stakeholders, ethical partnerships). One point that struck me during the community archives session was Verónica Reyes-Escudero’s discussion of “cultural competency as a tool for front-end digital preservation.” By conceptualizing interpersonal skills as a technology for facilitating digital preservation, we gain a broader and more ethically grounded idea of what it is we are really trying to do by preserving bits in the first place. Software and hardware are part of the picture, but they are certainly not the whole view.
The second major theme was that digital preservation is best done together. Distributed digital preservation platforms, consortial preservation models, and collaborative research networks were also well-represented by speakers from LOCKSS, Texas Digital Library (TDL), Duraspace, Open Preservation Foundation, Software Preservation Network, and others. The takeaway from these sessions was that the sheer resource-intensiveness of digital preservation means that institutions, both large and small, are going to have to collaborate in order to achieve their goals. PASIG seemed to be a place where attendees could foster and strengthen these collective efforts. Throughout the conference, presenters also highlighted failures of collaborative projects and the need for sustainable financial and governance models, particularly in light of recent developments at the Digital Preservation Network (DPN) and Digital Public Library of America (DPLA). I was particularly impressed by Mary Molinaro’s honest and informative discussion about the factors that led to the shuttering of DPN. Molinaro indicated that DPN would soon be publishing a final report in order to transparently share their model, flaws and all, with the broader community.
Touching on both of these themes, Carlos Martínez Suárez of Video Trópico Sur gave a moving keynote about his collaboration with Natalie M. Baur, Preservation Librarian at Colmex, to digitize and preserve video recordings he made while living with indigenous groups in the Mexican state of Chiapas. The question and answer portion of this session highlighted some of the ethical issues surrounding rights and consent when providing access to intimate documentation of people’s lives. While Colmex is not yet focusing on access to this collection, it was informative to hear Baur and others talk a bit about the ongoing technical, legal, and ethical challenges of a work-in-progress collaboration.
Presenters also provided some awesome practical tools for attendees to take home with them. One of the many great open resources session leaders shared was Frances Harrell (NEDCC) and Alexandra Chassanoff (Educopia)’s DigiPET: A Community Built Guide for Digital Preservation Education + Training Google document, a living resource for compiling educational tools that you can add to using this form. Julian Morley also shared a Preservation Storage Cost Model Google sheet that contains a template with a wealth of information about estimating the cost of different digital preservation storage models, including comparisons for several cloud providers. Amy Rudersdorf (AVP), Ben Fino-Radin (Small Data Industries), and Frances Harrell (NEDCC) also discussed helpful frameworks for conducting self-assessments.
PASIG closed out by spending some time on the challenges involved with preserving emerging and complex formats. On the last afternoon of sessions, Amelia Acker (University of Texas at Austin) spoke about the importance of preserving APIs, terms of service, and other “born-networked” formats when archiving social media. She was followed by a panel of software preservationists who discussed different use cases for preserving binaries, source code, and other software artifacts.
Thanks to the wonderful work of the PASIG 2019 steering, program, and local arrangements committees!
Kelly Bolding is the Project Archivist for Americana Manuscript Collections at Princeton University Library, as well as the team leader for bloggERS! She is interested in developing workflows for processing born-digital and audiovisual materials and making archival description more accurate, ethical, and inclusive.
Development of the Digital Processing Framework began after the second annual Born Digital Archiving eXchange unconference at Stanford University in 2016. There, a group of nine archivists saw a need for standardization, best practices, or general guidelines for processing digital archival materials. What came out of this initial conversation was the Digital Processing Framework (https://hdl.handle.net/1813/57659) developed by a team of 10 digital archives practitioners: Erin Faulder, Laura Uglean Jackson, Susanne Annand, Sally DeBauche, Martin Gengenbach, Karla Irwin, Julie Musson, Shira Peltzman, Kate Tasker, and Dorothy Waugh.
An initial draft of the Digital Processing Framework was presented at the Society of American Archivists’ Annual meeting in 2017. The team received feedback from over one hundred participants who assessed whether the draft was understandable and usable. Based on that feedback, the team refined the framework into a series of 23 activities, each composed of a range of assessment, arrangement, description, and preservation tasks involved in processing digital content. For example, the activity Survey the collection includes tasks like Determine total extent of digital material and Determine estimated date range.
The Digital Processing Framework’s target audience is folks who process born digital content in an archival setting and are looking for guidance in creating processing guidelines and making level-of-effort decisions for collections. The framework does not include recommendations for archivists looking for specific tools to help them process born digital material. We draw on language from the OAIS reference model, so users are expected to have some familiarity with digital preservation, as well as with the management of digital collections and with processing analog material.
Processing born-digital materials is often non-linear, requires technical tools that are selected based on unique institutional contexts, and blends terminology and theories from archival and digital preservation literature. Because of these characteristics, the team first defined 23 activities involved in digital processing that could be generalized across institutions, tools, and localized terminology. These activities may be strung together in a workflow that makes sense for your particular institution. They are:
Survey the collection
Create processing plan
Establish physical control over removeable media
Create checksums for transfer, preservation, and access copies
Determine level of description
Identify restricted material based on copyright/donor agreement
Gather metadata for description
Add description about electronic material to finding aid
Record technical metadata
Run virus scan
Organize electronic files according to intellectual arrangement
Address presence of duplicate content
Perform file format analysis
Identify deleted/temporary/system files
Manage personally identifiable information (PII) risk
Create DIP for access
Publish finding aid
Publish catalog record
Delete work copies of files
Within each activity are a number of associated tasks. For example, tasks identified as part of the Establish physical control over removable media activity include, among others, assigning a unique identifier to each piece of digital media and creating suitable housing for digital media. Taking inspiration from MPLP and extensible processing methods, the framework assigns these associated tasks to one of three processing tiers. These tiers include: Baseline, which we recommend as the minimum level of processing for born digital content; Moderate, which includes tasks that may be done on collections or parts of collections that are considered as having higher value, risk, or access needs; and Intensive, which includes tasks that should only be done to collections that have exceptional warrant. In assigning tasks to these tiers, practitioners balance the minimum work needed to adequately preserve the content against the volume of work that could happen for nuanced user access. When reading the framework, know that if a task is recommended at the Baseline tier, then it should also be done as part of any higher tier’s work.
We designed this framework to be a step towards a shared vocabulary of what happens as part of digital processing and a recommendation of practice, not a mandate. We encourage archivists to explore the framework and use it however it fits in their institution. This may mean re-defining what tasks fall into which tier(s), adding or removing activities and tasks, or stringing tasks into a defined workflow based on tier or common practice. Further, we encourage the professional community to build upon it in practical and creative ways.
Erin Faulder is the Digital Archivist at Cornell University Library’s Division of Rare and Manuscript Collections. She provides oversight and management of the division’s digital collections. She develops and documents workflows for accessioning, arranging and describing, and providing access to born-digital archival collections. She oversees the digitization of analog collection material. In collaboration with colleagues, Erin develops and refines the digital preservation and access ecosystem at Cornell University Library.