On Friday, July 26, 2019, academics and practitioners met at Wilson Library at UNC Chapel Hill for “ml4arc – Machine Learning, Deep Learning, and Natural Language Processing Applications in Archives.” This meeting featured expert panels and participant-driven discussions about how we can use natural language processing – using software to understand text and its meaning – and machine learning – a branch of artificial intelligence that learns to infer patterns from data – in the archives.
The meeting was hosted by the RATOM Project (Review, Appraisal, and Triage of Mail). The RATOM project is a partnership between the State Archives of North Carolina and the School of Information and Library Science at UNC Chapel Hill. RATOM will extend the email processing capabilities currently present in the TOMES software and BitCurator environment, developing additional modules for identifying and extracting the contents of email-containing formats, NLP tasks, and machine learning approaches. RATOM and the ml4arc meeting are generously supported by the Andrew W. Mellon Foundation.
Presentations at ml4arc were split between successful applications of machine learning and problems that could potentially be addressed by machine learning in the future. In his talk, Mike Shallcross from Indiana University identified archival workflow pain points that provide opportunities for machine learning. In particular, he sees the potential for machine learning to address issues of authenticity and integrity in digital archives, PII and risk mitigation, aggregate description, and how all these processes are (or are not) scalable and sustainable. Many of the presentations addressed these key areas and how natural language processing and machine learning can lend aid to archivists and records managers. Additionally, attendees got to see presentations and demonstrations from tools for email such as RATOM, TOMES, and ePADD. Euan Cochrane also gave a talk about the EaaSI sandbox and discussed potential relationships between software preservation and machine learning.
The meeting agenda had a strong focus on using machine learning in email archives; collecting and processing emails is a large encumbrance in many archives that can stand to benefit greatly from machine learning tools. For example, Joanne Kaczmarek from the University of Illinois presented a project processing capstone email accounts using an e-discovery and predictive coding software called Ringtail. In partnership with the Illinois State Archives, Kaczmarek used Ringtail to identify groups of “archival” and “non-archival” emails from 62 capstone accounts, and to further break down the “archival” category into “restricted” and “public.” After 3-4 weeks of tagging training data with this software, the team was able to reduce the volume of emails by 45% by excluding “non-archival” messages, and identify 1.8 million emails that met the criteria to be made available to the public. Manually, this tagging process could have easily taken over 13 years of staff time.
After the ml4arc meeting, I am excited to see the evolution of these projects and how natural language processing and machine learning can help us with our responsibilities as archivists and records managers. From entity extraction to PII identification, there are myriad possibilities for these technologies to help speed up our processes and overcome challenges.
Emily Higgs is the Digital Archivist for the Swarthmore College Peace Collection and Friends Historical Library. Before moving to Swarthmore, she was a North Carolina State University Libraries Fellow. She is also the Assistant Team Leader for the SAA ERS section blog.
Nineteen years ago, the digital preservation community gathered in York, UK, for the Cedars Project’s Preservation 2000 conference. It was here that the first seeds were sown for what would become the Digital Preservation Coalition (DPC). Guided by Neil Beagrie, then of King’s College London and Jisc, work to establish the DPC continued over the next 18 months and, in 2002, representatives from 7 organizations signed the articles that formally constituted the DPC.
In the 17 years since its creation, the DPC has gone from strength to strength, the last 10 years under the leadership of current Executive Director, William Kilbride. The past decade has been a particular period of growth, as shown by the rise in the staff compliment from 2 to 7. We now have more than 90 members who represent an increasingly diverse group of organizations from 12 countries across sectors including cultural heritage, higher education, government, banking, industry, media, research and international bodies.
Our mission at the DPC is to:
[…] enable our members to deliver resilient long-term access to digital content and services, helping them to derive enduring value from digital assets and raising awareness of the strategic, cultural and technological challenges they face.
We work to achieve this through a broad portfolio of work across six strategic areas of activity: Community Engagement, Advocacy, Workforce Development, Capacity Building, Good Practice and Standards, and Management and Governance. Everything we do is member-driven and they guide our activities through the DPC Board, Representative Council, and Sub-Committees which oversee each strategic area.
Although the DPC is driven primarily by the needs of our members, we do also aim to contribute to the broader digital preservation community. As such, many of the resources we develop are made publicly available. In the remainder of this blog post, I’ll be taking a quick look at each of the DPC’s areas of activity and pointing out resources you might find useful.
1 | Community Engagement
First up is our work in the area of Community Engagement. Here our aim is to enable “a growing number of agencies and individuals in all sectors and in all countries to participate in a dynamic and mutually supportive digital preservation community”. Collaboration is a key to digital preservation success, and we hope to encourage and support it by helping build an inclusive and active community. An important step in achieving this aim was the publication of our ‘Inclusion and Diversity Policy’ in 2018.
Webinars are key to building community engagement amongst our members. We invite speakers to talk to our members about particular topics and share experiences through case studies. These webinars are recorded and made available for members to watch at a later date. We also run a monthly ‘Members Lounge’ to allow informal sharing of current work and discussion of issues as they arise and, on the public end of the website, a popular blog, covering case studies, new innovations, thought pieces, recaps of events and more.
2 | Advocacy
Our advocacy work campaigns “for a political and institutional climate more responsive and better informed about the digital preservation challenge”, as well as “raising awareness about the new opportunities that resilient digital assets create”. This tends to happen on several levels, from enabling and aiding members’ advocacy efforts within their own organizations, through raising legislators’ and policy makers’ awareness of digital preservation, to educating the wider populace.
To help those advocating for digital preservation within their own context, we have recently published our Executive Guide. The Guide provides a grab bag of statements and facts to help make the case for digital preservation, including key messages, motivators, opportunities to be gained and risks faced. We welcome any suggestions for additions or changes to this resource!
Our longest running advocacy activity is the biannual Digital Preservation Awards, last held in 2018. The Awards aim to celebrate excellence and innovation in digital preservation across a range of categories. This high-profile event has been joined in recent years by two other activities with a broad remit and engagement. The first is the Bit List of Digitally Endangered Species, which highlights at risk digital information, showing both where preservation work is needed and where efforts have been successful. Finally, there is World Digital Preservation Day (WDPD), a day to showcase digital preservation around the globe. Response to WDPD since its inauguration in 2017 has been exceptionally positive. There’s been tweets, blogs, events, webinars, and even a song and dance! This year WDPD is scheduled for 7th November, and we encourage everyone to get involved.
3 | Workforce Development
Workforce Development activities at the DPC focus on “providing opportunities for our members to acquire, develop and retain competent and responsive workforces that are ready to address the challenges of digital preservation”. There are many threads to this work, but key for our members are the scholarships we provide through our Career Development Fund and free access to the training courses we run.
At the moment we offer three training courses: ‘Getting Started with Digital Preservation’, ‘Making Progress with Digital Preservation’ and ‘Advocacy for Digital Preservation’, but we have plans to expand the portfolio in the coming year. All of our training courses are available to non-members for a modest fee, but at the moment are mostly held face to face in the UK and Ireland. A move to online training provision is, however, planned for 2020. We are also happy to share training resources and have set up a Slack workspace to enable this and greater collaboration with regards to digital preservation training.
Other resources that may prove helpful that fall under our Workforce Development heading include the ‘Digital Preservation Handbook’, a free online publication covering a digital preservation in the broadest sense. The Handbook aims to be a comprehensive guide for those starting with digital preservation, whilst also offering links additional resources. The content for Handbook was crowd-sourced from experts and has all been peer reviewed. Another useful and slightly less well-known series of publications are our ‘Topical Notes’, originally funded by the National Archives of Ireland, and intended to create resources that introduced key digital preservation issues to a non-specialist audience (particularly record creators). Each note is only two pages long and jargon-free, so a great resource to help raise awareness.
4 | Capacity Building
Perhaps the biggest area of DPC work covers Capacity Building, that is “supporting and assuring our members in the delivery and maintenance of high quality and sustainable digital preservation services through knowledge exchange, technology watch, research and development.” This can take the form of direct member support, helping with tasks such as policy development and procurement, as well as participation in research projects.
We also run around six thematic Briefing Day events a year on topical issues. As with the training, these are largely held in the UK and Ireland, but they are now also live-streamed for members. We support a number of Thematic Task Forces and Working Groups, with the ‘Web Archiving and Preservation Working Group’ being particularly active at the moment.
5 | Good Practice and Standards
Our Good Practice and Standards stream of work was a new addition as of the publication of our latest Strategic Plan (2018-22). Here we are contributing work towards “identifying and developing good practice and standards that make digital preservation achievable, supporting efforts to ensure services are tightly matched to shifting requirements.”
We hope this work will allow us to input into standards with the needs of our members in mind and facilitate the sharing of good practice that already happens across the coalition. This has already borne fruit in the shape of the forthcoming DPC Rapid Assessment Model, a maturity model to help with benchmarking digital preservation progress within your organization. You can read a bit more about it in this blog post by Jen Mitcham and the model will be released publicly in late September.
We also work with vendors through our Supporter Program and events like our ‘Digital Futures’ series to help bridge the gap between practice and solutions.
6 | Management and Governance
Our final stream of work is less focused on digital preservation and instead on “ensuring the DPC is a sustainable, competent organization focussed on member needs, providing a robust and trusted platform for collaboration within and beyond the Coalition.” This obviously relates to both the viability of the organization and well as good governance. It is essential that everything we do is transparent and that the members can both direct what we do and ensure accountability.
Before I depart, I thought I would share a little bit about some of our plans for the future. In the next few years we’ll be taking steps to further internationalize as an organization. At the moment our membership is roughly 75% UK and Ireland and 25% international, but those numbers are gradually moving closer and we hope that continues. With that in mind we will be investigating new ways to deliver services and resources online, as well as in languages beyond English. We’re starting this year with the publication of our prospectus in German, French and Spanish.
We’re also beginning to look forward to our 20th anniversary in 2022. It’s a Digital Preservation Awards Year, so that’s reason enough for a celebration, but we will also be welcoming the digital preservation community to Glasgow, Scotland, as hosts of iPRES 2022. Plans are already afoot for the conference, and we’re excited to make it a showcase for both the community and one of our home cities. Hopefully we’ll see you there, but I encourage you to make use of our resources and to get in touch soon!
Sharon McMeekin is Head of Workforce Development with the Digital Preservation Coalition and leads on work including training workshops and their scholarship program. She is also Managing Editor of the ‘Digital Preservation Handbook’. With Masters degrees in Information Technology and Information Management and Preservation, both from the University of Glasgow, Sharon is an archivist by training, specializing in digital preservation. She is also an ILM qualified trainer. Before joining the DPC she spent five years as Digital Archivist with RCAHMS. As an invited speaker, Sharon presents on digital preservation at a wide variety of training events, conferences and university courses.
Back in March, during bloggERS’ Making Tech Skills a Strategic Priorityseries, we distributed an open survey to MLIS, MLS, MI, and MSIS students to understand what they know and have experienced in relation to technology skills as they enter the field.
To be frank, this survey stemmed from personal interests since I just completed an MLIS core course on Research, Assessment, and Design (re: survey to collect data on current landscape). I am also interested in what skills I need to build/what class I should sign up for my next quarter (re: what tech skills do I need to become hire-able?). While I feel comfortable with a variety of tech-related tools and tasks, I’ve been intimidated by more “high-level”computational languages for some years. This survey was helpful for exploring what skills other LIS pre-professionals are interested in and which skills will help us make these costly degrees worth the time and financial investment that is traditionally required to enter a stable archive or library position.
The survey was open for one month on Google Forms, and distributed to SAA communities, @SAA_ERS Twitter, the Digital Curation Google Group, and a few MLIS university program listservs. There were 15 questions and we received responses from 51 participants.
Results & Analysis
Here’s a superficial scan of the results. If you would like to come up with your own analyses, feel free to view the raw data on GitHub.
The most popular technology-related skill that students are interested in learning is data management (manipulating, querying, transforming data, etc.). This is a pretty broad topic as it involves many tools and protocols which can vary between a GUI or scripts. A separate survey that does a breakdown of specific data management tools might be in order, especially since these types of skills can be divided into specialty courses, workshops, which then translates into a specific job position. A more specific survey could help demonstrate what types of skills need to be taught in a full semester-long course, or what skills can be covered in a day-long or multi-day workshop.
It was interesting to see that even in this day and age where social media management can be second nature to many students’ daily lives, there was still a notable interest in understanding how to make this a part of their career. This makes me wonder what value students have in knowing how to strategically manage an archives’ social media account. How could this help with the job market, as well as an archival organization’s main mission?
Looking deeper into the popular data management category, it would be interesting to know the current landscape of knowledge or pedagogy in communicating with IT (e.g. project management and translating users’ needs). In many cases, archivists are working separately from but dependently on IT system administrators, and it can be frustrating since either department may have distinct concerns about a server or other networks. In June’s NYC Preservathon/Preservashare 2019, there was mention that IT exists to make sure servers and networks are spinning at all hours of the day. Unlike archivists, they are not concerned about the longevity of the content, obsolescence of file formats, or the software to render files. Could it be useful to have a course on how to effectively communicate and take control of issues that can be fuzzy lines between archives, data management, and IT? Or as one survey respondent said, “I think more basic programming courses focusing on tech languages commonly used in archives/libraries would be very helpful.” Personally, I’ve only learned this from experience working in different tech-related jobs. This is not a subject I see on my MLIS course catalog, nor a discussion at conference workshops.
The popularity of data management skills sparked another question: what about knowledge around computer networks and servers? Even though LTO will forever be in our hearts, cloud storage is also a backup medium we’re budgeting for and relying on. Same goes for hosting a database for remote access and/or publishing digital files. A friend mentioned this networking workshop for non-tech savvy learners—Grassroots Networking: Network Administration for Small Organizations/Home Organizations—which could be helpful for multiple skill types including data management, digital forensics, web archiving, web development, etc. This is similar to a course that could be found in computer science or MLIS-adjacent information management departments.
I can’t say this is statistically significant, but the inverse relationship between 15.7% who have not/will not take a technology-focused course in their program, compared to 78.4% of respondents who are not aware of the difference between scripting and programming is eyebrow raising. According to an article in PLOS Computational Biology, the term “script” means “something that is executed directly as is”, while a “program[… is] something that is explicitly compiled before being used. The distinction is more one of degree than kind—libraries written in Python are actually compiled to bytecode as they are loaded, for example—so one other way to think of it is “things that are edited directly” and “things that are not edited directly” (Wilson et al 2017). This distinction is important since more archives are acquiring, processing and sharing collections that rely on the archivist to execute jobs such as web-scraping or metadata management (scripts) or archivists who can build and maintain a database (programming). These might be interpreted as trick questions, but the particular semantics and what is considered technology-focused is something modern library, archives, and information programs might want to consider.
Figure 4 illustrates the various ways students tackle new technologies. Reading the f* manual (RTFM) and Searching forums are the most common approaches to navigating technology. Here are quotes from a couple students on how they tend to learn a new piece of software:
“break whatever I’m trying to do with a new technology into steps and look for tutorials & examples related to each of those steps (i.e. Is this step even possible with X, how to do it, how else to use it, alternatives for accomplishing that step that don’t involve X)”
“I tend to google “how to….” for specific tasks and learn new technology on a task-by-task basis.”
In the end, there was overwhelming interest in “more project-based courses that allow skills from other tech classes to be applied.” Unsurprisingly, many of us are looking for full-time, stable jobs after graduating and the “more practical stuff, like CONTENTdm for archives” seems to be a pressure felt in-order to get an entry-level position. Not just entry too; as continuing education learners, there is also a push to strive for more—several respondents are looking for a challenge to level up their tech skills:
“I want more classes with hands-on experience with technical skills. A lot of my classes have been theory based or else they present technology to us in a way that is not easy to process (i.e. a lecture without much hands-on work).”
“Higher-level programming, etc. — everything on offer at my school is entry level. Also digital forensics — using tools such as BitCurator.”
“Advanced courses for the introductory courses. XML 2 and python 2 to continue to develop the skills.”
“A skills building survey of various code/scripting, that offers structured learning (my professor doesn’t give a ton of feedback and most learning is independent, and the main focus is an independent project one comes up with), but that isn’t online. It’s really hard to learn something without face to face interaction, I don’t know why.”
It’ll be interesting to see what skills recent MLIS, MLS, MIS, and MSIM graduates will enter the field with. While many job postings list certain software and skills as requirements, will programs follow suit? I have a feeling this might be a significant question to ask in the larger context of what is the purpose of this Master’s degree and how can the curriculum keep up with the dynamic technology needs of the field.
Potential bias: Those taking the survey might be interested in learning higher-level tech skills because they do not already know the skills, while those who are already tech-savvy might avoid a basic survey such as this one since they already know the skills. This may put a bias on the survey population consisting of mostly novice tech students.
More data on specific computational languages and technology courses taken are available in the GitHub csv file. As mentioned earlier, I just finished my first year as a part-time MLIS student, so I’m still learning the distinct jobs and nature of the LIS field. Feel free to submit an issue to the GitHub repo, or tweet me @snewyuen if you’d like to talk more about what this data could mean.
Sarah Nguyen is an advocate for open, accessible, and secure technologies. While studying as an MLIS candidate with the University of Washington iSchool, she is expressing interests through a few gigs: Project Coordinator for Preserve This Podcast at METRO, Assistant Research Scientist for Investigating & Archiving the Scholarly Git Experience at NYU Libraries, and archivist for the Dance Heritage Coalition/Mark Morris Dance Group. Offline, she can be found riding a Cannondale mtb or practicing movement through dance. (Views do not represent Uovo. And I don’t even work with them. Just liked the truck.)
BloggERS! editor, Dorothy Waugh recently interviewed Trevor Owens, Head of Digital Content Management at the Library of Congress about his recent–and award-winning–book, The Theory and Craft of Digital Preservation.
Who is this book for and how do you imagine it being used?
I attempted to write a book that would be engaging and accessible to anyone who cares about long-term access to digital content and wants to devote time and energy to helping ensure that important digital content is not lost to the ages. In that context, I imagine the primary audience as current and emerging professionals that work to ensure enduring access to cultural heritage: archivists, librarians, curators, conservators, folklorists, oral historians, etc. With that noted, I think the book can also be of use to broader conversations in information science, computer science and engineering, and the digital humanities.
Tell us about the title of the book and, in particular, your decision to use the word “craft” to describe digital preservation.
The words “theory” and “craft” in the title of the book forecast both the structure and the two central arguments that I advance in the book.
The first chapters focus on theory. This includes tracing the historical lineages of preservation in libraries, archives, museums, folklore, and historic preservation. I then move to explore work in new media studies and platform studies to round out a nuanced understanding of the nature of digital media. I start there because I think it’s essential that cultural heritage practitioners moor their own frameworks and approaches to digital preservation in a nuanced understanding of the varied and historically contingent nature of preservation as a concept and the complexities of digital media and digital information.
The latter half of the book is focused on what I describe as the “craft” of digital preservation. My use of the term craft is designed to intentionally challenge the notion that work in digital preservation should be understood as “a science.” Given the complexities of both what counts as preservation in a given context and the varied nature of digital media, I believe it is essential that we explicitly distance ourselves from many of the assumptions and baggage that come along with the ideology of “digital.”
We can’t build some super system that just solves digital preservation. Digital preservation requires making judgement calls. Digital preservation requires the applied thinking and work of professionals. Digital preservation is not simply a technical question, instead digital preservation involves understanding the nature of the content that matters most to an intended community and making judgement calls about how best to mitigate risks of potential loss of access to that content. As a result of my focus on craft, I offer less of a “this is exactly what one should do” approach, and more of an invitation to join the community of practice that is developing knowledge and honing and refining their craft.
Reading the book, I was so happy to see you make connections between the work that we do as archivists and digital preservation. Can you speak to that relationship and why you think it is important?
Archivists are key players in making preservation happen and the emergence of digital content across all kinds of materials and media that archivists work with means that digital preservation is now a core part of the work that archivists do.
I organize a lot of my discussion about the craft of digital preservation around archival concepts as opposed to library science or curatorial practices. For example, I talk about arrangement and description. I also draw from ideas like MPLP as key concepts for work in digital preservation and from work on community archives.
Broadly speaking, in the development of digital media, I see a growing context collapse between formats that had been distinct in the past. That is, conservation of oil paintings, management and preservation of bound volumes, and organizing and managing heterogeneous sets of records have some strong similarities but there are also a lot of differences. The born digital incarnations of those works; digital art, digital publishing, and digital records, are all made up of digital information and file formats, and face a related set of digital preservation issues.
With that note, I think archival practice tends to be particularly well-suited for dealing with the nature of digital content. Archives have long dealt with the problem of scale that is now intensified by digital data. At the same time, archivists have also long dealt with hybrid collections and complex jumbles of formats, forms, and organizational structures, which is also increasingly the case for all types of forms that transition into born-digital content.
You emphasize that the technical component of digital preservation is sometimes prioritized over social, ethical, and organizational components. What are the risks implicit in overlooking these other important components?
Digital preservation is not primarily a technical problem. The ideology of “digital” is that things should be faster, cheaper, and automatic. The ideology of “digital” suggests that we should need less labor, less expertise, and less resources to make digital stuff happen. If we let this line of thinking infect our idea of digital preservation we are going to see major losses of important data, we will see major failures to respect ethical and privacy issues relating to digital content, and lots of money will be spent on work that fails to get us the results that we want.
In contrast, when we take as a starting point that digital preservation is about investing resources in building strong organizations and teams who participate in the community of practice and work on the complex interactions that emerge between competing library and archives values then we have a chance of both being effective but also building great and meaningful jobs for professionals.
If digital preservation work is happening in organizations that have an overly technical view of the problem, it is happening despite, not because, of their organization’s approach. That is, there are people doing the work, they just likely aren’t getting credit and recognition for doing that work. Digital preservation happens because of people who understand that the fundamental nature of the work requires continual efforts to get enough resources to meaningfully mitigate risks of loss, and thoughtful decision making about building and curating collections of value to communities.
Considerations related to access and discovery form a central part of the book and you encourage readers to “Start simple and prioritize access,” an approach that reminded me of many similar initiatives focused on getting institutions started with the management and preservation of born-digital archives. Can you speak to this approach and tell us how you see the relationship between preservation and access?
A while back, OCLC ran an initiative called “walk before you run,” focused on working with digital archives and digital content. I know it was a major turning point for helping the field build our practices. Our entire community is learning how to do this work and we do it together. We need to try things and see which things work best and which don’t.
It’s really important to prioritize access in this work. Preservation is fundamentally about access in the future. The best way you know that something will be accessible in the future is if you’re making it accessible now. Then your users will help you. They can tell you if something isn’t working. The more that we can work end-to-end, that is, that we accession, process, arrange, describe, and make available digital content to our users, the more that we are able to focus on how we can continually improve that process end-to-end. Without having a full end-to-end process in place, it’s impossible to zoom out and look at that whole sequence of processes to start figuring out where the bottlenecks are and where you need to focus on working to optimize things.
Dr. Trevor Owens is a librarian, researcher, policy maker, and educator working on digital infrastructure for libraries. Owens serves as the first Head of Digital Content Management for Library Services at the Library of Congress. He previously worked as a senior program administrator at the United States Institute of Museum and Library Services (IMLS) and, prior to that, as a Digital Archivist for the National Digital Information Infrastructure and Preservation Program and as a history of science curator at the Library of Congress. Owens is the author of three books, including The Theory and Craft of Digital Preservation and Designing Online Communities: How Designers, Developers, Community Managers, and Software Structure Discourse and Knowledge Production on the Web. His research and writing has been featured in: Curator: The Museum Journal, Digital Humanities Quarterly,The Journal of Digital Humanities, D-Lib, Simulation & Gaming, Science Communication, New Directions in Folklore, and American Libraries. In 2014 the Society for American Archivists granted him the Archival Innovator Award, presented annually to recognize the archivist, repository, or organization that best exemplifies the “ability to think outside the professional norm.”
Archives put a great deal of effort into preserving the original object. We document the context around its creation, perform conservation work on the object if necessary, and implement reading room procedures designed to limit damage or loss. As a result, researchers can read and hold an original letter written by Alice Walker, view a series of tintypes taken before the Civil War, read marginalia written by Ted Hughes in a book from his personal library, or listen to an audio recording of one of Martin Luther King, Jr.’s speeches. In other words, we enable researchers to encounter materials as they were originally designed to be used as much as is possible.
The nature of digital humanities research challenges these traditional modes of archival access: are these the only ways to interact with archival material? How do we serve users who want to leverage computational techniques such as text mining, machine learning, network analysis, or computer vision in their research or teaching? Are machines and algorithms “users”? Archivists also encounter these questions as the content of archives shifts from analog to born digital material. Digital files were created and designed to be processed by algorithms, not just encountered through experiences such as watching, viewing, or reading. What could access for these types of materials look like if we gave access to their full functionality and not just their appearance?
I have spent the past two years working on an IMLS grant focused on addressing these types of questions. Collections As Data: Always Already Computational examined how digital collections are and could be used beyond analog research methodologies. Collections as data is ordered information, stored digitally, that is inherently amenable to computation. This could include metadata, digital text, or other digital surrogates. Whereas a digital repository might enable researchers to read a newspaper on a computer screen, an approach grounded in collections as data would give researchers access to the OCR file the repository generated to enable keyword search. In other words, digital repositories should provide access beyond the viewers, page turners, and streaming servers of most current digital repositories that replicate analog experiences. At its core, collections as data simply asks cultural heritage organizations to make the full digital object available rather than making assumptions about how users will want to interact with it.
Collections as data implementations are not necessarily complex nor do they involve complicated repository development. Some of the simplest examples can be found on Github where archives such as Vanderbilt and New York University publish their EAD files. The Rockefeller Archive Center and the Museum of Modern Art go a step further and publish all of their collection data, along with a creative commons license. Emory, my home institution, makes finding aid data available in both EAD and RDF from our finding aids database, which has led to a digital humanities project that harvested correspondence indexes from our Irish poetry collections to build network graphs of the Belfast Group. More complex implementations often provide access to data through APIs instead of a bulk download. An example of this can be found at Carnegie Hall Archives, which allows researchers to query their data through a SPARQL endpoint.
The Collections As Data: Always Already Computational final report includes more information and ideas for getting started with collections as data. It includes resources such as a set of user personas, methods profiles of common techniques used by data-driven researchers, and real-world case studies of institutions with collections as data including their motivations, technical details, and how they made the case to their administrators. I highly recommend “50 Things,” which is a list of activities associated with collections as data work ranging from the simple to the complex.
There are a few takeaways from this project I’d like to highlight for archivists in particular:
Collections as data approaches are archival. Data-driven research demands authenticity and context of the data source, established and preserved through archival principles of documentation, transparency, and provenance. This type of information was one of the most universal requests from digital humanities researchers. It was clear that they were not only interested in the object, but in how it came to be. They wanted to understand their data as an archival object with information about its creation, provenance, and preservation. Archivists need to advocate for digital collections to be treated not just as digital surrogates, or what I like to think of as expensive photocopying, but as unique resources unto themselves deserving description, preservation, and access that may not necessarily match that of the original object.
Collections as data enhances access to archival material. What if we could partially open restricted material to researchers? Emory holds the papers of Salman Rushdie and his email files are largely restricted per the deed of gift. Computational techniques being developed in ePADD could generate maps of Rushdie’s correspondents and reveal patterns in the timing and frequency of his correspondence, just through email header information and without exposing sensitive data (i.e. the content of the email) that Rushdie wanted to restrict. Could this methodology be extrapolated to other types of restricted electronic files?
Just start. For digital files, trying something is always the first, and best approach. There is no one way or best way to do collections as data work. Consider your community and ask them what they need. Unlike baseball fields, if you build it, they probably won’t come unless you ask first. Collections as data material already exists in your collection, especially if you use ArchivesSpace. Publish it. Think broadly about what might constitute collections as data and how you might make use of it yourself; collections as data benefits us too. Follow the Computational Archival Science project at the University of Maryland, which is exploring how we think about archival collections as data.
If you want to take a deep dive into collections as data (and get funding to do so!) consider applying to be part of the second cohort of the Part to Whole Mellon grant, which aims to foster the development of broadly viable models that support implementation and use of collections as data. The next call for proposals opens August 1: https://collectionsasdata.github.io/part2whole/ . On August 5, the project team will offer a webinar with more information about the grant and opportunities to ask questions: https://collectionsasdata.github.io/part2whole/cfp2webinar/.
Elizabeth Russey Roke is a digital archivist and metadata specialist at the Stuart A. Rose Library of Emory University, Atlanta, Georgia. Primarily focused on preservation, discovery, and access to digitized and born digital assets from special collections, Elizabeth works on a variety of technology projects and initiatives related to digital repositories, metadata standards, and archival descriptive practice. She was a co-investigator on a 2016-2018 IMLS grant investigating collections as data.
Over the past couple of months, we’ve heard a lot on bloggERS about how current students, recent grads, and mid-career professionals have made tech skills a strategic priority in their development plans. I like to think about the problem of “gaining tech skills” as being similar to “saving the environment”: individual action is needed and necessary, but it is most effective when it feeds clearly into systemic action.
So that begs the question, what root changes might educators of all types suggest and support to help GLAM professionals prioritize tech skills development? What are educator communities and systems – iSchools, faculty, and continuing education instructors – doing to achieve this? These questions are among those addressed by the BitCuratorEdu research project.
The BitCuratorEdu project is a two three-year effort funded by the Institute of Museum and Library Services (IMLS) to study and advance the adoption of born-digital archiving and digital forensics tools and methods in libraries and archives through a range of professional education efforts. The project is a partnership between the School of Information and Library Science at the University of North Carolina at Chapel Hill and the Educopia Institute, along with the Council of State Archivists (CoSA) and nine universities that are educating future information professionals.
We’re addressing two main research questions:
What are the primary institutional and technological factors that influence adoption of digital forensics tools and methods in different educational settings?
What are the most viable mechanisms for sustaining collaboration among LIS programs on the adoption of digital forensics tools and methods?
The project started in September 2018 and will conclude in Fall 2021, and Educopia and UNC SILS will be conducting ongoing research and releasing open educational resources on a rolling basis. With the help of our Advisory Board made up of nine iSchools and our Professional Experts Panel composed of leaders in the GLAM sector, we’re:
Piloting instruction to produce and disseminate a publicly accessible set of learning objects that can be used by education providers to administer hands-on digital forensics education
Gathering information and centralizing existing educational content to produce guides and other resources, such as this (still-in-development) guide to datasets that can be used to learn new digital forensics skills or test digital archives software/processes
Investigating and reporting on institutional factors that facilitate, hinder and shape adoption of digital forensics educational offerings
Through this work and intentional community cultivation, we hope to advance a community of practice around digital forensics education though partner collaboration, wider engagement, and exploration of community sustainability mechanisms.
To support our research and steer the direction of the project, we have conducted and analyzed nine advisory board interviews with current faculty who have taught or are developing a curriculum for digital forensics education. So far we’ve learned that:
instructors want and need access to example datasets to use in the classroom (especially cultural heritage datasets);
many want lesson plans and activities for teaching born-digital archiving tools and environments like BitCurator in one or two weeks because few courses are devoted solely to digital forensics;
they want further guidance on how to facilitate hands-on digital forensics instruction in distributed online learning environments; and
they face challenges related to IT support at their home institutions, just like those grappled with by practitioners in the field.
This list barely scratches the surface of our exploration into the experiences and needs of instructors for providing more effective digital forensics education, and we’re excited to tackle the tough job of creating resources and instructional modules that address these and many other topics. We’re also interested in exploring how the resources we produce may also support continuing education needs across libraries, archives, and museums.
We recently conducted a Twitter chat with SAA’s SNAP Section to learn about students’ experiences in digital forensics learning environments. We heard a range of experiences, from students who reported they had no opportunity to learn about digital forensics in some programs, to students who received effective instruction that remained useful post-graduation. We hope that the learning modules released at the conclusion of our project will address students’ learning needs just as much as their instructors’ teaching needs.
Later this year, we’ll be conducting an educational provider survey that will gather information on barriers to adoption of digital forensics instruction in continuing education. We hope to present to and conduct workshops for a broader set of audiences including museum and public records professionals.
Our deliverables, from conference presentations to learning modules, will be released openly and freely through a variety of outlets including the project website, the BitCurator Consortium wiki, and YouTube (for recorded webinars). Follow along at the project website or contact firstname.lastname@example.org if you have feedback or want to share your insights with the project team.
Jess Farrell is the project manager for BitCuratorEdu and community coordinator for the Software Preservation Network at Educopia Institute. Katherine Skinner is the Executive Director of Educopia Institute, and Christopher (Cal) Lee is Associate Professor at the School of Information and Library Science at the University of North Carolina, Chapel Hill, teaching courses on archival administration, records management, and digital curation. Katherine and Cal are Co-PIs on the BitCuratorEdu project, funded by the Institute of Museum and Library Services.
Mary Kidd (MLIS ’14) and Dana Gerber-Margie (MLS ’13) first met at a Radio Preservation Task Force meeting in 2016. They bonded over experiences of conference fatigue, but quickly moved onto topics near and dear to both of their hearts: podcasts and audio archiving. Dana Gerber-Margie has been a long-time podcast super-listener. She is subscribed to over 1400 podcasts, and she regularly listens to 40-50 of them. She launched a podcast recommendation newsletter when she was getting her MLS, called “The Audio Signal,” which has grown into a popular podcast publication called The Bello Collective. Mary was a National Digital Stewardship Resident at WNYC, where she was creating a born-digital preservation strategy for their archives. She had worked on analog archives projects in the past — scanning and transferring collections of tapes — but she’s embraced the madness and importance of preserving born-digital audio. Mary and Dana stayed in touch and continued to brainstorm ideas, which blossomed into a workshop about podcast preservation that they taught at the Personal Digital Archives conference at Stanford in 2017, along with Anne Wootton (co-founder of Popup Archive, now at Apple Podcasts).
Then Mary and I connected at the National Digital Stewardship Residency symposium in Washington, DC in 2017. I got my MLS back in 2013, but since then I’ve been working more at the intersection of media, storytelling, and archives. I had started a podcast and was really interested, for selfish reasons, in learning the most up-to-date best practices for born-digital audio preservation. I marched straight up to Mary and said something like, “hey, let’s work together on an audio preservation project.” Mary set up a three-way Skype call with Dana on the line, and pretty soon we were talking about podcasts. How we love them. How they are at risk because most podcasters host their files on commercial third-party platforms. And how we would love to do a massive outreach and education program where we teach podcasters that their digital files are at risk and give them techniques for preserving them. We wrote these ideas into a grant proposal, with a few numbers and a budget attached, and the Andrew W. Mellon Foundation gave us $142,000 to make it happen. We started working on this grant project, called “Preserve This Podcast,” back in February 2018. We’ve been able to hire people who are just as excited about the idea to help us make it happen. Like Sarah Nguyen, a current MLIS student at the University of Washington and our amazing Project Coordinator.
One moral of this story is that digital archives conferences really can bring people together and inspire them to advance the field. The other moral of the story is that, after months of consulting audio preservation experts and interviewing podcasters and getting 556 podcasters to take a survey and reading about the history of podcasting, we can confirm that podcasts are disappearing and podcast producers are not adequately equipped to preserve their work against the onslaught of forces working against the long-term endurance of digital information rendering devices. There is more information on our website about the project (preservethispodcast.org) and in the report about the survey findings. Please reach out to email@example.com or firstname.lastname@example.org if you have any thoughts or ideas.
Molly Schwartz is the Studio Manager at the Metropolitan New York Library Council (METRO). She is the host and producer of two podcasts about libraries and archives — Library Bytegeist and Preserve This Podcast. Molly did a Fulbright grant at the Aalto University Media Lab in Helsinki, was part of the inaugural cohort of National Digital Stewardship Residents in Washington, D.C., and worked at the U.S. State Department as a data analyst. She holds an MLS with a specialization in Archives, Records and Information Management from the University of Maryland at College Park and a BA/MA in History from the Johns Hopkins University.
by Richard Marciano, Victoria Lemieux, and Mark Hedges
The 3rd workshop on Computational Archival Science (CAS) was held on December 12, 2018, in Seattle, following two earlier CAS workshops in 2016 in Washington DC and in 2017 in Boston. It also built on three earlier workshops on ‘Big Humanities Data’ organized by the same chairs at the 2013-2015 conferences, and more directly on a symposium held in April 2016 at the University of Maryland. The current working definition of CAS is:
A transdisciplinary field that integrates computational and archival theories, methods and resources, both to support the creation and preservation of reliable and authentic records/archives and to address large-scale records/archives processing, analysis, storage, and access, with aim of improving efficiency, productivity and precision, in support of recordkeeping, appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material.
The workshop featured five sessions and thirteen papers with international presenters and authors from the US, Canada, Germany, the Netherlands, the UK, Bulgaria, South Africa, and Portugal. All details (photos, abstracts, slides, and papers) are available at: http://dcicblog.umd.edu/cas/ieee-big-data-2018-3rd-cas-workshop/. The keynote focused on using digital archives to preserve the history of WWII Japanese-American incarceration and featured Geoff Froh, Deputy Director at Densho.org in Seattle.
This workshop explored the conjunction (and its consequences) of emerging methods and technologies around big data with archival practice and new forms of analysis and historical, social, scientific, and cultural research engagement with archives. The aim was to identify and evaluate current trends, requirements, and potential in these areas, to examine the new questions that they can provoke, and to help determine possible research agendas for the evolution of computational archival science in the coming years. At the same time, we addressed the questions and concerns scholarship is raising about the interpretation of ‘big data’ and the uses to which it is put, in particular appraising the challenges of producing quality – meaning, knowledge and value – from quantity, tracing data and analytic provenance across complex ‘big data’ platforms and knowledge production ecosystems, and addressing data privacy issues.
Computational Thinking and Computational Archival Science
#1:Introducing Computational Thinking into Archival Science Education [William Underwood et al]
#2:Automating the Detection of Personally Identifiable Information (PII) in Japanese-American WWII Incarceration Camp Records [Richard Marciano, et al.]
#3:Computational Archival Practice: Towards a Theory for Archival Engineering [Kenneth Thibodeau]
#4:Stirring The Cauldron: Redefining Computational Archival Science (CAS) for The Big Data Domain [Nathaniel Payne]
Machine Learning in Support of Archival Functions
#5:Protecting Privacy in the Archives: Supervised Machine Learning and Born-Digital Records [Tim Hutchinson]
#6:Computer-Assisted Appraisal and Selection of Archival Materials [Cal Lee]
Metadata and Enterprise Architecture
#7:Measuring Completeness as Metadata Quality Metric in Europeana [Péter Királyet al.]
#8:In-place Synchronisation of Hierarchical Archival Descriptions [Mike Bryant et al.]
#9:The Utility Enterprise Architecture for Records Professionals [Shadrack Katuu]
#10:Framing the scope of the common data model for machine-actionable Data Management Plans [João Cardoso et al.]
#11:The Blockchain Litmus Test [Tyler Smith]
Social and Cultural Institution Archives
#12:A Case Study in Creating Transparency in Using Cultural Big Data: The Legacy of Slavery Project [Ryan Cox, Sohan Shah et al]
#13:Jupyter Notebooks for Generous Archive Interfaces [Mari Wigham et al.]
Finally, we are planning a 4th CAS Workshop in December 2019 at the 2019 IEEE International Conference on Big Data (IEEE BigData 2019) in Los Angeles, CA. Stay tuned for an upcoming CAS#4 workshop call for proposals, where we would welcome SAA member contributions!
Richard Marciano is a professor at the University of Maryland iSchool where he directs the Digital Curation Innovation Center (DCIC). He previously conducted research at the San Diego Supercomputer Center at the University of California San Diego for over a decade. His research interests center on digital preservation, sustainable archives, cyberinfrastructure, and big data. He is also the 2017 recipient of Emmett Leahy Award for achievements in records and information management. Marciano holds degrees in Avionics and Electrical Engineering, a Master’s and Ph.D. in Computer Science from the University of Iowa. In addition, he conducted postdoctoral research in Computational Geography.
Victoria Lemieux is an associate professor of archival science at the iSchool and lead of the Blockchain research cluster, Blockchain@UBC at the University of British Columbia – Canada’s largest and most diverse research cluster devoted to blockchain technology. Her current research is focused on risk to the availability of trustworthy records, in particular in blockchain record keeping systems, and how these risks impact upon transparency, financial stability, public accountability and human rights. She has organized two summer institutes for Blockchain@UBC to provide training in blockchain and distributed ledgers, and her next summer institute is scheduled for May 27-June 7, 2019. She has received many awards for her professional work and research, including the 2015 Emmett Leahy Award for outstanding contributions to the field of records management, a 2015 World Bank Big Data Innovation Award, a 2016 Emerald Literati Award and a 2018 Britt Literary Award for her research on blockchain technology. She is also a faculty associate at multiple units within UBC, including the Peter Wall Institute for Advanced Studies, Sauder School of Business, and the Institute for Computers, Information and Cognitive Systems.
Mark Hedges is a Senior Lecturer in the Department of Digital Humanities at King’s College London, where he teaches on the MA in Digital Asset and Media Management, and is also Departmental Research Lead. His original academic background was in mathematics and philosophy, and he gained a PhD in mathematics at University College London, before starting a 17-year career in the software industry, before joining King’s in 2005. His research is concerned primarily with digital archives, research infrastructures, and computational methods, and he has led a range of projects in these areas over the last decade. Most recently has been working in Rwanda on initiatives relating to digital archives and the transformative impact of digital technologies.
Where: Metropolitan New York Library Council (METRO), New York, NY
Stephen Klein, Digital Services Librarian at the CUNY Graduate Center (CUNY)
Ashley Blewer, AV Preservation Specialist at Artefactual
Kelly Stewart, Digital Preservation Services Manager at Artefactual
On December 3, 2018, the Metropolitan New York Library Council (METRO)’s Digital Preservation Interest Group hosted an informative (and impeccably titled) presentation about how the CUNY Graduate Center (GC) plans to incorporate Archivematica, a web-based, open-source digital asset management software (DAMs) developed by Artefactual, into its document management strategy for student dissertations. Speakers included Stephen Klein, Digital Services Librarian at the CUNY Graduate Center (GC); Ashley Blewer, AV Preservation Specialist at Artefactual; and Kelly Stewart, Digital Preservation Services Manager at Artefactual. The presentation began with an overview from Stephen about the GC’s needs and why they chose Archivematica as a DAMs, followed by an introduction to and demo of Archivematica and Duracloud, an open-source cloud storage service, led by Ashley and Kelly (who was presenting via video-conference call). While this post provides a general summary of the presentation, I would recommend reaching out to any of the presenters for more detailed information about their work. They were all great!
Every year the GC Library receives between 400-500 dissertations, theses, and capstones. These submissions can include a wide variety of digital materials, from PDF, video, and audio files, to websites and software. Preservation of these materials is essential if the GC is to provide access to emerging scholarship and retain a record of students’ work towards their degrees. Prior to implementing a DAMs, however, the GC’s strategy for managing digital files of student work was focused primarily on access, not preservation. Access copies of student work were available on CUNY Academic Works, a site that uses Bepress Digital Commons as a CMS. Missing from the workflow, however, was the creation, storage, and management of archival originals. As Stephen explained, if the Open Archival Information System (OAIS) model is a guide for a proper digital preservation workflow, the GC was without the middle, Archival Information Package (AIP), portion of it. Some of the qualities that GC liked about Archivematica was that it was open-source and highly-customizable, came with strong customer support from Artefactual, and had an API that could integrate with tools already in use at the library. GC Library staff hope that Archivematica can eventually integrate with both the library’s electronic submission system (Vireo) and CUNY Academic Works, making the submission, preservation, and access of digital dissertations a much more streamlined, automated, and OAIS-compliant process.
Next, Ashley and Kelly introduced and demoed Archivematica and Duracloud. I was very pleased to see several features of the Archivematica software that were made intentionally intuitive. The design of the interface is very clean and easily customizable to fit different workflows. Also, each AIP that is processed includes a plain-text, human-readable file which serves as extra documentation explaining what Archivematica did to each file. Artefactual recommends pairing Archivematica with Duracloud, although users can choose to integrate the software with local storage or with other cloud services like those offered by Google or Amazon. One of the features I found really interesting about Duracloud is that it comes with various data visualization graphs that show the user how much storage is available and what materials are taking up the most space.
I close by referencing something Ashley wrote in her recent bloggERS post (conveniently she also contributed to this event). She makes an excellent point about how different skill-sets are needed to do digital preservation, from the developers that create the tools that automate digital archival processes to the archivists that advocate for and implement said tools at their institutions. I think this talk was successful precisely because it included the practitioner and vendor perspectives, as well as the unique expertise that comes with each role. Both are needed if we are to meet the challenges and tap into the potential that digital archives present. I hope to see more of these “meetings of the minds” in the future.
Regina Carra is the Archive Project Metadata and Cataloging Coordinator at Mark Morris Dance Group. She is a recent graduate of the Dual Degree MLS/MA program in Library Science and History at Queens College – CUNY.
Scripting and working in the command line have become increasingly important skills for archivists, particularly for those who work with digital materials — at the same time, approaching these tools as a beginner can be intimidating. This series hopes to help break down barriers by allowing archivists to learn from their peers. We want to hear about how you use or are learning to use scripts (Bash, Python, Ruby, etc.) or the command line (one-liners, a favorite command line tool) in your day-to-day work, how scripts play into your processes and workflows, and how you are developing your knowledge in this area. How has this changed the way you think about your work? How has this changed your relationship with your colleagues or other stakeholders?
We’re particularly interested in posts that consist of a walk-through of a simple script (or one-liner) used in your digital archives workflow. Show us your script or command and tell us how it works.
A few other potential topics and themes for posts:
Stories of success or failure with scripting for digital archives
General “tips and tricks” for the command line/scripting
Independent or collaborative learning strategies for developing “tech” skills
A round-up of resources about a particular scripting language or related topic
Applying computational thinking to digital archives
Writing for bloggERS! “Script It!” Series
We encourage visual representations: Posts can include or largely consist of comics, flowcharts, a series of memes, etc!
Written content should be roughly 600-800 words in length
Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA