Diving into Computational Archival Science

by Jane Kelly

In December 2017, the IEEE Big Data conference came to Boston, and with it came the second annual computational archival science workshop! Workshop participants were generous enough to come share their work with the local library and archives community during a one-day public unconference held at the Harvard Law School. After some sessions from Harvard librarians that touched on how they use computational methods to explore archival collections, the unconference continued with lightning talks from CAS workshop participants and discussions about what participants need to learn to engage with computational archival science in the future.

So, what is computational archival science? It is defined by CAS scholars as:

“An interdisciplinary field concerned with the application of computational methods and resources to large-scale records/archives processing, analysis, storage, long-term preservation, and access, with aim of improving efficiency, productivity and precision in support of appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material.”

Lightning round (and they really did strike like a dozen 90-second bolts of lightning, I promise!) talks from CAS workshop participants ranged from computational curation of digitized records to blockchain to topic modeling for born-digital collections. Following a voting session, participants broke into two rounds of large group discussions to dig deeper into lightning round topics. These discussions considered natural language processing, computational curation of cultural heritage archives, blockchain, and computational finding aids. Slides from lightning round presenters and community notes can be found on the CAS Unconference website.

Lightning round talks. (Image credit)

 

What did we learn? (What questions do we have now?)

Beyond learning a bit about specific projects that leverage computational methods to explore archival material, we discussed some of the challenges that archivists may bump up against when they want to engage with this work. More questions were raised than answered, but the questions can help us build a solid foundation for future study.

First, and for some of us in attendance perhaps the most important point, is the need to familiarize ourselves with computational methods. Do we have the specific technical knowledge to understand what it really means to say we want to use topic modeling to describe digital records? If not, how can we build our skills with community support? Are our electronic records suitable for computational processes? How might these issues change the way we need to conceptualize or approach appraisal, processing, and access to electronic records?

Many conversations repeatedly turned to issues of bias, privacy, and ethical issues. How do our biases shape the tools we build and use? What skills do we need to develop in order to recognize and dismantle biases in technology?

Word cloud from the unconference created by event co-organizer Ceilyn Boyd.

 

What do we need?

The unconference was intended to provide a space to bring more voices into conversations about computational methods in archives and, more specifically, to connect those currently engaged in CAS with other library and archives practitioners. At the end of the day, we worked together to compile a list of things that we felt many of us would need to learn in order to engage with CAS.

These needs include lists of methodologies and existing tools, canonical data and/or open datasets to use in testing such tools, a robust community of practice, postmortem analysis of current/existing projects, and much more. Building a community of practice and skill development for folks without strong programming skills was identified as both particularly important and especially challenging.

Be sure to check out some of the lightning round slides and community notes to learn more about CAS as a field as well as specific projects!

Interested in connecting with the CAS community? Join the CAS Google Group at: computational-archival-science@googlegroups.com!

The Harvard CAS unconference was planned and administered by Ceilyn Boyd, Jane Kelly, and Jessica Farrell of Harvard Library, with help from Richard Marciano and Bill Underwood from the Digital Curation Innovation Center (DCIC) at the University of Maryland’s iSchool. Many thanks to all the organizers, presenters, and participants!


Jane Kelly is the Historical & Special Collections Assistant at the Harvard Law School Library. She will complete her MSLIS from the iSchool at the University of Illinois, Urbana-Champaign in December 2018.

Advertisements

Partnerships in Advancing Digital Archival Education

by Sohan Shah, Michael J. Kurtz, and Richard Marciano

This is the fourth post in the BloggERS series on Collaborating Beyond the Archival Profession.

The mission of the Digital Curation Innovation Center (DCIC) at the University of Maryland’s iSchool is to integrate archival education with research and technology. The Center does this through innovative instructional design, integrated with student-based project experience. A key element in these projects is forming collaborations with academic, public sector, and industry partners. The DCIC fosters these interdisciplinary partnerships through the use of Big Records and Archival Analytics.

DCIC Lab space at the University of Maryland.

The DCIC works with a wide variety of U.S. and foreign academic research partners. These include, among others, the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign, the University of British Columbia, King’s College London, and the Texas Advanced Computing Center at the University of Texas at Austin. Federal and state agencies who partner by providing access to Big Records collections and their staff expertise include the National Agricultural Library, the National Archives and Records Administration, the National Park Service, the U.S. Holocaust Memorial Museum, and the Maryland State Archives. In addition, the DCIC collaborates with the European Holocaust Research Infrastructure project to provide digital access to Holocaust-era collections documenting cultural looting by the Nazis and subsequent restitution actions. Industry partnerships have involved NetApp and Archival Analytics Solutions.

Students working on a semester-long project with Dr. Richard Marciano, Director, DCIC.

We offer students the opportunity to participate in interdisciplinary digital curation projects with the goal of developing new digital skills and conducting front line research at the intersection of archives, digital curation, Big Data, and analytics. Projects span across justice, human rights, cultural heritage, and cyber-infrastructure themes. Students explore new research opportunities as they work with cutting-edge technology and receive guidance from faculty and staff at the DCIC.

To further digital archival education, DCIC faculty develop courses at the undergraduate and graduate levels that teach digital curation theory and provide experiential learning through team-based digital curation projects. The DCIC has also collaborated with the iSchool to create a Digital Curation for Information Professionals (DCIP) Certificate program designed for working professionals who need training in next generation cloud computing technologies, tools, resources, and best practices to help with the evaluation, selection, and implementation of digital curation solutions. Along these lines, the DCIC will sponsor, with the Archival Educators Section of the Society of American Archivists (SAA), a workshop at the Center on August 13, 2018, immediately prior to the SAA’s Annual Meeting in Washington, D.C. The theme of the workshop is “Integrating Archival Education with Technology and Research.” Further information on the workshop will be forthcoming.

The DCIC seeks to integrate all its educational and research activities by exploring and developing a potentially new trans-discipline, Computational Archival Science (CAS), focused on the computational treatments of archival content. The emergence of CAS follows advances in Computational Social Science, Computational Biology, and Computational Journalism.

For further information about our programs and projects visit our web site at http://dcic.umd.edu. To learn more about CAS, see http://dcicblog.umd.edu/cas. Information about a student-led Data Challenge, which the DCIC is co-sponsoring, can be accessed at http://datachallenge.ischool.umd.edu.


Sohan Shah

Sohan Shah is a Master’s student at the University of Maryland studying Information Management. His focus is on using research and data analytical techniques to make better business decisions. He holds a Bachelor’s degree in Computer Science from Ramaiah Institute of Technology, India, and has worked for 4 years at Microsoft as a Consultant and then as a Technical Lead prior to joining the University of Maryland. Sohan is working at the DCIC to find innovative ways of integrating data analytics with archival education. He is the co-author of “Building Open-Source Digital Curation Services and Repositories at Scale” and is working on other DCIC initiatives such as the Legacy of Slavery and Japanese American WWII Camps. Sohan is also the President of the Master of Information Management Student Association and initiated University of Maryland’s annual “Data Challenge,” bringing together hundreds of students from different academic backgrounds and class years to work with industry experts and build innovative solutions from real-world datasets.

Dr. Michael J. Kurtz is Associate Director of the Digital Curation Innovation Center in the College of Information Studies at the University of Maryland. Prior to this he worked at the U.S. National Archives and Records Administration for 37 years as a professional archivist, manager, and senior executive, retiring as Assistant Archivist in 2011. He received his doctoral degree in European History from Georgetown University in Washington, D.C. Dr. Kurtz has published extensively in the fields of American history and archival management. His works, among others, include: “ The Enhanced ‘International Research Portal for Records Related to Nazi-Era Cultural Property’ Project (IRP2): A Continuing Case Study” (co-author) in Big Data in the Arts and Humanities: Theory and Practice (forthcoming); “Archival Management and Administration,” in Encyclopedia of Library and Information Sciences (Third Edition, 2010); Managing Archival and Manuscript Repositories (2004); America and the Return of Nazi Contraband: The Recovery of Europe’s Cultural Treasures (2006, Paperback edition 2009).

Dr. Richard Marciano is a professor in the College of Information Studies at the University of Maryland and director of the Digital Curation Innovation Center (DCIC).  Prior to that, he conducted research at the San Diego Supercomputer Center (SDSC) at the University of California San Diego (UCSD) for over a decade with an affiliation in the Division of Social Sciences in the Urban Studies and Planning program.  His research interests center on digital preservation, sustainable archives, cyberinfrastructure, and big data.  He is also the 2017 recipient of the Emmett Leahy Award for achievements in records and information management. With partners from KCL, UBC, TACC, and NARA, he has launched a Computational Archival Science (CAS) initiative to explore the opportunities and challenges of applying computational treatments to archival and cultural content. He holds degrees in Avionics and Electrical Engineering, a Master’s and Ph.D. in Computer Science from the University of Iowa, and conducted a Postdoc in Computational Geography.

Building a “Computational Archival Science” Community

By Richard Marciano

———

When the bloggERS! series started at the beginning of 2015, some of the very first posts featured work on “computer generated archival description” and “big data and big challenges for archives,” so it seems appropriate to revisit this theme of automation and management of records at scale and provide an update on a recent symposium and several upcoming events.

Richard Marciano co-hosted a recent “Archival Records in the Age of Big Data” symposium. For more information about the recent Symposium, visit: http://dcicblog.umd.edu/cas/. The three-day program is listed online and has links to all the videos and slides. A list of participants can also be found at http://dcicblog.umd.edu/cas/attendees. The objectives of the Symposium were to:

  • address the challenges of big data for digital curation,
  • explore the conjunction of emerging digital methods and technologies,
  • identify and evaluate current trends,
  • determine possible research agendas, and
  • establish a community of practice.

Richard Marciano and Bill Underwood will be further exploring these themes at SAA in Atlanta on Friday, August 5, 9:30am – 10:45am, session 311, for those ERS aficionados interested in contributing to this emerging conversation. See: https://archives2016.sched.org/event/7f9D/311-archival-records-in-the-age-of-big-data

On April 26-28, 2016 the Digital Curation Innovation Center (DCIC) at the University of Maryland’s College of Information Studies (iSchool) convened a Symposium in collaboration with King’s College London. This invitation-only symposium, entitled Finding New Knowledge: Archival Records in the Age of Big Data, featured 52 participants from the UK, Canada, South Africa and the U.S. Among the participants were researchers, students, and representatives from federal agencies, cultural institutions, and consortia.

This group of experts gathered at Maryland’s iSchool to discuss and try to define computational archival science: an interdisciplinary field concerned with the application of computational methods and resources to large-scale records/archives processing, analysis, storage, long-term preservation, and access, with the aim of improving efficiency, productivity and precision in support of appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material.

This event, co-sponsored by Richard Marciano, Mark Hedges from King’s College London and Michael Kurtz from UMD’s iSchool, brought together thought leaders in this emerging CAS field:  Maria Esteva from the Texas Advanced Computing Center (TACC), Victoria Lemieux from the University of British Columbia School of Library, Archival and Information Studies (SLAIS), and Bill Underwood from Georgia Tech Research Institute (GTRI). There is growing interest in large-scale management, automation, and analysis of archival content and the realization of enhanced possibilities for scholarship through the integration of ‘computational thinking’ and ‘archival thinking.

To capitalize on the April Symposium, a follow-up workshop entitled Computational Archival Science: Digital Records in the Age of Big Data, will take place in Washington D.C. the 2nd week of December 2016 at the 2016 IEEE International Conference on Big Data. For information on the upcoming workshop, please visit: http://dcicblog.umd.edu/cas/ieee_big_data_2016_cas-workshop/. Paper contributions will be accepted until October 3, 2016.

———

Richard is a professor at Maryland’s iSchool and director of the Digital Curation Innovation Center (DCIC). His research interests include digital preservation, archives and records management, computational archival science, and big data. He holds degrees in Avionics and Electrical Engineering, a Master’s and Ph.D. in Computer Science from the University of Iowa, and conducted a Postdoc in Computational Geography.