Collections as Data

by Elizabeth Russey Roke


Archives put a great deal of effort into preserving the original object.  We document the context around its creation, perform conservation work on the object if necessary, and implement reading room procedures designed to limit damage or loss.  As a result, researchers can read and hold an original letter written by Alice Walker, view a series of tintypes taken before the Civil War, read marginalia written by Ted Hughes in a book from his personal library, or listen to an audio recording of one of Martin Luther King, Jr.’s speeches.  In other words, we enable researchers to encounter materials as they were originally designed to be used as much as is possible.

The nature of digital humanities research challenges these traditional modes of archival access: are these the only ways to interact with archival material?  How do we serve users who want to leverage computational techniques such as text mining, machine learning, network analysis, or computer vision in their research or teaching?  Are machines and algorithms “users”? Archivists also encounter these questions as the content of archives shifts from analog to born digital material. Digital files were created and designed to be processed by algorithms, not just encountered through experiences such as watching, viewing, or reading.  What could access for these types of materials look like if we gave access to their full functionality and not just their appearance? 

I have spent the past two years working on an IMLS grant focused on addressing these types of questions.  Collections As Data: Always Already Computational examined how digital collections are and could be used beyond analog research methodologies.  Collections as data is ordered information, stored digitally, that is inherently amenable to computation.  This could include metadata, digital text, or other digital surrogates.  Whereas a digital repository might enable researchers to read a newspaper on a computer screen, an approach grounded in collections as data would give researchers access to the OCR file the repository generated to enable keyword search. In other words, digital repositories should provide access beyond the viewers, page turners, and streaming servers of most current digital repositories that replicate analog experiences.  At its core, collections as data simply asks cultural heritage organizations to make the full digital object available rather than making assumptions about how users will want to interact with it.  

Collections as data implementations are not necessarily complex nor do they involve complicated repository development.  Some of the simplest examples can be found on Github where archives such as Vanderbilt and New York University publish their EAD files.  The Rockefeller Archive Center and the Museum of Modern Art go a step further and publish all of their collection data, along with a creative commons license.  Emory, my home institution, makes finding aid data available in both EAD and RDF from our finding aids database, which has led to a digital humanities project that harvested correspondence indexes from our Irish poetry collections to build network graphs of the Belfast Group.  More complex implementations often provide access to data through APIs instead of a bulk download.  An example of this can be found at Carnegie Hall Archives, which allows researchers to query their data through a SPARQL endpoint.

Chord diagram created as part of Emory’s Belfast Group Poetry project, showing the networks of people associated with the Belfast Group and their relationships with each other.

The Collections As Data: Always Already Computational final report includes more information and ideas for getting started with collections as data. It includes resources such as a set of user personas, methods profiles of common techniques used by data-driven researchers, and real-world case studies of institutions with collections as data including their motivations, technical details, and how they made the case to their administrators.  I highly recommend “50 Things,” which is a list of activities associated with collections as data work ranging from the simple to the complex. 

There are a few takeaways from this project I’d like to highlight for archivists in particular:

Collections as data approaches are archival.  Data-driven research demands authenticity and context of the data source, established and preserved through archival principles of documentation, transparency, and provenance.  This type of information was one of the most universal requests from digital humanities researchers. It was clear that they were not only interested in the object, but in how it came to be.  They wanted to understand their data as an archival object with information about its creation, provenance, and preservation. Archivists need to advocate for digital collections to be treated not just as digital surrogates, or what I like to think of as expensive photocopying, but as unique resources unto themselves deserving description, preservation, and access that may not necessarily match that of the original object.

Collections as data enhances access to archival material.  What if we could partially open restricted material to researchers?  Emory holds the papers of Salman Rushdie and his email files are largely restricted per the deed of gift.  Computational techniques being developed in ePADD could generate maps of Rushdie’s correspondents and reveal patterns in the timing and frequency of his correspondence, just through email header information and without exposing sensitive data (i.e. the content of the email) that Rushdie wanted to restrict.  Could this methodology be extrapolated to other types of restricted electronic files?  

Just start.  For digital files, trying something is always the first, and best approach.  There is no one way or best way to do collections as data work. Consider your community and ask them what they need.  Unlike baseball fields, if you build it, they probably won’t come unless you ask first. Collections as data material already exists in your collection, especially if you use ArchivesSpace.  Publish it. Think broadly about what might constitute collections as data and how you might make use of it yourself; collections as data benefits us too. Follow the Computational Archival Science project at the University of Maryland, which is exploring how we think about archival collections as data.  

If you want to take a deep dive into collections as data (and get funding to do so!) consider applying to be part of the second cohort of the Part to Whole Mellon grant, which aims to foster the development of broadly viable models that support implementation and use of collections as data.  The next call for proposals opens August 1:  https://collectionsasdata.github.io/part2whole/ .  On August 5, the project team will offer a webinar with more information about the grant and opportunities to ask questions:  https://collectionsasdata.github.io/part2whole/cfp2webinar/.


Elizabeth Russey Roke is a digital archivist and metadata specialist at the Stuart A. Rose Library of Emory University, Atlanta, Georgia. Primarily focused on preservation, discovery, and access to digitized and born digital assets from special collections, Elizabeth works on a variety of technology projects and initiatives related to digital repositories, metadata standards, and archival descriptive practice. She was a co-investigator on a 2016-2018 IMLS grant investigating collections as data.

IEEE Big Data 2018: 3rd Computational Archival Science (CAS) Workshop Recap

by Richard Marciano, Victoria Lemieux, and Mark Hedges

Introduction

The 3rd workshop on Computational Archival Science (CAS) was held on December 12, 2018, in Seattle, following two earlier CAS workshops in 2016 in Washington DC and in 2017 in Boston. It also built on three earlier workshops on ‘Big Humanities Data’ organized by the same chairs at the 2013-2015 conferences, and more directly on a symposium held in April 2016 at the University of Maryland. The current working definition of CAS is:

A transdisciplinary field that integrates computational and archival theories, methods and resources, both to support the creation and preservation of reliable and authentic records/archives and to address large-scale records/archives processing, analysis, storage, and access, with aim of improving efficiency, productivity and precision, in support of recordkeeping, appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material [1].

The workshop featured five sessions and thirteen papers with international presenters and authors from the US, Canada, Germany, the Netherlands, the UK, Bulgaria, South Africa, and Portugal. All details (photos, abstracts, slides, and papers) are available at: http://dcicblog.umd.edu/cas/ieee-big-data-2018-3rd-cas-workshop/. The keynote focused on using digital archives to preserve the history of WWII Japanese-American incarceration and featured Geoff Froh, Deputy Director at Densho.org in Seattle.

Keynote speaker Geoff Froh, Deputy Director at Densho.org in Seattle presenting on “Reclaiming our Story: Using Digital Archives to Preserve the History of WWII Japanese American Incarceration.”

This workshop explored the conjunction (and its consequences) of emerging methods and technologies around big data with archival practice and new forms of analysis and historical, social, scientific, and cultural research engagement with archives. The aim was to identify and evaluate current trends, requirements, and potential in these areas, to examine the new questions that they can provoke, and to help determine possible research agendas for the evolution of computational archival science in the coming years. At the same time, we addressed the questions and concerns scholarship is raising about the interpretation of ‘big data’ and the uses to which it is put, in particular appraising the challenges of producing quality – meaning, knowledge and value – from quantity, tracing data and analytic provenance across complex ‘big data’ platforms and knowledge production ecosystems, and addressing data privacy issues.

Sessions

  1. Computational Thinking and Computational Archival Science
  • #1:Introducing Computational Thinking into Archival Science Education [William Underwood et al]
  • #2:Automating the Detection of Personally Identifiable Information (PII) in Japanese-American WWII Incarceration Camp Records [Richard Marciano, et al.]
  • #3:Computational Archival Practice: Towards a Theory for Archival Engineering [Kenneth Thibodeau]
  • #4:Stirring The Cauldron: Redefining Computational Archival Science (CAS) for The Big Data Domain [Nathaniel Payne]
  1. Machine Learning in Support of Archival Functions
  • #5:Protecting Privacy in the Archives: Supervised Machine Learning and Born-Digital Records [Tim Hutchinson]
  • #6:Computer-Assisted Appraisal and Selection of Archival Materials [Cal Lee]
  1. Metadata and Enterprise Architecture
  • #7:Measuring Completeness as Metadata Quality Metric in Europeana [Péter Királyet al.]
  • #8:In-place Synchronisation of Hierarchical Archival Descriptions [Mike Bryant et al.]
  • #9:The Utility Enterprise Architecture for Records Professionals [Shadrack Katuu]
  1. Data Management
  • #10:Framing the scope of the common data model for machine-actionable Data Management Plans [João Cardoso et al.]
  • #11:The Blockchain Litmus Test [Tyler Smith]
  1. Social and Cultural Institution Archives
  • #12:A Case Study in Creating Transparency in Using Cultural Big Data: The Legacy of Slavery Project [Ryan CoxSohan Shah et al]
  • #13:Jupyter Notebooks for Generous Archive Interfaces [Mari Wigham et al.]

Next Steps

Updates will continue to be provided through the CAS Portal website, see: http://dcicblog.umd.edu/cas and a Google Group you can join at computational-archival-science@googlegroups.com.

Several related events are scheduled in April 2019: (1) a 1 ½ day workshop on “Developing a Computational Framework for Library and Archival Education” will take place on April 3 & 4, 2019, at the iConference 2019 event (See: https://iconference2019.umd.edu/external-events-and-excursions/ for details), and (2) a “Blue Sky” paper session on “Establishing an International Computational Network for Librarians and Archivists” (See: https://www.conftool.com/iConference2019/index.php?page=browseSessions&form_session=356).

Finally, we are planning a 4th CAS Workshop in December 2019 at the 2019 IEEE International Conference on Big Data (IEEE BigData 2019) in Los Angeles, CA. Stay tuned for an upcoming CAS#4 workshop call for proposals, where we would welcome SAA member contributions!

References

[1] “Archival records and training in the Age of Big Data”, Marciano, R., Lemieux, V., Hedges, M., Esteva, M., Underwood, W., Kurtz, M. & Conrad, M.. See: LINK. In J. Percell , L. C. Sarin , P. T. Jaeger , J. C. Bertot (Eds.), Re-Envisioning the MLS: Perspectives on the Future of Library and Information Science Education (Advances in Librarianship, Volume 44B, pp.179-199). Emerald Publishing Limited. May 17, 2018. See: http://dcicblog.umd.edu/cas/wp-content/uploads/sites/13/2017/06/Marciano-et-al-Archival-Records-and-Training-in-the-Age-of-Big-Data-final.pdf


Richard Marciano is a professor at the University of Maryland iSchool where he directs the Digital Curation Innovation Center (DCIC). He previously conducted research at the San Diego Supercomputer Center at the University of California San Diego for over a decade. His research interests center on digital preservation, sustainable archives, cyberinfrastructure, and big data. He is also the 2017 recipient of Emmett Leahy Award for achievements in records and information management. Marciano holds degrees in Avionics and Electrical Engineering, a Master’s and Ph.D. in Computer Science from the University of Iowa. In addition, he conducted postdoctoral research in Computational Geography.

Victoria Lemieux is an associate professor of archival science at the iSchool and lead of the Blockchain research cluster, Blockchain@UBC at the University of British Columbia – Canada’s largest and most diverse research cluster devoted to blockchain technology. Her current research is focused on risk to the availability of trustworthy records, in particular in blockchain record keeping systems, and how these risks impact upon transparency, financial stability, public accountability and human rights. She has organized two summer institutes for Blockchain@UBC to provide training in blockchain and distributed ledgers, and her next summer institute is scheduled for May 27-June 7, 2019. She has received many awards for her professional work and research, including the 2015 Emmett Leahy Award for outstanding contributions to the field of records management, a 2015 World Bank Big Data Innovation Award, a 2016 Emerald Literati Award and a 2018 Britt Literary Award for her research on blockchain technology. She is also a faculty associate at multiple units within UBC, including the Peter Wall Institute for Advanced Studies, Sauder School of Business, and the Institute for Computers, Information and Cognitive Systems.

Mark Hedges is a Senior Lecturer in the Department of Digital Humanities at King’s College London, where he teaches on the MA in Digital Asset and Media Management, and is also Departmental Research Lead. His original academic background was in mathematics and philosophy, and he gained a PhD in mathematics at University College London, before starting a 17-year career in the software industry, before joining King’s in 2005. His research is concerned primarily with digital archives, research infrastructures, and computational methods, and he has led a range of projects in these areas over the last decade. Most recently has been working in Rwanda on initiatives relating to digital archives and the transformative impact of digital technologies.

Diving into Computational Archival Science

by Jane Kelly

In December 2017, the IEEE Big Data conference came to Boston, and with it came the second annual computational archival science workshop! Workshop participants were generous enough to come share their work with the local library and archives community during a one-day public unconference held at the Harvard Law School. After some sessions from Harvard librarians that touched on how they use computational methods to explore archival collections, the unconference continued with lightning talks from CAS workshop participants and discussions about what participants need to learn to engage with computational archival science in the future.

So, what is computational archival science? It is defined by CAS scholars as:

“An interdisciplinary field concerned with the application of computational methods and resources to large-scale records/archives processing, analysis, storage, long-term preservation, and access, with aim of improving efficiency, productivity and precision in support of appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material.”

Lightning round (and they really did strike like a dozen 90-second bolts of lightning, I promise!) talks from CAS workshop participants ranged from computational curation of digitized records to blockchain to topic modeling for born-digital collections. Following a voting session, participants broke into two rounds of large group discussions to dig deeper into lightning round topics. These discussions considered natural language processing, computational curation of cultural heritage archives, blockchain, and computational finding aids. Slides from lightning round presenters and community notes can be found on the CAS Unconference website.

Lightning round talks. (Image credit)

 

What did we learn? (What questions do we have now?)

Beyond learning a bit about specific projects that leverage computational methods to explore archival material, we discussed some of the challenges that archivists may bump up against when they want to engage with this work. More questions were raised than answered, but the questions can help us build a solid foundation for future study.

First, and for some of us in attendance perhaps the most important point, is the need to familiarize ourselves with computational methods. Do we have the specific technical knowledge to understand what it really means to say we want to use topic modeling to describe digital records? If not, how can we build our skills with community support? Are our electronic records suitable for computational processes? How might these issues change the way we need to conceptualize or approach appraisal, processing, and access to electronic records?

Many conversations repeatedly turned to issues of bias, privacy, and ethical issues. How do our biases shape the tools we build and use? What skills do we need to develop in order to recognize and dismantle biases in technology?

Word cloud from the unconference created by event co-organizer Ceilyn Boyd.

 

What do we need?

The unconference was intended to provide a space to bring more voices into conversations about computational methods in archives and, more specifically, to connect those currently engaged in CAS with other library and archives practitioners. At the end of the day, we worked together to compile a list of things that we felt many of us would need to learn in order to engage with CAS.

These needs include lists of methodologies and existing tools, canonical data and/or open datasets to use in testing such tools, a robust community of practice, postmortem analysis of current/existing projects, and much more. Building a community of practice and skill development for folks without strong programming skills was identified as both particularly important and especially challenging.

Be sure to check out some of the lightning round slides and community notes to learn more about CAS as a field as well as specific projects!

Interested in connecting with the CAS community? Join the CAS Google Group at: computational-archival-science@googlegroups.com!

The Harvard CAS unconference was planned and administered by Ceilyn Boyd, Jane Kelly, and Jessica Farrell of Harvard Library, with help from Richard Marciano and Bill Underwood from the Digital Curation Innovation Center (DCIC) at the University of Maryland’s iSchool. Many thanks to all the organizers, presenters, and participants!


Jane Kelly is the Historical & Special Collections Assistant at the Harvard Law School Library. She will complete her MSLIS from the iSchool at the University of Illinois, Urbana-Champaign in December 2018.

Partnerships in Advancing Digital Archival Education

by Sohan Shah, Michael J. Kurtz, and Richard Marciano

This is the fourth post in the BloggERS series on Collaborating Beyond the Archival Profession.

The mission of the Digital Curation Innovation Center (DCIC) at the University of Maryland’s iSchool is to integrate archival education with research and technology. The Center does this through innovative instructional design, integrated with student-based project experience. A key element in these projects is forming collaborations with academic, public sector, and industry partners. The DCIC fosters these interdisciplinary partnerships through the use of Big Records and Archival Analytics.

DCIC Lab space at the University of Maryland.

The DCIC works with a wide variety of U.S. and foreign academic research partners. These include, among others, the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign, the University of British Columbia, King’s College London, and the Texas Advanced Computing Center at the University of Texas at Austin. Federal and state agencies who partner by providing access to Big Records collections and their staff expertise include the National Agricultural Library, the National Archives and Records Administration, the National Park Service, the U.S. Holocaust Memorial Museum, and the Maryland State Archives. In addition, the DCIC collaborates with the European Holocaust Research Infrastructure project to provide digital access to Holocaust-era collections documenting cultural looting by the Nazis and subsequent restitution actions. Industry partnerships have involved NetApp and Archival Analytics Solutions.

Students working on a semester-long project with Dr. Richard Marciano, Director, DCIC.

We offer students the opportunity to participate in interdisciplinary digital curation projects with the goal of developing new digital skills and conducting front line research at the intersection of archives, digital curation, Big Data, and analytics. Projects span across justice, human rights, cultural heritage, and cyber-infrastructure themes. Students explore new research opportunities as they work with cutting-edge technology and receive guidance from faculty and staff at the DCIC.

To further digital archival education, DCIC faculty develop courses at the undergraduate and graduate levels that teach digital curation theory and provide experiential learning through team-based digital curation projects. The DCIC has also collaborated with the iSchool to create a Digital Curation for Information Professionals (DCIP) Certificate program designed for working professionals who need training in next generation cloud computing technologies, tools, resources, and best practices to help with the evaluation, selection, and implementation of digital curation solutions. Along these lines, the DCIC will sponsor, with the Archival Educators Section of the Society of American Archivists (SAA), a workshop at the Center on August 13, 2018, immediately prior to the SAA’s Annual Meeting in Washington, D.C. The theme of the workshop is “Integrating Archival Education with Technology and Research.” Further information on the workshop will be forthcoming.

The DCIC seeks to integrate all its educational and research activities by exploring and developing a potentially new trans-discipline, Computational Archival Science (CAS), focused on the computational treatments of archival content. The emergence of CAS follows advances in Computational Social Science, Computational Biology, and Computational Journalism.

For further information about our programs and projects visit our web site at http://dcic.umd.edu. To learn more about CAS, see http://dcicblog.umd.edu/cas. Information about a student-led Data Challenge, which the DCIC is co-sponsoring, can be accessed at http://datachallenge.ischool.umd.edu.


Sohan Shah

Sohan Shah is a Master’s student at the University of Maryland studying Information Management. His focus is on using research and data analytical techniques to make better business decisions. He holds a Bachelor’s degree in Computer Science from Ramaiah Institute of Technology, India, and has worked for 4 years at Microsoft as a Consultant and then as a Technical Lead prior to joining the University of Maryland. Sohan is working at the DCIC to find innovative ways of integrating data analytics with archival education. He is the co-author of “Building Open-Source Digital Curation Services and Repositories at Scale” and is working on other DCIC initiatives such as the Legacy of Slavery and Japanese American WWII Camps. Sohan is also the President of the Master of Information Management Student Association and initiated University of Maryland’s annual “Data Challenge,” bringing together hundreds of students from different academic backgrounds and class years to work with industry experts and build innovative solutions from real-world datasets.

Dr. Michael J. Kurtz is Associate Director of the Digital Curation Innovation Center in the College of Information Studies at the University of Maryland. Prior to this he worked at the U.S. National Archives and Records Administration for 37 years as a professional archivist, manager, and senior executive, retiring as Assistant Archivist in 2011. He received his doctoral degree in European History from Georgetown University in Washington, D.C. Dr. Kurtz has published extensively in the fields of American history and archival management. His works, among others, include: “ The Enhanced ‘International Research Portal for Records Related to Nazi-Era Cultural Property’ Project (IRP2): A Continuing Case Study” (co-author) in Big Data in the Arts and Humanities: Theory and Practice (forthcoming); “Archival Management and Administration,” in Encyclopedia of Library and Information Sciences (Third Edition, 2010); Managing Archival and Manuscript Repositories (2004); America and the Return of Nazi Contraband: The Recovery of Europe’s Cultural Treasures (2006, Paperback edition 2009).

Dr. Richard Marciano is a professor in the College of Information Studies at the University of Maryland and director of the Digital Curation Innovation Center (DCIC).  Prior to that, he conducted research at the San Diego Supercomputer Center (SDSC) at the University of California San Diego (UCSD) for over a decade with an affiliation in the Division of Social Sciences in the Urban Studies and Planning program.  His research interests center on digital preservation, sustainable archives, cyberinfrastructure, and big data.  He is also the 2017 recipient of the Emmett Leahy Award for achievements in records and information management. With partners from KCL, UBC, TACC, and NARA, he has launched a Computational Archival Science (CAS) initiative to explore the opportunities and challenges of applying computational treatments to archival and cultural content. He holds degrees in Avionics and Electrical Engineering, a Master’s and Ph.D. in Computer Science from the University of Iowa, and conducted a Postdoc in Computational Geography.

Modeling archival problems in Computational Archival Science (CAS)

By Dr. Maria Esteva

____

It was Richard Marciano who almost two years ago convened a small multi-disciplinary group of researchers and professionals with experience using computational methods to solve archival problems, and encouraged us to define the work that we do under the label of Computational Archival Science (CAS.) The exercise proved very useful to communicate the concept to others, but also, for us to articulate how we think when we go about using computational methods to conduct our work. We introduced and refined the definition amongst a broader group of colleagues at the Finding New Knowledge: Archival Records in the Age of Big Data Symposium in April of 2016.

I would like to bring more archivists into the conversation by explaining how I combine archival and computational thinking.  But first, three notes to frame my approach to CAS: a) I learned to do this progressively over the course of many projects, b) I took graduate data analysis courses, and c) It takes a village. I started using data mining methods out of necessity and curiosity, frustrated with the practical limitations of manual methods to address electronic records. I had entered the field of archives because its theories, and the problems that they address are attractive to me, and when I started taking data analysis courses and developing my work, I saw how computational methods could help hypothesize and test archival theories. Coursework in data mining was key to learn methods that initially I understood as “statistics on steroids.” Now I can systematize the process, map it to different problems and inquiries, and suggest the methods that can be used to address them. Finally, my role as a CAS archivist is shaped through my ongoing collaboration with computer scientists and with domain scientists.

In a nutshell, the CAS process goes like this: we first define the problem at hand and identify key archival issues within. On this basis we develop a model, which is an abstraction  of the system that we are concerned with. The model can be a methodology or a workflow, and it may include policies, benchmarks, and deliverables. Then, an algorithm, which is a set of steps that are accomplished within a software and hardware environment, is designed to automate the model and solve the problem.

A project in which I collaborate with Dr. Weijia Xu, a computer scientist at the Texas Advanced Computing Center, and Dr. Scott Brandenberg, an engineering professor at UCLA illustrates a CAS case. To publish and archive large amounts of complex data from natural hazards engineering experiments, researchers would need to manually enter significant amounts of metadata, which has proven impractical and inconsistent. Instead, they need automated methods to organize and describe their data which may consist of reports, plans and drawings, data files and images among other document types. The archival challenge is to design such a method in a way that the scientific record of the experiments is accurately represented. For this, the model has to convey the dataset’s provenance and capture the right type of metadata. To build the model we asked the domain scientist to draw out a typical experiment steps, and to provide terms that characterize its conditions, tools, materials, and resultant data. Using this information we created a data model, which is a network of classes that represent the experiment process, and of metadata terms describing the process. The figures below are the workflow and corresponding data model for centrifuge experiments.

Figure 1. Workflow of a centrifuge experiment by Dr. Scott Brandenberg

 

Figure 2. Networked data model of the centrifuge experiment process by the archivist

Following, Dr. Weijia Xu created an algorithm that combines text mining methods to: a) identify the terms from the model that are present in data belonging to an experiment, b) extend the terms in the model to related ones present in the data, and c) based on the presence of all the terms, predict the classes to which data belongs to. Using this method, a dataset can be organized around classes/processes and steps, and corresponding metadata terms describe those classes.

In a CAS project, the archivist defines the problem and gathers the requirements that will shape the deliverables. He or she collaborates with the domain scientists to model the “problem” system, and with the computer scientist to design the algorithm. An interesting aspect is how the method is evaluated by all team members using data-driven and qualitative methods. Using the data model as the ground truth we assess if data gets correctly assigned to classes, and if the metadata terms correctly describe the content of the data files. At the same time, as new terms are found in the dataset and the data model gets refined, the domain scientist and the archivist review the accuracy of the resulting representation and the generalizability of the solution.

I look forward to hearing reactions to this work and about research perspectives and experiences from others in this space.

____
Dr. Maria Esteva is a researcher and data archivist/curator at the Texas Advanced Computing Center, at the University of Texas at Austin. She conducts research on, and implements large-scale archival processing and data curation systems using as a backdrop High Performance Computing infrastructure resources. Her email is: maria@tacc.utexas.edu

 

Building a “Computational Archival Science” Community

By Richard Marciano

———

When the bloggERS! series started at the beginning of 2015, some of the very first posts featured work on “computer generated archival description” and “big data and big challenges for archives,” so it seems appropriate to revisit this theme of automation and management of records at scale and provide an update on a recent symposium and several upcoming events.

Richard Marciano co-hosted a recent “Archival Records in the Age of Big Data” symposium. For more information about the recent Symposium, visit: http://dcicblog.umd.edu/cas/. The three-day program is listed online and has links to all the videos and slides. A list of participants can also be found at http://dcicblog.umd.edu/cas/attendees. The objectives of the Symposium were to:

  • address the challenges of big data for digital curation,
  • explore the conjunction of emerging digital methods and technologies,
  • identify and evaluate current trends,
  • determine possible research agendas, and
  • establish a community of practice.

Richard Marciano and Bill Underwood will be further exploring these themes at SAA in Atlanta on Friday, August 5, 9:30am – 10:45am, session 311, for those ERS aficionados interested in contributing to this emerging conversation. See: https://archives2016.sched.org/event/7f9D/311-archival-records-in-the-age-of-big-data

On April 26-28, 2016 the Digital Curation Innovation Center (DCIC) at the University of Maryland’s College of Information Studies (iSchool) convened a Symposium in collaboration with King’s College London. This invitation-only symposium, entitled Finding New Knowledge: Archival Records in the Age of Big Data, featured 52 participants from the UK, Canada, South Africa and the U.S. Among the participants were researchers, students, and representatives from federal agencies, cultural institutions, and consortia.

This group of experts gathered at Maryland’s iSchool to discuss and try to define computational archival science: an interdisciplinary field concerned with the application of computational methods and resources to large-scale records/archives processing, analysis, storage, long-term preservation, and access, with the aim of improving efficiency, productivity and precision in support of appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material.

This event, co-sponsored by Richard Marciano, Mark Hedges from King’s College London and Michael Kurtz from UMD’s iSchool, brought together thought leaders in this emerging CAS field:  Maria Esteva from the Texas Advanced Computing Center (TACC), Victoria Lemieux from the University of British Columbia School of Library, Archival and Information Studies (SLAIS), and Bill Underwood from Georgia Tech Research Institute (GTRI). There is growing interest in large-scale management, automation, and analysis of archival content and the realization of enhanced possibilities for scholarship through the integration of ‘computational thinking’ and ‘archival thinking.

To capitalize on the April Symposium, a follow-up workshop entitled Computational Archival Science: Digital Records in the Age of Big Data, will take place in Washington D.C. the 2nd week of December 2016 at the 2016 IEEE International Conference on Big Data. For information on the upcoming workshop, please visit: http://dcicblog.umd.edu/cas/ieee_big_data_2016_cas-workshop/. Paper contributions will be accepted until October 3, 2016.

———

Richard is a professor at Maryland’s iSchool and director of the Digital Curation Innovation Center (DCIC). His research interests include digital preservation, archives and records management, computational archival science, and big data. He holds degrees in Avionics and Electrical Engineering, a Master’s and Ph.D. in Computer Science from the University of Iowa, and conducted a Postdoc in Computational Geography.