By Dr. Maria Esteva
It was Richard Marciano who almost two years ago convened a small multi-disciplinary group of researchers and professionals with experience using computational methods to solve archival problems, and encouraged us to define the work that we do under the label of Computational Archival Science (CAS.) The exercise proved very useful to communicate the concept to others, but also, for us to articulate how we think when we go about using computational methods to conduct our work. We introduced and refined the definition amongst a broader group of colleagues at the Finding New Knowledge: Archival Records in the Age of Big Data Symposium in April of 2016.
I would like to bring more archivists into the conversation by explaining how I combine archival and computational thinking. But first, three notes to frame my approach to CAS: a) I learned to do this progressively over the course of many projects, b) I took graduate data analysis courses, and c) It takes a village. I started using data mining methods out of necessity and curiosity, frustrated with the practical limitations of manual methods to address electronic records. I had entered the field of archives because its theories, and the problems that they address are attractive to me, and when I started taking data analysis courses and developing my work, I saw how computational methods could help hypothesize and test archival theories. Coursework in data mining was key to learn methods that initially I understood as “statistics on steroids.” Now I can systematize the process, map it to different problems and inquiries, and suggest the methods that can be used to address them. Finally, my role as a CAS archivist is shaped through my ongoing collaboration with computer scientists and with domain scientists.
In a nutshell, the CAS process goes like this: we first define the problem at hand and identify key archival issues within. On this basis we develop a model, which is an abstraction of the system that we are concerned with. The model can be a methodology or a workflow, and it may include policies, benchmarks, and deliverables. Then, an algorithm, which is a set of steps that are accomplished within a software and hardware environment, is designed to automate the model and solve the problem.
A project in which I collaborate with Dr. Weijia Xu, a computer scientist at the Texas Advanced Computing Center, and Dr. Scott Brandenberg, an engineering professor at UCLA illustrates a CAS case. To publish and archive large amounts of complex data from natural hazards engineering experiments, researchers would need to manually enter significant amounts of metadata, which has proven impractical and inconsistent. Instead, they need automated methods to organize and describe their data which may consist of reports, plans and drawings, data files and images among other document types. The archival challenge is to design such a method in a way that the scientific record of the experiments is accurately represented. For this, the model has to convey the dataset’s provenance and capture the right type of metadata. To build the model we asked the domain scientist to draw out a typical experiment steps, and to provide terms that characterize its conditions, tools, materials, and resultant data. Using this information we created a data model, which is a network of classes that represent the experiment process, and of metadata terms describing the process. The figures below are the workflow and corresponding data model for centrifuge experiments.
Following, Dr. Weijia Xu created an algorithm that combines text mining methods to: a) identify the terms from the model that are present in data belonging to an experiment, b) extend the terms in the model to related ones present in the data, and c) based on the presence of all the terms, predict the classes to which data belongs to. Using this method, a dataset can be organized around classes/processes and steps, and corresponding metadata terms describe those classes.
In a CAS project, the archivist defines the problem and gathers the requirements that will shape the deliverables. He or she collaborates with the domain scientists to model the “problem” system, and with the computer scientist to design the algorithm. An interesting aspect is how the method is evaluated by all team members using data-driven and qualitative methods. Using the data model as the ground truth we assess if data gets correctly assigned to classes, and if the metadata terms correctly describe the content of the data files. At the same time, as new terms are found in the dataset and the data model gets refined, the domain scientist and the archivist review the accuracy of the resulting representation and the generalizability of the solution.
I look forward to hearing reactions to this work and about research perspectives and experiences from others in this space.
Dr. Maria Esteva is a researcher and data archivist/curator at the Texas Advanced Computing Center, at the University of Texas at Austin. She conducts research on, and implements large-scale archival processing and data curation systems using as a backdrop High Performance Computing infrastructure resources. Her email is: firstname.lastname@example.org