Modeling archival problems in Computational Archival Science (CAS)

By Dr. Maria Esteva

____

It was Richard Marciano who almost two years ago convened a small multi-disciplinary group of researchers and professionals with experience using computational methods to solve archival problems, and encouraged us to define the work that we do under the label of Computational Archival Science (CAS.) The exercise proved very useful to communicate the concept to others, but also, for us to articulate how we think when we go about using computational methods to conduct our work. We introduced and refined the definition amongst a broader group of colleagues at the Finding New Knowledge: Archival Records in the Age of Big Data Symposium in April of 2016.

I would like to bring more archivists into the conversation by explaining how I combine archival and computational thinking.  But first, three notes to frame my approach to CAS: a) I learned to do this progressively over the course of many projects, b) I took graduate data analysis courses, and c) It takes a village. I started using data mining methods out of necessity and curiosity, frustrated with the practical limitations of manual methods to address electronic records. I had entered the field of archives because its theories, and the problems that they address are attractive to me, and when I started taking data analysis courses and developing my work, I saw how computational methods could help hypothesize and test archival theories. Coursework in data mining was key to learn methods that initially I understood as “statistics on steroids.” Now I can systematize the process, map it to different problems and inquiries, and suggest the methods that can be used to address them. Finally, my role as a CAS archivist is shaped through my ongoing collaboration with computer scientists and with domain scientists.

In a nutshell, the CAS process goes like this: we first define the problem at hand and identify key archival issues within. On this basis we develop a model, which is an abstraction  of the system that we are concerned with. The model can be a methodology or a workflow, and it may include policies, benchmarks, and deliverables. Then, an algorithm, which is a set of steps that are accomplished within a software and hardware environment, is designed to automate the model and solve the problem.

A project in which I collaborate with Dr. Weijia Xu, a computer scientist at the Texas Advanced Computing Center, and Dr. Scott Brandenberg, an engineering professor at UCLA illustrates a CAS case. To publish and archive large amounts of complex data from natural hazards engineering experiments, researchers would need to manually enter significant amounts of metadata, which has proven impractical and inconsistent. Instead, they need automated methods to organize and describe their data which may consist of reports, plans and drawings, data files and images among other document types. The archival challenge is to design such a method in a way that the scientific record of the experiments is accurately represented. For this, the model has to convey the dataset’s provenance and capture the right type of metadata. To build the model we asked the domain scientist to draw out a typical experiment steps, and to provide terms that characterize its conditions, tools, materials, and resultant data. Using this information we created a data model, which is a network of classes that represent the experiment process, and of metadata terms describing the process. The figures below are the workflow and corresponding data model for centrifuge experiments.

Figure 1. Workflow of a centrifuge experiment by Dr. Scott Brandenberg

 

Figure 2. Networked data model of the centrifuge experiment process by the archivist

Following, Dr. Weijia Xu created an algorithm that combines text mining methods to: a) identify the terms from the model that are present in data belonging to an experiment, b) extend the terms in the model to related ones present in the data, and c) based on the presence of all the terms, predict the classes to which data belongs to. Using this method, a dataset can be organized around classes/processes and steps, and corresponding metadata terms describe those classes.

In a CAS project, the archivist defines the problem and gathers the requirements that will shape the deliverables. He or she collaborates with the domain scientists to model the “problem” system, and with the computer scientist to design the algorithm. An interesting aspect is how the method is evaluated by all team members using data-driven and qualitative methods. Using the data model as the ground truth we assess if data gets correctly assigned to classes, and if the metadata terms correctly describe the content of the data files. At the same time, as new terms are found in the dataset and the data model gets refined, the domain scientist and the archivist review the accuracy of the resulting representation and the generalizability of the solution.

I look forward to hearing reactions to this work and about research perspectives and experiences from others in this space.

____
Dr. Maria Esteva is a researcher and data archivist/curator at the Texas Advanced Computing Center, at the University of Texas at Austin. She conducts research on, and implements large-scale archival processing and data curation systems using as a backdrop High Performance Computing infrastructure resources. Her email is: maria@tacc.utexas.edu

 

#snaprt chat Flashback: Archivist and Technologist Collaboration

By Ariadne Rehbein

This is a cross post in coordination with the SAA Students and New Archives Professionals Roundtable.

The spirit of community at the 2016 Code4Lib Conference in Philadelphia (March 7-10) served as inspiration for a recent SAA Students and New Archives Professionals Roundtable #snaprt Twitter chat. The conference was an exciting opportunity for archivists and librarians to learn about digital tools and projects that are free to use and open for further development, discuss needs for different technology solutions, gain a deeper understanding of technology work, and engage with larger cultural and technical issues within libraries and archives. SNAP’s Senior Social Media Coordinator hosted the chat on March 15, focusing the discussion on collaboration between archivists and technologists.

Many of the chat questions were influenced by discussions in the Code4Archives preconference workshop breakout group, “Whose job is that? Sharing how your team breaks down archives ‘tech’ work.” On the last day of the conference, SNAP invited participants through different Code4Lib and Society of American Archivist channels, such as the conference hashtag (#c4l16), the Code4Lib listserv, various SAA listservs, and the SNAP Facebook and Twitter accounts. All were invited to share suggestions or discussion questions for the chat. Participants included archives students and professionals with varying years of experience and focuses, such as digital curation, special collections, university archives, and government archives. Our chat questions were:

  • How do the expertise and knowledge of archivists and technologists who work together often overlap or differ? How much is important to understand of one another’s work? What are some ways to increase this knowledge?
  • What are some examples of technologies that archives currently use? What is their goal/ what are they used to do?
  • Who created and maintains these tools? Why might an archive choose one tool over another?
  • What kinds of tools and tech skills have new archivists learned post-LIS? What is this learning process like?
  • What are some examples of tasks or projects in an archival setting where the expertise of technologists is essential or extremely helpful? Please share any tips from these experiences.
  • Do you know of any blogs/posts that are helpful for born digital preservation / AV preservation / digitized content workflow?

Several different themes emerged in the chat:

  • The importance of an environment that supports relationships between those of different backgrounds and skills. Participants suggested developing a sharing a vocabulary to clearly convey information and providing casual opportunities to meet.
  • The decision to implement a technology solution to serve a need may involve a variety of considerations, such as level of institutional priority, cost, availability of technology professionals to manage or build the system, security, and applicability to other needs.
  • Participants suggested that students gain skills with a variety of different technologies, including relational databases, command line basics, Photoshop, Virtual Box, Bitcurator, and programming (through online tutorials.) The ability and willingness to learn on the job and teach others is important too! These are useful tools and may also help build a shared vocabulary.
  • Participants had engaged in a number of collaborative tasks or projects, such as performing digital forensics, building DIY History at the University of Iowa, implementing systems such as Preservica, and determining digital preservation storage solutions.
  • Some great resources are available for born-digital, digitized, and audiovisual preservation, including AV Preserve, the Digital Curation Google Group, the Bitcurator Consortium, The Signal blog, Chris Prom’s Practical E-Records, the Code4Lib listserv, Digital Preservation News, and National Digital Stewardship Residency blog posts.

Please visit Storify to read the full chat:

Storify of #snaprt chat about archivist and technologistsMany thanks to Wendy Hagenmaier of the ERS Steering Committee for inviting SNAP to share this post. #snaprt Twitter chats typically take place 3 times per month, on or around the 5th, 15th, and 25th at 8 PM ET. Participation is open to anyone interested in issues relevant to MLIS students and new archives professionals. To learn more about the chats, please visit our webpage.

Rehbein_snaprtcode4lib_ersblog_02Ariadne Rehbein strives to support students and new archives professionals as SNAP Roundtable’s Senior Social Media Coordinator. As Digital Asset Coordinator at the Arizona State University Libraries, she focuses on processing and stewardship of digital special collections and providing expertise on issues related to digital forensics, asset management workflows, and policies in accordance with community standards and best practices. She is a proud graduate of the Department of Information and Library Science at Indiana University Bloomington.

Retention of Technology-Based Interactives

The Cleveland Museum of Art’s Gallery One blends art, technology, and interpretation.  It includes real works from the museum’s collection as well as interactive, technology-based activities and games.  For example, Global Influences presents visitors with an artwork and asks them to guess which two countries on the map influenced the work in question; and crowd favorite Strike a Pose asks visitors to imitate the pose of a sculpture and invites them to save and share the resulting photograph.

It’s really cool stuff.  But as the museum plans a refresh of the space, the archives and IT department are starting to contemplate how to preserve the history of Gallery One.  The interactives will have to go, monitors and other hardware will be repurposed, and new artwork and interactive experiences will be installed.  We need to decide what to retain in archives and figure out how to collect and preserve whatever we decide to keep.

These pending decisions bring up familiar archival questions and ask us to apply them to complex digital materials: what about this gallery installation has enduring value?  Is it enough to retain a record of the look and feel of the space, perhaps create videos of the interactives?  Is it necessary to retain and preserve all of the code?

Records retention schedules call for the permanent retention of gallery labels, exhibition photographs, and other exhibition records but do not specifically address technology-based interactives.  The museum is developing an institutional repository for digital preservation using Fedora, but we are still in the testing phases for relatively simple image collections and we aren’t ready to ingest complex materials like the interactives from Gallery One.

As we work through these issues I would be grateful for input from the archives community.  How do we go about this? Does anyone have experience with the retention and preservation technology-based interactives?

Susan Hernandez is the Digital Archivist and Systems Librarian at the Cleveland Museum of Art. Her responsibilities include accessioning and preserving the museum’s electronic records; overseeing library and archives databases and systems; developing library and archives digitization programs; and serving on the development team for the museum’s institutional repository. Leave a comment or contact her directly at shernandez@clevelandart.org.

It May Work In Theory . . . Getting Down to Earth with Digital Workflows

Recently, Joe Coen, archivist at the Roman Catholic Diocese of Brooklyn, posted this to the ERS listserv:

I’m looking to find out what workflows you have for ingest of electronic records and what tools you are using for doing checksums, creating a wrapper for the files and metadata, etc. We will be taking in electronic and paper records from a closed high school next month and l want to do as much as l can according to best practices.

I’d appreciate any advice and suggestions you can give.
51069522_fa3dd37b07_z
“OK. I’ve connected the Fedora-BagIt-Library-Sleuthkit to the FTK-Bitcurator-Archivematica instance using the Kryoflux-Writeblocker-Labelmaker . . . now what?” (photo by e-magic, https://www.flickr.com/photos/emagic/51069522/).
Joe said a couple of people responded to his question directly, but that means we’ve missed an opportunity to learn as a community of archivists working with digital materials about the actual practices of other archivists working with digital materials.

There are a lot of different archivists working with electronic records—some are administrators, some are temps, some are lone arrangers, some are programmers, some are born digital archivists and some have digital archivy thrust upon them—and this diversity of interests and viewpoints is, to my mind, an untapped resource.

There are so many helpful articles and white papers out there offering general guidance and warning of common pitfalls, but sometimes, when you’re trying to cobble together an ingest workflow or planning a site visit, you just think, “Yeah, but how do I actually do this?”

Why don’t we do that here?

If you’ve got links to ingest workflows, transfer guidelines, in-house best practices, digital materials surveys, or any other formal or informal procedures that just might maybe, kinda, one day be helpful to another archivist, why not post or describe them in the comments?

I know I’ve often scoured the Internet for similar advice only to find it in a comment to a blog post.