PASIG (Preservation and Archiving Special Interest Group) 2019 Recap

by Kelly Bolding

PASIG 2019 met the week of February 11th at El Colegio de México (commonly known as Colmex) in Mexico City. PASIG stands for Preservation and Archiving Special Interest Group, and the group’s meeting brings together an international group of practitioners, industry experts, vendors, and researchers to discuss practical digital preservation topics and approaches. This meeting was particularly special because it was the first time the group convened in Latin America (past meetings have generally been held in Europe and the United States). Excellent real-time bilingual translation for presentations given in both English and Spanish enabled conversations across geographical and lingual boundaries and made room to center Latin American preservationists’ perspectives and transformative post-custodial archival practice.

Perla Rodriguez of the Universidad Nacional Autónoma de México (UNAM) discusses an audiovisual preservation case study.

The conference began with broad overviews of digital preservation topics and tools to create a common starting ground, followed by more focused deep-dives on subsequent days. I saw two major themes emerge over the course of the week. The first was the importance of people over technology in digital preservation. From David Minor’s introductory session to Isabel Galina Russell’s overview of the digital preservation landscape in Mexico, presenters continuously surfaced examples of the “people side” of digital preservation (think: preservation policies, appraisal strategies, human labor and decision-making, keeping momentum for programs, communicating to stakeholders, ethical partnerships). One point that struck me during the community archives session was Verónica Reyes-Escudero’s discussion of “cultural competency as a tool for front-end digital preservation.” By conceptualizing interpersonal skills as a technology for facilitating digital preservation, we gain a broader and more ethically grounded idea of what it is we are really trying to do by preserving bits in the first place. Software and hardware are part of the picture, but they are certainly not the whole view.

The second major theme was that digital preservation is best done together. Distributed digital preservation platforms, consortial preservation models, and collaborative research networks were also well-represented by speakers from LOCKSS, Texas Digital Library (TDL), Duraspace, Open Preservation Foundation, Software Preservation Network, and others. The takeaway from these sessions was that the sheer resource-intensiveness of digital preservation means that institutions, both large and small, are going to have to collaborate in order to achieve their goals. PASIG seemed to be a place where attendees could foster and strengthen these collective efforts. Throughout the conference, presenters also highlighted failures of collaborative projects and the need for sustainable financial and governance models, particularly in light of recent developments at the Digital Preservation Network (DPN) and Digital Public Library of America (DPLA). I was particularly impressed by Mary Molinaro’s honest and informative discussion about the factors that led to the shuttering of DPN. Molinaro indicated that DPN would soon be publishing a final report in order to transparently share their model, flaws and all, with the broader community.

Touching on both of these themes, Carlos Martínez Suárez of Video Trópico Sur gave a moving keynote about his collaboration with Natalie M. Baur, Preservation Librarian at Colmex, to digitize and preserve video recordings he made while living with indigenous groups in the Mexican state of Chiapas. The question and answer portion of this session highlighted some of the ethical issues surrounding rights and consent when providing access to intimate documentation of people’s lives. While Colmex is not yet focusing on access to this collection, it was informative to hear Baur and others talk a bit about the ongoing technical, legal, and ethical challenges of a work-in-progress collaboration.

Presenters also provided some awesome practical tools for attendees to take home with them. One of the many great open resources session leaders shared was Frances Harrell (NEDCC) and Alexandra Chassanoff (Educopia)’s DigiPET: A Community Built Guide for Digital Preservation Education + Training Google document, a living resource for compiling educational tools that you can add to using this form. Julian Morley also shared a Preservation Storage Cost Model Google sheet that contains a template with a wealth of information about estimating the cost of different digital preservation storage models, including comparisons for several cloud providers. Amy Rudersdorf (AVP), Ben Fino-Radin (Small Data Industries), and Frances Harrell (NEDCC) also discussed helpful frameworks for conducting self-assessments.

Selina Aragon, Daina Bouquin, Don Brower, and Seth Anderson discuss the challenges of software preservation.

PASIG closed out by spending some time on the challenges involved with preserving emerging and complex formats. On the last afternoon of sessions, Amelia Acker (University of Texas at Austin) spoke about the importance of preserving APIs, terms of service, and other “born-networked” formats when archiving social media. She was followed by a panel of software preservationists who discussed different use cases for preserving binaries, source code, and other software artifacts.

Conference slides are all available online.

Thanks to the wonderful work of the PASIG 2019 steering, program, and local arrangements committees!


Kelly Bolding is the Project Archivist for Americana Manuscript Collections at Princeton University Library, as well as the team leader for bloggERS! She is interested in developing workflows for processing born-digital and audiovisual materials and making archival description more accurate, ethical, and inclusive.

Advertisements

GVSU Scripted IR Curation in the Cloud

by Matt Schultz and Kyle Felker

This is the seventh post in the bloggERS Script It! Series.


In 2016, bepress launched a new service to assist subscribers of Digital Commons with archiving the contents of their institutional repository. Using Amazon Web Services (AWS), this new service pushes daily updates of an institution’s repository contents to an Amazon Simple Storage Service (S3) bucket setup that is hosted by the institution. Subscribers thus control a real-time copy of all their institutional repository content outside of bepress’s Digital Commons platform. They can download individual files or the entire repository from this S3 bucket, all on their own schedules, and for whatever purposes they deem appropriate.

Grand Valley State University (GVSU) Libraries makes use of Digital Commons for their ScholarWorks@GVSU institutional repository. Using bepress’s new archiving service has given GVSU an opportunity to perform scripted automation in the Cloud using the same open source tools that they use in their regular curation workflows. These tools include Brunnhilde, which bundles together ClamAV and Siegfried to produce file format, fixity, and virus check reports, as well as BagIt for preservation packaging.

Leveraging the ease of launching Amazon’s EC2 server instances and use of their AWS command line interface (CLI), GVSU Libraries was able to readily configure an EC2 “curation server” that directly syncs copies of their Digital Commons data from S3 to the new server where the installed tools mentioned above do their work of building preservation packages and sending them back to S3 and to Glacier for nearline and long-term storage. The entire process now begins and ends in the Cloud, rather than involving the download of data to a local workstation for processing.

Creating a digital preservation copy of the GVSU’s ScholarWorks files involves four steps:

  1. Syncing data from S3 to EC2:  No processing can actually be done on files in-place on S3, so they must be copied to the EC2 “curation server”. As part of this process, we wanted to reorganize the files into logical directories (alphabetized), so that it would be easier to locate and process files, and to better organize the virus reports and the “bags” the process generates.
  2. Virus and format reports with Brunnhilde:  The synced files are then run through Brunnhilde’s suite of tools.  Brunnhilde will generate both a command-line summary of problems found and detailed reports that are deposited with the files.  The reports stay with the files as they move through the process.
  3. Preservation Packaging with BagIt: Once the files are checked, they need to be put in “bags” for storage and archiving using the BagIt tool.  This will bundle the files in a data directory and generate metadata that can be used to check their integrity.
  4. Syncing files to S3 and Glacier: Checked and bagged files are then moved to a different S3 bucket for nearline storage.  From that bucket, we have set up automated processes (“lifecycle management” in AWS parlance) to migrate the files on a quarterly schedule into Amazon Glacier, our long-term storage solution.

Once the process has been completed once, new files are incorporated and re-synced on a quarterly basis to the BagIt data directories and re-checked with Brunnhilde.  The BagIt metadata must then be updated and re-verified using BagIt, and the changes synced to the destination S3 bucket.

Running all these tools in sequence manually using the command line interface is both tedious and time-consuming. We chose to automate the process using a shell script. Shell scripts are relatively easy to write, and are designed to automate command-line tasks that involve a lot of repetitive work (like this one).

These scripts can be found at our github repo: https://github.com/gvsulib/curationscripts

Process_backup is the main script. It handles each of the four processing stages outlined above.  As it does so, it stores the output of those tasks in log files so they can be examined later.  In addition, it emails notifications to our task management system (Asana) so that our curation staff can check on the process.

After the first time the process is run, the metadata that BagIt generates has to be updated to reflect new data.  The version of BagIt we are using (Python) can’t do this from the command line, but it does have an API with a command that will update existing “bag” metadata. So, we created a small Python script to do this (regen_BagIt_manifest.py).  The shell script invokes this script at the third stage if bags have previously been created.

Finally, the update.sh script automatically updates all the tools used in the process and emails curation staff when the process is done. We then schedule the scripts to run automatically using the Unix cron utility.

GVSU Libraries is now hard at work on a final bagit_check.py script that will facilitate spot retrieval of the most recent version of a “bag” from the S3 nearline storage bucket and perform a validation audit using BagIt.


Matt Schultz is the Digital Curation Librarian for Grand Valley State University Libraries. He provides digital preservation for the Libraries’ digital collections, and offers support to faculty and students in the areas of digital scholarship and research data management. He has the unique displeasure of working on a daily basis with Kyle Felker.

Kyle Felker was born feral in the swamps of Louisiana, and a career in technology librarianship didn’t do much to domesticate him.  He currently works as an Application Developer at GVSU Libraries.

DLF Forum & Digital Preservation 2017 Recap

By Kelly Bolding


The 2017 DLF Forum and NDSA’s Digital Preservation took place this October in Pittsburgh, Pennsylvania. Each year the DLF Forum brings together a variety of digital library practitioners, including librarians, archivists, museum professionals, metadata wranglers, technologists, digital humanists, and scholars in support of the Digital Library Federation’s mission to “advance research, learning, social justice, and the public good through the creative design and wise application of digital library technologies.” The National Digital Stewardship Alliance follows up the three-day main forum with Digital Preservation (DigiPres), a day-long conference dedicated to the “long-term preservation and stewardship of digital information and cultural heritage.” While there were a plethora of takeaways from this year’s events for the digital archivist community, for the sake of brevity, this recap will focus on a few broad themes, followed by some highlights related to electronic records specifically.

As an early career archivist and a first-time DLF/DigiPres attendee, I was impressed by the DLF community’s focus on inclusion and social justice. While technology was central to all aspects of the conference, the sessions centered the social and ethical aspects of digital tools in a way that I found both refreshing and productive. (The theme for this year’s DigiPres was, in fact, “Preservation is Political.”) Rasheedah Phillips, a Philadelphia-based public interest attorney, activist, artist, and science fiction writer opened the forum with a powerful keynote about the Community Futures Lab, a space she co-founded and designed around principles of Afrofuturism and Black Quantum Futurism. By presenting an alternate model of archiving deeply grounded in the communities affected, Phillips’s talk and Q&A responses brought to light an important critique of the restrictive nature of archival repositories. I left Phillips’s talk thinking about how we might allow the the liberatory “futures” she envisions to shape how we design online spaces for engaging with born-digital archival materials, as opposed to modeling these virtual spaces after the physical reading rooms that have alienated many of our potential users.

Other conference sessions echoed Phillips’s challenge to archivists to better engage and center the communities they document, especially those who have been historically marginalized. Ricky Punzalan noted in his talk on access to dispersed ethnographic photographs that collaboration with documented communities should now be a baseline expectation for all digital projects. Rosalie Lack and T-Kay Sangwand spoke about UCLA’s post-custodial approach to ethically developing digital collections across international borders using a collaborative partnership framework. Martha Tenney discussed concrete steps taken by archivists at Barnard College to respect the digital and emotional labor of students whose materials the archives is collecting to fill in gaps in the historical record.

Eira Tansey, Digital Archivist and Records Manager at the University of Cincinnati and organizer for Project ARCC, gave her DigiPres keynote about how our profession can develop an ethic of environmental justice. Weaving stories about the environmental history of Pittsburgh throughout her talk, Tansey called for archivists to commit firmly to ensuring the preservation and usability of environmental information. Related themes of transparency and accountability in the context of preserving and providing access to government and civic data (which is nowadays largely born-digital) were also present through the conference sessions. Regarding advocacy and awareness initiatives, Rachel Mattson and Brandon Locke spoke about Endangered Data Week; and several sessions discussed the PEGI Project. Others presented on the challenges of preserving born-digital civic and government information, including how federal institutions and smaller universities are tackling digital preservation given their often limited budgets, as well as how repositories are acquiring and preserving born-digital congressional records.

Collaborative workflow development for born-digital processing was another theme that emerged in a variety of sessions. Annalise Berdini, Charlie Macquarie, Shira Peltzman, and Kate Tasker, all digital archivists representing different University of California campuses, spoke about their process in coming together to create a standardized set of UC-wide guidelines for describing born-digital materials. Representatives from the OSSArcFlow project also presented some initial findings regarding their research into how repositories are integrating open source tools including BitCurator, Archivematica, and ArchivesSpace within their born-digital workflows; they reported on concerns about the scalability of various tools and standards, as well as desires to transition from siloed workflows to a more holistic approach and to reduce the time spent transforming the output of one tool to be compatible with another tool in the workflow. Elena Colón-Marrero of the Computer History Museum’s Center for Software History provided a thorough rundown of building a software preservation workflow from the ground-up, from inventorying software and establishing a controlled vocabulary for media formats to building a set of digital processing workstations, developing imaging workflows for different media formats, and eventually testing everything out on a case study collection (and she kindly placed her whole talk online!)

Also during the forum, the DLF Born-Digital Access Group met over lunch for an introduction and discussion. The meeting was well-attended, and the conversation was lively as members shared their current born-digital access solutions, both pretty and not so pretty (but never perfect); their wildest hopes and dreams for future access models; and their ideas for upcoming projects the group could tackle together. While technical challenges certainly figured into the discussion about impediments to providing better born-digital access, many of the problems participants reported had to do with their institutions being unwilling to take on perceived legal risks. The main action item that came out of the meeting is that the group plans to take steps to expand NDSA’s Levels of Preservation framework to include Levels of Access, as well as corresponding tiers of rights issues. The goal would be to help archivists assess the state of existing born-digital access models at their institutions, as well as give them tools to advocate for more robust, user-friendly, and accessible models moving forward.

For additional reports on the conference, reflections from several DLF fellows are available on the DLF blog. In addition to the sessions I mentioned, there are plenty more gems to be found in the openly available community notes (DLF, DigiPres) and OSF Repository of slides (DLF, DigiPres), as well as in the community notes for the Liberal Arts Colleges/HBCU Library Alliance unconference that preceded DLF.


Kelly Bolding is a processing archivist for the Manuscripts Division at Princeton University Library, where she is responsible for the arrangement and description of early American history collections and has been involved in the development of born-digital processing workflows. She holds an MLIS from Rutgers University and a BA in English Literature from Reed College.

OSS4Pres 2.0: Developing functional requirements/features for digital preservation tools

By Heidi Elaine Kelly

____

This is the final post in the bloggERS series describing outcomes of the #OSS4Pres 2.0 workshop at iPRES 2016, addressing open source tool and software development for digital preservation. This post outlines the work of the group tasked with “ developing functional requirements/features for OSS tools the community would like to see built/developed (e.g. tools that could be used during ‘pre-ingest’ stage).” 

The Functional Requirements for New Tools and Features Group of the OSS4Pres workshop aimed to write user stories focused on new features that developers can build out to better support digital preservation and archives work. The group was largely comprised of practitioners who work with digital curation tools regularly, and was facilitated by Carl Wilson of the Open Preservation Foundation. While their work largely involved writing user stories for development, the group also came up with requirement lists for specific areas of tool development, outlined below. We hope that these lists help continue to bridge the gap between digital preservation professionals and open source developers by providing a deeper perspective of user needs.

Basic Requirements for Tools:

  • Mostly needed for Mac environment
  • No software installation on donor computer
  • No software dependencies requiring installation (e.g., Java)
  • Must be GUI-based, as most archivists are not skilled with the command line
  • Graceful failure

Descriptive Metadata Extraction Needs (using Apache Tika):

  • Archival date
  • Author
  • Authorship location
  • Subject location
  • Subject
  • Document type
  • Removal of spelling errors to improve extracted text

Technical Metadata Extraction Needs:

  • All datetime information available should be retained (minimum of LastModified Date)
  • Technical manifest report
  • File permissions and file ownership permissions
  • Information about the tool that generated the technical manifest report:
    • tool – name of the tool used to gather the disk image
    • tool version – the version of the tool
    • signature version – if the tool uses ‘signatures’ or other add-ons, e.g. which virus scanner software signature – such as signature release July 2014 or v84
    • datetime process run – the datetime information of when the process ran (usually tools will give you when the process was completed) – for each tool that you use

Data Transfer Tool Requirements:

  • Run from portable external device
  • Bag-It standard compliant (build into a “bag”)
  • Able to select a subset of data – not disk image the whole computer
  • GUI-based tool
  • Original file name (also retained in tech manifest)
  • Original file path (also retained in tech manifest)
  • Directory structure (also retained in tech manifest)
  • Address these issues in filenames (record the actual filename in the tech manifest): Diacritics (e.g. naïve ), Illegal characters ( \ / : * ? “ < > | ), Spaces, M-dashes, n-dashes, Missing file extensions, Excessively long file and folder names, etc
  • Possibly able to connect to “your” FTP site/cloud thingy and send the data there when ready for transfer

Checksum Verification Requirements:

  • File-by-file checksum hash generation
  • Ability to validate the contents of the transfer

Reporting Requirements:

  • Ability to highlight/report on possibly problematic files/folders in a separate file

Testing Requirements:

  • Access to a test corpora, with known issues, to test tool

Smart Selection & Appraisal Tool Requirements:

  • DRM/TPMs detection
  • Regular expressions/fuzzy logic for finding certain terms – e.g. phone numbers, security numbers, other predefined personal data
  • Blacklisting of files – configurable list of blacklist terms
  • Shortlisting a set of “questionable” files based on parameters that could then be flagged for a human to do further QA/QC

Specific Features Needed by the Community:

  • Gathering/generating quantitative metrics for web harvests
  • Mitigation strategies for FFMPEG obsolescence
  • TESSERACT language functionality

____

heidi-elaine-kellyHeidi Elaine Kelly is the Digital Preservation Librarian at Indiana University, where she is responsible for building out the infrastructure to support long-term sustainability of digital content. Previously she was a DiXiT fellow at Huygens ING and an NDSR fellow at the Library of Congress.

OSS4Pres 2.0: Sharing is Caring: Developing an online community space for sharing workflows

By Sam Meister

____

This is the third post in the bloggERS series describing outcomes of the #OSS4Pres 2.0 workshop at iPRES 2016, addressing open source tool and software development for digital preservation. This post outlines the work of the group tasked with “developing requirements for an online community space for sharing workflows, OSS tool integrations, and implementation experiences” See our other posts for information on the groups that focused on feature development and design requirements for FOSS tools.

Cultural heritage institutions, from small museums to large academic libraries, have made significant progress developing and implementing workflows to manage local digital curation and preservation activities. Many institutions are at different stages in the maturity of these workflows. Some are just getting started, and others have had established workflows for many years. Documentation assists institutions in representing current practices and functions as a benchmark for future organizational decision-making and improvements. Additionally, sharing documentation assists in creating cross-institutional understanding of digital curation and preservation activities and can facilitate collaborations amongst institutions around shared needs.

One of the most commonly voiced recommendations from iPRES 2015 OSS4PRES workshop attendees was the desire for a centralized location for technical and instructional documentation, end-to-end workflows, case studies, and other resources related to the installation, implementation, and use of OSS tools. This resource could serve as a hub that would enable practitioners to freely and openly exchange information, user requirements, and anecdotal accounts of OSS initiatives and implementations.

At the OSS4Pres 2.0 workshop, the group of folks looking at developing an online space for sharing workflows and implementation experience started by defining a simple goal and deliverable for the two hour session:

Develop a list of minimal levels of content that should be included in an open online community space for sharing workflows and other documentation

The group the began a discussion on developing this list of minimal levels by thinking about the potential value of user stories in informing these levels. We spent a bit of time proposing a short list of user stories, just enough to provide some insight into the basic structures that would be needed for sharing workflow documentation.

User stories

  • I am using tool 1 and tool 2 and want to know how others have joined them together into a workflow
  • I have a certain type of data to preserve and want to see what workflows other institutions have in place to preserve this data
  • There is a gap in my workflow — a function that we are not carrying out — and I want to see how others have filled this gap
  • I am starting from scratch and need to see some example workflows for inspiration
  • I would like to document my workflow and want to find out how to do this in a way that is useful for others
  • I would like to know why people are using particular tools – is there evidence that they tried another tool, for example, that wasn’t successful?

The group then proceeded to define a workflow object as a series of workflow steps with its own attributes, a visual representation, and organizational context:

Workflow step
Title / name
Description
Tools / resources
Position / role

Visual workflow diagrams / model
Organizational Context
            Institution type
            Content type

Next, we started to draft out the different elements that would be part of an initial minimal level for workflow objects:

Level 1:

Title
Description
Institution / organization type
Contact
Content type(s)
Status
Link to external resources
Download workflow diagram objects
Workflow concerns / reflections / gaps

After this effort the group focused on discussing next steps and how an online community space for sharing workflows could be realized. This discuss led towards pursuing the expansion of COPTR to support sharing of workflow documentation. We outlined a roadmap for next steps toward pursuing this goal:

  • Propose / approach COPTR steering group on adding workflows space to COPTR
  • Develop home page and workflow template
  • Add examples
  • Group review
  • Promote / launch
  • Evaluation

The group has continued this work post-workshop and has made good progress setting up a Community Owned Workflows section to COPTR and developing an initial workflow template. We are in the midst of creating and evaluating sample workflows to help with revising and tweaking as needed. Based on this process we hope to launch and start promoting this new online space for sharing workflows in the months ahead. So stay tuned!

____

meister_photoSam Meister is the Preservation Communities Manager, working with the MetaArchive Cooperative and BitCurator Consortium communities. Previously, he worked as Digital Archivist and Assistant Professor at the University of Montana. Sam holds a Master of Library and Information Science degree from San Jose State University and a B.A. in Visual Arts from the University of California San Diego. Sam is also an Instructor in the Library of Congress Digital Preservation Education and Outreach Program.

 

Preservation and Access Can Coexist: Implementing Archivematica with Collaborative Working Groups

By Bethany Scott

The University of Houston (UH) Libraries made an institutional commitment in late 2015 to migrate the data for its digitized and born-digital cultural heritage collections to open source systems for preservation and access: Hydra-in-a-Box (now Hyku!), Archivematica, and ArchivesSpace. As a part of that broader initiative, the Digital Preservation Task Force began implementing Archivematica in 2016 for preservation processing and storage.

At the same time, the DAMS Implementation Task Force was also starting to create data models, use cases, and workflows with the goal of ultimately providing access to digital collections via a new online repository to replace CONTENTdm. We decided that this would be a great opportunity to create an end-to-end digital access and preservation workflow for digitized collections, in which digital production tasks could be partially or fully automated and workflow tools could integrate directly with repository/management systems like Archivematica, ArchivesSpace, and the new DAMS. To carry out this work, we created a cross-departmental working group consisting of members from Metadata & Digitization Services, Web Services, and Special Collections.

Continue reading

Digital Preservation, Eh?

by Alexandra Jokinen

This post is the third post in our series on international perspectives on digital preservation.

___

Hello / Bonjour!

Welcome to the Canadian edition of International Perspectives on Digital Preservation. My name is Alexandra Jokinen. I am the new(ish) Digital Archives Intern at Dalhousie University in Halifax, Nova Scotia. I work closely with the Digital Archivist, Creighton Barrett, to aid in the development of policies and procedures for some key aspects of the University Libraries’ digital archives program—acquisitions, appraisal, arrangement, description, and preservation.

One of the ways in which we are beginning to tackle this very large, very complex (but exciting!) endeavour is to execute digital preservation on a small scale, focusing on the processing of digital objects within a single collection, and then using those experiences to create documentation and workflows for different aspects of the digital archives program.

The collection chosen to be our guinea pig was a recent donation of work from esteemed Canadian ecologist and environmental scientist, Bill Freedman, who taught and conducted research at Dalhousie from 1979 to 2015. The fonds is a hybrid of analogue and digital materials dating from 1988 to 2015. Digital media carriers include: 1 computer system unit, 5 laptops, 2 external hard drives, 7 USB flash drives, 5 zip disks, 57 CDs, 6 DVDs, 67 5.25 inch floppy disks and 228 3.5 inch floppy disks. This is more digital material than the archives is likely to acquire in future accessions, but the Freedman collection acted as a good test case because it provided us with a comprehensive variety of digital formats to work with.

Our first area of focus was appraisal. For the analogue material in the collection, this process was pretty straightforward: conduct macro-appraisal and functional analysis by physically reviewing material. However, (as could be expected) appraisal of the digital material was much more difficult to complete. The archives recently purchased a forensic recovery of evidence device (FRED) but does not yet have all the necessary software and hardware to read the legacy formats in the collection (such as the floppy disks and zip disks), so, we started by investigating the external hard drives and USB flash drives. After examining their content, we were able to get an accurate sense of the information they contained, the organizational structure of the files, and the types of formats created by Freedman. Although, we were not able to examine files on the legacy media, we felt that we had enough context to perform appraisal, determine selection criteria and formulate an arrangement structure for the collection.

The next step of the project will be to physically organize the material. This will involve separating, photographing and reboxing the digital media carriers and updating a new registry of digital media that was created during a recent digital archives collection assessment modelled after OCLC’s 2012 “You’ve Got to Walk Before You Can Run” research report. Then, we will need to process the digital media, which will entail creating disk images with our FRED machine and using forensic tools to analyze the data.  Hopefully, this will allow us to apply the selection criteria used on the analogue records to the digital records and weed out what we do not want to retain. During this process, we will be creating procedure documentation on accessioning digital media as well as updating the archives’ accessioning manual.

The project’s final steps will be to take the born-digital content we have collected and ingest it using Archivematica to create Archival Information Packages for storage and preservation and accessed via the Archives Catalogue and Online Collections.

So there you have it! We have a long way to go in terms of digital preservation here at Dalhousie (and we are just getting started!), but hopefully our work over the next several months will ensure that solid policies and procedures are in place for maintaining a trustworthy digital preservation system in the future.

This internship is funded in part by a grant from the Young Canada Works Building Careers in Heritage Program, a Canadian federal government program for graduates transitioning to the workplace.

___

dsc_0329

Alexandra Jokinen has a Master’s Degree in Film and Photography Preservation and Collections Management from Ryerson University in Toronto. Previously, she has worked as an Archivist at the Liaison of Independent Filmmakers of Toronto and completed a professional practice project at TIFF Film Reference Library and Special Collections.

Connect with me on LinkedIn!

Practical Digital Preservation: In-House Solutions to Digital Preservation for Small Institutions

By Tyler McNally

This post is the tenth post in our series on processing digital materials.

Many archives don’t have the resources to install software or subscribe to a service such as Archivematica, but still have a mandate to collect and preserve born-digital records. Below is a digital-preservation workflow created by Tyler McNally at the University of Manitoba. If you have a similar workflow at your institution, include it in the comments. 

———

Recently I completed an internship at the University of Manitoba’s College of Medicine Archives, working with Medical Archivist Jordan Bass. A large part of my work during this internship dealt with building digital infrastructure for the archive to utilize in working on digital preservation. As a small operation, the archive does not have the resources to really pursue any kind of paid or difficult to use system.

Originally, our plan was to use the open-source, self-install version of Archivematica, but certain issues that cropped up made this impossible, considering the resources we had at hand. We decided that we would simply make our own digital-preservation workflow, using open-source and free software to convert our files for preservation and access, check for viruses, and create checksums—not every service that Archivematica offers, but enough to get our files stored safely. I thought other institutions of similar size and means might find the process I developed useful in thinking about their own needs and capabilities.

Continue reading

Using NLP to Support Dynamic Arrangement, Description, and Discovery of Born Digital Collections: The ArchExtract Experiment

By Mary W. Elings

This post is the eighth in our Spring 2016 series on processing digital materials.

———

Many of us working with archival materials are looking for tools and methods to support arrangement, description, and discovery of electronic records and born digital collections, as well as large bodies of digitized text. Natural Language Processing (NLP), which uses algorithms and mathematical models to process natural language, offers a variety of potential solutions to support this work. Several efforts have investigated using NLP solutions for analyzing archival materials, including TOME (Interactive TOpic Model and MEtadata Visualization), Ed Summers’ Fondz, and Thomas Padilla’s Woese Collection work, among others, though none have resulted in a major tool for broader use.

One of these projects, ArchExtract, was carried out at UC Berkeley’s Bancroft Library in 2014-2015. ArchExtract sought to apply several NLP tools and methods to large digital text collections and build a web application that would package these largely command-line NLP tools into an interface that would make it easy for archivists and researchers to use.

The ArchExtract project focused on facilitating analysis of the content and, via that analysis, discovery by researchers. The development work was done by an intern from the UC Berkeley School of Information, Janine Heiser, who built a web application that implements several NLP tools, including Topic Modelling, Named Entity Recognition, and Keyword Extraction to explore and present large, text-based digital collections.

The ArchExtract application extracts topics, named entities (people, places, subjects, dates, etc.), and keywords from a given collection. The application automates, implements, and extends various natural language processing software tools, such as MALLET and the Stanford Core NLP toolkit, and provides a graphical user interface designed for non-technical users.

 

archextract1
ArchExtract Interface Showing Topic Model Results. Elings/Heiser, 2015.

In testing the application, we found the automated text analysis tools in ArchExtract were successful in identifying major topics, as well as names, dates, and places found in the text, and their frequency, thereby giving archivists an understanding of the scope and content of a collection as part of the arrangement and description process. We called this process “dynamic arrangement and description,” as materials can be re-arranged using different text processing settings so that archivists can look critically at the collection without changing the physical or virtual arrangement.

The topic models, in particular, surfaced documents that may have been related to a topic but did not contain a specific keyword or entity. The process was akin to the sort of serendipity a researcher might achieve when shelf reading in the analog world, wherein you might find what you seek without knowing it was there. And while topic modelling has been criticized for being inexact, it can be “immensely powerful for browsing and isolating results in thousands or millions of uncatalogued texts” (Schmidt, 2012). This, combined with the named entity and keyword extraction, can give archivists and researchers important data that could be used in describing and discovering material.

archextract2
ArchExtract Interface Showing Named Entity Recognition Results. Elings/Heiser, 2015.

As a demonstration project, ArchExtract was successful in achieving our goals. The code developed is documented and freely available on GitHub to anyone interested in how it was done or who might wish to take it further. We are very excited by the potential of these tools in dynamically arranging and describing large, text-based digital collections, but even more so by their application in discovery. We are particularly pleased that broad, open source projects like BitCurator and ePADD are taking this work forward and will be bringing NLP tools into environments that we can all take advantage of in processing and providing access to our born digital materials.

———

Mary W. Elings is the Principal Archivist for Digital Collections and Head of the Digital Collections Unit of The Bancroft Library at the University of California, Berkeley. She is responsible for all aspects of the digital collections, including managing digital curation activities, the born digital archives program, web archiving, digital processing, mass digitization, finding aid publication and maintenance, metadata, archival information management and digital asset management, and digital initiatives. Her current work concentrates on issues surrounding born-digital materials, supporting digital humanities and digital social sciences, and research data management. Ms. Elings co-authored the article “Metadata for All: Descriptive Standards and Metadata Sharing across Libraries, Archives and Museums,” and wrote a primer on linked data for LAMs. She has taught as an adjunct professor in the School of Information Studies at Syracuse University, New York (2003-2009) and School of Library and Information Science, Catholic University, Washington, DC (2010-2014), and is a regular guest-lecturer in the John F. Kennedy University Museum Studies program (2010-present).