Announcing the Digital Processing Framework

by Erin Faulder

Development of the Digital Processing Framework began after the second annual Born Digital Archiving eXchange unconference at Stanford University in 2016. There, a group of nine archivists saw a need for standardization, best practices, or general guidelines for processing digital archival materials. What came out of this initial conversation was the Digital Processing Framework (https://hdl.handle.net/1813/57659) developed by a team of 10 digital archives practitioners: Erin Faulder, Laura Uglean Jackson, Susanne Annand, Sally DeBauche, Martin Gengenbach, Karla Irwin, Julie Musson, Shira Peltzman, Kate Tasker, and Dorothy Waugh.

An initial draft of the Digital Processing Framework was presented at the Society of American Archivists’ Annual meeting in 2017. The team received feedback from over one hundred participants who assessed whether the draft was understandable and usable. Based on that feedback, the team refined the framework into a series of 23 activities, each composed of a range of assessment, arrangement, description, and preservation tasks involved in processing digital content. For example, the activity Survey the collection includes tasks like Determine total extent of digital material and Determine estimated date range.

The Digital Processing Framework’s target audience is folks who process born digital content in an archival setting and are looking for guidance in creating processing guidelines and making level-of-effort decisions for collections. The framework does not include recommendations for archivists looking for specific tools to help them process born digital material. We draw on language from the OAIS reference model, so users are expected to have some familiarity with digital preservation, as well as with the management of digital collections and with processing analog material.

Processing born-digital materials is often non-linear, requires technical tools that are selected based on unique institutional contexts, and blends terminology and theories from archival and digital preservation literature. Because of these characteristics, the team first defined 23 activities involved in digital processing that could be generalized across institutions, tools, and localized terminology. These activities may be strung together in a workflow that makes sense for your particular institution. They are:

  • Survey the collection
  • Create processing plan
  • Establish physical control over removeable media
  • Create checksums for transfer, preservation, and access copies
  • Determine level of description
  • Identify restricted material based on copyright/donor agreement
  • Gather metadata for description
  • Add description about electronic material to finding aid
  • Record technical metadata
  • Create SIP
  • Run virus scan
  • Organize electronic files according to intellectual arrangement
  • Address presence of duplicate content
  • Perform file format analysis
  • Identify deleted/temporary/system files
  • Manage personally identifiable information (PII) risk
  • Normalize files
  • Create AIP
  • Create DIP for access
  • Publish finding aid
  • Publish catalog record
  • Delete work copies of files

Within each activity are a number of associated tasks. For example, tasks identified as part of the Establish physical control over removable media activity include, among others, assigning a unique identifier to each piece of digital media and creating suitable housing for digital media. Taking inspiration from MPLP and extensible processing methods, the framework assigns these associated tasks to one of three processing tiers. These tiers include: Baseline, which we recommend as the minimum level of processing for born digital content; Moderate, which includes tasks that may be done on collections or parts of collections that are considered as having higher value, risk, or access needs; and Intensive, which includes tasks that should only be done to collections that have exceptional warrant. In assigning tasks to these tiers, practitioners balance the minimum work needed to adequately preserve the content against the volume of work that could happen for nuanced user access. When reading the framework, know that if a task is recommended at the Baseline tier, then it should also be done as part of any higher tier’s work.

We designed this framework to be a step towards a shared vocabulary of what happens as part of digital processing and a recommendation of practice, not a mandate. We encourage archivists to explore the framework and use it however it fits in their institution. This may mean re-defining what tasks fall into which tier(s), adding or removing activities and tasks, or stringing tasks into a defined workflow based on tier or common practice. Further, we encourage the professional community to build upon it in practical and creative ways.


Erin Faulder is the Digital Archivist at Cornell University Library’s Division of Rare and Manuscript Collections. She provides oversight and management of the division’s digital collections. She develops and documents workflows for accessioning, arranging and describing, and providing access to born-digital archival collections. She oversees the digitization of analog collection material. In collaboration with colleagues, Erin develops and refines the digital preservation and access ecosystem at Cornell University Library.

Advertisements

Call for Contributions: Making Tech Skills a Strategic Priority

As a follow-up to our popular Script It! Series — which attempted to break down barriers and demystify scripting with walkthroughs of simple scripts — we’re interested in learning more about how archival institutions (as such) encourage their archivists to develop and promote their technical literacy more generally. As Trevor Owens notes in his forthcoming book, The Theory and Craft of Digital Preservation, “the scale and inherent structures of digital information suggest working more with a shovel than with a tweezers.” Encouraging archivists to develop and promote their technical literacy is one such way to use a metaphorical shovel!

Maybe you work for an institution that explicitly encourages its employees to learn new technical skills. Maybe your team or institution has made technical literacy a strategic priority. Maybe you’ve formed a collaborative study group with your peers to learn a programming language. Whatever the case, we want to hear about it!

Writing for bloggERS! “Making Tech Skills a Strategic Priority” Series

  • We encourage visual representations: Posts can include or largely consist of comics, flowcharts, a series of memes, etc!
  • Written content should be roughly 600-800 words in length
  • Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA
  • Align with other editorial guidelines as outlined in the bloggERS! guidelines for writers.

Posts for this series will start in late November or December, so let us know if you are interested in contributing by sending an email to ers.mailer.blog@gmail.com!

Electronic Records at SAA 2018

With just weeks to go before the 2018 SAA Annual Meeting hits the capital, here’s a round-up of the sessions that might interest ERS members in particular. This year’s schedule offers plenty for archivists who deal with the digital, tackling current gnarly issues around transparency and access, and format-specific challenges like Web archiving and social media. Other sessions anticipate the opportunities and questions posed by new technologies: blockchain, artificial intelligence, and machine learning.

And of course, be sure to mark your calendars for the ERS annual meeting! This year’s agenda includes lightning talks from representatives from the IMLS-funded OSSArcFlow and Collections as Data projects, and the DLF Levels of Access research group. There will also be a mini-unconference session focused on problem-solving current challenges associated with the stewardship of electronic records. If you would like to propose an unconference topic or facilitate a breakout group, sign up here.

Wednesday, August 15

2:30-3:45

Electronic Records Section annual business meeting (https://archives2018.sched.com/event/ESmz/electronic-records-section)

Thursday, August 16

10:30-11:45

105 – Opening the Black Box: Transparency and Complexity in Digital  Preservation (https://archives2018.sched.com/event/ESlh)

12:00-1:15

Open Forums: Preservation of Electronic Government Information (PEGI) Project (https://archives2018.sched.com/event/ETNi/open-forums-preservation-of-electronic-government-information-pegi-project)

Open Forums: Safe Search Among Sensitive Content: Investigating Archivist and Donor Conceptions of Privacy, Secrecy, and Access (https://archives2018.sched.com/event/ETNh/open-forums-safe-search-among-sensitive-content-investigating-archivist-and-donor-conceptions-of-privacy-secrecy-and-access)

1:30-2:30

201 – Email Archiving Comes of Age (https://archives2018.sched.com/event/ESlo/201-email-archiving-comes-of-age)

204-Scheduling the Ephemeral: Creating and Implementing Records Management Policy for Social Media (https://archives2018.sched.com/event/ESls/204-scheduling-the-ephemeral-creating-and-implementing-records-management-policy-for-social-media)

Friday, August 17

2:00 – 3:00

501 – The National Archives Aims for Digital Future: Discuss NARA Strategic Plan and Future of Archives with NARA Leaders (https://archives2018.sched.com/event/ESmP/501-the-national-archives-aims-for-digital-future-discuss-nara-strategic-plan-and-future-of-archives-with-nara-leaders)

502 – This is not Skynet (yet): Why Archivists should care about Artificial Intelligence and Machine Learning (https://archives2018.sched.com/event/ESmQ/502-this-is-not-skynet-yet-why-archivists-should-care-about-artificial-intelligence-and-machine-learning)

504 – Equal Opportunities: Physical and Digital Accessibility of Archival Collections (https://archives2018.sched.com/event/ESmS/504-equal-opportunities-physical-and-digital-accessibility-of-archival-collections)

508 – Computing Against the Grain: Capturing and Appraising Underrepresented Histories of Computing (https://archives2018.sched.com/event/ESmW/508-computing-against-the-grain-capturing-and-appraising-underrepresented-histories-of-computing)

Saturday, August 18

8:30 – 10:15

605 – Taming the Web: Perspectives on the Transparent Management and Appraisal of Web Archives (https://archives2018.sched.com/event/ESme/605-taming-the-web-perspectives-on-the-transparent-management-and-appraisal-of-web-archives)

606 – Let’s Be Clear: Transparency and Access to Complex Digital Objects (https://archives2018.sched.com/event/ESmf/606-lets-be-clear-transparency-and-access-to-complex-digital-objects)

10:30 – 11:30

704 – Blockchain: What Is It and Why Should You Care (https://archives2018.sched.com/event/ESmo/704-blockchain-what-is-it-and-why-should-we-care)

 

Building Community for Archivematica

By Shira Peltzman, Nick Krabbenhoeft and Max Eckard


In March of 2018, the Archivematica User Forum held the first in an ongoing series of bi-monthly calls for active Archivematica users or stakeholders. Archivematica users (40 total!) from the United States, Canada, and the United Kingdom came together to share project updates and ongoing challenges and begin to work with their peers to identify and define community solutions.

Purpose

The Archivematica user community is large (and growing!), but formal communication channels between Archivematica users are limited. While the Archivematica Google Group is extremely valuable, it has some drawbacks. Artefactual prioritizes their paid support and training services there, and posts seem to focus primarily on announcing new releases or resolving errors. This sets an expectation that communication flows there from Artefactual to Archivematica users, rather than between Archivematica users. Likewise, Archivematica Camps are an exciting development, but at the moment these occur relatively infrequently and require participants to travel. As a result, it can be hard for Archivematica users to find partners and share work.

Enter the Archivematica User Forum. We hope these calls will fill this peer-to-peer communication void! Our goal is to create a space for discussion that will enable practitioners to connect with one another and identify common denominators, issues, and roadblocks that affect users across different organizations. In short, we are hoping that these calls will provide a broader and more dynamic forum for user engagement and support, and ultimately foster a more cohesive and robust user community.

Genesis

The User Forum is not the first group created to connect Archivematica users. Several regional groups already exist; the Texas Archivematica Users Groups and UK Archivematica Users Group (blog of their latest meeting) are amazing communities that meet regularly. But sometimes, the people trying to adapt, customize, and improve Archivematica the same way you are live in a different time zone.

That situation inspired the creation of this group. After realizing how often relationships would form because someone knew someone who knew someone doing something similar, creating a national forum where everyone had the chance to meet everyone else seemed like the natural choice.

Scope

It takes a lot to build a new community, so we have tried to keep the commitment light. To start with, the forum meets every two months. Second, it’s open to anyone using Archivematica that can make the call, 9AM on the West Coast, 12PM on the East Coast. That includes archivists, technologists, developers and any other experts actively using or experimenting with Archivematica.

Third, we have some in-scope and out-of-scope topics. In-scope includes anything that helps us continue to improve our usage of Archivematica: project announcements, bug tracking/diagnosis, desired features, recurring problems or concerns, documentation, checking-in on Archivematica implementations, and identifying other users that make use of the same features. Out-of-scope includes topics about getting started with digital preservation or Archivematica. Those are incredibly important topics, but an over commitment for this group.

Finally, we don’t have any official relationship with Artefactual Systems. We want to develop a user-led community that can identify areas for improvements and contribute to the long-term development of Archivematica. Part of the development is finding our voice as a community.

Current Activity

As of this blog post, the Archivematica Users Forum is two calls in. We’ve discussed project announcements, bug tracking/diagnosis, recurring problems or concerns, desired features (including this Features Request spreadsheet), local customizations and identifying other users that make use of the same features.

We spent a good deal of time during our first meeting on March 1, 2018 gathering and ranking topics that participants wanted to discuss during these calls, and intend to cover them in future calls. These topics, in order of interest, include:

Topic Number of Up-votes
Processing large AIPs (size and number of files) 12
Discussing reporting features, workflows, and code 10
How ingest is being tracked and QA’ed, both within and without Archivematica 9
Automation tools – how are people using them, issues folks are running into, etc. 7
How to manage multi-user installations and pipelines 7
Types of pipelines/workflows 7
Having more granularity in turning micro-services on and off 6
Troubleshooting the AIC functionality 3
What other types of systems people are using with Archivematica – DPN, etc. 3
Are people doing development work outside of Artefactual contracts? 2
How to add new micro-services 2
How to customize the FPR, how to manage and migrate customizations 2
How system architectures impact the throughput of Archivematica (large files, large numbers of files, backup schedules) 1

As you can see, there’s no shortage of potential topics! During that meeting, participants shared a number of development announcements:

  • dataverse Integration as a data source (Scholars Portal);
  • DIP creator for software/complex digital objects via Automation Tools (CCA);
  • reporting – development project to report on file format info via API queries (UCLA/NYPL);
  • turning off indexing to increase pipeline speed (Columbia);
  • micro-service added to post identifier to ArchivesSpace (UH); and
  • micro-service added to write README file to AIP (Denver Art Museum).

During our second meeting on May 3, 2018, we discussed types of pipelines/workflows as well as well as how folks decided to adopt another pipeline versus having multiple processing configurations or Storage Service locations. We heard from a number of institutions:

  • NYPL: Uses multiple pipelines – one is for disk images exclusively (they save all disk images even if they don’t end up in the finding aid) and the other is for packages of files associated to finding aid components. They are considering a third pipeline for born-digital video material. Their decision point on adopting a new pipeline is whether different workflows might require different format policies, and therefore different FPRs.
  • RAC: Uses multiple pipelines for digitization, AV, and born-digital archival transfers. Their decision point is based on amount of processing power required for different types of material.
  • Bentley: Uses one pipeline where processing archivists arrange and describe. They are considering a new pipeline with a more streamlined approach to packaging, and are curious when multiple configurations in a single pipeline is warranted versus creating multiple pipelines.
  • Kansas State: Uses two pipelines – one for digitization (images and text) and a second pipeline for special collections material (requires processing).
  • University of Houston: Uses two pipelines – one pipeline for digitization and a second pipeline for born-digital special collections.
  • UT San Antonio: Uses multiple configurations instead of multiple pipeline.

During that call, we also began to discuss the topic of how people deal with large transfers (size or number of files).

Next Call and Future Plans!

We hope you will consider joining us during our next call on July 5, 2018 at 12pm EDT / 9am PDT or at future bi-monthly calls, which are held on the first Thursday of every other month. Call in details are below!

Join from PC, Mac, Linux, iOS or Android:
https://ucla.zoom.us/j/854186191

  • iPhone one-tap (US): +16699006833,854186191# or +16465588656,854186191#
  • Telephone (US): +1 669 900 6833 or +1 646 558 8656
  • Meeting ID: 854 186 191

International numbers available: https://ucla.zoom.us/zoomconference?m=EYLpz4l8KdqWrLdoSAbf5AVRwxXt7OHo


Shira Peltzman is the Digital Archivist at the University of California, Los Angeles Library.

Nick Krabbenhoeft is the Head of Digital Preservation at the New York Public Library.

Max Eckard is the Lead Archivist for Digital Initiatives at the Bentley Historical Library.

 

A Day in Review: Personal Digital Archiving Conference Recap

by Valencia Johnson

The Personal Digital Archiving conference, which took place April 23-25, 2018, was hosted by the University of Houston. The conference was a mixture of archival professionals, librarians, entrepreneurs, and self-taught memory workers.  The recurrent theme this year, from the perspective of a newcomer at least, was personal digitization. Each demographic offered battle-tested advice for digitization and digital preservation. From these personal testimonies several questions occurred to me and other conference attendees. How is the digital world transforming memory and identity? How can the archival community improve the accessibility of tools and knowledge necessary to create and manage digital cultural heritage? What does it look like when we work with people instead of working for people? If these questions trigger a post-modernism bell in your mind, then you are on the right path.

Each presentation touched upon the need within communities to preserve their history for one reason or another. The residents of Houston are in some ways still recovering from Hurricane Harvey; institutions and homes were flooded, and pictures and home videos were lost to the gulf. Yet, through this disaster the Houston community is finding ways to rebuild and recapture a small piece of what was lost. Lisa Spiro opened the first day of the conference with her presentation “Creating a Community-Driven Digital Archive: The Harvey Memories Project.” This archive aims to document the experience of the survivors of Harvey and offer an additional personal narrative to the official record of the disaster. Expected to launch in August 2018, the first anniversary of  Hurricane Harvey, the project is built by community members and something to keep an eye out for.

The following session was comprised of multiple presenters diving into community archives. Presentations covered how researchers Ana Roeschley’s and Jeonghun (Annie) Kim’s project about a memory roadshow in Massachusetts is uncovering the complex nature of human memory and attachment; Sandra Wang’s quest to preserve her family history by travelling to China and interviewing her grandfather about topics like shame and self-doubt; and Lisa Cruces’s work with Houston Archives Collective, an organization that educates and supports efforts of the community to preserve their history for themselves. Finally, all the way from Alaska, Siri Tuttle and Susan Paskuan discussed the Eliza Jones’ Collection, a true collaboration between an institution and a community to preserve and use material vital to interior Alaskan native communities.

This is a slide from Scott Carlson’s presentation “Megaton Goes Boom: Archiving and Preserving My Father’s First Comic Book,” 25 April 2018.

Later that day were lightning talks about tools useful in the digital age. For example, did you know you can save voicemails? I did not, but thanks to Lucy Rosenbloom’s presentation, I know iPhone users are able to save the voicemails by clicking the square box with the up arrow and emailing the message as a .mp4. Here is a link to a useful article about saving voicemail. Rosenbloom converts her .mp4s into .mp3s and she also uses an auto transcription tool to create transcripts of her messages. The day winded down with personal tales of archiving family history solo and on a budget from Leslie Wagner and Scott Carlson respectively. For more information about the tools and projects discussed at the conference, please visit the program.


Valencia L. Johnson is the Project Archivist for Student Life for the Seeley G. Mudd Manuscript Library at Princeton University. She is a certified archivist with an MA in Museum Studies from Baylor University.

Diving into Computational Archival Science

by Jane Kelly

In December 2017, the IEEE Big Data conference came to Boston, and with it came the second annual computational archival science workshop! Workshop participants were generous enough to come share their work with the local library and archives community during a one-day public unconference held at the Harvard Law School. After some sessions from Harvard librarians that touched on how they use computational methods to explore archival collections, the unconference continued with lightning talks from CAS workshop participants and discussions about what participants need to learn to engage with computational archival science in the future.

So, what is computational archival science? It is defined by CAS scholars as:

“An interdisciplinary field concerned with the application of computational methods and resources to large-scale records/archives processing, analysis, storage, long-term preservation, and access, with aim of improving efficiency, productivity and precision in support of appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material.”

Lightning round (and they really did strike like a dozen 90-second bolts of lightning, I promise!) talks from CAS workshop participants ranged from computational curation of digitized records to blockchain to topic modeling for born-digital collections. Following a voting session, participants broke into two rounds of large group discussions to dig deeper into lightning round topics. These discussions considered natural language processing, computational curation of cultural heritage archives, blockchain, and computational finding aids. Slides from lightning round presenters and community notes can be found on the CAS Unconference website.

Lightning round talks. (Image credit)

 

What did we learn? (What questions do we have now?)

Beyond learning a bit about specific projects that leverage computational methods to explore archival material, we discussed some of the challenges that archivists may bump up against when they want to engage with this work. More questions were raised than answered, but the questions can help us build a solid foundation for future study.

First, and for some of us in attendance perhaps the most important point, is the need to familiarize ourselves with computational methods. Do we have the specific technical knowledge to understand what it really means to say we want to use topic modeling to describe digital records? If not, how can we build our skills with community support? Are our electronic records suitable for computational processes? How might these issues change the way we need to conceptualize or approach appraisal, processing, and access to electronic records?

Many conversations repeatedly turned to issues of bias, privacy, and ethical issues. How do our biases shape the tools we build and use? What skills do we need to develop in order to recognize and dismantle biases in technology?

Word cloud from the unconference created by event co-organizer Ceilyn Boyd.

 

What do we need?

The unconference was intended to provide a space to bring more voices into conversations about computational methods in archives and, more specifically, to connect those currently engaged in CAS with other library and archives practitioners. At the end of the day, we worked together to compile a list of things that we felt many of us would need to learn in order to engage with CAS.

These needs include lists of methodologies and existing tools, canonical data and/or open datasets to use in testing such tools, a robust community of practice, postmortem analysis of current/existing projects, and much more. Building a community of practice and skill development for folks without strong programming skills was identified as both particularly important and especially challenging.

Be sure to check out some of the lightning round slides and community notes to learn more about CAS as a field as well as specific projects!

Interested in connecting with the CAS community? Join the CAS Google Group at: computational-archival-science@googlegroups.com!

The Harvard CAS unconference was planned and administered by Ceilyn Boyd, Jane Kelly, and Jessica Farrell of Harvard Library, with help from Richard Marciano and Bill Underwood from the Digital Curation Innovation Center (DCIC) at the University of Maryland’s iSchool. Many thanks to all the organizers, presenters, and participants!


Jane Kelly is the Historical & Special Collections Assistant at the Harvard Law School Library. She will complete her MSLIS from the iSchool at the University of Illinois, Urbana-Champaign in December 2018.

Improving Descriptive Practices for Born-Digital Material in an Archival Context

by Annie Tummino

In 2014/15 I worked at the New York Metropolitan Library Council (METRO) as the Project Manager for the National Digital Stewardship Residency (NDSR) program in New York City, providing administrative support for five fantastic digital stewardship projects. While I gained a great deal of theoretical knowledge during that time, my hands-on work with born digital materials has been fairly minimal. When I saw that METRO was offering a workshop on “Improving Descriptive Practices for Born-Digital Material in An Archival Context” with Shira Peltzman, former NDSR-NY resident (and currently Digital Archivist for UCLA Library Special Collections), I jumped at the opportunity to sign-up.

For the last two years I have served as the archivist at SUNY Maritime College, working as a “lone arranger” on a small library staff. My emphasis has been on modernizing the technical infrastructure for our special collections and archives. We’ve implemented ArchivesSpace for collections management and are in the process of launching a digital collections platform. However, most of my work with born-digital materials has been of a very controlled type; for example, oral histories and student photographs that we’ve collected as part of documentation projects. While we don’t routinely accession born-digital materials, I know the reckoning will occur eventually. A workshop on descriptive practices seemed like a good place to start.

Shira emphasized that a great deal of technical and administrative metadata is captured during processing of born-digital materials, but not all of this information should be recorded in finding aids. Part of the archivist’s job is figuring out which data is meaningful for researchers and important for access. Quoting Bertram Lyons, she also emphasized that possessing a basic understanding of the underlying “chemistry” of digital files will help archivists become better stewards of born-digital materials. To that end, she started the day with a “digital deep dive” outlining some of this underlying digital chemistry, including bits and bytes, character encoding, and plain text versus binary data. This was followed by an activity where we worked in small groups to analyze the interdependencies involved in finding, retrieving, and rendering the contents of files in given scenarios. The activity definitely succeeded in demonstrating the complexities involved in processing digital media, and provided an important foundation for our subsequent discussion of descriptive practice.

The following bullets, pulled directly from Shira’s presentation, succinctly summarize the unique issues archivists face when processing born digital materials:

  • Processing digital material often requires us to (literally) transform the files we’re working with
  • As stewards of this material, we must be prepared to record, account for, and explain these changes to researchers
  • Having guidelines that helps us do this consistently and uniformly is essential

The need for transparency and standardization were themes that came up again and again throughout the day.

To deal with some of the special challenges inherent in describing born-digital materials, a working group under the aegis of the UC Born-Digital Content Common Knowledge Group (CKG) has developed a UC-wide descriptive standard for born-digital archival material. The elements map to existing descriptive standards (DACS, EAD, and MARC) while offering additional guidance for born-digital materials where gaps exist. The most up-to-date version is on GitHub, where users can make pull requests to specific descriptive elements of the guidelines if they’d like to propose revisions. They have also been deposited in eScholarship, the institutional repository for the UC community.

Working in small groups, workshop participants took a closer look at the UC guidelines, examining particular elements, such as Processing Information; Scope and Content; Physical Description; and Extent. Drawing from our experience, we investigated differences and similarities in using these elements for born-digital materials in comparison to traditional materials. We also discussed the potential impacts of skipping these elements on the research process. We agreed that lack of standardization and transparency sows confusion, as researchers often don’t understand how born-digital media can be accessed, how much of it there is, or how it relates to the collection as a whole.

For our final activity, each group reviewed a published finding aid and identified at least five ways that the description of born-digital materials could be improved in the finding aid. The collections represented were all hybrid, containing born-digital materials as well as papers and other analog formats. It was common for the digital materials to be under-described, with unclear access statuses and procedures. The UC guidelines were extremely helpful in terms of generating ideas for improvements. However, the exercise also led to real talk about resource limitations and implementation. How do born-digital materials fit into an MPLP context? What do the guidelines mean for description in terms of tiered or efficient processing? No solid answers here, but great food for thought.

On the whole, the workshop was a great mix of presentation, discussion, and activities. I left with some immediate ideas to apply in my own institution. I hope more archivists will have opportunities to take workshops like this one and will check out the UC Guidelines.


 

Tummino Picture

Annie Tummino is the Archivist & Scholarly Communications Librarian at SUNY Maritime College, where she immerses herself in maritime special collections and advocates for Open Access while working in a historic fort by the sea. She received her Masters in Library and Information Studies and Archives Certificate from Queens College-CUNY in December, 2010.

Embedded Archives at the Institute for Social Research

by Kelly Chatain

This is the fourth post in the BloggERS Embedded Series.

As any archivist will tell you, the closer you can work with creators of digital content, the better. I work for the Institute for Social Research (ISR) at the University of Michigan. To be more precise, I am a part of the Survey Research Center (SRC), one of five centers that comprise the Institute and the largest academic social science research center in the United States. But really, I was hired by the Survey Research Operations (SRO) group, the operational arm of SRC, that conducts surveys all over the world collecting vast and varied amounts of data. In short, I am very close to the content creators. They move fast, they produce an extraordinary amount of content, and they needed help.

Being an ‘embedded’ archivist in this context is not just about the end of the line; it’s about understanding and supporting the entire lifecycle. It’s archives, records management, knowledge management, and more, all rolled into one big job description. I’m a functional group of one interacting with every other functional group within SRO to help manage research records in an increasingly fragmented and prolific digital world. I help to build good practices, relationships, and infrastructure among ourselves and other institutions working towards common scientific goals.

Lofty. Let’s break it down a bit.

Find it, back it up, secure it

When I arrived in 2012, SRO had a physical archive of master study files that had been tended to by survey research staff over the years. These records provide important reference points for sampling and contacting respondents, designing questionnaires, training interviewers, monitoring data collection activities, coding data, and more. After the advent of the digital age, a few building moves, and some server upgrades, they also had an extensive shared drive network and an untold number of removable media containing the history of more recent SRO work. My first task was to centralize the older network files, locate and back up the removable media, and make sure sensitive data was out of reach. Treesize Professional is a great tool for this type of work because it creates detailed reports and clear visualizations of disk space usage. This process also produced SRO’s first retention schedule and an updated collection policy for the archive.

Charts produced by Treesize Professional used for the initial records survey and collection.
A small selection of removable media containing earlier digital content.

Welcome, GSuite

Despite its academic home, SRO operates more like a business. It serves University of Michigan researchers as well as external researchers (national and international), meeting the unique requirements for increasingly complex studies. It maintains a national field staff of interviewers as well as a centralized telephone call center. The University of Michigan moved to Google Apps for Education (now GSuite) shortly after I arrived, which brought new challenges, particularly in security and organization. GSuite is not the only documentation environment in which SRO operates, but training in the Googleverse coincided nicely with establishing guidance on best practices for email, file management, and organization in general. For instance, we try to label important emails by project (increasingly decisions are documented only in email) which can then be archived with the other documentation at the end of the study (IMAP to Thunderbird and export to pdf; or Google export to .mbox, then into Thunderbird). Google Drive files are downloaded to our main projects file server in .zip format at the end of the study.

Metadata, metadata, metadata

A marvelous video on YouTube perfectly captures the struggle of data sharing and reuse when documentation isn’t available. The survey data that SRO collects is delivered to the principal investigator, but SRO also collects and requires documentation for data about the survey process to use for our own analysis and design purposes. Think study-level descriptions, methodologies, statistics, and more. I’m still working on finding that delicate balance of collecting enough metadata to facilitate discovery and understanding while not putting undue burden on study staff. The answer (in progress) is a SQL database that will extract targeted structured data from as many of our administrative and survey systems as possible, which can then be augmented with manually entered descriptive metadata as needed. In addition, I’m looking to the Data Documentation Initiative, a robust metadata standard for documenting a wide variety of data types and formats, to promote sharing and reuse in the future.

DDI is an international standard for describing data.

Preserve it

The original plan for digital preservation was to implement and maintain our own repository using an existing open-source or proprietary system. Then I found my new family in the International Association for Social Science Information Services & Technology (IASSIST) and realized I don’t have to do this alone. In fact, just across the hall from SRO is the Inter-University Consortium for Political and Social Research (ICPSR), who recently launched a new platform called Archonnex for their data archive(s). Out of the box, Archonnex already delivers much of the basic functionality SRO is looking for, including support for the ever-evolving  preservation needs of digital content, but it can also be customized to serve the particular needs of a university, journal, research center, or individual department like SRO.

Searching for data in OpenICPSR, built on the new Archonnex platform.

 

The embedded archivist incorporates a big picture perspective with the specific daily challenges of managing records in ways that not many positions allow. And you never know what you might be working on next…


Kelly Chatain is Associate Archivist at the Institute for Social Research, University of Michigan in Ann Arbor. She holds an MS from Pratt Institute in Brooklyn, NY.

Thoughts from a Newcomer: Code4Lib 2018 Recap

by Erica Titkemeyer

After prioritizing audiovisual preservation conferences for so long, this year I chose to attend my first Code4Lib annual conference in Washington, D.C. I looked forward to the talks most directly related to my work, but also knew that there would be many relatable discussions to participate in, as we are all users/creators of library technology tools and we all seek solutions to similar data management challenges.

The conference started with Chris Bourg’s keynote, detailing research on why marginalized individuals are compelled to leave their positions in tech jobs because of undue discrimination. Calling for increased diversity in the workplace, less mansplaining/whitesplaining, more vouching for and amplification of marginalized colleagues, Dr. Bourg set out to “equip the choir”. She also pointed out that Junot Diaz said it best at ALA’s midwinter conference, when he called for a reckoning, explaining that “A profession that is 88% white means 5000% agony for people of color, no matter how liberal and enlightened you think you are.”

I appreciated her decision to use this opportunity to point out our own shortcomings, contradictions and need to do better. I will also say, if you ever needed proof that there is an equity problem in the tech world, you can:

  1. Listen to Dr. Bourg’s talk and
  2. Read the online trolling and harassment that she has since been subjected to because of it.

Since the backlash, Code4Lib has released a Community Statement in support of her remarks.

Following the keynote, the first round of talks further assured me that I had chosen the right conference to attend. In Andreas Orphanides’ talk: “Systems thinking: a practical field guide”, he cleverly pointed out system failures and hacks we all experience in our daily lives, and how they are analogous to the software we might build and where there might be areas for improvement. I also appreciated Julie Swierczek’s talk “For Beginners – No Experience Necessary”, in which she made the case for improving how we teach to true beginners in workshops. She also argued that instructors should not assume everyone is on the same level-playing field just because the title includes “for beginners” as it is not likely that attendees will know how to self-select workshops, especially if they are truly beginners to the technology being taught.

As a fan of Arduino (an open source hardware and software platform that supports DIY electronic prototying), I was curious to hear Monica Maceli’s “Low-cost preservation Environment Monitoring” talk, where she described her experience developing an environmental datalogger using the Raspberry Pi (similar microcontroller concept to Arduino) comparing the results and associated costs with a commercial datalogger, the PEM2. While it would require staff with appropriate expertise, it seemed to be a worthwhile endeavor for anyone wishing to spend a quarter of the price.

With the sunsetting of Flash, I was eager to hear how Jacob Zaborowski’s talk “Save Homestar Runner!: Preserving Flash on the Web” would address the preservation concerns surrounding Homestar Runner, an online cartoon series that began in 2000 using flash animation. Knowing that tools such as Webrecorder and Archive-it would capture, but not aid in preserving the SWF files comprising the animations, Zaborowski sought out free and/or open source tools for transcoding the files to a more accessible and preservation-minded format. Like many audiovisual-based digital formats, tools for transcoding the SWF files were not entirely reliable or capable of migrating all of the unique attributes to a new container with different encodings. At the time of his talk, the folks at Homestar Runner were in the midst of a site redesign to hopefully resolve some of these issues.

While I don’t have the space to summarize all of the talks I found relatable or enlightening during my time at Code4Lib, I think these few that I’ve mentioned show how varied the topics can be, while still managing to complement the information management work we are all charged with doing.


TitkemeyerErica.jpgErica Titkemeyer is the Audiovisual Conservator for the Southern Folklife Collection (SFC) at the University of North Carolina at Chapel Hill. She is the Project Director on the Andrew W. Mellon Foundation grant Extending the Reach of Southern Audiovisual Sources, and overseas the digitization, preservation and access of audiovisual recordings for the SFC.