92NY Unterberg Poetry Center Email Archives: A Case Study on Troubleshooting Email File Transfer for Processing

by Marian Clarke and Sally DeBauche

The 92nd Street Y is a community and cultural center located in New York City. Founded in 1874, 92NY is home to the Unterberg Poetry Center, a preeminent place for writers and poets to share their work with the public. Like many organizations, the 92nd Street Y adopted email in the late 1990s, and it has been the primary tool of communication for at least the last 15 years, all but replacing letters, faxes and interoffice memos. A collaboration between the Poetry Center and the digital archives team at Stanford University Libraries Department of Special Collections, this project will apply the pioneering ePADD email curation and access software to the assessment and preservation of the email archive with the aim of developing a model of processing and accessibility that other cultural centers might learn from and adopt. Our team began transferring email in September of 2022, and we expect to complete the project by December of 2023. This project has been made possible through the generous funding of the Email Archives: Building Capacity and Community regrant program—administered by the University of Illinois at Urbana-Champaign and funded by the Andrew W. Mellon Foundation. This post outlines some of the challenges we encountered in the primary stages of the project as we transferred email files for processing with ePADD and details the tools, methods, and strategies that we have found to make this vital first step in our project successful. 

The Poetry Center’s email archive records the day-to-day activities of one of the world’s most prestigious literary organizations and contains nearly three-million messages from the accounts of its directors dating back to the late 1990s. The nature of these activities—the ongoing curation of readings, conversations, performances, workshops, seminars, “Discovery” Poetry Contest for emerging writers, and literacy outreach—has resulted in an email archive featuring correspondence with thousands of literary artists across their careers.

Our project team is made up of Marian Clarke, a Project Archivist for 92NY, Sally DeBauche, a Digital Archivist at Stanford Libraries, Bernard Schwartz, Director of 92NY’s Unterberg Poetry Center, and Glynn Edwards, Assistant Director for Special Collections at Stanford Libraries. Being located in New York City, the Bay Area, and Panama, we have coordinated the entire project via Zoom. Working as a geographically dispersed team created some unique challenges, namely in transferring the email files from 92NY to Stanford Libraries and finally on to Marian. 

The Poetry Center’s email archive comprises 78 .pst files clocking in at 371 gigabytes of data. The largest email files from the 21 years of email at the 92NY easily belonged to Bernard Schwartz, the longest serving (and current!) director at the Poetry Center. The 92NY IT staff uploaded those files from UPC’s local server to a cloud storage service so that the Stanford team could download them to convert them to ePADD’s target file format for ingest, .mbox using the Emailchemy email conversion tool. The transfer of the email files from Stanford to Marian proved to be more challenging than anticipated and required some creative solutions. Initially, the Stanford team mailed an encrypted hard drive with the email files to Marian. When the hard drive arrived, the folders were empty—it seemed that something had gone wrong with the transfer of the files to the hard drive and wasn’t caught.

The easiest alternative was to upload the files to Google Drive so that Marian could download them, but before taking this route we needed to make sure that we were protecting the data in the process due to the usual prevalence of sensitive and legally protected information in email collections. We tested several encryption tools including the Mac disk utility, which worked until we switched from using Marian’s personal Mac to a PC for the project (the paragraph below further details this decision). We also tried Veracrypt, an open source utility for encryption, but ran into complications when attempting to de-encrypt the files. We ultimately used 7-zip, which is not marketed as an encryption tool but does give the user the option to encrypt a file when they zip it. This tool proved to be the least complicated to use and zipping the files also made the process of uploading and downloading them from Google Drive faster.

Files in hand, Marian turned to importing them into ePADD to begin processing them. Originally, we had planned for Marian to use her personal Macbook Air for the project, however it quickly became clear that it was not up to the task. The Macbook, with 8GB RAM and 120 GB of memory, was purchased in 2017. Ingesting these files into ePADD was extremely slow, sometimes taking 5-6 hours at a time and often ending with a series of error messages. Marian’s personal Macbook simply did not have enough RAM to process this amount of data and the hard drive did not have the capacity to store the resulting ePADD collection files along with the original files at the same time. The solution was to buy a new Dell laptop with 16GB RAM and 1TB of memory entirely dedicated to the project. With the new laptop, Marian successfully imported the email files into ePADD and began processing the collection.

Throughout this process, we have had some gentle reminders that all projects of this nature offer unanticipated technological issues and frustration, but with patience, curiosity, research, and a willingness to experiment with new tools and methods, they are not only possible but offer new models for preserving and making our collections accessible to users.


Marian Clarke is the project archivist working on the email collections of the 92nd Street Y’s Unterberg Poetry Center. She was previously a digital archivist at the Frick Art Reference Library Archives and an audiovisual archivist at LaGuardia and Wagner Archives, CUNY. She holds an MA in media studies from the University of Texas and MLIS from Pratt Institute.

Sally DeBauche is a Digital Archivist in the Department of Special Collection at Stanford University Libraries. She is responsible for creating policy and workflows related to born digital archiving and processing born digital collections, with a particular focus on email. She also project managed the development for the ePADD software from 2020-2021 and consulted on the most recent cycle of development led by the University of Manchester and Harvard University. Sally received a BA in History from the University of Wisconsin-Madison and an MSIS from the University of Texas at Austin.

Recap: BitCurator Forum, Day 2, March 30, 2023 

The second and final day of the 2023 BitCurator Forum consisted of the popular “Great Question!” session, two final presentations, and a “Birds of a Feather” networking event for BIPOC professionals.  

Great Question! Session 

The ever popular “Great Question!” session was back again, and similarly to previous years, participants were able to submit questions related to digital archives and curation. The popularity of this session stems from the fact that no question is off limits and participants have time set aside to answer and discuss one by one. Information is crowdsourced from colleagues in this dedicated timeframe instead of relying on listservs and social media. The questions ranged from practical,

What disk imaging software would you use to image optical discs on a Mac?”

to philosophical,

How do you shake imposter syndrome? Especially when colleagues come to you for digital archives advice but you’ve really spent all day googling what to do…”

to everything in between. As in the past, the session used AirTable, and each question was submitted on a card with responses that could be added below. The questions were also stacked so that new professionals’ questions were prioritized by the moderators. Each question was read out loud and then participants could either speak, chat, or submit an answer in AirTable. The questions—and answers—are still able to be viewed.

Session 3

The third and final session of the Forum featured two presentations that focused on digital preservation standards and appraisal. The first presentation featured Brenna Edwards, Hyeeyoung Kim, and Christy Toms (University of Texas-Austin). They described their experience reviving the “UT Digital Preservation Group” which had brought together multiple institutions under the University of Texas umbrella with a focus on discussing digital preservation projects and needs. The overarching goal of the group was to establish a set of shared institutional standards that aligned with the National Digital Stewardship Alliance’s (NDSA) Levels of Digital Preservation Matrix. While the presentation itself discussed the group’s aims, process, results, and lessons learned, the highlight for me was a frank discussion about the challenges the group faced when organizing meetings and attempting to make them meaningful for participants.

The second presentation was led by Emmeline Kaser (University of Georgia), who discussed a collaborative approach to appraising born-digital records. She structured her presentation around a case study at her institution. After sifting through a format report assessing risky formats contained within their digital preservation system, Emmeline found that many of the file formats at risk of obsolescence weren’t files that they wanted to preserve. As the digital archivist on the project, she partnered with a curator that had more extensive collection knowledge to develop an appraisal workflow they could apply to digital records as they were being accessioned. Their rationale was that this more thorough appraisal would eliminate a larger number of unwanted digital files, therefore saving server storage space and staff processing time. Kaser gave a comprehensive overview of her workflow, as well as the tools she used to track collections as they move through the appraisal process. The case study clearly illustrated how partnering with curators or other subject experts can help archivists more effectively steward complex records.

The 2023 BitCurator Forum slides and recordings will gradually be made available on the Forum website in the coming weeks. Anyone interested in learning more about BitCurator Consortium can visit the website, which offers resources for using the BitCurator environment and connecting with other digital curation practitioners. 


Amanda Garfunkel is the Digital Archivist at the Medical Center Archives at NewYork-Presbyterian/Weill Cornell in New York, New York. Her work currently involves creating and implementing born-digital workflows with a focus on preservation and access. 

Recap: BitCurator Forum, Day 1, March 29, 2023

The 2023 BitCurator Forum was held virtually this year from March 27-30. The Forum featured two days of workshops followed by two days of participant panels and lightning talks. Other events during the forum included two “Birds of a Feather” networking gatherings, and the popular “Great Question!” session where participants can ask and answer questions about any topic. 

Session 1

The first day of the Forum was held on March 29 following two days of hands-on workshops. The day kicked off with Session 1 featuring two talks related to software emulation and digital forensics workflows. The first presentation was from Ethan Gates (Yale University) and concerned his work creating a GUI for his emulation application QEMU. This was a follow-up to his presentation at last year’s BitCurator Forum where he first demonstrated the application as part of the Tools and Demos Showcase. He discussed his motivations and challenges in creating a GUI—which he named Qiwi—to make the emulator more widely accessible to archivists and digital preservation practitioners who may not be comfortable with the command line interface. He ended his talk with his thoughts about how community feedback and engagement would be pivotal in making the Qiwi successful as it continues to develop.

The second presentation was a joint presentation from Leo Konstantelos and Emma Yan (University of Glasgow) focused on creating an end-to-end workflow of digital archiving and the integration of digital forensics. Leo primarily discussed the setup and equipment they use in their digital preservation lab and reviewed some of the key decision points in their workflow, which spans from pre-acquisition decisions to ingest. If a decision is made to use digital forensic tools during the accessioning process, they needed to develop a new workflow to incorporate those tools before proceeding to the transfer process. Emma expanded on Leo’s explanations of the workflows by reviewing how it affected their written documentation. Emma discussed decisions they had to make when drafting their Digital Preservation Policy and how they needed to update their Donor forms to address the new digital forensic workflow. Of particular concern was the fact that digital forensic tools can sometimes capture previously deleted files that were never intended for transfer to the repository.

Session 2

The afternoon session featured three presentations that discussed the implementation of privacy reviews in digital preservation workflows, working with outdated software and preservation assessments for born-digital material. The first presentation, from Annie Schweikert and Victor Aguilar III (Stanford University), highlighted their experience using the Bulk Extractor tool to review for sensitive information—what they call “high risk data.” The Bulk Extractor tool reviews digital files for information such as social security numbers, bank account and credit card numbers, and customized lexicon terms. The presenters gave an overview of their workflow, including some of their challenges, and discussed potential next steps in their handling of sensitive digital records.

The second presentation, a lightning talk, introduced a case study focused on one archivist’s attempts to capture information from a defunct music notation software. Elizabeth-Anne Johnson (University of Calgary) described her process working with music scores that were formerly held on 3.5” floppy disks. Her talk explained how she was able to connect with the developer of the defunct software, who helpfully assisted her in finding a way to export the files into PDFs. She also discussed her challenges with the rendering of the scores, and helpful lessons she learned working with niche software.

The final presentation from the Forum’s first day was a second lightning talk from Hafsah Hujaleh (University of Toronto), who described her experience creating a new method for preparing preservation assessments for digital records. Hafsah explained the workflow and described the components of her preservation assessments using a collection of personal papers as a model. The advantage of the preservation assessment she created was clear, as it resulted in a report that not only documented actions taken (i.e. disk imaging) but outlined potential risks and recommendations for preservation such as normalizing file types. 


Amanda Garfunkel is the Digital Archivist at the Medical Center Archives at NewYork-Presbyterian/Weill Cornell in New York, New York. Her work currently involves creating and implementing born-digital workflows with a focus on preservation and access. 

Call for bloggERS: Blog Posts on the code4lib Conference

With short weeks to go before the code4lib conference (March 14-17), bloggERS! is seeking attendees who are interested in writing a re-cap or a blog post covering a particular session, theme, or topic relevant to SAA Electronic Records Section members. The program for the conference is available here.

Please let us know if you are interested in contributing by sending an email to ers.mailer.blog@gmail.com! You can also let us know if you’re interested in writing a general re-cap or if you’d like to cover something more specific.

Writing for bloggERS!:

  • We encourage visual representations: Posts can include or largely consist of comics, flowcharts, a series of memes, etc!
  • Written content should be roughly 600-800 words in length
  • Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA
  • Align with other editorial guidelines as outlined in the bloggERS! guidelines for writers.

Please let us know if you are interested in contributing by sending an email to ers.mailer.blog@gmail.com!

Call for bloggERS: Q&Archivist

We are currently accepting nominations for a new series of blog posts, Q&Archivist. 

Has your path to working in electronic records or digital archives been unique? Are you working on an interesting or challenging project? Do you have a perspective on the future of electronic record stewardship? Or maybe you’ve always wanted to contribute to bloggERS but didn’t have the time to write a whole post! Whatever your story, we want to talk to you! 

Nominate yourself or a colleague for a short Q&A call or email with a member of the bloggERS team. Your interview will be published right here on bloggERS!

Please let us know if you’re interested by filling out this Google Form.

Join the bloggERS team!

bloggERS!, the blog of the Electronic Records Section of the Society of American Archivists fosters communication and collaboration within the ERS and across the wider archival community.

Apply by February 15, 2023 to join the bloggERS Editorial Team! We’re seeking two volunteers to join us as Team Members. Access the brief application here.

Team Member responsibilities:

  • Term of service: through July 2023 (after July 2023, renewable with one-year term commitment)
  • Term limit: none
  • Duties: Manage (recruit, edit, publish) one post every six weeks

All information professionals with an interest in electronic records and digital archives are welcome to apply, including MLIS students and early-career professionals. Questions? You can always reach us at ers.mailer.blog@gmail.com

New Year, New Resources!

Happy New Year from the editors at BloggERS! We’ve spent some time reviewing resources from the past year in order to create an easy-to-reference list. It is by no means comprehensive; just some interesting things that we discovered in our recent journeys through the digital landscape.

In addition to the list below, we’d love to hear what other new resources our readers are excited about! Feel free to respond in the comments and share widely with fellow electronic records practitioners.

Tools

Congress.gov API Beta

This API, now in its third version, allows anyone with an API key to query machine-readable data available on the Congress.gov site. 

Browsertrix Cloud Crawling Service

Browsertrix Cloud is an open-source Chrome based crawling service developed by the Webrecorder community. 

It Takes a Village In Practice Toolkit

From the site: The ITAV in Practice Toolkit is an adaptable set of tools for practical use in planning and managing sustainability for open source software initiatives.

Aaru Data Preservation Suite

Aaru is an open-source disk imaging format with some really interesting features.

Discmaster 

The Discmaster site enables visitors to browse and search a large dataset of vintage computer files from the Internet Archive.

IIPC Web Archiving Tools and Software 

A one-stop shop of resources and links for any of your web archiving needs!

Reports

Realities of Academic Data Sharing (RADS) Initiative Report

The RADS Initiative examines the myriad ways in which academic institutions are supporting the dissemination of and public access to federally funded research data.

Software Metadata Recommended Format Guide

The SMRF Guide summarizes and defines metadata elements used to describe software materials, including crosswalk mapping to MARC, Dublin Core, MODS, Wikidata, and Codemeta.

Legal and Ethical Considerations for Born-Digital Access from the DLF Born-Digital Access Working Group

Providing access to born-digital archival records is still a tricky subject, but this publication covers a lot of ground, including HIPAA and FERPA restrictions as well as copyright concerns.

Reference and Roundups

Open Grants Repository

The IMLS-funded Open Grants project seeks to promote transparency in the authorship of funding proposals by providing a searchable repository of grant applications.

Oxford Common File Layout Specification

From the site: This Oxford Common File Layout (OCFL) specification describes an application-independent approach to the storage of digital objects in a structured, transparent, and predictable manner. It is designed to promote long-term access and management of digital objects within digital repositories.

DPOE-N Digital Preservation Resource Guide

A clearinghouse of educational and informational resources related to digital preservation.

DPC Competency Framework 

“The Competency Framework presents information on the skills, knowledge, and competencies required to undertake or support digital preservation activities.”

BitCurator Tool Inventory

This inventory lists the workflow step, the accepted input (disk image, directory of files, etc.), the type of user interface (GUI/CLI), links to available documentation, and function of each tool in the BitCurator environment.

SAA Electronic Records Section Community Conversation: Legal and Ethical Considerations for Born-Digital Access

The SAA Electronic Records Section with the Records Management Section and the Privacy & Confidentiality Section invite you to attend a community conversation with the authors of the DLF Publication Legal and Ethical Considerations for Born-Digital Access and Archival discretion: a survey on the theory and practice of archival restrictions on Friday, January 13, 2023 at 1pm ET.

Let’s discuss your thoughts and needs about navigating restrictions for born-digital archives. Do these publications reflect methods you’ve used? Are there best practices or resources that should be included in future updates of the DLF publication? Do you have a restrictions related question that can’t be addressed with these resources that you’d like to bring to the table and crowdsource advice on? Or maybe you’ve used the recommendations from these documents and can share with others how you used it! 

The conversation will be based on your questions or stories, so please submit questions by January 9th here. Following a brief overview of the project, panelists will answer pre-submitted questions after which attendees will engage in group discussions. 

Register here.

Brought to you by:

SAA Electronic Records Section

SAA Records Management Section

SAA Privacy & Confidentiality Section

DLF Born-Digital Access Working Group

Reinventing COMPEL: Migration as a tool for project renewal or, a too-honest assessment of how projects can go very very wrong and how to get them going right again

We’ve all heard the horror stories: what began as a fairly simple digital humanities curation project (as if those even exist) quickly became a quagmire of confusion abruptly ended by a security vulnerability deemed too resource intensive to fix. But what happens when, instead of shuffling the project into the graveyard of “what can you do, these things happen,” an influx of new perspective and effort resurrects interest in the dormant collection? That is a recipe for a migration project! Take one defunct, poorly resourced project, add a dash of interest, a splash of digital elbow grease, and a heaping cup of “let’s just grab the original data and start over with a new infrastructure,” mix, and let rise for two years. Optional additions include: involvement of an international computer music society, project advisors in three different time zones, and a recognition of the privileges of academic study. This, in particular, is the recipe for COMPEL 2.0

The first iteration of this project involved a data dump from the WordPress site of the Society for Electro-Acoustic Music in the United States (SEAMUS).  As initially conceived, the project’s goals were to publish and preserve digital music compositions and performances from scholarly and artistic venues. Data from SEAMUS were strong-armed into a Hyrax repository built on top of Fedora. This initial venture failed (miserably), as the open source system was too cumbersome for part-time developers to manage and sat stagnant for a number of years until a security vulnerability in the operating system for the (now ancient and un-updated) repository necessitated a shut down before the data could be retrieved (a process deemed too time-intensive without a good enough return on the investment). The data, it turned out, had not been augmented or added to at all; it had merely been rearranged to conform to a standard that Fedora could deal with (don’t ask us what that standard looks like because we weren’t able to get the data out to see it). And so, we went back to basics and the original data from WordPress. It wasn’t pretty, but at least we could see it. 

Armed with the knowledge that we did not (nor would ever, probably) have the development resources needed to use a fully customizable infrastructure like Hyrax/Fedora, we explored more out-of-the-box infrastructures before landing, hesitantly, on a locally-hosted Omeka. While we could no longer dream of a fully customizable repository with a next-level, visually arresting interface (yeah, we know, we probably shouldn’t have dreamed of that in the first place), the Omeka instance has proved much less resource-intensive and much more stable. 

And so, we found ourselves once again migrating the data from the original source (WordPress) into a Dublin Core-esque environment. Getting the original SEAMUS data into the system—while not easy—was at least possible for a competent computer science graduate student (shout out to Javaid!).  Naturally the migration import wasn’t clean (because of course complicated objects need complicated, non-standard metadata), so the next approximately 500 years was spent manually fixing the records to conform to a non-standard metadata structure that tends to change as we learn new things about the nature of computer music objects and the social structures that have grown up around them. Our project advisors, all computer musicians from an international academic context, have been extremely helpful in helping us confront the complexity of this genre, for which we are (kinda, I guess) grateful. 

Armed with moderate success migrating the records from SEAMUS into a structured non-standard metadata schema, we then decided to further complicate the issue by adding in records from an international computer music conference: New Interfaces for Musical Expression (NIME). Naturally, adding in data from a time-bound, event-based organization (like a conference) has been super fun.  But it has also forced us to confront, through our metadata, both the temporal-spatial aspect of computer music (is a digital performance the same as a physical performance?) and the need to account for both physical and digital instrumentation. 

Ultimately, we’ve determined that this project needs to evolve as constantly and quickly as the technology and sounds that it is trying to capture; recognizing this has led us to a few important lessons that will inform the future of this project. 

  1. Data export is probably more important than data import. Whichever infrastructure you use, make sure you can get the data back out in a uniform structure. Test data export early, and test it often. Check it against import AND the user interface (not just the back end). 
  2. Don’t marry the infrastructure. The pace of technology changes so quickly that if an infrastructure isn’t killed by updates, it will probably be killed by security risks. Plan for this in advance and make sure that you can get the data out (see #1 above). 
  3. Unless you work for Google or Apple or another for-profit tech company, keep your project as low-tech as possible. Depend on as few developers or sysadmins as possible because their time is precious and they’re probably dealing with bigger, more important problems than you (like the administrator who gets hit with a ransomware attack). Keep your dependencies low and document, document, document EVERYTHING you do.

Andi Ogier is the Assistant Dean and Director of Data Services at Virginia Tech in Blacksburg, VA.

Hollis Wittman is the Metadata and Music Libraries Specialist at Virginia Tech, working remotely from Kalamazoo, MI.

Downsizing: Migrating multiple systems to a unified collection management suite

When I started my position at the Arthur H. Aufses, Jr., MD Archives at the Icahn School of Medicine at Mount Sinai in January 2020, one of my first mandates was to find a migration pathway from our locally hosted digital repository system, DSpace, to a software-as-a-service (SaaS) option. As I began this process, my first instinct was to zoom out and look at all of the systems used by the Archives. What I found was an ecosystem of archives and digital object management tools that had grown organically over the past ten-plus years. 

Pre-migration archives ecosystem. The middle lane (in yellow) represents all of the sets of metadata and digital objects managed by the Archives. The top lane (in green) shows all of the systems that managed the metadata and digital objects internally, and the bottom lane (in blue) shows how those assets were served publicly to users.

While this ecosystem has worked well over time, responding to the Archives’ needs as it grew, there were a few reasons I began looking into migrating out of them as a whole. The first, and perhaps most pressing, was that our partners in IT were moving towards a SaaS model, and that the local servers that hosted our platforms were no longer sustainable. Second, the growth of our collections was outpacing the ability of these systems to keep up, and it wasn’t always clear which system to use as they often had overlapping roles in managing the collection. And finally, we were able to identify several opportunities a new system would provide, namely better collection management capabilities and additional pathways for researchers to have direct access to collection description and digital objects.

With the overarching goal of migrating our digital objects, the first step was to strengthen our underlying metadata management at the collection level. Then, our hope was to be able to link digital objects directly to their related records within their description. Improved digital preservation capabilities were also a must on our list. This led us to Access to Memory (AtoM), an archival description and management environment, and Archivematica, a digital preservation system. We use a SaaS instance hosted by the vendor, Artefactual. We began using these tools in January 2021, and our AtoM site went live to the public in June 2021.

What has ensued is a multi-armed migration project. Our first task was to migrate a collection management system run on Microsoft Access and approximately sixty (of 120+) finding aids in Microsoft Word to the collection management system, AtoM. This has involved a lot of manual clean-up and moving unstructured data into CSV files for batch import. I was able to complete this step from January-June 2021. 

With a metadata infrastructure mostly in place, it was time to start moving over our digital objects. I began with our audiovisual material, 900+ files and 10+ TB of material. The metadata for audiovisual files was located in multiple spreadsheets, and all of the files were on local storage. The ingest process was straightforward, but re-linking the audiovisual files to the appropriate collection was a significant remediation step that took much longer than anticipated. While we initially projected this would only take a few weeks, we also underestimated the amount of time it would take for large files (100+ GB) to ingest through Archivematica, often at a rate of only one or two per workday. I ultimately worked on this through August 2022, alongside other migration activities.

I additionally worked on migrating almost 3,000 digital objects out of DSpace during this time. The Archives had hosted DSpace on two local servers: one managed all digital objects and was accessible to Archives staff only (“DSpace internal”); the second was a duplicate copy of only the publicly accessible files and this was available on the public web (“DSpace external”). DSpace external was a high-priority migration target for us. The IT team was particularly concerned about the longevity of this server due to security and hardware issues. 

I was able to export packages from DSpace and created a script that repackaged the files and metadata in a way that was easily understandable to Archivematica and AtoM. The resulting packages would be ingested through Archivematica, which in turn would send derivative access copies to AtoM. In AtoM, I was able to match the metadata CSVs generated by the script to the records in AtoM using the command-line CSV import function. We were able to decommission DSpace external in July 2022. We paused our DSpace internal migration (another 15,800 digital objects) to address InMagic/DB Textworks (described in the next section), returning to migrating the DSpace internal server in December 2022. 

We had used InMagic/DB Textworks, a legacy product now owned by Lucidea, since approximately 2006(!) to host our digitized image collections (almost 2000 digital files). While this tool was a powerhouse for us for many years, it was increasingly difficult to link the image files to related description. I began the project by exporting a CSV of the metadata from InMagic. I then wrote a Python script that created directories for each of the unique values, and pulled copies of the corresponding TIF files into each of the folders, effectively grouping material from the same physical folders. I had to manually match these to their existing identifiers. The server hosting the platform was sunset in November 2022, and all the remaining image files that did not have associated, structured metadata were moved to a shared drive. These files are now more readily accessible to archivists on staff and eventually will be preserved in Archivematica and made available via AtoM pending ongoing metadata remediation.

Post-migration archives ecosystem. The middle lane (in yellow) represents all of the sets of metadata and digital objects managed by the Archives. The top lane (in green) shows that only AtoM and Archivematica manage the digital objects and metadata , and the bottom lane (in blue) shows that the assets are served up to the public web by AtoM (which is linked out to from the Archives website).

We hope to conclude this project sometime in the first quarter of 2023, and you can keep up with our progress by checking out what’s available on our publicly accessible AtoM site (“Archives Catalog”). This is the first time I’ve laid this project out narratively. Admittedly many of the complexities have been glossed over for the sake of brevity, and as we wrap up the project, the work and its complexities will continue. Hopefully this sweeping overview has provided helpful insight into how we are undertaking a large-scale migration project. This project would not have been possible without institutional support, as well as the work of our colleagues in IT.

On a final note, I’m left thinking about Elizabeth McAulay’s great article, “Always Be Migrating.” The title has become something of a professional motto in these past months, and I’m already anticipating ways we can continue to iterate on our current metadata and digital object management practices, not to mention revamping our location module, redesigning our AtoM homepage, improving our documentation, and… I’ll save it for the Gantt chart! 


Stefana Breitwieser is the Digital Archivist at the Arthur H. Aufses, Jr., MD Archives at the Icahn School of Medicine at Mount Sinai in New York, New York. Her professional interests include providing researcher access to digital archival material and wrangling metadata.