What's Your Set-up?: Curation on a Shoestring

by Rachel MacGregor


At the Modern Records Centre, University of Warwick in the United Kingdom we have been making steady progress in our digital preservation work. Jessica Venlet from UNC Chapel Hill wrote recently about being in the lucky position of finding “an excellent stock of hardware and two processors” when she started in 2016. We’re a little further behind than this—when I began in 2017 I had a lot less!

What we want is FRED. Who’s he? He’s your Forensic Recovery of Evidence Device (forensic workstation), but costing several thousand dollars, it’s beyond the reach of many of us.  

What I had in 2017: 

  • A Tableau T8-R2 write blocker. Write blockers are very important when working with rewritable media (USB drives, hard drives, etc.) because they prevent accidental alteration of material by blocking overwriting or deletion.
  • A (fingers crossed) working 3.5 inch external floppy disk drive.
  • A lot of enthusiasm.

What I didn’t have: 

  • A budget.
Image of Dell monitor and computer, keyboard, mouse, and writeblocker on a desk in an office.  Bitcurator software opened on the screen.
My digital curation workstation – not fancy but it works for me. Photo taken by MacGregor, under CC-BY license.

Whilst doing background research for tackling our born-digital collections, I got interested in the BitCurator toolkit which is designed to help with the forensic recovery of digital materials.  It interested me particularly because:

  • It’s free.
  • It’s open source.
  • It’s created and managed by archivists for archivists.
  • There’s a great user community.
  • There are loads of training materials online and an online discussion group.

I found this excellent blog post by Porter Olsen to help get started. He suggests starting with a standard workstation with a relatively high specification (e.g. 8 GB of RAM). So, I asked our IT folk for one, which they had in stock (yay!).  I specified a Windows operating system and installed a virtual machine, which runs a Linux operating system on which to run BitCurator. 

I’m still exploring BitCurator—it’s a powerful suite of tools with lots of features. However, when trialing it on the papers of the eminent historian Eric Hobsbawm, I found that it was a bit like using a hammer to crack a nut. Whilst it was possible to produce all sorts of sophisticated reports identifying email addresses etc., this isn’t much use on drafts of published articles from the late 1990-early 2000s. I turned to FTK Imager which is proprietary but free software. It is widely used in the preservation community, but not designed by, with, or for archivists (as BitCurator is). I guess its popularity derives from the fact that it’s easy to use and will allow you to image (i.e. take a complete copy of all the whole media including deleted and empty space),  or just extract the files, without too much time spent learning to use it. There are standard options for disk image output (e.g. as a raw byte-for-byte image, an E01 Expert witness format, SMART, and AFF formats). However, I would like to spend some more time getting to know BitCurator and becoming part of its community. There is always room for new and different tools and I suspect the best approaches are those which embrace diversity. 

Another tool that looks useful for disk imaging is one called diskimgr created by Johan van der Knijff of the Nationale Bibliotheek van Nederland. It will only run on a Linux operating system (not on a virtual machine), so now I am wondering about getting a separate Linux workstation.  BitCurator also works more effectively in a Linux environment as opposed to a virtual machine–it does stall sometimes with larger collections. I wonder if I should have opted for a Linux machine to start with. . . it’s certainly something to consider when creating a specification for a digital curation workstation. 

Once the content is extracted, we need further tools to help us manage and process. Bitcurator does a lot, but there may be extra things that you might need depending on your intended workflow. I never go anywhere without DROID software. DROID is useful for loads of stuff like file format identification, creating checksums, deduplication, and lots more. My standard workflow is to create a DROID profile and then use this as part of the appraisal process further down the line. What I don’t yet have is some sort of file viewer—Quick View Plus is the one I have in mind (it’s not free and as I think I mentioned my resources are limited!). I would also like to get LibreOffice installed as it deals quite well with old Word processed documents.

I guess I’ll keep adding to it as I go along. I now need to work out the most efficient ways of using the tools I have and capturing the relevant metadata that is produced. I would encourage everyone to take some time experimenting with some of the tools out there and I’d love to hear about how people get on.


Rachel MacGregor is Digital Preservation Officer at the Modern Records Centre, University of Warwick, United Kingdom. Rachel is responsible for developing and implementing digital preservation processes at the Modern Records Centre, including developing cataloguing guidelines, policies and workflows. She is particularly interested in workforce digital skills development.

What's Your Set-up?: Managing Electronic Records at University of Alaska Anchorage

by Veronica Denison


For years, archivists at the University of Alaska Anchorage Archives and Special Collections have known that we would have to grapple with how to store our electronic records. We were increasingly receiving more donations that contained items created born digitally, and I also knew I wanted to apply for grants to digitize audio, video, and film in our holdings. The grants would not provide funding for our storage system, which also meant we had to come up with alternative means to pay for it.

Prior to 2018, anything we digitized was saved on what we called “scribe drives” which were shared network drives mapped to the computers in the archives. These drive could store about 4TB of data. In 2017, we digitized 10 ¼-inch audio reels as .mp3 and .wav files, which gave us both access and master copies of those recordings. At the time, we had enough storage space for everything, but in 2018, we received a request to digitize two 16mm film reels from the Dorothy and Grenold Collins papers. Unfortunately, we could not save both the access and masters to the drive since the files were too large (about 350GB per film for the master copy). 

Around the same time, we also received the Anne Nevaldine papers. The collection contained 4 boxes of 35mm slides as well as multiple CDs that in total contained 64,932 files within 1446 folders for an amount of 322GB. I had a volunteer run each of the CDs through the Archives segregation machine to check for viruses and then transfer the digital files onto an external hard drive. We thought we had time to figure out a more permanent solution than the external hard drive, but two weeks after I made the finding aid available online, a researcher came in wanting to look at the digital photographs in the collection. This created an issue, as my only option was to give her the external hard drive to look at the images. While she was in the Archives Research Room, I watched her closely to make sure nothing was deleted or moved. 

We decided that we needed a system where we could save and access all of our digital content, while also having it backed up, and have the option to make read-only copies available to researchers in the Research Room. We knew we would probably end up having at least 5 TB of data right away if we factored in our current digital items and the possibility of future ones. We initially approached the University’s IT Department to learn about our options. Unfortunately, we were quoted a very high cost (over $10,000 a year for 20TB), so we approached the Library’s IT Department for suggestions. After some discussion about what would be appropriate for our needs, Brad, the Library’s PC and Network Administrator, presented us with some options.

We ultimately decided on a Synology DiskStation DS1817+, which cost $848, with WD Gold 10TB HDD drives. We settled on 8 drives for total of 80TB (to provide growth space), which were $375 each for a cost of $3000. Then we needed a system to hook it to. For that we used a Windows 10 Desktop, which cost $1065. The total cost for the hardware was $4913, however we also needed a cloud service provider to back up the files. We decided to go with Backblaze, which costs $5 per TB per month. This whole system is a network-attached storage system, which means it is a file-level computer data storage server connected to a computer network. We took to calling it “the NAS” for short. Thankfully, when we presented the need for electronic storage to the Library’s Dean, he was willing to provide us with the funding needed to purchase the items.

Once everything was installed, we had to transfer the files and develop a new system for arranging and saving the files. We decided on having three separate drives, two of which would be on the NAS (Master and Access), and one a separate network drive (Reference_Access). The Master drive is the only one that is backed-up to Backblaze. The Master drive acts as a dark archive, meaning once items are saved to it, they will not be accessed. Therefore, we created the Access drive where archivists can retrieve the digital contents for reference and use purposes. The Access drive is essentially a copy of the Master drive. There is also a Reference_Access drive, which is mapped separately to each computer within the Archives, and not on the NAS. Reference_Access is the drive researchers use to access digital content in the Research Room and contains the access copies and low resolution .jpgs of photographs that may be high resolution in the Master and Access drives. 

The next step was mapping the Reference_Access drive to the researcher computer in the Research Room, and to make it read-only, but only for that computer. After working with the University’s IT Department, Brad was able to make it work. Since establishing this system in Spring 2019, the Reference_Access drive has been used by multiple researchers and it works great! They are able to access digital content of collections as easily as looking through a box on a table. We are grateful for all those who helped the Archives have a great mechanism for saving our electronic records, at a relatively low cost.


Veronica Denison is currently the Assistant University Archivist at Kansas State University where she has been since September 2019. Prior to being hired at K-State, she was an archivist for six years at the University of Alaska Anchorage. She holds an MLIS with a Concentration in Archives Management from Simmons College.

What's Your Set-Up? Born-Digital Processing at UNC Chapel Hill

by Jessica Venlet


At UNC-Chapel Hill Libraries’ Wilson Special Collections Library, our workflow and technology set-up for born-digital processing has evolved over many years and under the direction of a few different archivists. This post provides a look at what technology was here when I started work in 2016 and how we’ve built on those resources in the last three years. Our set-up for processing and appraisal centers on getting collections ready for ingest to the Libraries’ Digital Collections Repository where other file-level preservation actions occur. 

What We Had 

I arrived at UNC in 2016 and was happy to find an excellent stock of hardware and two processing computers. Thank you to my predecessors! 

The computers available for processing were an iMac (10.11.6, 8GB RAM, 500GB storage) and a Lenovo PC (Windows 7, 8 GB RAM, 465 GB storage, 2.4GHz processor).  These computers were not used for storing collection material. Collections were temporarily stored and processed on a server before ingest to the repository. While I’m not sure how these machines were selected, I was glad to have dedicated      born-digital processing workstations.

In addition to the computers, we had a variety of other devices including:

  • One Device Side Data FC0525 5.25” floppy controller and a 5.25” disk drive
  • One Tableau USB write blocker
  • One Tableau SATA/IDE write blocker
  • Several USB connectable 3.5” floppy drives 
  • Two memory card readers (SanDisk 12 in 1 and Delkin)
  • Several zip disk drives (which turned out to be broken)
  • External CD/DVD player 
  • 3 external hard drives and several USB drives
  • Camera for photographing storage devices
  • A variety of other cords and adapters, most of which are used infrequently. Some examples are extra SATA/IDE adapters (like this one or this kit), Molex power adapters and power cords (like this or this), and USB adapter kit (like this one). 

The primary programs in use at the time were FTK Imager, Exact Audio Copy, and Bagger. 

What We Have Now

Since 2016, our workflow has evolved to include more appraisal and technical review before ingest. As a result, our software set-up expanded to include actions like virus scanning and file format identification. While it was great to have two dedicated workstations, our computers definitely needed an upgrade, so we worked on securing replacements.

The iMac was replaced with a Mac Mini (10.14.6, 16 GB RAM, 251 GB flash storage). Our PC was upgraded to a Lenovo P330 tower (Windows 10, 16 GB RAM, 476 GB storage). The Mini was a special request, but the PC request fit into a normal upgrade cycle. We continue to temporarily store collections on a server for processing before ingest.

Our peripheral devices remain largely the same as above, but we have added new (functional) zip drives and another Tableau USB write blocker used for appraisal outside of the processing space (e.g. offsite for a donor visit). We also purchased a KryoFlux, which can be used for imaging floppies. While not strictly required for processing, the KryoFlux may be useful to have if you encounter frequent issues accessing floppies. To learn more about the KryoFlux, check out the excellent Archivist’s Guide to the KryoFlux resource.

The software and tools that we’ve used have changed more often that our hardware set-up. Since about May 2018, we’ve settled on a pretty stable selection of software to get things done. Our commonly used tools are Bagger, Brunnhilde (and the dependencies that go with like Siegfried and clamAV), Bulk_Extractor, Exact Audio Copy, ffmpeg, IsoBuster, LibreOffice, Quick View Plus, rsync, text editors (text wrangler or BBEdit), and VLC Media Player. 

Recommended Extras

  • USB hub. Having extra USB ports has proven useful. 
  • A basic repair toolkit. This isn’t something we use often, but we have had a few older external hard drives come through that we needed to remove from an enclosure to connect to the write blocker. 
  • Training Collection Materials. One of the things I recommend most for a digital archives set-up is a designated set of storage devices and files that are for training and testing only. This way you have some material ready to go for testing new tools or training colleagues. Our training and testing collection includes a few 3.5” and 5.25” floppies, optical discs, and a USB drive that is loaded with files (including files with information that will get caught by our PII scanning tools). Many of the storage devices were deaccessioned and destined for the recycle. 

So, that’s how our set-up has changed over the last several years. As we continue to understand our needs for born-digital processing and as born-digital collections grow, we’ll continue to improve our hardware and software set-up.


Jessica Venlet works as the Assistant University Archivist for Digital Records & Records Management at the UNC-Chapel Hill Libraries’ Wilson Special Collections Library. In this role, Jessica is responsible for a variety of things related to both records management and digital preservation. In particular, she leads the processing and management of born-digital special collections. She earned a Master of Science in Information degree from the University of Michigan.

Welcome to the newest series on bloggERS, “What’s Your Set-Up?”

By Emily Higgs


Welcome to the newest series on bloggERS, “What’s Your Set-Up?” In the coming weeks, bloggERS will feature posts from digital archives professionals will explore the question: what equipment do you need to get your job done? 

This series was born from personal need:; as the first Digital Archivist at my institution, one of my responsibilities has been setting up a workstation to ingest and process our born-digital collections. It’s easy to be overwhelmed by the range of hardware and software needed, the variety of options for different equipment types, and where to obtain everything. In my context, some groundwork had already been done by forward-thinking former employees, who set up a computer with the BitCurator environment and also purchased a WiebeTech USB WriteBlocker. While this was a good first step for a born-digital workstation, we had much farther to go.

The first question I asked was: what do I need to buy?

My initial list of equipment was pretty easy to compile: 3.5” floppy drive, 5.25” floppy drive, optical drive, memory card reader, etc. etc. Then it started to get more complicated: 

  • Do I need to purchase disk controllers now or should I wait until I’m more familiar with the collections and know what I need? 
  • How much will a KryoFlux cost us over time vs. hiring an outside vendor to read our difficult floppies? 
  • Is it feasible to share one workstation among multiple departments? Should some of this equipment be shared consortially, like much of our collections structure? 
  • What brands and models of all this stuff are appropriate for our use case? What is quality and what is not?

The second question was: where do I buy all this stuff? This question contained myriad sub-questions: 

  • How do I balance quality and cost? 
  • Can I buy this equipment from Amazon? Should I buy equipment from Amazon? 
  • Will our budget structure allow for me to use vendors like eBay? 
  • Which sellers on eBay can I trust to send us legacy equipment that’s in working condition?

As with most of my work, I have taken  an iterative approach to this process. The majority of our unprocessed born-digital materials were stored on CDs and 3.5” floppy disks, so those were the focus of our first round of purchasing a few weeks ago. In addition to the basic USB blocker and BitCurator machine we already had, we now have a Dell External USB CD drive, a Tendak USB 3.5” floppy drive, and an Aluratek multimedia card reader to read the most common media in our unprocessed collections. We chose the Tendak drive mainly because of its price point, but it has not been the most reliable hardware and we will likely try something else in the future. As I’ve gone through old boxes from archivists past, I have found additional readers such as an Iomega Jaz drive, which I’m very glad we have; there are a number of Jaz disks in our unprocessed collections as well. 

As I went about this process, I started by emailing many of my peers in the field to solicit their opinions and learn more about the equipment at their institutions. The range of responses I got was extremely helpful for my decision-making process. The team at bloggERS wanted to share that knowledge out to the rest of our readership, helping them learn from their peers at a variety of institutions. We hope you glean some useful information from this series, and we look forward to your comments and discussions on this important topic.


Emily Higgs is the Digital Archivist for the Swarthmore College Peace Collection and Friends Historical Library. Before moving to Swarthmore, she was a North Carolina State University Libraries Fellow. She is also the Assistant Team Leader for the SAA ERS section blog.

Recap: BitCurator Users Forum, October 24-25, 2019

The fifth annual BitCurator Users Forum was held at Yale University from October 24-25, bringing library, archives, and museum practitioners together to learn and discuss many aspects of digital forensics work. Over two days of workshops, lightning talks, and panels, the Forum covered a range of topics around acquisition, processing, and access for born digital materials. In addition to traditional panels and conference sessions, attendees also participated in hands-on workshops on digital forensics techniques and tools, including the BitCurator environment.

Throughout the workshops, sessions, and discussions, one of the most dominant themes to emerge was the question of how archivists and institutions should address the environmental unsustainability of digital preservation. Attendees were quick to highlight recent work in this area, including the article Toward Environmentally Sustainable Digital Preservation by Keith L. Pendergrass, Walker Sampson, Tim Walsh, and Laura Alagna among others. The prevalence of this topic at the Forum as well as other conferences and in our professional literature points to urgency that archivists feel toward ensuring that we are able to continue to preserve our digital holdings while minimizing negative environmental impact as much as possible.

The role of appraisal in relation to the environmental sustainability of digital preservation specifically was a major focus of the Forum. One attendee remarked that the “low cost of storage has outpaced the ability to appraise content,” summing up the situation that many institutions find themselves in, where the ever decreasing cost of digital storage, anxiety about discarding potentially valuable collection material, and a lack of time and guidance on appraisal of digital materials has resulted in the ballooning of their digital holdings.

Participants challenged the notion that “keeping everything forever” should be our default preservation strategy. One common thread to emerge was the need to be more thoughtful about what we choose to retain and to develop and share appraisal criteria for born digital materials to help us make those decisions.

Also related to concerns about the environmental impact of digital preservation, presenters posed questions about how much data and related metadata for digital collections should be captured in the first place. Kelsey O’Connell, digital archivist at Northwestern University, proposed defining levels of digital forensics rather than applying the same workflow to every collection. Taking this type of approach to acquisition and metadata creation for born digital collection materials could help institutions minimize the storage of unnecessary collection data.

The BitCurator Users Forum provides an excellent opportunity for library and archives practitioners to learn new skills and discuss the many challenges and opportunities in the field of digital archiving. This year’s Forum was no exception and I have no doubt that it will continue to serve as a valuable resource for experienced practitioners as well as those just starting out.

—————————————————————————————————————-

Sally DeBauche is a Digital Archivist at Stanford University and the ePADD Project Manager.

DLFF’d Behind?

This year’s Digital Library Foundation Forum (DLFF or #DLF2019 or #DLFforum if you’re social) was held October 14-16 in Tampa, FL. As usual, many of the sessions were directly relevant to the Electronic Records Section membership; also as usual, the Forum was heavily Tweeted, giving a lot of us who couldn’t be there a mix of vicarious engagement and serious conference envy.

Thankfully, the DLF(F) ethos of collaboration makes it a little easier for everyone who couldn’t be there: OSF repositories for the DLF Forum and DigiPres meetings host (most of) the presentation slides for the 2019 meetings, organizers set up shared notes documents for the sessions, and each session had its own hashtag to help corral the discussion, annotations, and meta-commentary we’ve come to expect from libraries/archives/allied trades Twitter.

As most anyone who’s attended DLF Forum will tell you, every time slot has something great in it, and there’s no substitute for being there: for the next best thing, we’re happy to present below a few sessions which caught our interest– the session description and available materials, shared notes, and of course, the Twitter feed. Enjoy, and FOMO no more!

SAA 2019 Recap| Email Archiving: Strategies, Tools, Techniques Workshop

Email Archiving: Strategies, Tools, Techniques was a one-day workshop held on August 1, 2019. Chris Prom (University of Illinois) and Tricia Patterson (Harvard University) taught the workshop, which gave a broad overview of the opportunities and challenges of email archiving and some tools that can be used to make this daunting task easier.

As a processing archivist, email sits squarely within the electronic records processing workflow I’m helping develop: I took this class to build my digital archiving skills and to learn about techniques for managing email archives. Attending this class while my department is developing a digital archiving workflow helped me think ahead about technical limitations, ethical considerations, storage, and access issues related to email.

For me, the class was a good introduction to the opportunities and challenges of preserving this ephemeral and widespread communication. The class was divided into three sections: Assessing Needs and Challenges, Understanding Tools and Techniques, and Implementing Workflows. These sections were based on the Lifecycle of Email Model from The Future of Email Archives CLIR Report.

During the first portion of the class, we discussed the types of communication that occur through email, and the functions which fall under the creation and use as well as appraisal and selection categories of the email lifecycle. This section featured an interesting group activity asking us to list all of the email accounts we had used in our lifetime, the type of correspondence that occurred on the platform, an estimated size of the collection, and the scope and contents. This exercise helped illustrate how large, multifaceted, and varied even a single email a collection can be: I found this exercise effective for thinking about the complexities of archiving email.

In the second section, Prom and Patterson walked the class through seven tools for capturing and processing emails. The instructors gave a brief description of each tool’s functions and where they fit in the lifecycle model before giving a demo. Unfortunately, the demo portion was the weakest part of this workshop for me: instead of a live demonstration, the instructors used screenshots and a video recording. It was difficult to read the screenshots and the slides containing the screenshots do not have any explanatory text, so unless you took good notes, it would be difficult to understand how these tools work after the class was over. If SAA offers this class again, I would suggest the instructors do a live demo and provide more notes on how the tools work so that we can use class materials as a resource when we are doing this work at our own institutions. 

The group activity for this class was to export a small portion of our own email and use one of the tools discussed in class to begin processing. During this activity, we discovered that Yahoo makes it difficult or impossible to export email. I think this activity would have been more effective if we had been told to download our own emails and how before the class began. Most of the time allotted for this activity was spent figuring out how to download our emails and waiting for them to download, so we never got the chance to use the programs we discussed.

Overall, I thought the class provided a good introduction to the complexities of preserving email and introducing open-source and hosted tools that help with different parts of the email lifecycle. I would recommend this class to people who are exploring how to archive email and what would work for their institution.

Kahlee Leingang is a Processing Archivist at Iowa State University, where she works on creating guidelines and workflows for processing, preservation, and access of born-digital records as well as processing collections in the backlog.

Call for Contributions: What’s Your Set-Up?

bloggERS!, the blog of the Electronic Records Section of SAA, is accepting proposals for blog posts on the theme “What’s Your Set-Up?” These posts will address the question: what equipment do you need to get your job done in digital archives? We’re particularly interested in posts that consist of a detailed description of hardware, software, and other equipment used in your institution’s digital archives workflow (computers, readers, drives, etc.), as well as more general posts about equipment needs in digital archives.

See our call for posts below and email any proposals to ers.mailer.blog@gmail.com.

We look forward to hearing from all of you.

—The bloggERS! editorial subcommittee

Call for Posts

When starting a digital archives program from scratch, archivists can be easily overwhelmed by the range of hardware and software needed to effectively manage and preserve digital media, the variety of options for different equipment types, and where to obtain everything needed. As our practice evolves, so does the required equipment, and archivists are constantly replacing and improving our equipment according to our needs and resources. 

This series hopes to help break down barriers by allowing archivists to learn from their peers at a variety of institutions. We want to hear about the specific equipment you use in your day-to-day workflows, addressing questions such as: what do your workstations consist of? How many do you have? What readers and drives work reliably for your workflows? How did you obtain them? What doesn’t work? What is on your wish list for equipment acquisition?

We welcome posts from staff at institutions with all levels of budgetary resources. 

Other potential topics and themes for posts:

  • Creating a low-cost digital archives workstation
  • Stories of assembling workstations iteratively
  • Strategies for obtaining the necessary equipment, and preferred vendors
  • Working with IT to establish and support digital archives hardware and software
  • Stories of success or failure with advanced equipment such as the FRED Forensic Workstation or the Kryoflux

Writing for bloggERS! “What’s Your Set-Up?” Series

  • We encourage visual representations: Posts can include or largely consist of comics, flowcharts, a series of memes, etc!
  • Written content should be roughly 600-800 words in length
  • Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA
  • Align with other editorial guidelines as outlined in the bloggERS! guidelines for writers.

Please let us know if you are interested in contributing by sending an email to ers.mailer.blog@gmail.com!

Data As Collections

By Nathan Gerth

Over the past several years there has been a growing conversation about “collections as data” in the archival field. Elizabeth Russey Roke underscored the growing impact of this movement in her recent post on the Collections As Data: Always Already Computational final report at Blo. Much like her, I have seen this computational approach to the data in our collections manifest itself in ways at my home institution, with our decision to start providing researchers with aggregate data harvested from our born-digital collections.

Data as Collections

At the same time, in my role as a digital archivist working with congressional papers, I have seen a growing body of what I call “data as collections.” I am using the term data in this case specifically in reference to exports from relational database systems in collections. More akin to research datasets than standard born-digital acquisitions, these exports amplify the privacy and technical challenges associated with typical digital collections. However, they also embody some of the more appealing possibilities for the computational research highlighted by the “collections as data” initiative, given their structured nature and millions of data points.   

The problem of curating and supplying access to a particular type of data export has become an acute problem in the field of congressional papers. As documented in a white paper by a Congressional Papers Section Task Force in 2017, members in the U.S. House of Representatives and U.S. Senate have widely adopted proprietary Constituent Management Systems (CMS) or Constituent Services Systems (CSS) to manage constituent communications. The exports from these systems document the core interactions between the American people and their representatives in Congress. Unfortunately, these data exports have remained largely inaccessible to archivists and researchers alike.

The question of curating, preserving, and supplying access to the exports from these systems has galvanized the work of several task forces in the archival community. In recent years, congressional papers archivists have collaborated to document the problem in the white paper referenced above and to support the development of a tool to access these exports. The latter effort, spearheaded by West Virginia Libraries, earned a Lyrasis Catalyst Fund grant in 2018 to assess the development possibilities for an open-source platform developed at WVU to open and search these data exports. You can see a screenshot of the application in action below.

https://lh6.googleusercontent.com/iLmJ2ebnslIsoeST9MSbfuSBwtCfZM7-aHoV620y-9gq1daH9iZbpGAxy1NJX8vSB1qSrnLsxeveRCGr0VybE6JrlB5Z_X6MMzzXWK3_S2FXLlOPekDdFdhMV95a81U1AQ4j

Screenshot of data table viewed in the West Virginia University Libraries CSS Data Tool

The project funded by the grant, America Contacts Congress, has now issued its final report and the members of the task force that served as its advisory board are transitioning to the next stage of the project. Here are where things stand:

What We Now Know

We now know much more about the key research audiences for this data and the archival needs associated with the tool. Researchers expressed solid enthusiasm for gaining access to the data, especially computationally minded quantitative scholars. For those of us involved in testing data in the tool, the project gave us a moment to become much more familiar with our data. I, for my part, also know a great deal more about the 16 million records in the relational data tables we received from the office of Senator Harry Reid, in addition to the 3 million attachments referenced by those tables. Without the ability to search and view the data in the tool, the tables and attachments from the Reid collection would have existed as little more than binary files.

Unresolved Questions

While members of the grant’s advisory board know much more about how the tool might be used in the sphere of congressional papers, we would like to learn more about other cases of “data in collections” in the archival field. Who beyond congressional papers archivists are grappling with supplying access to and preserving relational databases? We know, for example, that many state and local governments are using the same Constituent Relationship Management systems, such as iConstituent and Intranet Quorum, deployed in congressional offices. Do our needs overlap with those of other archivists and could this tool serve a broader community? While the amount of CSS data exports in congressional collections is significant, the direction we plan to take tool development and partnerships to supply access to the data will hinge on finding a broader audience of archivists facing similar challenges.

If any of the questions above apply to you, consider contacting the members of the America Contacts Congress project’s advisory board. We would love to hear from you and discuss how the outcomes of the grant might apply to a broader array of data exports in archival collections. Who knows, we might even help you test the tool on your own data exports! For more information about the project, visit our webpage.

Nathan Gerth

Nathan Gerth is the Head of Digital Services and Digital Archivist at the University of Nevada, Reno Libraries. Splitting his time between application support and digital preservation, he is the primary custodian of the electronic records from the papers of Senator Harry Reid. Outside of the University, he is an active participant in the congressional papers community, serving as the incoming chair of the Congressional Papers Section and as a member of the Association of Centers for the Study of Congress CSS Data Task Force.

SAA 2019 recap | Session 504: Building Community History Web Archives: Lessons Learned from the Community Webs Program

by Steven Gentry


Introduction

Session 504 focused on the Community Webs program and the experiences of archivists who worked at either the Schomburg Center for Research in Black Culture or the Grand Rapids Public Library. The panelists consisted of Sylvie Rollason-Cass (Web Archivist, Internet Archive), Makiba Foster (Manager, African American Research Library and Cultural Center, formerly the Assistant Chief Librarian, the Schomburg Center for Research in Black Culture), and Julie Tabberer (Head of Grand Rapids History & Special Collections).

Note: The content of this recap has been paraphrased from the panelists’ presentations and all quoted content is drawn directly from the panelists’ presentations.

Session summary

Sylvie Rollason-Cass began with an overview of web archiving and web archives, including:

  • The definition of web archiving.
  • The major components of web archives, including relevant capture tools (e.g. web crawlers, such as Wget or Heritrix) and playback software (e.g. Webrecorder Player).
  • The ARC and WARC web archive file formats. 

Rollason-Cass then noted both the necessity of web archiving—especially due to the web’s ephemeral nature—and that many organizations archiving web content are higher education institutions. The Community Webs program was therefore designed to get more public library institutions involved in web archiving, which is critical given that these institutions often collect unique local and/or regional material.

After a brief description of the issues facing public libraries and public library archives—such as a lack of relevant case studies—Rollason-Cass provided information about the institutions that joined the program, the resources provided by the Internet Archive as part of the program (e.g. a multi-year subscription to Archive-It), and the project’s results, including:

  • The creation of more than 200 diverse web archives (see the Remembering 1 October web archive for one example).
  • Institutions’ creation of collection development policies pertaining specifically to web archives, in addition to other local resources.
  • The production of an online course entitled “Web Archiving for Public Libraries.” 
  • The creation of the Community Webs program website.

Rollason-Cass concluded by noting that although some issues—such as resource limitations—may continue to limit public libraries’ involvement in web archiving, the Community Webs program has greatly increased the ability for other institutions to confidently archive web content. 

Makiba Foster then addressed her experiences as a Community Webs program member. After a brief description of the Schomburg Center, its mission, and unique place where “collections, community, and current events converge”, Foster highlighted the specific reasons for becoming more engaged with web archiving:

  • Like many other institutions, the Schomburg Center has long collected clippings files—and web archiving would allow this practice to continue.
  • Materials that document the experiences of the black community are prominent on the World Wide Web.
  • Marginalized community members often publish content on the Web.

Foster then described the #HashtagSyllabusMovement collection, a web archive of educational material “related to publicly produced and crowd-sourced content highlighting race, police violence, and other social justice issues within the Black community.” Foster had known this content could be lost, so—even before participating in the Community Webs program—she began collecting URLs. Upon joining the Community Webs program, Foster used Archive-It to archive various relevant materials (e.g. Google docs, blog posts, etc.) dated from 2014 to the current. Although some content was lost, the #HashtagSyllabusMovement collection both continues to grow—especially if, as Foster hopes, it begins to include international educational content—and shows the value of web archiving. 

In her conclusion, Foster addressed various success, challenges, and future endeavors:

  • Challenges:
    • Learning web archiving technology and having confidence in one’s decisions.
    • Curating content for the Center’s five divisions.
    • “Getting institutional support.”
  • Future Directions:
    • A new digital archivist will work with each division to collect and advocate for web archives.
    • Considering how to both do outreach for and catalog web archives.
    • Ideally, working alongside community groups to help them implement web archiving practices.

The final speaker, Julie Tabberer, addressed the value of public libraries’ involvement in web archives. After a brief overview of the Grand Rapids Public Library, the necessity of archives, and the importance of public libraries’ unique collecting efforts, Tabberer posited the following question: “Does it matter if public libraries are doing web archiving?” 

To test her hypothesis that “public libraries document mostly community web content [unlike academic archives],” Tabberer analyzed the seed URLs of fifty academic and public libraries to answer two specific questions:

  • “Is the institution crawling their own website?”
  • “What type of content [e.g. domain types] is being crawled [by each institution]?”

After acknowledging some caveats with her sampling and analysis—such as the fact that data analysis is still ongoing and that only Archive-It websites were examined—Taberrer showed audience members several graphics that revealed academic libraries 1.) Typically crawled their websites more so than public libraries and 2.) Captured more academic websites than public libraries.

Tabberer then concluded with several questions and arguments for the audience to consider:

  • In addition to encouraging more public libraries to archive web content—especially given their values of access and social justice—what other information institutions are underrepresented in this community?
  • Are librarians and archivists really collecting content that represents the community?
  • Even though resource limitations are problematic, academic institutions must expand their web archiving efforts.

Steven Gentry currently serves a Project Archivist at the Bentley Historical Library. His responsibilities include assisting with accessioning efforts, processing complex collections, and building various finding aids. He previously worked at St. Mary’s College of Maryland, Tufts University, and the University of Baltimore.