Supervised versus Unsupervised Machine Learning for Text-based Datasets

By Aarthi Shankar

This is the fifth post in the bloggERS series on Archiving Digital Communication.


I am a graduate student working as a Research Assistant on an NHPRC-funded project, Processing Capstone Email Using Predictive Coding, that is investigating ways to provide appropriate access to email records of state agencies from the State of Illinois. We have been exploring various text mining technologies with a focus on those used in the e-discovery industry to see how well these tools may improve the efficiency and effectiveness of identifying sensitive content.

During our preliminary investigations we have begun to ask one another whether tools that use Supervised Machine Learning or Unsupervised Machine Learning would be best suited for our objectives. In my undergraduate studies I conducted a project on Digital Image Processing for Accident Prevention, involving building a system that uses real-time camera feeds to detect human-vehicle interactions and sound alarms if a collision is imminent. I used a Supervised Machine Learning algorithm – Support Vector Machine (SVM) to train and identify the car and human on individual data frames. With this project, Supervised Machine Learning worked well when applied to identifying objects embedded in images. But I do not believe it will be as beneficial for our project which is working with text only data. Here is my argument for my position.

In Supervised Machine Learning, a pre-classified input set (training dataset) has to be given to the tool in order to be trained. Training is based on the input set and the algorithms used to process the input set gives the required output. In my project, I created a training dataset which contained pictures of specific attributes of cars (windshields, wheels) and specific attributes of humans (faces, hands, and legs). I needed to create a training set of 900-1,000 images to achieve a ~92% level of accuracy. Supervised Machine Learning works well for this kind of image detection because unsupervised learning algorithms would be challenged to accurately make distinctions between windshield glass and other glass (e.g. building windows) present in many places on a whole data frame.

For Supervised Machine Learning to work well, the expected output of an algorithm should be known and the data that is used to train the algorithm should be properly labeled. This takes a great deal of effort. A huge volume of words along with their synonyms would be needed as a training set. But this implies we know what we are looking for in the data. I believe for the purposes of our project, the expected output is not so clearly known (all “sensitive” content) and therefore a reliable training set and algorithm would be difficult to create.

In Unsupervised Machine Learning, the algorithms allow the machine to learn to identify complex processes and patterns without human intervention. Text can be identified as relevant based on similarity and grouped together based on likely relevance. Unsupervised Machine Learning tools can still allow humans to add their own input text or data for the algorithm to understand and train itself. I believe this approach is better than Supervised Machine Learning for our purposes. Through the use of clustering mechanisms in Unsupervised Machine Learning the input data can first be divided into clusters and then the test data identified using those clusters.

In summary, a Supervised Machine Learning tool learns to ascribe the labels that are input from the training data but the effort to create a reliable training dataset is significant and not easy to create from our textual data. I feel that Unsupervised Machine Learning tools can provide better results (faster, more reliable) for our project particularly with regard to identifying sensitive content. Of course, we are still investigating various tools, so time will tell.


Aarthi Shankar is a graduate student in the Information Management Program specializing in Data Science and Analytics. She is working as a Research Assistant for the Records and Information Management Services at the University of Illinois.

Advertisements

Stanford Hosts Pivotal Session of Personal Digital Archiving Conference

By Mike Ashenfelder


In March, Stanford University Libraries hosted Personal Digital Archiving 2017, a conference about preservation and access of digital stuff for individuals and for aggregations of individuals. Presenters included librarians, data scientists, academics, data hobbyists, researchers, humanitarian citizens and more. PDA 2017 differed from previous PDA conferences though, when an honest, intense discussion erupted about race, privilege and bias.

Topics did not fall into neat categories. Some people collected data, some processed it, some managed it, some analyzed it. But often the presenters’ interests overlapped. Here are just some of the presentations, grouped by loosely related themes.

  • Joan Jeffri’s (Research Center for Arts & Culture/The Actors Fund) project archives the legacy of older performing artists. Jessica Moran (National Library of New Zealand) talked about the digital archives of a contemporary New Zealand composer and Shu-Wen Lin (NYU) talked about archiving an artist’s software-based installation.
  • In separate projects, Andrea Prichett (Berkeley Copwatch), Stacy Wood and Robin Margolis (UCLA), and Ina Kelleher (UC Berkeley) talked about citizens tracking the actions of police officers and holding the police accountable.
  • Stace Maples (Stanford) helped digitize 500,000+ consecutive photos of buildings along Sunset Strip. Pete Schreiner (North Carolina State University) archived the debris from a van that three bands shared for ten years. Schreiner said, “(The van) accrued the detritus of low-budget road life.”
  • Adam Lefloic Lebel (University of Montreal) talked about archiving video games and Eric Kaltman (UC Santa Cruz) talked about the game-research tool, GISST.
  • Robert Douglas Ferguson (McGill) examined personal financial information management among young adults. Chelsea Gunn (University of Pittsburgh) talked about the personal data that service providers collect from their users.
  • Rachel Foss (The British Library) talked about users of born-digital archives. Dorothy Waugh and Elizabeth Russey Roke (Emory) talked about how digital archiving have evolved since Emory acquired the Salman Rushdie collection.
  • Jean-Yves Le Meur (CERN) called for a non-profit international collaboration to archive personal data, “…so that individuals would know they have deposited stuff in a place where they know it is safe for the long term.”
  • Sarah Slade (State Library Victoria) talked about Born Digital 2016, an Australasian public-outreach program. Natalie Milbrodt (Queens Public Library) talked about helping her community archive personal artifacts and oral histories. Russell Martin (DC Public Library) talked about helping the DC community digitize their audio, video, photos and documents. And Jasmyn Castro (Smithsonian African American History Museum) talked about digitizing AV stuff for the general public.
  • Melody Condron (University of Houston) reviewed tools for handling and archiving social media and Wendy Hagenmaier (Georgia Tech) introduced several custom-built resources for preservation and emulation.
  • Sudheendra Hangal and Abhilasha Kumar (Ashoka University) talked about using personal email as a tool to research memory. And Stanford University Libraries demonstrated their ePADD software for appraisal, ingest, processing, discovery and delivery of email archives. Stanford also hosted a hackathon.
  • Carly Dearborn (Purdue), talked about data analysis and management for researchers. Leisa Gibbons (Kent State) analyzed interactions between YouTube and its users. and Nancy Van House (UC Berkeley) and Smiljana Antonijevic Ubois (Penn State) talked about digital scholarly workflow. Gary Wolf (Quantified Self) talked about himself.

Some presentations addressed cultural sensitivity and awareness. Barbara Jenkins (University of Oregon) discussed a collaborative digital project in Afghanistan. Kim Christen (Washington State) demonstrated Mukurtu, built with indigenous communities, and Traditional Knowledge Labels, a metadata tool for adding local cultural protocols. Katrina Vandeven (University of Denver) talked about a transformative insight she had during a Women’s March project, where she suddenly became aware of the bias and “privileged understanding” she brought to it.

The conference ended with observations from a group of panelists who have been involved with the PDA conferences since the beginning: Cathy Marshall, Howard Besser (New York University), Jeff Ubois (MacArthur Foundation), Cliff Lynch (Coalition for Networked Information) and me.

Marshall said, “I still think there’s a long-term tendency toward benign neglect, unless you’re a librarian, in which case you tend to take better care of your stuff than other people do.” She added that cloud services have improved to the point where people can trust online backup. Marshall said, “Besides, it’s much more fun to create new stuff than to be the steward of your own of your own mess.”

Lynch agreed about automated backup. “There used to be a view that catastrophic data loss was part of life and you’d just lose your stuff and start over,” Lynch said. “It was cleansing and terrifying at the same time.” He said the possibility of data loss is still real but less urgent.

Marshall spoke of backing the same stuff up again and again, and how it’s “all over the place.”

Besser described a conversation he had with his sister that they carried out over voicemail, WhatsApp, text and email. “All this is one conversation,” Besser said. “And it’s all over the place.” Lynch predicted that the challenge of organizing digital things is “…going to shift as we see more and more…automatic correlation of things.”

Ubois said, “I think emulation has been proven as something that we can trust.” He also indicated the “cognitive diversity” around the room. He said, “Many of the best talks at PDA over the years have been by persistent fanatics who had something and they were going to grab it and make it available.”

Besser said, “Things that we were talking about…years ago, that were far out ideas, have entered popular discourse…One example is what happens to your digital stuff when you die…Now we have laws in some states about it…and the social media have stepped up somewhat.”

I noted that the first PDA conference included presentations about personal medical records and about genealogy, but those two areas haven’t been covered since. Lynch made a similar statement about how genealogy “…richly deserves a bit more exploration.” I also noted that the general public still needs expert information about digitizing and digital preservation, and we see more examples of university and public librarians taking the initiative to help their communities with PDA.

In a Q&A session, Charles Ransom (University of Michigan), raised the bias issue again when he said, “I was wondering…how privilege plays a part in all of this. Look at the audience and it’s clear that it does,” referring to the overwhelmingly white audience.

Besser said that NYU reaches out to activist, ethnic and neighborhood communities. “Most of us (at this conference) …work with disenfranchised communities,” said Besser. “It doesn’t bring people here to this meeting…but it does mean that at least some of those voices are being heard through outreach.” Besser said that when NYU hosted PDA 2015, they worked toward diversity. “We still had a pretty white audience,” Besser said. “But…it’s more than just who gets the call for proposal…It’s a big society problem that is not really easy to solve and we all have to really work on it.”

I said it was a challenge for PDA conference organizers to reach a wide, diverse audience just through the host institution’s social media tools and a few newsgroups.  When asked what the racial mix was at the PDA 2016 conference (which University of Michigan hosted), Ransom said it was about the same as this conference. He said, “I specifically reached out to presenters that I knew and the pushback I got from them was ‘We don’t have a budget to go to Ann Arbor for four days and pay for hotel and travel and registration fees.’ “

Audience members suggested having the PDA host invite more local community organizations, so travel and lodging won’t matter, and possibly waiving fees. The University of Houston will host PDA 2018; Melody Condron said UH has a close relationship with Houston community organizations and she will explore ways to involve them in the conference.

Lynch, whose continuous conference travels afford him a global technological perspective, said of the PDA 2017 conference, “I’m really encouraged, particularly by the way we seem to be moving the deeper and harder problems into focus…We’re just now starting to understand the contours of the digital environment.”

The conference videos are available online at https://archive.org/details/pda2017.

Call for Contributions: Collaborating Beyond the Archival Profession

The work of archivists is highly collaborative in nature. While the need for and benefits of collaboration are widely recognized, the practice of collaboration can be, well, complicated.

This year’s ARCHIVES 2017 program featured a number of sessions on collaboration: archivists collaborating with users, Indigenous communities, secondary school teachers, etc. We would like to continue that conversation in a series of posts that cover the practical issues that arise when collaborating with others outside of the archival profession at any stage of the archival enterprise. Give us your stories about working with technologists, videogame enthusiasts, artists, musicians, activists, or anyone else with whom you find yourself collaborating!

A few potential topics and themes for posts:

  • Posts written by non-traditional archivists or others working to preserve heritage materials outside of traditional archival repositories
  • Posts co-written by archivists and collaborators
  • Tips for translating archive jargon, and suggestions for working with others in general
  • Incorporating researcher feedback into archival work
  • The role of empathy in digital archives practice

Writing for bloggERS! Collaborating Beyond the Archival Profession series

  • We encourage visual representations: Posts can include or largely consist of comics, flowcharts, a series of memes, etc!
  • Written content should be 600-800 words in length
  • Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA
  • Align with other editorial guidelines as outlined in the bloggERS! guidelines for writers.

Posts for this series will start in November, so let us know if you are interested in contributing by sending an email to ers.mailer.blog@gmail.com!

There’s a First Time for Everything

By Valencia Johnson

This is the fourth post in the bloggERS series on Archiving Digital Communication.


This summer I had the pleasure of accessioning a large digital collection from a retiring staff member. Due to their longevity with the institution, the creator had amassed an extensive digital record. In addition to their desktop files, the archive collected an archival Outlook .pst file of 15.8 GB! This was my first time working with emails. This was also the first time some of the tools discussed below were used in the workflow at my institution. As a newcomer to the digital archiving community, I would like to share this case study and my first impressions on the tools I used in this acquisition.

My original workflow:

  1. Convert the .pst file into an .mbox file.
  2. Place both files in a folder titled Emails and add this folder to the acquisition folder that contains the Desktop files folder. This way the digital records can be accessioned as one unit.
  3. Follow and complete our accessioning procedures.

Things were moving smoothly; I was able to use Emailchemy, a tool that converts email from closed, proprietary file formats, such as .pst files used by Outlook, to standard, portable formats that any application can use, such as .mbox files, which can be read using Thunderbird, Mozilla’s open source email client. I used a Windows laptop that had Outlook and Thunderbird installed to complete this task. I had no issues with Emailchemy, the instructions in the owner’s manual were clear, and the process was easy. Next, I uploaded the Email folder, which contained the .pst and .mbox files, to the acquisition external hard drive and began processing with BitCurator. The machine I used to accession is a FRED, a powerful forensic recovery tool used by law enforcement and some archivists. Our FRED runs BitCurator, which is a Linux environment. This is an important fact to remember because .pst files will not open on a Linux machine.

At Princeton, we use Bulk Extractor to check for Personally Identifiable Information (PII) and credit card numbers. This is step 6 in our workflow and this is where I ran into some issues.

Yeah Bulk Extractor I’ll just pick up more cores during lunch.

The program was unable to complete 4 threads within the Email folder and timed out. The picture above is part of the explanation message I received. In my understanding and research, aka Google because I did not understand the message, the program was unable to completely execute the task with the amount of processing power available. So the message is essentially saying “I don’t know why this is taking so long. It’s you not me. You need a better computer.” From the initial scan results, I was able to remove PII from the Desktop folder. So instead of running the scan on the entire acquisition folder, I ran the scan solely on the Email folder and the scan still timed out. Despite the incomplete scan, I moved on with the results I had.  

I tried to make sense of the reports Bulk Extractor created for the email files. The Bulk Extractor output includes a full file path for each file flagged, e.g. (/home/user/Desktop/blogERs/Email.docx). This is how I was able locate files within the Desktop folder. The output for the Email folder looked like this:

(Some text has been blacked out for privacy.)

Even though Bulk Extractor Viewer does display the content, it displays it like a text editor, e.g. Notepad, with all the coding alongside the content of the message, not as an email, because all the results were from the .mbox file. This is just the format .mbox generates without an email client. This coding can be difficult to interpret without an email client to translate the material into a human readable format. This output makes it hard to locate an individual message within a .pst because it is hard but not impossible to find the date or title of the email amongst the coding. But this was my first time encountering results like this and it freaked me out a bit.

Because regular expressions, the search method used by Bulk Extractor, looks for number patterns, some of the hits were false positives, number strings that matched the pattern of SSN or credit card numbers. So in lieu of social security numbers, I found the results were FedEx tracking numbers or mistyped phone numbers, though to be fair mistyped numbers are someone’s SSN. For credit card numbers, the program picked up email coding and non-financially related number patterns.

The scan found a SSN I had to remove from the .pst and the .mbox. Remember .pst files only work with Microsoft Outlook. At this point in processing, I was on a Linux machine and could not open the .pst so I focused on the .mbox.  Using the flagged terms, I thought maybe I could use a keyword search within the .mbox to locate and remove the flagged material because you can open .mbox files using a text editor. Remember when I said the .pst was over 15 GB? Well the .mbox was just as large and this caused the text editor to stall and eventually give up opening the file. Despite these challenges, I remained steadfast and found UltraEdit, a large text file editor. This whole process took a couple of days and in the end the results from Bulk Extractor’s search indicated the email files contained one SSN and no credit card numbers.  

While discussing my difficulties with my supervisor, she suggested trying FileLocator Pro, a scanner like Bulk Extractor that was created with .pst files in mind, to fulfill our due diligence to look for sensitive information since the Bulk Extractor scan timed out before finishing.  Though FileLocator Pro operates on Windows so, unfortunately, we couldn’t do the scan on the FRED,  FileLocator Pro was able to catch real SSNs hidden in attachments that did not appear in the Bulk Extractor results.

I was able to view the email with the flagged content highlighted within FileLocator Pro like Bulk Extractor. Also, there is the option to open the attachments or emails in their respective programs. So a .pdf file opened in Adobe and the email messages opened in Outlook. Even though I had false positives with FileLocator Pro, verifying the content was easy. It didn’t perform as well searching for credit card numbers; I had some error messages stating that some attached files contained no readable text or that FileLocator Pro had to use a raw data search instead of the primary method. These errors were limited to attachments with .gif, .doc, .pdf, and .xls extensions. But overall it was a shorter and better experience working with FileLocator Pro, at least when it comes to email files.

As emails continue to dominate how we communicate at work and in our personal lives, archivists and electronic records managers can expect to process even larger files, despite how long an individual stays at an institution. Larger files can make the hunt for PII and other sensitive data feel like searching for a needle in a haystack, especially when our scanners are unable to flag individual emails, attachments, or even complete a scan. There’s no such thing as a perfect program; I like Bulk Extractor for non-email files, and I have concerns with FileLocator Pro. However, technology continues to improve and with forums like this blog we can learn from one another.


Valencia Johnson is the Digital Accessioning Assistant for the Seeley G. Mudd Manuscript Library at Princeton University. She is a certified archivist with an MA in Museum Studies from Baylor University.

Hybrid like Frankenstein, but not helpful like a Spork

By Gabby Redwine

This is the third post in the bloggERS series on Archiving Digital Communication.


The predictions have come true: acquisitions of born-digital materials are on the rise, and for the foreseeable future many of these collections will include a combination of digital and analog materials. Working with donors prior to acquisition can help a collecting body ensure that the digital archives it receives fall within its collecting scope in terms of both format and content. This holds true for institutional repositories, collecting institutions of all sorts, and archives gathered and stored by communities and individuals rather than institutions. Donors sometimes can provide insight into how born-digital items in an acquisition relate to each other and to non-digital materials, which can be particularly helpful with acquisitions containing a hybrid of paper and born-digital correspondence.

I’ve helped transfer a few acquisitions containing different kinds of digital correspondence: word processing documents, email on hard drives and in the cloud, emails saved as PDFs, email mailboxes in archived formats, and others. Often the different formats represent an evolution in a person’s correspondence practices over time and across the adoption of different digital technologies. Just as often, a subset of these different types of digital correspondence are duplicated in analog form.

Examples include:

  • Letters originally written in word processing software that also exist as print-outs with corrections and scribbles, not to mention the paper copy received (and perhaps retained in some other archive) by the recipient.
  • Email that has been downloaded, saved, printed, and stored alongside analog letters.
  • An acquisition that includes email as the only correspondence after a particular date, all of which is downloaded and saved as individual PDF files, but only the most important ones are printed and stored among paper records.
  • Email folders received annually from staff with significant duplication in content.
  • Tens of thousands of emails stored in the cloud which have been migrated across different clients/hosts over the last 20 years, some with different foldering and labeling practices.

When the time comes to transfer all or some of this to an archives, the donor and the collecting body must make decisions about what, if anything, is important to include and how to represent the relationship between the different correspondence formats. Involvement with donors early on can be incredibly beneficial, but it can also cause a significant drain on staff resources, particularly in one-person shops.

What is the minimum level of support staff can provide to every donor with digital materials? What are levels of service that could be added in particular circumstances—for example, when a collection is of particular value or a donor requires additional technological support? And how can staff ensure that the minimum level of service provided doesn’t inadvertently place an undue burden on a donor—for example, someone who may not have the resources to hire technological support or might not like to ask for help—that results in important materials being excluded from the historical record?

At the staff end, hybrid correspondence files also raise questions about whether and how to identify both paper and digital duplicates (is it worth the effort?), whether and how to dispose of them (is it worth the time and documentation?), and at what point in the process this work can realistically take place. Many of the individual components of hybrid correspondence archives seem familiar and perhaps even basic to archivists, but once assembled they present challenges that resemble a more complex monster—one that perhaps not even the creator can explain.

I’m writing from the perspective of someone who has been involved with hybrid collections primarily at the acquisition and accessioning end of the spectrum. If any readers have an example of an archival collection in which the hybrid nature of the materials has been helpful (like a Spork!), perhaps during arrangement & description or even to a researcher, please share your experience in the comments.


Gabby Redwine is Digital Archivist at the Beinecke Rare Book & Manuscript Library at Yale.

Archiving Email: Electronic Records and Records Management Sections Joint Meeting Recap

By Alice Sara Prael

This is the second post in the bloggERS series on Archiving Digital Communication.

Email has become a major challenge for archivists working to preserve and provide access to correspondence. There are many technical challenges that differ between platforms as well as intellectual challenges to describe and appraise massive disorganized inboxes.

Image of smart phone showing 1 New Email Message

At this year’s Annual Meeting of the Society of American Archivists, the Electronic Records Section and the Records Management Section joined forces to present a panel on the breakthroughs and challenges of managing email in the archives.

Sarah Demb, Senior Records Manager, kicked off the panel by discussing the Harvard University Archive’s approach to collecting email from internal and donated records. Since their records retention schedule is not format specific, it doesn’t separate email from other types of correspondence. Correspondence in electronic format affects the required metadata and acquisition tools and methods, not the appraisal decisions, which are driven entirely by content. When a collection is acquired, administrative records are often mixed with faculty archives which poses a major challenge for appraisal of correspondence. This is true for paper and email correspondence, but a digital environment lends itself to mixing administrative and faculty records much more easily. Another major challenge in acquiring these internal records is that the emails are often attached to business systems in the form of notifications and reporting features. These system specific emails have significant overlap and cause duplication when system reports exist in one or many inboxes.

Since  internal records at Harvard University Archives are closed for 50 years, and personal information is closed for 80 years, Demb is less concerned with an accidental disclosure of private information to a researcher and more concerned with making the right appraisal decisions during acquisition. Email is acquired by the archive at the end of faculty’s career rather than regular smaller acquisitions, which often leaves the archivist with one large, unwieldy inbox. Although donors are encouraged to weed their own inbox prior to acquisition, this is a rare occurrence. The main strategy that Demb supports is to encourage best practices through training and offering guidance whenever possible.

The next presenter was Chris Prom, Assistant University Archivist at the University of Illinois at Urbana-Champaign. He discussed the work of Andrew W. Mellon Foundation and the Digital Preservation Coalition Task Force on Technical Approaches to Email Archives. This task force includes 12 members representing the U.K. and U.S. as well as numerous “Friends of the Task Force” who provide additional support. The task force recently published a draft report which is available online for comment through August 31st. Don’t worry if you won’t have time to comment in the next two days because the report will go out for a second round of comments in September. The task force is taking cues from other industries that are doing similar work with email, such as legal and forensic fields which use email as evidence. Having corporate representation from Google and Microsoft has been valuable because they are already acting upon suggestions from the task force to make their systems easier to preserve.

One major aspect of the task force’s work is addressing interoperability. Getting data out of one platform and usable by different tools has been an ongoing challenge for archivists managing email. There are many useful tools available, but chaining them together for a holistic workflow is problematic. Prom suggested one potential solution to the ‘one big inbox’ problem is to capture email via API to collect at regular intervals rather than waiting for an entire career’s worth of email to accumulate.

Camille Tyndall Watson, Digital Services Section Manager at State Archives of North Carolina, completed the panel discussing the Transforming Online Mail with Embedded Semantics (TOMES) project. This grant funded project is focused on appraisal by implementing the capstone approach, which identifies certain email accounts with enduring value rather than identifying individual emails. The project includes partners from Kansas, Utah, and North Carolina, but the hope is that this model could be duplicated in other states.

The first challenge was to choose the public officials whose accounts are considered part of the ‘capstone’ based on their position in the organizational chart. The project also crosswalked job descriptions to functional retention schedules. By working with the IT department, the team members are automating as much of the workflow as possible. This included assigning position numbers for ‘archival email accounts’ in order to track positions rather than individuals, which is difficult in an organization with significant turn-over like governmental departments. This nearly constant turn-over requires constant outreach to answer questions like “what is a record” and “why does the archive need your email?” The project is also researching natural language processing to allow for an automated and simplified process of arrangement and description of email collections.

The main takeaway from this panel is that email matters. There are many challenges, but the work is necessary because email, much like paper correspondence, has cultural and historical value beyond the transactional value it serves in our everyday lives.

profilephoto


Alice Sara Prael is the Digital Accessioning Archivist for Yale Special Collections at Beinecke Rare Book & Manuscript Library.  She works with born digital archival material through a centralized accessioning service.

Adventures in Email Wrangling: TAMU-CC’s ePADD Story

By Alston Cobourn

This is the first post in the bloggERS series on Archiving Digital Communication.

Getting Started

Soon after I arrived at Texas A&M University-Corpus Christi in January 2017 as the university’s first Processing and Digital Assets Archivist, two high-level longtime employees retired or switched positions. Therefore, I fast-tracked an effort to begin collecting selected email records because these employees undoubtedly had some correspondence of long-term significance, which was also governed by the Texas A&M System’s records retention schedules.

I began by testing ePADD, software used to conduct various archival processes on email, on small date ranges of my work email account.  I ultimately decided to begin using it on selected campus email because I found it relatively easy to use, it includes some helpful appraisal tools, and it provides an interface for patrons to view and select records of which they want a copy. Since the emails themselves live as MBOX files in folders outside of the software, and are viewable with a text editor, I felt comfortable that using ePADD meant not risking the loss of important records. I installed ePADD on my laptop with the thought that traveling to the employees would make the process of transferring their email easier and encourage cooperation.

Transferring the email

In June 2017, I used ePADD Version 3.1 to collect the email of the two employees.  My department head shared general information and arranged an appointment with the employees’ current administrative assistant or interim replacement as applicable. She also made a request to campus IT that they keep the account of the retired employee open.  IT granted the interim replacement access to the account.

I then traveled to the employees’ offices where they entered the appropriate credentials for the university email account into ePADD, identified which folders were most likely to contain records of long-term historical value, and verified the date range I needed to capture.  Then we waited.

In one instance, I had to leave my laptop running in the person’s office overnight because I needed to maintain a consistent internet connection during ePADD’s approximately eight hours of harvesting and the office was off-campus.  I had not remembered to bring a power cord, but thankfully my laptop was fully charged.

Successes

Our main success—we were actually able to collect some records!  Obvious, yes, but I state it because it was the first time TAMU-CC has ever collected this record format and the email of the departed employee was almost deactivated before we sent our preservation request to IT. Second, my department head and I have started conversations with important players on campus about the ethical and legal reasons why the archives needs to review email before disposal.

Challenges

In both cases, the employee had deleted a significant number of emails before we were able to capture their account and had used their work account for personal email.  These behaviors confirmed what we all already knew–employees are largely unaware that their email is an official record. Therefore, we plan to increase efforts to educate faculty and staff about this fact, their responsibilities, and best practices for organizing their email.  The external conversations we have had so far are an important start.

ePADD enabled me to combat the personal email complication by systematically deleting all emails from specific individual senders in batch. I took this approach for family members, listservs, and notifications from various personal accounts.

The feature that recognizes sensitive information worked well in identifying messages that contained social security numbers. However, it did not flag messages that contained phone numbers, which we also consider sensitive personal information. Additionally, in-message redaction is not possible in 3.1.

For messages I have marked as restricted, I have chosen to add an annotation as well that specifies the reason for the restriction. This will enable me to manage those emails at a more granular level. This approach was a modification of a suggestion by fellow archivists at Duke University.

Conclusion

Currently, the email is living on a networked drive while we establish an Amazon S3 account and an Archivematica instance. We plan to provide access to email in our reading room via the ePADD delivery module and publicize this access via finding aids. Overall ePADD is a positive step forward for TAMU-CC.

Note from the Author:

Since writing this post, I have learned that it is possible in ePADD to use regular expressions to further aid in identifying potentially sensitive materials.  By default the program uses regular expressions to find social security numbers, but it can be configured to find other personal information such as credit card numbers and phone numbers.  Further guidance is provided in the Reviewing Regular Expressions section of the ePADD User Guide.

 

ABCheadshotAlston Cobourn is the Processing and Digital Assets Archivist at Texas A&M University-Corpus Christi where she leads the library’s digital preservation efforts. Previously she was the Digital Scholarship Librarian at Washington and Lee University. She holds a BA and MLS with an Archives and Records Management concentration from UNC-Chapel Hill.

Systems Thinking Started Me on My Path

By Jim Havron

___

This is the fourth post in the bloggERS! series Digital Archives Pathways, where archivists discuss the non-traditional, accidental, idiosyncratic, or unique paths they took to become a digital archivist.

I am apparently a systems thinker. When I view an event, problem, or task, I study how it affects other things with which it connects, the processes it involves, potential repercussions and unintended consequences, how it came to be and why. I learned early in life to question statements and strenuously evaluate the evidence supporting or opposing them. This made me an excellent competition debater, since although the topic resolution to be debated was already known, the team wasn’t told whether they would debate for or against it until 30 minutes prior to the debate. This required the ability to see different views on the topic and different ways of interpreting evidence. The importance of evidence in the “big picture” and ongoing processes led me to eventually become an archivist. The big picture of archives today also led me to cybersecurity or InfoSec in IT/IS.

Early Skills and Experiences

My path to archives, and particularly to electronic records in archives, was not a straight one. I originally started college to double major in math and physics. I left school before completing my degrees, and developed other professional skills and experiences. These skills in management, logistics, legacy technology, and communication all became part of my views on and approach to seeing more than an archives-eye view of electronic record production and preservation. They helped to shape my course through school when I returned and allowed me to “sell” my skills to increase experience. I feel one should always take a look at one’s full inventory of experience when one is tackling a task that has ever-changing parameters.

Silos, Collaborative Silos, Venn Views

I was already aware that professional people tended to operate what we refer to now as silos. Many professionals, despite knowledge of peripheral fields, have a core training and set of experiences that strongly define their identity in their professions. I saw many professions as series of overlapping fields. Seeing things as parts of systems, I often pursued tasks that were not usually combined in an efficient manner. 

silosOverlapping

Alternative Professional Path

I continued my self-education on technology. My experience had taught me that the vast majority of records were being generated in electronic formats and were not being saved for historical value. Archivists I encountered, working primarily with paper, seemed unconcerned, IT professionals didn’t understand historical value, and the people who generated the records didn’t see value that would offset resources needed for such preservation. When I started my graduate work to become an archivist, I did so with the plan of continuing on into computer information systems (CIS). This field augmented traditional IT with business knowledge and skills. I intended to combine the fields to gain new insights.

Venn_of_4_Overlapping_CultHerit_Profs

I had several experiences and revelations that helped drive my work:

  • A survey I did in school showed that over 70% of surveyed researchers would prefer access to electronic versions of records online (very unusual and controversial at the time) than more electronic finding aids. However, 100% of surveyed archivists believed researchers would prefer the finding aids.
  • I once (within the last decade) made the statement that repositories that were not online were pretty much invisible. I was literally called a “heretic.”
  • My primary field is security and assurance. I found professionals working with digital assets in libraries promoted very insecure products and techniques.
  • Even IT is not a universal profession and has its own silos.
  • I realized that archivists still have not accepted most records are in databases, requiring preservation of much more information than the data fields.
  • Many born digital records in archives are just electronic representations of documents that can be turned into hard copies.
  • Archivists I know tend to trust the vendor, the person whose job it is to sell the product, when selecting technology to keep records safe over time, not the security professional.
  • Many archivists turn their electronic records over to departments that they do not control, with ever changing personnel and budgets, with unknown security or disaster recovery measures, and frequently unknown storage locations, but feel as long as they can access the records, the archivist has control.
  • The cloud is a vague, poorly understood term, with different, ever changing meanings, yet is often the first choice for record preservation.

The biggest problem and motivation in security and preservation is people.

 

Lessons Learned

Some primary points learned in selecting a path into a career with electronic records preservation and access:

#1 Step back and look at the picture. Is there a special problem or area of need where you have a passion, special skills, or both? Do you have skills that you can use that are different or rare?

#2 Never believe that archives, or for that matter IT, exists in a bubble. The creators of the records will drive the technology.

#3 The easy way to deal with electronic records is often the least secure.

 #4 Technology changes faster than most people imagine, so knowledge and skill acquisition never stops.

#5 There is a desperate yet unrecognized need for people who understand the business function that drives the creation of electronic records, what technology is involved, and yet also can judge the historic value of such records.

For me, #5 is what it is all about.

Below is a diagram I did for my mother to help her understand how I had gotten from one place to another, and what the reasons were for my learning and doing special things. It may be confusing without narration, but it gives an idea of my ongoing pathway. It starts with experience and “Imported [into this process] Resources” and moves to working as “Cultural Heritage Cyber Preservation”.

Path to Professina Vocation

___

Havron_working_for_onceJim Havron grew up in a family of historians and lawyers, as well as spending time on a debate team, so he learned how easily evidence can be overlooked or misinterpreted. In a professional life that stretched from technology to business to first responder, he discovered that many professionals do not understand the evidentiary value of information created, used, stored or needed by other professionals, let alone how to best rescue and preserve it. His education, professional training, and experience in archives and information systems security have given him opportunities to study this situation from different professional views and apply his skills to archives and heritage issues that involve computer systems and security

Call for Contributors: Archiving Digital Communications Series

Archives have long collected correspondence, but as communication has shifted to digital platforms, archivists must discover and develop new tools and methods.  From appraising one massive inbox to describing threaded messages, email has introduced many new challenges to the way we work with correspondence. Likewise, instant messaging, text messaging, collaborative online working environments, and other forms of digital communication have introduced new challenges and opportunities.

We want to hear how you and your institution are managing the acquisition, appraisal, processing, preservation and access to these complex digital collections.  Although the main focus of most programs is email, we’re also interested in hearing how you manage other formats of digital communication as well.

We’re interested in real-life solutions by working archivists: case studies, workflows, any kind of practical work with these collections describing the challenges of  the archival processes to acquire, preserve, and make accessible email and other forms of digital communication.

A few potential topics and themes for posts:

  • Evaluating tools to acquire and process email
  • Case studies on archiving email and other forms of digital communication
  • Integrating practices for digital correspondence with physical correspondence
  • Addressing privacy and legal issues in email collections
  • Collaborating with IT departments and donors to collect email

Writing for bloggERS!

  • Posts should be between 200-600 words in length
  • Posts can take many forms: instructional guides, in-depth tool exploration, surveys, dialogues, point-counterpoint debates are all welcome!
  • Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA
  • Align with other editorial guidelines as outlined in the bloggERS! guidelines for writers.

Experimenting and Digressing to the Digital Archivist Role

By Walker Sampson

___

This is the third post in the bloggERS! series Digital Archives Pathways, where archivists discuss the non-traditional, accidental, idiosyncratic, or unique paths they took to become a digital archivist.

On the surface, my route to digital preservation work was by the book: I attended UT–Austin’s School of Information from 2008–2010 and graduated with a heavy emphasis on born-digital archiving. Nevertheless, I feel my path to this work has been at least partly non-traditional in that it was a) initially unplanned and b) largely informed by projects outside formal graduate coursework (though my professors were always accommodating in allowing me to tie in such work to their courses where it made sense).

I came to the UT–Austin program as a literature major from the University of Mississippi with an emphasis on Shakespeare and creative writing. I had no intention of pursuing digital archiving work and was instead gunning for coursework in manuscript and codex conservation. It took a few months, but I realized I did not relish this type of work. There’s little point in relating the details here, but I think it’s sufficient to say that at times one simply doesn’t enjoy what one thought one would enjoy.

So, a change of direction in graduate school—not unheard of, right? I began looking into other courses and projects. One that stood out was a video game preservation IMLS grant project. I’ve always played video games, so why not? I was eventually able to serve as a graduate research assistant on this project while looking for other places at the school and around town that were doing this type of work.

One key find was a computer museum run out of the local Goodwill in Austin, which derived most of its collection from the electronics recycling stream processed of that facility. At that point, I already had an interest in and experience with old hardware and operating systems: I was fortunate enough to have a personal computer in the house growing up. That machine had a certain mystique that I wanted to explore. I read DOS for Dummies and Dan Gookin’s Guide to Underground DOS 6.0 cover to cover. I logged into bulletin board systems, scrounged for shareware games, and swapped my finds on floppy disk with another friend around the block. All of this is to say that I had a certain comfort level with computers before changing directions in graduate school.

I answered a call for volunteers at the computer museum and soon began working there. This gig would become the nexus for a lot of my learning and advancement with born digital materials. For example, the curator wanted a database for cataloging the vintage machines and equipment, so I learned enough PHP and MySQL to put together a relational database with a web frontend. (I expect there were far better solutions to the problem in retrospect, but I was eager to try and make things at the time. That same desire would play out far less well when I tried to make a version two of the database using the Fedora framework – an ill-conceived strategy from the start. C’est la vie.)

I and other students would also use equipment from the Goodwill museum to read old floppies. At the time BitCurator had not hit 1.0, and it seemed more expedient to simply run dd and other Unix utilities from a machine to which we had attached a floppy drive pulled from the Goodwill recycling stream. I learned a great deal about imaging through this work alone. Many of the interview transcripts for the Presidential Election of 1988 at the Dolph Briscoe Center for American History were acquired in this way under the guidance of Dr. Patricia Galloway. Using vintage and ‘obsolescent’ machines from the Goodwill computer museum was not originally part of a plan to rescue archival material on legacy media, but Dr. Galloway recognized the value of such an exercise and formed the Digital Archeology Lab at UT–Austin. In this way, experimenting can open the door to new formal practices.

This experience and several other like it, were as instrumental as coursework in establishing my career path. With that in mind, I’ll break out a couple of guiding principles that I gleaned from this process.

1: Experiment and Learn Independently

I say this coming out of one of the top ten graduate programs in the field, but the books I checked out to learn PHP and MySQL were not required reading, and the database project wasn’t for coursework at all. Learning how to use a range of Unix utilities and write scripts for batch-processing files were also projects that required self-directed learning outside of formal studies. Volunteer work is not strictly required, but a bulk of what I learned was in service of a non-profit where I had the space to learn and experiment.

Despite my early background playing with computers I don’t feel that it ultimately matters. Provided you are interested in the field, start your own road of experimentation now and get over any initial discomfort with computers by diving in head first.

This, over and over

 

In other words, be comfortable failing. Learning in this way mean failing a lot—but failing in a methodical way. Moreover, when it is just you and a computer, you can fail at a fantastic rate that would appall your friends and family —but no one will ever know! You may have to break some stuff and become almost excruciatingly frustrated at one point or another. Take a deep breath and come back to it the next day.

2: Make Stuff That Interests You

All that experimenting and independent learning can be lonely, so design projects and outputs that you are excited by and want to share with others, regardless of how well it turns out. It helps to check out books, play with code, and bump around on the command line in service of an actual project you want to complete. Learning without a destination point you want to reach, however earnest, will inevitably run out of steam. While a given course may have its own direction and intention, and a given job position may have its own directives and responsibilities, there is typically a healthy latitude for a person to develop projects and directions that serve their interests and broadly align with the goals of the course or job.

Again, my own path has been fairly “traditional”—graduate studies in the field and volunteer gigs along with some part-time work to build experience. Even within this traditional framework however, experimenting, exploring projects outside the usual assignments, and independently embarking on learning whatever I thought I needed to learn have been huge benefits for me.

___

Walker Sampson is the Digital Archivist at the University of Colorado Boulder Libraries where he is responsible for the acquisition, accessioning and description of born digital objects, along with the continued preservation and stewardship of all digital materials in the Libraries.