By Valencia Johnson
This is the fourth post in the bloggERS series on Archiving Digital Communication.
This summer I had the pleasure of accessioning a large digital collection from a retiring staff member. Due to their longevity with the institution, the creator had amassed an extensive digital record. In addition to their desktop files, the archive collected an archival Outlook .pst file of 15.8 GB! This was my first time working with emails. This was also the first time some of the tools discussed below were used in the workflow at my institution. As a newcomer to the digital archiving community, I would like to share this case study and my first impressions on the tools I used in this acquisition.
My original workflow:
- Convert the .pst file into an .mbox file.
- Place both files in a folder titled Emails and add this folder to the acquisition folder that contains the Desktop files folder. This way the digital records can be accessioned as one unit.
- Follow and complete our accessioning procedures.
Things were moving smoothly; I was able to use Emailchemy, a tool that converts email from closed, proprietary file formats, such as .pst files used by Outlook, to standard, portable formats that any application can use, such as .mbox files, which can be read using Thunderbird, Mozilla’s open source email client. I used a Windows laptop that had Outlook and Thunderbird installed to complete this task. I had no issues with Emailchemy, the instructions in the owner’s manual were clear, and the process was easy. Next, I uploaded the Email folder, which contained the .pst and .mbox files, to the acquisition external hard drive and began processing with BitCurator. The machine I used to accession is a FRED, a powerful forensic recovery tool used by law enforcement and some archivists. Our FRED runs BitCurator, which is a Linux environment. This is an important fact to remember because .pst files will not open on a Linux machine.
At Princeton, we use Bulk Extractor to check for Personally Identifiable Information (PII) and credit card numbers. This is step 6 in our workflow and this is where I ran into some issues.
The program was unable to complete 4 threads within the Email folder and timed out. The picture above is part of the explanation message I received. In my understanding and research, aka Google because I did not understand the message, the program was unable to completely execute the task with the amount of processing power available. So the message is essentially saying “I don’t know why this is taking so long. It’s you not me. You need a better computer.” From the initial scan results, I was able to remove PII from the Desktop folder. So instead of running the scan on the entire acquisition folder, I ran the scan solely on the Email folder and the scan still timed out. Despite the incomplete scan, I moved on with the results I had.
I tried to make sense of the reports Bulk Extractor created for the email files. The Bulk Extractor output includes a full file path for each file flagged, e.g. (/home/user/Desktop/blogERs/Email.docx). This is how I was able locate files within the Desktop folder. The output for the Email folder looked like this:
Even though Bulk Extractor Viewer does display the content, it displays it like a text editor, e.g. Notepad, with all the coding alongside the content of the message, not as an email, because all the results were from the .mbox file. This is just the format .mbox generates without an email client. This coding can be difficult to interpret without an email client to translate the material into a human readable format. This output makes it hard to locate an individual message within a .pst because it is hard but not impossible to find the date or title of the email amongst the coding. But this was my first time encountering results like this and it freaked me out a bit.
Because regular expressions, the search method used by Bulk Extractor, looks for number patterns, some of the hits were false positives, number strings that matched the pattern of SSN or credit card numbers. So in lieu of social security numbers, I found the results were FedEx tracking numbers or mistyped phone numbers, though to be fair mistyped numbers are someone’s SSN. For credit card numbers, the program picked up email coding and non-financially related number patterns.
The scan found a SSN I had to remove from the .pst and the .mbox. Remember .pst files only work with Microsoft Outlook. At this point in processing, I was on a Linux machine and could not open the .pst so I focused on the .mbox. Using the flagged terms, I thought maybe I could use a keyword search within the .mbox to locate and remove the flagged material because you can open .mbox files using a text editor. Remember when I said the .pst was over 15 GB? Well the .mbox was just as large and this caused the text editor to stall and eventually give up opening the file. Despite these challenges, I remained steadfast and found UltraEdit, a large text file editor. This whole process took a couple of days and in the end the results from Bulk Extractor’s search indicated the email files contained one SSN and no credit card numbers.
While discussing my difficulties with my supervisor, she suggested trying FileLocator Pro, a scanner like Bulk Extractor that was created with .pst files in mind, to fulfill our due diligence to look for sensitive information since the Bulk Extractor scan timed out before finishing. Though FileLocator Pro operates on Windows so, unfortunately, we couldn’t do the scan on the FRED, FileLocator Pro was able to catch real SSNs hidden in attachments that did not appear in the Bulk Extractor results.
I was able to view the email with the flagged content highlighted within FileLocator Pro like Bulk Extractor. Also, there is the option to open the attachments or emails in their respective programs. So a .pdf file opened in Adobe and the email messages opened in Outlook. Even though I had false positives with FileLocator Pro, verifying the content was easy. It didn’t perform as well searching for credit card numbers; I had some error messages stating that some attached files contained no readable text or that FileLocator Pro had to use a raw data search instead of the primary method. These errors were limited to attachments with .gif, .doc, .pdf, and .xls extensions. But overall it was a shorter and better experience working with FileLocator Pro, at least when it comes to email files.
As emails continue to dominate how we communicate at work and in our personal lives, archivists and electronic records managers can expect to process even larger files, despite how long an individual stays at an institution. Larger files can make the hunt for PII and other sensitive data feel like searching for a needle in a haystack, especially when our scanners are unable to flag individual emails, attachments, or even complete a scan. There’s no such thing as a perfect program; I like Bulk Extractor for non-email files, and I have concerns with FileLocator Pro. However, technology continues to improve and with forums like this blog we can learn from one another.
Valencia Johnson is the Digital Accessioning Assistant for the Seeley G. Mudd Manuscript Library at Princeton University. She is a certified archivist with an MA in Museum Studies from Baylor University.