There’s a First Time for Everything

By Valencia Johnson

This is the fourth post in the bloggERS series on Archiving Digital Communication.


This summer I had the pleasure of accessioning a large digital collection from a retiring staff member. Due to their longevity with the institution, the creator had amassed an extensive digital record. In addition to their desktop files, the archive collected an archival Outlook .pst file of 15.8 GB! This was my first time working with emails. This was also the first time some of the tools discussed below were used in the workflow at my institution. As a newcomer to the digital archiving community, I would like to share this case study and my first impressions on the tools I used in this acquisition.

My original workflow:

  1. Convert the .pst file into an .mbox file.
  2. Place both files in a folder titled Emails and add this folder to the acquisition folder that contains the Desktop files folder. This way the digital records can be accessioned as one unit.
  3. Follow and complete our accessioning procedures.

Things were moving smoothly; I was able to use Emailchemy, a tool that converts email from closed, proprietary file formats, such as .pst files used by Outlook, to standard, portable formats that any application can use, such as .mbox files, which can be read using Thunderbird, Mozilla’s open source email client. I used a Windows laptop that had Outlook and Thunderbird installed to complete this task. I had no issues with Emailchemy, the instructions in the owner’s manual were clear, and the process was easy. Next, I uploaded the Email folder, which contained the .pst and .mbox files, to the acquisition external hard drive and began processing with BitCurator. The machine I used to accession is a FRED, a powerful forensic recovery tool used by law enforcement and some archivists. Our FRED runs BitCurator, which is a Linux environment. This is an important fact to remember because .pst files will not open on a Linux machine.

At Princeton, we use Bulk Extractor to check for Personally Identifiable Information (PII) and credit card numbers. This is step 6 in our workflow and this is where I ran into some issues.

Yeah Bulk Extractor I’ll just pick up more cores during lunch.

The program was unable to complete 4 threads within the Email folder and timed out. The picture above is part of the explanation message I received. In my understanding and research, aka Google because I did not understand the message, the program was unable to completely execute the task with the amount of processing power available. So the message is essentially saying “I don’t know why this is taking so long. It’s you not me. You need a better computer.” From the initial scan results, I was able to remove PII from the Desktop folder. So instead of running the scan on the entire acquisition folder, I ran the scan solely on the Email folder and the scan still timed out. Despite the incomplete scan, I moved on with the results I had.  

I tried to make sense of the reports Bulk Extractor created for the email files. The Bulk Extractor output includes a full file path for each file flagged, e.g. (/home/user/Desktop/blogERs/Email.docx). This is how I was able locate files within the Desktop folder. The output for the Email folder looked like this:

(Some text has been blacked out for privacy.)

Even though Bulk Extractor Viewer does display the content, it displays it like a text editor, e.g. Notepad, with all the coding alongside the content of the message, not as an email, because all the results were from the .mbox file. This is just the format .mbox generates without an email client. This coding can be difficult to interpret without an email client to translate the material into a human readable format. This output makes it hard to locate an individual message within a .pst because it is hard but not impossible to find the date or title of the email amongst the coding. But this was my first time encountering results like this and it freaked me out a bit.

Because regular expressions, the search method used by Bulk Extractor, looks for number patterns, some of the hits were false positives, number strings that matched the pattern of SSN or credit card numbers. So in lieu of social security numbers, I found the results were FedEx tracking numbers or mistyped phone numbers, though to be fair mistyped numbers are someone’s SSN. For credit card numbers, the program picked up email coding and non-financially related number patterns.

The scan found a SSN I had to remove from the .pst and the .mbox. Remember .pst files only work with Microsoft Outlook. At this point in processing, I was on a Linux machine and could not open the .pst so I focused on the .mbox.  Using the flagged terms, I thought maybe I could use a keyword search within the .mbox to locate and remove the flagged material because you can open .mbox files using a text editor. Remember when I said the .pst was over 15 GB? Well the .mbox was just as large and this caused the text editor to stall and eventually give up opening the file. Despite these challenges, I remained steadfast and found UltraEdit, a large text file editor. This whole process took a couple of days and in the end the results from Bulk Extractor’s search indicated the email files contained one SSN and no credit card numbers.  

While discussing my difficulties with my supervisor, she suggested trying FileLocator Pro, a scanner like Bulk Extractor that was created with .pst files in mind, to fulfill our due diligence to look for sensitive information since the Bulk Extractor scan timed out before finishing.  Though FileLocator Pro operates on Windows so, unfortunately, we couldn’t do the scan on the FRED,  FileLocator Pro was able to catch real SSNs hidden in attachments that did not appear in the Bulk Extractor results.

I was able to view the email with the flagged content highlighted within FileLocator Pro like Bulk Extractor. Also, there is the option to open the attachments or emails in their respective programs. So a .pdf file opened in Adobe and the email messages opened in Outlook. Even though I had false positives with FileLocator Pro, verifying the content was easy. It didn’t perform as well searching for credit card numbers; I had some error messages stating that some attached files contained no readable text or that FileLocator Pro had to use a raw data search instead of the primary method. These errors were limited to attachments with .gif, .doc, .pdf, and .xls extensions. But overall it was a shorter and better experience working with FileLocator Pro, at least when it comes to email files.

As emails continue to dominate how we communicate at work and in our personal lives, archivists and electronic records managers can expect to process even larger files, despite how long an individual stays at an institution. Larger files can make the hunt for PII and other sensitive data feel like searching for a needle in a haystack, especially when our scanners are unable to flag individual emails, attachments, or even complete a scan. There’s no such thing as a perfect program; I like Bulk Extractor for non-email files, and I have concerns with FileLocator Pro. However, technology continues to improve and with forums like this blog we can learn from one another.


Valencia Johnson is the Digital Accessioning Assistant for the Seeley G. Mudd Manuscript Library at Princeton University. She is a certified archivist with an MA in Museum Studies from Baylor University.

Advertisements

Adventures in Email Wrangling: TAMU-CC’s ePADD Story

By Alston Cobourn

This is the first post in the bloggERS series on Archiving Digital Communication.

Getting Started

Soon after I arrived at Texas A&M University-Corpus Christi in January 2017 as the university’s first Processing and Digital Assets Archivist, two high-level longtime employees retired or switched positions. Therefore, I fast-tracked an effort to begin collecting selected email records because these employees undoubtedly had some correspondence of long-term significance, which was also governed by the Texas A&M System’s records retention schedules.

I began by testing ePADD, software used to conduct various archival processes on email, on small date ranges of my work email account.  I ultimately decided to begin using it on selected campus email because I found it relatively easy to use, it includes some helpful appraisal tools, and it provides an interface for patrons to view and select records of which they want a copy. Since the emails themselves live as MBOX files in folders outside of the software, and are viewable with a text editor, I felt comfortable that using ePADD meant not risking the loss of important records. I installed ePADD on my laptop with the thought that traveling to the employees would make the process of transferring their email easier and encourage cooperation.

Transferring the email

In June 2017, I used ePADD Version 3.1 to collect the email of the two employees.  My department head shared general information and arranged an appointment with the employees’ current administrative assistant or interim replacement as applicable. She also made a request to campus IT that they keep the account of the retired employee open.  IT granted the interim replacement access to the account.

I then traveled to the employees’ offices where they entered the appropriate credentials for the university email account into ePADD, identified which folders were most likely to contain records of long-term historical value, and verified the date range I needed to capture.  Then we waited.

In one instance, I had to leave my laptop running in the person’s office overnight because I needed to maintain a consistent internet connection during ePADD’s approximately eight hours of harvesting and the office was off-campus.  I had not remembered to bring a power cord, but thankfully my laptop was fully charged.

Successes

Our main success—we were actually able to collect some records!  Obvious, yes, but I state it because it was the first time TAMU-CC has ever collected this record format and the email of the departed employee was almost deactivated before we sent our preservation request to IT. Second, my department head and I have started conversations with important players on campus about the ethical and legal reasons why the archives needs to review email before disposal.

Challenges

In both cases, the employee had deleted a significant number of emails before we were able to capture their account and had used their work account for personal email.  These behaviors confirmed what we all already knew–employees are largely unaware that their email is an official record. Therefore, we plan to increase efforts to educate faculty and staff about this fact, their responsibilities, and best practices for organizing their email.  The external conversations we have had so far are an important start.

ePADD enabled me to combat the personal email complication by systematically deleting all emails from specific individual senders in batch. I took this approach for family members, listservs, and notifications from various personal accounts.

The feature that recognizes sensitive information worked well in identifying messages that contained social security numbers. However, it did not flag messages that contained phone numbers, which we also consider sensitive personal information. Additionally, in-message redaction is not possible in 3.1.

For messages I have marked as restricted, I have chosen to add an annotation as well that specifies the reason for the restriction. This will enable me to manage those emails at a more granular level. This approach was a modification of a suggestion by fellow archivists at Duke University.

Conclusion

Currently, the email is living on a networked drive while we establish an Amazon S3 account and an Archivematica instance. We plan to provide access to email in our reading room via the ePADD delivery module and publicize this access via finding aids. Overall ePADD is a positive step forward for TAMU-CC.

Note from the Author:

Since writing this post, I have learned that it is possible in ePADD to use regular expressions to further aid in identifying potentially sensitive materials.  By default the program uses regular expressions to find social security numbers, but it can be configured to find other personal information such as credit card numbers and phone numbers.  Further guidance is provided in the Reviewing Regular Expressions section of the ePADD User Guide.

 

ABCheadshotAlston Cobourn is the Processing and Digital Assets Archivist at Texas A&M University-Corpus Christi where she leads the library’s digital preservation efforts. Previously she was the Digital Scholarship Librarian at Washington and Lee University. She holds a BA and MLS with an Archives and Records Management concentration from UNC-Chapel Hill.

Managing Our Web-Based Content at the University of Minnesota

By Valerie Collins

____

This is the fourth post in the bloggERS series #digitalarchivesfail: A Celebration of Failure.

The University of Minnesota Archives manages the web archiving program for the Twin Cities campus. We use Archive-It to capture the bulk of our online content, but as we have discovered, managing subsets of our web content and bringing it into our collections has its unique challenges and requires creative approaches. We increasingly face requests to provide a permanent, accessible home for files that would otherwise be difficult to locate in a large archived website. Some content, like newsletters, is created in HTML and is not well-suited for  upload into the institutional repository (IR) we use to handle most of our digital content. Our success in managing web content that is created for the web (as opposed to uploaded and linked PDF files, for example) has been mixed.

In 2016, a department informed us that one of their web domains was going to be cleared of its current content and redirected. Since that website contained six years of University Relations press releases, available solely in HTML format, we were pretty keen on retrieving that content before it disappeared from the live web.

The department also wanted these releases saved, so they downloaded the contents of the website for us, converted each release into a PDF, and emailed them to us before that content was removed. Although we did have crawls of the press releases through Archive-It, we intended to use our institutional repository, the University Digital Conservancy (UDC), to preserve and provide access to the PDF files derived from the website.

So, when faced with the 2,920 files included in the transfer, labeled in no particularly helpful way, in non-chronological order, and with extraneous files included, I rolled up my sleeves and got to work. With the application of some helpful programs and a little more spreadsheet data entry than I would like to admit to, I ended up with some 2,000 articles renamed in chronological order. I grouped and combined the files by year, which was in keeping with the way we have previously provided access to press releases available in the UDC.

All that was left was to OCR and upload, right?

And everything screeched to a halt. Because of the way the files had been downloaded and converted, every page of every file contained renderable text from the original stylesheet hidden within an additional layer that prevented OCR’ing with our available tools, and we were unable to invest more time to find an acceptable solution.

Attempted extracted text

 

Thus, these news releases now sit in the UDC, six 1000 page documents that cannot be full-text searched but are, mercifully, in chronological order. The irony of having our born-digital materials return to the same limitations that plagued our analogue press releases, prior to the adoption of the UDC, has not been lost on us.

But this failure shines a light on the sometimes murky boundaries between archiving the web and managing web content in our archive. I have a website sitting on my desk, burned to a CD. The site is gone from the live web, and Archive-It never crawled it. We have a complete download of an intranet site sitting on our network drive–again, Archive-It never crawled that site. We handle increasing amounts of web content that never made it into Archive-It. But, using our IR to handle these documents is imperfect, too, and can require significant hands-on work when the content has to be stripped out of its original context (the website), and manipulated to meet the preservation requirements of that IR (file format, OCR).

Cross-pollination between our IR and our web archive is inevitable when they are both capturing the born-digital content of the University of Minnesota. Assisting departments with archiving their websites and web-based materials usually involves using a combination of the two, but raises questions of scalability. But even in our failure to bring those press releases all the way to the finish line, we were able to get pretty close using the tools we had available to us and were able to make the files available, and frankly, that’s an almost-success I can live with.

And, while we were running around with those press releases, another department posted a web-based multimedia annual report only to ask later whether it could be uploaded to the IR, with their previous annual reports. Onward!

____

Valerie Collins is a Digital Repositories & Records Archivist at the University of Minnesota Archives.

Preservation and Access Can Coexist: Implementing Archivematica with Collaborative Working Groups

By Bethany Scott

The University of Houston (UH) Libraries made an institutional commitment in late 2015 to migrate the data for its digitized and born-digital cultural heritage collections to open source systems for preservation and access: Hydra-in-a-Box (now Hyku!), Archivematica, and ArchivesSpace. As a part of that broader initiative, the Digital Preservation Task Force began implementing Archivematica in 2016 for preservation processing and storage.

At the same time, the DAMS Implementation Task Force was also starting to create data models, use cases, and workflows with the goal of ultimately providing access to digital collections via a new online repository to replace CONTENTdm. We decided that this would be a great opportunity to create an end-to-end digital access and preservation workflow for digitized collections, in which digital production tasks could be partially or fully automated and workflow tools could integrate directly with repository/management systems like Archivematica, ArchivesSpace, and the new DAMS. To carry out this work, we created a cross-departmental working group consisting of members from Metadata & Digitization Services, Web Services, and Special Collections.

Continue reading

Exploring Digital Preservation, Digital Curation, and Digital Collections in Mexico

By Natalie Baur

This post is the fourth post in our series on international perspectives on digital preservation.

___

During the 2015-2016 academic year, I received a Fulbright García-Robles fellowship to pursue research relating to the state of digital preservation initiatives and digital information access in Mexico. The Instituto de Investigaciones Bibliotecológicas y de la Información at the Universidad Nacional Autónoma de México in Mexico City graciously hosted me as a visiting researcher, and I worked with leading Mexican digital preservation expert Dr. Juan Voutssás.

In Mexico, I was able to conduct interviews with nearly thirty organizations working on building, managing, sharing and preserving their digital collections. The types of organizations I visited were diverse in several areas: geographic location (i.e. outside of heavily centralized Mexico City), organization size, organization mission, and industry sector.

  • Cultural Heritage organizations (galleries, libraries, archives, museums)
  • Government institutions
  • Business/For-profit organizations
  • College and University archives and libraries

Because of the diversity of the types of institutions that I visited, the results and conclusions I drew were also varied, and I noticed distinct trends within each area or category of institutions. For the brevity of this blog post, I have taken the liberty to abbreviate my findings in the following bullet points. These are not meant to be definitive or exhaustive, as I am still compiling, codifying and quantifying interview data.

  • The focus on digital collection building and preservation in business and government tends toward records management approaches. Retention schedules are dictated by the federal government and administered and enforced by the National Archives. All federal and state government entities are obligated to follow these guidelines for retention and transfer of records and archives. While the guidelines and processes for paper records are robust, many institutions are only beginning to implement and use electronic records management platforms. Long-term digital preservation of records designated for permanent deposit is an ongoing challenge.
  • In cultural heritage institutions and college and university archives, digital collection work is focused on building digitization and digital collection management programs. The primary focus of the majority of institutions is still on digitization, storage and diffusion of digitized assets, and wrangling issues related to long-term, sustainable maintenance of digital collections platforms and backups on precarious physical media formats like optical disks and (non-redundant) hard drives.
  • While digital preservation issues are still in the nascent stages of being worked through and solved everywhere around the globe, in some areas strong national and regional groups have been formed to help share strategies, create standards and think through local solutions. In Mexico and Latin America, this has mostly been done through participation in the InterPARES project, but a national Mexican digital preservation consortium, similar to the National Digital Stewardship Alliance (NDSA) in the United States, is still yet to be established in Mexico. In the meantime, several Mexican academic and government institutions have taken the lead on digital preservation issues, and through those initiatives, a more cohesive, intentional organization similar to the NDSA may be able to take root in the near future.

My opportunity to live and do research in Mexico was life-changing. It is now more crucial than ever for librarians, archivists, developers, administrators, and program leaders to look outside of the United States for collaborations and opportunities to learn with and from colleagues abroad. The work we have at hand is critical, and we need to share all the resources we have, especially those resources money cannot buy: a different perspective, diversity of language, and the shared desire to make the whole world, not just our little corner of it, a better place for all.  


natalie_headshotNatalie Baur is currently the Preservation Librarian at El Colegio de México in Mexico City, an institution of higher learning specializing in the humanities and social sciences. Previously, she served as the Archivist for the Cuban Heritage Collection at the University of Miami Libraries and was a 2015-2016 Fulbright-García Robles fellowship recipient, looking at digital preservation issues in Mexican libraries, archives and museums. She holds an M.A. in History and a certificate in Museum Studies from the University of Delaware and an M.L.S. with a concentration in Archives, Information and Records Management from the University of Maryland. She is also co-founder of the Desmantelando Fronteras webinar series and the Itinerant Archivists project. You can read more about her Fullbright-García Robles fellowship here.

Consortial Certification Processes: the Goportis Digital Archive—a Case Study

By Franziska Schwab, Yvonne Tunnat, and Dr. Thomas Gerdes

This post is the second post in our series on international perspectives on digital preservation.

___

The Goportis Consortium consists of the three German National Subject Libraries: the TIB Hannover, ZB MED Cologne/Bonn and the ZBW Kiel/Hamburg.

One key area of collaboration is digital preservation. We jointly use the Goportis Digital Archive based on Ex Libris’s Rosetta since 2010. The certification of our digital archive is part of our quality management, since all workflows are evaluated. Beyond that, a certification seal signals to external parties, like stakeholders and customers, that the long-term availability of the data is ensured, and the digital archive is trustworthy.

So far TIB and ZBW have successfully completed the certification processes for the Data Seal of Approval (DSA) and are currently working on the application for the nestor Seal. Here are some key facts about the seals:

Seal Since Extent Focus Certified institutions (01/2017)
Data Seal of Approval 2010 16 guidelines Ingest, Preservation, Access 64
nestor Seal 2014 34 criteria Ingest, Preservation, Access, Organization & Sustainability Aspects 2

Distribution of Tasks

In general, we are equal partners. For digital preservation, though, TIB is the consortium leader, since it is the software licensee and hosts the computing center.

Due to the terms of the DSA—as well as those of the nestor Seal—a consortium cannot be certified as a whole, but only each partner individually. For that reason each partner drew up its own application. However, for some aspects of the certification ZBW had to refer to the answers of TIB, which functions as its service provider.

Beside these external requirements, we organized the distribution of tasks on the basis of internal goals as well. We interpreted the certification process as an opportunity to get a deeper insight in the workflows, policies and dependencies of our partner institutions. That is why we analyzed the DSA guidelines together. Moreover, we discussed the progress of the application process regularly in telephone conferences and matched our answers to each guideline. As a positive side effect, this way of proceeding strengthened not only the ability of our teamwork, it also led to a better understanding of the guidelines and more elaborate answers for the DSA application.

The documentations for the DSA were created in more detail than recommended in order to facilitate further use of the documents for the nestor Seal.

Time Frame

The certification process for the DSA extended over six months (12/2014–08/2015).  In each institution one employee was in charge of the certification process. Other staff members added special information about their respective areas of work. This included technical development, data specialists, legal professionals, team leaders, and system administration (TIB only). The costs of applying for the seal can be measured in person months:

Institution Person Responsible Other Staff Total
TIB ~ 3 ~ 0.25 ~ 3.25
ZBW ~ 1.5 ~ 0.1 ~ 1.6

Outlook: nestor Seal

The nestor Seal represents the second level of the European Framework for Audit and Certification of Digital Repositories. With its 34 criteria, it is more complex than the DSA. It also requires more detailed information, which makes it necessary to involve more staff from different departments. The time effort is not foreseeable at this time.

map5
Map with relationships between the nestor criteria (Click on the image to enlarge it.) (Read more.)

 

Based on our positive experiences with the DSA certification, we plan to acquire the nestor Seal following the same procedures. The DSA application has prepared the ground for this task, since important documents, such as policies, have already been drafted.

___

Franziska Schwab is working as a Preservation Analyst in the Digital Preservation team at the German National Library of Science and Technology (TIB) since 2014. She’s responsible for Pre-Ingest data analysis, Ingest, process documentation, policies, and certification.

Yvonne Tunnat is the Digital Preservation Manager for the Leibniz Information Centre for Economics in Kiel/Hamburg (ZBW) since 2011. Her key working areas are format identification, validation, and preservation planning.

Dr. Thomas Gerdes is part of the Digital Preservation team of the Leibniz Information Centre for Economics in Kiel/Hamburg (ZBW), since 2015. His interests are in the field of certification methods.

Processing Digital Research Data

By Elise Dunham

This is the sixth post in our Spring 2016 series on processing digital materials.

———

The University of Illinois at Urbana-Champaign’s (Illinois) library-based Research Data Service (RDS) will be launching an institutional data repository, the Illinois Data Bank (IDB), in May 2016. The IDB will provide University of Illinois researchers with a repository for research data that will facilitate data sharing and ensure reliable stewardship of published data. The IDB is a web application that transfers deposited datasets into Medusa, the University Library’s digital preservation service for the long-term retention and accessibility of its digital collections. Content is ingested into Medusa via the IDB’s unmediated self-deposit process.

As we conceived of and developed our dataset curation workflow for digital datasets ingested in the IDB, we turned to archivists in the University Archives to gain an understanding of their approach to processing digital materials. [Note: I am not specifying whether data deposited in the IDB is “born digital” or “digitized” because, from an implementation perspective, both types of material can be deposited via the self-deposit system in the IDB. We are not currently offering research data digitization services in the RDS.] There were a few reasons for consulting with the archivists: 1) Archivists have deep, real-world curation expertise and we anticipate that many of the challenges we face with data will have solutions whose foundations were developed by archivists and 2) If, through discussing processes, we found areas where the RDS and Archives have converging preservation or curation needs, we could communicate these to the Preservation Services Unit, who develops and manages Medusa, and 3) I’m an archivist by training and I jump on any opportunity to talk with archivists about archives!

Even though the RDS and the University Archives share a central goal–to preserve and make accessible the digital objects that we steward–we learned that there are some operational and policy differences between our approaches to digital stewardship that necessitate points of variance in our processing/curation workflow:

Appraisal and Selection

In my view, appraisal and selection are fundamental to the archives practice. The archives field has developed a rich theoretical foundation when it comes to appraisal and selection, and without these functions the archives endeavor would be wholly unsustainable. Appraisal and selection ideally tend to occur in the very early stages of the archival processing workflow. The IDB curation workflow will differ significantly–by and large, appraisal and selection procedures will not take place until at least five years after a dataset is published in the IDB–making our appraisal process more akin to that of an archives that chooses to appraise records after accessioning or even during the processing of materials for long-term storage. Our different approaches to appraisal and selection speak to the different functions the RDS and the University Archives fulfill within the Library and the University.

The University Archives is mandated to preserve University records in perpetuity by the General Rules of the University, the Illinois State Records Act. The RDS’s initiating goal, in contrast, is to provide a mechanism for Illinois researchers to be compliant with funder and/or journal requirements to make results of research publicly available. Here, there is no mandate for the IDB to accept solely what data is deemed to have “enduring value” and, in fact, the research data curation field is so new that we do not yet have a community-endorsed sense of what “enduring value” means for research data. Standards regarding the enduring value of research data may evolve over the long-term in response to discipline-specific circumstances.

To support researchers’ needs and/or desires to share their data in a simple and straightforward way, the IDB ingest process is largely unmediated. Depositing privileges are open to all campus affiliates who have the appropriate University log-in credentials (e.g., faculty, graduate students, and staff), and deposited files are ingested into Medusa immediately upon deposit. RDS curators will do a cursory check of deposits, as doing so remains scalable (see workflow chart below), and the IDB reserves the right to suppress access to deposits for a “compelling reason” (e.g., failure to meet criteria for depositing as outlined in the IDB Accession Policy, violations of publisher policy, etc.). Aside from cases that we assume will be rare, the files as deposited into the IDB, unappraised, are the files that are preserved and made accessible in the IDB.

Preservation Commitment

A striking policy difference between the RDS and the University Archives is that the RDS makes a commitment to preserving and facilitating access to datasets for a minimum of five years after the date of publication in the Illinois Data Bank.

The University Archives, of course, makes a long-term commitment to preserving and making accessible records of the University. I have to say, when I learned that the five-year minimum commitment was the plan for the IDB, I was shocked and a bit dismayed! But after reflecting on the fact that files deposited in the IDB undergo no formal appraisal process at ingest, the concept began to feel more comfortable and reasonable. At a time when terabytes of data are created, oftentimes for single projects, and budgets are a universal concern, there are logistical storage issues to contend with. Now, I fully believe that for us to ensure that we are able to 1) meet current, short-term data sharing needs on our campus and 2) fulfill our commitment to stewarding research data in an effective and scalable manner over time, we have to make a circumspect minimum commitment and establish policies and procedures that enable us to assess the long-term viability of a dataset deposited into the IDB after five years.

The RDS has collaborated with archives and preservation experts at Illinois and, basing our work in archival appraisal theory, have developed guidelines and processes for reviewing published datasets after their five-year commitment ends to determine whether to retain, deaccession, or dedicate more stewardship resources to datasets. Enacting a systematic approach to appraising the long-term value of research data will enable us to allot resources to datasets in a way that is proportional to the datasets’ value to research communities and its preservation viability.

Convergences

To show that we’re not all that different after all, I’ll briefly mention a few areas where the University Archives and the RDS are taking similar approaches or facing similar challenges:

  • We are both taking an MPLP-style approach to file conversion. In order to get preservation control of digital content, at minimum, checksums are established for all accessioned files. As a general rule, if the file can be opened using modern technology, file conversion will not be pursued as an immediate preservation action. Establishing strategies and policies for managing a variety of file formats at scale is an area that will be evolving at Illinois through collaboration of the University Archives, the RDS, and the Preservation Services Unit.
  • Accruals present metadata challenges. How do we establish clear accrual relationships in our metadata when a dataset or a records series is updated annually? Are there ways to automate processes to support management of accruals?
  • Both units do as much as they can to get contextual information about the material being accessioned from the creator, and metadata is enhanced as possible throughout curation/processing.
  • The University Archives and the RDS control materials in aggregation, with the University Archives managing at the archival collection level and the RDS managing digital objects at the dataset level.
  • More? Certainly! For both the research data curation community and the archives community, continually adopting pragmatic strategies to manage the information created by humans (and machines!) is paramount, and we will continue to learn from one another.

Research Data Alliance Interest Group

If you’re interested in further exploring the areas where the principles and practices in archives and research data curation overlap and where they diverge, join the Research Data Alliance (RDA) Archives and Records Professionals for Research Data Interest Group. You’ll need to register with the RDA, (which is free!), and subscribe to the group. If you have any questions, feel free to get in touch!

IDB Curation Workflow

The following represents our planned functional workflow for handling dataset deposits in the Illinois Data Bank:

Dunham_ProcessingDigitalReserachData_PublishedDepositScan_ERSblog_1
Workflow graphic created by Elizabeth Wickes. Click on the image to view it in greater detail.

Learn More

To learn more about the IDB policies and procedures discussed in this post, keep an eye on the Illinois Data Bank website after it launches next month. Of particular interest on the Policies page will be the Accession Policy and the Preservation Review, Retention, Deaccession, Revision, and Withdrawal Procedure document.

Acknowledgements

Bethany Anderson and Chris Prom of the University of Illinois Archives

The rest of the Research Data Preservation Review Policy/Procedures team: Bethany Anderson, Susan Braxton, Heidi Imker, and Kyle Rimkus

The rest of the RDS team: Qian Zhang, Elizabeth Wickes, Colleen Fallaw, and Heidi Imker

———

Dunham_ProcessingDigitalReserachData_PublishedDepositScan_ERSblog_2Elise Dunham is a Data Curation Specialist for the Research Data Service at the University of Illinois at Urbana-Champaign. She holds an MLS from the Simmons College Graduate School of Library and Information Science where she specialized in archives and metadata. She contributes to the development of the Illinois Data Bank in areas of metadata management, repository policy, and workflow development. Currently she co-chairs the Research Data Alliance Archives and Records Professionals for Research Data Interest Group and is leading the DACS workshop revision working group of the Society of American Archivists Technical Subcommittee for Describing Archives: A Content Standard.

Keeping Track of Time with Data Accessioner

By Kevin Dyke

This post is the fourth in our Spring 2016 series on processing digital materials.

———

When it comes to working to process large sets of electronic records, it’s all too easy to get so wrapped up in the task at hand that when you finally come up for air you look at the clock and think to yourself, “Where did the time go? How long was I gone?” Okay, that may sound rather apocalyptic, but tracking time spent is an important yet easily elided step in electronic records processing.

At the University of Minnesota Libraries, the members of the Electronic Records Task Force are charged with developing workflows and making estimates for future capacity and personnel needs. In an era of very tight budgets, making a strong, well-documented case for additional personnel and resources is critical. To that end, we’ve made some efforts to more systematically track our time as we pilot our workflows.

Chief among those efforts has been a customization of the Data Accessioner tool. Originally written for internal use at the David M. Rubenstein Rare Book & Manuscript Library at Duke University, the project has since become open source, with support for recent releases coming from the POWRR Project. Written in Java and utilizing the common logging library log4j, Data Accessioner is structured in a way that made it possible for someone like me (familiar with programming, but not much experience with Java) to enhance the time logging functionality.  As we know some accession tasks take a few minutes, others can run for many hours (if not days). Enhancing the logging functionality of Data Accessioner allows staff to accurately see how long any data transfer takes, without needing to be physically present. The additional functionality was in itself pretty minor: log the time and folder name before starting accessioning of a folder and upon completion. The most complex part of this process was not writing the additional code, but rather modifying the log4j configuration. Luckily, with an existing configuration file, solid documentation, and countless examples in the wild, I was able to produce a version of Data Accessioner that outputs a daily log as a plain text file, which makes time tracking accessioning jobs much easier. You can see more description of the changes I made and the log output formatting on GitHub. You can download a ZIP file with the application with this addition from that page as well, or use this download link.

Screenshots and a sample log file:

Main Data Accessioner Folder
Main Data Accessioner folder
Contents of Log Folder
Contents of log folder
Sample of the beginning and ending of log file showing the start time and end times for file migration
Sample of the beginning and ending of log file showing the start time and end times for file migration

With this change, we are now able to better estimate the time it takes to use Data Accessioner.  Do the tools you use keep track of the time it takes to run?  If not, how are you doing this?  Questions or comments can be sent to lib-ertf [at] umn [dot] edu.

———

Kevin DykeKevin Dyke is the spatial data analyst/curator at the University of Minnesota’s John R. Borchert Map Library. He’s a member of the University of Minnesota Libraries’ Electronic Records Task Force, works as a data curator for the Data Repository for the University of Minnesota (DRUM), and is also part of the Committee on Institutional Cooperation’s (CIC) Geospatial Data Discovery Project. He received a Masters degree in Geography from the University of Minnesota and can be reached at dykex005 [at] umn.edu.

Recent Changes in How Stanford University Libraries is Documenting Born-Digital Processing

By Michael G. Olson

This post is the third in our Spring 2016 series on processing digital materials.

———

Stanford University Libraries is in the process of changing how it documents its digital processing activities and records lab statistics. This is our third iteration of how we track our born-digital work in six years and is a collaborative effort between Digital Library Systems and Services, our Digital Archivist Peter Chan, and Glynn Edwards, who manages our Born-Digital Program and is the Director of the ePADD project.

Initially we documented our statistics using a library-hosted FileMaker Pro database. In this initial iteration we were focused on tracking media counts and media failure rates. After a single year of using the database we decided that we needed to modify the data structure and the data entry templates significantly. Our staff found the database too time consuming and cumbersome to modify.

We decided to simplify and replaced the database with a spreadsheet stored with our collection data. Our digital archivist and hourly lab employees were responsible for updating this spreadsheet when they had finished working with a collection. This was a simple solution that was easy to edit and update, and it worked well for four years until we realized we needed more data for our fiscal year-end reports. As our born-digital program has grown and matured, we discovered we were missing key data points that documented important processing decisions in our workflows. It was time to again improve how we documented our work.

BDFL_labstats_FY2015Q1-Q2_v2Stanford Statistics Spreadsheet version 2

For our brand new version of work tracking we have decided to continue to use a spreadsheet but have migrated our data to Google Drive to better facilitate updates and versioning of our documentation. New data points have been included to better track specific types of born-digital content like email. This new version also allows us to better document the processing lifecycle of our born-digital collections. In order to better do this we have created the following additional data points:

  • Number of email messages
  • Email in ePADD.stanford.edu
  • File count in media cart
  • File size on media cart (GB)
  • SearchWorks (materials discoverable / available in library catalog)
  • SpotLight Exhibit (a virtual exhibit)

BDFL_stats_v3Stanford Statistics Spreadsheet version 3

We anticipate that evolving library administrative needs, the continually changing nature of born-digital data, and new methodologies for processing these materials will make it necessary to again change how we document our work. Our solution is not perfect but is flexible enough to allow us to reimagine our documentation strategy in a few short years. If anyone is interested in learning more about what we are documenting and why, please do let us know, as we would be happy to provide further information and may learn something from our colleagues in the process.

———

Michael G. Olson is the Service Manager for the Born-Digital / Forensics Labs at Stanford University Libraries. In this capacity he is responsible for working with library stakeholders to develop services for acquiring, preserving and accessing born-digital library materials. Michael holds a Masters in Philosophy in History and Computing from the University of Glasgow. He can be reached at mgolson [at] Stanford [dot] edu.