Remember that scene in The Matrix where Neo wakes and says “I know kung fu”? Library Carpentry is like that. Almost. Do you need to search lots of files for pieces of text and tire of using Ctrl-F? In the UNIX shell lesson you’ll learn to automate tasks and rapidly extract data from files. Are you managing datasets with not-quite-standardized data fields and formats? In the OpenRefine lesson you’ll easily wrangle data into standard formats for easier processing and de-duplication. There are also Library Carpentry lessons for Python (a popular scripting programming language), Git (a powerful version control system), SQL (a commonly used relational database interface), and many more.
But let me back up a bit.
Library Carpentry is part of the Carpentries, an organization is designed to provide training to scientists, researchers, and information professionals on the computational skills necessary for work in this age of big data.
The goals of Library Carpentry align with this series’ initial call for contributions, providing resources for those in data- or information-related fields to work “more with a shovel than with a tweezers.” Library Carpentry workshops are primarily hands-on experiences with tools to make work more efficient and less prone to mistakes when performing repeated tasks.
One of the greatest parts about a Library Carpentry workshop is that they begin at the beginning. That is, the first lesson is an Introduction to Data, which is a structured discussion and exercise session that breaks down jargon (“What is a version control system”) and sets down some best practices (naming things is hard).
Not only are the lessons designed for those working in library and information professions, but they’re also designed by “in the trenches” folks who are dealing with these data and information challenges daily. As part of the Mozilla Global Sprint, Library Carpentry ran a two-day hackathon in May 2018 where lessons were developed, revised, remixed, and made pretty darn shiny by contributors at ten different sites. For some, the hackathon itself was an opportunity to learn how to use GitHub as a collaboration tool.
Furthermore, Library Carpentry workshops are led by librarians, like the most recent workshop at the University of Arizona, where lessons were taught by our Digital Scholarship Librarian, our Geospatial Specialist, our Liaison Librarian to Anthropology (among other domains), and our Research Data Management Specialist.
Now, a Library Carpentry workshop won’t make you an expert in Python or the UNIX command line in two days. Even Neo had to practice his kung fu a bit. But workshops are designed to be inclusive and accessible, myth-busting, and – I’ll say it – fun. Don’t take my word for it, here’s a sampling of comments from our most recent workshop:
Loved the hands-on practice on regular expressions
Really great lesson – I liked the challenging exercises, they were fun! It made SQL feel fun instead of scary
Feels very powerful to be able to navigate files this way, quickly & in bulk.
So regardless of how you work with data, Library Carpentry has something to offer. If you’d like to host a Library Carpentry workshop, you can use our request a workshop form. You can also connect to Library Carpentry through social media, the web, or good old fashioned e-mail. And since you’re probably working with data already, you have something to offer Library Carpentry. This whole endeavor runs on the multi-faceted contributions of the community, so join us, we have cookies. And APIs. And a web scraping lesson. The terrible puns are just a bonus.
As cultural icons go, the floppy disk continues to persist in the contemporary information technology landscape. Though digital storage has moved beyond the 80 KB – 1.44 MB storage capacity of the floppy disk, its image is often shorthand for the concept of saving one’s work (to wit: Microsoft Word 2016 still uses an icon of a 3.5″ floppy disk to indicate save in its user interface). Likewise, floppy disks make up a sizable portion of many archival collections, in number of objects if not storage footprint. If a creator of personal papers or institutional records maintained their work in electronic form in the 1980s or 1990s, chances are high that these are stored on floppy disks. But the persistent image of the ubiquitous floppy disk conceals a long list of challenges that come into play as archivists attempt to capture their data.
For starters, we often grossly underestimate the extent to which the technology was in active development during its heyday. One would be forgiven the assumption that there existed only a small number of floppy disk formats: namely 5.25″ and 3.5″, plus their 8″ forebears. But within each of these sizes there existed myriad variations of density and encoding, all of which complicate the archivist’s task now that these disks have entered our stacks. This is to say nothing of the hardware: 8″ and 5.25″ drives and standard controller boards are no longer made, and the only 3.5″ drive currently manufactured is a USB-enabled device capable only of reading disks with the more recent encoding methods storing file systems compatible with the host computer. And, of course, none of the above accounts for media stability over time for obsolete carriers.
Enter KryoFlux, a floppy disk controller board first made available in 2009. KryoFlux is very powerful, allowing users of contemporary Windows, Mac, and Linux machines to interface with legacy floppy drives via a USB port. The KryoFlux does not attempt to mount a floppy disk’s file system to the host computer, granting two chief affordances: users can acquire data (a) independent of their host computer’s file system, and (b) without necessarily knowing the particulars of the disk in question. The latter is particularly useful when attempting to analyze less stable media.
Despite the powerful utility of KryoFlux, uptake among archives and digital preservation programs has been hampered by a lack of accessible documentation and training resources. The official documentation and user forums assume a level of technical knowledge largely absent from traditional archival training. Following several informal conversations at Stanford University’s Born-Digital Archives eXchange events in 2015 and 2016, as well as discussions at various events hosted by the BitCurator Consortium, we formed a working group that included archivists and archival studies students from Emory University, the University of California Los Angeles, Yale University, Duke University, and the University of Texas at Austin to create user-friendly documentation aimed specifically at archivists.
Development of The Archivists Guide to KryoFlux began in 2016, with a draft released on Google Docs in Spring 2017. The working group invited feedback over a 6-month comment period and were gratified to receive a wide range of comments and questions from the community. Informed by this incredible feedback, a revised version of the Guide is now hosted in GitHub and available for anyone to use, though the use cases described are generally those encountered by archivists working with born-digital collections in institutional and manuscript repositories.
The Guide is written in two parts. “Part One: Getting Started” provides practical guidance on how to set-up and begin using the KryoFlux and aims to be as inclusive and user-friendly as possible. It includes instructions for running KryoFlux using both Mac and Windows operating systems. Instructions for running KryoFlux using Linux are also provided, allowing repositories that use BitCurator (an Ubuntu-based open-source suite of digital archives tools) to incorporate the KryoFlux into their workflows.
“Part Two: In Depth” examines KryoFlux features and floppy disk technology in more detail. This section introduces the variety of floppy disk encoding formats and provides guidance as to how KryoFlux users can identify them. Readers can also find information about working with 40-track floppy disks. Part Two covers KryoFlux-specific output too, including log files and KryoFlux stream files, and suggests ways in which archivists might make use of these files to support digital preservation best practices. Short case studies documenting the experiences of archivists at other institutions are also included here, providing real-life examples of KryoFlux in action.
As with any technology, the KryoFlux hardware and software will undergo updates and changes in the future which will, if we are not careful, have an effect on the currency of the Guide. In an attempt to address this possibility, the working group have chosen to host the guide as a public GitHub repository. This platform supports versioning and allows for easy collaboration between members of the working group. Perhaps most importantly, GitHub supports the integration of community-driven contributions, including revisions, corrections, and updates. We have established a process for soliciting and reviewing additional contributions and corrections (short answer: submit a pull request via GitHub!), and will annually review the membership of an ongoing working group responsible for monitoring this work to ensure that the Guide remains actively maintained for as long as humanly possible.
[Matthew] Farrell is the Digital Records Archivist in Duke University’s David M. Rubenstein Rare Book & Manuscript Library. Farrell holds an MLS from the University of North Carolina at Chapel Hill.
Shira Peltzman is the Digital Archivist for the UCLA Library where she leads a preservation program for Library Special Collections’ born-digital material. Shira received her M.A. in Moving Image Archiving and Preservation from New York University’s Tisch School of the Arts, and was a member of the inaugural cohort of the National Digital Stewardship Residency in New York (NDSR-NY).
I am a graduate student working as a Research Assistant on an NHPRC-funded project, Processing Capstone Email Using Predictive Coding, that is investigating ways to provide appropriate access to email records of state agencies from the State of Illinois. We have been exploring various text mining technologies with a focus on those used in the e-discovery industry to see how well these tools may improve the efficiency and effectiveness of identifying sensitive content.
During our preliminary investigations we have begun to ask one another whether tools that use Supervised Machine Learning or Unsupervised Machine Learning would be best suited for our objectives. In my undergraduate studies I conducted a project on Digital Image Processing for Accident Prevention, involving building a system that uses real-time camera feeds to detect human-vehicle interactions and sound alarms if a collision is imminent. I used a Supervised Machine Learning algorithm – Support Vector Machine (SVM) to train and identify the car and human on individual data frames. With this project, Supervised Machine Learning worked well when applied to identifying objects embedded in images. But I do not believe it will be as beneficial for our project which is working with text only data. Here is my argument for my position.
In Supervised Machine Learning, a pre-classified input set (training dataset) has to be given to the tool in order to be trained. Training is based on the input set and the algorithms used to process the input set gives the required output. In my project, I created a training dataset which contained pictures of specific attributes of cars (windshields, wheels) and specific attributes of humans (faces, hands, and legs). I needed to create a training set of 900-1,000 images to achieve a ~92% level of accuracy. Supervised Machine Learning works well for this kind of image detection because unsupervised learning algorithms would be challenged to accurately make distinctions between windshield glass and other glass (e.g. building windows) present in many places on a whole data frame.
For Supervised Machine Learning to work well, the expected output of an algorithm should be known and the data that is used to train the algorithm should be properly labeled. This takes a great deal of effort. A huge volume of words along with their synonyms would be needed as a training set. But this implies we know what we are looking for in the data. I believe for the purposes of our project, the expected output is not so clearly known (all “sensitive” content) and therefore a reliable training set and algorithm would be difficult to create.
In Unsupervised Machine Learning, the algorithms allow the machine to learn to identify complex processes and patterns without human intervention. Text can be identified as relevant based on similarity and grouped together based on likely relevance. Unsupervised Machine Learning tools can still allow humans to add their own input text or data for the algorithm to understand and train itself. I believe this approach is better than Supervised Machine Learning for our purposes. Through the use of clustering mechanisms in Unsupervised Machine Learning the input data can first be divided into clusters and then the test data identified using those clusters.
In summary, a Supervised Machine Learning tool learns to ascribe the labels that are input from the training data but the effort to create a reliable training dataset is significant and not easy to create from our textual data. I feel that Unsupervised Machine Learning tools can provide better results (faster, more reliable) for our project particularly with regard to identifying sensitive content. Of course, we are still investigating various tools, so time will tell.
Aarthi Shankar is a graduate student in the Information Management Program specializing in Data Science and Analytics. She is working as a Research Assistant for the Records and Information Management Services at the University of Illinois.
This summer I had the pleasure of accessioning a large digital collection from a retiring staff member. Due to their longevity with the institution, the creator had amassed an extensive digital record. In addition to their desktop files, the archive collected an archival Outlook .pst file of 15.8 GB! This was my first time working with emails. This was also the first time some of the tools discussed below were used in the workflow at my institution. As a newcomer to the digital archiving community, I would like to share this case study and my first impressions on the tools I used in this acquisition.
My original workflow:
Convert the .pst file into an .mbox file.
Place both files in a folder titled Emails and add this folder to the acquisition folder that contains the Desktop files folder. This way the digital records can be accessioned as one unit.
Things were moving smoothly; I was able to use Emailchemy, a tool that converts email from closed, proprietary file formats, such as .pst files used by Outlook, to standard, portable formats that any application can use, such as .mbox files, which can be read using Thunderbird, Mozilla’s open source email client. I used a Windows laptop that had Outlook and Thunderbird installed to complete this task. I had no issues with Emailchemy, the instructions in the owner’s manual were clear, and the process was easy. Next, I uploaded the Email folder, which contained the .pst and .mbox files, to the acquisition external hard drive and began processing with BitCurator. The machine I used to accession is a FRED, a powerful forensic recovery tool used by law enforcement and some archivists. Our FRED runs BitCurator, which is a Linux environment. This is an important fact to remember because .pst files will not open on a Linux machine.
At Princeton, we use Bulk Extractor to check for Personally Identifiable Information (PII) and credit card numbers. This is step 6 in our workflow and this is where I ran into some issues.
The program was unable to complete 4 threads within the Email folder and timed out. The picture above is part of the explanation message I received. In my understanding and research, aka Google because I did not understand the message, the program was unable to completely execute the task with the amount of processing power available. So the message is essentially saying “I don’t know why this is taking so long. It’s you not me. You need a better computer.” From the initial scan results, I was able to remove PII from the Desktop folder. So instead of running the scan on the entire acquisition folder, I ran the scan solely on the Email folder and the scan still timed out. Despite the incomplete scan, I moved on with the results I had.
I tried to make sense of the reports Bulk Extractor created for the email files. The Bulk Extractor output includes a full file path for each file flagged, e.g. (/home/user/Desktop/blogERs/Email.docx). This is how I was able locate files within the Desktop folder. The output for the Email folder looked like this:
Even though Bulk Extractor Viewer does display the content, it displays it like a text editor, e.g. Notepad, with all the coding alongside the content of the message, not as an email, because all the results were from the .mbox file. This is just the format .mbox generates without an email client. This coding can be difficult to interpret without an email client to translate the material into a human readable format. This output makes it hard to locate an individual message within a .pst because it is hard but not impossible to find the date or title of the email amongst the coding. But this was my first time encountering results like this and it freaked me out a bit.
Because regular expressions, the search method used by Bulk Extractor, looks for number patterns, some of the hits were false positives, number strings that matched the pattern of SSN or credit card numbers. So in lieu of social security numbers, I found the results were FedEx tracking numbers or mistyped phone numbers, though to be fair mistyped numbers are someone’s SSN. For credit card numbers, the program picked up email coding and non-financially related number patterns.
The scan found a SSN I had to remove from the .pst and the .mbox. Remember .pst files only work with Microsoft Outlook. At this point in processing, I was on a Linux machine and could not open the .pst so I focused on the .mbox. Using the flagged terms, I thought maybe I could use a keyword search within the .mbox to locate and remove the flagged material because you can open .mbox files using a text editor. Remember when I said the .pst was over 15 GB? Well the .mbox was just as large and this caused the text editor to stall and eventually give up opening the file. Despite these challenges, I remained steadfast and found UltraEdit, a large text file editor. This whole process took a couple of days and in the end the results from Bulk Extractor’s search indicated the email files contained one SSN and no credit card numbers.
While discussing my difficulties with my supervisor, she suggested trying FileLocator Pro, a scanner like Bulk Extractor that was created with .pst files in mind, to fulfill our due diligence to look for sensitive information since the Bulk Extractor scan timed out before finishing. Though FileLocator Pro operates on Windows so, unfortunately, we couldn’t do the scan on the FRED, FileLocator Pro was able to catch real SSNs hidden in attachments that did not appear in the Bulk Extractor results.
I was able to view the email with the flagged content highlighted within FileLocator Pro like Bulk Extractor. Also, there is the option to open the attachments or emails in their respective programs. So a .pdf file opened in Adobe and the email messages opened in Outlook. Even though I had false positives with FileLocator Pro, verifying the content was easy. It didn’t perform as well searching for credit card numbers; I had some error messages stating that some attached files contained no readable text or that FileLocator Pro had to use a raw data search instead of the primary method. These errors were limited to attachments with .gif, .doc, .pdf, and .xls extensions. But overall it was a shorter and better experience working with FileLocator Pro, at least when it comes to email files.
As emails continue to dominate how we communicate at work and in our personal lives, archivists and electronic records managers can expect to process even larger files, despite how long an individual stays at an institution. Larger files can make the hunt for PII and other sensitive data feel like searching for a needle in a haystack, especially when our scanners are unable to flag individual emails, attachments, or even complete a scan. There’s no such thing as a perfect program; I like Bulk Extractor for non-email files, and I have concerns with FileLocator Pro. However, technology continues to improve and with forums like this blog we can learn from one another.
Valencia Johnson is the Digital Accessioning Assistant for the Seeley G. Mudd Manuscript Library at Princeton University. She is a certified archivist with an MA in Museum Studies from Baylor University.
Soon after I arrived at Texas A&M University-Corpus Christi in January 2017 as the university’s first Processing and Digital Assets Archivist, two high-level longtime employees retired or switched positions. Therefore, I fast-tracked an effort to begin collecting selected email records because these employees undoubtedly had some correspondence of long-term significance, which was also governed by the Texas A&M System’s records retention schedules.
I began by testing ePADD, software used to conduct various archival processes on email, on small date ranges of my work email account. I ultimately decided to begin using it on selected campus email because I found it relatively easy to use, it includes some helpful appraisal tools, and it provides an interface for patrons to view and select records of which they want a copy. Since the emails themselves live as MBOX files in folders outside of the software, and are viewable with a text editor, I felt comfortable that using ePADD meant not risking the loss of important records. I installed ePADD on my laptop with the thought that traveling to the employees would make the process of transferring their email easier and encourage cooperation.
Transferring the email
In June 2017, I used ePADD Version 3.1 to collect the email of the two employees. My department head shared general information and arranged an appointment with the employees’ current administrative assistant or interim replacement as applicable. She also made a request to campus IT that they keep the account of the retired employee open. IT granted the interim replacement access to the account.
I then traveled to the employees’ offices where they entered the appropriate credentials for the university email account into ePADD, identified which folders were most likely to contain records of long-term historical value, and verified the date range I needed to capture. Then we waited.
In one instance, I had to leave my laptop running in the person’s office overnight because I needed to maintain a consistent internet connection during ePADD’s approximately eight hours of harvesting and the office was off-campus. I had not remembered to bring a power cord, but thankfully my laptop was fully charged.
Our main success—we were actually able to collect some records! Obvious, yes, but I state it because it was the first time TAMU-CC has ever collected this record format and the email of the departed employee was almost deactivated before we sent our preservation request to IT. Second, my department head and I have started conversations with important players on campus about the ethical and legal reasons why the archives needs to review email before disposal.
In both cases, the employee had deleted a significant number of emails before we were able to capture their account and had used their work account for personal email. These behaviors confirmed what we all already knew–employees are largely unaware that their email is an official record. Therefore, we plan to increase efforts to educate faculty and staff about this fact, their responsibilities, and best practices for organizing their email. The external conversations we have had so far are an important start.
ePADD enabled me to combat the personal email complication by systematically deleting all emails from specific individual senders in batch. I took this approach for family members, listservs, and notifications from various personal accounts.
The feature that recognizes sensitive information worked well in identifying messages that contained social security numbers. However, it did not flag messages that contained phone numbers, which we also consider sensitive personal information. Additionally, in-message redaction is not possible in 3.1.
For messages I have marked as restricted, I have chosen to add an annotation as well that specifies the reason for the restriction. This will enable me to manage those emails at a more granular level. This approach was a modification of a suggestion by fellow archivists at Duke University.
Currently, the email is living on a networked drive while we establish an Amazon S3 account and an Archivematica instance. We plan to provide access to email in our reading room via the ePADD delivery module and publicize this access via finding aids. Overall ePADD is a positive step forward for TAMU-CC.
Note from the Author:
Since writing this post, I have learned that it is possible in ePADD to use regular expressions to further aid in identifying potentially sensitive materials. By default the program uses regular expressions to find social security numbers, but it can be configured to find other personal information such as credit card numbers and phone numbers. Further guidance is provided in the Reviewing Regular Expressions section of the ePADD User Guide.
Alston Cobourn is the Processing and Digital Assets Archivist at Texas A&M University-Corpus Christi where she leads the library’s digital preservation efforts. Previously she was the Digital Scholarship Librarian at Washington and Lee University. She holds a BA and MLS with an Archives and Records Management concentration from UNC-Chapel Hill.
This is the second post in the bloggERS series describing outcomes of the #OSS4Pres 2.0 workshop at iPRES 2016, addressing open source tool and software development for digital preservation. This post outlines the work of the group tasked with “drafting a design guide and requirements for Free and Open Source Software (FOSS) tools, to ensure that they integrate easily with digital preservation institutional systems and processes.”
The FOSS Development Requirements Group set out to create a design guide for FOSS tools to ensure easier adoption of open-source tools by the digital preservation community, including their integration with common end-to-end software and tools supporting digital preservation and access that are now in use by that community.
The group included representatives of large digital preservation and access projects such as Fedora and Archivematica, as well as tool developers and practitioners, ensuring a range of perspectives were represented. The group’s initial discussion led to the creation of a list of minimum necessary requirements for developing open source tools for digital preservation, based on similar examples from the Open Preservation Foundation (OPF) and from other fields. Below is the draft list that the group came up with, followed by some intended future steps. We welcome feedback or additions to the list, as well as suggestions for where such a list might be hosted long term.
Minimum Necessary Requirements for FOSS Digital Preservation Tool Development
Provide publicly accessible documentation and an issue tracker
Have a documented process for how people can contribute to development, report bugs, and suggest new documentation
Every tool should do the smallest possible task really well; if you are developing an end-to-end system, develop it in a modular way in keeping with this principle
Follow established standards and practices for development and use of the tool
Keep documentation up-to-date and versioned
Follow test-driven development philosophy
Don’t develop a tool without use cases, and stakeholders willing to validate those use cases
Use an open and permissive software license to allow for integrations and broader use
Have a mailing list, Slack or IRC channel, or other means for community interaction
Establish community guidelines
Provide a well-documented mechanism for integration with other tools/systems in different languages
Provide functionality of tool as a library, separating out the GUI and the actual functions
Package tool in an easy-to-use way; the more broadly you want the tool to be used, package it for different operating systems
Use a packaging format that supports any dependencies
Provide examples of functionality for potential users
Consider the organizational home or archive for the tool for long-term sustainability; develop your tool based on potential organizations’ guidelines
Consider providing a mechanism for internationalization of your tool (this is a broader community need as well, to identify the tools that exist and to incentivize this)
Digital preservation is an operating system-agnostic field
Feedback and Perspectives. Because of the expense of the iPRES conference (and its location in Switzerland), all of the group members were from relatively large and well-resourced institutions. The perspective of under-resourced institutions is very often left out of open-source development communities, as they are unable to support and contribute to such projects; in this case, this design guide would greatly benefit from the perspective of such institutions as to how FOSS tools can be developed to better serve their digital preservation needs. The group was also largely from North America and Europe, so this work would eventually benefit greatly from adding perspectives from the FOSS and digital preservation communities in South America, Asia, and Africa.
Institutional Home and Stewardship. When finalized, the FOSS development requirements list should live somewhere permanently and develop based on the ongoing needs of our community. As this line of communication between practitioners and tool developers is key to the continual development of better and more user-friendly digital preservation tools, we should continue to build on the work of this group.
Heidi Elaine Kelly is the Digital Preservation Librarian at Indiana University, where she is responsible for building out the infrastructure to support long-term sustainability of digital content. Previously she was a DiXiT fellow at Huygens ING and an NDSR fellow at the Library of Congress.
Organized by Sam Meister (Educopia), Shira Peltzman (UCLA), Carl Wilson (Open Preservation Foundation), and Heidi Kelly (Indiana University), OSS4PRES 2.0 was a half-day workshop that took place during the 13th annual iPRES 2016 conference in Bern, Switzerland. The workshop aimed to bring together digital preservation practitioners, developers, and administrators in order to discuss the role of open source software (OSS) tools in the field.
Although several months have passed since the workshop wrapped up, we are sharing this information now in an effort to raise awareness of the excellent work completed during this event, to continue the important discussion that took place, and to hopefully broaden involvement in some of the projects that developed. First, however, a bit of background: The initial OSS4PRES workshop was held at iPRES 2015. Attended by over 90 digital preservation professionals from all areas of the open source community, individuals reported on specific issues related to open source tools, which were followed by small group discussions about the opportunities, challenges, and gaps that they observed. The energy from this initial workshop led to both the proposal of a second workshop, as well as a report that was published in Code4Lib Journal, OSS4EVA: Using Open-Source Tools to Fulfill Digital Preservation Requirements.
The overarching goal for the 2016 workshop was to build bridges and fill gaps within the open source community at large. In order to facilitate a focused and productive discussion, OSS4PRES 2.0 was organized into three groups, each of which was led by one of the workshop’s organizers. Additionally, Shira Peltzman floated between groups to minimize overlap and ensure that each group remained on task. In addition to maximizing our output, one of the benefits of splitting up into groups was that each group was able to focus on disparate but complementary aspects of the open source community.
Develop user stories for existing tools (group leader: Carl Wilson)
Carl’s group was comprised principally of digital preservation practitioners. The group scrutinized existing pain points associated with the day-to-day management of digital material, identified tools that had not yet been built that were needed by the open source community, and began to fill this gap by drafting functional requirements for these tools.
Define requirements for online communities to share information about local digital curation and preservation workflows (group leader: Sam Meister)
With an aim to strengthen the overall infrastructure around open source tools in digital preservation, Sam’s group focused on the larger picture by addressing the needs of the open source community at large. The group drafted a list of requirements for an online community space for sharing workflows, tool integrations, and implementation experiences, to facilitate connections between disparate groups, individuals, and organizations that use and rely upon open source tools.
Define requirements for new tools (group leader: Heidi Kelly)
Heidi’s group looked at how the development of open source digital preservation tools could be improved by implementing a set of minimal requirements to make them more user-friendly. Since a list of these requirements specifically for the preservation community had not existed previously, this list both fills a gap and facilitates the building of bridges, by enabling developers to create tools that are easier to use, implement, and contribute to.
Ultimately OSS4PRES 2.0 was an effort to make the open source community more open and diverse, and in the coming weeks we will highlight what each group managed to accomplish towards that end. The blog posts will provide an in-depth summary of the work completed both during and since the event took place, as well as a summary of next steps and potential project outcomes. Stay tuned!
Shira Peltzman is the Digital Archivist for the UCLA Library where she leads the development of a sustainable preservation program for born-digital material. Shira received her M.A. in Moving Image Archiving and Preservation from New York University’s Tisch School of the Arts and was a member of the inaugural class of the National Digital Stewardship Residency in New York (NDSR-NY).
Heidi Elaine Kelly is the Digital Preservation Librarian at Indiana University, where she is responsible for building out the infrastructure to support long-term sustainability of digital content. Previously she was a DiXiT fellow at Huygens ING and an NDSR fellow at the Library of Congress.
Contemplating how to provide access to born-digital materials? Wondering how to meet researcher needs for accessing and analyzing files? We are too! Join us for a Twitter chat on providing access to born digital records.
*When?* Thursday, October 27 at 2:00pm and 9:00pm EST *How?* Follow #bdaccess for the discussion *Who?* Researchers, information professionals, and anyone else interested in using born-digital records
Newly-conceived #bdaccess chats are organized by an ad-hoc group that formed at the 2015 SAA annual meeting. We are currently developing a bootcamp to share ideas and tools for providing access to born-digital materials and have teamed up with the Digital Library Federation to spread the word about the project.
Understanding how researchers want to access and use digital archives is key to our curriculum’s success, so we’re taking it to the Twitter streets to gather feedback from digital researchers. The following five questions will guide the discussion:
Q1. _What research topic(s) of yours and/or content types have required the use of born digital materials?_
Q2. _What challenges have you faced in accessing and/or using born digital content? Any suggested improvements?_
Q3. _What discovery methods do you think are most suitable for research with born digital material?_
Q4. _What information or tools do/could help provide the context needed to evaluate and use born digital material?_
Q5. _What information about collecting/providing access would you like to see accompanying born digital archives?_
Can’t join on the 27th? Follow #bdaccess for ongoing discussion and future chats!
Jess Farrell is the curator of digital collections at Harvard Law School. Along with managing and preserving digital history, she’s currently fixated on inclusive collecting, labor issues in libraries, and decolonizing description.
Sarah Dorpinghaus is the Director of Digital Services at the University of Kentucky Libraries Special Collections Research Center. Although her research interests lie in the realm of born-digital archives, she has a budding pencil collection.