More skills, less pain with Library Carpentry

By Jeffrey C. Oliver, Ph.D

This is the second post in the bloggERS Making Tech Skills a Strategic Priority series.

Remember that scene in The Matrix where Neo wakes and says “I know kung fu”? Library Carpentry is like that. Almost. Do you need to search lots of files for pieces of text and tire of using Ctrl-F? In the UNIX shell lesson you’ll learn to automate tasks and rapidly extract data from files. Are you managing datasets with not-quite-standardized data fields and formats? In the OpenRefine lesson you’ll easily wrangle data into standard formats for easier processing and de-duplication. There are also Library Carpentry lessons for Python (a popular scripting programming language), Git (a powerful version control system), SQL (a commonly used relational database interface), and many more.

But let me back up a bit.

Library Carpentry is part of the Carpentries, an organization is designed to provide training to scientists, researchers, and information professionals on the computational skills necessary for work in this age of big data.

The goals of Library Carpentry align with this series’ initial call for contributions, providing resources for those in data- or information-related fields to work “more with a shovel than with a tweezers.” Library Carpentry workshops are primarily hands-on experiences with tools to make work more efficient and less prone to mistakes when performing repeated tasks.

One of the greatest parts about a Library Carpentry workshop is that they begin at the beginning. That is, the first lesson is an Introduction to Data, which is a structured discussion and exercise session that breaks down jargon (“What is a version control system”) and sets down some best practices (naming things is hard).

Not only are the lessons designed for those working in library and information professions, but they’re also designed by “in the trenches” folks who are dealing with these data and information challenges daily. As part of the Mozilla Global Sprint, Library Carpentry ran a two-day hackathon in May 2018 where lessons were developed, revised, remixed, and made pretty darn shiny by contributors at ten different sites. For some, the hackathon itself was an opportunity to learn how to use GitHub as a collaboration tool.

Furthermore, Library Carpentry workshops are led by librarians, like the most recent workshop at the University of Arizona, where lessons were taught by our Digital Scholarship Librarian, our Geospatial Specialist, our Liaison Librarian to Anthropology (among other domains), and our Research Data Management Specialist.

Now, a Library Carpentry workshop won’t make you an expert in Python or the UNIX command line in two days. Even Neo had to practice his kung fu a bit. But workshops are designed to be inclusive and accessible, myth-busting, and – I’ll say it – fun. Don’t take my word for it, here’s a sampling of comments from our most recent workshop:

  • Loved the hands-on practice on regular expressions
  • Really great lesson – I liked the challenging exercises, they were fun! It made SQL feel fun instead of scary
  • Feels very powerful to be able to navigate files this way, quickly & in bulk.

So regardless of how you work with data, Library Carpentry has something to offer. If you’d like to host a Library Carpentry workshop, you can use our request a workshop form. You can also connect to Library Carpentry through social media, the web, or good old fashioned e-mail. And since you’re probably working with data already, you have something to offer Library Carpentry. This whole endeavor runs on the multi-faceted contributions of the community, so join us, we have cookies. And APIs. And a web scraping lesson. The terrible puns are just a bonus.

Advertisements

IEEE Big Data 2018: 3rd Computational Archival Science (CAS) Workshop Recap

by Richard Marciano, Victoria Lemieux, and Mark Hedges

Introduction

The 3rd workshop on Computational Archival Science (CAS) was held on December 12, 2018, in Seattle, following two earlier CAS workshops in 2016 in Washington DC and in 2017 in Boston. It also built on three earlier workshops on ‘Big Humanities Data’ organized by the same chairs at the 2013-2015 conferences, and more directly on a symposium held in April 2016 at the University of Maryland. The current working definition of CAS is:

A transdisciplinary field that integrates computational and archival theories, methods and resources, both to support the creation and preservation of reliable and authentic records/archives and to address large-scale records/archives processing, analysis, storage, and access, with aim of improving efficiency, productivity and precision, in support of recordkeeping, appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material [1].

The workshop featured five sessions and thirteen papers with international presenters and authors from the US, Canada, Germany, the Netherlands, the UK, Bulgaria, South Africa, and Portugal. All details (photos, abstracts, slides, and papers) are available at: http://dcicblog.umd.edu/cas/ieee-big-data-2018-3rd-cas-workshop/. The keynote focused on using digital archives to preserve the history of WWII Japanese-American incarceration and featured Geoff Froh, Deputy Director at Densho.org in Seattle.

Keynote speaker Geoff Froh, Deputy Director at Densho.org in Seattle presenting on “Reclaiming our Story: Using Digital Archives to Preserve the History of WWII Japanese American Incarceration.”

This workshop explored the conjunction (and its consequences) of emerging methods and technologies around big data with archival practice and new forms of analysis and historical, social, scientific, and cultural research engagement with archives. The aim was to identify and evaluate current trends, requirements, and potential in these areas, to examine the new questions that they can provoke, and to help determine possible research agendas for the evolution of computational archival science in the coming years. At the same time, we addressed the questions and concerns scholarship is raising about the interpretation of ‘big data’ and the uses to which it is put, in particular appraising the challenges of producing quality – meaning, knowledge and value – from quantity, tracing data and analytic provenance across complex ‘big data’ platforms and knowledge production ecosystems, and addressing data privacy issues.

Sessions

  1. Computational Thinking and Computational Archival Science
  • #1:Introducing Computational Thinking into Archival Science Education [William Underwood et al]
  • #2:Automating the Detection of Personally Identifiable Information (PII) in Japanese-American WWII Incarceration Camp Records [Richard Marciano, et al.]
  • #3:Computational Archival Practice: Towards a Theory for Archival Engineering [Kenneth Thibodeau]
  • #4:Stirring The Cauldron: Redefining Computational Archival Science (CAS) for The Big Data Domain [Nathaniel Payne]
  1. Machine Learning in Support of Archival Functions
  • #5:Protecting Privacy in the Archives: Supervised Machine Learning and Born-Digital Records [Tim Hutchinson]
  • #6:Computer-Assisted Appraisal and Selection of Archival Materials [Cal Lee]
  1. Metadata and Enterprise Architecture
  • #7:Measuring Completeness as Metadata Quality Metric in Europeana [Péter Királyet al.]
  • #8:In-place Synchronisation of Hierarchical Archival Descriptions [Mike Bryant et al.]
  • #9:The Utility Enterprise Architecture for Records Professionals [Shadrack Katuu]
  1. Data Management
  • #10:Framing the scope of the common data model for machine-actionable Data Management Plans [João Cardoso et al.]
  • #11:The Blockchain Litmus Test [Tyler Smith]
  1. Social and Cultural Institution Archives
  • #12:A Case Study in Creating Transparency in Using Cultural Big Data: The Legacy of Slavery Project [Ryan CoxSohan Shah et al]
  • #13:Jupyter Notebooks for Generous Archive Interfaces [Mari Wigham et al.]

Next Steps

Updates will continue to be provided through the CAS Portal website, see: http://dcicblog.umd.edu/cas and a Google Group you can join at computational-archival-science@googlegroups.com.

Several related events are scheduled in April 2019: (1) a 1 ½ day workshop on “Developing a Computational Framework for Library and Archival Education” will take place on April 3 & 4, 2019, at the iConference 2019 event (See: https://iconference2019.umd.edu/external-events-and-excursions/ for details), and (2) a “Blue Sky” paper session on “Establishing an International Computational Network for Librarians and Archivists” (See: https://www.conftool.com/iConference2019/index.php?page=browseSessions&form_session=356).

Finally, we are planning a 4th CAS Workshop in December 2019 at the 2019 IEEE International Conference on Big Data (IEEE BigData 2019) in Los Angeles, CA. Stay tuned for an upcoming CAS#4 workshop call for proposals, where we would welcome SAA member contributions!

References

[1] “Archival records and training in the Age of Big Data”, Marciano, R., Lemieux, V., Hedges, M., Esteva, M., Underwood, W., Kurtz, M. & Conrad, M.. See: LINK. In J. Percell , L. C. Sarin , P. T. Jaeger , J. C. Bertot (Eds.), Re-Envisioning the MLS: Perspectives on the Future of Library and Information Science Education (Advances in Librarianship, Volume 44B, pp.179-199). Emerald Publishing Limited. May 17, 2018. See: http://dcicblog.umd.edu/cas/wp-content/uploads/sites/13/2017/06/Marciano-et-al-Archival-Records-and-Training-in-the-Age-of-Big-Data-final.pdf


Richard Marciano is a professor at the University of Maryland iSchool where he directs the Digital Curation Innovation Center (DCIC). He previously conducted research at the San Diego Supercomputer Center at the University of California San Diego for over a decade. His research interests center on digital preservation, sustainable archives, cyberinfrastructure, and big data. He is also the 2017 recipient of Emmett Leahy Award for achievements in records and information management. Marciano holds degrees in Avionics and Electrical Engineering, a Master’s and Ph.D. in Computer Science from the University of Iowa. In addition, he conducted postdoctoral research in Computational Geography.

Victoria Lemieux is an associate professor of archival science at the iSchool and lead of the Blockchain research cluster, Blockchain@UBC at the University of British Columbia – Canada’s largest and most diverse research cluster devoted to blockchain technology. Her current research is focused on risk to the availability of trustworthy records, in particular in blockchain record keeping systems, and how these risks impact upon transparency, financial stability, public accountability and human rights. She has organized two summer institutes for Blockchain@UBC to provide training in blockchain and distributed ledgers, and her next summer institute is scheduled for May 27-June 7, 2019. She has received many awards for her professional work and research, including the 2015 Emmett Leahy Award for outstanding contributions to the field of records management, a 2015 World Bank Big Data Innovation Award, a 2016 Emerald Literati Award and a 2018 Britt Literary Award for her research on blockchain technology. She is also a faculty associate at multiple units within UBC, including the Peter Wall Institute for Advanced Studies, Sauder School of Business, and the Institute for Computers, Information and Cognitive Systems.

Mark Hedges is a Senior Lecturer in the Department of Digital Humanities at King’s College London, where he teaches on the MA in Digital Asset and Media Management, and is also Departmental Research Lead. His original academic background was in mathematics and philosophy, and he gained a PhD in mathematics at University College London, before starting a 17-year career in the software industry, before joining King’s in 2005. His research is concerned primarily with digital archives, research infrastructures, and computational methods, and he has led a range of projects in these areas over the last decade. Most recently has been working in Rwanda on initiatives relating to digital archives and the transformative impact of digital technologies.

A Recap of “DAM if you do and DAM if you don’t!”

by Regina Carra

When: December 3, 2018

Where: Metropolitan New York Library Council (METRO), New York, NY

Speakers:

  • Stephen Klein, Digital Services Librarian at the CUNY Graduate Center (CUNY)
  • Ashley Blewer, AV Preservation Specialist at Artefactual
  • Kelly Stewart, Digital Preservation Services Manager at Artefactual

On December 3, 2018, the Metropolitan New York Library Council (METRO)’s Digital Preservation Interest Group hosted an informative (and impeccably titled) presentation about how the CUNY Graduate Center (GC) plans to incorporate Archivematica, a web-based, open-source digital asset management software (DAMs) developed by Artefactual, into its document management strategy for student dissertations. Speakers included Stephen Klein, Digital Services Librarian at the CUNY Graduate Center (GC); Ashley Blewer, AV Preservation Specialist at Artefactual; and Kelly Stewart, Digital Preservation Services Manager at Artefactual. The presentation began with an overview from Stephen about the GC’s needs and why they chose Archivematica as a DAMs, followed by an introduction to and demo of Archivematica and Duracloud, an open-source cloud storage service, led by Ashley and Kelly (who was presenting via video-conference call). While this post provides a general summary of the presentation, I would recommend reaching out to any of the presenters for more detailed information about their work. They were all great!

Every year the GC Library receives between 400-500 dissertations, theses, and capstones. These submissions can include a wide variety of digital materials, from PDF, video, and audio files, to websites and software. Preservation of these materials is essential if the GC is to provide access to emerging scholarship and retain a record of students’ work towards their degrees. Prior to implementing a DAMs, however, the GC’s strategy for managing digital files of student work was focused primarily on access, not preservation. Access copies of student work were available on CUNY Academic Works, a site that uses Bepress Digital Commons as a CMS. Missing from the workflow, however, was the creation, storage, and management of archival originals. As Stephen explained, if the Open Archival Information System (OAIS) model is a guide for a proper digital preservation workflow, the GC was without the middle, Archival Information Package (AIP), portion of it. Some of the qualities that GC liked about Archivematica was that it was open-source and highly-customizable, came with strong customer support from Artefactual, and had an API that could integrate with tools already in use at the library. GC Library staff hope that Archivematica can eventually integrate with both the library’s electronic submission system (Vireo) and CUNY Academic Works, making the submission, preservation, and access of digital dissertations a much more streamlined, automated, and OAIS-compliant process.

A sample of one of Duracloud’s data visualization graphs from the presentation slides.

Next, Ashley and Kelly introduced and demoed Archivematica and Duracloud. I was very pleased to see several features of the Archivematica software that were made intentionally intuitive. The design of the interface is very clean and easily customizable to fit different workflows. Also, each AIP that is processed includes a plain-text, human-readable file which serves as extra documentation explaining what Archivematica did to each file. Artefactual recommends pairing Archivematica with Duracloud, although users can choose to integrate the software with local storage or with other cloud services like those offered by Google or Amazon. One of the features I found really interesting about Duracloud is that it comes with various data visualization graphs that show the user how much storage is available and what materials are taking up the most space.

I close by referencing something Ashley wrote in her recent bloggERS post (conveniently she also contributed to this event). She makes an excellent point about how different skill-sets are needed to do digital preservation, from the developers that create the tools that automate digital archival processes to the archivists that advocate for and implement said tools at their institutions. I think this talk was successful precisely because it included the practitioner and vendor perspectives, as well as the unique expertise that comes with each role. Both are needed if we are to meet the challenges and tap into the potential that digital archives present. I hope to see more of these “meetings of the minds” in the future.

(For more info: Stephen and Ashley and Kelly have generously shared their slides!)


Regina Carra is the Archive Project Metadata and Cataloging Coordinator at Mark Morris Dance Group. She is a recent graduate of the Dual Degree MLS/MA program in Library Science and History at Queens College – CUNY.