Of Python and Pandas: Using Programming to Improve Discovery and Access

by Kate Topham

This is the ninth post in the bloggERS Script It! Series.

Over my spring break this year, I joined a team of librarians at George Washington University who wanted to use their MARC records as data to learn more about their rare book collection. We decided to explore trends in the collection over time, using a combination of Python and OpenRefine. Our research question: in what subjects was the collection strongest? From which decades do we have the most books in a given subject?

This required some programming chops–so the second half of this post is targeted towards those who have a working knowledge of python, as well as the pandas library.

If you don’t know python, have no fear! The next section describes how to use OpenRefine, and is 100% snake-free.

Cleaning your data

A big issue with analyzing cataloging data is that humans are often inconsistent in their description. Two records may be labeled “African-American History” and “History, African American,” respectively. When asked to compare these two strings, python will consider them different. Therefore, when you ask python to count all the books with the subject “African American History,” it counts only those whose subjects match that string exactly.

Enter OpenRefine, an open source application that allows you to clean and transform data in ways Excel doesn’t. Below you’ll see a table generated from pymarc, a python module designed to handle data encoded in MARC21. It contains the bibliographic id of each book and its subject data, pulled from field 650.

Picture1

Facets allow you to view subsets of your data. A “text facet” means that OpenRefine will treat the values as text rather than numbers or dates, for example. If we create a text facet for column a…Picture1.png

it will display all the different values in that column in a box on the left.Picture1.png

If we choose one of these facets, we can view all the rows that have that value in the “a” column.Picture1.png

Wow, we have five fiction books about Dogs! However, if you look on the side, there are six records that have “Dogs.” instead of “Dogs”, and were therefore excluded. How can we look at all of the dog books? This is where we need clustering.

Clustering allows you to group similar strings and merge the whole group into one value. It does this by sorting each string alphabetically and match them that way. Since “History, African American” and “African American History” both evaluate to “aaaaccefhiiimnnorrrsty,” OpenRefine will group them together and allow you to change all of the matching fields to either (or something totally different!) according to your preference.Picture1.png

This way, when you analyze the data, you can ask “How many books on African-American History do we have?” and trust that your answer will be correct. After clustering to my heart’s content, I exported this table into a pandas dataframe for analysis.

Analyzing your data

In order to study the subjects over time, we needed to restructure the table so that we could merge it with the date data.

First, I pivoted the table from short form to long so that we could count  separate pairs of subject tags. The pandas ‘melt’ method set the index by bibliographic id and subject so that books with multiple subjects would be counted in both categories.Picture1.png

Then I merged the dates from the original table of MARC records, bib_data, onto our newly melted table.Picture1.png

I grouped the data by subject using .group(). The .agg() function communicates how we want to count within the subject groups.Picture1.png

Because of the vast number of subjects we chose to focus on the ten most numerous subjects. I used the same grouping and aggregating process on the original cleaned subject data: grouped by 650a, counted by bib_id, and sorted by bib_id count.Picture1.png

Once I had our top ten list, I could select the count by decade for each subject.Picture1.png

Visualizing your data

In order to visualize this data, I used a python library called plotly. Plotly generates graphs from your data. It plays very well with pandas dataframes. Plotly provides many examples of code that you can copy, replacing the example data with your own. I placed the plotly code in for loop create a line on the graph for each subject.  Picture1.pngPicture1.png

Some of the interesting patterns we noticed was the spike in African-American books soon after 1865, the end of the Civil War; and at the end of the 20th century, following the Civil Rights movement.Knowing where the peaks and gaps are in our collections helps us better assist patrons in the use of our collection, and better market it to researchers.

Acknowledgments

I’d like to thank Dolsy Smith, Leah Richardson, and Jenn King for including me in this collaborative project and sharing their expertise with me.

Advertisements

One thought on “Of Python and Pandas: Using Programming to Improve Discovery and Access

  1. ostephens (@ostephens) October 10, 2018 / 9:51 am

    I hope this is helpful : the way clustering works in OpenRefine can be chosen by the user. The different approaches are described on the wiki at https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth. The default method for the example given above would result in “african american history” as the common cluster value.

    There is more information on the different types of clustering at https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth – the different methods work well for different situations and different data types – for example ‘Nearest Neighbour’ methods work better for spelling mistakes where ‘Key Collision’ methods work well in situations as described in this post where you might expect variations in word order or upper/lower case (e.g. names, subject headings)

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s