Big Data and Big Challenges for Archives

The following is a post by Glen McAninch based on a breakout session at the ERS section meeting of last year’s SAA annual meeting.

What is “big data” and how does it relate to what archivists do? Many of us, particularly those outside the Federal Government, private technology companies, and research based universities, will doubt that they will ever have to deal with “big data,” but the topic addresses issues that those of us who manage electronic records collections are facing more and more. No doubt most archivists are beginning to acquire increasingly large collections of electronic records that challenge our abilities to process them, preserve them, and provide access to them.

11711725656_fbe0919b55_z
Image courtesy of Stafano Bertolo.

So, what is big data? According to Gartner Research, it is “high-volume, high-velocity, high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” Traditionally, computers excelled at manipulating structured data, but increasingly businesses need to be able to integrate and make sense of data from a wide range of sources, in a variety of formats, at various levels of structure and cleanliness, and it needs to do so in a timely fashion.[1]

Most archivists are acquiring collections that are rapidly increasing in volume and complexity and may fit this definition of big data. It is chiefly a matter of scale that separates most of us from the data tsunami faced by some institutions. Can archivists learn processes and acquire tools from those who are using big data sets for non-archival purposes?

The current use of big data is basically for analytics or research rather than to document specific activities. Data analytics tools allow researchers to manipulate and analyze data stored in multiple formats. Issues with big data of greatest concern for archivists are:

  • Appraisal involves selection of records, some of which may be useful for analytic research. When acquiring unstructured and structured records, it is important for archivists to carefully document the context of the original record so that researchers who use big data tools have the proper framework to do analysis.
  • Searching, one of the long suits of “big data” tools, can be leveraged to help archivists improve access to massive amounts of records in multiple formats.
  • Big data sets are often not managed in the traditional way that archivists and records managers select records for long-term retention. Thus, the emphasis of big data tools, such as Apache’s Hadoop, is on analysis of objects and not on the management of records like is done in structured databases. This makes many big data tools unsuited for archival management and preservation needs.
  • How will you and your institution acquire big data? From whom? Will this data come directly from those who collect it, for instance, your university’s department of institutional research? If so, do you want to acquire the full set of raw data or are you only interested in the different outputs and analyses performed on that data? Or will you acquire the work of researchers who had obtained copies of electronic records, extracted selected content from those records, mashed that content up with data from other sources, and then performed analyses on that data? Is the goal of acquisition to allow future researchers to reinterpret and reanalyze the data or is your goal to document the information that informed certain decisions at an institution?
  • Privacy, the fear that big brother is watching us, is a popular issue that is often associated with big data and archivists need to address that fear through access restrictions and redaction. Visualization tools are increasingly being used to appraise records and establish links between records, particularly for large e-mail projects. Additionally, users of high volume data have made advances in using crowd sourcing, face recognition, and other techniques that archivists are adapting.

Projects to watch:

  • Brown Dog is a collaborative big data management project based on the integration of heterogeneous datasets and multi-source historical and digital collections.
  • Tools like the CI-BER treemap GIS interface to NARA records, the visual analysis tools being developed by Maria Esteva at the Texas Advanced Computing Center, and Kenton McHenry’s 1940 Census big data analysis, indexing, and visualization at the National Center for Supercomputing Applications (now part of Brown Dog) provide good examples of adapting big data techniques to the mission and spirit of the archival profession.

[1] http://radar.oreilly.com/2012/01/what-is-big-data.html. Accessed 12/17/2014.

 

Advertisements

3 thoughts on “Big Data and Big Challenges for Archives

  1. markconrad2014 March 5, 2015 / 10:06 pm

    Great post, Glen.

    In terms of the privacy issue, natural language processing, content summarization, and information extraction tools like those developed by Dr. William Underwood and his colleagues at GTRI can also be useful. (http://perpos.gtri.gatech.edu/)

    Mark

    Like

    • Glen McAninch March 6, 2015 / 5:26 pm

      Thank you Mark for your advice on this blog post as well as the additional reference.

      Like

  2. Alison White March 9, 2015 / 1:56 pm

    Excellent points particularly for archivists who might not currently envision a future in which they have to deal with big data issues. Ten years ago how many of us fully realized the challenges we would face in archiving email or even imagined the social media landscape? Another major issue with archiving big data is access. Glen made a great point about considering the purpose of acquiring big data; do we need to do more than acquire it as documentation of decisions? Providing access to archived big data data can dramatically change the traditional way we view archives and the role of the archivist.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s