The following is a post by Glen McAninch based on a breakout session at the ERS section meeting of last year’s SAA annual meeting.
What is “big data” and how does it relate to what archivists do? Many of us, particularly those outside the Federal Government, private technology companies, and research based universities, will doubt that they will ever have to deal with “big data,” but the topic addresses issues that those of us who manage electronic records collections are facing more and more. No doubt most archivists are beginning to acquire increasingly large collections of electronic records that challenge our abilities to process them, preserve them, and provide access to them.
So, what is big data? According to Gartner Research, it is “high-volume, high-velocity, high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” Traditionally, computers excelled at manipulating structured data, but increasingly businesses need to be able to integrate and make sense of data from a wide range of sources, in a variety of formats, at various levels of structure and cleanliness, and it needs to do so in a timely fashion.
Most archivists are acquiring collections that are rapidly increasing in volume and complexity and may fit this definition of big data. It is chiefly a matter of scale that separates most of us from the data tsunami faced by some institutions. Can archivists learn processes and acquire tools from those who are using big data sets for non-archival purposes?
The current use of big data is basically for analytics or research rather than to document specific activities. Data analytics tools allow researchers to manipulate and analyze data stored in multiple formats. Issues with big data of greatest concern for archivists are:
- Appraisal involves selection of records, some of which may be useful for analytic research. When acquiring unstructured and structured records, it is important for archivists to carefully document the context of the original record so that researchers who use big data tools have the proper framework to do analysis.
- Searching, one of the long suits of “big data” tools, can be leveraged to help archivists improve access to massive amounts of records in multiple formats.
- Big data sets are often not managed in the traditional way that archivists and records managers select records for long-term retention. Thus, the emphasis of big data tools, such as Apache’s Hadoop, is on analysis of objects and not on the management of records like is done in structured databases. This makes many big data tools unsuited for archival management and preservation needs.
- How will you and your institution acquire big data? From whom? Will this data come directly from those who collect it, for instance, your university’s department of institutional research? If so, do you want to acquire the full set of raw data or are you only interested in the different outputs and analyses performed on that data? Or will you acquire the work of researchers who had obtained copies of electronic records, extracted selected content from those records, mashed that content up with data from other sources, and then performed analyses on that data? Is the goal of acquisition to allow future researchers to reinterpret and reanalyze the data or is your goal to document the information that informed certain decisions at an institution?
- Privacy, the fear that big brother is watching us, is a popular issue that is often associated with big data and archivists need to address that fear through access restrictions and redaction. Visualization tools are increasingly being used to appraise records and establish links between records, particularly for large e-mail projects. Additionally, users of high volume data have made advances in using crowd sourcing, face recognition, and other techniques that archivists are adapting.
Projects to watch:
- Brown Dog is a collaborative big data management project based on the integration of heterogeneous datasets and multi-source historical and digital collections.
- Tools like the CI-BER treemap GIS interface to NARA records, the visual analysis tools being developed by Maria Esteva at the Texas Advanced Computing Center, and Kenton McHenry’s 1940 Census big data analysis, indexing, and visualization at the National Center for Supercomputing Applications (now part of Brown Dog) provide good examples of adapting big data techniques to the mission and spirit of the archival profession.
 http://radar.oreilly.com/2012/01/what-is-big-data.html. Accessed 12/17/2014.