Computer Generated Archival Description


In our archive at Carleton College we have implemented a number of automated and semi-automated tools to assist with processing our digital records. We use several batch processes to create access copies, generate checksums, validate file formats, extract tagged metadata and are working on a data accessioner that can automate many of the repetitive steps we perform on our Submission Information Packages (SIPS).  While these improvements have been tremendously helpful with processing collections quickly, there is one area that is consistently backed up in our workflow: the creation of descriptive metadata. Minimal descriptive metadata has improved our processing time for electronic records, but I can already see that this will not be enough in the near future.

In light of the accelerating growth rate of digital accessions in our repositories, how sustainable will human created descriptive metadata be in the next few years? Perhaps we should be turning to automated, computer based methods for creating descriptions just like we have for other processing steps. We are already relying on optical character recognition (OCR) to improve access to scanned print documents, but there are other methods that hold great promise. Voice to text software, while not fully baked yet, is being used by some digitization vendors to create transcriptions of video and audio files. Facial recognition could be a powerful tool for photograph identification – I could see these same methods being applied to the recognition of buildings as well. Geospatial data based on known reference points, such as an address, can make images of locations more searchable and usable in dynamically generated maps. Analysis of text could even be used to generate subject categories.

These methods would of course change how we work as professionals and how users access our records.  Our descriptive metadata would be much more extensive, but would probably be filled with many more errors than we are currently willing to accept.  To use this data, researchers might turn away from the traditional finding aid, detailed biographical descriptions and human assigned subject headings in favor of term searching, ranked results and faceted displays.  These new tools and changes may be unsettling, but in light of our mounting backlogs of electronic records, we may have no choice but to embrace them.

Do you have any experience with this kind of cataloging?  Does the idea of trusting a machine do this work cause you to feel dizziness or shortness of breath?  Please let us know in the comments below.

Nat Wilson is a Digital Archivist at Carleton College.

4 thoughts on “Computer Generated Archival Description

  1. markconrad2014 February 26, 2015 / 2:36 pm

    Great post! The National Archives and Records Administration’s (NARA) Applied Research Branch has worked with a number of our Research Partners over the years to conduct experiments in several of the areas that you identify in your blog.

    Our Research Partner, Dr. Richard Marciano, conducted an experiment that involved capturing and processing collections from the Department of State E-FOIA Reading Room. This experiment involved ~61,000 documents in 30 collections. Part of the experiment involved automatically generating item-level descriptions for each document in the 30 collections. The descriptions were created by mapping the minimum set of data elements for an item-level description from NARA’s Lifecycle Data Requirements Guidelines (LCDRG) to metadata that could be extracted from each document. It took approximately three days to develop the regular expressions and perl scripts that were used to conduct the experiment. Once these were developed it took about 9 minutes to generate the ~61,000 item-level descriptions. Were the descriptions up to full NARA standards? No. They didn’t use the controlled vocabularies or some of the other quality standards found in the LCDRG. Did they provide a substantial number of new low-level access points to a collection. Yes.

    Over the years our Research Partners at Georgia Tech Research Institute (GTRI) have conducted a number of experiments to: automatically recognize person’s names, job titles, organization names, locations, addresses, and dates; automatically recognize record types such as letters, memos, itineraries, and resumes; automatically generate preliminary folder titles, and scope and content notes for file units and records series. See:

    Imagine that you want to find electronic records related to a particular geographic location in a very large collection (40 TB and about 70 million files) of archival electronic records. Wouldn’t it be cool if you could pick up an iPad, have a map pop up on the screen, run your finger over the area on the map you were interested in, and have a list of relevant record collections show up on the screen next to the map? Wouldn’t it be really cool if you could then drill down through that list and see metadata about records in each collection?

    Our Research Partners at the University of North Carolina – Chapel Hill have demonstrated prototype tools that can carry out just such a search, and more. The development of these tools was part of the NARA and National Science Foundation (NSF)-supported Cyberinfrastructure for Billions of Electronic Records (CI-BER) Project. You can read a blogpost I wrote about it here:

    See also:


  2. Nat Wilson March 4, 2015 / 9:57 pm

    Wow, that’s some great work Mark. Any feedback from researchers using these collections on your work?
    A friend recently sent me this really interesting set of Python programs designed to interpret and categorize natural language.


  3. Glen McAninch March 6, 2015 / 11:10 pm

    Very thoughtful blog and comment on useful tools we will need to handle the records tsunami we face.


  4. Brecht Declercq March 11, 2015 / 7:58 pm

    Very interesting and relevant posting. From the technical side numerous projects and studies have already been done and the work goes on. For audiovisual archives this kind of technologies are particularly relevant. A recent survey by FIAT/IFTA, presented at the 2014 Amsterdam World Conference (see and the clip on Youtube:, showed however that very few of FIAT/IFTA’s members actually use automatic generation of descriptive metadata (AGM, which I use as an umbrella term for a broad range of technologies that allow the content of sound, moving image or pictures to be automatically described in words, so without human intervention other than applying a specific algorithm to the media essence). As far as I know, only two public broadcasting archives in the world use AGM in daily practice: RSI (Italian speaking Switzerland) and RAI (Italy). (If someone knows another one, please let me know). However, the survey also showed that many archives are in a stage somewhat close to an implementation. Virtually all major audiovisual archives are currently carrying out at least some R&D project on this subject.

    The question to which extent this kind of technologies will affect the job of the (audiovisual) archivist is equally important. In fact this theme has been tackled numerous times by the FIAT/IFTA Media Management Commission on its biannual ‘Changing Sceneries, Changing Roles’ seminars, of which the next edition was officially announced today. It will take place in Glasgow, Scotland, May 21-22 2015 and more info can be found here:

    At the previous edition of this Seminar in Hilversum in 2013, I presented a model that I’ve drawn to help audiovisual archives to develop a future proof annotation strategy. In fact it starts from the premise that there are 4 main ways to create descriptive metadata for time based media (audio or video): manual annotation, user generated annotation, automatically generated annotation and metadata coming from production. Each of these has strengths and weaknesses, and future archivist should know these, to be able to decide which one to use.

    On the other hand there are the goals, the purposes that are aimed at when creating these metadata. It will be the archivist’s role to pair the right creation method with the right goal, or even combine these to get better results. More information about this so-called Hourglass Model can be found in the proceedings of the 2013 Hilversum Seminar (, from p. 157), or those who understand Spanish can find a full clip presenting the Hourglass Model on the Sexto Seminario de Archivos Sonoros y Audiovisuales in Mexico City in 2014 here: (from 03.00.00 onwards).

    In my opinion it will be good practice of every (audiovisual) archive to think about their annotation strategy, and I’d like to encourage everyone to use the Hourglass Model for that. The Dutch Institute for Sound and Vision (the national AV archive of the Netherlands) did so, and declared in October 2014 that it was a corporate goal to eliminate as much as possible every kind of manual annotation by January 2015. This proved to be easier said than done, but nevertheless archivists should be prepared for a changing role, in which they’ll have a full box of really suiting tools to manage the ever growing heap of stuff coming in, instead of just a slow screwdriver called manual annotation.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s