Supervised versus Unsupervised Machine Learning for Text-based Datasets

By Aarthi Shankar

This is the fifth post in the bloggERS series on Archiving Digital Communication.

I am a graduate student working as a Research Assistant on an NHPRC-funded project, Processing Capstone Email Using Predictive Coding, that is investigating ways to provide appropriate access to email records of state agencies from the State of Illinois. We have been exploring various text mining technologies with a focus on those used in the e-discovery industry to see how well these tools may improve the efficiency and effectiveness of identifying sensitive content.

During our preliminary investigations we have begun to ask one another whether tools that use Supervised Machine Learning or Unsupervised Machine Learning would be best suited for our objectives. In my undergraduate studies I conducted a project on Digital Image Processing for Accident Prevention, involving building a system that uses real-time camera feeds to detect human-vehicle interactions and sound alarms if a collision is imminent. I used a Supervised Machine Learning algorithm – Support Vector Machine (SVM) to train and identify the car and human on individual data frames. With this project, Supervised Machine Learning worked well when applied to identifying objects embedded in images. But I do not believe it will be as beneficial for our project which is working with text only data. Here is my argument for my position.

In Supervised Machine Learning, a pre-classified input set (training dataset) has to be given to the tool in order to be trained. Training is based on the input set and the algorithms used to process the input set gives the required output. In my project, I created a training dataset which contained pictures of specific attributes of cars (windshields, wheels) and specific attributes of humans (faces, hands, and legs). I needed to create a training set of 900-1,000 images to achieve a ~92% level of accuracy. Supervised Machine Learning works well for this kind of image detection because unsupervised learning algorithms would be challenged to accurately make distinctions between windshield glass and other glass (e.g. building windows) present in many places on a whole data frame.

For Supervised Machine Learning to work well, the expected output of an algorithm should be known and the data that is used to train the algorithm should be properly labeled. This takes a great deal of effort. A huge volume of words along with their synonyms would be needed as a training set. But this implies we know what we are looking for in the data. I believe for the purposes of our project, the expected output is not so clearly known (all “sensitive” content) and therefore a reliable training set and algorithm would be difficult to create.

In Unsupervised Machine Learning, the algorithms allow the machine to learn to identify complex processes and patterns without human intervention. Text can be identified as relevant based on similarity and grouped together based on likely relevance. Unsupervised Machine Learning tools can still allow humans to add their own input text or data for the algorithm to understand and train itself. I believe this approach is better than Supervised Machine Learning for our purposes. Through the use of clustering mechanisms in Unsupervised Machine Learning the input data can first be divided into clusters and then the test data identified using those clusters.

In summary, a Supervised Machine Learning tool learns to ascribe the labels that are input from the training data but the effort to create a reliable training dataset is significant and not easy to create from our textual data. I feel that Unsupervised Machine Learning tools can provide better results (faster, more reliable) for our project particularly with regard to identifying sensitive content. Of course, we are still investigating various tools, so time will tell.

Aarthi Shankar is a graduate student in the Information Management Program specializing in Data Science and Analytics. She is working as a Research Assistant for the Records and Information Management Services at the University of Illinois.


Announcing the First-Ever #bdaccess Twitter Chats: 10/27 @ 2 and 9pm EST

By Jess Farrell and Sarah Dorpinghaus

This post is the fifteenth in a bloggERS series about access to born-digital materials.


Contemplating how to provide access to born-digital materials? Wondering how to meet researcher needs for accessing and analyzing files? We are too! Join us for a Twitter chat on providing access to born digital records.

*When?* Thursday, October 27 at 2:00pm and 9:00pm EST
*How?* Follow #bdaccess for the discussion
*Who?* Researchers, information professionals, and anyone else interested in using born-digital records

Newly-conceived #bdaccess chats are organized by an ad-hoc group that formed at the 2015 SAA annual meeting. We are currently developing a bootcamp to share ideas and tools for providing access to born-digital materials and have teamed up with the Digital Library Federation to spread the word about the project.

Understanding how researchers want to access and use digital archives is key to our curriculum’s success, so we’re taking it to the Twitter streets to gather feedback from digital researchers. The following five questions will guide the discussion:

Q1. _What research topic(s) of yours and/or content types have required the use of born digital materials?_

Q2. _What challenges have you faced in accessing and/or using born digital content? Any suggested improvements?_

Q3. _What discovery methods do you think are most suitable for research with born digital material?_

Q4. _What information or tools do/could help provide the context needed to evaluate and use born digital material?_

Q5. _What information about collecting/providing access would you like to see accompanying born digital archives?_

Can’t join on the 27th? Follow #bdaccess for ongoing discussion and future chats!


Jess Farrell is the curator of digital collections at Harvard Law School. Along with managing and preserving digital history, she’s currently fixated on inclusive collecting, labor issues in libraries, and decolonizing description.

Sarah Dorpinghaus is the Director of Digital Services at the University of Kentucky Libraries Special Collections Research Center. Although her research interests lie in the realm of born-digital archives, she has a budding pencil collection.