Supervised versus Unsupervised Machine Learning for Text-based Datasets

By Aarthi Shankar

This is the fifth post in the bloggERS series on Archiving Digital Communication.


I am a graduate student working as a Research Assistant on an NHPRC-funded project, Processing Capstone Email Using Predictive Coding, that is investigating ways to provide appropriate access to email records of state agencies from the State of Illinois. We have been exploring various text mining technologies with a focus on those used in the e-discovery industry to see how well these tools may improve the efficiency and effectiveness of identifying sensitive content.

During our preliminary investigations we have begun to ask one another whether tools that use Supervised Machine Learning or Unsupervised Machine Learning would be best suited for our objectives. In my undergraduate studies I conducted a project on Digital Image Processing for Accident Prevention, involving building a system that uses real-time camera feeds to detect human-vehicle interactions and sound alarms if a collision is imminent. I used a Supervised Machine Learning algorithm – Support Vector Machine (SVM) to train and identify the car and human on individual data frames. With this project, Supervised Machine Learning worked well when applied to identifying objects embedded in images. But I do not believe it will be as beneficial for our project which is working with text only data. Here is my argument for my position.

In Supervised Machine Learning, a pre-classified input set (training dataset) has to be given to the tool in order to be trained. Training is based on the input set and the algorithms used to process the input set gives the required output. In my project, I created a training dataset which contained pictures of specific attributes of cars (windshields, wheels) and specific attributes of humans (faces, hands, and legs). I needed to create a training set of 900-1,000 images to achieve a ~92% level of accuracy. Supervised Machine Learning works well for this kind of image detection because unsupervised learning algorithms would be challenged to accurately make distinctions between windshield glass and other glass (e.g. building windows) present in many places on a whole data frame.

For Supervised Machine Learning to work well, the expected output of an algorithm should be known and the data that is used to train the algorithm should be properly labeled. This takes a great deal of effort. A huge volume of words along with their synonyms would be needed as a training set. But this implies we know what we are looking for in the data. I believe for the purposes of our project, the expected output is not so clearly known (all “sensitive” content) and therefore a reliable training set and algorithm would be difficult to create.

In Unsupervised Machine Learning, the algorithms allow the machine to learn to identify complex processes and patterns without human intervention. Text can be identified as relevant based on similarity and grouped together based on likely relevance. Unsupervised Machine Learning tools can still allow humans to add their own input text or data for the algorithm to understand and train itself. I believe this approach is better than Supervised Machine Learning for our purposes. Through the use of clustering mechanisms in Unsupervised Machine Learning the input data can first be divided into clusters and then the test data identified using those clusters.

In summary, a Supervised Machine Learning tool learns to ascribe the labels that are input from the training data but the effort to create a reliable training dataset is significant and not easy to create from our textual data. I feel that Unsupervised Machine Learning tools can provide better results (faster, more reliable) for our project particularly with regard to identifying sensitive content. Of course, we are still investigating various tools, so time will tell.


Aarthi Shankar is a graduate student in the Information Management Program specializing in Data Science and Analytics. She is working as a Research Assistant for the Records and Information Management Services at the University of Illinois.