An Unsupervised Training Approach for Clusterization of the Incoming Requests into Support Service

5 min readNov 25, 2020

Preface

Support service is required in most modern businesses. From small local stores to big corporations that provide thousands of different products and services — all they are dealing with support service.

Many of us have come across cases when the autoresponder on the support service hotline telling you that all operators are busy currently and you have to wait for X minutes. Such cases slow down business and reduce usability. That’s why the modern tendency is to automate support service and provide the customers correct answers to their questions quickly and automatically, without waiting for someone. And here Chatbots can help greatly.

Most of the modern chatbots require defined Intent. An Intent is an intention of the user or the users, interacting with the chatbot, for example: ‘I would like to go to Los Angeles’

Here the intent can be categorized as ‘travel’. Of course, this work can be done by chatbot operators, who can define correct intents for all kinds of questions from customers. But this operation requires perfect knowledge of all possible incoming requests. When we are dealing with really big business — defining correct intents can take too much time, too much money, or can be not efficient if we will define a few correct intents only. To imagine how complicated this task can be, see the next figure — “contact us” form on the site of the Payoneer company — international payment system.

Fig. 1 — Defining topic for contact us form — Payoneer company

As you can see — a user must select a topic for his message, to forward it to the correct person. The topic dropdown has 8 items on the first level, and an average number of 3 topics on the 2nd and 3rd levels, so in total, all possible incoming messages are sorted into 8 *3 * 3 = 72 groups.

It is described as a new approach that allows retrieving Intents automatically, by using an existing database of real customers' requests and applying techniques of unsupervised training. For experiments, I used open data for Verizon support system from https://twitter.com/VerizonSupport

General workflow

General schema how the proposed system works — described in fig. 2.

On the system input, we have an archive of Incoming messages from users to service support. Also, there are Responses.

Convert phrases to vectors

It’s impossible to serve phrases in text view automatically, that’s why I had used the Spacy — Industrial-Strength Natural Language Processing (NLP) engine.

Spacy allows us to present any phrase as a vector with N dimensions. Additionally, support messages can contain some noise like stopwords i.e. the most common words of the language. We are removing these stopwords from our training dataset to save just general words which are describing phrase meaning.

Clusterization

When we have formalized incoming customer requests into vectors — we can serve this information and try to group it into Intent candidates. For grouping, I had used Clustering based on the k-means method.

K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. As a result of clusterization, we are receiving a Cluster model which can be used for predicting cluster ID for any message vector. We can apply the 1-nearest neighbor classifier on the cluster centers obtained by k-means to classify new data into the existing clusters.

In the picture below you can see some messages from cluster #4 related to keywords “Internet Service Outage”

Fig. 3 — Example messages from cluster #4 Internet service outage

As you can see — all messages have the same meaning about service outages in some areas, which confirms that clusterization was done correctly and provides useful information.

Visualization

To verify the results of clusterization — it will be very good to visualize clusters on some chart, to be able to see location and differences for each cluster. As it was described above, each message after NLP analysis is presented as a vector in N-dimensions. To display these vectors on a 2D or 3D chart — we must simplify them. For this, we can use principal component analysis.

Principal component analysis (PCA) is a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. As result, we are converting our message vectors from N-dimensional space into 3-dimensional space.

Also, we are detecting Cluster ID for each incoming message by using a generated cluster model. For each cluster, we are defining a list of the most common keywords inside.

By having vectors in 3-dimensional space, cluster IDs predicted for each vector — we can build a 3D chart that is displayed in fig. 5.

Fig. 5 — Visualization for received messages into clusters

As you can see from this chart — incoming messages are well-grouped into clusters by meaning, because points for each cluster are distributed from other cluster points.

Future Todos

As you can see from Fig. 2 — not all steps in the proposed schema were developed and tested. Also, some steps can be improved. Below I listed some planned todos:

Try another clusterization methods, maybe some of them will improve the quality of the clusters
Running clusterization for responses from customer support service, and clusters received in result will be like Responses for Intents.
Using NLP engine Spacy and Neural Networks to detect base Entities for intents: Location, Price, Duration, etc…

Conclusions

As you can see — the proposed unsupervised training approach returned good results for sample requests. It allowed us to convert thousands of text messages representing incoming requests into support service to the tens of clusters grouping all these messages by general topic. In the next steps, we can play with the number of clusters, grouping typical responses into clusters, detecting entities that must be requested from customers to give them correct answers for their requests.