How to Automatically Segment Your Data with Clustering
One of the most common analyses we perform is to look for patterns in data. What market segments can we divide our customers into? How do we find clusters of individuals in a network of users?
It’s possible to answer these questions with Machine Learning. Even when you don’t know which specific segments to look for, or have unstructured data, you can use a variety of techniques to algorithmically find emergent patterns in your data and properly segment or classify outcomes.
In this post, we’ll walk through one such algorithm called K-Means Clustering, how to measure its efficacy, and how to choose the sets of segments you generate.
Supervised vs. Unsupervised Learning
In classification of data, there are two types of Machine Learning available.
With Supervised Learning, you can predict classifications of outcomes when you already know which inputs map to which discrete segments. But in many situations, you won’t actually have such labels predefined for you – you’ll only be given a set of unstructured data without any defined segments. In these cases you’ll need to use Unsupervised Learning to infer the segments from unlabeled data.
For clarity, let’s take the example of classifying t-shirt sizes.
If we’re given a dataset like in Figure 1A above, we’d have a set of inputs width (X1) and length (X2), as well as their corresponding t-shirt size of say small (blue) and large (green). In such a scenario we can use Supervised Learning techniques like Logistic Regression to draw a clear decision boundary and separate the respective classes of t-shirts.
But if we are given a dataset like Figure 1B, we’ll have a set of inputs width (X1) and length (X2), but no corresponding label for t-shirt size. In this case, we’ll need to use Unsupervised Learning techniques like K-Means Clustering to find similar sets of t-shirts and cluster them together into the respective classes of small (blue circle) and large (green circle).
In many real-world applications you’ll face cases like that in Figure 2A, so it’s helpful to walk through how to actually find structure in unstructured data.
To find structure in unstructured data, K-Means Clustering provides a straightforward application for Unsupervised Machine Learning.
K-Means Clustering works as its name would imply – assigning similar observations in your data to a set of clusters. It operates in 4 simple and repeatable steps, wherein you iteratively evaluate a set of clusters that provide the closest mean (average) distance to each of your observations. It follows that if a set of observations are close in proximity to a one another, it’s likely they are part of a cluster.
Let’s walk through the algorithm step-by-step. The first step is to randomly initialize a set of “centroids” (the Xs in Figure 2A above), or centers to your clusters. You can set these centroids anywhere to start, but it’s recommended to initialize them at a random set of points matching your observations. You will in turn use these centroids to group your observations, assigning a centroid to each observation by those closest in distance (the blue and green circles in Figure 2B).
This will initialize a set of clusters to group the observations in your data to those closest to the same centroid. But it’s unlikely that these initial clusters are perfectly fit on their first assignment. So as a next step, you’ll move your cluster to a position of closer fit. This is done by finding the mean or average of the observations in each current cluster, and moving your centroids to that position (Figure 2C). Then you’ll reassign all your observations to new clusters based on the closest distance to the new centroids (Figure 2D).
You can repeat this process of cluster assignment / find means / move centroids, until you reach convergence. Once you’ve found a group of clusters in which all observations have found their closest centroid, the clusters will no longer move on evaluating the mean-distance. Those observations grouped together will be clustered such that they share similarity in their inputs (as indicated by their proximity to the same centroid) and you’ll have found an appropriate fit of clusters for your data.
How Many Clusters Do You Use?
The K-Means Clustering is an effective method for finding a good fit of clusters for your data. But there remains the question of how do you decide on the number of clusters to start with?
Unsupervised Learning techniques like K-Means Clustering are necessitated when you don’t know the labels or assignments of an unstructured dataset. So it follows that the data itself won’t inherently tell you the correct number (or labels) for your clusters.
So how do you compare different numbers of clusters for your data? The simplest approach is to measure the error of your clusters, as defined by:
This function assesses the error of your clusters by measuring the distance between your observations (X) and their respectively assigned centroid (μ). The set of clusters that presents the lowest distance, or lowest overall error, to each respective centroid affords the best fit for clustering your data.
Returning to our example of t-shirt sizes, how can we use this error function to determine the right number of clusters? One approach is the “Elbow Method” as seen in Figure 3 above. By plotting out the errors of your data with respect to the number of clusters you initialize, you can look for the point at which the rate of change in error is sharpest. In Figure 3 that would seem to be at 2 clusters, indicating that we should perhaps go with assignments of Small vs. Large.
The caveat to this approach is that often there isn’t a clear inflection point in your error curve. By consequence its not always possible to use the Elbow Method to pick the appropriate number of clusters.
In such scenarios it’s recommended to rely on intuition or the context of the problem you’re trying to solve. For t-shirt sizes for instance, it’s possible that you know you want to group your t-shirts into 5 sizes – extra-small, small, medium, large, and extra-large. This isn’t something the data would necessarily tell you clearly, but based on your intuition you could initialize 5 centroids to find the appropriate clusters.
Overall with a set of cluster sets in mind, K-Means Clustering provides an iterative and effective algorithm for discovering structure in your data.
This blog post is based on concepts taught in Stanford’s Machine Learning course notes by Andrew Ng on Coursera.