Clustering

1. Clustering

In this last lesson, you'll learn about clustering.

2. Supervised vs. unsupervised machine learning

So far, you've seen so called supervised machine learning techniques: for example, predicting dino mass based on bone length, or forecasting T-shirt sales based on previous observations. In these examples, you described the relationship between two variables, and applied that relationship on new, unseen data. Regression and exponential smoothing are examples of supervised machine learning. Clustering is an example of unsupervised machine learning. For example, differentiating between tissue types or segmenting customers in different groups. You don't know beforehand what tissue types or customer groups are present in the data, but you let the unsupervised algorithm figure out which data points are more similar to each other. In the case of clustering, the output is a set of clusters.

3. k-means clustering

So, how does this work? Tableau uses one of the many existing clustering algorithms, called k-means clustering. k-means clustering can be applied to one, two, or more variables, but let's take two variables to explain how the algorithm works. Let's visualize them as a scatter plot, with one variable on each axis.

4. k-means clustering

The k in k-means refers to the number of clusters you want to end up with. If you know that you want to split your data into three categories good, medium, and bad for example, you specify k equals three. You can also let Tableau suggest a number of clusters if k isn't specified. In this example, we want to end with two clusters, and two randomly chosen centers are added.

5. k-means clustering

Then, all distances between the random centers and each of the data points are measured. Each data point is assigned to the center it is the closest to, colored here for visualization purposes.

6. k-means clustering

Each center is then moved to the new center of the points assigned to it.

7. k-means clustering

The process is iterative: all distances between each data point and the new center are measured again, and the data points are assigned accordingly.

8. k-means clustering

Once the center stops moving between iterations, the final clusters are set.

9. Assess clustering quality

Since you can't assess the quality of the clustering result by comparing actual and predicted values, you need to use another way. Two metrics are used to assess the clustering algorithm: between-group sum of squares and within-group sum of squares. Between-group sum of squares measures the separation between the clusters as the sum of squared distances between each cluster’s center, and the average value of the data set. The larger the value, the better the separation between clusters. Within-group sum of squares quantifies the cohesion of clusters as the sum of squared distances between the center of each cluster and the individual data points in the cluster. The smaller the value, the more cohesive the clusters.

10. Let's practice!

Ready for the final set of exercises?

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.