Get startedGet started for free

Clustering

1. Clustering

Previously, we learned how to use Supervised Learning to make predictions based on labeled data. In this lesson, we’ll cover a new topic called clustering.

2. What is clustering?

Clustering is a set of machine learning algorithms that divide data into categories, called clusters. Clustering can help us see patterns in messy datasets. Machine Learning Scientists use clustering to divide customers into segments, images into categories, or behaviors into typical and anomalous.

3. Supervised vs. Unsupervised Machine Learning

Clustering is part of a broader category within Machine Learning called "Unsupervised Learning". Unsupervised Learning differs from Supervised Learning in the structure of the training data. While Supervised Learning uses data with features and labels, Unsupervised Learning uses data with only features. This makes Unsupervised Learning, and clustering, particularly appealing: you can use it even when you don't know much about your dataset.

4. Case study: customer segmentation

Let's dive into an example of clustering for customer segmentation. Customer segmentation is the process of dividing a pool of customers into different groups with common attributes. We can use these segments to devise targeted advertising campaigns or to explain otherwise confusing results by analyzing the behavior of individual segments, rather than just looking at the customers as a whole.

5. Case study: customer segmentation

First, we need to brainstorm a list of features that will accurately describe our customers. Let's say that we are working for an airline and our customers are travelers. Important features might include the number of flights they've taken in the past year, the percent of those flights that were international, how far in advance they typically buy tickets, and what percent of tickets were business class.

6. Case study: customer segmentation

Some clustering algorithms need us to define how many clusters we want to create. The number of clusters we ask for greatly affects how the algorithm will segment our data. Here's an example of some of our flight data. The x-axis represents the number of flights per year and the y-axis represents the percent of those flights that were business class.

7. Case study: customer segmentation

Here is how the algorithm divides the data if we ask for two clusters. And here is how it divides the same data if we ask for three clusters. Having a strong hypothesis about our data helps us get better results from the clustering algorithm. For our airlines example, we might expect to have business travelers, family travelers, and adventurers: three clusters, as in the image on the bottom right.

8. Clustering review

Let's review. Clustering is an Unsupervised Machine Learning method that divides an unlabeled dataset into different categories. In order to perform clustering, we must first select relevant features of our dataset. Next, we select the number of clusters based on hypotheses about our data. Finally, we use the results of our clustering to solve business problems, such as targeted advertising or price setting. We can even use clustering as a way of breaking up a larger Machine Learning problem! Rather than modeling all of our data at once, we could create a different model for each cluster to get better predictions.

9. Let's practice!

Now that we understand clustering, let's practice!