Transforming features for better clusterings

1. Transforming features for better clusterings

Let's look now at another dataset,

2. Piedmont wines dataset

the Piedmont wines dataset. We have 178 samples of red wine from the Piedmont region of Italy. The features measure chemical composition (like alcohol content) and visual properties like color intensity. The samples come from 3 distinct varieties of wine.

3. Clustering the wines

Let's take the array of samples and use KMeans to find 3 clusters.

4. Clusters vs. varieties

There are three varieties of wine, so let's use pandas crosstab to check the cluster label - wine variety correspondence. As you can see, this time things haven't worked out so well. The KMeans clusters don't correspond well with the wine varieties.

5. Feature variances

The problem is that the features of the wine dataset have very different variances. The variance of a feature measures the spread of its values. For example, the malic acid feature has a higher variance

6. Feature variances

than the od280 feature, and this can also be seen in their scatter plot. The differences in some of the feature variances is enormous, as seen here, for example, in the scatter plot of the od280 and proline features.

7. StandardScaler

In KMeans clustering, the variance of a feature corresponds to its influence on the clustering algorithm. To give every feature a chance, the data needs to be transformed so that features have equal variance. This can be achieved with the StandardScaler from scikit-learn. It transforms every feature to have mean 0 and variance 1. The resulting "standardized" features can be very informative. Using standardized od280 and proline, for example, the three wine varieties are much more distinct.

8. sklearn StandardScaler

Let's see the StandardScaler in action. First, import StandardScaler from sklearn.preprocessing. Then create a StandardScaler object, and fit it to the samples. The transform method can now be used to standardize any samples, either the same ones, or completely new ones.

9. Similar methods

The APIs of StandardScaler and KMeans are similar, but there is an important difference. StandardScaler transforms data, and so has a transform method. KMeans, in contrast, assigns cluster labels to samples, and this done using the predict method.

10. StandardScaler, then KMeans

Let's return to the problem of clustering the wines. We need to perform two steps. Firstly, to standardize the data using StandardScaler, and secondly to take the standardized data and cluster it using KMeans. This can be conveniently achieved by combining the two steps using a scikit-learn pipeline. Data then flows from one step into the next, automatically.

11. Pipelines combine multiple steps

The first steps are the same: creating a StandardScaler and a KMeans object. After that, import the make_pipeline function from sklearn.pipeline. Apply the make_pipeline function to the steps that you want to compose in this case, the scaler and the kmeans objects. Now use the fit method of the pipeline to fit both the scaler and kmeans, and use its predict method to obtain the cluster labels.

12. Feature standardization improves clustering

Checking the correspondence between the cluster labels and the wine varieties reveals that this new clustering, incorporating standardization, is fantastic. Its three clusters correspond almost exactly to the three wine varieties. This is a huge improvement on the clustering without standardization.

13. sklearn preprocessing steps

StandardScaler is an example of a "preprocessing" step. There are several of these available in scikit-learn, for example MaxAbsScaler and Normalizer.

14. Let's practice!

You've learned a lot this video. Now put it into practice and cluster some datasets!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Unsupervised Learning in Python

IntermediateSkill Level

4.8+

759 reviews