1. Tableau: clustering
In this last demo, we will cover the use of clustering in Tableau.
The dataset comes from the Anthropometric Survey of US Army Personnel (ANSUR), and contains body measurements in millimeters. As a T-shirt manufacturer, we need to know which body measurements belong to either small, medium or large T-shirt sizes. Then, we could retrieve the average body measurement values to determine what size clothing should be made in each of these categories.
There are six measurements: from arm length to waist circumference. Clustering needs the measures you want to cluster on the canvas. The visualization itself is not that important, but a scatter plot makes visually the most sense. To create a matrix of scatter plots, drag the same measures once to the Rows shelf, and once to the Columns shelf.
There aren't any natural clusters present in the data, but we can ask Tableau to group the most similar observations. To do this, click on the Analytics pane and drag Clusters to the canvas. Tableau suggests a value of two for the k-means algorithm, but since we want to divide into small, medium, and large, I specify a value of three. The most important part is that all variables you want to use for clustering should be in this menu.
To check the quality of the clusters, you can right click on the cluster group, then on Describe clusters. Both between-group and within-group sum of squares are given, together with the cluster summary statistics. The centers you see here are the final centers for each cluster, after the k-means algorithm has finished. Under models, you can find the p-value for each variable: a p-value lower than zero point zero five suggest that the expected values of the corresponding variable differ among clusters. In this case, all variables seem to be different enough in all clusters.
The newly created clusters group can be dragged to the data pane. We'll name it Sizes. Note that Tableau names these clusters one, two, and three by default, since it doesn't know what these clusters mean. We can try to figure that out by creating a table.
Drag the clusters group and measure names to the Rows shelf, and add the measure values to the text mark. The sum doesn't make sense here, we're interested in the average value of each measurement per cluster. Now we can see that, for example, the average waist circumference is about 83 cm for cluster one, 88 cm for cluster two, and 104 cm for cluster three. By deduction, cluster one will be the small category, cluster two will be medium, and cluster three the large one. You can rename these clusters if you want, and you can even drill down further on gender.
With the use of this table, we can start manufacturing T-shirts, with the appropriate values for small, medium, and large sizes.
Time for your last set of exercises!
2. Let's practice!