Revisiting wholesale data: "Best" k

At the end of Chapter 2 you explored wholesale distributor data customers_spend using hierarchical clustering. This time you will analyze this data using the k-means clustering tools covered in this chapter.

The first step will be to determine the "best" value of k using average silhouette width.

A refresher about the data: it contains records of the amount spent by 45 different clients of a wholesale distributor for the food categories of Milk, Grocery & Frozen. This is stored in the data frame customers_spend. For this exercise you can assume that because the data is all of the same type (amount spent) and you will not need to scale it.

Use map_dbl() to run pam() using the customers_spend data for k values ranging from 2 to 10 and extract the average silhouette width value from each model: model$silinfo$avg.width. Store the resulting vector as sil_width.
Build a new data frame sil_df containing the values of k and the vector of average silhouette widths.
Use the values in sil_df to plot a line plot showing the relationship between k and average silhouette width.

Calculating Distance Between Observations

Hierarchical Clustering

K-means Clustering

Case Study: National Occupational Mean Wage

Exercise

Revisiting wholesale data: "Best" k

Instructions