Get startedGet started for free

Revisiting wholesale data: "Best" k

At the end of Chapter 2 you explored wholesale distributor data customers_spend using hierarchical clustering. This time you will analyze this data using the k-means clustering tools covered in this chapter.

The first step will be to determine the "best" value of k using average silhouette width.

A refresher about the data: it contains records of the amount spent by 45 different clients of a wholesale distributor for the food categories of Milk, Grocery & Frozen. This is stored in the data frame customers_spend. For this exercise you can assume that because the data is all of the same type (amount spent) and you will not need to scale it.

This exercise is part of the course

Cluster Analysis in R

View Course

Exercise instructions

  • Use map_dbl() to run pam() using the customers_spend data for k values ranging from 2 to 10 and extract the average silhouette width value from each model: model$silinfo$avg.width. Store the resulting vector as sil_width.
  • Build a new data frame sil_df containing the values of k and the vector of average silhouette widths.
  • Use the values in sil_df to plot a line plot showing the relationship between k and average silhouette width.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:10,  function(k){
  model <- pam(x = ___, k = ___)
  model$silinfo$avg.width
})

# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
  k = ___,
  sil_width = ___
)

# Plot the relationship between k and sil_width
ggplot(___, aes(x = ___, y = ___)) +
  geom_line() +
  scale_x_continuous(breaks = 2:10)
Edit and Run Code