CommencerCommencer gratuitement

Revisiting wholesale data: "Best" k

At the end of Chapter 2 you explored wholesale distributor data customers_spend using hierarchical clustering. This time you will analyze this data using the k-means clustering tools covered in this chapter.

The first step will be to determine the "best" value of k using average silhouette width.

A refresher about the data: it contains records of the amount spent by 45 different clients of a wholesale distributor for the food categories of Milk, Grocery & Frozen. This is stored in the data frame customers_spend. For this exercise you can assume that because the data is all of the same type (amount spent) and you will not need to scale it.

Cet exercice fait partie du cours

Cluster Analysis in R

Afficher le cours

Instructions

  • Use map_dbl() to run pam() using the customers_spend data for k values ranging from 2 to 10 and extract the average silhouette width value from each model: model$silinfo$avg.width. Store the resulting vector as sil_width.
  • Build a new data frame sil_df containing the values of k and the vector of average silhouette widths.
  • Use the values in sil_df to plot a line plot showing the relationship between k and average silhouette width.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:10,  function(k){
  model <- pam(x = ___, k = ___)
  model$silinfo$avg.width
})

# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
  k = ___,
  sil_width = ___
)

# Plot the relationship between k and sil_width
ggplot(___, aes(x = ___, y = ___)) +
  geom_line() +
  scale_x_continuous(breaks = 2:10)
Modifier et exécuter le code