Revisiting wholesale data: "Best" k
At the end of Chapter 2 you explored wholesale distributor data customers_spend
using hierarchical clustering. This time you will analyze this data using the k-means clustering tools covered in this chapter.
The first step will be to determine the "best" value of k using average silhouette width.
A refresher about the data: it contains records of the amount spent by 45 different clients of a wholesale distributor for the food categories of Milk, Grocery & Frozen. This is stored in the data frame customers_spend
. For this exercise you can assume that because the data is all of the same type (amount spent) and you will not need to scale it.
Cet exercice fait partie du cours
Cluster Analysis in R
Instructions
- Use
map_dbl()
to runpam()
using thecustomers_spend
data for k values ranging from 2 to 10 and extract the average silhouette width value from each model:model$silinfo$avg.width
. Store the resulting vector assil_width
. - Build a new data frame
sil_df
containing the values of k and the vector of average silhouette widths. - Use the values in
sil_df
to plot a line plot showing the relationship between k and average silhouette width.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:10, function(k){
model <- pam(x = ___, k = ___)
model$silinfo$avg.width
})
# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
k = ___,
sil_width = ___
)
# Plot the relationship between k and sil_width
ggplot(___, aes(x = ___, y = ___)) +
geom_line() +
scale_x_continuous(breaks = 2:10)