Revisiting wholesale data: "Best" k
At the end of Chapter 2 you explored wholesale distributor data customers_spend
using hierarchical clustering. This time you will analyze this data using the k-means clustering tools covered in this chapter.
The first step will be to determine the "best" value of k using average silhouette width.
A refresher about the data: it contains records of the amount spent by 45 different clients of a wholesale distributor for the food categories of Milk, Grocery & Frozen. This is stored in the data frame customers_spend
. For this exercise you can assume that because the data is all of the same type (amount spent) and you will not need to scale it.
This exercise is part of the course
Cluster Analysis in R
Exercise instructions
- Use
map_dbl()
to runpam()
using thecustomers_spend
data for k values ranging from 2 to 10 and extract the average silhouette width value from each model:model$silinfo$avg.width
. Store the resulting vector assil_width
. - Build a new data frame
sil_df
containing the values of k and the vector of average silhouette widths. - Use the values in
sil_df
to plot a line plot showing the relationship between k and average silhouette width.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:10, function(k){
model <- pam(x = ___, k = ___)
model$silinfo$avg.width
})
# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
k = ___,
sil_width = ___
)
# Plot the relationship between k and sil_width
ggplot(___, aes(x = ___, y = ___)) +
geom_line() +
scale_x_continuous(breaks = 2:10)