Exercise

K-means training

Now that the RDD is ready for training, in this 2nd part, you'll test with k's from 13 to 16 (to save computation time) and use the elbow method to chose the correct k. The idea of the elbow method is to run K-means clustering on the dataset for different values of k, calculate Within Set Sum of Squared Error (WSSSE) and select the best k based on the sudden drop in WSSSE. Next, you'll retrain the model with the best k and finally, get the centroids (cluster centers).

Remember, you already have a SparkContext sc and rdd_split_int RDD available in your workspace.

Instructions

100 XP
  • Train the KMeans model with clusters from 13 to 16 and print the WSSSE for each cluster.
  • Train the KMeans model again with the best k.
  • Get the Cluster Centers (centroids) of KMeans model trained with the best k.