K-means training
Now that the RDD is ready for training, in this 2nd part, you'll test it with k's from 13 to 16 (to save computation time) and use the elbow method to chose the correct k. The idea of the elbow method is to run K-means clustering on the dataset for different values of k, calculate Within Set Sum of Squared Error (WSSSE), and select the best k based on the sudden drop in WSSSE, i.e. where the elbow occurs. Next, you'll retrain the model with the best k and finally, get the centroids (cluster centers).
Remember, you already have a SparkContext sc
and rdd_split_int
RDD available in your workspace.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Train the KMeans model with clusters from 13 to 16 and print the WSSSE for each cluster.
- Train the KMeans model again with the best k.
- Get the Cluster Centers (centroids) of KMeans model trained with the best k.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Train the model with clusters from 13 to 16 and compute WSSSE
for clst in range(13, 17):
model = KMeans.____(rdd_split_int, clst, seed=1)
WSSSE = rdd_split_int.____(lambda point: error(point)).reduce(lambda x, y: x + y)
print("The cluster {} has Within Set Sum of Squared Error {}".format(clst, ____))
# Train the model again with the best k
model = KMeans.train(rdd_split_int, k=____, seed=1)
# Get cluster centers
cluster_centers = model.____