K-means training

Now that the RDD is ready for training, in this 2nd part, you'll test it with k's from 13 to 16 (to save computation time) and use the elbow method to chose the correct k. The idea of the elbow method is to run K-means clustering on the dataset for different values of k, calculate Within Set Sum of Squared Error (WSSSE), and select the best k based on the sudden drop in WSSSE, i.e. where the elbow occurs. Next, you'll retrain the model with the best k and finally, get the centroids (cluster centers).

Remember, you already have a SparkContext sc and rdd_split_int RDD available in your workspace.

Train the KMeans model with clusters from 13 to 16 and print the WSSSE for each cluster.
Train the KMeans model again with the best k.
Get the Cluster Centers (centroids) of KMeans model trained with the best k.

Introduction to Big Data analysis with Spark

Programming in PySpark RDD’s

PySpark SQL & DataFrames

Machine Learning with PySpark MLlib

Exercise

K-means training

Instructions