1. Learn
  2. /
  3. Courses
  4. /
  5. Big Data Fundamentals with PySpark

Connected

Exercise

Visualizing clusters

You just trained the k-means model with an optimum k value (k=16) and generated cluster centers (centroids). In this final exercise, you will visualize the clusters and the centroids by overlaying them. This will indicate how well the clustering worked (ideally, the clusters should be distinct from each other and centroids should be at the center of their respective clusters).

To achieve this, you will first convert the rdd_split_int RDD into a Spark DataFrame, and then into Pandas DataFrame which can be used for plotting. Similarly, you will convert cluster_centers into a Pandas DataFrame. Once both the DataFrames are created, you will create scatter plots using Matplotlib.

The SparkContext sc as well as the variables rdd_split_int and cluster_centers, and package matplotlib.pyplot (imported as plt) are available in your workspace.

Instructions

100 XP
  • Convert the rdd_split_int RDD to a Spark DataFrame, then to a pandas DataFrame.
  • Create a pandas DataFrame from the cluster_centers list.
  • Create a scatter plot from the pandas DataFrame of raw data (rdd_split_int_df_pandas) and overlay that with a scatter plot from the Pandas DataFrame of centroids (cluster_centers_pandas).