Get startedGet started for free

Visualizing clusters

You just trained the k-means model with an optimum k value (k=16) and generated cluster centers (centroids). In this final exercise, you will visualize the clusters and the centroids by overlaying them. This will indicate how well the clustering worked (ideally, the clusters should be distinct from each other and centroids should be at the center of their respective clusters).

To achieve this, you will first convert the rdd_split_int RDD into a Spark DataFrame, and then into Pandas DataFrame which can be used for plotting. Similarly, you will convert cluster_centers into a Pandas DataFrame. Once both the DataFrames are created, you will create scatter plots using Matplotlib.

The SparkContext sc as well as the variables rdd_split_int and cluster_centers, and package matplotlib.pyplot (imported as plt) are available in your workspace.

This exercise is part of the course

Big Data Fundamentals with PySpark

View Course

Exercise instructions

  • Convert the rdd_split_int RDD to a Spark DataFrame, then to a pandas DataFrame.
  • Create a pandas DataFrame from the cluster_centers list.
  • Create a scatter plot from the pandas DataFrame of raw data (rdd_split_int_df_pandas) and overlay that with a scatter plot from the Pandas DataFrame of centroids (cluster_centers_pandas).

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Convert rdd_split_int RDD into Spark DataFrame and then to Pandas DataFrame
rdd_split_int_df_pandas = spark.____(rdd_split_int, schema=["col1", "col2"]).toPandas()

# Convert cluster_centers to a pandas DataFrame
cluster_centers_pandas = pd.DataFrame(____, columns=["col1", "col2"])

# Create an overlaid scatter plot of clusters and centroids
plt.scatter(rdd_split_int_df_pandas["col1"], rdd_split_int_df_pandas["col2"])
plt.scatter(____["col1"], ____["col2"], color="red", marker="x")
plt.show()
Edit and Run Code