Visualizing clusters
You just trained the k-means model with an optimum k
value (k=16
) and generated cluster centers (centroids). In this final exercise, you will visualize the clusters and the centroids by overlaying them. This will indicate how well the clustering worked (ideally, the clusters should be distinct from each other and centroids should be at the center of their respective clusters).
To achieve this, you will first convert the rdd_split_int
RDD into a Spark DataFrame, and then into Pandas DataFrame which can be used for plotting. Similarly, you will convert cluster_centers
into a Pandas DataFrame. Once both the DataFrames are created, you will create scatter plots using Matplotlib.
The SparkContext sc
as well as the variables rdd_split_int
and cluster_centers
, and package matplotlib.pyplot
(imported as plt
) are available in your workspace.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Convert the
rdd_split_int
RDD to a Spark DataFrame, then to a pandas DataFrame. - Create a pandas DataFrame from the
cluster_centers
list. - Create a scatter plot from the pandas DataFrame of raw data (
rdd_split_int_df_pandas
) and overlay that with a scatter plot from the Pandas DataFrame of centroids (cluster_centers_pandas
).
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Convert rdd_split_int RDD into Spark DataFrame and then to Pandas DataFrame
rdd_split_int_df_pandas = spark.____(rdd_split_int, schema=["col1", "col2"]).toPandas()
# Convert cluster_centers to a pandas DataFrame
cluster_centers_pandas = pd.DataFrame(____, columns=["col1", "col2"])
# Create an overlaid scatter plot of clusters and centroids
plt.scatter(rdd_split_int_df_pandas["col1"], rdd_split_int_df_pandas["col2"])
plt.scatter(____["col1"], ____["col2"], color="red", marker="x")
plt.show()