Visualizing clusters

You just trained the k-means model with an optimum k value (k=16) and generated cluster centers (centroids). In this final exercise, you will visualize the clusters and the centroids by overlaying them. This will indicate how well the clustering worked (ideally, the clusters should be distinct from each other and centroids should be at the center of their respective clusters).

To achieve this, you will first convert the rdd_split_int RDD into a Spark DataFrame, and then into Pandas DataFrame which can be used for plotting. Similarly, you will convert cluster_centers into a Pandas DataFrame. Once both the DataFrames are created, you will create scatter plots using Matplotlib.

The SparkContext sc as well as the variables rdd_split_int and cluster_centers, and package matplotlib.pyplot (imported as plt) are available in your workspace.

Questo esercizio fa parte del corso

Big Data Fundamentals with PySpark

Visualizza il corso

Istruzioni dell'esercizio

Convert the rdd_split_int RDD to a Spark DataFrame, then to a pandas DataFrame.
Create a pandas DataFrame from the cluster_centers list.
Create a scatter plot from the pandas DataFrame of raw data (rdd_split_int_df_pandas) and overlay that with a scatter plot from the Pandas DataFrame of centroids (cluster_centers_pandas).

Esercizio pratico interattivo

Prova a risolvere questo esercizio completando il codice di esempio.

# Convert rdd_split_int RDD into Spark DataFrame and then to Pandas DataFrame
rdd_split_int_df_pandas = spark.____(rdd_split_int, schema=["col1", "col2"]).toPandas()

# Convert cluster_centers to a pandas DataFrame
cluster_centers_pandas = pd.DataFrame(____, columns=["col1", "col2"])

# Create an overlaid scatter plot of clusters and centroids
plt.scatter(rdd_split_int_df_pandas["col1"], rdd_split_int_df_pandas["col2"])
plt.scatter(____["col1"], ____["col2"], color="red", marker="x")
plt.show()

Modifica ed esegui il codice