Loading and parsing the 5000 points data

Clustering is the unsupervised learning task that involves grouping objects into clusters of high similarity. Unlike the supervised tasks, where data is labeled, clustering can be used to make sense of unlabeled data. PySpark MLlib includes the popular K-means algorithm for clustering. In this 3 part exercise, you'll find out how many clusters are there in a dataset containing 5000 rows and 2 columns. For this you'll first load the data into an RDD, parse the RDD based on the delimiter, run the KMeans model, evaluate the model and finally visualize the clusters.

In the first part, you'll load the data into RDD, parse the RDD based on the delimiter, and convert the string type of the data to an integer.

Remember, you have a SparkContext sc available in your workspace. Also file_path variable (which is the path to the 5000_points.txt file) is already available in your workspace.

Load the 5000_points dataset into an RDD named clusterRDD.
Transform the clusterRDD by splitting the lines based on the tab ("\t").
Transform the split RDD to create a list of integers for the two columns.
Confirm that there are 5000 rows in the dataset.

Introduction to Big Data analysis with Spark

Programming in PySpark RDD’s

PySpark SQL & DataFrames

Machine Learning with PySpark MLlib

Exercise

Loading and parsing the 5000 points data

Instructions