Loading and parsing the 5000 points data
Clustering is the unsupervised learning task that involves grouping objects into clusters of high similarity. Unlike the supervised tasks, where data is labeled, clustering can be used to make sense of unlabeled data. PySpark MLlib includes the popular K-means algorithm for clustering. In this 3 part exercise, you'll find out how many clusters are there in a dataset containing 5000 rows and 2 columns. For this you'll first load the data into an RDD, parse the RDD based on the delimiter, run the KMeans model, evaluate the model and finally visualize the clusters.
In the first part, you'll load the data into RDD, parse the RDD based on the delimiter, and convert the string type of the data to an integer.
Remember, you have a SparkContext sc
available in your workspace. Also file_path
variable (which is the path to the 5000_points.txt
file) is already available in your workspace.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Load the
5000_points
dataset into an RDD namedclusterRDD
. - Transform the
clusterRDD
by splitting the lines based on the tab ("\t"). - Transform the split RDD to create a list of integers for the two columns.
- Confirm that there are 5000 rows in the dataset.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load the dataset into an RDD
clusterRDD = sc.____(file_path)
# Split the RDD based on tab
rdd_split = clusterRDD.____(lambda x: ____.split(____))
# Transform the split RDD by creating a list of integers
rdd_split_int = rdd_split.____(lambda x: [int(____), int(x[1])])
# Count the number of rows in RDD
print("There are {} rows in the rdd_split_int dataset".format(____.____()))