Create a base RDD and transform it
The volume of unstructured data (log lines, images, binary files) in existence is growing dramatically, and PySpark is an excellent framework for analyzing this type of data through RDDs. In this 3 part exercise, you will write code that calculates the most common words from Complete Works of William Shakespeare.
Here are the brief steps for writing the word counting program:
- Create a base RDD from
Complete_Shakespeare.txt
file. - Use RDD transformation to create a long list of words from each element of the base RDD.
- Remove stop words from your data.
- Create pair RDD where each element is a pair tuple of
('w', 1)
- Group the elements of the pair RDD by key (word) and add up their values.
- Swap the keys (word) and values (counts) so that keys is count and value is the word.
- Finally, sort the RDD by descending order and print the 10 most frequent words and their frequencies.
In this first exercise, you'll create a base RDD from Complete_Shakespeare.txt
file and transform it to create a long list of words.
Remember, you already have a SparkContext sc
already available in your workspace. A file_path
variable (which is the path to the Complete_Shakespeare.txt
file) is also loaded for you.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Create a RDD called
baseRDD
that reads lines fromfile_path
. - Transform the
baseRDD
into a long list of words and create a newsplitRDD
. - Count the total number words in
splitRDD
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create a baseRDD from the file path
baseRDD = sc.____(file_path)
# Split the lines of baseRDD into words
splitRDD = baseRDD.____(lambda x: x.split())
# Count the total number of words
print("Total number of words in splitRDD:", splitRDD.____())