Remove stop words and reduce the dataset
In this exercise you'll remove stop words from your data. Stop words are common words that are often uninteresting, for example, "I", "the", "a" etc. You can remove many obvious stop words with a list of your own. But for this exercise, you will just remove the stop words from a curated list stop_words
provided to you in your environment.
After removing stop words, you'll create a pair RDD where each element is a pair tuple (k, v)
where k
is the key and v
is the value. In this example, pair RDD is composed of (w, 1)
where w
is for each word in the RDD and 1 is a number. Finally, you'll combine the values with the same key from the pair RDD to count the number of occurrences of each word.
Remember you already have a SparkContext sc
and splitRDD
available in your workspace, along with the stop_words
list variable.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Filter
splitRDD
, removing stop words listed in thestop_words
variable. - Create a pair RDD tuple containing the word (using the
w
iterator) and the number1
from each word element insplitRDD
. - Get the count of the number of occurrences of each word (word frequency) in the pair RDD. Use a transformation which operates on key, value (k,v) pairs. Think carefully about which function to use here.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Filter splitRDD to remove stop words from the stop_words curated list
splitRDD_no_stop = splitRDD.____(lambda x: x.lower() not in ____)
# Create a tuple of the word (w) and 1
splitRDD_no_stop_words = splitRDD_no_stop.map(lambda w: (____, ____))
# Count of the number of occurences of each word
resultRDD = splitRDD_no_stop_words.____(lambda x, y: x + y)