Remove stop words and reduce the dataset

In this exercise you'll remove stop words from your data. Stop words are common words that are often uninteresting, for example, "I", "the", "a" etc. You can remove many obvious stop words with a list of your own. But for this exercise, you will just remove the stop words from a curated list stop_words provided to you in your environment.

After removing stop words, you'll create a pair RDD where each element is a pair tuple (k, v) where k is the key and v is the value. In this example, pair RDD is composed of (w, 1) where w is for each word in the RDD and 1 is a number. Finally, you'll combine the values with the same key from the pair RDD to count the number of occurrences of each word.

Remember you already have a SparkContext sc and splitRDD available in your workspace, along with the stop_words list variable.

This exercise is part of the course

Big Data Fundamentals with PySpark

View Course

Exercise instructions

  • Filter splitRDD, removing stop words listed in the stop_words variable.
  • Create a pair RDD tuple containing the word (using the w iterator) and the number 1 from each word element in splitRDD.
  • Get the count of the number of occurrences of each word (word frequency) in the pair RDD. Use a transformation which operates on key, value (k,v) pairs. Think carefully about which function to use here.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Filter splitRDD to remove stop words from the stop_words curated list
splitRDD_no_stop = splitRDD.____(lambda x: x.lower() not in ____)

# Create a tuple of the word (w) and 1 
splitRDD_no_stop_words = splitRDD_no_stop.map(lambda w: (____, ____))

# Count of the number of occurences of each word
resultRDD = splitRDD_no_stop_words.____(lambda x, y: x + y)