Exercise

Remove stop words and reduce the dataset

After splitting the lines in the file into a long list of words in the previous exercise, in the next step, you'll remove stop words from your data. Stop words are common words that are often uninteresting. For example "I", "the", "a" etc., are stop words. You can remove many obvious stop words with a list of your own. But for this exercise, you will just remove the stop words from a curated list stop_words provided to you in your environment.

After removing stop words, you'll next create a pair RDD where each element is a pair tuple (k, v) where k is the key and v is the value. In this example, pair RDD is composed of (w, 1) where w is for each word in the RDD and 1 is a number. Finally, you'll combine the values with the same key from the pair RDD.

Remember you already have a SparkContext sc and splitRDD available in your workspace.

Instructions

100 XP
  • Convert the words in splitRDD in lower case and then remove stop words from stop_words curated list.
  • Create a pair RDD tuple containing the word and the number 1 from each word element in splitRDD.
  • Get the count of the number of occurrences of each word (word frequency) in the pair RDD.