Exercise

Filter and Count

The RDD transformation filter() returns a new RDD containing only the elements that satisfy a particular function. It is useful for filtering large datasets based on a keyword. For this exercise, you'll filter out lines containing keyword Spark from fileRDD RDD which consists of lines of text from the README.md file. Next, you'll count the total number of lines containing the keyword Spark and finally print the first 4 lines of the filtered RDD.

Remember, you already have a SparkContext sc, file_path and fileRDD available in your workspace.

Instructions

100 XP
  • Create filter() transformation to select the lines containing the keyword Spark.
  • How many lines in fileRDD_filter contains the keyword Spark?
  • Print the first four lines of the resulting RDD.