Filter and Count
The RDD transformation filter() returns a new RDD containing only the elements that satisfy a particular function. It is useful for filtering large datasets based on a keyword. For this exercise, you'll filter out lines containing keyword Spark from fileRDD RDD which consists of lines of text from the README.md file. Next, you'll count the total number of lines containing the keyword Spark and finally print the first 4 lines of the filtered RDD.
Remember, you already have a SparkContext sc, file_path, and fileRDD available in your workspace.
Questo esercizio fa parte del corso
Big Data Fundamentals with PySpark
Istruzioni dell'esercizio
- Create
filter()transformation to select the lines containing the keywordSpark. - How many lines in
fileRDD_filtercontain the keywordSpark? - Print the first four lines of the resulting RDD.
Esercizio pratico interattivo
Prova a risolvere questo esercizio completando il codice di esempio.
# Filter the fileRDD to select lines with Spark keyword
fileRDD_filter = fileRDD.filter(lambda line: 'Spark' in ____)
# How many lines are there in fileRDD?
print("The total number of lines with the keyword Spark is", fileRDD_filter.____())
# Print the first four lines of fileRDD
for line in fileRDD_filter.____(____):
print(line)