Print word frequencies

After combining the values (counts) with the same key (word), in this exercise, you'll return the first 10 word frequencies. You could have retrieved all the elements at once using collect(), but it is bad practice and not recommended. RDDs can be huge: you may run out of memory and crash your computer..

What if we want to return the top 10 words? For this, first you'll need to swap the key (word) and values (counts) so that keys is count and value is the word. Right now, result_RDD has key as element 0 and value as element 1. After you swap the key and value in the tuple, you'll sort the pair RDD based on the key (count). This way it is easy to sort the RDD based on the key rather than using sortByKey operation in PySpark. Finally, you'll return the top 10 words based on their frequencies from the sorted RDD.

You already have a SparkContext sc and resultRDD available in your workspace.

Print the first 10 words and their frequencies from the resultRDD RDD.
Swap the keys and values in the resultRDD.
Sort the keys according to descending order.
Print the top 10 most frequent words and their frequencies from the sorted RDD.

Introduction to Big Data analysis with Spark

Programming in PySpark RDD’s

PySpark SQL & DataFrames

Machine Learning with PySpark MLlib

Exercise

Print word frequencies

Instructions