Print word frequencies
After combining the values (counts) with the same key (word), in this exercise, you'll return the first 10 word frequencies. You could have retrieved all the elements at once using collect(), but it is bad practice and not recommended. RDDs can be huge: you may run out of memory and crash your computer..
What if we want to return the top 10 words? For this, first you'll need to swap the key (word) and values (counts) so that keys is count and value is the word. Right now, result_RDD
has key as element 0 and value as element 1. After you swap the key and value in the tuple, you'll sort the pair RDD based on the key (count). This way it is easy to sort the RDD based on the key rather than using sortByKey
operation in PySpark. Finally, you'll return the top 10 words based on their frequencies from the sorted RDD.
You already have a SparkContext sc
and resultRDD
available in your workspace.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Print the first 10 words and their frequencies from the
resultRDD
RDD. - Swap the keys and values in the
resultRDD
. - Sort the keys according to descending order.
- Print the top 10 most frequent words and their frequencies from the sorted RDD.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Display the first 10 words and their frequencies from the input RDD
for word in resultRDD.____(10):
print(word)
# Swap the keys and values from the input RDD
resultRDD_swap = resultRDD.____(lambda x: (x[1], x[____]))
# Sort the keys in descending order
resultRDD_swap_sort = resultRDD_swap.____(ascending=False)
# Show the top 10 most frequent words and their frequencies from the sorted RDD
for word in resultRDD_swap_sort.____(____):
print("{},{}". format(____, word[0]))