Print word frequencies

After combining the values (counts) with the same key (word), in this exercise, you'll return the first 10 word frequencies. You could have retrieved all the elements at once using collect(), but it is bad practice and not recommended. RDDs can be huge: you may run out of memory and crash your computer..

What if we want to return the top 10 words? For this, first you'll need to swap the key (word) and values (counts) so that keys is count and value is the word. Right now, result_RDD has key as element 0 and value as element 1. After you swap the key and value in the tuple, you'll sort the pair RDD based on the key (count). This way it is easy to sort the RDD based on the key rather than using sortByKey operation in PySpark. Finally, you'll return the top 10 words based on their frequencies from the sorted RDD.

You already have a SparkContext sc and resultRDD available in your workspace.

This exercise is part of the course

Big Data Fundamentals with PySpark

View Course

Exercise instructions

Print the first 10 words and their frequencies from the resultRDD RDD.
Swap the keys and values in the resultRDD.
Sort the keys according to descending order.
Print the top 10 most frequent words and their frequencies from the sorted RDD.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Display the first 10 words and their frequencies from the input RDD
for word in resultRDD.____(10):
	print(word)

# Swap the keys and values from the input RDD
resultRDD_swap = resultRDD.____(lambda x: (x[1], x[____]))

# Sort the keys in descending order
resultRDD_swap_sort = resultRDD_swap.____(ascending=False)

# Show the top 10 most frequent words and their frequencies from the sorted RDD
for word in resultRDD_swap_sort.____(____):
	print("{},{}". format(____, word[0]))

Edit and Run Code