Partitions in your data
SparkContext's textFile()
method takes an optional second argument called minPartitions
for specifying the minimum number of partitions. In this exercise, you'll create a RDD named fileRDD_part
with 5 partitions and then compare that with fileRDD
that you created in the previous exercise. Refer to the "Understanding Partition" slide in video 2.1 to know the methods for creating and getting the number of partitions in a RDD.
Remember, you already have a SparkContext sc
, file_path
and fileRDD
available in your workspace.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Find the number of partitions that support
fileRDD
RDD. - Create an RDD named
fileRDD_part
from the file path but create 5 partitions. - Confirm the number of partitions in the new
fileRDD_part
RDD.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Check the number of partitions in fileRDD
print("Number of partitions in fileRDD is", fileRDD.____)
# Create a fileRDD_part from file_path with 5 partitions
fileRDD_part = sc.textFile(____, minPartitions = ____)
# Check the number of partitions in fileRDD_part
print("Number of partitions in fileRDD_part is", fileRDD_part.____)