Partitions in your data

SparkContext's textFile() method takes an optional second argument called minPartitions for specifying the minimum number of partitions. In this exercise, you'll create a RDD named fileRDD_part with 5 partitions and then compare that with fileRDD that you created in the previous exercise. Refer to the "Understanding Partition" slide in video 2.1 to know the methods for creating and getting the number of partitions in a RDD.

Remember, you already have a SparkContext sc, file_path and fileRDD available in your workspace.

This exercise is part of the course

Big Data Fundamentals with PySpark

View Course

Exercise instructions

  • Find the number of partitions that support fileRDD RDD.
  • Create an RDD named fileRDD_part from the file path but create 5 partitions.
  • Confirm the number of partitions in the new fileRDD_part RDD.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Check the number of partitions in fileRDD
print("Number of partitions in fileRDD is", fileRDD.____)

# Create a fileRDD_part from file_path with 5 partitions
fileRDD_part = sc.textFile(____, minPartitions = ____)

# Check the number of partitions in fileRDD_part
print("Number of partitions in fileRDD_part is", fileRDD_part.____)