Practicing caching: part 1
In the next few exercises, you'll experiment with different ways of caching two DataFrames.
A dataframe df1
is loaded from a csv file. Several processing steps are performed on it. As df1
is to be used more than once, it is a candidate for caching.
A second dataframe df2
is created by performing additional compute-intensive steps on df1
. It is also a candidate for caching.
Because df2
depends on df1
the question arises: is it better to cache df1
, or to cache df2
?
In this exercise, we'll try caching df1
. Note the amount of time that each action takes. We'll be comparing these in the next exercise.
This exercise is part of the course
Introduction to Spark SQL in Python
Exercise instructions
- Cache
df1
only. - Run a first action on
df1
and repeat it, then run an actiondf2
and repeat it. This has been done for you. - Confirm whether or not
df1
is cached.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Unpersists df1 and df2 and initializes a timer
prep(df1, df2)
# Cache df1
____
# Run actions on both dataframes
run(df1, "df1_1st")
run(df1, "df1_2nd")
run(df2, "df2_1st")
run(df2, "df2_2nd", elapsed=True)
# Prove df1 is cached
print(____)