Get Started

Caching

1. Caching

We've learned how to perform complex operations using simple syntax. In this lesson you will now learn how to cache dataframes and tables.

2. What is caching?

Caching is keeping data in memory so that it does not have to be refetched or recalculated each time it is used. Using caching properly is an important best practice to know when working with Spark. Spark is aggressive about freeing up data from memory. It will err on the side of unloading data from memory if it is not used, even if it is going to be needed later.

3. Eviction Policy

Eviction Policy determines when and which data is removed from cache. The policy is LRU. Each worker manages its own cache, and eviction depends on the memory available to each worker.

4. Caching a dataframe

To cache a dataframe, use df.cache(). To uncache it, use df.unpersist()

5. Determining whether a dataframe is cached

To determine whether a dataframe is cached, use the dataframe property is_cached. df.is_cached confirmed that it was not cached. Then, we cached it. Using df.is_cached again confirmed that it was cached.

6. Uncaching a dataframe

You uncache a dataframe using unpersist().

7. Storage level

A dataframe's storage level specifies 5 details about how it is cached: useDisk, useMemory, useOffHeap, deserialized, and replication. We can set any or all of these; however, the cache() operation sets them to defauls. useDisk specifies whether to move some or all of the dataframe to disk if it needed to free up memory. useMemory specifies whether to keep the data in memory. useOffHeap tells Spark to use off-heap storage instead of on-heap memory. The on-heap store refers to objects in an in-memory data structure that is fast to access. The off-heap store is also in memory, but is slightly slower than the on-heap store. However, off-heap storage is still faster than disk. Even though the best performance is obtained when operating solely in on-heap memory, Spark also makes it possible to use off-heap storage for certain operations. Off-heap storage is slightly slower than on-heap but still faster than disk. The downside is that the user has to manually deal with managing the allocated memory. deserialized True is faster but uses more memory. Serialized data is more space-efficient but slower to read. This option only applies to in-memory storage. Disk cache is always serialized. replication is used to tell Spark to replicate data on multiple nodes. This allows faster fault recovery when a node fails.

8. Persisting a dataframe

You may wonder why you use df.cache() to cache, but df.unpersist() to uncache. That's because df.cache() is shorthand for df.persist() with the first argument set to its default value. The persist() command allows you to specify the desired storage level using the first argument. If that argument is not provided, it uses a default setting. When memory is scarce, it is recommended to use MEMORY_AND_DISK caching strategy. This will spill the dataframe to disk if memory runs low. Reading the dataframe from disk cache is slower than reading it from memory, but can still be faster than recreating from scratch. cache() is equivalent to using persist() with the default storageLevel.

9. Caching a table

We just learned about caching a dataframe. Tables can also be cached. spark.catalog.isCached() tells you whether a table has been cached. You cache a table using the operation spark.catalog.cacheTable(), giving the table name as the first argument.

10. Uncaching a table

To uncache the table, use spark.catalog.uncacheTable(), giving the table name as the first argument. spark.catalog.clearCache() removes all cached tables.

11. Tips

Caching is a lazy operation. A dataframe won't appear in the cache until an action is performed on the dataframe. Don't overdo it. Only cache if more than one operation is to be performed and it takes substantial time to create the dataframe. Unpersist unneeded objects. Caching incurs a cost. Caching everything generally slows things down.

12. Let's practice

Now it's your turn to practice caching with Spark!