Storing intermediate results
As you saw in Chapter 1, copying data between R and Spark is a fundamentally slow task. That means that collecting the data, as you saw in the previous exercise, should only be done when you really need to.
The pipe operator is really nice for chaining together data manipulation commands, but in general, you can't do a whole analysis with everything chained together. For example, this is an awful practice, since you will never be able to debug your code.
final_results <- starting_data %>%
# 743 steps piped together
# ... %>%
collect()
That gives a dilemma. You need to store the results of intermediate calculations, but you don't want to collect them because it is slow. The solution is to use compute()
to compute the calculation, but store the results in a temporary data frame on Spark. Compute takes two arguments: a tibble, and a variable name for the Spark data frame that will store the results.
a_tibble %>%
# some calculations %>%
compute("intermediate_results")
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
A Spark connection has been created for you as spark_conn
. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl
.
- Filter the rows of
track_metadata_tbl
whereartist_familiarity
is greater than 0.8. - Compute the results using
compute()
.- Store the results in a Spark data frame named
"familiar_artists"
. - Assign the result to an R tibble named
computed
.
- Store the results in a Spark data frame named
- See the available Spark datasets using
src_tbls()
. - Print the
class()
ofcomputed
. Notice that unlikecollect()
,compute()
returns a remote tibble. The data is still stored in the Spark cluster.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_metadata_tbl has been pre-defined
track_metadata_tbl
computed <- track_metadata_tbl %>%
# Filter where artist familiarity is greater than 0.8
___ %>%
# Compute the results
___
# See the available datasets
___
# Examine the class of the computed results
___