Collecting data back from Spark

In the exercise 'Exploring the structure of tibbles', back in Chapter 1, you saw that tibbles don't store a copy of the data. Instead, the data stays in Spark, and the tibble simply stores the details of what it would like to retrieve from Spark.

There are lots of reasons that you might want to move your data from Spark to R. You've already seen how some data is moved from Spark to R when you print it. You also need to collect your dataset if you want to plot it, or if you want to use a modeling technique that is not available in Spark. (After all, R has the widest selection of available models of any programming language.)

To collect your data: that is, to move it from Spark to R, you call collect().

This exercise is part of the course

Introduction to Spark with sparklyr in R

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

Filter the rows of track_metadata_tbl where artist_familiarity is greater than 0.9, assigning the results to results.
Print the class of results, noting that it is a tbl_lazy (used for remote data).
Collect your results, assigning them to collected.
Print the class of collected, noting that it is a tbl_df (used for local data).

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# track_metadata_tbl has been pre-defined
track_metadata_tbl

results <- track_metadata_tbl %>%
  # Filter where artist familiarity is greater than 0.9
  ___

# Examine the class of the results
___

# Collect your results
collected <- results %>%
  ___

# Examine the class of the collected results
___

Edit and Run Code