Get startedGet started for free

Big data, tiny tibble

In the last exercise, when you copied the data to Spark, copy_to() returned a value. This return value is a special kind of tibble() that doesn't contain any data of its own. To explain this, you need to know a bit about the way that tidyverse packages store data. Tibbles are usually just a variant of data.frames that have a nicer print method. However, dplyr also allows them to store data from a remote data source, such as databases, and – as is the case here – Spark. For remote datasets, the tibble object simply stores a connection to the remote data. This will be discussed in more detail later, but the important point for now is that even though you have a big dataset, the size of the tibble object is small.

On the Spark side, the data is stored in a variable called a DataFrame. This is a more or less direct equivalent of R's data.frame variable type. (Though the column variable types are named slightly differently – for example numeric columns are called DoubleType columns.) Throughout the course, the term data frame will be used, unless clarification is needed between data.frame and DataFrame. Since these types are also analogous to database tables, sometimes the term table will also be used to describe this sort of rectangular data.

Calling tbl() with a Spark connection, and a string naming the Spark data frame will return the same tibble object that was returned when you used copy_to().

A useful tool that you will see in this exercise is the object_size() function from the pryr package. This shows you how much memory an object takes up.

This exercise is part of the course

Introduction to Spark with sparklyr in R

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. The track metadata for 1,000 tracks is stored in the Spark cluster in the table "track_metadata".

  • Link to the "track_metadata" table using tbl(). Assign the result to track_metadata_tbl.
  • See how big the dataset is, using dim() on track_metadata_tbl.
  • See how small the tibble is, using object_size() on track_metadata_tbl.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Link to the track_metadata table in Spark
track_metadata_tbl <- ___(___, "___")

# See how big the dataset is
___(___)

# See how small the tibble is
___(___)
Edit and Run Code