Big data, tiny tibble
In the last exercise, when you copied the data to Spark, copy_to()
returned a value. This return value is a special kind of tibble()
that doesn't contain any data of its own. To explain this, you need to know a bit about the way that tidyverse packages store data. Tibbles are usually just a variant of data.frame
s that have a nicer print method. However, dplyr
also allows them to store data from a remote data source, such as databases, and – as is the case here – Spark. For remote datasets, the tibble object simply stores a connection to the remote data. This will be discussed in more detail later, but the important point for now is that even though you have a big dataset, the size of the tibble object is small.
On the Spark side, the data is stored in a variable called a DataFrame
. This is a more or less direct equivalent of R's data.frame
variable type. (Though the column variable types are named slightly differently – for example numeric
columns are called DoubleType
columns.) Throughout the course, the term data frame will be used, unless clarification is needed between data.frame
and DataFrame
. Since these types are also analogous to database tables, sometimes the term table will also be used to describe this sort of rectangular data.
Calling tbl()
with a Spark connection, and a string naming the Spark data frame will return the same tibble object that was returned when you used copy_to()
.
A useful tool that you will see in this exercise is the object_size()
function from the pryr
package. This shows you how much memory an object takes up.
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
A Spark connection has been created for you as spark_conn
. The track metadata for 1,000 tracks is stored in the Spark cluster in the table "track_metadata"
.
- Link to the
"track_metadata"
table usingtbl()
. Assign the result totrack_metadata_tbl
. - See how big the dataset is, using
dim()
ontrack_metadata_tbl
. - See how small the tibble is, using
object_size()
ontrack_metadata_tbl
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Link to the track_metadata table in Spark
track_metadata_tbl <- ___(___, "___")
# See how big the dataset is
___(___)
# See how small the tibble is
___(___)