Get startedGet started for free

Shrinking the data by sampling

When you are working with a big dataset, you typically don't really need to work with all of it all the time. Particularly at the start of your project, while you are experimenting wildly with what you want to do, you can often iterate more quickly by working on a smaller subset of the data. sdf_sample() provides a convenient way to do this. It takes a tibble, and the fraction of rows to return. In this case, you want to sample without replacement. To get a random sample of one tenth of your dataset, you would use the following code.

a_tibble %>%
  sdf_sample(fraction = 0.1, replacement = FALSE)

Since the results of the sampling are random, and you will likely want to reuse the shrunken dataset, it is common to use compute() to store the results as another Spark data frame.

a_tibble %>%
  sdf_sample(<some args>) %>%
  compute("sample_dataset")

To make the results reproducible, you can also set a random number seed via the seed argument. Doing this means that you get the same random dataset every time you run your code. It doesn't matter which number you use for the seed; just choose your favorite positive integer.

This exercise is part of the course

Introduction to Spark with sparklyr in R

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

  • Use sdf_sample() to sample 1% of the track metadata without replacement.
    • Pass 20000229 to the seed argument to set a random seed.
  • Compute the result, and store it in a table named "sample_track_metadata".

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# track_metadata_tbl has been pre-defined
track_metadata_tbl

track_metadata_tbl %>%
  # Sample the data without replacement
  ___ %>%
  # Compute the result
  ___
Edit and Run Code