Sorting vs. arranging

So far in this chapter, you've explored some feature transformation functions from Spark's MLlib. sparklyr also provides access to some functions making use of the Spark DataFrame API.

The dplyr way of sorting a tibble is to use arrange(). You can also sort tibbles using Spark's DataFrame API using sdf_sort(). This function takes a character vector of columns to sort on, and currently only sorting in ascending order is supported.

For example, to sort by column x, then (in the event of ties) by column y, then by column z, the following code compares the dplyr and Spark DataFrame approaches.

a_tibble %>%
  arrange(x, y, z)
a_tibble %>%
  sdf_sort(c("x", "y", "z"))

To see which method is faster, try using both arrange(), and sdf_sort(). You can see how long your code takes to run by wrapping it in microbenchmark(), from the package of the same name.

microbenchmark({
  # your code
})

You can learn more about profiling the speed of your code in the Writing Efficient R Code course.

This is a part of the course

“Introduction to Spark with sparklyr in R”

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

Use microbenchmark() to compare how long it takes to perform the following actions.
- Use arrange() to order the rows of track_metadata_tbl by year, then artist_name, then release, then title.
- Collect the result.
- Do the same thing again, this time using sdf_sort() rather than arrange(). Remember to quote the column names.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# track_metadata_tbl has been pre-defined
track_metadata_tbl

# Compare timings of arrange() and sdf_sort()
microbenchmark(
  arranged = track_metadata_tbl %>%
    # Arrange by year, then artist_name, then release, then title
    ___ %>%
    # Collect the result
    ___,
  sorted = track_metadata_tbl %>%
    # Sort by year, then artist_name, then release, then title
    ___ %>%
    # Collect the result
    ___,
  times = 5
)

Edit and Run Code