Sorting vs. arranging
So far in this chapter, you've explored some feature transformation functions from Spark's MLlib. sparklyr
also provides access to some functions making use of the Spark DataFrame API.
The dplyr
way of sorting a tibble is to use arrange()
. You can also sort tibbles using Spark's DataFrame API using sdf_sort()
. This function takes a character vector of columns to sort on, and currently only sorting in ascending order is supported.
For example, to sort by column x
, then (in the event of ties) by column y
, then by column z
, the following code compares the dplyr
and Spark DataFrame approaches.
a_tibble %>%
arrange(x, y, z)
a_tibble %>%
sdf_sort(c("x", "y", "z"))
To see which method is faster, try using both arrange()
, and sdf_sort()
. You can see how long your code takes to run by wrapping it in microbenchmark()
, from the package of the same name.
microbenchmark({
# your code
})
You can learn more about profiling the speed of your code in the Writing Efficient R Code course.
This is a part of the course
“Introduction to Spark with sparklyr in R”
Exercise instructions
A Spark connection has been created for you as spark_conn
. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl
.
- Use
microbenchmark()
to compare how long it takes to perform the following actions.- Use
arrange()
to order the rows oftrack_metadata_tbl
byyear
, thenartist_name
, thenrelease
, thentitle
. - Collect the result.
- Do the same thing again, this time using
sdf_sort()
rather thanarrange()
. Remember to quote the column names.
- Use
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_metadata_tbl has been pre-defined
track_metadata_tbl
# Compare timings of arrange() and sdf_sort()
microbenchmark(
arranged = track_metadata_tbl %>%
# Arrange by year, then artist_name, then release, then title
___ %>%
# Collect the result
___,
sorted = track_metadata_tbl %>%
# Sort by year, then artist_name, then release, then title
___ %>%
# Collect the result
___,
times = 5
)
This exercise is part of the course
Introduction to Spark with sparklyr in R
Learn how to run big data analysis using Spark and the sparklyr package in R, and explore Spark MLIb in just 4 hours.
In which you learn about Spark's machine learning data transformation features, and functionality for manipulating native DataFrames.
Exercise 1: Two new interfacesExercise 2: Popcorn double featureExercise 3: Transforming continuous variables to logicalExercise 4: Transforming continuous variables into categorical (1)Exercise 5: Transforming continuous variables into categorical (2)Exercise 6: More than words: tokenization (1)Exercise 7: More than words: tokenization (2)Exercise 8: More than words: tokenization (3)Exercise 9: Sorting vs. arrangingExercise 10: Exploring Spark data typesExercise 11: Shrinking the data by samplingExercise 12: Training/testing partitionsWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.