Sorting vs. arranging
So far in this chapter, you've explored some feature transformation functions from Spark's MLlib. sparklyr
also provides access to some functions making use of the Spark DataFrame API.
The dplyr
way of sorting a tibble is to use arrange()
. You can also sort tibbles using Spark's DataFrame API using sdf_sort()
. This function takes a character vector of columns to sort on, and currently only sorting in ascending order is supported.
For example, to sort by column x
, then (in the event of ties) by column y
, then by column z
, the following code compares the dplyr
and Spark DataFrame approaches.
a_tibble %>%
arrange(x, y, z)
a_tibble %>%
sdf_sort(c("x", "y", "z"))
To see which method is faster, try using both arrange()
, and sdf_sort()
. You can see how long your code takes to run by wrapping it in microbenchmark()
, from the package of the same name.
microbenchmark({
# your code
})
You can learn more about profiling the speed of your code in the Writing Efficient R Code course.
This is a part of the course
“Introduction to Spark with sparklyr in R”
Exercise instructions
A Spark connection has been created for you as spark_conn
. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl
.
- Use
microbenchmark()
to compare how long it takes to perform the following actions.- Use
arrange()
to order the rows oftrack_metadata_tbl
byyear
, thenartist_name
, thenrelease
, thentitle
. - Collect the result.
- Do the same thing again, this time using
sdf_sort()
rather thanarrange()
. Remember to quote the column names.
- Use
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_metadata_tbl has been pre-defined
track_metadata_tbl
# Compare timings of arrange() and sdf_sort()
microbenchmark(
arranged = track_metadata_tbl %>%
# Arrange by year, then artist_name, then release, then title
___ %>%
# Collect the result
___,
sorted = track_metadata_tbl %>%
# Sort by year, then artist_name, then release, then title
___ %>%
# Collect the result
___,
times = 5
)
This exercise is part of the course
Introduction to Spark with sparklyr in R
Learn how to run big data analysis using Spark and the sparklyr package in R, and explore Spark MLIb in just 4 hours.
Chapter 1: Light My Fire: Starting To Use Spark With dplyr Syntax
In which you learn how Spark and R complement each other, how to get data to and from Spark, and how to manipulate Spark data frames using dplyr syntax.
Exercise 1: Getting startedExercise 2: Made for each otherExercise 3: Here be dragonsExercise 4: The connect-work-disconnect patternExercise 5: Copying data into SparkExercise 6: Big data, tiny tibbleExercise 7: Exploring the structure of tibblesExercise 8: Selecting columnsExercise 9: Filtering rowsExercise 10: Arranging rowsExercise 11: Mutating columnsExercise 12: Summarizing columnsChapter 2: Tools of the Trade: Advanced dplyr Usage
In which you learn more about using the <code>dplyr</code> interface to Spark, including advanced field selection, calculating groupwise statistics, and joining data frames.
Exercise 1: Leveling upExercise 2: Mother's little helper (1)Exercise 3: Mother's little helper (2)Exercise 4: Selecting unique rowsExercise 5: Common peopleExercise 6: Collecting data back from SparkExercise 7: Storing intermediate resultsExercise 8: Groups: great for music, great for dataExercise 9: Groups of mutantsExercise 10: Advanced Selection II: The SQLExercise 11: Left joinsExercise 12: Anti joinsExercise 13: Semi joinsChapter 3: Going Native: Use The Native Interface to Manipulate Spark DataFrames
In which you learn about Spark's machine learning data transformation features, and functionality for manipulating native DataFrames.
Exercise 1: Two new interfacesExercise 2: Popcorn double featureExercise 3: Transforming continuous variables to logicalExercise 4: Transforming continuous variables into categorical (1)Exercise 5: Transforming continuous variables into categorical (2)Exercise 6: More than words: tokenization (1)Exercise 7: More than words: tokenization (2)Exercise 8: More than words: tokenization (3)Exercise 9: Sorting vs. arrangingExercise 10: Exploring Spark data typesExercise 11: Shrinking the data by samplingExercise 12: Training/testing partitionsChapter 4: Case Study: Learning to be a Machine: Running Machine Learning Models on Spark
A case study in which you learn to use <code>sparklyr</code>'s machine learning routines, by predicting the year in which a song was released.
Exercise 1: Machine learning on SparkExercise 2: Machine learning functionsExercise 3: (Hey you) What's that sound?Exercise 4: Working with parquet filesExercise 5: Come togetherExercise 6: Partitioning data with a group effectExercise 7: Gradient boosted trees: modelingExercise 8: Gradient boosted trees: predictionExercise 9: Gradient boosted trees: visualizationExercise 10: Random Forest: modelingExercise 11: Random Forest: predictionExercise 12: Random Forest: visualizationExercise 13: Comparing model performanceWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.