Get Started

Sorting vs. arranging

So far in this chapter, you've explored some feature transformation functions from Spark's MLlib. sparklyr also provides access to some functions making use of the Spark DataFrame API.

The dplyr way of sorting a tibble is to use arrange(). You can also sort tibbles using Spark's DataFrame API using sdf_sort(). This function takes a character vector of columns to sort on, and currently only sorting in ascending order is supported.

For example, to sort by column x, then (in the event of ties) by column y, then by column z, the following code compares the dplyr and Spark DataFrame approaches.

a_tibble %>%
  arrange(x, y, z)
a_tibble %>%
  sdf_sort(c("x", "y", "z"))

To see which method is faster, try using both arrange(), and sdf_sort(). You can see how long your code takes to run by wrapping it in microbenchmark(), from the package of the same name.

microbenchmark({
  # your code
})

You can learn more about profiling the speed of your code in the Writing Efficient R Code course.

This is a part of the course

“Introduction to Spark with sparklyr in R”

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

  • Use microbenchmark() to compare how long it takes to perform the following actions.
    • Use arrange() to order the rows of track_metadata_tbl by year, then artist_name, then release, then title.
    • Collect the result.
    • Do the same thing again, this time using sdf_sort() rather than arrange(). Remember to quote the column names.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# track_metadata_tbl has been pre-defined
track_metadata_tbl

# Compare timings of arrange() and sdf_sort()
microbenchmark(
  arranged = track_metadata_tbl %>%
    # Arrange by year, then artist_name, then release, then title
    ___ %>%
    # Collect the result
    ___,
  sorted = track_metadata_tbl %>%
    # Sort by year, then artist_name, then release, then title
    ___ %>%
    # Collect the result
    ___,
  times = 5
)

This exercise is part of the course

Introduction to Spark with sparklyr in R

IntermediateSkill Level
5.0+
4 reviews

Learn how to run big data analysis using Spark and the sparklyr package in R, and explore Spark MLIb in just 4 hours.

Chapter 1: Light My Fire: Starting To Use Spark With dplyr Syntax

In which you learn how Spark and R complement each other, how to get data to and from Spark, and how to manipulate Spark data frames using dplyr syntax.

Exercise 1: Getting startedExercise 2: Made for each otherExercise 3: Here be dragonsExercise 4: The connect-work-disconnect patternExercise 5: Copying data into SparkExercise 6: Big data, tiny tibbleExercise 7: Exploring the structure of tibblesExercise 8: Selecting columnsExercise 9: Filtering rowsExercise 10: Arranging rowsExercise 11: Mutating columnsExercise 12: Summarizing columns

Chapter 2: Tools of the Trade: Advanced dplyr Usage

Chapter 3: Going Native: Use The Native Interface to Manipulate Spark DataFrames

In which you learn about Spark's machine learning data transformation features, and functionality for manipulating native DataFrames.

Exercise 1: Two new interfacesExercise 2: Popcorn double featureExercise 3: Transforming continuous variables to logicalExercise 4: Transforming continuous variables into categorical (1)Exercise 5: Transforming continuous variables into categorical (2)Exercise 6: More than words: tokenization (1)Exercise 7: More than words: tokenization (2)Exercise 8: More than words: tokenization (3)Exercise 9: Sorting vs. arranging
Exercise 10: Exploring Spark data typesExercise 11: Shrinking the data by samplingExercise 12: Training/testing partitions

Chapter 4: Case Study: Learning to be a Machine: Running Machine Learning Models on Spark

What is DataCamp?

Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.

Start Learning for Free