Common people

The distinct() function showed you the unique values. It can also be useful to know how many of each value you have. The base-R function for this is table(); that isn't supported in sparklyr since it doesn't conform to the tidyverse philosophy of keeping everything in tibbles. Instead, you must use count(). To use it, pass the unquoted names of the columns. For example, to find the counts of distinct combinations of columns x, y, and z, you would type the following.

a_tibble %>%
  count(x, y, z)

The result is the same as

a_tibble %>%
  distinct(x, y, z)

… except that you get an extra column, n, that contains the counts.

A really nice use of count() is to get the most common values of something. To do this, you call count(), with the argument sort = TRUE which sorts the rows by descending values of the n column, then use slice_max() to restrict the results to the top however-many values. (slice_max() is similar to base-R's head(), but it works with remote datasets such as those in Spark.) For example, to get the top 20 most common combinations of the x, y, and z columns, use the following.

a_tibble %>%
  count(x, y, z, sort = TRUE) %>%
  slice_max(20)

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

Count the values in the artist_name column from track_metadata_tbl.
- Pass sort = TRUE to sort the rows by descending popularity.
Restrict the results to the top 20 using slice_max().

Light My Fire: Starting To Use Spark With dplyr Syntax

Tools of the Trade: Advanced dplyr Usage

Going Native: Use The Native Interface to Manipulate Spark DataFrames

Case Study: Learning to be a Machine: Running Machine Learning Models on Spark

Exercice

Common people

Instructions