Get startedGet started for free

Common people

The distinct() function showed you the unique values. It can also be useful to know how many of each value you have. The base-R function for this is table(); that isn't supported in sparklyr since it doesn't conform to the tidyverse philosophy of keeping everything in tibbles. Instead, you must use count(). To use it, pass the unquoted names of the columns. For example, to find the counts of distinct combinations of columns x, y, and z, you would type the following.

a_tibble %>%
  count(x, y, z)

The result is the same as

a_tibble %>%
  distinct(x, y, z)

… except that you get an extra column, n, that contains the counts.

A really nice use of count() is to get the most common values of something. To do this, you call count(), with the argument sort = TRUE which sorts the rows by descending values of the n column, then use slice_max() to restrict the results to the top however-many values. (slice_max() is similar to base-R's head(), but it works with remote datasets such as those in Spark.) For example, to get the top 20 most common combinations of the x, y, and z columns, use the following.

a_tibble %>%
  count(x, y, z, sort = TRUE) %>%
  slice_max(20)

This exercise is part of the course

Introduction to Spark with sparklyr in R

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

  • Count the values in the artist_name column from track_metadata_tbl.
    • Pass sort = TRUE to sort the rows by descending popularity.
  • Restrict the results to the top 20 using slice_max().

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# track_metadata_tbl has been pre-defined
track_metadata_tbl

track_metadata_tbl %>%
  # Count the artist_name values
  ___ %>%
  # Restrict to top 20
  ___
Edit and Run Code