Common people
The distinct()
function showed you the unique values. It can also be useful to know how many of each value you have. The base-R function for this is table()
; that isn't supported in sparklyr
since it doesn't conform to the tidyverse philosophy of keeping everything in tibbles. Instead, you must use count()
. To use it, pass the unquoted names of the columns. For example, to find the counts of distinct combinations of columns x
, y
, and z
, you would type the following.
a_tibble %>%
count(x, y, z)
The result is the same as
a_tibble %>%
distinct(x, y, z)
… except that you get an extra column, n
, that contains the counts.
A really nice use of count()
is to get the most common values of something. To do this, you call count()
, with the argument sort = TRUE
which sorts the rows by descending values of the n
column, then use slice_max()
to restrict the results to the top however-many values. (slice_max()
is similar to base-R's head()
, but it works with remote datasets such as those in Spark.) For example, to get the top 20 most common combinations of the x
, y
, and z
columns, use the following.
a_tibble %>%
count(x, y, z, sort = TRUE) %>%
slice_max(20)
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
A Spark connection has been created for you as spark_conn
. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl
.
- Count the values in the
artist_name
column fromtrack_metadata_tbl
.- Pass
sort = TRUE
to sort the rows by descending popularity.
- Pass
- Restrict the results to the top 20 using
slice_max()
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_metadata_tbl has been pre-defined
track_metadata_tbl
track_metadata_tbl %>%
# Count the artist_name values
___ %>%
# Restrict to top 20
___