Transforming continuous variables into categorical (1)

A generalization of the previous idea is to have multiple thresholds; that is, you split a continuous variable into "buckets" (or "bins"), just like a histogram does. In base-R, you would use cut() for this task. For example, in a study on smoking habits, you could take the typical number of cigarettes smoked per day, and transform it into a factor.

smoking_status <- cut(
  cigarettes_per_day,
  breaks = c(0, 1, 10, 20, Inf),
  labels = c("non", "light", "moderate", "heavy"),
  right  = FALSE
)

The sparklyr equivalent of this is to use ft_bucketizer(). The code takes a similar format to ft_binarizer(), but this time you must pass a vector of cut points to the splits argument. Here is the same example rewritten in sparklyr style.

smoking_data %>%
  ft_bucketizer("cigarettes_per_day", "smoking_status", splits = c(0, 1, 10, 20, Inf))

There are several important things to note. You may have spotted that the breaks argument from cut() is the same as the splits argument from ft_bucketizer(). There is a slight difference in how values on the boundary are handled. In cut(), by default, the upper (right-hand) boundary is included in each bucket, but not the left. ft_bucketizer() includes the lower (left-hand) boundary in each bucket, but not the right. This means that it is equivalent to calling cut() with the argument right = FALSE.

One exception is that ft_bucketizer() includes values on both boundaries for the upper-most bucket. So ft_bucketizer() is also equivalent to setting include.lowest = TRUE when using cut().

The final thing to note is that whereas cut() returns a factor, ft_bucketizer() returns a numeric vector, with values in the first bucket returned as zero, values in the second bucket returned as one, values in the third bucket returned as two, and so on. If you want to work on the results in R, you need to explicitly convert to a factor. This is a common code pattern:

a_tibble %>%
  ft_bucketizer("x", "x_buckets", splits = splits) %>%
  collect() %>%
  mutate(x_buckets = factor(x_buckets, labels = labels))

This exercise is part of the course

Introduction to Spark with sparklyr in R

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl. decades is a numeric sequence of 1920, 1930, …, 2020, and decade_labels is a text description of those decades.

Create a variable named hotttnesss_over_time from track_metadata_tbl.
- Select the artist_hotttnesss and year fields.
- Convert the year column to numeric.
- Use ft_bucketizer() to create a new field, decade, which splits the years using decades.
- Collect the result.
- Convert the decade field to a factor with labels decade_labels.
Draw a ggplot() bar plot of artist_hotttnesss by decade.
- The first argument to ggplot() is the data argument, hotttnesss_over_time.
- The second argument to ggplot() is the aesthetic, which takes decade and artist_hotttnesss wrapped in aes().
- Add geom_boxplot() to draw the bars.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# track_metadata_tbl, decades, decade_labels have been pre-defined
track_metadata_tbl
decades
decade_labels

hotttnesss_over_time <- track_metadata_tbl %>%
  # Select artist_hotttnesss and year
  ___ %>%
  # Convert year to numeric
  ___ %>%
  # Bucketize year to decade using decades vector
  ___ %>%
  # Collect the result
  ___ %>%
  # Convert decade to factor using decade_labels
  ___

# Draw a boxplot of artist_hotttnesss by decade
ggplot(___, aes(___, ___)) +
  ___()

Edit and Run Code