Transforming continuous variables into categorical (2)

A special case of the previous transformation is to cut a continuous variable into buckets where the buckets are defined by quantiles of the variable. A common use of this transformation is to analyze survey responses or review scores. If you ask people to rate something from one to five stars, often the median response won't be three stars. In this case, it can be useful to split their scores up by quantile. For example, you can make five quintile groups by splitting at the 0th, 20th, 40th, 60th, 80th, and 100th percentiles.

The base-R way of doing this is cut() + quantile(). The sparklyr equivalent uses the ft_quantile_discretizer() transformation. This takes an num_buckets argument, which determines the number of buckets. The base-R and sparklyr ways of calculating this are shown together. As before, right = FALSE and include.lowest are set.

survey_response_group <- cut(
  survey_score,
  breaks = quantile(survey_score, c(0, 0.25, 0.5, 0.75, 1)),
  labels = c("hate it", "dislike it", "like it", "love it"),
  right  = FALSE,
  include.lowest = TRUE
)
survey_data %>%
  ft_quantile_discretizer("survey_score", "survey_response_group", num_buckets = 4)

As with ft_bucketizer(), the resulting bins are numbers, counting from zero. If you want to work with them in R, explicitly convert to a factor.

This is a part of the course

“Introduction to Spark with sparklyr in R”

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl. duration_labels is a character vector describing lengths of time.

Create a variable named familiarity_by_duration from track_metadata_tbl.
- Select the duration and artist_familiarity fields.
- Use ft_quantile_discretizer() to create a new field, duration_bin, made from 5 quantile bins of duration.
- Collect the result.
- Convert the duration_bin field to a factor with labels duration_labels.
Draw a ggplot() box plot of artist_familiarity by duration_bin.
- The first argument to ggplot() is the data argument, familiarity_by_duration.
- The second argument to ggplot() is the aesthetic, which takes duration_bin and artist_familiarity wrapped in aes().
- Add geom_boxplot() to draw the bars.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# track_metadata_tbl, duration_labels have been pre-defined
track_metadata_tbl
duration_labels

familiarity_by_duration <- track_metadata_tbl %>%
  # Select duration and artist_familiarity
  ___ %>%
  # Bucketize duration
  ___ %>%
  # Collect the result
  ___ %>%
  # Convert duration bin to factor
  ___

# Draw a boxplot of artist_familiarity by duration_bin
ggplot(___, aes(___, ___)) +
  ___()

Edit and Run Code