Transforming continuous variables into categorical (2)
A special case of the previous transformation is to cut a continuous variable into buckets where the buckets are defined by quantiles of the variable. A common use of this transformation is to analyze survey responses or review scores. If you ask people to rate something from one to five stars, often the median response won't be three stars. In this case, it can be useful to split their scores up by quantile. For example, you can make five quintile groups by splitting at the 0th, 20th, 40th, 60th, 80th, and 100th percentiles.
The base-R way of doing this is cut() + quantile(). The sparklyr equivalent uses the ft_quantile_discretizer() transformation. This takes an num_buckets argument, which determines the number of buckets. The base-R and sparklyr ways of calculating this are shown together. As before, right = FALSE and include.lowest are set.
survey_response_group <- cut(
survey_score,
breaks = quantile(survey_score, c(0, 0.25, 0.5, 0.75, 1)),
labels = c("hate it", "dislike it", "like it", "love it"),
right = FALSE,
include.lowest = TRUE
)
survey_data %>%
ft_quantile_discretizer("survey_score", "survey_response_group", num_buckets = 4)
As with ft_bucketizer(), the resulting bins are numbers, counting from zero. If you want to work with them in R, explicitly convert to a factor.
Diese Übung ist Teil des Kurses
Introduction to Spark with sparklyr in R
Anleitung zur Übung
A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl. duration_labels is a character vector describing lengths of time.
- Create a variable named
familiarity_by_durationfromtrack_metadata_tbl.- Select the
durationandartist_familiarityfields. - Use
ft_quantile_discretizer()to create a new field,duration_bin, made from 5 quantile bins ofduration. - Collect the result.
- Convert the
duration_binfield to a factor with labelsduration_labels.
- Select the
- Draw a
ggplot()box plot ofartist_familiaritybyduration_bin.- The first argument to
ggplot()is the data argument,familiarity_by_duration. - The second argument to
ggplot()is the aesthetic, which takesduration_binandartist_familiaritywrapped inaes(). - Add
geom_boxplot()to draw the bars.
- The first argument to
Interaktive Übung
Vervollständige den Beispielcode, um diese Übung erfolgreich abzuschließen.
# track_metadata_tbl, duration_labels have been pre-defined
track_metadata_tbl
duration_labels
familiarity_by_duration <- track_metadata_tbl %>%
# Select duration and artist_familiarity
___ %>%
# Bucketize duration
___ %>%
# Collect the result
___ %>%
# Convert duration bin to factor
___
# Draw a boxplot of artist_familiarity by duration_bin
ggplot(___, aes(___, ___)) +
___()