Transforming continuous variables into categorical (2)
A special case of the previous transformation is to cut a continuous variable into buckets where the buckets are defined by quantiles of the variable. A common use of this transformation is to analyze survey responses or review scores. If you ask people to rate something from one to five stars, often the median response won't be three stars. In this case, it can be useful to split their scores up by quantile. For example, you can make five quintile groups by splitting at the 0th, 20th, 40th, 60th, 80th, and 100th percentiles.
The base-R way of doing this is cut()
+ quantile()
. The sparklyr
equivalent uses the ft_quantile_discretizer()
transformation. This takes an num_buckets
argument, which determines the number of buckets. The base-R and sparklyr
ways of calculating this are shown together. As before, right = FALSE
and include.lowest
are set.
survey_response_group <- cut(
survey_score,
breaks = quantile(survey_score, c(0, 0.25, 0.5, 0.75, 1)),
labels = c("hate it", "dislike it", "like it", "love it"),
right = FALSE,
include.lowest = TRUE
)
survey_data %>%
ft_quantile_discretizer("survey_score", "survey_response_group", num_buckets = 4)
As with ft_bucketizer()
, the resulting bins are numbers, counting from zero. If you want to work with them in R, explicitly convert to a factor
.
This is a part of the course
“Introduction to Spark with sparklyr in R”
Exercise instructions
A Spark connection has been created for you as spark_conn
. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl
. duration_labels
is a character vector describing lengths of time.
- Create a variable named
familiarity_by_duration
fromtrack_metadata_tbl
.- Select the
duration
andartist_familiarity
fields. - Use
ft_quantile_discretizer()
to create a new field,duration_bin
, made from 5 quantile bins ofduration
. - Collect the result.
- Convert the
duration_bin
field to a factor with labelsduration_labels
.
- Select the
- Draw a
ggplot()
box plot ofartist_familiarity
byduration_bin
.- The first argument to
ggplot()
is the data argument,familiarity_by_duration
. - The second argument to
ggplot()
is the aesthetic, which takesduration_bin
andartist_familiarity
wrapped inaes()
. - Add
geom_boxplot()
to draw the bars.
- The first argument to
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_metadata_tbl, duration_labels have been pre-defined
track_metadata_tbl
duration_labels
familiarity_by_duration <- track_metadata_tbl %>%
# Select duration and artist_familiarity
___ %>%
# Bucketize duration
___ %>%
# Collect the result
___ %>%
# Convert duration bin to factor
___
# Draw a boxplot of artist_familiarity by duration_bin
ggplot(___, aes(___, ___)) +
___()
This exercise is part of the course
Introduction to Spark with sparklyr in R
Learn how to run big data analysis using Spark and the sparklyr package in R, and explore Spark MLIb in just 4 hours.
In which you learn about Spark's machine learning data transformation features, and functionality for manipulating native DataFrames.
Exercise 1: Two new interfacesExercise 2: Popcorn double featureExercise 3: Transforming continuous variables to logicalExercise 4: Transforming continuous variables into categorical (1)Exercise 5: Transforming continuous variables into categorical (2)Exercise 6: More than words: tokenization (1)Exercise 7: More than words: tokenization (2)Exercise 8: More than words: tokenization (3)Exercise 9: Sorting vs. arrangingExercise 10: Exploring Spark data typesExercise 11: Shrinking the data by samplingExercise 12: Training/testing partitionsWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.