Transforming continuous variables into categorical (1)
A generalization of the previous idea is to have multiple thresholds; that is, you split a continuous variable into "buckets" (or "bins"), just like a histogram does. In base-R, you would use cut()
for this task. For example, in a study on smoking habits, you could take the typical number of cigarettes smoked per day, and transform it into a factor.
smoking_status <- cut(
cigarettes_per_day,
breaks = c(0, 1, 10, 20, Inf),
labels = c("non", "light", "moderate", "heavy"),
right = FALSE
)
The sparklyr
equivalent of this is to use ft_bucketizer()
. The code takes a similar format to ft_binarizer()
, but this time you must pass a vector of cut points to the splits
argument. Here is the same example rewritten in sparklyr
style.
smoking_data %>%
ft_bucketizer("cigarettes_per_day", "smoking_status", splits = c(0, 1, 10, 20, Inf))
There are several important things to note. You may have spotted that the breaks
argument from cut()
is the same as the splits
argument from ft_bucketizer()
. There is a slight difference in how values on the boundary are handled. In cut()
, by default, the upper (right-hand) boundary is included in each bucket, but not the left. ft_bucketizer()
includes the lower (left-hand) boundary in each bucket, but not the right. This means that it is equivalent to calling cut()
with the argument right = FALSE
.
One exception is that ft_bucketizer()
includes values on both boundaries for the upper-most bucket. So ft_bucketizer()
is also equivalent to setting include.lowest = TRUE
when using cut()
.
The final thing to note is that whereas cut()
returns a factor, ft_bucketizer()
returns a numeric
vector, with values in the first bucket returned as zero, values in the second bucket returned as one, values in the third bucket returned as two, and so on. If you want to work on the results in R, you need to explicitly convert to a factor. This is a common code pattern:
a_tibble %>%
ft_bucketizer("x", "x_buckets", splits = splits) %>%
collect() %>%
mutate(x_buckets = factor(x_buckets, labels = labels))
This is a part of the course
“Introduction to Spark with sparklyr in R”
Exercise instructions
A Spark connection has been created for you as spark_conn
. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl
. decades
is a numeric sequence of 1920, 1930, …, 2020, and decade_labels
is a text description of those decades.
- Create a variable named
hotttnesss_over_time
fromtrack_metadata_tbl
.- Select the
artist_hotttnesss
andyear
fields. - Convert the
year
column tonumeric
. - Use
ft_bucketizer()
to create a new field,decade
, which splits the years usingdecades
. - Collect the result.
- Convert the
decade
field to a factor with labelsdecade_labels
.
- Select the
- Draw a
ggplot()
bar plot ofartist_hotttnesss
bydecade
.- The first argument to
ggplot()
is the data argument,hotttnesss_over_time
. - The second argument to
ggplot()
is the aesthetic, which takesdecade
andartist_hotttnesss
wrapped inaes()
. - Add
geom_boxplot()
to draw the bars.
- The first argument to
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_metadata_tbl, decades, decade_labels have been pre-defined
track_metadata_tbl
decades
decade_labels
hotttnesss_over_time <- track_metadata_tbl %>%
# Select artist_hotttnesss and year
___ %>%
# Convert year to numeric
___ %>%
# Bucketize year to decade using decades vector
___ %>%
# Collect the result
___ %>%
# Convert decade to factor using decade_labels
___
# Draw a boxplot of artist_hotttnesss by decade
ggplot(___, aes(___, ___)) +
___()
This exercise is part of the course
Introduction to Spark with sparklyr in R
Learn how to run big data analysis using Spark and the sparklyr package in R, and explore Spark MLIb in just 4 hours.
In which you learn about Spark's machine learning data transformation features, and functionality for manipulating native DataFrames.
Exercise 1: Two new interfacesExercise 2: Popcorn double featureExercise 3: Transforming continuous variables to logicalExercise 4: Transforming continuous variables into categorical (1)Exercise 5: Transforming continuous variables into categorical (2)Exercise 6: More than words: tokenization (1)Exercise 7: More than words: tokenization (2)Exercise 8: More than words: tokenization (3)Exercise 9: Sorting vs. arrangingExercise 10: Exploring Spark data typesExercise 11: Shrinking the data by samplingExercise 12: Training/testing partitionsWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.