Get Started

Transforming continuous variables into categorical (2)

A special case of the previous transformation is to cut a continuous variable into buckets where the buckets are defined by quantiles of the variable. A common use of this transformation is to analyze survey responses or review scores. If you ask people to rate something from one to five stars, often the median response won't be three stars. In this case, it can be useful to split their scores up by quantile. For example, you can make five quintile groups by splitting at the 0th, 20th, 40th, 60th, 80th, and 100th percentiles.

The base-R way of doing this is cut() + quantile(). The sparklyr equivalent uses the ft_quantile_discretizer() transformation. This takes an num_buckets argument, which determines the number of buckets. The base-R and sparklyr ways of calculating this are shown together. As before, right = FALSE and include.lowest are set.

survey_response_group <- cut(
  survey_score,
  breaks = quantile(survey_score, c(0, 0.25, 0.5, 0.75, 1)),
  labels = c("hate it", "dislike it", "like it", "love it"),
  right  = FALSE,
  include.lowest = TRUE
)
survey_data %>%
  ft_quantile_discretizer("survey_score", "survey_response_group", num_buckets = 4)

As with ft_bucketizer(), the resulting bins are numbers, counting from zero. If you want to work with them in R, explicitly convert to a factor.

This is a part of the course

“Introduction to Spark with sparklyr in R”

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl. duration_labels is a character vector describing lengths of time.

  • Create a variable named familiarity_by_duration from track_metadata_tbl.
    • Select the duration and artist_familiarity fields.
    • Use ft_quantile_discretizer() to create a new field, duration_bin, made from 5 quantile bins of duration.
    • Collect the result.
    • Convert the duration_bin field to a factor with labels duration_labels.
  • Draw a ggplot() box plot of artist_familiarity by duration_bin.
    • The first argument to ggplot() is the data argument, familiarity_by_duration.
    • The second argument to ggplot() is the aesthetic, which takes duration_bin and artist_familiarity wrapped in aes().
    • Add geom_boxplot() to draw the bars.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# track_metadata_tbl, duration_labels have been pre-defined
track_metadata_tbl
duration_labels

familiarity_by_duration <- track_metadata_tbl %>%
  # Select duration and artist_familiarity
  ___ %>%
  # Bucketize duration
  ___ %>%
  # Collect the result
  ___ %>%
  # Convert duration bin to factor
  ___

# Draw a boxplot of artist_familiarity by duration_bin
ggplot(___, aes(___, ___)) +
  ___()  

This exercise is part of the course

Introduction to Spark with sparklyr in R

IntermediateSkill Level
5.0+
4 reviews

Learn how to run big data analysis using Spark and the sparklyr package in R, and explore Spark MLIb in just 4 hours.

In which you learn about Spark's machine learning data transformation features, and functionality for manipulating native DataFrames.

Exercise 1: Two new interfacesExercise 2: Popcorn double featureExercise 3: Transforming continuous variables to logicalExercise 4: Transforming continuous variables into categorical (1)Exercise 5: Transforming continuous variables into categorical (2)
Exercise 6: More than words: tokenization (1)Exercise 7: More than words: tokenization (2)Exercise 8: More than words: tokenization (3)Exercise 9: Sorting vs. arrangingExercise 10: Exploring Spark data typesExercise 11: Shrinking the data by samplingExercise 12: Training/testing partitions

What is DataCamp?

Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.

Start Learning for Free