Transforming continuous variables to logical

Logical variables are nice because it is often easier to think about things in "yes or no" terms rather than in numeric terms. For example, if someone asks you "Would you like a cup of tea?", a yes or no response is preferable to "There is a 0.73 chance of me wanting a cup of tea". This has real data science applications too. For example, a test for diabetes may return the glucose concentration in a patient's blood plasma as a number. What you really care about is "Does the patient have diabetes?", so you need to convert the number into a logical value, based upon some threshold.

In base-R, this is done fairly simply, using something like this:

threshold_mmol_per_l <- 7
has_diabetes <- plasma_glucose_concentration > threshold_mmol_per_l

All the sparklyr feature transformation functions have a similar user interface. The first three arguments are always a Spark tibble, a string naming the input column, and a string naming the output column. That is, they follow this pattern.

a_tibble %>%
  ft_some_transformation("x", "y", some_other_args)

The sparklyr way of converting a continuous variable into logical uses ft_binarizer(). The previous diabetes example can be rewritten as the following. Note that the threshold value should be a number, not a string refering to a column in the dataset.

diabetes_data %>%
  ft_binarizer("plasma_glucose_concentration", "has_diabetes", threshold = threshold_mmol_per_l)

In keeping with the Spark philosophy of using DoubleType everywhere, the output from ft_binarizer() isn't actually logical; it is numeric. This is the correct approach for letting you continue to work in Spark and perform other transformations, but if you want to process your data in R, you have to remember to explicitly convert the data to logical. The following is a common code pattern.

a_tibble %>%
  ft_binarizer("x", "is_x_big", threshold = threshold) %>%
  collect() %>%
  mutate(is_x_big = as.logical(is_x_big))

This exercise considers the appallingly named artist_hotttnesss field, which provides a measure of how much media buzz the artist had at the time the dataset was created. If you would like to learn more about drawing plots using the ggplot2 package, please take the Data Visualization with ggplot2 (Part 1) course.

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

Create a variable named hotttnesss from track_metadata_tbl.
- Select the artist_hotttnesss field.
- Use ft_binarizer() to create a new field, is_hottt_or_nottt, which is true when artist_hotttnesss is greater than 0.5.
- Collect the result.
- Convert the is_hottt_or_nottt field to be logical.
Draw a ggplot() bar plot of is_hottt_or_nottt.
- The first argument to ggplot() is the data argument, hotttnesss.
- The second argument to ggplot() is the aesthetic, is_hottt_or_nottt wrapped in aes().
- Add geom_bar() to draw the bars.

Light My Fire: Starting To Use Spark With dplyr Syntax

Tools of the Trade: Advanced dplyr Usage

Going Native: Use The Native Interface to Manipulate Spark DataFrames

Case Study: Learning to be a Machine: Running Machine Learning Models on Spark

Exercise

Transforming continuous variables to logical

Instructions