datacamp-logo

Popcorn double feature

The dplyr methods that you saw in the previous two chapters use Spark's SQL interface. That is, they convert your R code into SQL code before passing it to Spark. This is an excellent solution for basic data manipulation, but it runs into problems when you want to do more complicated processing. For example, you can calculate the mean of a column, but not the median. Here is the example from the 'Summarizing columns' exercise that you completed in Chapter 1.

track_metadata_tbl %>%
  summarize(mean_duration = mean(duration)) #OK
track_metadata_tbl %>%
  summarize(median_duration = median(duration))

sparklyr also has two "native" interfaces that will be discussed in the next two chapters. Native means that they call Java or Scala code to access Spark libraries directly, without any conversion to SQL. sparklyr supports the Spark DataFrame Application Programming Interface (API), with functions that have an sdf_ prefix. It also supports access to Spark's machine learning library, MLlib, with "feature transformation" functions that begin ft_, and "machine learning" functions that begin ml_.

One important philosophical difference between working with R and working with Spark is that Spark is much stricter about variable types than R. Most of the native functions want DoubleType inputs and return DoubleType outputs. DoubleType is Spark's equivalent of R's numeric vector type. sparklyr will handle converting numeric to DoubleType, but it is up to the user (that's you!) to convert logical or integer data into numeric data and back again.

Which of these statements is true?

  1. sparklyr's dplyr methods convert code into Scala code before running it on Spark.
  2. Converting R code into SQL code limits the number of supported computations.
  3. Most Spark MLlib modeling functions require DoubleType inputs and return DoubleType outputs.
  4. Most Spark MLlib modeling functions require IntegerType inputs and return BooleanType outputs.
Answer the question
50 XP
Possible Answers
  • press
  • press
  • press
  • press