dplyr methods that you saw in the previous two chapters use Spark's SQL interface. That is, they convert your R code into SQL code before passing it to Spark. This is an excellent solution for basic data manipulation, but it runs into problems when you want to do more complicated processing. For example, you can calculate the mean of a column, but not the median. Here is the example from the 'Summarizing columns' exercise that you completed in Chapter 1.
track_metadata_tbl %>% summarize(mean_duration = mean(duration)) #OK track_metadata_tbl %>% summarize(median_duration = median(duration))
sparklyr also has two "native" interfaces that will be discussed in the next two chapters. Native means that they call Java or Scala code to access Spark libraries directly, without any conversion to SQL.
sparklyr supports the Spark DataFrame Application Programming Interface (API), with functions that have an
sdf_ prefix. It also supports access to Spark's machine learning library, MLlib, with "feature transformation" functions that begin
ft_, and "machine learning" functions that begin
One important philosophical difference between working with R and working with Spark is that Spark is much stricter about variable types than R. Most of the native functions want
DoubleType inputs and return
DoubleType is Spark's equivalent of R's
numeric vector type.
sparklyr will handle converting
DoubleType, but it is up to the user (that's you!) to convert
integer data into
numeric data and back again.
Which of these statements is true?
dplyrmethods convert code into Scala code before running it on Spark.
DoubleTypeinputs and return
IntegerTypeinputs and return