Exercise

Exploring Spark data types

You've already seen (back in Chapter 1) src_tbls() for listing the DataFrames on Spark that sparklyr can see. You've also seen glimpse() for exploring the columns of a tibble on the R side.

sparklyr has a function named sdf_schema() for exploring the columns of a tibble on the R side. It's easy to call; and a little painful to deal with the return value.

sdf_schema(a_tibble)

The return value is a list, and each element is a list with two elements, containing the name and data type of each column. The exercise shows a data transformation to more easily view the data types.

Here is a comparison of how R data types map to Spark data types. Other data types are not currently supported by sparklyr.

R type Spark type
logical BooleanType
numeric DoubleType
integer IntegerType
character StringType
list ArrayType

Instructions

100 XP

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

  • Call sdf_schema() to get the schema of the track metadata.
  • Run the transformation code on schema to see it in a more readable tibble format.