Exploring Spark data types
You've already seen (back in Chapter 1) src_tbls()
for listing the DataFrames on Spark that sparklyr
can see. You've also seen glimpse()
for exploring the columns of a tibble on the R side.
sparklyr
has a function named sdf_schema()
for exploring the columns of a tibble on the R side. It's easy to call; and a little painful to deal with the return value.
sdf_schema(a_tibble)
The return value is a list, and each element is a list with two elements, containing the name and data type of each column. The exercise shows a data transformation to more easily view the data types.
Here is a comparison of how R data types map to Spark data types. Other data types are not currently supported by sparklyr
.
R type | Spark type |
---|---|
logical | BooleanType |
numeric | DoubleType |
integer | IntegerType |
character | StringType |
list | ArrayType |
This is a part of the course
“Introduction to Spark with sparklyr in R”
Exercise instructions
A Spark connection has been created for you as spark_conn
. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl
.
- Call
sdf_schema()
to get the schema of the track metadata. - Run the transformation code on
schema
to see it in a more readable tibble format.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_metadata_tbl has been pre-defined
track_metadata_tbl
# Get the schema
(schema <- ___(___))
# Transform the schema
schema %>%
lapply(function(x) do.call(data_frame, x)) %>%
bind_rows()
This exercise is part of the course
Introduction to Spark with sparklyr in R
Learn how to run big data analysis using Spark and the sparklyr package in R, and explore Spark MLIb in just 4 hours.
In which you learn about Spark's machine learning data transformation features, and functionality for manipulating native DataFrames.
Exercise 1: Two new interfacesExercise 2: Popcorn double featureExercise 3: Transforming continuous variables to logicalExercise 4: Transforming continuous variables into categorical (1)Exercise 5: Transforming continuous variables into categorical (2)Exercise 6: More than words: tokenization (1)Exercise 7: More than words: tokenization (2)Exercise 8: More than words: tokenization (3)Exercise 9: Sorting vs. arrangingExercise 10: Exploring Spark data typesExercise 11: Shrinking the data by samplingExercise 12: Training/testing partitionsWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.