Working with parquet files

CSV files are great for saving the contents of rectangular data objects (like R data.frames and Spark DataFrames) to disk. The problem is that they are really slow to read and write, making them unusable for large datasets. Parquet files provide a higher performance alternative. As well as being used for Spark data, parquet files can be used with other tools in the Hadoop ecosystem, like Shark, Impala, Hive, and Pig.

Technically speaking, parquet file is a misnomer. When you store data in parquet format, you actually get a whole directory worth of files. The data is split across multiple .parquet files, allowing it to be easily stored on multiple machines, and there are some metadata files too, describing the contents of each column.

sparklyr can import parquet files using spark_read_parquet(). This function takes a Spark connection, a string naming the Spark DataFrame that should be created, and a path to the parquet directory. Note that this function will import the data directly into Spark, which is typically faster than importing the data into R, then using copy_to() to copy the data from R to Spark.

spark_read_parquet(sc, "a_dataset", "path/to/parquet/dir")

This is a part of the course

“Introduction to Spark with sparklyr in R”

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. A string pointing to the parquet directory (on the file system where R is running) has been created for you as parquet_dir.

  • Use dir() to list the absolute file paths of the files in the parquet directory, assigning the result to filenames.
    • The first argument should be the directory whose files you are listing, parquet_dir.
    • To retrieve the absolute (rather than relative) file paths, you should also pass full.names = TRUE.
  • Create a data_frame with two columns.
    • filename should contain the filenames you just retrieved, without the directory part. Create this by passing the filenames to basename().
    • size_bytes should contain the file sizes of those files. Create this by passing the filenames to file.size().
  • Use spark_read_parquet() to import the timbre data into Spark, assigning the result to timbre_tbl.
    • The first argument should be the Spark connection.
    • The second argument should be "timbre".
    • The third argument should be parquet_dir.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# parquet_dir has been pre-defined
parquet_dir

# List the files in the parquet dir
filenames <- ___

# Show the filenames and their sizes
data_frame(
  filename = ___,
  size_bytes = ___
)

# Import the data into Spark
timbre_tbl <- ___