Get Started

Working with parquet files

CSV files are great for saving the contents of rectangular data objects (like R data.frames and Spark DataFrames) to disk. The problem is that they are really slow to read and write, making them unusable for large datasets. Parquet files provide a higher performance alternative. As well as being used for Spark data, parquet files can be used with other tools in the Hadoop ecosystem, like Shark, Impala, Hive, and Pig.

Technically speaking, parquet file is a misnomer. When you store data in parquet format, you actually get a whole directory worth of files. The data is split across multiple .parquet files, allowing it to be easily stored on multiple machines, and there are some metadata files too, describing the contents of each column.

sparklyr can import parquet files using spark_read_parquet(). This function takes a Spark connection, a string naming the Spark DataFrame that should be created, and a path to the parquet directory. Note that this function will import the data directly into Spark, which is typically faster than importing the data into R, then using copy_to() to copy the data from R to Spark.

spark_read_parquet(sc, "a_dataset", "path/to/parquet/dir")

This is a part of the course

“Introduction to Spark with sparklyr in R”

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. A string pointing to the parquet directory (on the file system where R is running) has been created for you as parquet_dir.

  • Use dir() to list the absolute file paths of the files in the parquet directory, assigning the result to filenames.
    • The first argument should be the directory whose files you are listing, parquet_dir.
    • To retrieve the absolute (rather than relative) file paths, you should also pass full.names = TRUE.
  • Create a data_frame with two columns.
    • filename should contain the filenames you just retrieved, without the directory part. Create this by passing the filenames to basename().
    • size_bytes should contain the file sizes of those files. Create this by passing the filenames to file.size().
  • Use spark_read_parquet() to import the timbre data into Spark, assigning the result to timbre_tbl.
    • The first argument should be the Spark connection.
    • The second argument should be "timbre".
    • The third argument should be parquet_dir.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# parquet_dir has been pre-defined
parquet_dir

# List the files in the parquet dir
filenames <- ___

# Show the filenames and their sizes
data_frame(
  filename = ___,
  size_bytes = ___
)

# Import the data into Spark
timbre_tbl <- ___

This exercise is part of the course

Introduction to Spark with sparklyr in R

IntermediateSkill Level
5.0+
4 reviews

Learn how to run big data analysis using Spark and the sparklyr package in R, and explore Spark MLIb in just 4 hours.

A case study in which you learn to use <code>sparklyr</code>'s machine learning routines, by predicting the year in which a song was released.

Exercise 1: Machine learning on SparkExercise 2: Machine learning functionsExercise 3: (Hey you) What's that sound?Exercise 4: Working with parquet files
Exercise 5: Come togetherExercise 6: Partitioning data with a group effectExercise 7: Gradient boosted trees: modelingExercise 8: Gradient boosted trees: predictionExercise 9: Gradient boosted trees: visualizationExercise 10: Random Forest: modelingExercise 11: Random Forest: predictionExercise 12: Random Forest: visualizationExercise 13: Comparing model performance

What is DataCamp?

Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.

Start Learning for Free