Working with parquet files
CSV files are great for saving the contents of rectangular data objects (like R data.frame
s and Spark DataFrames
) to disk. The problem is that they are really slow to read and write, making them unusable for large datasets. Parquet files provide a higher performance alternative. As well as being used for Spark data, parquet files can be used with other tools in the Hadoop ecosystem, like Shark, Impala, Hive, and Pig.
Technically speaking, parquet file is a misnomer. When you store data in parquet format, you actually get a whole directory worth of files. The data is split across multiple .parquet
files, allowing it to be easily stored on multiple machines, and there are some metadata files too, describing the contents of each column.
sparklyr
can import parquet files using spark_read_parquet()
. This function takes a Spark connection, a string naming the Spark DataFrame that should be created, and a path to the parquet directory. Note that this function will import the data directly into Spark, which is typically faster than importing the data into R, then using copy_to()
to copy the data from R to Spark.
spark_read_parquet(sc, "a_dataset", "path/to/parquet/dir")
This is a part of the course
“Introduction to Spark with sparklyr in R”
Exercise instructions
A Spark connection has been created for you as spark_conn
. A string pointing to the parquet directory (on the file system where R is running) has been created for you as parquet_dir
.
- Use
dir()
to list the absolute file paths of the files in the parquet directory, assigning the result tofilenames
.- The first argument should be the directory whose files you are listing,
parquet_dir
. - To retrieve the absolute (rather than relative) file paths, you should also pass
full.names = TRUE
.
- The first argument should be the directory whose files you are listing,
- Create a
data_frame
with two columns.filename
should contain the filenames you just retrieved, without the directory part. Create this by passing the filenames tobasename()
.size_bytes
should contain the file sizes of those files. Create this by passing the filenames tofile.size()
.
- Use
spark_read_parquet()
to import the timbre data into Spark, assigning the result totimbre_tbl
.- The first argument should be the Spark connection.
- The second argument should be
"timbre"
. - The third argument should be
parquet_dir
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# parquet_dir has been pre-defined
parquet_dir
# List the files in the parquet dir
filenames <- ___
# Show the filenames and their sizes
data_frame(
filename = ___,
size_bytes = ___
)
# Import the data into Spark
timbre_tbl <- ___
This exercise is part of the course
Introduction to Spark with sparklyr in R
Learn how to run big data analysis using Spark and the sparklyr package in R, and explore Spark MLIb in just 4 hours.
A case study in which you learn to use <code>sparklyr</code>'s machine learning routines, by predicting the year in which a song was released.
Exercise 1: Machine learning on SparkExercise 2: Machine learning functionsExercise 3: (Hey you) What's that sound?Exercise 4: Working with parquet filesExercise 5: Come togetherExercise 6: Partitioning data with a group effectExercise 7: Gradient boosted trees: modelingExercise 8: Gradient boosted trees: predictionExercise 9: Gradient boosted trees: visualizationExercise 10: Random Forest: modelingExercise 11: Random Forest: predictionExercise 12: Random Forest: visualizationExercise 13: Comparing model performanceWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.