Made for each other
R lets you write data analysis code quickly. With a bit of care, you can also make your code easy to read, which means that you can easily maintain your code too. In many cases, R is also fast enough at running your code.
Unfortunately, R requires that all your data be analyzed in memory (RAM), on a single machine. This limits how much data you can analyze using R. There are a few solutions to this problem, including using Spark.
Spark is an open source cluster computing platform. That means that you can spread your data and your computations across multiple machines, effectively letting you analyze an unlimited amount of data. The two technologies complement each other strongly. By using R and Spark together you can write code fast and run code fast!
sparklyr
is an R package that lets you write R code to work with data in a Spark cluster. It has a dplyr
interface, which means that you can write (more or less) the same dplyr
-style R code, whether you are working with data on your machine or on a Spark cluster.
Scream if you want to go faster!
This exercise is part of the course
Introduction to Spark with sparklyr in R
Hands-on interactive exercise
Turn theory into action with one of our interactive exercises
