Get startedGet started for free

Getting started

1. Getting started

Hi, I'm Richie. Welcome to the course! In this chapter you are going to learn how to work with Spark using the sparklyr's dplyr interface. Before we get to that, let's take a moment to explore what Spark is.

2. R logo

R is a wonderful tool for data analysis, but by default the amount of data that you can process with it is limited to what you can store in RAM, on a single computer. For many datasets, that isn't a problem, but when you have really big data, you can run into trouble.

3. Apache Spark logo

Spark is a cluster computing platform. That means that your datasets and your computations can be spread across several machines, effectively removing the limit to the size of your datasets. All this happens automatically, so you don't need to worry about how your data is split up.

4. Sparklyr logo

sparklyr is an R package that let's you access Spark from R. That means you get the power of R's easy to write syntax, and the power of Spark's unlimited data handling. The icing on the cake is that sparklyr uses dplyr syntax, so once you know dplyr, you are half way to knowing sparklyr.

5. Connect-work-disconnect

The most important thing you will learn in this chapter is the workflow pattern. First you connect to Spark, then you do your work, then you disconnect. Since connecting to Spark takes several seconds, it is sensible to connect once at the start of the day, and disconnect again at the end.

6. dplyr verbs

dplyr provides a grammar of data transformation. There are five main transformations that you can apply to a dataset. You can select columns, filter rows, arrange the order of rows, change columns or add new columns, and calculate summary statistics. These transformations work on local data frames and on Spark data frames.

7. Let's practice!

Don't be afraid of the problems; I'll see you in the course.