In this chapter, you'll learn how Spark manages data and how can you read and write tables from Python.

What is Spark, anyway?

Using Spark in Python

Examining The SparkContext

Using DataFrames

Creating a SparkSession

Viewing tables

Are you query-ious?

Pandafy a Spark DataFrame

Put some Spark in your data

Dropping the middle man

Getting to know PySpark

In this chapter, you'll learn about the pyspark.sql module, which provides optimized data queries to your Spark session.

Creating columns

SQL in a nutshell

SQL in a nutshell (2)

Filtering Data

Selecting

Selecting II

Aggregating

Aggregating II

Grouping and Aggregating I

Grouping and Aggregating II

Joining

Joining II

Manipulating data

PySpark has built-in, cutting-edge machine learning routines, along with utilities to create full machine learning pipelines. You'll learn about them in this chapter.

Machine Learning Pipelines

Join the DataFrames

Data types

String to integer

Create a new column

Making a Boolean

Strings and factors

Carrier

Destination

Assemble a vector

Create the pipeline

Test vs. Train

Transform the data

Split the data

Getting started with machine learning pipelines

In this last chapter, you'll apply what you've learned to create a model that predicts which flights will be delayed.

What is logistic regression?

Create the modeler

Cross validation

Create the evaluator

Make a grid

Make the validator

Fit the model(s)

Evaluating binary classifiers

Evaluate the model

Model tuning and selection

Airports

Flights

Planes

In this course, you'll learn how to use Spark from Python! Spark is a tool for doing parallel computation with large datasets and it integrates well with Python. PySpark is the Python package that makes the magic happen. You'll use this package to work with data about flights from Portland and Seattle. You'll learn to wrangle this data and build a whole machine learning pipeline to predict whether or not flights will be delayed. Get ready to put some Spark in your Python code and dive into the world of high-performance machine learning!

Introduction to Python

Learn to wrangle data and build a machine learning pipeline to make predictions with  PySpark Python package. Practice your skills with real-world data.

Introduction to PySpark

Learn to implement distributed data management and machine learning in Spark using the PySpark package.

Foundations of PySpark

Pandafy a Spark DataFrame

Foundations of PySpark

Exercise instructions

Hands-on interactive exercise