A review of DataFrame fundamentals and the importance of data cleaning.

Intro to data cleaning with Apache Spark

Data cleaning review

Defining a schema

Immutability and lazy processing

Immutability review

Using lazy processing

Understanding Parquet

Saving a DataFrame in Parquet format

SQL and Parquet

DataFrame details

A look at various techniques to modify the contents of DataFrames in Spark.

DataFrame column operations

Filtering column content with Python

Filtering Question #1

Filtering Question #2

Modifying DataFrame columns

Conditional DataFrame column operations

when() example

When / Otherwise

User defined functions

Understanding user defined functions

Using user defined functions in Spark

Partitioning and lazy processing

Adding an ID Field

IDs with different partitions

More ID tricks

Manipulating DataFrames in the real world

Improve data cleaning tasks by increasing performance or reducing resource requirements.

Caching

Caching a DataFrame

Removing a DataFrame from cache

Improve import performance

File size optimization

File import performance

Cluster configurations

Reading Spark configurations

Writing Spark configurations

Performance improvements

Normal joins

Using broadcasting on Spark joins

Comparing broadcast vs normal joins

Improving Performance

Learn how to process complex real-world data using Spark and the basics of pipelines.

Introduction to data pipelines

Quick pipeline

Pipeline data issue

Data handling techniques

Removing commented lines

Removing invalid rows

Splitting into columns

Further parsing

Data validation

Validate rows via join

Examining invalid rows

Final analysis and delivery

Dog parsing

Per image count

Percentage dog pixels

Congratulations and next steps

Complex processing and data pipelines

Dallas Council Votes

Dallas Council Voters

Flights - 2014

Flights - 2015

Flights - 2016

Flights - 2017

Working with data is tricky - working with millions or even billions of rows is worse.
Did you receive some data processing code written on a laptop with fairly pristine data?
Chances are you’ve probably been put in charge of moving a basic data process from prototype to production.
You may have worked with real world datasets, with missing fields, bizarre formatting, and orders of magnitude more data. Even if this is all new to you, this course helps you learn what’s needed to prepare data processes using Python with Apache Spark.
You’ll learn terminology, methods, and some best practices to create a performant, maintainable, and understandable data processing platform.

Intermediate Python

Introduction to PySpark

Learn how to use PySpark to clean your data in Python with DataFrames and data pipelines. You'll also learn why clean data is so important for analysis

Cleaning Data with PySpark

Learn how to clean data with Apache Spark in Python.

Big Data with PySpark

SQL and Parquet

Cleaning Data with PySpark

Exercise instructions

Hands-on interactive exercise