A General introduction to PySpark and distributed computing. This section introduces PySpark, PySpark DataFrames, and RDDs.

Introduction to PySpark

Creating a SparkSession

Loading census data

Introduction to PySpark DataFrames

Scalability and performance

Reading a CSV and performing aggregations

Filtering by company

More on Spark DataFrames

Infer and filter

Schema writeout

Introduction to Apache Spark and PySpark

A continuation of DataFrames and complex datatypes. This section expands on what DataFrames offer in PySpark and introduces some Spark SQL concepts.

Data manipulation with DataFrames

Handling missing data with fill and drop

Column operations - creating and renaming columns

Advanced DataFrame operations

DataFrame combinations

Joining flights with their destination airports

U define it? U use it!

UDF defined

Integers in PySpark UDFs

Pandas UDFs

PySpark in Python

Delve into leveraging Spark SQL and PySpark for scalable data processing, combining SQL's simplicity with PySpark's distributed computing power to handle large datasets efficiently.

Resilient distributed datasets in PySpark

Creating RDDs

Collecting RDDs

Intro to Spark SQL

Querying on a temp view

Running SQL on DataFrames

Analytics with SQL on DataFrames

PySpark aggregations

Aggregating in PySpark

Aggregating in RDDs

Complex Aggregations

PySpark at scale

Broadcasting

Bringing it all together I

Bringing it all together II

What have we learned?

Introduction to PySpark SQL

Transportation

Salaries

Adults

This course is designed for data engineers, data scientists, and machine learning practitioners looking to work with large datasets using PySpark. You'll explore Apache Spark's speed and scalability, learn to create Spark sessions, work with RDDs, and manipulate DataFrames through hands-on exercises. The course also covers PySpark SQL, teaching you how to query data with SQL, handle schemas and complex data types, and optimize performance in distributed environments. By the end, you'll have the foundational skills to process and analyze big data, setting the stage for advanced applications like machine learning and big data analytics.


This course is perfect for data engineers, data scientists, and machine learning practitioners looking to work with large datasets efficiently. Whether you're transitioning from tools like Pandas or diving into big data technologies for the first time, this course offers a solid introduction to PySpark and distributed data processing.<br><br>
<h2>Why Spark? Why Now?</h2>
Discover the speed and scalability of Apache Spark, the powerful framework designed for handling big data. Through interactive lessons and hands-on exercises, you'll see how Spark's in-memory processing gives it an edge over traditional frameworks like Hadoop. You'll start by setting up Spark sessions and dive into core components like Resilient Distributed Datasets (RDDs) and DataFrames. Learn to filter, group, and join datasets with ease while working on real-world examples.<br><br>
<h2>Boost Your Python and SQL Skills for Big Data</h2>
Learn how to harness PySpark SQL for querying and managing data using familiar SQL syntax. Tackle schemas, complex data types, and user-defined functions (UDFs), all while building skills in caching and optimizing performance for distributed systems.<br><br>
<h2>Build Your Big Data Foundations</h2>
By the end of this course, you'll have the confidence to handle, query, and process big data using PySpark. With these foundational skills, you'll be ready to explore advanced topics like machine learning and big data analytics.

Introduction to SQL

Data Manipulation with pandas

Master PySpark to handle big data with ease—learn to process, query, and optimize massive datasets for powerful analytics!

Big Data with PySpark

Machine Learning Scientist in Python

Professional Data Engineer in Python

Aggregating in RDDs

Introduction to PySpark

Exercise instructions

Hands-on interactive exercise