In this chapter, you'll learn how Spark manages data and how can you read and write tables from Python.

What is Spark, anyway?

Using Spark in Python

Examining The SparkContext

Using DataFrames

Creating a SparkSession

Viewing tables

Are you query-ious?

Pandafy a Spark DataFrame

Put some Spark in your data

Dropping the middle man

Getting to know PySpark

In this chapter, you'll learn about the pyspark.sql module, which provides optimized data queries to your Spark session.

Creating columns

SQL in a nutshell

SQL in a nutshell (2)

Filtering Data

Selecting

Selecting II

Aggregating

Aggregating II

Grouping and Aggregating I

Grouping and Aggregating II

Joining

Joining II

Manipulating data

PySpark has built-in, cutting-edge machine learning routines, along with utilities to create full machine learning pipelines. You'll learn about them in this chapter.

Pipeline di Machine Learning

Fai il join dei DataFrame

Tipi di dati

Da stringa a intero

Crea una nuova colonna

Creare un booleano

Stringhe e fattori

Carrier

Destinazione

Assembla un vettore

Crea la pipeline

Test vs. Train

Trasforma i dati

Suddividere i dati

Getting started with machine learning pipelines

In this last chapter, you'll apply what you've learned to create a model that predicts which flights will be delayed.

What is logistic regression?

Create the modeler

Cross validation

Create the evaluator

Make a grid

Make the validator

Fit the model(s)

Evaluating binary classifiers

Evaluate the model

Model tuning and selection

Airports

Flights

Planes

In questo corso imparerai a usare Spark da Python! Spark è uno strumento per eseguire calcoli paralleli su grandi insiemi di dati e si integra molto bene con Python. PySpark è il pacchetto Python che rende tutto questo possibile. Userai questo pacchetto per lavorare con i dati dei voli da Portland e Seattle. Imparerai a manipolare questi dati e a costruire un'intera pipeline di Machine Learning per prevedere se i voli subiranno ritardi. Preparati a dare una marcia in più al tuo codice Python e a tuffarti nel mondo del Machine Learning ad alte prestazioni!

Introduction to Python

Impara a gestire i dati e crea una pipeline di machine learning con PySpark. Esercitati con dati reali.

Fondamenti di PySpark

Impara a usare la gestione distribuita dei dati e l'apprendimento automatico in Spark con il pacchetto PySpark.

Pipeline di Machine Learning

Fondamenti di PySpark

Esercizio pratico interattivo