A review of DataFrame fundamentals and the importance of data cleaning.

Intro to data cleaning with Apache Spark

Data cleaning review

Defining a schema

Immutability and lazy processing

Immutability review

Using lazy processing

Understanding Parquet

Saving a DataFrame in Parquet format

SQL and Parquet

DataFrame details

A look at various techniques to modify the contents of DataFrames in Spark.

DataFrame column operations

Filtering column content with Python

Filtering Question #1

Filtering Question #2

Modifying DataFrame columns

Conditional DataFrame column operations

when() example

When / Otherwise

User defined functions

Understanding user defined functions

Using user defined functions in Spark

Partitioning and lazy processing

Adding an ID Field

IDs with different partitions

More ID tricks

Manipulating DataFrames in the real world

Improve data cleaning tasks by increasing performance or reducing resource requirements.

Cachen

Een DataFrame cachen

Een DataFrame uit de cache verwijderen

Importprestatie verbeteren

Bestandsgrootte optimaliseren

Prestaties van bestandsimport

Clusterconfiguraties

Spark-configuraties uitlezen

Spark-configuraties schrijven

Prestatieverbeteringen

Normale joins

Broadcasting gebruiken bij Spark-joins

Broadcast-joins versus normale joins vergelijken

Improving Performance

Learn how to process complex real-world data using Spark and the basics of pipelines.

Introduction to data pipelines

Quick pipeline

Pipeline data issue

Data handling techniques

Removing commented lines

Removing invalid rows

Splitting into columns

Further parsing

Data validation

Validate rows via join

Examining invalid rows

Final analysis and delivery

Dog parsing

Per image count

Percentage dog pixels

Congratulations and next steps

Complex processing and data pipelines

Dallas Council Votes

Dallas Council Voters

Flights - 2014

Flights - 2015

Flights - 2016

Flights - 2017

Werken met data is lastig — werken met miljoenen of zelfs miljarden rijen is nog lastiger.
Heb je verwerkingscode gekregen die op een laptop is geschreven met vrij schone data?
Grote kans dat jij nu verantwoordelijk bent om een basisproces van prototype naar productie te brengen.
Misschien heb je gewerkt met echte gegevenssets, met ontbrekende velden, bizarre opmaak en ordes van grootte meer data. Ook als dit allemaal nieuw voor je is, helpt deze cursus je te leren wat je nodig hebt om dataprocessen voor te bereiden met Python en Apache Spark.
Je leert terminologie, methoden en een aantal best practices om een performante, onderhoudbare en begrijpelijke gegevensverwerkingsomgeving te bouwen.

Intermediate Python

Introduction to PySpark

Leer hoe je PySpark gebruikt om data te reinigen in Python met DataFrames en datastromen.

Data opschonen met PySpark

Big Data met PySpark

Prestaties van bestandsimport

Data opschonen met PySpark

Oefeninstructies

Praktische interactieve oefening