Get to know a bit about your problem before you dive in! Then learn how to statistically and visually inspect your dataset!

Where to Begin

Where to begin?

Check Version

Load in the data

Defining A Problem

What are we predicting?

Verifying Data Load

Verifying DataTypes

Visually Inspecting Data / EDA

Using Corr()

Using Visualizations: distplot

Using Visualizations: lmplot

Exploratory Data Analysis

Real data is rarely clean and ready for analysis. In this chapter learn to remove unneeded information, handle missing values and add additional data to your analysis.

Dropping data

Dropping a list of columns

Using text filters to remove records

Filtering numeric fields conditionally

Adjusting Data

Custom Percentage Scaling

Scaling your scalers

Correcting Right Skew Data

Working with Missing Data

Visualizing Missing Data

Imputing Missing Data

Calculate Missing Percents

Getting More Data

A Dangerous Join

Spark SQL Join

Checking for Bad Joins

Wrangling with Spark Functions

In this chapter learn how to create new features for your machine learning model to learn from. We'll look at generating them by combining fields, extracting values from messy columns or encoding them for better results.

Feature Generation

Differences

Ratios

Deeper Features

Time Features

Time Components

Joining On Time Components

Date Math

Extracting Features

Extracting Text to New Features

Splitting & Exploding

Pivot & Join

Binarizing, Bucketing & Encoding

Binarizing Day of Week

Bucketing

One Hot Encoding

Feature Engineering

In this chapter we'll learn how  to choose which type of model we want. Then we will learn how to apply our data to the model and evaluate it. Lastly,  we'll learn how to interpret the results and save the model for later!

Choosing the Algorithm

Which MLlib Module?

Creating Time Splits

Adjusting Time Features

Feature Engineering Assumptions for RFR

Feature Engineering For Random Forests

Dropping Columns with Low Observations

Naively Handling Missing and Categorical Values

Building a Model

Building a Regression Model

Evaluating & Comparing Algorithms

Understanding Metrics

Interpreting, Saving & Loading

Interpreting Results

Saving & Loading Models

Final Thoughts

2017 St Paul MN Real Estate Dataset

The real world is messy and your job is to make sense of it. Toy datasets like MTCars and Iris are the result of careful curation and cleaning, even so the data needs to be transformed for it to be useful for powerful machine learning algorithms to extract meaning, forecast, classify or cluster. This course will cover the gritty details that data scientists are spending 70-80% of their time on; data wrangling and feature engineering. With size of datasets now becoming ever larger, let's use PySpark to cut this Big Data problem down to size!

Supervised Learning with scikit-learn

Introduction to PySpark

Learn to use PySpark to cut Big Data problems by using data wrangling and feature engineering so you can extract meaning, forecast, classify or cluster.

Feature Engineering with PySpark

Learn the gritty details that data scientists are spending 70-80% of their time on; data wrangling and feature engineering.

Big Data avec PySpark

What are we predicting?

Instructions de l’exercice

Exercice interactif pratique