Differences

Let's explore generating features using existing ones. In the midwest of the U.S. many single family homes have extra land around them for green space. In this example you will create a new feature called 'YARD_SIZE', and then see if the new feature is correlated with our outcome variable.

Create a new column using withColumn() called LOT_SIZE_SQFT and convert ACRES to square feet by multiplying by acres_to_sqfeet the conversion factor.
Create another new column called YARD_SIZE by subtracting FOUNDATIONSIZE from LOT_SIZE_SQFT.
Run corr() on each of the independent variables YARD_SIZE, FOUNDATIONSIZE, LOT_SIZE_SQFT against the dependent variable SALESCLOSEPRICE. Does new feature show a stronger correlation than either of its components?

Exploratory Data Analysis

Wrangling with Spark Functions

Feature Engineering

Building a Model

Exercise

Differences

Instructions