Extracting Text to New Features
Garages are an important consideration for houses in Minnesota where most people own a car and the snow is annoying to clear off a car parked outside. The type of garage is also important, can you get to your car without braving the cold or not? Let's look at creating a feature has_attached_garage
that captures whether the garage is attached to the house or not.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Import the needed function
when()
frompyspark.sql.functions
. - Create a string matching condition using
like()
to look for for the string patternAttached Garage
indf['GARAGEDESCRIPTION']
and use wildcards%
so it will match anywhere in the field. - Similarly, create another condition using
like()
to find the string patternDetached Garage
indf['GARAGEDESCRIPTION']
and use wildcards%
so it will match anywhere in the field. - Create a new column
has_attached_garage
usingwhen()
to assign the value 1 if it has an attached garage, zero if detached and useotherwise()
to assign null withNone
if it is neither.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import needed functions
____ ____ ____ ____
# Create boolean conditions for string matches
has_attached_garage = df[____].____(____)
has_detached_garage = df[____].____(____)
# Conditional value assignment
df = df.withColumn(____, (____(____, 1)
.____(____, 0)
.____(____)))
# Inspect results
df[['GARAGEDESCRIPTION', 'has_attached_garage']].show(truncate=100)