Get startedGet started for free

Extracting Text to New Features

Garages are an important consideration for houses in Minnesota where most people own a car and the snow is annoying to clear off a car parked outside. The type of garage is also important, can you get to your car without braving the cold or not? Let's look at creating a feature has_attached_garage that captures whether the garage is attached to the house or not.

This exercise is part of the course

Feature Engineering with PySpark

View Course

Exercise instructions

  • Import the needed function when() from pyspark.sql.functions.
  • Create a string matching condition using like() to look for for the string pattern Attached Garage in df['GARAGEDESCRIPTION'] and use wildcards % so it will match anywhere in the field.
  • Similarly, create another condition using like() to find the string pattern Detached Garage in df['GARAGEDESCRIPTION'] and use wildcards % so it will match anywhere in the field.
  • Create a new column has_attached_garage using when() to assign the value 1 if it has an attached garage, zero if detached and use otherwise() to assign null with None if it is neither.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import needed functions
____ ____ ____ ____

# Create boolean conditions for string matches
has_attached_garage = df[____].____(____)
has_detached_garage = df[____].____(____)

# Conditional value assignment 
df = df.withColumn(____, (____(____, 1)
                                          .____(____, 0)
                                          .____(____)))

# Inspect results
df[['GARAGEDESCRIPTION', 'has_attached_garage']].show(truncate=100)
Edit and Run Code