Get startedGet started for free

Extracting Features

1. Extracting Features

Datasets frequently have rich features trapped in messy combination fields, lists or even free-form text. In this video, we'll go over how to wrangle columns into useful information for machine learning.

2. Extracting Age with Text Match

We can see that within this ROOF column there are many useful features. For instance, an old roof is very expensive to replace and knowing that might impact the price of the house. The age in this dataset is either over 8 years or less. This would be better as a boolean variable, 1 or 0, something that we can calculate on.

3. Extracting Age with Text Match

To create the boolean column we will use the when function to create an if-then. The when function evaluates a boolean condition and does something. In this case our boolean conditions are find_over_8 and find_under_8 which use the like function to return true/false depending on if string is found. You might notice we use percent sign before and after the string we are looking for. These are wildcards that allow any number of characters before or after the string. Now that we have these conditions created we can put them in the when function. When find_over_8 is true, assign the value 1, when find_under_8 is true assign 0. If neither is true, the otherwise function allows us to assign None so they are null. We can see that the roof age has now been created into a new boolean variable!

4. Splitting Columns

Let's look at the ROOF column again. You'll notice that if there is a value in ROOF it seems to be a list starting with the type of materials it was made out of. If we know the pattern we can split this into its own column called Roof Material.

5. Splitting Columns

To split a column we need to introduce a new function from the pyspark sql functions module, split, which takes a column to split and a character to split on. In our example, we will split on df ROOF and use the comma as the delimiter between values. Once we have that created we can use our familiar withColumn to create a new column from the first value with split_col get Item zero. get Item zero takes the 0 index position of the split column and returns the value. Here we can verify that our code performed as expected, splitting the ROOF column and putting the first value into a new column Roof_Material.

6. Explode!

What if the order of the listed values in a column is not guaranteed and we want to extract out the values to their own columns? To do this is a two-step process. The first step is called 'exploding'. Changing a compound field so that each value has a separate record, with everything else repeated.

7. Pivot!

The second step is to pivot those repeated fields into columns. You'll notice we now have columns for each POSSIBLE value in the compound field we started with.

8. Explode & Pivot!

To do this in PySpark we have to import several functions: split, explode, lit, coalesce and first. Then we need to split our ROOF column into an array column. Now we can explode roof_list to create a new record for each value. Next we'll create a constant column to help our pivot. The pivot function will groupBy our record id number so that only one row is returned for pivoting ex_roof_list. Since pivot is an aggregate function we will use our constant_val column with coalesce to ignore nulls and first to take the first value.

9. Let's wrangle some features!

In this video, we learned how to salvage some messy fields into machine learnable features. Now it's your turn to wrangle some features on your own!