Using text filters to remove records
It pays to have to ask your clients lots of questions and take time to understand your variables. You find out that Assumable mortgage is an unusual occurrence in the real estate industry and your client suggests you exclude them. In this exercise we will use isin() which is similar to like() but allows us to pass a list of values to use as a filter rather than a single one.
Deze oefening maakt deel uit van de cursus
Feature Engineering with PySpark
Oefeninstructies
- Use
select()andshow()to inspect the distinct values in the column'ASSUMABLEMORTGAGE'and create the listyes_valuesfor all the values containing the string'Yes'. - Use
~df['ASSUMABLEMORTGAGE'],isin(), and.isNull()to create a NOT filter to remove records containing corresponding values in the listyes_valuesand to keep records with null values. Store this filter in the variabletext_filter. - Use
where()to apply thetext_filtertodf. - Print out the number of records remaining in
df.
Praktische interactieve oefening
Probeer deze oefening eens door deze voorbeeldcode in te vullen.
# Inspect unique values in the column 'ASSUMABLEMORTGAGE'
df.____([____]).distinct().____()
# List of possible values containing 'yes'
yes_values = [____, ____]
# Filter the text values out of df but keep null values
text_filter = ~df['ASSUMABLEMORTGAGE'].isin(____) | df['ASSUMABLEMORTGAGE'].isNull()
df = df.____(text_filter)
# Print count of remaining records
print(____.____())