Bucketing
If you are a homeowner its very important if a house has 1, 2, 3 or 4 bedrooms. But like bathrooms, once you hit a certain point you don't really care whether the house has 7 or 8. This example we'll look at how to figure out where are some good value points to bucket.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Plot a distribution plot of the
pandas
dataframesample_df
usingSeaborn
distplot()
. - Given it looks like there is a long tail of infrequent values after 5, create the bucket
splits
of 1, 2, 3, 4, 5+ - Create the transformer
buck
by instantiatingBucketizer()
with the splits for setting the buckets, then set the input column asBEDROOMS
and output column asbedrooms
. - Apply the Bucketizer transformation on
df
usingtransform()
and assign the result todf_bucket
. Then verify the results withshow()
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from pyspark.ml.feature import Bucketizer
# Plot distribution of sample_df
sns.____(____, axlabel='BEDROOMS')
plt.show()
# Create the bucket splits and bucketizer
splits = [____, ____, ____, ____, ____, ____, float('Inf')]
buck = ____(splits=____, inputCol=____, outputCol=____)
# Apply the transformation to df: df_bucket
df_bucket = ____.____(____)
# Display results
df_bucket[['BEDROOMS', 'bedrooms']].____()