Bucketing
If you are a homeowner its very important if a house has 1, 2, 3 or 4 bedrooms. But like bathrooms, once you hit a certain point you don't really care whether the house has 7 or 8. This example we'll look at how to figure out where are some good value points to bucket.
Cet exercice fait partie du cours
Feature Engineering with PySpark
Instructions
- Plot a distribution plot of the pandasdataframesample_dfusingSeaborndistplot().
- Given it looks like there is a long tail of infrequent values after 5, create the bucket splitsof 1, 2, 3, 4, 5+
- Create the transformer buckby instantiatingBucketizer()with the splits for setting the buckets, then set the input column asBEDROOMSand output column asbedrooms.
- Apply the Bucketizer transformation on dfusingtransform()and assign the result todf_bucket. Then verify the results withshow()
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
from pyspark.ml.feature import Bucketizer
# Plot distribution of sample_df
sns.____(____, axlabel='BEDROOMS')
plt.show()
# Create the bucket splits and bucketizer
splits = [____, ____, ____, ____, ____, ____, float('Inf')]
buck = ____(splits=____, inputCol=____, outputCol=____)
# Apply the transformation to df: df_bucket
df_bucket = ____.____(____)
# Display results
df_bucket[['BEDROOMS', 'bedrooms']].____()