Bucketing

If you are a homeowner its very important if a house has 1, 2, 3 or 4 bedrooms. But like bathrooms, once you hit a certain point you don't really care whether the house has 7 or 8. This example we'll look at how to figure out where are some good value points to bucket.

Bu egzersiz, kursun bir parçasıdır

Feature Engineering with PySpark

Kursa Göz Atın

Egzersiz talimatları

Plot a distribution plot of the pandas dataframe sample_df using Seaborn distplot().
Given it looks like there is a long tail of infrequent values after 5, create the bucket splits of 1, 2, 3, 4, 5+
Create the transformer buck by instantiating Bucketizer() with the splits for setting the buckets, then set the input column as BEDROOMS and output column as bedrooms.
Apply the Bucketizer transformation on df using transform() and assign the result to df_bucket. Then verify the results with show()

Uygulamalı etkileşimli egzersiz

Bu egzersizi bu örnek kodu tamamlayarak deneyin.

from pyspark.ml.feature import Bucketizer

# Plot distribution of sample_df
sns.____(____, axlabel='BEDROOMS')
plt.show()

# Create the bucket splits and bucketizer
splits = [____, ____, ____, ____, ____, ____, float('Inf')]
buck = ____(splits=____, inputCol=____, outputCol=____)

# Apply the transformation to df: df_bucket
df_bucket = ____.____(____)

# Display results
df_bucket[['BEDROOMS', 'bedrooms']].____()

Kodu Düzenle ve Çalıştır