BaşlayınÜcretsiz Başlayın

Bucketing

If you are a homeowner its very important if a house has 1, 2, 3 or 4 bedrooms. But like bathrooms, once you hit a certain point you don't really care whether the house has 7 or 8. This example we'll look at how to figure out where are some good value points to bucket.

Bu egzersiz

Feature Engineering with PySpark

kursunun bir parçasıdır
Kursu Görüntüle

Egzersiz talimatları

  • Plot a distribution plot of the pandas dataframe sample_df using Seaborn distplot().
  • Given it looks like there is a long tail of infrequent values after 5, create the bucket splits of 1, 2, 3, 4, 5+
  • Create the transformer buck by instantiating Bucketizer() with the splits for setting the buckets, then set the input column as BEDROOMS and output column as bedrooms.
  • Apply the Bucketizer transformation on df using transform() and assign the result to df_bucket. Then verify the results with show()

Uygulamalı interaktif egzersiz

Bu örnek kodu tamamlayarak bu egzersizi bitirin.

from pyspark.ml.feature import Bucketizer

# Plot distribution of sample_df
sns.____(____, axlabel='BEDROOMS')
plt.show()

# Create the bucket splits and bucketizer
splits = [____, ____, ____, ____, ____, ____, float('Inf')]
buck = ____(splits=____, inputCol=____, outputCol=____)

# Apply the transformation to df: df_bucket
df_bucket = ____.____(____)

# Display results
df_bucket[['BEDROOMS', 'bedrooms']].____()
Kodu Düzenle ve Çalıştır