BaşlayınÜcretsiz Başlayın

Imputing Missing Data

Missing data happens. If we make the assumption that our data is missing completely at random, we are making the assumption that what data we do have, is a good representation of the population. If we have a few values we could remove them or we could use the mean or median as a replacement. In this exercise, we will look at 'PDOM': Days on Market at Current Price.

Bu egzersiz

Feature Engineering with PySpark

kursunun bir parçasıdır
Kursu Görüntüle

Egzersiz talimatları

  • Get a count of the missing values in the column 'PDOM' using where(), isNull() and count().
  • Calculate the mean value of 'PDOM' using the aggregate function mean().
  • Use fillna() with the value set to the 'PDOM' mean value and only apply it to the column 'PDOM' using the subset parameter.

Uygulamalı interaktif egzersiz

Bu örnek kodu tamamlayarak bu egzersizi bitirin.

# Count missing rows
missing = df.____(df[____].____()).____()

# Calculate the mean value
col_mean = df.____({____: ____}).____()[0][0]

# Replacing with the mean value for that column
df.____(____, ____=[____])
Kodu Düzenle ve Çalıştır