LoslegenKostenlos loslegen

Calculate Missing Percents

Automation is the future of data science. Learning to automate some of your data preparation pays dividends. In this exercise, we will automate dropping columns if they are missing data beyond a specific threshold.

Diese Übung ist Teil des Kurses

Feature Engineering with PySpark

Kurs anzeigen

Anleitung zur Übung

  • Define a function column_dropper() that takes the parameters df a dataframe and threshold a float between 0 and 1.
  • Calculate the percentage of values that are missing using where(), isNull() and count()
  • Check to see if the percentage of missing is higher than the threshold, if so, drop the column using drop()
  • Run column_dropper() on df with the threshold set to .6

Interaktive Übung

Vervollständige den Beispielcode, um diese Übung erfolgreich abzuschließen.

def column_dropper(df, threshold):
  # Takes a dataframe and threshold for missing values. Returns a dataframe.
  total_records = df.____()
  for col in df.columns:
    # Calculate the percentage of missing values
    missing = df.____(df[col].____()).____()
    missing_percent = ____ / ____
    # Drop column if percent of missing is more than threshold
    if ____ > ____:
      df = df.____(col)
  return df

# Drop columns that are more than 60% missing
df = ____(____, ____)
Code bearbeiten und ausführen