Calculate Missing Percents
Automation is the future of data science. Learning to automate some of your data preparation pays dividends. In this exercise, we will automate dropping columns if they are missing data beyond a specific threshold.
Questo esercizio fa parte del corso
Feature Engineering with PySpark
Istruzioni dell'esercizio
- Define a function
column_dropper()that takes the parametersdfa dataframe andthresholda float between 0 and 1. - Calculate the percentage of values that are missing using
where(),isNull()andcount() - Check to see if the percentage of missing is higher than the threshold, if so, drop the column using
drop() - Run
column_dropper()ondfwith the threshold set to .6
Esercizio pratico interattivo
Prova a risolvere questo esercizio completando il codice di esempio.
def column_dropper(df, threshold):
# Takes a dataframe and threshold for missing values. Returns a dataframe.
total_records = df.____()
for col in df.columns:
# Calculate the percentage of missing values
missing = df.____(df[col].____()).____()
missing_percent = ____ / ____
# Drop column if percent of missing is more than threshold
if ____ > ____:
df = df.____(col)
return df
# Drop columns that are more than 60% missing
df = ____(____, ____)