Visualizing Missing Data
Being able to plot missing values is a great way to quickly understand how much of your data is missing. It can also help highlight when variables are missing in a pattern something that will need to be handled with care lest your model be biased.
Which variable has the most missing values? Run all lines of code except the last one to determine the answer. Once you're confident, and fill out the value and hit "Submit Answer".
Cet exercice fait partie du cours
Feature Engineering with PySpark
Instructions
- Use select()to subset the dataframedfwith the list of columnscolumnsand Sample with the providedsample()function, and assign this dataframe to the variablesample_df.
- Convert the Subset dataframe to a pandasdataframepandas_df, and usepandasisnull()to convert itDataFrameinto True/False. Store this result intf_df.
- Use seaborn's heatmap()to plottf_df.
- Hit "Run Code" to view the plot. Then assign the name of the variable with most missing values to answer.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Sample the dataframe and convert to Pandas
____ = df.select(____).sample(False, 0.1, 42)
____ = ____.toPandas()
# Convert all values to T/F
tf_df = ____.____()
# Plot it
sns.____(data=____)
plt.xticks(rotation=30, fontsize=10)
plt.yticks(rotation=0, fontsize=10)
plt.show()
# Set the answer to the column with the most missing data
answer = '____'