Visualizing Missing Data
Being able to plot missing values is a great way to quickly understand how much of your data is missing. It can also help highlight when variables are missing in a pattern something that will need to be handled with care lest your model be biased.
Which variable has the most missing values? Run all lines of code except the last one to determine the answer. Once you're confident, and fill out the value and hit "Submit Answer".
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Use
select()
to subset the dataframedf
with the list of columnscolumns
and Sample with the providedsample()
function, and assign this dataframe to the variablesample_df
. - Convert the Subset dataframe to a
pandas
dataframepandas_df
, and usepandas
isnull()
to convert itDataFrame
into True/False. Store this result intf_df
. - Use seaborn's
heatmap()
to plottf_df
. - Hit "Run Code" to view the plot. Then assign the name of the variable with most missing values to
answer
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Sample the dataframe and convert to Pandas
____ = df.select(____).sample(False, 0.1, 42)
____ = ____.toPandas()
# Convert all values to T/F
tf_df = ____.____()
# Plot it
sns.____(data=____)
plt.xticks(rotation=30, fontsize=10)
plt.yticks(rotation=0, fontsize=10)
plt.show()
# Set the answer to the column with the most missing data
answer = '____'