Get startedGet started for free

Visualizing Missing Data

Being able to plot missing values is a great way to quickly understand how much of your data is missing. It can also help highlight when variables are missing in a pattern something that will need to be handled with care lest your model be biased.

Which variable has the most missing values? Run all lines of code except the last one to determine the answer. Once you're confident, and fill out the value and hit "Submit Answer".

This exercise is part of the course

Feature Engineering with PySpark

View Course

Exercise instructions

  • Use select() to subset the dataframe df with the list of columns columns and Sample with the provided sample() function, and assign this dataframe to the variable sample_df.
  • Convert the Subset dataframe to a pandas dataframe pandas_df, and use pandas isnull() to convert it DataFrame into True/False. Store this result in tf_df.
  • Use seaborn's heatmap() to plot tf_df.
  • Hit "Run Code" to view the plot. Then assign the name of the variable with most missing values to answer.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Sample the dataframe and convert to Pandas
____ = df.select(____).sample(False, 0.1, 42)
____ = ____.toPandas()

# Convert all values to T/F
tf_df = ____.____()

# Plot it
sns.____(data=____)
plt.xticks(rotation=30, fontsize=10)
plt.yticks(rotation=0, fontsize=10)
plt.show()

# Set the answer to the column with the most missing data
answer = '____'
Edit and Run Code