CommencerCommencer gratuitement

Visualizing Missing Data

Being able to plot missing values is a great way to quickly understand how much of your data is missing. It can also help highlight when variables are missing in a pattern something that will need to be handled with care lest your model be biased.

Which variable has the most missing values? Run all lines of code except the last one to determine the answer. Once you're confident, and fill out the value and hit "Submit Answer".

Cet exercice fait partie du cours

Feature Engineering with PySpark

Afficher le cours

Instructions

  • Use select() to subset the dataframe df with the list of columns columns and Sample with the provided sample() function, and assign this dataframe to the variable sample_df.
  • Convert the Subset dataframe to a pandas dataframe pandas_df, and use pandas isnull() to convert it DataFrame into True/False. Store this result in tf_df.
  • Use seaborn's heatmap() to plot tf_df.
  • Hit "Run Code" to view the plot. Then assign the name of the variable with most missing values to answer.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Sample the dataframe and convert to Pandas
____ = df.select(____).sample(False, 0.1, 42)
____ = ____.toPandas()

# Convert all values to T/F
tf_df = ____.____()

# Plot it
sns.____(data=____)
plt.xticks(rotation=30, fontsize=10)
plt.yticks(rotation=0, fontsize=10)
plt.show()

# Set the answer to the column with the most missing data
answer = '____'
Modifier et exécuter le code