ComeçarComece de graça

Visualizing Missing Data

Being able to plot missing values is a great way to quickly understand how much of your data is missing. It can also help highlight when variables are missing in a pattern something that will need to be handled with care lest your model be biased.

Which variable has the most missing values? Run all lines of code except the last one to determine the answer. Once you're confident, and fill out the value and hit "Submit Answer".

Este exercício faz parte do curso

Feature Engineering with PySpark

Ver curso

Instruções do exercício

  • Use select() to subset the dataframe df with the list of columns columns and Sample with the provided sample() function, and assign this dataframe to the variable sample_df.
  • Convert the Subset dataframe to a pandas dataframe pandas_df, and use pandas isnull() to convert it DataFrame into True/False. Store this result in tf_df.
  • Use seaborn's heatmap() to plot tf_df.
  • Hit "Run Code" to view the plot. Then assign the name of the variable with most missing values to answer.

Exercício interativo prático

Experimente este exercício completando este código de exemplo.

# Sample the dataframe and convert to Pandas
____ = df.select(____).sample(False, 0.1, 42)
____ = ____.toPandas()

# Convert all values to T/F
tf_df = ____.____()

# Plot it
sns.____(data=____)
plt.xticks(rotation=30, fontsize=10)
plt.yticks(rotation=0, fontsize=10)
plt.show()

# Set the answer to the column with the most missing data
answer = '____'
Editar e executar o código