Visualizing many variables
As you begin to consider more variables, plotting them all at the same time becomes increasingly difficult. In addition to using x and y scales for two numeric variables, you can use color for a third numeric variable, and you can use faceting for categorical variables. And that's about your limit before the plots become too difficult to interpret. There are some specialist plot types like correlation heatmaps and parallel coordinates plots that will handle more variables, but they give you much less information about each variable, and they aren't great for visualizing model predictions.
Here you'll push the limits of the scatter plot by showing the house price, the distance to the MRT station, the number of nearby convenience stores, and the house age, all together in one plot.
taiwan_real_estate
is available.
This exercise is part of the course
Intermediate Regression with statsmodels in Python
Exercise instructions
- Create a facet grid for each
house_age_years
intaiwan_real_estate
. - Using the
taiwan_real_estate
dataset, draw a scatter plot ofn_convenience
versussqrt_dist_to_mrt_m
, colored byprice_twd_msq
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Prepare the grid using taiwan_real_estate, for each house age category, colored by price_twd_msq
grid = ____(data=____,
col=____,
hue=____,
palette="plasma")
# Plot the scatterplots with sqrt_dist_to_mrt_m on the x-axis and n_convenience on the y-axis
grid.map(____,
____,
____)
# Show the plot (brighter colors mean higher prices)
plt.show()