Dealing with too many categories
Sometimes you may be short on figure space and need to show a lot of data at once. Here you want to show the year-long trajectory of every pollutant for every city in the pollution
dataset. Each pollutant trajectory will be plotted as a line with the y-value corresponding to standard deviations from year's average. This means you will have a lot of lines on your plot at once -- way more than you could separate clearly with color.
To deal with this, you have decided to highlight on a small subset of city pollutant combinations (wanted_combos
). This subset is the most important to you, and the other trajectories will provide valuable context for comparison. To focus attention, you will set all the non-highlighted trajectories lines to of the same 'other' color.
This exercise is part of the course
Improving Your Data Visualizations in Python
Exercise instructions
- Modify the list comprehension to isolate the desired combinations of city and pollutant (
wanted_combos
). - Tell the line plot to color the lines by the newly created
color_cats
column in your DataFrame. - Use the
units
argument to determine how, i.e., from which column, the data points should be connected to form each line. - Disable the binning of points with the
estimator
argument.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Choose the combos that get distinct colors
wanted_combos = ['Vandenberg Air Force Base NO2', 'Long Beach CO', 'Cincinnati SO2']
# Assign a new column to DataFrame for isolating the desired combos
city_pol_month['color_cats'] = [x if x in ____ else 'other' for x in city_pol_month['city_pol']]
# Plot lines with color driven by new column and lines driven by original categories
sns.lineplot(x = "month",
y = "value",
hue = '____',
units = '____',
estimator = ____,
palette = 'Set2',
data = city_pol_month)
plt.show()