1. Binned scatterplots
So far in this course, you have been working with small to moderately sized datasets, but often you will want to explore large datasets.
When you have large datasets, some of the charts we have seen will perform poorly. In this lesson, you will see that scatterplots scale poorly, and learn how to create binned scatterplots to represent large datasets.
2. Overplotting
To understand the limitations of scatterplots, consider this scatterplot of simulated data. What do you see?
The dataset contains sixty thousand observations, and if we naively create a standard scatterplot, it is impossible to distinguish the density of points on the plot. We can still see the overarching structure, but we are missing a lot.
3. Opacity
In previous examples, we saw that decreasing the opacity of the points can improve readability. Here, the opacity is set to 0-point-1. This has certainly improved the readability of the chart; however, we are still plotting sixty thousand points, which bogs down the interactive features.
4. Binning a scatterplot
Binning is a very intuitive way to address the issue of having too many overlapping points on a scatterplot. The general idea is to create a uniform tiling on the X-Y plane and count the number of observations falling into each tile. This count is then displayed as a color scale.
In this example, rectangular bins are used with our simulated data, making it easy to see three clusters.
5. add_histogram2d()
To create the binned scatterplot we begin with the typical data-plot pipeline, mapping columns to the x- and y-axes.
Then, we pipe this base layer into add_histogram2d() to create the binned scatterplot.
6. Changing the bins
As with one-dimensional histograms, the default number of bins is not always ideal. You can adjust the number of bins created for the x- and y-axes independently by adding the nbinsx and nbinsy arguments to add_histogram2d().
In this example, we specify nbinsx = 200 and nbinsy = 100, resulting in 200 bins on the x-axis and 100 bins on the y-axis.
7. Let's practice!
Now that you know the basics behind creating binned scatterplots in plotly, it's time to practice.