Get startedGet started for free

Exploring the patterns

1. Exploring the patterns

Once you've familiarized yourself with your dataset, you can start applying more complex visualization techniques explore patterns.

2. Digging in deeper

The basic scatter matrix approach to looking at your data does a great job of illuminating interesting relationships. The next step is to start investigating these relationships further. Usually, visualizations in this stage of the process involve looking at the relationships of two or three variables. Do they correlate strongly? Are the correlations driven by some other observed or unobserved confounding variable? Do these patterns fit with your expectations - or do they surprise you?

3. Target audiences

While the visualizations we saw in the last lesson are almost exclusively viewed by the person making them, visualizations made in this second stage may be shared with peers. For instance, if you're on a data science team and want to run your results by another team member, to see if they have any insights into the stories the data are telling. If you need to share a visualization, you'll want to be a bit more careful with the aesthetics of your plots. Make sure you're using optimal color palettes and axes arrangement. Because your viewer will not be as intimately familiar with the inner workings of the plot and data as you are.

4. Using regplot() to investigate correlations

Often times if you have a lot of data, or data that overlaps due to axis boundaries you can benefit from the use of a simple linear regression line to show a relationship that the data may be concealing. Often it also helps to reduce the point opacity of the scatter plot using the scatter_kws argument to show the overlapping points. You should never take the results of these regressions as valid statistical inference because they often violate many assumptions of regressions, such as the fact the relationship is not linear at all. However, these plots can be invaluable in helping you see patterns that were not immediately apparent in your scatter matrix. They may do the opposite, by showing you that a pattern you thought you saw was simply spurious.

5. Profiling patterns

Say you've dug into an interesting relationship between two variables in your dataset. You see a clear trend in most of the data, but there are a few points that do not fit with the rest. How can you quickly dig into this pattern to explore the data that are causing the pattern and potentially explain what's causing it? In the first chapter, we learned how carefully considered text annotations can greatly improve a visualization. Another way text can come in handy is when it's used in excess.

6. Using text scatters to id outliers

In this plot of monthly pollutant values in Denver, we have a clear outlier in the upper right corner of the plot. How do we quickly find out what month the outlier belongs to? One way to do this is by adding the datapoint's id in text to every point. This will help you immediately see what the outliers are, but can also help you see patterns in the data such as similar points tending to fall in the same region. Here we can see that the outlier is the month of January.

7. Looping through a DataFrame for annotation

Luckily, it's not too hard to build a text-scatter with a for loop and the iterrows() function in pandas. When you loop over a DataFrame using iterrows() you are given a list of each column's value. All you then need to do is pass the desired mapping values to the annotate() function.

8. Looping through a DataFrame for annotation (plot)

Using text like this allows you to easily dig into patterns and new avenues of exploration. Here, we see Long Beach's January, while still in the upper right, is much less of an outlier than Denver's was.

9. Let's dig in

Let's put these techniques to use exploring our market data.