Get startedGet started for free

First explorations

1. Looking at the farmers market data

Welcome to the final chapter of this course, where we will explore the process of using visualization to take a data project from exploration to presentation.

2. First explorations of a dataset

In this first lesson, we will be starting with the first step of every data science project, exploring the dataset. When working with a brand new dataset, there are a few principles to keep in mind for your visualizations. First, take as broad of a view as you can with your dataset. If you start your analysis by searching for a specific story you may be missing the more interesting stories in the data, and potentially worse, molding the analysis to fit a narrative. To help keep your exploration of the data broad, your visualizations should show as much information as possible. Since you are the only viewer of the visualization you don't need to worry about making it look great, you simply want to rapidly iterate through informative views of the data. This means you shouldn't worry about your axes labels overlapping or having a well thought out legend. It's a liberating situation.

3. Using your head()

The first explorations of the dataset aren't visualizations, but they are critical to knowing how to visualize the data later. There are two pandas commands, head() and describe(), which can help you get acquainted with your data. By using head to print out the first few rows of the dataset, you can get an idea of what columns you have. Which columns are numeric and which are categorical.

4. describe() before visualizing

describe() helps you explore the basic statistics of these columns. Does a column only ever contain one value? Do you have weird extreme values in your numeric columns? One thing that's important to remember is to pass includes equals 'all', so pandas knows to show you all your columns instead of just the numeric ones.

5. The scatter matrix

Once you have familiarized yourself with the form of your data, one of the very best visualizations to run is known as a scatter matrix. This plot lets you compare correlations between all pairs of continuous variables in your dataset by arranging a series of scatterplots in a grid (or matrix) with each row and column corresponding to a column in your dataset.

6. The scatter matrix (b)

For instance, in this scatter matrix of our pollution data, the plot in the second row and first column shows the relationship between the CO and NO2 values. The equivalent of if you called Seaborn's scatterplot() with x as CO and y as NO2. One important tweak you will likely want to make is to lower the opacity with the alpha argument to show overlap in the small scatter plots.

7. The scatter matrix (c)

pandas helpfully places histograms showing the distribution of each column on the diagonals. Overall, scatter matrices show a huge amount of information in a single display.

8. Farmers market data

In this chapter, we will be working with a new dataset. You will be investigating data provided by the US Department of Agriculture on registered farmers markets across the US. This data contains info on the name of the markets, their location, the duration the market runs in months, the goods sold at the market, and the population of the state the market resides in. There are lots of interesting stories that can be told with this data, which you will illustrate with your visualizations in the exercises.

9. Let's explore our data

Now jump into this dataset by exploring it using the techniques we just discussed.