Perform EDA

1. Perform EDA

Welcome back. After having had a look at the data using head(), info() and describe(), which we saw in previous lessons, we will now focus on graphical inspection of the data.

2. Plot DataFrame

Pandas makes this easy by providing a method DataFrame.plot() - which allows to easily plot a line-graph of the available data. As you can see, pandas automatically provides a legend. We set the title of the plot to Environment by specifying this as the keyword argument title. The X-Label is set automatically to the index name, timestamp in this case. Since the different columns have different scalings, this results in a very messy plot, which is hiding a lot of information.

3. Line plot

We can clean this up by selecting columns within the same value range, like temperature and humidity, which both have their maximum lower than 100. We also change the Label on the X-Axis to Time by passing this to plt.xlabel(). This will not cover all cases since we might want to compare temperature with pressure.

4. Secondary y

We can do just that by using the 2nd axis for the plot. Setting the keyword argument secondary_y to the column name we can plot both series, even if the values for temperature are in the range between 5 and 30, and pressure is between 900 and 1000. The data labels for temperature are on the left Y-Axis, while the labels for Pressure have moved to the right Y-Axis. To label both the left and right Y-Axis, we need to first set the left axis label, then generate the plot, before setting the right axis label by using plt.ylabel() a second time. Notice that towards the end of the month, the weather seems to have changed.

5. Histogram basics

Let's now look at a different type of commonly used plot, the histogram. The histogram shows the frequency of values on the Y-axis, and the bins on the X-axis. It's a great way to see the distribution of a series. In the picture, you can see an example of a normally distributed series, also called a bell-curve, because of the shape of the histogram.

6. Histogram

To have a look at the distribution of the different columns, we can use df.hist(), which plots one histogram per numeric column in our DataFrame. By specifying the keyword argument bins=20, we specify that we would like to split the data into 20 bins. If we look at the dataset, we can observe that the temperature-histogram on the bottom right is close to normally distributed. We can also see that radiation has a high number of 0 values, which we also saw when looking at the summary statistics in previous lessons. Pressure and Humidity both have their highest frequency towards the right of the plot, so they are skewed towards their maximum.

7. Let's practice!

And now, you try this out yourself.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.