Summarizing datetime data in Pandas

1. Summarizing datetime data in Pandas

In this lesson, we will discuss how to summarize Pandas tables, especially when we have datetime columns. One note: Pandas continues to evolve quickly. Many of the techniques in this chapter don't work on versions of Pandas more than a few years old. If anything breaks on your personal computer, make sure you're using at least Pandas version 0-point-23.

2. Summarizing data in Pandas

First things first, let's review some general principles for summarizing data in Pandas. You can call dot-mean(), dot-median(), dot-sum() and so on, on any column where it makes sense. For example, rides['Duration']-dot-mean() returns that the average time the bike was out of the dock was 19 minutes and 38 seconds. We also can ask: how much is this column in total? By using the dot-sum() method, we can see that the bike was out of the dock for a total of 3 days, 22 hours, 58 minutes and 10 seconds during this time period.

3. Summarizing data in Pandas

The output of Pandas operations mix perfectly well with the rest of Python. For example, if we divide this sum by 91 days (the number of days from October 1 to December 31), we see that the bike was out about 4.3% of the time, meaning about 96% of the time the bike was in the dock.

4. Summarizing data in Pandas

For non-numeric columns, we have other ways of making summaries. The dot-value_counts() method tells us how many times a given value appears. In this case, we want to know how often the Member type is Member or Casual. 236 rides were from Members, and 54 were from Casual riders, who bought a ride at the bike kiosk without a membership. We can also divide by the total number of rides, using len(rides), and Pandas handles the division for us across our result. 81-point-4% of rides were from members, whereas 18-point-6% of rides were from casual riders.

5. Summarizing datetime in Pandas

To make this next section easy, let's make a column called 'Duration seconds', which will be the original column 'Duration' converted to seconds. Pandas has powerful ways to group rows together. First, we can group by values in any column, using the dot-groupby() method. dot-groupby() takes a column name and does all subsequent operations on each group. For example, we can groupby Member type, and ask for the mean duration in seconds for each member type. Rides from casual members last nearly twice as long on average.

6. Summarizing datetime in Pandas

Second, we can also group by time, using the dot-resample() method. dot-resample() takes a unit of time (for example, 'M' for month), and a datetime column to group on, in this case 'Start date'. From this we can see that, in the month ending on October 31st, average rides were 1886 seconds, or about 30 minutes, whereas for the month ending December 31, average rides were 635 seconds, or closer to ten minutes.

7. Summarizing datetime in Pandas

There are also others methods which operate on groups. For example, we can call dot-size() to get the size of each group. Or we can call dot-first() to get the first row of each group.

8. Summarizing datetime in Pandas

Pandas also makes it easy to plot results. Just add the dot-plot() method at the end of your call and it will pass the results to the Python plotting library Matplotlib. It will usually have sensible defaults, though if you want to change things further you can.

9. Summarizing datetime in Pandas

We can also change the resampling rate from 'M' for months to 'D' for days, and plot again. Now we can see that there is at least one big outlier skewing our data: some ride in the middle of October was 25000 seconds long, or nearly 7 hours. We identified this ride in an earlier chapter as possibly a bike repair. Now we can see that it happened after many days with zero rides, which lends strength to that idea. If the bike was broken and sitting in the dock for awhile, eventually it would have been removed for repairs, then returned.

10. Summarizing datetime data in Pandas

In this lesson, we discussed how to use basic Pandas operations, such as dot-mean(), dot-median() and dot-sum(), and also dot-groupby() and dot-resample() to combine our rows into different groups. Time to practice what you've learned!