Get startedGet started for free

Summarizing data

1. Summarizing data

When we first look at a dataset, it's common to ask big-picture questions about what the data holds.

2. Aggregating methods

NumPy has several great ways to summarize array information. We'll look at five aggregating methods - dot-sum, dot-min, dot-max, dot-mean, and dot-cumsum.

3. Our data

Let's imagine we run a data security firm with three large clients. We have a count of data breaches for each client by year in a NumPy array. Each row of this data represents the security breaches that happened in a given year, and each column represents the breaches that occurred for each client. Let's understand our data better by looking at some summary values.

4. Summing data

Our first aggregating method, dot-sum, adds up all elements in the entire array. Throughout the five years our firm has been tracking breaches, our three clients have seen 17 total security breaches.

5. Aggregating rows

We can control which axis to sum across with the axis argument. Setting the axis argument equal to zero sums the values of all rows in each column, creating column totals. In this case, each total represents the number of security breaches a client has ever experienced. Client two has had a lot of breaches!

6. Aggregating columns

Setting axis equal to one will sum the values of all columns in each row, creating row totals. Here, the totals represent the security breaches experienced by all clients in a given year.

7. Making sense of the axis argument

If we are "summing the value of all columns in each row," it can be confusing at first to know whether the axis argument should refer to columns or rows. Our focus is the axis being collapsed. If we want a column representing the sum of elements across rows, the end result is a single column, so the axis is set to one.

8. Minimum and maximum values

Many aggregating methods use the same syntax as dot-sum. Dot-min and dot-max, for example, find the minimum or maximum of an entire array if no axis argument is set. The smallest number of breaches any client has experienced in a given year is 0. The largest is 5. And we can return the min or max of each column or each row if the axis argument is set to 0 or 1.

9. Finding the mean

dot-mean operates the same way. The average number of security breaches that a client can expect in a year is about one-point-one-three. Setting axis equal to one finds the mean breaches by year across all clients; for example, the average number of breaches in the first year was two.

10. The keepdims argument

dot-sum, dot-min, dot-max, and dot-mean all have an optional keepdims keyword argument. If keepdims is set to True, then the dimensions that are collapsed when aggregating are left in the output array and set to one. As we saw in the last chapter, this can be helpful to achieve dimension compatibility! Here, the dot-sum array output is already 2D, ready for concatenation with another 2D array if that suits our purpose.

11. Cumulative sums

np-dot-cumsum returns the cumulative sum of elements along a given axis. For example, when the axis keyword argument is set to zero, np-dot-cumsum returns the number of security breaches a client has ever had up to that year.

12. Graphing summary values

Summary information is often best communicated in graphs! For example, we can graph the cumulative sum of breaches for one client along with the average of the cumulative sums to see how a single client's security breaches over time differ from the mean.

13. Let's practice!

Let's get aggregating!