1. Aggregating and summarizing
Once you have your data in a DataFrame, you will want to understand it's characteristics. Lets look at some of the methods that Pandas DataFrames provide for aggregating and summarizing data.
2. DataFrame methods
DataFrames have methods for getting the count of items, for getting the item that is the minimum, maximum, first or last, as well as methods to calculate the sum, product, mean, median, standard deviation, and variance.
3. Axis
All of these methods can be run across columns or rows. To specify which you would like, use the 'axis' parameter. Methods will run on rows if the parameter is set to zero, the string 'rows', or if it is omitted. To use columns, set axis to one or the string 'columns'.
Lets look at examples of some of these methods.
4. Count
The count method returns the number of items. On the left you can see the original DataFrame. In this example we use the default axis setting, which counts the rows. The result is the number of rows for each column.
5. Sum
The sum method returns the result of summing items. Here, we've specified an axis value of one, to sum across the columns.
So the values of all of the columns of the first row are added together with a result of four hundred fifteen and forty four cents. The other rows are calculated in the same fashion.
6. Product
The 'prod' method returns the product of items. Here we set the axis with the string 'columns', which returns the product across columns.
7. Mean
You may be familiar with the concept of a mean from statistics. It is one way to get a sense of where the center of your data lays. Pandas DataFrames offer a mean method to calculate means.
Here we can see that the data in the first column averages at a higher point than the others.
8. Median
The median is calculated by ordering items by magnitude and taking the middle one. If the number of items is even, then the average of the two middle items is used. It is a useful way to see the middle of your data without distortion caused by outliers.
In this example, the two middle items of the row ADD are three hundred point twenty two and three hundred one point forty nine. If we add these together and divide by two, we get three hundred point eight five five, which is the median of the column.
9. Standard deviation
The std method of a DataFrame calculates the standard deviation. This represents the amount of variation in your data. If it is small then the data is grouped together, and if large there is a greater spreed of values.
We can see in this example that the first columns values are more tightly grouped than the other two.
10. Variance
Another way to understand the distribution of your data is to calculate the variance. This is done with the var method. Once again in this example, the distribution for the first column is tighter, so the variance is lower.
The variance give us similar information as the standard deviation, but the standard deviation is easier to interpret on it's own, while the variance is more useful for use in other formulas.
11. Columns and rows
All of these methods can be run across a single row or column using the same row and column selection methods we've used before.
For example, we can select the maximum value from the column AAD using the loc() operation.
Or select the minimum from the first row using the iloc operation.
12. Let's practice!
You've been introduced to lots of ways to summarize and aggregate your data, lets practice using them!