Get startedGet started for free

Intro to Aggregation

1. Intro to Aggregation: From Query Components to Aggregation Stages

There are cases where you may want to avoid having to fetch and iterate over lots of data client-side. In this chapter, we'll learn how MongoDB can do a good chunk of your data analysis and aggregation for you. In this first lesson, we'll reproduce what we already know how to do with the "find" method of a collection. By doing so, we'll see how the implicit stages of a query can map to explicit stages of an aggregation pipeline.

2. Queries have implicit stages

Here, we iterate over a cursor to yield prize-year information for a few USA-born laureates. I used indentation in this code to demarcate implicit stages. Also, I passed the arguments to "find" as keyword arguments to name these stages. The first stage filters for documents that match an expression. The second stage projects out fields I need downstream for output. Finally, the last stage limits the number of documents retrieved. With an aggregation pipeline, I make these stages explicit. An aggregation pipeline is a list, a sequence of stages, and it looks like this. Each stage involves a stage operator. Here's an aggregation that produces the same result as our call to "find" on the left. To filter for documents matching an expression, I use the match operator. To project fields, I use project. And to limit results, I use limit. This pipeline, in particular, has three stages. It matches documents for USA-born laureates. It strips the documents of all but prize years. And it yields only the first three.

3. Adding sort and skip stages

Sorting and skipping are also available as pipeline stages. Here, we project prize years for USA-born laureates. We yield them in chronological order. Furthermore, we skip the first year and collect only the second, third, and fourth. One quirk of the sort stage in pymongo is that it requires a dictionary-like object. We can use the OrderedDict class in Python's included collections module. This class yields field-direction pairs in the order they are input. In the case of sorting by only one key, we can of course use a plain dictionary. I use the more general form here so that you know how to ensure compound indices.

4. But can I count?

Finally, we can use a "count" stage to count the number of documents passed in from the previous stage. This count gets assigned to a field of your choosing. Here, I count the number of USA-born laureates. This aggregation, of course, is the same as the "count_documents" method of a collection. The other convenience method we know about for aggregation is "distinct". This method has a counterpart aggregation stage as well, which we'll cover in the next lesson.

5. Let's practice!

You can now translate collection and cursor methods to aggregation pipeline stages. You've seen how to do this for all but the "distinct" method, which we'll cover later. Let's practice doing these translations before we learn about more-advanced aggregation capabilities.