1. Another iterator method: .itertuples()
In the previous lesson, we covered how to iterate over a pandas DataFrame row by row using the dot-iterrows method. pandas also comes with a similar iteration method called dot-itertuples that is often more efficient that dot-iterrows. Let's continue using our baseball dataset to compare these two methods.
2. Team wins data
Suppose we have a pandas DataFrame called team_wins_df that contains each team's total wins in a season.
3. Iterating with .iterrows()
If we use dot-iterrows to loop over our team_wins_df DataFrame and print each row's tuple, we see that each row's values are stored as a pandas Series.
Remember, dot-iterrows returns each DataFrame row as a tuple of (index, pandas Series) pairs, so we have to access the row's values with square bracket indexing.
4. Iterating with .itertuples()
But, we could use dot-itertuples to loop over our DataFrame rows instead.
The dot-itertuples method returns each DataFrame row as a special data type called a namedtuple. A namedtuple is one of the specialized data types that exist within the collections module we've discussed previously. These data types behave just like a Python tuple but have fields accessible using attribute lookup. What does this mean? Notice in the output that each printed row_namedtuple has an Index attribute and each column in our team_wins_df as an attribute.
That means we can access each of these attributes with a lookup using a dot method. Here, we can print the last row_namedtuple's Index using row_namedtuple-dot-Index. We can print this row_namedtuple's Team with row_namedtuple-dot-Team, Year with row_namedtuple-dot-Year and so on.
5. Comparing methods
When we compare dot-iterrows to dot-itertuples, we see that there is quite a bit of improvement! The reason dot-itertuples is more efficient than dot-iterrows is due to the way each method stores its output. Since dot-iterrows returns each row's values as a pandas Series, there is a bit more overhead.
6. Attribute lookup caveat
One more quick note about the differences between these methods. When using dot-iterrows, we can use square brackets to reference a column within our team_wins_df DataFrame. Here, we are printing the Team column for each row in our DataFrame.
If we use the same syntax with dot-itertuples, we get a TypeError. This is due to the fact that namedtuples don't support square brackets like a pandas Series does.
When looking up an attribute within a namedtuple, we must use a dot to reference the attribute. So anytime we use dot-itertuples we have to use a dot when referring to a column within our DataFrame. If we replace our square bracket notation with a dot, we see that the Teams are correctly printed out.
7. Let's keep iterating!
Now, let's put our new skill to the test and practice efficiently looping over rows of a DataFrame using dot-itertuples.