Get startedGet started for free

Optimal pandas iterating

1. Optimal pandas iterating

We've come a long way from our first dot-iloc approach for iterating over a DataFrame. Each approach we've discussed has really improved performance. But, these approaches focus on performing calculations for each row of our DataFrame individually. In this lesson, we'll explore some pandas internals that allow us to perform calculations even more efficiently.

2. pandas internals

As you know, we should try to stay away from loops when writing Python code - and working with pandas is no exception. In the previous lessons, we were iterating over a DataFrame row by row in order to perform a calculation. pandas is a library that is built on NumPy. This means that each pandas DataFrame we use can take advantage of the efficient characteristics of NumPy arrays that we learned in Chapter 1. Do you remember an array's broadcasting functionality? Broadcasting allows NumPy arrays to vectorize operations, so they are performed on all elements of an object at once. This allows us to efficiently perform calculations over entire arrays. Just like NumPy, pandas is designed to vectorize calculations so that they operate on entire datasets at once (not just on a row by row basis). Let's explore this concept with some examples.

3. DataFrame columns as arrays

We'll continue to use the baseball_df DataFrame we have been using throughout the chapter. Since pandas is built on top of NumPy, we can grab any of these DataFrame column's values as a NumPy array using the dot-values method. Here, we are collecting the W column's values into a NumPy array called wins_np. When we print the type of wins_np, we see that it is, in fact, a NumPy array. We can see the contents of the array by printing it and verifying that it is the same as the W column from our DataFrame.

4. Power of vectorization

The beauty of knowing that pandas is built on NumPy can be seen when taking advantage of a NumPy array's broadcasting abilities. Remember, this means we can vectorize our calculations, and perform them on entire arrays all at once! Instead of looping over a DataFrame, and treating each row independently, like we've done with dot-iterrows, dot-itertuples, and dot-apply, we can perform calculations on the underlying NumPy arrays of our baseball_df DataFrame. Here, we gather the RS and RA columns in our DataFrame as NumPy arrays, and use broadcasting to calculate run differentials all at once!

5. Run differentials with arrays

When we use NumPy arrays to perform our run differential calculations, we can see that our code becomes much more readable. Here, we can explicitly see how our run differentials are being calculated. After creating our new column and printing the DataFrame, we see that our results are identical to all other approaches. But, just how much more efficient is using NumPy arrays?

6. Comparing approaches

When we time our NumPy arrays approach, we see that our run differential calculations take microseconds! All other approaches were reported in milliseconds. Our array approach is orders of magnitude faster than all previous approaches!

7. Let's put our skills into practice!

It's clear that using a DataFrame's underlying NumPy arrays to perform calculations can help us gain some massive efficiencies. Let's practice with a few coding exercises.