1. pandas alternative to looping
We've been looping over DataFrames row-by-row with ease in the past two lessons. But remember, in order to write efficient code, we want to avoid looping when possible. In this lesson, we'll explore an alternative to using dot-iterrows and dot-itertuples to perform calculations on a DataFrame.
2. Revisit run differentials
We'll continue using our baseball dataset and revisit the calc_run_diff function we've used in the past.
This function calculates a team's run differential for a given year by subtracting the team's total number of runs allowed from their total number of runs scored in a season.
3. Run differentials with a loop
We'd like to create a new column in our baseball_df DataFrame called RD that stores each team's run differentials over the years.
In previous lessons, we did this with a for loop using either dot-iterrows or dot-itertuples. Here, we'll use dot-iterrows as an example. Notice that we are iterating over baseball_df with a for loop, passing each row's RS and RA columns into our calc_run_diff function, and then appending each row's result to our run_diffs_iterrows list.
This gives us our desired output, but it is not our most efficient option.
4. pandas .apply() method
One alternative to using a loop to iterate over a DataFrame is to use pandas' dot-apply method.
This function acts like the map function we've used in the past. It takes a function as an input and applies this function to an entire DataFrame.
Since we are working with tabular data, we must specify an axis that we'd like our function to act on. Using a zero for the axis argument will apply our function on columns while using a one for the axis will apply our function on all rows.
Just like the map function, pandas' dot-apply method can be used with anonymous functions or lambdas.
Let's walk through how'd we'd use the dot-apply method to calculate our run differentials. First, we call dot-apply on the baseball_df DataFrame.
Then, we use a lambda function to iterate over the rows of the DataFrame. Notice that our argument for lambda is row (since we are applying to each row of the DataFrame). For every row, we grab the RS and RA columns and pass them to our calc_run_diff function.
Lastly, we specify our axis to tell dot-apply that we want to iterate over rows instead of columns.
5. Run differentials with .apply()
When we use the dot-apply method to calculate our run differentials, we don't need to use a for loop. We can collect our run differentials directly into an object called run_diffs_apply.
After creating our new column and printing the DataFrame, we see that our results are identical to the dot-iterrows approach. But, was using dot-apply more efficient?
6. Comparing approaches
When timing the dot-iterrows approach, we see that it took about 87 milliseconds to complete.
7. Comparing approaches
But, using the dot-apply method took only 30 milliseconds. A definite improvement!
8. Let's practice using pandas .apply() method!
Now, let's practice using dot-apply with some coding examples!