Intro to pandas DataFrame iteration
1. Intro to pandas DataFrame iteration
So far, we've focused on Python's built-in data structures. Now, we'll shift gears and focus on one of the most popular data analytics tools: the pandas DataFrame.2. pandas recap
You should be familiar with pandas before continuing with this course. If not, consider refreshing your pandas knowledge with the great overview provided in the DataCamp course listed here. Just to recap, pandas is a library used for data analysis. The main construct of pandas is the DataFrame, a tabular data structure with labeled rows and columns. This chapter will focus on the best approaches for iterating over a DataFrame. Let's begin by analyzing a Major League Baseball dataset.3. Baseball stats
We've collected team stats for each Major League Baseball team from the year 1962 to 2012, which are stored in a pandas DataFrame named baseball_df.4. Baseball stats
The Team column is each baseball team's abbreviated name. The first team, ARI, represents the Arizona Diamondbacks.5. Baseball stats
All other columns in this DataFrame represent specific statistics for each team in a given Year or season. We'll cover what the RS, RA, and Playoffs columns mean in later exercises. For now, we'll focus on the W column, which specifies the number of wins a team had in a season and the G column that contains the number of games a team played in a season.6. Calculating win percentage
One popular statistic used to evaluate a team's performance for a given season is the team's win percentage. This metric is calculated by dividing a team's total wins by number of games played. We've written a simple function to perform this calculation. If a team wins 50 out of 100 games, we see that our function returns the correct result.7. Adding win percentage to DataFrame
We'd like to create a new column in our baseball_df DataFrame that stores each team's win percentage for a season. To do this, we'll need to iterate over the DataFrame's rows and apply our calc_win_perc function. First, we create an empty win_perc_list to store all the win percentages we'll calculate. Then, we write a loop that will iterate over each row of the DataFrame. Notice that we are using an index variable (i) that ranges from zero to the number of rows that exist within the DataFrame. We then use the dot-iloc method to lookup each individual row within the DataFrame using the index variable. Now, we grab each team's wins and games played by referencing the W and G columns. Next, we pass the team's wins and games played to calc_win_perc to calculate the win percentages. Finally, we append win_perc to win_perc_list and continue the loop. We create our desired column in the DataFrame, called WP, by setting the column value equal to the win_perc_list.8. Adding win percentage to DataFrame
Printing the first five rows of our DataFrame, we see that the win percentage column now appears.9. Iterating with .iloc
Looping over the DataFrame with dot-iloc gave us our desired output, but is it efficient? When estimating the runtime, the dot-iloc approach took 183 milliseconds, which is pretty inefficient.10. Iterating with .iterrows()
pandas comes with a few efficient methods for looping over a DataFrame. The first method we'll cover is the dot-iterrows method. This is similar to the dot-iloc method, but dot-iterrows returns each DataFrame row as a tuple of (index, pandas Series) pairs. This means each object returned from dot-iterrows contains the index of each row as the first element and the data in each row as a pandas Series as the second element. Notice that we still create the empty win_perc_list, but now we don't have to create an index variable to look up each row within the DataFrame. dot-iterrows handles the indexing for us! The remainder of the for loop stays the same to create a new win percentage column within our baseball_df DataFrame.11. Iterating with .iterrows()
Using dot-iterrows takes roughly half the time dot-iloc takes to iterate over our DataFrame. We'll explore more efficient ways to loop over a DataFrame later on in the chapter. But for now, we know that using dot-iloc is not efficient and shouldn't be used to iterate over a DataFrame.12. Practice DataFrame iterating with .iterrows()
Now, let's practice iterating over a DataFrame using dot-iterrows!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.