Advanced Pandas

1. More Pandas

Real-world datasets will almost always have missing data.

2. Missing data

Missing data are represented as N-a-N or Nan, short for not a number. The NaN type is defined in the NumPy library. To check if there are missing values in your DataFrame, you can use the isnull() pandas function. On the other hand, to check for non-missing values, you can use the notnull() function.

3. Working with missing data

We recreated the sample dataset from Hadley Wickham's Tidy Data paper here. What if you wanted to fill the missing value in the first row with the mean of that column? You can call mean method on the treatment_a column to find its mean. Note that unlike R, missing values are ignored by default when calculating the mean.

4. Fillna

You can now use the calculated mean as an argument to the fillna() method to replace missing values in the column.

5. More Pandas

We will now talk about applying custom functions to DataFrames, calculating statistics by group, and tidying data.

6. Apply your own functions

You saw earlier that you can calculate the mean using the mean() method. But what if you wanted to calculate some other statistic or apply a custom function to the DataFrame? This is where the apply() method comes in handy.

7. Apply functions on DataFrames

When you call the apply() method on a DataFrame, you can specify the function you want to apply as the first argument, and optionally specify the axis argument. axis=0 applies the function column-wise, which is the default and axis=1 applies the function row-wise.

8. Tidy

Reshaping or tidying your data has multiple applications. There are 3 key features of tidy data: - Each row is an observation. - Each column is a variable. - Each type of observational unit forms a table

9. Tidy melt

We will now discuss two functions that will help you reshape your data. The melt() function can be used to convert your data to a tidy format. It takes a DataFrame as the first argument and the "id_vars" argument specifies the columns you wish to retain as the identifier.

10. Tidy pivot_table

In certain situations you may want to transform back your data to the original format. You can use the pivot_table() function for this. You pass in the tidy DataFrame as the first argument and specify three additional arguments. The index argument specifies the name of the column that will be the index of the new DataFrame, the columns argument specifies the column that will be used as the columns in the new DataFrame and finally values specifies the column that will be used to fill in the column values of the new DataFrame.

11. Reset index

You can call the reset_index() method on this new DataFrame if you want a regular flat DataFrame instead of the results that were returned, which is called a hierarchical index.

12. Groupby

Groupbys are an incredibly powerful and fast way to perform calculations on your data. It follows the mantra of split-apply-combine where the dataset is split into multiple partitions based on unique values in one or more columns, a function is then applied on each partition separately, and the results are combined at the end.

13. Performing a groupby

Here's the tidy form of the treatment data you saw earlier. To calculate the mean value for each name, you can call the groupby() method on the DataFrame and pass the name of the column as a string, 'name' in this case. You can then use the square-bracket notation to refer to one or more columns and calculate the relevant statistic. As you can see here, we are calling the mean() method on the value column.

14. Let's practice!

Now it's your turn to practice these concepts.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.