Advanced data transformation with pandas

1. Advanced data transformation with pandas

Learning to leverage pandas' more advanced functionality can help when transforming data in a pipeline. Before we dive further in, let's check in with our architecture diagram.

2. Advanced data transformation with pandas

In the past two lessons, we've extracted and transformed non-tabular data into a pandas DataFrame. Now, we'll explore more advanced transformation techniques,such as handling missing data, grouping data, and applying custom transformations to DataFrames.

3. Filling missing values with pandas

When transforming data, you'll often come across missing values in a DataFrame, which are typically designated by NaN. In the DataFrame above, rows two and three are both missing values in the "open" and "close" columns. To remedy this, pandas offers the dot-fillna method. In its most basic form, this method takes a value that is used to fill all NaN values in a DataFrame. In our example, missing values in the "open" and "close" columns are filled with the value zero, and the result is shown below.

4. Filling missing values with pandas

A dictionary can also be passed to the "value" parameter in the fillna method. When axis is set to one, the key-value pairs represent column names, and the associated values are used to fill missing values in that column. This expedites filling missing values across multiple columns. In our example, all missing values in the "open" column are replaced with zero, and all missing values in the "close" column are replaced with point-five.

5. Filling missing values with pandas

A column can also be passed to the fillna method. When that occurs, missing values are replaced with the corresponding values from the column that was passed. Here, the "close" column is used to fill all missing values for the "open" column. When the parameter in_place is set to True, the DataFrame is altered in-place, and the output does not need to be stored to a new variable.

6. Grouping data

In SQL, one of the most common transformations applied to data is done using "GROUP BY" functionality. In this SQL statement, data is grouped by the "ticker" column, and the average of the remaining columns is taken. Lucky for us, pandas offers this same functionality through their dot-groupby method.

7. Grouping data with pandas

In a single line of code, the dot-groupby method groups the raw_stock_data DataFrame by the "ticker" column, and finds the mean of the other columns. By passing zero to the axis parameter, we are grouping the DataFrame by row labels, which is standard practice. If one is passed to axis, data is grouped by column labels. The grouped DataFrame is stored to the grouped_stock_data DataFrame, and shown below. In addition to the mean, pandas allows methods such as dot-min, dot-max and dot-sum to be used to aggregate the remaining columns.

8. Applying advanced transformations to DataFrames

At times, transformation logic will be more complex than what pandas' built-in functionality can handle. Luckily, pandas offers the dot-apply method, which takes a function containing the custom transformation logic, and applies it to the DataFrame. To illustrate this, let's use an example. We'd like to classify the price changes that takes place for an asset by creating a "change" column. First, we'll define a function, called classify_change. This function takes a row, and returns "Increase" or "Decrease" based on the difference between the "open" and "close" values. Once this function is defined, it's "applied" to each row, using the dot-apply method. Setting axis equal to one ensures that the classify_change function is applied to each row. The result is written to the "change" column, and shown in the DataFrame below.

9. Let's practice!

Great work! Time to reinforce all that you've learned with some practice. Good luck!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.