Data Transformation Patterns

1. Data transformation patterns

Welcome to the last chapter in this course. In this first video, we'll explore some more advanced methods to reshape and modify your data.

2. Map function

We have seen .map() before in the previous chapter, and so let us quickly recap it. Recall the .map() function transforms each element in a column using a specified function. It's perfect for applying calculations to entire columns. Let's say we have a column of temperatures in Celsius. We want to create a new column that holds the equivalent temperatures in Fahrenheit. To do this, we can use the .map() method on the Celsius column, applying a simple conversion formula using a lambda expression. Notice how .map() preserves the original data while creating new values.

3. Reduce function

The .reduce() function helps us summarize data by combining all the values in a column into a single result. It works by applying an accumulator pattern - starting with an initial value and then processing each element sequentially with a specified operation. This makes .reduce() incredibly useful for statistical and analytical operations, like calculating totals, averages, or even finding extremes such as maximum or minimum values. Here, we use .reduce() to add up all the values in "Sales" column using the sum() method, starting from zero. Where the reduce method shines is performing operations that are a bit more complex. For the summing above, we could have just used the sum method directly, without using reduce. In this next example, we now calculate the total number of sales greater than five thousand. It goes through every row in the Sales column and adds 1 each time the Sales is greater than 5000; otherwise, it adds 0. This is what the question mark operator does - add 1 if x > 5000 is true, else add zero. acc is our accumulator, or the running total that increases as we scan through the column.

4. Row iteration with forEach

The forEach() method allows us to iterate through each row in a table, accessing multiple columns simultaneously. Unlike a map, which works on single columns, forEach can access all column values in a row to perform calculations. In this example, we calculate the difference between the two temperatures from our previous table. Using forEach, we can extract the Fahrenheit and Celsius values from each row, perform the difference calculation, and append the result to a new column. This makes forEach ideal for more complex, cross-column transformations. By using forEach, we can write logic that examines a full record and generates new, calculated insights, such as this difference value, based on multiple data points.

5. Transformation pipelines

Transformation pipelines chain multiple operations together for complex data processing workflows. By combining .map(), .forEach(), and .reduce(), we not only increase the readability and maintainability of our code, but we also create powerful and efficient data processing sequences. This example demonstrates a complete pipeline: we first filter our data using .where(), taking only rows where the age is greater than 18. Then, we transform values using .map(), where we increase everyone's income by multiplying it by 1.1, and finally aggregating results using .reduce(), which gives us the sum of all income, divided by the number of rows to get an average. Pipelines are essential for handling real-world data processing tasks that require multiple transformation steps in sequence.

6. Recap

In this video, we explored three essential tools for transforming data: map(), forEach(), and reduce(). They allow us to transform column values, iterate through rows for cross-column calculations, and summarize data. They form a foundation of data transformation workflows.

7. Let's practice!

Now that you've seen how to use these transformation functions, let's practice with some examples.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.