Transform

1. Transform

Great job on those exercises. In this video, we'll focus on the second stage of the ETL pipeline: transform.

2. Kind of transformations

We've already talked a bit about transformations in the previous chapter when we talked about parallel computing. We saw how transformations are typically done using parallel computing. However, we didn't talk about what kind of transformations a data engineer has to do. To illustrate this, have a look at the following sample record of a customer database from a DVD Store. It contains a customer id, an email, the state they live in and the date they created their account. Here's a non-exhaustive list of things the data engineer might have to perform on this data during the transform phase. First, there's the selection of specific attribute, for example, we could select the 'email' column only. Second, there is the translation of code values. For instance, 'New York' could be translated into 'NY'. Third, the transformation phase could validate the data. For example, if 'created_at' does not contain a date value, we could drop the record. For the last two transformations, splitting and joining, we'll go into some more detailed examples.

3. An example: split (Pandas)

In the first detailed example, the goal is to split up a single column into multiple columns. You might want to have the e-mail address attribute split into username and domain name. For this example, we'll use Pandas for the transformation. To achieve this, you can use the `str.split()` method on the `customer_df.email` Pandas Series. After the split, you can add two new columns: `username` and `domain`.

4. Transforming in PySpark

Before moving on to the final illustrative example of joining, we need to take a quick detour. The last transformation example will be using PySpark. We could just as well have used pandas if the load is small. However, since we used PySpark, the extract phase needs to load the table into Spark. We can do this with `spark.read.jdbc`, where `spark` is a `SparkSession` object. JDBC is a piece of software that helps Spark connect to several relational databases. There are some differences between this connection URI and the one you saw in the previous video. First of all, it's prepended by 'jdbc:', to tell Spark to use JDBC. Second, we pass authorization information in the `properties` attribute instead of the URL. Finally, we pass the name of the table as a second argument. If you are interested in learning about Spark, DataCamp offers other courses that delve deeper into Spark.

5. An example: join

Now, let's move on to the final example. Let's say that as a marketing effort, we allow users to use their mobile phone to rate the films they watched. An external firm has created the service, and you end up with a new database containing ratings from customers for specific films. For simplicity's sake, we'll assume that we use the same film and customer ids as in the store's database. A transformation phase could use a table from the store's database and the rating's database. For example, we could add the mean rating for each customer as an attribute to the customer table.

6. An example: join (PySpark)

Now, how to do this in code with PySpark? Let's say you have two DataFrames, customer and ratings. We want to figure out the mean rating for each customer and add it to the customer dataframe. First, we aggregate the ratings by grouping by customer ids using the .groupBy() method. To get the mean rating per customer, we chain the groupby method with the .mean() method. Afterward, we can join the aggregated table with the customer table. That gives us the customer table, extended with the mean rating for each customer. Note how we set the matching keys of the two data frames when joining the data frames.

7. Let's practice!

For now, let's practice these concepts in the exercises.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.