Get startedGet started for free

Scheduling daily jobs

1. Scheduling daily jobs

Hi, and welcome to the last lesson of this chapter. In this lesson, we'll see how to do the third and final step of the ETL process.

2. What you've done so far

So far, we've seen how to extract data from the `courses` as well as `rating` table using `extract_course_data` and `extract_rating_data`. We've created a function to clean up the missing values in the courses table. We've also created an aggregation function to get the average course ratings per course. You might remember from the previous exercises that we need a DataFrame with average course ratings and a DataFrame with eligible user and course id pairs to get to the recommendations. To get the eligible user and course id pairs, we need to look at the rating table and for each user, generate pairs with the courses that they haven't rated yet. Additionally, we need to look at the courses table and make sure only to recommend courses in the same technology user already showed interest in. The exact implementation would drive us too far, but let's call the function `transform_courses_to_recommend`, and it takes the two tables we extracted as an input. Finally, we already know how to calculate the recommendations with these tables, using transform_recommendations.

3. Loading to Postgres

Now, it's time to load the data into a Postgres table. In this case, we could use the table in data products like a recommendation engine. Finally, we'll orchestrate everything in an Airflow job to make sure we keep the table up to date daily. That could be essential if we want to send out daily e-mails to specific customers with recommended courses, for example.

4. The loading phase

Ok, let's look at the loading phase first. To get the data into the table, we can take the recommendations DataFrame that we've built in a previous exercise, and use the `pandas` `.to_sql` method to write it to a SQL table. We've seen the method in the third chapter, but here's a little refresher on the syntax. It takes the table name as a first argument, a database engine, and finally, we can define a strategy for when the table exists. We could use `"append"` as a value, for example, if we'd like to add records to the database instead of replacing them. This should be enough to get the data into a target database.

5. The etl() function

We now have all pieces to put together to get to the final ETL function to create the recommendations table. We'll start by extracting both the ratings as the courses table from the application's database. We then clean the courses table by filling the NA's, as we did in previous exercises. Afterward, we'll calculate the average course ratings and eligible user and course id pairs. We need these to get to the recommendations using `transform_recommendations()`. We now have everything we need to load the recommendations into Postgres. Here's the full ETL function.

6. Creating the DAG

It's time to wrap things up by creating the final DAG in this course. This time, we'll have to execute the ETL function we just created daily. As we've seen before, in this case, we can use a simple `PythonOperator`. We start by creating the DAG object itself, give it a `schedule_interval` value using the cron notation. We then create a `PythonOperator` and pass the ETL function as callable. We set the DAG of the operator to be the one we just created.

7. Let's practice!

That wraps up the last lesson of this chapter. Have fun doing the exercises.