Get startedGet started for free

Course ratings

1. Course ratings

Hello again! And welcome to the final chapter of this course. In this chapter, we're going to use everything you've learned thus far and discover how it fits into real-world case studies, by looking at DataCamp's course ratings!

2. Ratings at DataCamp

The case study will be about course ratings at DataCamp. As you might have noticed, as a student on DataCamp, you can rate a chapter after you completed it. We can aggregate these chapter ratings to get an estimate of how people rate specific courses. This kind of rating data lends itself to use in recommendation systems.

3. Recommend using ratings

You could imagine that top-rated courses are suitable to recommend, for example. Alternatively, you could recommend top-rated Python courses to people that previously rated Python courses highly. There are several ways to go about it, but in essence, we need to get this rating data, clean it where possible, and calculate the top-recommended courses for each user. We could re-calculate this daily, for example, and show the courses in the user's dashboard.

4. As an ETL process

In other words, we need to extract rating data, transform it to get useful recommendations, and load it into an application database, ready to be used by several recommendation products. Sounds like a job for the Data Engineer, right? Well... kind of. Usually, this would be a collaboration between a Data Engineer and a Data Scientist. The Data Scientist is responsible for the way recommendations are made, and the Data Engineer fits everything together to get to a stable system that updates recommendations on a schedule. For this case study, we'll describe the full process as a single ETL job. Before we dive into the specifics of the SQL tables we're going to use, it's always nice to look at what we're trying to achieve through a diagram. The extraction happens from two SQL tables in the `datacamp_application` database. It's a PostgreSQL database. The transformation phase consists partly of cleaning up the data, and an algorithm to calculate recommendations from a rating table. Finally, the data needs to be loaded into the `datawarehouse` database, to be used by data products.

5. The database

As we've mentioned before, we'll be using two SQL tables from the `datacamp_application` database. The first table is called `courses`. The records are courses in our database. There are four columns we'll look at: `course_id`, which is the internal id we use for courses; `title`, the course title; `description`, a description of the course in English; and `programming_language`: the programming language used in the course. Second, there is a table called `rating` which contains course ratings for all of the courses in the `courses` table. There are three columns in this table: `course_id`, which is a foreign key to the `courses` table; `user_id`, which is the internal identifier of the user that gave the rating; and `rating` which is the one-to-five star rating the user provided for this course.

6. The database relationship

The following diagrams could thus clarify the relationship between these tables.

7. Let's practice!

Now before we build the recommendation pipeline, it's time to start experimenting with the data in the exercises.