Get startedGet started for free

Joining data

1. Joining data

In real life, data comes in a variety of shapes, sizes, formats, and tables. While talking about different formats of data is out of the scope of this course, we will learn how to join data from two sources together.

2. Why we join

So why do we need to know how to join data? Well, let's look at the chocolates dataset example. We have a lot of information about the chocolates, but maybe we are interested in how the age of the company reflects on the ratings. Do older and more established companies create tastier chocolates? Or are the newer, modern companies better at getting the best out of the bean? As the year when the company was founded is not included in the dataset, we'll need to join the chocolates dataset with the chocolate-underscore-companies dataset, which contains more information about the companies.

3. Left join

There are many different joins we can use. In this course, we'll cover only the left join. The left join works like this. We have two datasets, left and right. They both contain a column called C with some of the same values - this could be the company column. We want to use this column to perform a left join. The result of the left join will contain all the data from the left and only those rows from the right where the values in column C match.

4. Left join in practice

Let's see how to perform such a join in Julia. The DataFrames package contains all the join functions. You can find more about them here. All the join functions follow the same pattern so if you know other types of joins, the syntax is pretty much the same. To use a left join on chocolates and chocolate-companies, we call leftjoin, passing the DataFrame we want to keep - chocolates in this case - followed by the DataFrame we want to join - chocolate-companies - and set on equal to the name of the column that contains the same values, which is company. The result is a larger DataFrame, that contains all the columns from both chocolates and chocolate-companies, and all the rows from chocolates. If a company wasn't present in the chocolate-companies dataset, the corresponding columns in the new DataFrame are filled with missing.

5. Joining on columns with different names

If the columns containing the values we want to join on don't have the same name, we need to use the following syntax. We proceed the same, except we type on-equals the name of the column in the left dataset, company, equals-greater-than signs, the column name in the right dataset, company-name. The column name in the left dataset is the one that is kept.

6. Joining on multiple columns

Sometimes, the values in just one column are not enough to specify the result. Maybe there are two companies that have the same name but are located in different countries. Luckily, we can join using multiple columns. To do that, we pass a vector to the on keyword. We set the company columns from both DataFrames equal to each other using equals-greater-than, and do the same for company-location.

7. Cheat sheet

Joins can seem complicated but once you understand them, they become easy! Here is a summary to help you.

8. Let's practice!

Ready to join? Let's head to the exercises!