Joining data: a real-world necessity

1. Joining data: a real-world necessity

Welcome! Collecting and organizing data is an important part of preparing data for analysis. In this course, you’ll learn how to use the Python package Pandas to join data from two or more sources.

2. Pandas for spreadsheet users

Joining data with pandas is similar in many ways to joining data in spreadsheets. Pandas stores data in data frames with rows and columns, just as spreadsheets do. Now, you will need to learn new formulas, or expressions as they're called in pandas since pandas uses the python language. Learning a new language can be challenging. But you'll see how the power and flexibility of pandas makes it worth the effort. You'll be working with data from the National Football League collected to explore factors associated with player concussions. Who knows, you might find some useful insights!

3. Common situations

To analyze data and gain insight, we often need to combine data from various sources in a helpful way. There are two common situations that pretty much guarantee the need for joining. The first is when similar data is split by time, location or other factor and stored separately. The second situation is when two or more datasets may have different but related factors. Let's talk about these two situations in more detail.

4. Split data

Split data is quite common in a spreadsheet environment. People will often enter data, produce a report, and save the result. Then they repeat the cycle for the next time period, starting with a blank template. Time-based splits are commonly used for accounting data. Data from each month might be saved in a separate worksheet with each file holding a year's worth of data. There are other ways of splitting data to mirror reporting practices. Splitting by geography or business unit are both common options.

5. Split data example

Split data is best combined row-wise, or vertically. Here's a real example with American football data. We have game data from two seasons, 2016 and 2017. The seasons appear in separate tables, on separate spreadsheets.

6. Split data example

You can see the same columns appear in both tables.

7. Split data example

You can also see that the game keys listed in the two tables are for different games. It's a perfect case for combining the datasets row-wise. On a spreadsheet, this type of join can be as simple as copying and pasting tables together.

8. Complementary data

Let's look next at the second common situation, joining complementary datasets. We collect data for different purposes, so it follows that datasets have different elements, even when describing the same type of thing. For instance, one customer database might hold purchase history, while another holds customer call data. It's common to find different departments in a company focused on different data for the same customers and products. Also, each department often stores data in their own file set or in separate tables within a common database.

9. Complementary data example

Complementary datasets are best combined column-wise, or horizontally. For example, here are two tables with more football data. The left table has season data for each game. The table on the right has stadium data.

10. Complementary data example

The columns are different except for the GameKey column, which specifies the unique game for each row of data.

11. Complementary data example

You can see from the GameKey that both tables cover many of the same games. It makes sense to combine the data by joining the columns together. You could combine the data in a spreadsheet environment by copying and pasting if the rows match exactly. Otherwise you might use the VLOOKUP function.

12. Let's practice!

Good so far? Then it's time to apply your knowledge with some practical exercises.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.