Get startedGet started for free

Loading multiple tables with joins

1. Loading multiple tables with joins

One feature of relational databases is that tables can be linked to one another via unique record identifiers, or keys. This lets users combine tables into custom datasets with SQL joins, which we'll explore in this lesson.

2. Keys

As mentioned, records typically have keys that uniquely identify them. At their simplest, keys can be automatically assigned row numbers, like in the 311 call data,

3. Keys

but they can also carry meaning, like university course numbers. When records include other tables' keys, you can bring in, or join, data from the referenced table.

4. Keys

For example, given a column of instructor ID numbers

5. Keys

and a professor table,

6. Keys

we can then join in professors' names. When building data pipelines, this means you're not limited to working with a single table's columns.

7. Joining Tables

Let's join weather data to 311 call records to study if certain problems are exacerbated by weather conditions. Both tables contain date columns as text, with unique dates in the weather table, so they'll be the join key.

8. Joining Tables

First, we select star from hpd311calls to get all columns there. Star will also get all columns from weather once we join it.

9. Joining Tables

Then we join weather on hpd311calls dot created date equals weather dot date, which are the key columns. We use dot notation to specify the table and column when querying multiple tables. Two things to note here: join, by default, only returns records with key values that appear in both tables. And key columns must be the same data type or nothing will match!

10. Joining and Filtering

We can incorporate a where clause after the join to refine the dataset. Here, we filter the data to focus on heat and hot water calls, which probably spike in cold weather.

11. Joining and Aggregating

It's even possible to summarize data and then join additional columns. Imagine we wanted to compare call counts by borough against census data about population and housing in another table, boro census. Let's build the query in parts. We first get call counts by borough by selecting hpd311calls dot borough and count star from hpd311calls, grouping by hpd311calls dot borough. Because both tables have borough columns, we have to specify the table here.

12. Joining and Aggregating

The boro census table has more columns than we need, so we also select its total population and housing units columns. Notice that we don't have to list boro census in the from clause.

13. Joining and Aggregating

Finally, we add the join clause between the from and group by clauses, joining on the borough columns.

14. Joining and Aggregating

Let's pass this query and the database engine to read SQL and check out the results. Looks like the Bronx is overrepresented in our call data!

15. Review

As you can tell, SQL queries can get complicated. Let's review the order of keywords. First comes the select statement, including aggregate functions, and the table you're selecting from. Then a join clause if there is one. Then the where clause, if any. Last comes group by.

16. Let's practice!

This is just the beginning when it comes to using SQL plus pandas to build data pipelines. But enough of me talking about SQL, it's your turn to practice!