1. Loading multiple tables with joins
One feature of relational databases is that tables can be linked to one another via unique record identifiers, or keys. This lets users combine tables into custom datasets with SQL joins, which we'll explore in this lesson.
2. Keys
As mentioned, records typically have keys that uniquely identify them.
At their simplest, keys can be automatically assigned row numbers, like in the 311 call data,
3. Keys
but they can also carry meaning, like university course numbers.
When records include other tables' keys, you can bring in, or join, data from the referenced table.
4. Keys
For example, given a column of instructor ID numbers
5. Keys
and a professor table,
6. Keys
we can then join in professors' names.
When building data pipelines, this means you're not limited to working with a single table's columns.
7. Joining Tables
Let's join weather data to
311 call records to study if certain problems are exacerbated by weather conditions. Both tables contain date columns as text, with unique dates in the weather table, so they'll be the join key.
8. Joining Tables
First, we select star from hpd311calls to get all columns there. Star will also get all columns from weather once we join it.
9. Joining Tables
Then we join weather on hpd311calls dot created date equals weather dot date, which are the key columns.
We use dot notation to specify the table and column when querying multiple tables.
Two things to note here: join, by default, only returns records with key values that appear in both tables.
And key columns must be the same data type or nothing will match!
10. Joining and Filtering
We can incorporate a where clause after the join to refine the dataset. Here, we filter the data to focus on heat and hot water calls, which probably spike in cold weather.
11. Joining and Aggregating
It's even possible to summarize data and then join additional columns. Imagine we wanted to compare call counts by borough against census data about population and housing in another table, boro census. Let's build the query in parts.
We first get call counts by borough by selecting hpd311calls dot borough and count star from hpd311calls, grouping by hpd311calls dot borough. Because both tables have borough columns, we have to specify the table here.
12. Joining and Aggregating
The boro census table has more columns than we need, so we also select its total population and housing units columns. Notice that we don't have to list boro census in the from clause.
13. Joining and Aggregating
Finally, we add the join clause between the from and group by clauses, joining on the borough columns.
14. Joining and Aggregating
Let's pass this query and the database engine to read SQL and check out the results.
Looks like the Bronx is overrepresented in our call data!
15. Review
As you can tell, SQL queries can get complicated. Let's review the order of keywords.
First comes the select statement, including aggregate functions, and the table you're selecting from.
Then a join clause if there is one.
Then the where clause, if any.
Last comes group by.
16. Let's practice!
This is just the beginning when it comes to using SQL plus pandas to build data pipelines. But enough of me talking about SQL, it's your turn to practice!