Get startedGet started for free

More complex SQL queries

1. More complex SQL queries

The SQL we've used so far mimics the functionality of pandas keyword arguments like usecols and skiprows. In this lesson, we'll take advantage of SQL's features to wrangle data in ways that cannot be done at the import stage using pandas alone.

2. Getting DISTINCT Values

In an analysis, you might need unique values in a column, or unique combinations of values across several columns. Examples include getting unique values to check data quality and creating crosswalks between values to combine datasets. Alternatively, data might have duplicate records that should be excluded. Pandas has tools to do this wrangling in dataframes, but it can be done with SQL during import using select distinct. The syntax is "select distinct column names from table". To remove duplicate records, select distinct on all columns with the query "select distinct star from table". For another example, if we wanted to map buildings with housing complaints, we could use the query "select distinct incident address, borough from hpd311calls" to get unique street addresses and boroughs.

3. Aggregate Functions

Other times, you might not be interested in the details of individual records, particularly when visualizing data. You can query the database directly for descriptive statistics with aggregate functions like sum, average, max, min, and count.

4. Aggregate Functions

The first four functions all take a single column name in parentheses. For example, the query "select average t max from weather", with tmax in parentheses, returns the average daily high temperature. Count is a little different. While it can accept a single column name, you can do things like get the number of rows that fit a query with count star, or even get the number of unique values in a column with count distinct and the column name.

5. GROUP BY

Aggregate functions return a single number on their own. More likely, you want data summarized by categories, such as average high temperatures by month or counts of plumbing complaints by borough. In that case, add a group by clause after the select statement and where clauses. Remember to select the column you're grouping by as well as the aggregate function -- otherwise you'll end up with unlabeled summary figures for each group. The query to get counts of plumbing complaints by borough, for example, would be "select borough, count star from hpd311calls where complaint type equals plumbing, group by borough"

6. Counting by Groups

Let's see what the results of that query look like in a dataframe. With the necessary libraries imported, we create the engine, write out the query, then pass the query and engine to pandas' read SQL function.

7. Counting by Groups

When we check the results, we see we have a neat summary dataframe ready for plotting.

8. Let's practice!

This has been a SQL-heavy lesson, with Python as the glue connecting SQL queries to data visualization and analysis libraries. Working in two languages at once is tricky at first, but it's a useful skill. And it gets easier with practice!