1. What's in the database?
Hi, my name is Christina Maimone. I'm a data scientist at Northwestern University, where I help researchers with their data challenges.
In this course, you'll learn how to explore a SQL database, summarize different types of data, and deal with messy data.
You've already learned a number of skills in the previous SQL courses that will help you with these tasks. In this chapter, we'll review some of those skills, pick up some new ones, and see how they apply to exploratory data analysis.
2. PostgreSQL
One note before we start. This course uses PostgreSQL. Many of the functions we'll use are also available in other SQL database systems, but their names or syntax may be different. If you're using another database system, you should refer to the system's documentation to learn the correct syntax. With that,
3. Database client
let's get started. You've finally been granted access to your company's database. Yay!
But where do you begin? What are the tables? How are they related? What columns exist in the tables?
A database client is a program used to connect to, and work with, a database. There are many different database clients.
Each one has a different way to retrieve information on the table names, the columns in each table, and the formal relationships between the tables.
Refer to your client program's documentation to find the commands to extract this information.
4. Entity relationship diagram
You may also be given information about the structure of the database from the database owner or creator. One type of documentation is an entity-relationship diagram that shows the tables, their columns, and the relationships between the tables. Here is the entity-relationship diagram for the database for this course. There are six tables.
5. ER diagram: evanston311
The evanston311 table contains help requests sent to the city of Evanston, Illinois.
6. ER diagram: fortune500
fortune500 contains information on the 500 largest US companies by revenue from 2017.
7. ER diagram: stackoverflow
stackoverflow contains data from the popular programming question and answer site. It includes daily counts of the number of questions that were tagged as being related to select technology companies.
8. ER diagram: supporting
company, tag_company, and tag_type are supporting tables with additional information related to the stackoverflow data.
9. Select a few rows
Once you know the names of the tables in the database, one way to get a sense of what's in a table is to simply select a few rows from it. Here we use the star to select all columns from the company table and use limit to return only five rows. Remember that the rows returned from a table are in no particular order by default.
10. A few reminders
As you start to explore the contents of a table, keep a few additional things in mind. NULL indicates missing data in a database.
11. A few reminders
To check which values are NULL, use "is NULL" or "is not NULL", not an equals sign.
12. A few reminders
The count function with a star counts the number of rows.
13. A few reminders
If you instead supply a column name to the count function, it counts the number of non-NULL observations in the column. This is equal to the total number of rows, minus the number of NULL values.
14. A few reminders
If you count the distinct values of a column, you'll get the number of different non-NULL values in the column.
15. A few reminders
But if you select those distinct values directly, NULL will be included as a value if it exists in the column, even though it isn't counted by the count function.
16. Let's start exploring
With those tips in mind, it's time to start exploring the course database through exercises.