Get startedGet started for free

Introduction to databases

1. Introduction to databases

In this chapter, you'll learn to build pipelines to relational databases, which underpin the information systems of many organizations.

2. Relational Databases

Relational databases organize data about entities in tables, with rows representing instances of entities and columns of attributes. This probably sounds familiar -- dataframes, flat files, and many Excel sheets arrange data similarly. Relational databases differ in that tables can be linked, or related, via unique record identifiers, or keys. Databases handle more data and support more simultaneous users than spreadsheets or flat files. They also offer more data quality controls, like enforcing column data types. And we interface with databases via a specific language: Structured Query Language, or SQL.

3. Common Relational Databases

Common relational databases include Microsoft SQL Server, Oracle, PostgreSQL, and SQLite, which this course uses. Unlike the others, SQLite databases are stored as regular, self-contained computer files, just as CSVs and Excel files are, making them great for sharing data.

4. Connecting to Databases

Reading data from a database is a two-step process. We first make a way to connect to a database then query it with SQL and pandas.

5. Creating a Database Engine

To do this, we'll use the SQLAlchemy library, which has tools to work with many major relational databases. Specifically, we'll use SQLAlchemy's create engine function. Create engine takes a string URL to a database and makes an engine object that manages database connections. URLs follow a pattern that varies slightly depending on the database. For SQLite, the pattern is "sqlite", colon, three forward slashes, then the database file name.

6. Querying Databases

Once we create the database engine, we can pull data with pandas' read SQL function. Read SQL needs two arguments. The first is a string of either a SQL query, or, to load a whole table, just the table name. The second argument is a way to connect to the database. We'll supply the engine we made here.

7. SQL Review: SELECT

Let's take a minute to review SQL select statements, which are used to query databases. The basic syntax is "select column names from table name". This will get all rows for the specified columns. To get all rows and all columns, use "select star from table name." A note about code style: keywords like "select" and "from" are not case sensitive, but it's conventional to type them in all capital letters. It's also best practice to mark the end of a SQL statement with a semicolon.

8. Getting Data from a Database

Now let's put all this together to fetch weather data from a SQLite database containing information about New York City. We import pandas as pd, plus the create engine function from SQLAlchemy. Then we make the engine object, passing the database URL string, sqlite colon slash slash slash data dot db, to the create engine function. Since we want everything in the weather table, we can make the first argument to read SQL the table name as a string,

9. Getting Data from a Database

or we can use the SQL statement "select star from weather" as the first argument. We also supply the engine object as the second argument. Then we can check out our new dataframe.

10. Let's practice!

And that's the basic workflow to get data from a relational database with pandas. Now it's your turn to practice. Good luck!