Get startedGet started for free

Databases

1. Databases

Hi and welcome back! In the last chapter, we learned what a data engineer is. In this chapter, you'll learn everything about the tools of a data engineer. Let's start with databases.

2. What are databases?

Databases are an essential tool for the Data Engineer. They can be used to store information. Before zooming in to the kinds of databases, let's get some definitions out of the way. According to Merriam-Webster, a database is "a usually large collection of data organized especially for rapid search and retrieval." There are few pieces of vital information in this definition. First, the database holds data. Second, databases organize data. We'll see later that there are differences in the level of organization between database types. Lastly, databases help us quickly retrieve or search for data. The database management system or DBMS is usually in charge of this.

3. Databases and file storage

The main difference between databases and simple storage systems like file systems is the level of organization and the fact that the database management systems abstract away a lot of complicated data operations like search, replication and much more. File systems host less such functionality.

4. Structured and unstructured data

In the universe of databases, there's a big difference in the level of organization. To understand these differences, we have to make a distinction between structured and unstructured data. On one hand, structured data is coherent to a well-defined structure. Database schemas usually define such structure. An example of structured data is tabular data in a relational database. Unstructured data, on the other hand, is schemaless. It looks a lot more like files. Unstructured data could be something like photographs or videos. Structured and unstructured data define outer boundaries, and there is a whole lot of semi-structured data in between. An example of semi-structured data is JSON data.

5. SQL and NoSQL

Another distinction we can make is the one between SQL and NoSQL. Generally speaking, in SQL databases, tables form the data. The database schema defines relations between these tables. We call SQL databases relational. For example, we could create one table for customers and another for orders. The database schema defines the relationships and properties. Typical SQL databases are MySQL and PostgreSQL. On the other hand, NoSQL databases are called non-relational. NoSQL is often associated with unstructured, schemaless data. That's a misconception, as there are several types of NoSQL databases and they are not all unstructured. Two highly used NoSQL database types are key-value stores like Redis or document databases like MongoDB. In key-value stores, the values are simple. Typical use cases are caching or distributed configuration. Values in a document database are structured or semi-structured objects, for example, a JSON object.

6. SQL: The database schema

For the remainder of this video, let's focus on database schemas. A schema describes the structure and relations of a database. In this slide, you can see a database schema on the left-hand side. It represents the relations shown in the diagram to the right. We see a table called Customer and one called Order. The column called customer_id connects orders with customers. We call this kind of column a foreign key, as it refers to another table. The SQL statements on the left create the tables of the schema. As you've seen in courses on SQL, you can leverage these foreign keys by joining tables using the JOIN statement.

7. SQL: Star schema

In data warehousing, a schema you'll see often is the star schema. A lot of analytical databases like Redshift have optimizations for these kinds of schemas. According to Wikipedia, "the star schema consists of one or more fact tables referencing any number of dimension tables." Fact tables contain records that represent things that happened in the world, like orders. Dimension tables hold information on the world itself, like customer names or product prices.

8. Let's practice!

Now that you know about databases, let's practice in the exercises!