1. Table partitioning
Welcome back! Now that we've covered access control of databases, it's time for something completely different: database partitioning.
2. Why partition?
Before we talk about what table partitioning, let's talk a bit about the why. When tables grow — we're talking hundreds of gigabytes or even terabytes here — queries tend to become slow. Even when we've set indices correctly, these indices can become so large they don't fit into memory. At a certain point, it can make sense to split a table up into multiple smaller parts. We call the process of doing this 'partitioning'.
3. Data modeling refresher
Note that, looking at the data modeling layers we saw in the first chapter, partitioning fits into the physical data model. Indeed, logically, the data you'll access or update is still the same. The difference is we distribute the data over several physical entities.
4. Vertical partitioning
When we talk about partitioning, there are two different types of partitioning. The first one we'll introduce is vertical partitioning. You saw in the second chapter that to normalize a table, you can use foreign keys and create new tables. As you saw, it can help you reduce redundant data. Vertical partitioning goes one step further and splits up a table vertically by its columns, even when it's already fully normalized.
5. Vertical partitioning: an example
In this example, you start with a table containing product data. It has four columns. After vertical partitioning, you could end up with two tables: one for the first three columns, and another for the last column. We can link them through a shared key. Let's say the fourth column, containing a long description, is retrieved very rarely. We could store the second table on a slower medium. Doing this would improve query time for the first table, as we need to scan less data for search queries.
6. Horizontal partitioning
For the remainder of this video, let's focus on the second type of partitioning: horizontal partitioning. Instead of splitting tables up over the columns, you can also split up tables over the rows. For example, you could split up data according to a timestamp. Specifically, all records related to 2019 could be in a separate partition from the ones of 2018.
7. Horizontal partitioning: an example
Let's look at an example. Let's say you have the following table where every row is a book sale. We could decide to partition the table according to the timestamp.
8. Horizontal partitioning: an example
You could create partitions according to the timestamp, and partition them by quarter. Different SQL dialects have different ways of creating partitioned tables. In this example, we'll look at PostgreSQL, where you can use something called declarative partitioning since PostgreSQL 10. First, you add the PARTITION BY clause to your table creation statement. You pass it the column you want to partition by, 'timestamp' in our case. Next, you have to create the partitions. To do this, use the PARTITION OF clause to create tables for the specific partitions. You can specify rules to partition by in the same statement. For a timestamp, you could use particular ranges of values, like this. Finally, it's advised to add an index to the column you used for partitioning.
9. Pros/cons of horizontal partitioning
Horizontal partitioning can help by optimizing indices, increasing the chance heavily-used parts of the index fit in memory. You could also move rarely accessed partitions to a slower medium. Both OLAP and OLTP can benefit from partitioning. There are some downsides though, as partitioning an existing table can be a hassle: you have to create a new table and copy over the data. Additionally, we can not always set the same type of constraints on a partitioned table, for example, the PRIMARY KEY constraint.
10. Relation to sharding
We can take partitioning one step further and distribute the partitions over several machines. When horizontal partitioning is applied to spread a table over several machines, it's called sharding. You can see how this relates to massively parallel processing databases, where each node, or machine, can do calculations on specific shards.
11. Let's practice!
Let's practice!