Row vs. column data store

1. Row vs. column data store

Welcome back. Let's now turn our focus to column and row data storing. This subtle difference in how data is stored allows for the tuning of databases for transactional or analytical workloads.

2. Why is it important?

We want to optimize our queries for speed, and with data warehouses, we often work with analytical workloads. Recall our discussion of OLAP versus OLTP systems in a previous video. In this video, we will learn that the column store is best for analytical queries. Let's look at some basics to understand this better.

3. Basics of computer storage

In an oversimplification of the process, we can say that computers typically store data in blocks on their hard drive for long-term storage. This data can be stored over multiple blocks. When that data is needed later, the hard drive looks up which block holds the required data and reads those blocks. Data stored over many blocks will take longer to retrieve than the same data stored over only a few blocks. For analytical workloads, we can take advantage of this by storing likely needed data over a few blocks, increasing query speeds.

4. Example of health table

Let's take a look at an example. The data for this table is downloaded from the US CDC and contains flu infection data. Each row includes the hospitalization percentage by age group during a particular season. Now let's look at how this data is stored with row store.

5. Row store example

If the table is written in a row store format, each row of data is stored together. To illustrate, all the data for the second row in our table is stored together. In our diagram, we placed the data for the 2019 season's 18 to 49 years old group in block 2. This format is excellent for transactional workloads because the computer system can add new data to the table by writing it in a new block. However, if we were interested in the question, "For the 2019 season, what is the average hospitalization percentage across all age groups?" the system would need to read each block containing data for 2019. In our simplified example, three blocks of data need to be read. This storage method can be relatively slow for analytical workload, where we often want summaries of many rows of data.

6. Column store example

In comparison, the column store format stores the data for a column together. For instance, the data for the age group column is now shown in block 2 of the diagram. Answering the question, "For the 2019 season, what is the average hospitalization percentage across all age groups?" the computer system only needs to read the blocks with the season and percentage data, therefore using one less block of data to answer the same question! We can now see why this storage format is optimal for analytical workloads, returning responses faster than the row store format. Although the tradeoff is it takes relatively longer to add new rows because the blocks for each column need to be read and edited. Finally, a column store has the added benefit of better data compression because all of the data in a block is the same type. Therefore, we can store the same data with less space.

7. Summary

In summary, row data is stored together in blocks in the row store format. Also, it is ideal for transactional workloads. However, with column store, entire columns are held together in blocks. Thus, it best suits analytical workloads and has better data compression.

8. It's practice time!

It's practice time!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.