1. Using column-oriented storage
Let's now explore column oriented storage, a common structure for analytics.
2. Column-oriented
In column oriented storage, the relationship between rows remains. All the zoo animal names are stored together. All the animal species are stored together. All the ages are stored together, et cetera. However, the relationship between the name Bob, species Panda, and age two is lost in column-oriented storage.
3. Analytics focus - a good fit
Column-oriented storage is ideal for analytics. Because one column is stored in the same location, queries to return all rows are quick. Calculations using all rows from a single column are also quick.
For instance, counting the number of zebras in the zoo table references the single species column.
Typically organizations perform analytics in order to find counts, averages, or some other calculation over all records, or rows. These are the type of metrics usually needed for reporting. These column aggregation queries quickly answer questions like, what is the average age of all the animals.
4. Transactional focus - a poor fit
Column-oriented storage is not well suited for transactional database needs. This storage retains the relationship between rows for a single column. Because a whole row is not stored in the same location, queries to return all columns are slow. Loading data is also slow since data is usually loaded on a row basis.
For instance, finding all the information about Bob the panda references multiple columns.
Typically organizations store data from source systems in transactional focused databases. These databases focus on inserting and deleting records quickly. Retaining the relationship between columns is better suited for row-oriented storage databases.
5. Database examples
This table shows some common examples of column-oriented storage databases. For instance, Citus Data, Greenplum, and Amazon Redshift are all examples of column-oriented storage built off Postgres. Databases exist to transform MySQL and Oracle into analytics-focused, or column-oriented storage structures. Additionally, some applications such as Clickhouse, Apache Druid, and CrateDB work with multiple database management systems.
6. Information schema
Particularly with large data, limiting the number of columns returned will improve performance. You can limit columns by using SELECT * sparingly.
7. Information schema
Instead, use the information schema to explore data. This columns view in the information schema shows each column name and data type for the zoo_animals table.
8. Writing your queries
Finally, work through calculations for the columns of interest in individual queries.
Suppose you are writing a query to find the range of ages for all the zebras. Restrict your query to operate on as few of columns as possible. Here it uses the age and species columns. This returns values from only one of the columns.
This second query answers the same question by sorting the results by age. However, it uses all the table columns. This structure is better suited to a row-oriented storage system.
9. Let's practice!
Now it's your turn to test your aptitude at column-oriented storage.