Get startedGet started for free

Inside the Redshift warehouse

1. Inside the Redshift warehouse

Now, let's dig into how Redshift works.

2. Agenda

This video will explore Redshift cluster internals, including node types, leader functions and storage, SQL predicates, and their uses. It will conclude with Redshift Spectrum, focusing on database components and external tables.

3. Redshift cluster architecture

Redshift scales by distributing the workload and storage across multiple nodes. There are two types of nodes: leader and compute. Every cluster has a leader node for database connections, query plan distribution, and exclusive tasks like query sorting and grouping. Clusters with a size greater than one also have compute nodes that store data and execute parts of queries assigned by the leader node.

4. Leader specific functions

Leader nodes have special functions that only they can run due to how they handle data; a complete list is available in the documentation. One that commonly trips up users new to Redshift is the SUBSTR and SUBSTRING functions, which do the same thing with the same arguments. SUBSTR only works on the leader node and will extract from a string easily. However, it will return an error when extracting from a table column, as shown in this example. It would be best to always default to the SUBSTRING function.

5. Looking at data across nodes

Previously, we mentioned that data is distributed across the compute nodes. We can see that data by looking at the STV-PARTITIONS table. Any STV table we encounter will be a virtual system table about running system information. For example, we can see that we have two hosts or nodes with roughly the same percentage of used disk space across them. Sometimes, this can get out of balance; we'll discuss that later in the course.

6. Predicates

Now, let's look at one way in which queries can take advantage of multiple nodes. Many times, we write queries that have clauses in them that evaluate to a boolean value. For example, in the ON clause shown here, we are checking if two keys are equal, and we will return a true or false when evaluating records and the matching records, and the same applies to the value check in the WHERE clause. These types of clauses are called predicates.

7. Predicate push-down

The leader node can ask the compute node to perform this evaluation in distributed data warehouses. Sending this to the compute nodes is called predicate push-down because it pushes the work down to where the data is stored. Doing this means there is less filtering and comparisons to do on the overall query on the leader before we get our results back. We'll learn how to write queries designed for this later on.

8. Typical internal database components

Now, let's learn about Redshift Spectrum. First, we need to know the three main components of database systems: the metadata catalog, the query engine, and storage. The metadata catalog holds our data's schema information like columns, keys, etc., and references where that data is stored. The query engine turns the SQL we write into executable plans and provides connectivity. Finally, storage holds our data and defines the file and table formats used. In traditional RDBMS systems, we don't think of these as separate things, but modern data warehouses support separating and using parts of these things outside the central database system.

9. Redshift spectrum architecture

Redshift Spectrum allows us to use an external metadata catalog and storage system while using the powerful and familiar Redshift query engine. These tables are called external tables. AWS Glue Data Catalog is a metadata catalog that provides these external tables when used with Redshift and AWS S3 storage. External tables look just like internal tables; however, external tables allow us to query files like CSVs, JSON, Text, and other formats without transferring all the data into tables.

10. Let's practice!

Let's reinforce what you learned here with a few exercises and start exploring queries.