Get startedGet started for free

Tables in Redshift

1. Tables in Redshift

Next, we'll dive into the world of tables in Amazon Redshift, exploring how to create them efficiently and manage their distribution and sorting.

2. Creating Tables

To start, let's create a basic table in Redshift. The example here should look familiar, as it is similar to most other databases. However, because Redshift is a columnar distributed database, we can add some additional directives to take advantage of that.

3. Distributing data

If we look at the structure of a Redshift cluster again, we might remember that data is stored on the compute nodes for parallel processing. We can distribute data using the internal row id, DISTKEY, or PRIMARY KEY. We'll return to the DISTKEY in just a second, but first, let's look at all our options for distributing data.

4. Distribution styles

Redshift offers several distribution styles tailored to different scenarios. The ALL style puts a full table copy on each node, which is excellent for small lookup tables. Next, the KEY style distributes the table's rows across the nodes by the DISTKEY or primary key, which is perfect when we filter, aggregate, or join by the key columns most of the time. We have the EVEN distribution style for tables that don't have keys, which just spreads the rows across the nodes. Finally, we have auto, which is the default. AUTO uses the ALL style for a small table, and as the table grows, it switches to the KEY distribution style if there is a key; if not, it falls back to EVEN distribution.

5. DISTKEY

We just learned that DISTKEYs are used to distribute our data across nodes for the KEY and AUTO distribution styles. Typically, a DISTKEY is the table's primary key; however, if we regularly aggregate, join, or group on other columns, it can be helpful to designate them as the DISTKEY. In this example, we are setting the distkey to the organization_id. There are two ways to do it: on the left, it comes right after the table definition, and on the right, it's a column attribute. I default to the left most of the time to make it more visible.

6. Setting the distribution style

If we want to set the distribution style manually, we can use the diststyle directive. In this example, we put the DISTSTYLE to KEY, which will distribute the data according to the DISTKEY we set.

7. SORTKEY

The SORTKEY parameter controls the storage order on disk, optimizing query performance due to predicate pushdown. In this example, we're setting the SORTKEY to the fk-monitoring-location column. So, on each node, the data is distributed by organization-id and ordered by the fk-monitoring-location. Redshift also supports multiple types of multicolumn sort keys: compound and interleaved; however, mastery of interleaved keys is beyond the scope of this course.

8. Defining multiple SORTKEYs

In this example, we'll have our data sorted on each node by monitoringlocation and organization id by creating a compound sortkey, which just means multiple columns.

9. Viewing column DISTKEY and SORTKEY status

Here we are querying the SVV-REDSHIFT-COLUMNS view we used previously to see the distkey and sortkey status for the ecommerce_sales table. We can see year_qtr is the distkey and year_qtr and total_sales are the sortkeys.

10. Viewing distribution style

We can query the SVV-TABLE-INFO system view to check the distribution style of tables. This view is full of valuable table details that can affect query performance. In addition to the distribution style, it shows any skew or imbalance in the data distribution across the nodes, the overall size of the table, and any sortkeys and their skew.

11. Viewing distribution style continued

For example, we can look at each table and distribution style in our spectrumdb schema, and we can confirm that year_qtr is a distkey and being used in the KEY dist style for the ecommerce_sales table.

12. Let's practice!

Now that we've covered some of the unique Redshift table features, let's put our knowledge into practice.