Get startedGet started for free

External Schemas, File, and Table Formats

1. External Schemas, File, and Table Formats

Let's delve into external schemas, file formats, and table formats in the context of data management and explore how these components fit into the larger landscape of database architecture.

2. External schemas

Previously, we dissected the components that make up a database, and although typically integrated, it's possible to use separate tools for each component. External tables and schemas in Redshift use a storage system and metadata catalog outside of the Redshift cluster. Redshift Spectrum is the Redshift extension that supports external schemas and tables. It uses AWS Glue Data Catalog as the metadata catalog, AWS S3 for storage, and the regular Redshift engine.

3. S3 data file formats

We can use various file formats when using AWS S3 to store the data. The file format we use will affect the performance of our queries. Standard data warehouse formats like Parquet and ORC are the best of both worlds: columnar and supporting parallel reads. Nonetheless, sometimes we want to quickly query a CSV, Delimited Textfile, or just a bunch of JSON files, and Redshift Spectrum can support that. Don't worry if all these formats aren't familiar today; know that it's possible to query a wide array of formats.

4. Create CSV External Table

Creating external tables looks like regular table creation but with some new suffixes. For example, we have a collection of CSV files in the spectrum-id S3 bucket, and the idaho_site_id data is under the prefix of the same name. So we add that we're using a row format delimited. Next, a comma terminates those row fields. Then we tell Redshift the files are ordinary text files in the proper s3 path. Finally, these files have headers, and we don't want the headers in our results, so we tell it to skip the header line using a table property.

5. Querying spectrum tables

There are no significant differences in querying internal versus external tables. We'll notice extra lines if we do an explain on a query that references reading the data files and the metadata catalog. We don't need to consider our query's DISTKEY or SORTKEY usage since external tables don't support them. External tables have a few pseudocolumns that we can use. For example, path will tell us where our data lives, and size will show how big it is. These aren't actual columns, but the Redshift engine will generate them when we query the table.

6. Using pseudocolumns

Here, we can see that we're selecting the path and size along with the site ID from an external table, and it returns the S3 path to the file, the size, and the primary key.

7. Table formats

Some data file types also support special table formats. We commonly encounter Hive, Iceberg, Hudi, or Deltalake table formats. Redshift Spectrum can provide read-only access to tables in those formats. Some table formats might require us to use an external catalog other than AWS Glue. Again, don't feel like we should already know all these table formats. Just remember that we can query all of those in Redshift via Spectrum.

8. Viewing external schemas

To determine if a schema is external, we can use the SVV-ALL-SCHEMAS view, which we previously used to see which schema held which tables. The schema type column will tell us if the table is internal or external. As we can see here, the spectrumdb schema is showing as external.

9. Viewing external tables

Just like we can tell if the schema is external, we can also tell whether a table is external. We use the SVV-ALL-TABLES view and select the table type field to do this. As shown here, it will return table if it's internal or external table if it is external.

10. Let's practice!

Alright, it is time for you to explore the world of Redshift Spectrum and external tables and schema.