1. Meet Redshift, a columnar database
Welcome, everyone! Let's dive into Amazon Redshift, a robust data warehouse offered by AWS. I'm Jason, your instructor, and I'll guide you through the critical aspects of Redshift.
2. Redshift overview
Let's begin with an overview of Redshift. Redshift is a distributed database that spreads queries and large volumes of data across multiple nodes. It's also a columnar database, which differs from traditional row-based databases by optimizing for using individual columns of data instead of rows. Redshift uses PostgreSQL 9 syntax, making it familiar to PostgreSQL users while offering additional capabilities. We can choose between serverless and provisioned clusters based on our needs.
3. Columnar vs row based databases
Now, let's discuss the concept of columnar databases compared to row-based databases. In a row-based database, data is stored in rows inside the files, making it efficient for transactional operations or when we need most of the row's columns at once. Columnar databases like Redshift store data in columns, improving performance for analytical queries where we are aggregating columns of data for things like sums, medians, classification, etc.
4. Redshift's position in the AWS ecosystem
Redshift plays a crucial role in the AWS ecosystem. It's the primary SQL data warehouse technology, making it a central component for data analytics. Redshift can serve as the core of our AWS environment, enabling data lakehouse capabilities. It seamlessly integrates with various AWS services, including RDS for PostgreSQL, Aurora PostgreSQL-compatible edition, EMR for Hive support, and Kinesis for streaming data.
5. Redshift's position in the AWS ecosystem
Redshift also supports querying data housed in other AWS RDS databases directly via federated queries. It can even leverage Redshift ML for data science tasks with Amazon SageMaker.
6. Competitors
Redshift faces competition from other data warehousing solutions, including Snowflake, BigQuery, Databricks Lakehouse, and Azure Synapse Analytics. Advantages of Redshift include deep integration with other AWS products. A share-nothing architecture for scalability, which means that nodes work independently and don't share disk space. Different data is processed in parallel on different nodes. Enhanced caching provides optimized query performance, and cost savings with reserved instances (RI).
On the downside, Redshift is limited to AWS, may have unoptimized query performance issues in some instances, and additional flexibility for external file support may come at an extra cost.
7. Viewing databases and schemas
Redshift organizes its data into databases, schemas, and then tables. To navigate Redshift effectively, we'll need to view databases and schemas. Redshift provides great views containing this information: SVV-REDSHIFT-SCHEMAS for local or internal schemas and SVV-ALL-SCHEMAS for internal and external schemas.
For example, in this query, we select the database and schema names along with the schema type from the SVV-REDSHIFT-SCHEMAS view to get a list of internal schemas. While we're on the subject, any time we see SVV it's a view to look at internal database metadata.
8. Viewing tables in a schema
Now, let's explore how to list tables within a schema. For this, we can utilize SVV-REDSHIFT-TABLES for internal tables, or for both internal and external tables, use SVV-ALL-TABLES.
Here, we're selecting the table name from our spectrumdb schema to get a list of all the tables within this schema.
9. Let's practice!
We've covered the essential aspects of Redshift, and now it's time for some hands-on practice. As you do the exercises in this course, please explore Redshift further, run queries, and work with its features to gain a deeper understanding.