1. Tabular NoSQL data stores
Welcome back! Now that we've covered the basics of NoSQL, let's learn more about tabular NoSQL databases.
2. Tabular data stores
Using traditional, relational data stores, such as Postgres or MySQL, data is stored in tables by row. This means that each row needs to be processed when a query is executed, which can be quite slow when only a few columns are needed.
Snowflake stores data in columns rather than rows. This allows for selective column reading and retrieval, dramatically improving query performance, especially against "wide" datasets used for analytics.
Columnar databases also lend themselves to easier schema changes than row-oriented databases, as each column can be individually altered without having to manipulate each record in a table.
Storing data by column also allows for more effective data compression, also improving query performance.
3. Querying a column-oriented database
To practice querying a column-oriented database, let's take a look at a sample dataset in Snowflake. This table is called "books", and stores information about popular data science books.
We'd like to retrieve the title and price of all books under fifty dollars, returning the rows in green below. To do this, we can use Snowflake's SQL-like syntax to build a query that looks like this. Upon initial glance, this looks like a normal SQL query. However, Snowflake executes queries to take advantage of its column-oriented architecture. Behind the scenes, Snowflake leverages column elimination using selective column reads and retrievals, reducing the amount of data processed. Snowflake also uses automatic data clustering to reduce the amount of data read from disk, further improving query performance.
Let's take a look deeper into how this query is executed!
4. Query execution in column-oriented data stores
When executing the query above, Snowflake first processes the WHERE clause to identify books with a price of less than fifty dollars. Snowflake uses selective read and retrieval to access only the "price" column during this step of the query. This "column elimination" reduces data being read from storage, which can dramatically improve query performance, especially with tables that have a lot of columns.
Once records with a price value of less than fifty dollars are identified, the corresponding values from the "title" column are retrieved and returned. In our example, using a column-oriented database reduces the amount of data that's processed by a little more than 30%.
Later in the course, we'll look closer at Snowflake's more advanced functionality, including optimizing data loads and deletes, creating performant JOINs, and working with semi-structured data using the VARIANT, OBJECT, and ARRAY types.
5. Connecting to a Snowflake database
To run Snowflake queries in this course, we'll use Python's snowflake library. A connection object will be created before each exercise, and stored in the variable conn. The snowflake-dot-connector-dot-connect function takes a username, password, account, database, schema, and warehouse, and returns the resulting connection object.
6. Writing and executing Snowflake queries
Once a connection object is built, we'll write a query and store it in a string. Then, this query will be passed to the conn-dot-cursor-dot-execute function, with an additional call to the fetch-pandas-all method. The result is a pandas DataFrame that contains the results of the query that was written above.
7. Let's practice!
Great work! Time to practice with a few more hands-on exercises. Have fun!