1. Writing queries in BigQuery
Now that we know how BigQuery works under the hood, let's apply that knowledge to writing queries!
2. Writing simple queries
Writing queries in BigQuery works similarly, but not exactly, as any other database or data warehouse and follows the same rules as other SQL queries. The core SQL commands you have learned previously will all apply in BigQuery.
3. Running queries in BigQuery
We can write queries in the Google BigQuery Studio within the Google Cloud Workspace, through one of the many client libraries, using the command line with Google Cloud command line tools, or even using functions in Pandas to turn our results into DataFrames.
4. Using correct table names
If we recall, for all table names in BigQuery, we must use the project, dataset, and table formatting as seen in the first query on the left. However, if we are not querying data from another project and running our query in the same project we have set as our default, we can use just the dataset and table name formatting as seen in the second query on the right.
5. GoogleSQL
We should also note that BigQuery uses a specific SQL syntax called GoogleSQL. All the core elements of SQL still apply, but some specific functions and data type names differ from standard SQL. For example, integer and float datatypes in GoogleSQL are always base 64, as seen here.
6. Our datasets: Olist E-Commerce
As a review, we will review some of the fundamental SQL operators while introducing you to some of the data we will use during the course. We will be using several tables from Kaggle from the Olist E-Commerce dataset. Here, we can see the four tables we will use throughout the course. These are modified slightly from the original source data.
7. Products
Our products dataset contains information about the products being purchased, such as the number of photos in the product listing, weight, and category name.
8. Orders
The orders dataset contains two columns, the order-id and order-items, which are nested data types. This has four entries: the "order-item-id" or the item number in the order (e.g., 1 and 2 if the order has two items), the product and seller IDs, and the item's price.
9. Order details
The order-details dataset has an order and customer ID that can be used to join our data, the order status, and several timestamp fields representing different stages of the delivery process. If those columns are null, then that stage has not been completed.
10. Payments
Our payments dataset has data about the payment type, if the order is paid in installments, which installment it is, and the value of the payment.
11. Review of aggregations and joins
Before we proceed, we will also quickly review joins and aggregations as they are a core part of analytical queries. As we can see in this query, we are joining our orders and order-details datasets. There are five components to a join with an aggregation. First is our aggregation function, in this case, the "count" of the orders. Next, we state our left dataset in the join, in this case, order-details, followed by our right dataset, orders. Next is our join condition. We often use the "on" operator if our column names differ, but since our column names are the same, we can use "using" followed by the shared column name in parentheses. Finally, we have our group by condition, in this case our "customer-id".
12. Let's practice!
Now that we have a basic understanding of the data we will be working with and a quick SQL refresher let's jump in and start writing some queries.