BigQuery Architecture

1. BigQuery Architecture

In this video, we will learn about the unique architecture of BigQuery and all the elements that help it perform queries at its massive scale!

2. Columnar data

BigQuery stores data in a column-oriented format. This means each column is stored separately and contains an index across each column. This data storage helps make read-intensive workflows much faster, and the query only needs to use the columns required by the query rather than every row like the image on the left. If you think of BigQuery as a car factory, think of the columnar data as the individual parts of the car.

3. Capacitor

This brings us to Capacitor, a columnar storage format introduced in 2016 to store and query semi-structured data efficiently with nested and repeated fields. Each BigQuery table column (highlighted as a single file in this diagram) is stored in a separate Capacitor file, allowing data to be highly compressed to increase speed. In our car factory metaphor, Capacitor is an organized warehouse of parts, helping things move much faster.

4. Colossus

Google's Colossus is a distributed file system built by Google. It handles replication, recovery (for when a disk crashes), and distributed management. Each of Google's data centers has its own Colossus cluster, and each cluster contains enough disks to give each user many disks at a time. Colossus would be the various assembly line workers, each with a specific task, distributing tasks across the factory.

5. Jupiter

Jupiter is the connective tissue between the storage and compute in BigQuery, allowing terabytes of data to be transferred from storage to compute. This allows the data to move at the rate of one petabit of bisection bandwidth per second. That is similar to 100,000 servers communicating at 10 gigabytes per second. Jupiter would be the robotics in the factory, moving parts to where they need to go quickly.

6. Dremel

Dremel is the engine that runs and organizes your query to run quickly. When Dremel receives your query, it will split it into logical levels to run it as efficiently as possible. Dremel would be the the order of the assembly line, making sure each car is assembled as effectively as possible.

7. Borg

Borg is a vast cluster management system with thousands of compute processing units (or CPU) cores running on many machines. Borg allocates resources to jobs like the Dremel cluster. It can even route around hardware failures, providing seamless operation even in the face of any possible failures. Dremel would be the foreperson in our factory, making sure everything is running quickly and efficiently.

8. Mixers, leaves, execution trees, and slots

As we saw in the architecture diagram, Dremel is the engine that actually executes and manages our query. As we see here, the query execution involves a hierarchical structure with root, intermediate nodes (mixers), and leaf nodes. When a user submits a query, it starts at the root node, which reads metadata and communicates with mixers. Mixers optimize the query and send it to leaf nodes for execution. Leaf nodes read data from Colossus, apply filters, and perform aggregation. BigQuery also calculates slots, or required computing power, for each query based on its complexity. The process allows for parallelization and efficient execution, returning results up the hierarchy to the user.

9. Categorized architecture

Each of the elements we have discussed thus far can be categorized into one of three groups. Capacitor and Colussus manage the storage of data. Jupiter and Borg manage the compute resources in BigQuery. Dremel is the query execution engine that includes mixers, leaf nodes, slots, and the execution tree.

10. Let's practice!

Now, let's spend some time reviewing BigQuery's architecture in detail.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.