The Open Lakehouse

1. The Open Lakehouse

As we discussed in the last lesson, Apache Iceberg is just a description of how SQL interacts with files, but it's just that, a description. To actually work with Apache Iceberg, you need other components which together make up an Open Lakehouse. An Open Lakehouse is simply a collection of components sitting on top of an open format that can allow a plethora of tools to access the underlying data without using ETL processes. There are many component options with a wide array of use cases, but not every user needs every possible function. Most Open Lakehouse users will need a catalog, storage, and a query engine. In this lesson, we'll break down what each of those components is and does, starting with the catalog component. Apache Iceberg requires a catalog for associating table identifiers with related table metadata to ensure transactional correctness. Think of your catalog component like an old address book. Whereas an address book may tell you things like a business's location and phone number, the Iceberg catalog stores information on where your tables are located and what their current state is. Originally, Iceberg had a collection of client-side plugins, which could use existing catalog technologies. But most modern Iceberg deployments use catalogs which are compatible with the Apache Iceberg REST specification. While there are a lot of technical reasons why the REST specification or spec for short is important, the main value to users is that it makes it vastly easier to write applications which can talk to any compatible catalog without changing the configuration. When choosing a catalog, it is important to make sure that its persistence layer, features, and governance match the requirements of your organization. At a high level, a persistence layer describes how the actual information is stored, and it's usually key to throughput and durability. The features of a catalog would include things like the ability to store other table types or perform audit logging. Lastly, governance describes how user permissions and policies and fine-grained access controls are managed on a table. As long as you pick a REST compatible catalog, switching catalogs in the future is a relatively easy task. This helps with portability and interoperability. While a catalog is responsible for listing and organizing table identifiers and performing table state transactions, the storage component is responsible for durability. All of the metadata and data files for our Apache Iceberg tables live in storage. Iceberg tables can be rebuilt in their entirety, including their history from just the files in storage. This means that when we change catalogs, we can point the new catalog to our existing table in storage and move the table without any loss or data movement. When choosing storage, the easiest solution is to use whatever you already have connected to other deployments. If you're deploying to a Cloud environment, then Cloud storage is the right answer. But if you're using local deployments, then a local object store and not a block store is probably the right decision. With storage and catalog sorted out, the next thing we need is something to actually perform queries on the tables contained in our storage component. The component that facilitates these transactions is the query engine. The query engine is responsible for taking our queries and transforming them into actions that read and write our Apache Iceberg tables. There are a lot of options in this space and some of the most popular open-source options are the Apache Spark and Trino engines. In both cases, the main goal of the engine is to turn user queries into code, even if their implementations are quite different. When choosing an engine, it's important to research the engine's specific capabilities outside of just Iceberg compatibility to determine the right tool for your specific use case. To add our engine to the Lakehouse, we add configuration, which describes how to connect to the catalog and our storage layer. Once completed, that configuration file will look like this, with sections for catalog, storage, and engine components. In this lesson, we built a small self-contained local Lakehouse for exercises and experimentation with these components. For this specific example, the catalog we used was Apache Polaris, an open-source catalog with an implementation of the Iceberg REST specification. MinIO, an Amazon S3 compatible file store, which we've included as a storage component. For the query engine, we've incorporated two different options to show different mechanisms for working with Iceberg, including Trino and Jupyter Notebooks running Apache Spark. It's important to note, even within this local context, we could replace any of these components with any number of open or closed, self-managed or hosted pieces without losing access to our data. Assuming you migrated the data to your new component first, because of this, we maintain the flexibility to choose what layers we want to own based on our business needs, and can trust that we can change in the future if business or technical requirements change. With that said, let's check out the environment we just built by doing your first exercise. To do this, you'll need to install Docker Desktop and clone the project. Once you've cloned the project, follow the instructions in the README or the reading to get started.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.