Moving existing data to Iceberg
1. Moving existing data to Iceberg
You've learned how Iceberg works. You understand the metadata architecture, the file skipping, and the ACID guarantees. But here's a reality check. You rarely get to start fresh with a Greenfield data lake. You've likely got years of data sitting in Hive tables, Parquet files, scattered across Maybe even Delta or Apache hoodie tables that the teams have already built pipelines around. And your business can't afford downtime while you figure out migration. So how do you actually move to Iceberg without breaking everything? How do you test changes safely when downstream teams depend on your tables? And how do you evolve your schemas when requirements change every quarter? That's what Module 2 is all about. We're taking Iceberg from concept to production reality. Let's get started. In the first module, we worked to understand how Iceberg works under the hood. Now let's talk about getting your external data into Iceberg tables. In the real world, data comes from many different sources, like CSV files, relational databases, existing Parquet files, Hive tables, or even other table formats like Delta Lake. The good news is that Iceberg provides several migration strategies, and you're free to choose the right one based on your source format and requirements. The easiest and most efficient migration path is available when your data is already stored in an Iceberg-supported data file format, like Parquet, ORC, or Avro. In these cases, you don't need to rewrite or copy the data files at all. Instead, Iceberg can read the existing file metadata and create its own metadata layer on top of them. This process consolidates all the column metrics from the existing files into Iceberg's manifest structure, and the resulting table behaves almost as if it had been written as Iceberg from day one. You get all of the Iceberg features, such as ACID transactions, time travel, and schema evolution, without the cost and time of rewriting terabytes or petabytes of data. One of the key virtues of a good engineer is laziness, so whenever we have the chance to do less work, we should take it. The Iceberg Spark integration includes built-in tools for this metadata-based migration, with implementations in Spark for migrating Hive tables within a Hive metastore, and a general approach for any file-based tables. Let's look at one of the primary commands, snapshot, and go through an example to show how it works. The snapshot procedure is minimally invasive because it creates a new Iceberg table that references existing data files without modifying the source. This is perfect when you want to experiment with Iceberg while keeping your original table intact. Once a snapshot has been created, you must be careful not to delete the underlying data files from another process, since Iceberg is now relying on their existence. Changes to the original table will also not be propagated in the Iceberg snapshot automatically, so manual intervention is required to add new files. Let's see the snapshot command in action with our New York City Taxi dataset from Module 1. We have parquet files sitting in object storage, and we want to make them queryable as an Iceberg table. To do this, we use the snapshot command like this. What just happened? Iceberg scanned the parquet file footers, extracted the schema and column statistics, and built manifest files that reference these existing data files. No data was copied or rewritten. Instead, we just wrapped metadata around it. Powerful stuff. But it gets even better. Let's say you have additional parquet files you want to add to this table. Maybe a new month of taxi data just arrived. You can use the add files procedure to incorporate new files into your Iceberg table by running this code. Again, no data rewriting, just updating the metadata with some Iceberg operations. Iceberg reads the new file statistics and adds the entries to the manifest. This is incredibly powerful for incrementally building tables from landing zones or incorporating backfill data. It also helps with the earlier problem of keeping your data in sync between your source and Iceberg tables if you so choose. Before we talk in depth about another migration strategy, I want to take a moment and note that for other table formats, like Delta Lake and Apache Hoodie, where the data isn't already packaged in Iceberg supported formats, there are tools that will convert from the other format's metadata layer directly to Iceberg. These tools are beyond the scope of this course and we will not go into them here. With that being said, let's take a moment and discuss the worst case scenario for migration. Data that isn't already in a compatible format. This requires our other migration strategy, reserializing the data into a new Iceberg table. This is the most compute-intensive approach because it uses your query engine as an intermediate processing layer. The engine reads from the source, transforms the data as necessary, and writes it out as new Iceberg data files. When do you need this approach? The first major time is when your source data is in a format that Iceberg doesn't support natively, like CSV, JSON, or many proprietary formats. These need to be parsed and converted into an open standard format like Parquet, ORC, or Avro as part of the Iceberg write. The second most likely place you'll encounter this is when you're extracting a subset of data from a non-file-based source system. For example, Postgres has no underlying files we could snapshot, so we need an intermediary to actually extract the data and write new data files for us. In Spark, this is typically done with a create table as select statement, like those you've seen in Module 1. The beauty of CTAS is that Spark can read from any data source it supports, including JDBC connections to databases, CSV files, JSON, existing Hive tables, even other Iceberg tables, and write the results as a new Iceberg table. Let's take a look at a simple example of using CTAS with a CSV file. In this example, we use a few lines of code to make a small CSV and then run a CTAS statement on it to convert it into an Iceberg table. While this method is more expensive in terms of compute and storage I.O., it gives you complete flexibility. You can transform data during the migration, apply filters, change partitioning schemes, or combine data from multiple sources, which is often a requirement in the real world. This is one of the main reasons we use this method in the data modeling exercise in Module 1. It's also the most straightforward approach when you're dealing with smaller datasets or when your data transformation is part of your migration requirements anyway. So to summarize, use Snapshot for existing Parquet, ORC, or Avro files. It's fast and efficient. Use format-specific converters when migrating from other compatible table formats and you want to preserve history. And use CTAS or similar reserialization approaches when dealing with unsupported formats or when you're extracting and transforming data as part of the migration. In the exercises, you'll practice Snapshot and reserialization approaches and get a feel for when each one makes sense.2. Let's practice!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.