Delta Lake table optimization

1. Delta Lake table optimization

In this video, we'll talk about identifying and resolving issues with Delta tables in Fabric.

2. What is Delta Lake?

Delta Lake is an open source storage layer that enables building a Lakehouse architecture on top of data lakes by adding support for ACID transactions, metadata handling and versioning. Fabric uses Delta Lake tables as the standard across the entire platform. Data ingested into Fabric is stored as Delta tables by default using the Parquet file format. All Fabric experiences generate and consume Delta Lake tables. For example, Delta Lake tables produced by an engine like Fabric Data Warehouse can be consumed by another engine like Power BI.

3. Table maintenance

Fabric Lakehouse provides the Table maintenance feature to manage delta tables and to keep them in an optimum state for analytics. The table maintenance feature provides three core operations: Optimize, V-Order and Vacuum. We'll drill down into each of these operations.

4. Optimize

The Optimize operation consolidates multiple small Parquet files into a large file. Fabric engines benefit from having larger files sizes, with the optimum range above 128 megabytes and close to 1 Gigabyte. This improves compression and data distribution, achieving efficient read operations. It's a general best practice to run optimization strategies after loading large tables.

5. Optimize

In this example, a lakehouse table is composed of a dozen small parquet files. After running optimize, all data is condensed into two large parquet files. The original files remain in storage but they are no longer active.

6. Running the Optimize command from Lakehouse explorer

You can run an ad-hoc optimize operation on a lakehouse table by selecting the Maintenance option in the Lakehouse explorer, and then selecting the option Run OPTIMIZE command.

7. Running the Optimize command in Spark SQL

You can also invoke the OPTIMIZE statement on a Spark SQL Notebook.

8. Running the Optimize command in PySpark

You can also use the delta table function optimize in PySpark.

9. V-Order

V-Order is a write-time optimization that applies special sorting and compression to parquet files so they can be read faster by Fabric compute engines like Power BI, SQL, and Spark. V-Order is enabled by default.

10. Applying V-Order from Lakehouse explorer

V-Order is an option that can be selected when running the Optimize command from Lakehouse explorer.

11. Controlling V-Order writing in Apache Spark session

In Spark, V-Order can be enabled or disabled at the session level. When enabled, all parquet writes from commands run in the session are made with V-Order enabled.

12. Applying V-Order when optimizing a table

The VORDER option can also be specified when invoking the OPTIMIZE statement on a Spark SQL Notebook.

13. Vacuum

VACUUM removes old files no longer referenced by a Delta table log. Files need to be older than the retention threshold, which by default is seven days. Deleting files that are no longer needed is important to keep the delta files in good shape and to reduce storage costs.

14. Vacuum

In this example, a lakehouse table's data resides in two large parquet files. A set of old files are still taking up storage. After running vacuum, the old files are removed.

15. Running Vacuum from Lakehouse explorer

You can run an ad-hoc vacuum operation on a lakehouse table by selecting the Maintenance option in the Lakehouse explorer, and then selecting the option Run VACUUM command, specifying the retention threshold in days or weeks.

16. Let's practice!

Now, let's do a couple of exercises to practice running these delta lake table optimizations.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.