1. Ingesting Data
Welcome back! In this video, we will discuss how to ingest data into the lakehouse using Databricks SQL.
2. Motivation
Imagine a lake where we plan to take our boat. The water in the lake is essential, but it doesn't exist there on its own. It originates from a source, such as snowmelt on a distant mountain. From there, the water flows down a river, eventually reaching the lake as its final destination. In a sense, the river is ingesting water from the mountain and delivering it to the lake.
3. Creating the lakehouse
The lakehouse architecture involves several key steps to process, transform, and analyze datasets effectively.
Organizations typically manage a mix of datasets, including both structured data from databases and unstructured data in file formats. To establish a solid foundation for the lakehouse, it's essential to first ingest this data into the Bronze layer, which serves as the raw data repository. With this layer in place, organizations can begin transforming their data to uncover valuable insights.
4. GUI-based options
Let’s begin by exploring the two main GUI-based options for data upload.
Lakeflow Connect offers a fully managed ingestion solution, with built-in integrations for popular databases and SaaS applications. It automatically creates production-quality pipelines to ingest both existing and new data, making it ideal for large-scale, production-level datasets.
Alternatively, users can manually upload data files, such as CSV or Parquet, directly in Databricks and create a Delta table. While better suited for smaller-scale, ad hoc analysis, this approach is quick and easy for exploratory work.
5. Bringing data into the lakehouse
Databricks SQL also provides several programmatic options for more flexible and automated data ingestion.
The `COPY INTO` command enables you to load large collections of data files directly into a Delta table from cloud storage. This method is well-suited for relatively static datasets that don't change frequently. In the example below, we ingest a folder of Parquet files and create a Delta table named *my_table*. This approach ensures a quick and efficient ingestion process for bulk data.
For more dynamic and fast-changing datasets, Databricks offers the Auto Loader feature. Auto Loader continuously monitors a specified cloud storage location and automatically ingests new files as they arrive. In the provided example, we use the `cloud_files` keyword to set up Auto Loader to ingest any new CSV files that land in the designated file path. This makes it an excellent solution for scenarios requiring real-time or near-real-time data updates.
6. Let's practice!
Now, let's go ingest some data into our Databricks environment!