1. Updating coffee sales data
In this example, I will again continue my role as a data engineer at a global coffee retailing company. I am responsible for maintaining all of our sales-related datasets, and constantly receive new data from our various data sources.
I will begin by setting my notebook to run from my centralized data schema. Doing so makes it easier to construct my queries, as all of my data resides in the same schema.
I will next take a look at some new sales data that I have received. I have previously created a new table called sales_new from the original CSV file. Looking at the data, I can see that these are all net-new records and can be simply appended to my sales table. In Databricks SQL, I can accomplish this with the INSERT INTO statement. Since my datasets are both stored in Unity Catlog tables, this becomes an easy process, where I just put my target and source tables into the query. When I query my final dataset, I can now see that I have the most recent date of sales records in my table.
Next, I can start to work with some of my product data. My product dataset is a fairly stagnant dataset, as our company has had the same products for quite some time. We do, however, periodically update how we refer to these products internally. In this products_updated table, we can see that every product has new values for the hierarchy columns, as we have re-categorized every product we have. Since the records are updated versions of previous rows, I want to use the MERGE INTO syntax in Databricks SQL. I will again specify my target and source tables, and will indicate that the product_id column is my key to match rows together on. In the case where there is a match between the data rows, I will UPDATE the entire dataset. Now, when I query the products table, I can see that my table reflects the latest categories.
By using these two strategies, I am confident that I can address any kind of data ingestion that I might run into. Both of these strategies can apply to when the table structure stays the same, and even when it changes schema.
Now, let us go practice these techniques with our insurance datasets.
2. Let's practice!