End-to-end data pipeline example in Databricks

1. End-to-end data pipeline example in Databricks

Hey there! In this video, I will be playing the role of a data engineer for Wikipedia. I have been tasked with creating a data pipeline that can update some reports and dashboards about our clickstream data in an automated way. Based on my learnings so far, I am ready to deploy this pipeline in Databricks. To start, let’s take a look at a notebook that dives a bit into the data I will be looking at. You will notice right away that this notebook, and the overall UI, will look familiar to other popular IDEs that you are accustomed to. Databricks is built on the open-source Jupyter notebooks, and has provided several upgrades to the notebooks and the developer experience. We have a collection of JSON files from our website host and want to have some high-level metrics about our webpages. We can see a quick view into our data, which contains information about what pages a user has clicked on and in what order. I want to get this pipeline up and running as quickly as I can, but I also want to make sure that this pipeline is efficient and is a good implementation in Databricks. This is a great example of when to use Delta Live Tables. Here we can see that I am able to declare what the table looks like using standard SQL syntax with the Live Table keyword for Delta Live Tables. I can also define some data quality constraints, which will be enforced by Databricks for all incoming data. This is all the code that I need to write, because under the hood Databricks is creating a more sophisticated and optimized pipeline that will process the data to get to my declared end state. We can look at the Delta Live Table pipeline and see all of the steps that will create the different layers in our Medallion architecture.. First we have a step which ingests the raw data from our data source, which creates the Bronze layer. That steps leads into a data cleansing step, creating our Silver table. Finally, we aggregate and further filter a dataset for a specific purpose, resulting in a Gold table. In this UI, I could trigger the pipeline to run manually, or even to refresh specific tables without having to run the whole pipeline. Since I want to automate a dashboard refresh as well, I want to create a workflow that will make sure every person looking at the data has the most up-to-date version. If we jump over to the Dashboards section here, I can see Wikipedia Dashboard, with some high level information about this dataset. I don’t want to dive into the details of this report, but want to make sure that I can refresh this table whenever new data come in. To do this, I will jump over to the Workflows section, and create a new end-to-end pipeline. Let’s call this workflow the Wikipedia Dashboard Refresh Pipeline. First I want to have the data itself be updated, so I will create a new task, called Data_Refresh, that calls our Delta Live Tables pipeline. I could set up alerts based on this pipeline so if something fails I can go triage in the environment. Next, I want to add another task, doing so by clicking on Add Task, which will refresh the Dashboard. Let’s call this task Dashboard_Refresh, and then select our Wikipedia Dashboard as our task. This task will be dependent on the completion of our Delta Live Tables pipeline, and we can specify that here. Finally, we have setup an end-to-end pipeline that will refresh data for our BI users. Next, let’s practice implementing one of these on our own!

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.