Data transformations in Databricks

1. Data transformations in Databricks

Hello, and welcome back! In this video, we will discuss the basics of data transformations in Databricks.

2. SQL for data engineering

With all that Databricks can do and support, a common question about data engineering pops up. Which language is best to use? While there are many different ways to create the same data pipeline, there are some best practices to follow. SQL as a language is very common in data engineering. Much of its popularity is that most database administrators are very familiar with SQL as a language. Despite being created mainly as a querying language, SQL is a great option for more standard transformations. You can also consume specific methods or functions. Here is an example code snippet that uses the CREATE TABLE AS method, also known as CTAS.

3. Other languages for data engineering

The other languages in Databricks, Python, R, and Scala, will be more familiar to those with a software engineering background. These languages are more broad in capabilities and thus can support standard and more complex transformations. You will also have more flexibility in using or creating custom functions when using these languages. Here is a code snippet that creates the same table from our SQL query but uses the Pyspark API instead. Luckily in Databricks, we do not have to choose and can use these languages interchangeably!

4. Common transformations

Let us dive into some of the common kinds of transformations that one can do in Spark. Note that while I will be showing code examples in Python, there are corresponding functions in the other languages for each. First, we have schema manipulation, meaning we can add, remove, or change the columns of our DataFrame. In Pyspark, this uses the withColumn and drop methods primarily. Next, we have filtering, which means we can remove a subset of our data based on particular criteria we don't want. The filter method lets us pass in criteria to change the DataFrame.

5. Common transformations (continued)

Next, we have ways to handle or create nested data, which can exist in arrays or various data structures. The explode method takes an array and creates new rows for each entry, while the flatten method takes rows of data and puts them into an array. Finally, we have aggregation, which means we try to put our data in groups and calculate totals or different summary statistics. This uses the groupBy method and the agg method to define our summarization.

6. Auto Loader

When creating pipelines, we know we must read in the data. If data comes into our data lake quickly, it can become difficult to reconcile what data is new vs old. Luckily, Databricks has a tool called Auto Loader, which provides a way to do incremental and efficient processing of only the new data that has landed in a data lake. Here is a diagram of the new pipeline structure that Auto Loader enables. In code, we simply pass in cloudFiles as our data format, which tells Databricks to use Auto Loader to ingest the data.

7. Structured Streaming

Another way to read in data from high-velocity data sources is with Spark Structured Streaming, which comes directly from the Spark framework. Here is a diagram of how Structured Streaming works in a data pipeline. In this case, a Kafka stream writes data into a data lake. With Structured Streaming in Databricks, we can read directly from that stream without waiting for the data files to be created in the first data lake location. In real-time, we can also perform various data transformations and write that data into a Delta table in a different streaming sink. In our code snippet, we can define Kafka as our source data, which allows us to read directly from the streaming data. Notice how we can also join other datasets and write them out in real-time, all within the context of our streaming data.

8. Let's practice!

With this understanding, let's review some data transformations in Databricks!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.