Dataflow and Beam SQL

1. Dataflow and Beam SQL

Hi again, my name is David Sabater Dinter and I work as Outbound Product Manager for data analytics at Google Cloud. This module is part of the Dataflow Developing Pipelines course, and we are going to introduce two new APIs to represent your business logic in Beam, SQL and Dataframes. Let’s start with SQL, available in our Dataflow runner and also within the Beam SDK. You might remember this slide from the Schemas section, where one of the key takeaways was that by understanding the structure of a pipeline’s records, we can provide much more concise APIs for data processing. Do you remember what API is used by Database folks since the ‘70s? SQL, that’s right. Have you ever wondered why? Let’s discuss briefly. SQL reduces boilerplate code, and is easier to understand by implementing simple SQL statements describing your transformations. It can also automatically optimize the pipeline execution: SQL planners can actually optimize on every execution, probably more so than our handwritten code. And workers are also able to perform further optimizations. SQL is a domain-specific language used in programming and designed for managing data held in a relational database management system. Or, more relevant here, for stream processing in a relational data stream management system. The data is accessed through relational algebra, where, for example, projections are used to pick a subset of the columns to query, filter to apply certain conditions to the rows being returned, and finally apply aggregation via the group by clause. It includes also syntax to operate on nested structures. Don’t worry about reading the code! This is just to show you how verbose it can be to write a join in the Java SDK. You might remember the example that was used to introduce schemas in the other section. The important part is to understand what a join is. Fundamentally, this is joining two input datasets to obtain one output dataset. Let’s now show why Dataflow SQL can help you to implement your data processing pipelines. See the amount of lines of code required to implement a join in the Java SDK, without using schemas and SQL. Most of the extra code is required to annotate types, mapping key/values, and so on. If we use the Scala SDK with Scio, developed by Spotify and available as open source Apache license, we can reduce the code verbosity significantly, thanks to lambdas and type inference features from Scala. With SQL, we are providing the ability to translate the business logic written in SQL back to Apache Beam primitives, to be executed in our Dataflow serverless service in a scalable and overall concise way. What are we trying to do here? We can see in this slide how these primitives are derived and chained together in a Directed Acyclic Graph to execute our logic. Before we go into detail about Dataflow SQL, let’s first distinguish some important components associated with the personas and their user journeys or how they interact with the service. A data analyst will typically start interacting with historical data in the BigQuery UI, running SQL statements on historical data to test their hypothesis about what happened, typically referenced as batch data. After testing on historical data, they will ideally want to test the same business logic in the form of SQL statements over real-time data this time. Switching to the Dataflow SQL UI to test the same logic over real-time data involves very few changes—nice! Lastly, once the data analyst is happy with the logic, they would be able to pass those SQL statements to the data engineer, who will be able to implement them with little change in the form of SQLTransforms inside the Beam Java SDK. Please note that Beam SQL and Dataflow SQL are effectively identical, but while Beam SQL offers a programmatic interface, Dataflow SQL also offers a UI interface. Let’s walk through this journey together. First we begin with the common denominator, which is Beam SQL. It allows a Beam user to query bounded and unbounded PCollections with SQL statements, also referred to as querying data in batch and streaming mode. Your SQL query is embedded using SQLTransforms, an encapsulated segment of a Beam pipeline similar to PTransforms. You can freely mix SQLTransforms and other PTransforms in your pipeline if needed. It also supports User-Defined Functions. Beam SQL includes the following dialects to interpret SQL statements, which we will cover later: Apache Calcite SQL, and Google ZetaSQL. Finally it integrates Schema PCollections and supports windowing when aggregating unbounded data. As mentioned earlier, Beam SQL supports two dialects to understand the relational algebra: The Beam Calcite SQL is a variant of Apache Calcite, a dialect widespread in big data processing, compatible with Apache Flink SQL for example. Beam Calcite SQL is the default Beam SQL dialect and supports Java UDFs among other mature features. Beam ZetaSQL is more compatible with BigQuery, so it’s especially useful in pipelines that write to or read from BigQuery tables, for example when using the Dataflow SQL UI writing to BigQuery. Now let's talk about Dataflow SQL. Dataflow SQL integrates with Apache Beam SQL and supports a variant of the ZetaSQL query syntax, using SQLTransforms in a Dataflow Flex template (but all transparent to you, the user!). You can actually write your SQL logic through the UI or gcloud client command line. ZetaSQL provides the same dialect as BigQuery Standard SQL. And lastly, it can optionally be used as a long-running batch engine. As you may remember from the personas we described, one of the core use cases for Dataflow SQL is to help data analysts query streaming data using a common language, SQL. A typical use case will: Select from Pub/Sub, Join with batch data, Aggregate some metrics over a particular Window, And finally publish to BigQuery or Pub/Sub topic. It’s also worth mentioning that the Dataflow SQL is not only restricted to GCP-native services like BigQuery or Pub/Sub. We are also planning to integrate with many others like Kafka and Bigtable. As we described earlier, data analysts can use their existing SQL skills to develop and run streaming pipelines from the BigQuery web UI. You don't need to set up an SDK environment or know how to program in Java or Python. By using the familiar BigQuery UI, one could easily join streams such as Pub/Sub with snapshotted data sets. BigQuery tables are an example, but Kafka and Bigtable are also coming soon as already mentioned. You can query your streams at static datasets with SQL by associating schemas with objects such as tables, files, and Pub/Sub topics. You create a job and specify the output location, for example writing your results into BigQuery tables for analysis and dashboarding. This is just as simple as selecting Dataflow as the execution engine for SQL statements, using the BigQuery web UI and your destination dataset and table. Remember always to ensure the regional endpoint, sources and destination are within the same region if possible. You can directly access the Dataflow UI to monitor the underlying Dataflow job running your query. In case you want to launch Dataflow SQL jobs programmatically instead of relying on the UI, there is also an option to use the gcloud command tool for authoring jobs through the Dataflow SQL CLI. The Dataflow SQL interface is integrated with gcloud to give you that capability through your command-line interface. Finally, as described in the user journeys, the data engineer would be able to apply the SQL logic implemented by the data analyst, within existing pipelines. In this case, note the use of PCOLLECTION as a table name. Named Tuples can also be accessed by name. And before closing this section, you can see here a simplified version of the Dataflow Template being submitted when using the Dataflow SQL UI. Data engineers are free to actually implement a similar template to be able to encapsulate all that logic from the data analysts, in a more programmatic and scalable way.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.