Course Introduction

1. Course Introduction

Hi, and welcome to the second installment of the Serverless Data Processing with Dataflow series, Developing Pipelines on Dataflow. My name is Mehran Nazir, and I am a product manager with Dataflow. If you’ve been following the data engineering progression thus far, you’ve learned about different Google Cloud services you can use for your data processing needs. You might have chosen to go deeper on Dataflow, Google Cloud’s unified batch and stream processing engine that’s serverless, fast, and cost-effective. If you’ve taken the Dataflow Foundations course, the first part of this course series, you likely understand Dataflow’s IAM, quotas and security model, and also have a conceptual grasp of the Beam Portability Framework and how Dataflow separates compute and storage with Shuffle and Streaming Engine. If you remember, there are three ways to launch a Dataflow pipeline: Launching a template using the Create Job Wizard in Cloud Console. You don’t have to write code with this option—all you have to do is select your desired template from a drop-down menu, fill out a few fields, and your job can be deployed. We covered this workflow briefly in the Building Batch Pipelines course in the data engineering curriculum. 2. Authoring a pipeline using the Apache Beam SDK and launching from your development environment. This can mean writing a pipeline using the Java SDK in an interactive development environment (IDE) like IntelliJ, or using a read-eval-print-loop workflow with the Python SDK using a Jupyter notebook. We introduced the building blocks of the Apache Beam SDK in the data engineering course. 3. Writing a SQL statement and launching it in the Dataflow SQL UI. Dataflow SQL lets you launch Dataflow jobs using the familiar semantics of SQL, and includes streaming extensions that allow you to express logic for handling data in real time. In this second installment of the Dataflow course series, we are going to be diving deeper on number 2 (developing pipelines using the Beam SDK) and will dedicate one module to number 3 (Dataflow SQL). Developing your pipelines using the SDK allows you to tap into the full suite of possibilities afforded by the Beam model, and is often the choice of our most advanced users. Let’s take a look at what we’ll be covering in the Developing Pipelines with Dataflow course. We will first spend some time refreshing the concepts covered in earlier courses. More specifically, we will be reviewing the building blocks of the Beam programming model. We will then review watermarks and triggers, introduced in our Building Resilient Streaming Analytics Systems course and expanded upon in this course. Next, we will review sources and sinks, which represent the “Extract” and the “Load” of your Extract-Transform-Load (or ETL) pattern. From there, we will introduce schemas, which give developers a way to express structured data in their Beam pipelines. In the next module, we will cover state and timers. These powerful primitives unlock new use cases by giving developers fine-grained control over in-flight data. After we have laid the foundations of the Beam SDK, we will discuss best practices and review common patterns that maximize performance for your Dataflow pipelines. We will dive into two domain-specific languages, SQL and DataFrames. We’ll explore how SQL is implemented with Beam and Dataflow, then examine Beam DataFrames, an API that gives developers a similar interface to the popular pandas open-source project. Our last module will cover Beam notebooks, an interface for Python developers to onboard onto the Beam SDK and develop their pipelines iteratively in a Jupyter notebook environment. We’ll wrap up the course with a summary of all of the concepts covered.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.