Get startedGet started for free

Beam and Dataflow Refresher

1. Beam and Dataflow Refresher

Federico: Hi, I'm Federico Patota, a cloud consultant here at Google. In this section, we will revisit some concepts on the relationship between Apache Beam and Dataflow, and we will see why customers value Dataflow so much. Apache Beam is an open source unified programming model to define both batch and streaming processing pipelines. To create a pipeline, you can use the Beam SDK of the language of your choice to build a program that defines your data-processing pipeline. Beam SDKs use the same classes to represent both batch and streaming data sources, and the same run forms to operate on that data. We will talk more about this in the next course on developing pipelines. A pipeline can be run locally on your computer, remotely on a virtual machine in a data center, or by using the services of a cloud provider. To decide which will be the engine powering your pipeline, you need to specify a runner. Each runner has its own configuration, and it is associated with a backend service. As you might now already, Dataflow is one of the runners available in Apache Beam. It is a fully-managed data processing service with automated provisioning and management of processing resources. Dataflow includes resource autoscaling and dynamic work rebalancing to maximize resource usage and automatically optimize your pipeline execution. It is part of the Google Cloud ecosystem and uses horizontal service like logging and monitoring. Dataflow allows you to separate computing storage resources. We will cover this more in detail in another module of this course.

2. Let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.