Sources & Sinks

1. Sources & Sinks

Welcome to the Sources and Sinks for Dataflow module. My name is Wei Hsia and I’m a Customer Engineer for Google Cloud. In this module, you will learn about what makes sources and sinks in Dataflow. The module will go over some examples of TextIO, FileIO, BigQueryIO, PubsubIO, KafKaIO, BigtableIO, AvroIO, and Splittable DoFn. The module will also point out some useful features associated with each I/O. In this video, we talk about sources and sinks. In a data pipeline, there’s generally an input and an output. In Beam, these are called sources and sinks. A source is when you read input data into a Beam pipeline. Sources generally appear at the beginning of a pipeline—but that doesn’t necessarily need to be the case, as you will see later in this module. A sink is where you would write output data from your Beam pipeline. A sink is a PTransform that performs a write to the specified destination. A PTransform is an operation that takes an input and provides an output. A common output for a sink is PDone, which signals that the branch of the pipe is done. A bounded source is a source that reads a finite amount of input. This is commonly associated with batch processing. A bounded source will be responsible for splitting up the work of reading an input into bundles. Bundles are groupings of elements in the pipeline for a unit of work. The bounded source will also provide estimates to the service and number of bytes to be processed. Because the input is finite, there is a known start and a known end. If the bundles can be broken down into smaller chunks, Dataflow will dynamically rebalance work to achieve better performance. An unbounded source is a source that reads from an unbounded amount of input. An unbounded source is commonly associated with streaming. Checkpoints allow for the ability to bookmark where the data has been read in the source, which means that data that has been processed in the stream doesn’t need to be re-read. Watermarks from sources can provide the point in time estimates for a piece of data. Finally, some unbounded sources, for example PubsubIO, have the ability to pass a record ID to allow deduplication of messages. Dataflow will keep track of the message IDs for 10 minutes and automatically discard the record if duplicated. Sinks are often PTranformations that write data to an end system. You can check the code out for the various sinks to see this. In general, sinks will emit a PDone value to signify the completion of the transform. There are some, such as BigtableIO, that also allow you to continue processing data after the success, as you will see later. If you have a need for continued processing, you can also write your own PTransform sink to write out the data and have an output. Apache Beam is open-source, so there are many developments contributed by the community and by Google. There are a lot of various open source I/Os, and the list continues to grow. For an updated list of various I/O connectors, refer to the official Apache Beam documentation page.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.