Utility Transforms

1. Utility Transforms

Apache Beam comes with a set of transforms that you can use as the building blocks of your pipeline. Let's learn about those transforms. By combining these blocks, you can build a complex process in logic that is applied at scale by Dataflow. ParDo lets you apply a function to each one of the elements of a P collection. GroupByKey and Combine are similar. With GroupByKey, you put all the elements with the same key together in the same worker. If your group is very large or the data is very skewed, you have a so-called hotkey and you're going to apply a commutative and associative operation, you can use Combine instead. Combine will make the transformation in a hierarchy of several steps. For large groups, this will have much better performance than GroupByKey. GroupByKey let you join two P collections by a common key. You can create a left or right, outer join, inner join and so on using GroupByKey. Flatten also receives two or more input P collections and fuses them together. But please do not confuse flattened with joins or with GroupByKey. If two P collections contain exactly the same type, they can be fused together in just one P collection using the Flatten transform. However, with joins with GroupByKey, you have two P collections, but typically with different value types that share a common key. Partition is in a way the opposite of Flatten. It divides your P collection into several output P collections by applying a function that assigns a group to ID each element in the input P collection.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.