Get startedGet started for free

Beam Basics

1. Beam Basics

Israel: Hello, my name is Israel Herraiz, and I work as a strategic cloud engineer at Google. In this video, you will learn the main concept of Apache Beam and how to apply them to write your own data processing pipelines. Let's start with the main concerns of Apache Beam. The genius of Beam is that it provides instructions that unify traditional batch programing concepts and stream processing concepts. Unifying batch programming and [indistinct] processing is a big innovation in data engineering. The four main concepts are, Beam transforms P collections, pipelines, and pipeline runners. ♠A pipeline identifies the data to be processed and the actions to be taken on the data. The data is held on a distributed data instruction called a P collection. A P collection is immutable. Any change that happens in a pipeline receives one P collection as input and creates a new P collection as output. It doesn't change the incoming P collection. The actions are contained in an instruction called a P transform. A P transform handles input, transformation and output of the data. The data in a P collection is passed along the graph from one P transform to another. Pipeline runners are analagous to container hosts, such as Kubernetes Engine. The integral pipeline can be run on a local computer, in a virtual machine, in a data center or in a service in the cloud, such as Dataflow. The only differences are scale and access to platform specific services. For instance, Google Cloud Storage. Imutable data is one of the key differences between batch programing and testing processing. The assumption in the von Neumann architecture was that data would be operated on and change in place. This was very memory efficient, and this made sense when memory was expensive and scarce. So making a copy of data was expensive. Nowadays, in distributed systems, imitable data where each form results in a new copy means that there is no need to coordinate access, control or sharing of the original ingested data. So it enables, or at least it simplifies distributed processing.

2. Let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.