Data Locality

1. Data Locality

Omar: Hey there. My name is Omar Ismail, a solutions developer at Google Cloud. In this module, we talk about the different ways you can enhance security while running Dataflow. We discuss four security features in this module. Let us get started with data locality. Data locality ensures that all data and metadata stay in one region. When you launch a Dataflow job, a backend exists in a Google-managed project that deploys and controls your pipeline. As we discussed in the IAM module, the Dataflow service account communicates between your project and the Dataflow backend. The Dataflow backend exists in a few regions across the globe and can be different from the region in which your workers run. What metadata is transferred between your project and the regional endpoint? There are regular health checks from the workers, workers requesting a work item and the regional endpoint responding with a work item, the worker item status, and autoscaling events. Unexpected events are also transferred to the regional endpoint. For example, unhandled exceptions in user code, jobs that fail to launch because of permissions, worker item failures, and errors from another related system, such as Compute Engine. These items are stored at the regional endpoint and are visible to you on the Dataflow UI, along with any other info you see in the UI, such as pipeline parameter values, job name, job ID, and start time. There are a couple of reasons for specifying a regional endpoint. The first is to support your project's security and compliance needs. For example, if you work for a bank in certain countries, regulatory rules mandate that data does not leave the country of operation. These rules can be met by specifying the regional endpoint. You can also specify a regional endpoint to minimize network latency and network transport costs. If your pipeline sources, syncs, and staging locations are all in the same region, you will not be charged for network egress because all the info remains in the same region. If you have a pipeline with workers in northamerica-northeast and its regional endpoint is set to us-central1, your network egress charge will increase because of the metadata that is transferred between your project and the regional endpoint. In the next couple of slides, we will show you how to specify the regional endpoint you want the Dataflow service to run in. If you want to use a supported regional endpoint and have no zone preference within the region, specify the regional flag only. In this case, the regional endpoint automatically selects the best zone based on available capacity. In Apache Beam 2.15 and greater, specifying this flag is mandatory. If you need worker processing to occur in a specific zone of a region that has a regional endpoint, specify both region and worker zone flags. Use the region flag to specify the regional endpoint, and use the worker zone flag to specify the specific zone within that region. If you need worker processing to occur in a specific region that does not have a regional endpoint, specify both region and worker region flags. Use the region flag to specify the supported regional endpoint that is closest to the region where the worker processing must occur. Use worker region flag to specify a region where worker processing must occur. Compared to the scenarios on the two previous slides, specifying a different region for the regional endpoint and the worker has the protentional to create greater latency. It is important to note that even if no regional endpoint exists in a region you want your data to be kept in, only metadata is transferred. Your application data stays in that region.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.