Quotas

1. Quotas

In the second section of this module, we look at the quotas to consider when running Dataflow. Let’s get started! One of the quotas that Dataflow consumes is CPU. CPU quota is the total number of virtual CPUs across all of your VM instances in a region or zone. Any Google Cloud product that creates a Compute Engine VM, such as Dataproc, GKE, or AI Notebooks, consumes this quota. CPU quota can be viewed in the UI on the IAM Quota page. For example, right now, I am consuming 219 CPUs in the northmaerica-northeast1 region. Say you want to start a Dataflow job with 100 workers. If the VM size selected is n1-standard-1, meaning 1 CPU core per VM, the CPU usage will be 100. If the VM size selected is n1-standard-8, that would mean 800 CPUs are needed. If the limit is 600, the job will display an error because the CPU limit has been exceeded. Another quota to consider is the number of in-use IP addresses in each region. The in-use IP address quota limits the number of VMs that can be launched with an external IP address for each region in your project. Like the CPU quota, this quota is shared across all Google Cloud products that create VMs with an external IP address. When you launch a Dataflow job, the default setting is for the VM to launch with an external IP address. Jobs that access APIs and services outside Google Cloud require internet access. However, if your job does not need to access any external APIs or services, you can launch the Dataflow job using internal IPs only, which saves money and conserves the In-use IP address quota. In our next module, we will show you how to launch VMs with internal IPs only. Unlike the CPU quota, the in-use IP address quota is independent of the machine type; there is no difference between launching 150 n1-standard-1s vs 150 n1-standard-8s. In the slide image here, the In-use IP address limit for a few regions is 575. In the previous slide for CPU quota, the maximum number of CPUs per region was 600. When you launch a Dataflow job, the more restrictive quota takes precedence. Let us look at quotas for persistent disks. You can choose between two different types of Persistent Disks when running Dataflow jobs. You can launch jobs with either legacy Hard Disk Drives or modern Solid State Drives. Each disk type has a limit per region that can be used. For example, in the image shown here, Google Cloud products in my project that use HDDs in northamerica-northeast1 are consuming 23.5 TB of disk space out of the available 102.4TB To specify the disk type, set the worker_disk_type flag to the prefix shown in the image, and end it with either pd-ssd or pd-standard. Use Pd-standard for Hard Disk Drives and pd-ssd for Solid State Drives. In the slide example, we set the disk type to SSD using both Python and Java. When you launch a batch pipeline, the ratio of VMs to PDs is 1:1. For each VM, only one persistent disk is attached. For jobs running shuffle on worker VMs, the default size of each persistent disk is 250 GB. If the Batch job is running using Shuffle Service, the default PD size is 25 GB. Recall that Dataflow Shuffle moves the shuffle operation out of the worker VMs and into the Dataflow service backend, which is why the default persistent disk size attached to the VM is smaller. Note that you can use the disk_size_gb flag to override the default persistent disk size for batch pipelines using either shuffle on VM or Dataflow Shuffle. Streaming pipelines, however, are deployed with a fixed pool of Persistent Disks. Each worker must have at least 1 persistent disk attached to it, while the maximum is 15 persistent disks per worker instance. As with Batch jobs, Streaming jobs can be run either on the worker VMs or on the Dataflow backend. When you run a job using the Dataflow backend, the feature that is used is Dataflow's Streaming Engine. Streaming Engine moves pipeline execution out of the worker VMs and into the Dataflow service backend. For jobs launched to execute in the worker VMs, the default persistent disk size is 400 GB. Jobs launched using Streaming Engine have a persistent disk size of 30 GB. Just like with Batch pipelines, these default persistent disk limits can be overridden using the disk_size_gb flag. It is important to note that the amount of disk allocated in a streaming pipeline is equal to the max_num_workers flag. For example, if you launch a job with 3 workers initially and set the maximum number of workers to 25, 25 disks will count against your quota, not 3. To set the maximum number of workers that a pipeline can use, use the --max_num_workers flag. This cannot be above 1000. When you launch a streaming job that does not use Streaming Engine, the flag --max_num_workers is required. For streaming jobs that do use Streaming Engine, the --max_num_workers flag is optional. The default is 100.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.