IAM

1. IAM

Omar: Hello, there. My name is Omar Ismail, a solutions developer at Google Cloud. In this module, we talk about the different IAM roles, quotas, and permissions required to run Dataflow, Google's batch and streaming analytic service based on Apache Beam. In this video, we learn how IAM provides access to the different Dataflow resources. You have your Beam code, and now you want to run it on Dataflow. Let us look at what happens when your Beam code is submitted. When the pipeline is submitted, it is sent to two places. The SDK uploads your code to Cloud storage and sends it to the Dataflow service. The Dataflow service does a few things. It validates and optimizes the pipeline, it creates the Compute Engine virtual machines in your projects to run your code, it deploys the code to the VMs, and it starts to gather monitoring information for display. When all that is done, the VMs will start running your code. At each of the stages we mentioned-- user submission of code, Dataflow validating the pipeline, and the VM running-- IAM plays a role in determining whether to continue the process. We will briefly explain how IAM comes into play at each stage. Three credentials determine whether a Dataflow job can be launched. The first credential that is checked is the user role. When you submit a code, whether you are allowed to submit it is determined by the IAM role set to your account. On Google Cloud, your account is represented by your email address. For example, when I submit a Dataflow job, it is done via [email protected]. Three user roles can be assigned to each user or group. Each role is made up of a set of permissions that determine how much access each user or group has to the different Dataflow resources. The first role you can assign to a user or group is the Dataflow viewer role. If you want a user or group to be able to only view Dataflow jobs, assign them the Dataflow viewer role. This role prevents submitting, updating, and cancelling jobs. It allows users who have the role to only view Dataflow jobs either in the UI or by using the command line interface. The next role you can assign to a user or group is the Dataflow developer role. This role is ideal for a person who is responsible for managing pipelines that are running. For a job to run on Dataflow, the user must be able to submit the job to Dataflow, stage files to cloud storage, and view the available Compute Engine quota. If a user only has the Dataflow developer role, they can view and cancel jobs that are currently running, but they cannot create jobs because the role does not have permissions to stage the files and view the Compute Engine quota. You can use the Dataflow developer role as a building block to compose custom roles. For example, if you also want to be able to create pipelines, you can create a role that has the permissions from the Dataflow developer role plus the permissions required to stage files to a bucket and to view the Compute Engine quota. The last role you can assign to a user or group is the Dataflow admin role. Use this role to provide a user or group with the minimum set of permissions that allow both creating and managing Dataflow jobs. The Dataflow admin role allows a user or group to interact with Dataflow and stage files in an existing Cloud storage bucket and view the Compute Engine quota. The second credential Dataflow uses is the Dataflow service account. Dataflow uses the Dataflow service account to interact between your project and Dataflow. For example, to check project quota, to create worker instances on your behalf, and to manage the job during job execution. When you run your pipeline on Dataflow, it uses the service account service- @dataflow-service-producer-prod . iam.gserviceaccount.com. This account is automatically created when the Dataflow API is enabled. It is assigned the Dataflow service agent role and has the necessary permissions to run a Dataflow job in your project. In our job overview diagram, the Dataflow service account is responsible for the interaction happening here between your project and Dataflow. The last credential used to run Dataflow jobs is the controller service account. The controller service account is assigned to the Compute Engine VMs to run your Dataflow pipeline. By default, workers use your project's Compute Engine default service account as the controller service account. This service account, <project-number>-compute @developer.gservices.com, is automatically created when you enable the Compute Engine API for your project from the API's page in the Google Cloud console. The Compute Engine default service account has broad access to your project's resources, which makes it easy to get started with Dataflow. However, for production workloads, we recommend that you create a new service account with only the roles and permissions that you need. At a minimum, your service account must have the Dataflow worker role and can be used by adding the service account email flag when launching a Dataflow pipeline. When using your own service account, you might also need to add additional roles to access different Google Cloud resources. For example, if your job reads from BigQuery, your service account must also have a role like the BigQuery Data Viewer role. Let us review. In our job overview diagram, where would the controller service account be? Here.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.