Course Summary

1. Course Summary

Mehran: Hi, it's Mehran again. Congratulations, you've made it to the end of the foundation's course of serverless data processing. What data flow? Let's recap what you've learned. In this course, we started with the refresher of what Apache Beam is and its relationship with dataflow. Apache Beam is an open source programing model with the unified approach for batch and streaming pipelines so that you don't have to manage multiple data processing architectures, no more land architectures. Data flow is a fully managed, distributed data processing engine for Apache beam pipelines integrated within the Google cloud. Ecosystem data will automates the provisioning and orchestration of worker machines and uses techniques like horizontal auto scaling in dynamic work rebalancing to achieve the lowest total cost of ownership. Next, we talked about the Apache beam vision and the benefits of the beam portability framework. Thanks to the interoperability layer introduced by the Portability API, digital pipelines can leverage new features such as custom containers and multi-language pipelines. The Beam Portability Framework achieves the vision that a developer can use their favorite programing language with the preferred execution backend. Dataflow is runner review to implementation offers this portable architecture on dataflow, it can be enabled via pipeline option without rewriting a single line of code. Another aspect that we looked at is how do you feel allows you to separate, compute and storage while saving money? The shuttle service helps Batz pipelined scale seamlessly to hundreds of terabytes without any testing required by offloading the shuttle operation from worker VMS onto the dataflow service back end. Because operations are carried out on a service back end, pipelines consume less CPU memory and persistent storage. The same principle applies for streaming pipelines with streaming engine, which offloads Windows state storage from the persistent disk attached to workers onto a back end service, this significantly improves auto scaling and data latency and also reduces the resource footprint for your streaming pipeline. We finish this module with flexible resource scheduling or flex hours that can help you save in costs for your pipelines by using a combination of preemptive VMS and our shuffle back end. And the best thing about all these features is that you can enable them for your pipeline to that rewriting a line of code, all you need to do is pass a new parameter when you deploy the pipeline. We reviewed how identity access and management tools interact with your data flow pipelines. We learn about different predefined rules for data flow users and the different service accounts used to run data for pipelines. Then we looked at which quotas apply to data flow jobs. Specifically, VCP use IPS and persistent disks. Every capacity planning exercise it involves data flow. Workloads should involve estimating consumption needs for these resources. Lastly, we looked at how to implement the right security model for your use case on dataflow. We learn how to comply with the locality requirements by specifying the region in zone parameters in your job, and we learn about how to run dataflow jobs with various VPC configurations. We learn how to prevent data exfiltration by disabling public IPS on your dataflow workers. We ended our discussion on security by covering data encryption on data flow. By default, all data is encrypted with Google managed keys, but data integration with cloud key management service allows you to bring your own encryption keys to ensure the maximum level of security. I hope you enjoy this course and you'll be able to put into practice what you've learned.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.