Schemas
1. Schemas
Ajay: Hi, my name is Ajay. I'm a strategic cloud engineer at Google. And now you have covered different building blocks of a DataFlow pipeline, like Beam basic concepts, windows, watermarks and triggers and their usage in creating data processing pipelines; different sources and sinks supported in DataFlow; Beam schemas for processing structured data; pipeline state and timers. In this chapter, we'll take a deep dive into some of the best practices involved in DataFlow. We will begin with the introduction of Beam schemas. We will explore how, using schemas, we can process structured data more efficiently. Then we'll explore best practices for handling unprocessable or erroneous records in a pipeline. We will also cover some best practices around error handling and generation of POJOs, also known as plain old Java objects. We will wrap up this section with an overview of DAG optimization and ways to exploit the life cycle of DoFn to do batch processing. Let's start by looking into Beam schemas. As we have discussed in previous chapters, a schema describes a type in terms of fields and values. Each field is named and has a type. Schemas can be nested arbitrarily and can contain repeated or complex fields as well. When you use schemas in DataFlow jobs, you make your code more readable and easier to manage. Also, it allows the DataFlow service to make optimizations behind the scenes as it is aware of the type and structure of data being processed. For example, the DataFlow service optimizes the encoder and decoder required for [indistinct] and deceleration of data as it moves from one phase to another. Here is an example of using schemas in Java and Python SDK. Each code snippet shows an example of a class with the name, purchase, and in Java and Python respectively. It has five fields: user ID, item ID, shipping address, cost cents, and transactions.2. Let's practice!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.