Beam schemas

1. Beam schemas

David: Hi there, my name is David Sabater, and I work as outbound product manager for data analytics at Google Cloud. In this model, I'm going to introduce Beam schemas and also provide some code examples. This model about schemas is part of the DataFlow developing pipelines course. Let us start introducing schemas before we look at some examples. A P collection must consist of elements of the same type. For example, it could consist of many JSON objects or of other available object types like byte stream, also known as PlainText, Avro or protocol buffer. To Beam, these collections of types are blocks that are passed between transforms. However, to support this [indistinct] processing, Beam needs to be able to encode each individual element-- for example, as a byte stream-- so elements can be read and passed around to distributed workers. Common Beam sources can produce JSON, Avro, Proto [indistinct], or database raw objects. All of these types have well-defined structures: structures that can often be determined by examining the type. Even within an SDK pipeline, simple Java protos or [indistinct] equivalent structures in other languages are often used as [indistinct] types. These are also have a clear structure that you can infer by applying a custom coder and inspecting the class. As we have seen in our previous slide, the types of records being processed typically have an obvious structure. By understanding the structure of a pipeline's records, we can provide much more concise APIs for data processing. And actually, database folks have known this since the '70s, using schemas. Schemas to the rescue. Most structured records share some common characteristics which can be represented as schemas. They can be subdivided into separate name fields and values. Fields usually have string names, but sometimes, as in the case of index, [indistinct] have numerical indices instead. There is a finite list of primitive types that a field can have. These often match primitive types in most programing languages: int, long, string, and so on. Often a field type can be marked as optional, sometimes referred to as nullable or requited. Often records have a nested structure. A nested structure occurs when a field itself has two fields. So the type of the field itself has a schema. These structure records have some commonly feature array or map type fields. Now, in order to take advantage of schemas, your P collection must have a schema attached to it. Often the source itself will attach a schema to the P collection. For example, when using Avro IO to read Avro files, the source can automatically infer a Beam schema from the average schema and attach that to the Beam P collection. However, not all sources produce schemas. In addition, Beam pipelines often have intermediate stages and types, and those also can benefit from the expressiveness of schemas.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.