Beam Notebooks

1. Beam Notebooks

>> Hi, my name is Rosa Orkney and I'm a Google cloud developer advocate for Google Earth dataflow. And today, as you can see on the agenda, we're going to be looking at being notebook's and running them on the dataflow service. We think about the way that we would normally develop a Pache pipeline. The SDK allows us to declaratively described the pipeline that we would like the service to run. As we can see in this example here with the store sales and the online sales, which comes and describes a direct, slightly graphic computation attack. Once we've done this, we actually submitted to the service and the service goes ahead and runs that pipeline for us. The this is fantastic for production use cases because it allows the runner to be able to do fancy optimizations like dataflow fusion, where it can collapse stages together and make it very efficient for the processing that's going to happen. However, it's not the best experience when you're first developing your pipeline because we often get into this right submit job re waiting for logs, waiting for print statements. This is especially true when we're still exploring our data where we would not only like access to the data as we purchased it, but access to the data as our transforms are transforming that data now with Apache Beam. We do have the interactive runner that's available and the interactive runner allows us to be able to get access to the pipeline. So specifically, it gives us access to the intermediate results that are available from our transformations, allowing us to do exploration and the next stages of development. Importantly, as you would expect with Apache Beam, the interactive runner also works with batch and stream sources. So we no longer have to mock out or objects as we're doing our development. We can actually, when needed, run directly against the real data, even if that source is unbounded and industry. So going on to the you know, how we get to run, this will look at the being notebook's and on this slide you actually see the various steps we need to take with the data flow service UI to be able to set up a notebook environment. So we start the data flow notebooks that gives us access to creating new instances. And the nice thing there is that once a new instance is being created, which is the host for our notebooks, all the libraries and things that we need for them already in place so we can immediately start doing our development. The notebook's environment also comes with some ready-Made examples, which are great for exploration and learning, but also very useful for Kosmidis specifically. We'll talk a little bit more about one of the examples, the word count example, and we won't reproduce it completely in this session. That's for you to hopefully do once you start exploring the notebooks directly. But just to take a few snippets and some of the options that will be available when you do start making use of the notebooks. So in this slide, we can actually see some of the transformations in that word. Count example. First, we need to read from our inbounded source, which we do with read from pops up. So this puts it into the words collection. We then apply a fixed window to those elements, a fixed window of 10 seconds. This puts it into the window word collection. And finally we do a count. Now, if we were doing this without the interactive runner at this point, we'd have the fourth transform, would do some lobbying or turn it converted to a sink where we can then look at the data because we are using the interactive running, we can access intermediate results. How do we do need a way to tell the system when to stop reading from this unbounded source of information? And the way that we do that is with a couple of options from the interactive runner, which is the IB options, either recording duration to be set or recording size limit. The duration gives a fixed amount of time for the interactive runner to record data, and the recording size limit gives us a fixed amount of bytes to read. The latter is very useful when you're working with a real stream of information where you might have a very large volume of data and you don't want all of that to be put into your notebook as you're doing the experimentation. The other important factor is that we have the option to actually reuse the stream of information that we've gathered or to get fresh data. So the reuse is useful because then we were exploring data and manipulating it or working with the same data set in terms of looking at those collections and looking at that intermediate result. I did not show allows us to visually see the information, as we can see in this slide example here. The options include window info will also give us some extra metadata about each element. For example, the event time and the window that that data belongs to. Visualization is obviously very useful, but we also want to be able to use this data and manipulate it directly within our code, for that we can use it to collect, which allows us to then output this into a Penders data frame, which then we can do all our manipulation against if we wanted to actually do further visualization of data with graphs, etc.. The notebook comes in built with a feature via IB show, which is to set visualize data to true. This gives us access to the UI that you're seeing here and you can do various visual exploration of the information with these core primitives. We can now start getting out of that right. Submit to service lifecycle when we are dealing with our data. But obviously once we complete our development, we then want to be able to submit the job to the service and we want to do that with as little as code as possible. So finally, until the gain from development to production, there's very little we actually need to do the code, because at the end of the day, we're just writing a Beam SDK code for this whole process. We just need to enable running on the service by importing the data from runner, by providing options, the pipeline options, for example, the project and the staging directory, etc. And then we just do run it on pipeline, which will submit that job to the service. So hopefully with this, you've had a nice overview of what will be available when we when you start working with the notebooks and thank you.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.