Demo: Dataproc

1. Demo: Dataproc

Let's look at how to create a Dataproc cluster, modify the number of workers in the cluster, and submit a simple Apache Spark job. So here I am in the GCP console. And the first thing I want to do is navigate to Cloud Dataproc. It's pretty far down, so let's navigate down to big data. We have Dataproc. And it's going to check if there is already a cluster, which we don't have, so we can go ahead and now create a cluster. And we can start off by defining the name. Let's just call it our example cluster. And then I'm not going to change any of the other settings. I'll just kind of highlight them. We can define where this is stored, what regions and zones, what kind of mode, which defines the relationship between nodes and workers. We want to have one master and workers. You can also have a high availability setting where you have three masters and then define the name of the workers. You have the machine types available for the master node, so four virtual CPUs. And then we also have the workers. There also can be four virtual CPUs, and they're going to be two of them. So in total, this itself is going to create 12 virtual CPUs. If we go to the advanced options, we could make some of these nodes preemptible. We can define the network [Indistinct] network tags in terms of firewall rules, make this internal IP only, Cloud storage bucket for staging, image. You can see there are lots of other options, all the way down to the specific encryption. So let me go ahead and just create this with the default configuration. And click create. And again, this is going to create a bunch of different machines for us now. If I open another tab and actually navigate to Compute Engine, we'll see all those instances being generated for us. So I can go to Compute Engine. So even though this is a managed service, we can see all the instances. They're all ready. So we have the master and we have our two worker nodes, and they just take the name that I specified. That attaches M for master, W for worker, and starts with a zero index. So if I come back here, I can refresh. The cluster itself is still being initialized. That's the software that's being installed and all the setup is happening in the back end. And once the cluster is ready, we can go ahead and we could maybe resize that. We see that we currently have two worker nodes. We could change that to something else that can be three worker nodes. And then after that, we're actually going to go ahead and submit a job for this. So here we are. Just took another minute or two. We have the cluster up and running. I can go click on the cluster itself and I can get more information about it. So here we have all sorts of monitoring set up. If I go to the VM instances, I'll see those. [Indistinct] master, any jobs I have, which currently we don't have any yet. And if I click on the configuration, we'll see that we currently have two worker nodes. And if I click on edit, I can change that. So let's say we want three worker node. We can change that to three and hit save, and it's now going to go ahead and request that update for us. So it's going to create another worker and it's also going to update the master and let the master know that there is another worker out there so when we submit jobs, all the workers are being leveraged. So if I change back to Compute Engine, here we see the new worker is already up and running. And if come back here and refresh, you can see that the cluster itself is still being updated. And this again should just take a minute or two. Pretty fast. Again, this is a managed service, but we can see the actual back-end instances that are being leveraged. So here we can see the cluster update is complete. I can click on it again and go to the configuration. We can see that we now have three worker nodes. So time to submit a job. Let's go to the job section and click on submit a job. I can leave the job ID, leave the region. Obviously you want to select the cluster, especially if I had multiple clusters. The job type in this case is going to be Spark. I'm going to define a main class. This is just from the example class. And what we're going to do actually is we're going to provide an example to calculate the value of pi. So arguments, I'm just going to give it 1,000. And jar file, I'm going to provide that as well. And then I can review that. There's lots of other things. I have properties, labels. So I'm all set, so I'm going to click submit on this job. So it's going to go ahead and submit that. And that job is now running. That's the status symbol that's on here right now. I can go click on that job itself. And here I can see the job actually running. I can also review the configuration one more time, so here you see all the different settings that I just specified. And we can go back to output. And again, this is now going to do a rough calculation for us to estimate the value of pi. So we'll just wait for that. And here we go. It says pi is roughly this. So the job is now complete. And if this is all that we wanted to do, we could go ahead and delete the cluster. Otherwise, we could submit more jobs. In our case, we're done. So let's go back to the cluster, select it and click delete. It's going to delete also all the data. Can't undo this. Okay. Click that. You can go to Compute Engine, refresh here. We can already see that all of these are now being stopped and will then be deleted. And that way you can easily spin up clusters and delete them so that you're only being charged for the uses of the cluster while you need it. So we can wait around for this to be deleted. So that just took another minute or two. We can see that the cluster itself is deleted. And if I go to the instances, we can also see that all the instances are gone. That's how easy it is to create a Dataproc cluster and submit a job to that cluster.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.