Getting started with Databricks

1. Getting started with Databricks

Hello, and welcome back! In this video, we will cover a few steps to get us started with Databricks.

2. Compute cluster refresh

As a reminder, a cluster is a collection of compute resources in the cloud that work together to process data. These compute resources are connected together and will be able to handle much larger amounts of data than your local laptop or computer.

3. Create your first cluster

Once you have your Databricks workspace setup, you are ready to do some analytics! Before you can do any kind of data processing in Databricks, you must create a Databricks cluster. Many different configuration options are available to you, and each configuration can determine how well your cluster will perform for your intended workload. While this can feel a bit daunting, there are only a few fundamental configurations that we should focus on initially.

4. Create your first cluster

On each cluster, we can define some parameters around how these clusters are created and who can access them. Cluster policies define a set of parameter "guardrails" that the configuration must stay between. Think of these as the size and shape of a box that an object must fit in. This is a really useful technique to minimize the risk of runaway costs.

5. Cluster Access

We can also set who can use this cluster. There are a variety of ways we can restrict access to a particular cluster. Single user clusters are assigned to a particular user, and can be used for someone to explore their own datasets, or for an automated process to access data. Shared clusters allow several users or groups of users to work on the same pool of resources, which can be a great design for teams working on the same project.

6. Create your first cluster

Next, we can define what comes installed on the cluster. We can select a Databricks Runtime, which comes with all the languages, libraries, and system configurations needed to run a Databricks cluster. There are various versions of the Databricks Runtime, with various improvements over time and specializations for machine learning workloads.

7. Create your first cluster

Finally, we can decide the physical properties of our cluster. In this UI, we can select the specific types of compute nodes that will make up our cluster and how many will be available to scale up to. Databricks clusters can also be set up to auto-scale and auto-terminate depending on the level of activity happening on that particular cluster.

8. Data Explorer

Next, we will want to start working with data. The Data Explorer will be a one-stop location to view all data components in your Unity Catalog. You'll be able to browse which data is available and see a sample of that table directly in the UI. With Unity Catalog, you can see the history or lineage of your data assets and share those assets with others with Delta Sharing.

9. Create a notebook

Finally, we can create a Databricks notebook and play with our data. The notebook is the standard interface for handling data in the platform. Databricks notebooks provide various improvements over the open-source Jupyter notebooks that they are based on and can support any of the languages Databricks can handle. Even in the same notebook, you can use different languages through the use of magic commands. Some key improvements in Databricks notebooks include support for built-in visualizations for better exploratory data analysis and the ability to collaborate or comment in real-time with your colleagues.

10. Let's practice!

Now let's go ahead and get started in the Databricks platform!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.