1. Using Databricks for machine learning
Hey there! In this video, we will be talking about how to get started with a machine learning use case on the Databricks Lakehouse Platform.
2. Machine Learning Lifecycle
Let's take a step back and consider the overall machine-learning lifecycle. Machine learning workloads are usually not linear processes but rather cyclical as you continue to develop and fine-tune the models you have.
This image is from a blog here on DataCamp that does a deeper dive into each of these stages, and I encourage you to check it out if you are interested.
3. Planning and preparation
In this video, we will be focusing on what, generally, are the first stages of machine learning processes: planning and data preparation.
4. Planning for machine learning
So, how exactly do you plan for machine learning workloads? There are a lot of factors that go into a machine learning workload, but we can simplify it down into two questions.
First, we must ask ourselves, "What do we have to work with?" This will refer to any datasets we have at our disposal and other business or data practitioner resources needed to deliver a use case. These are important, as they will define what can be implemented and how long it might take.
After that, we need to ask ourselves, "What do we want out of this?" This is where our particular use case or business question comes into play. Of course, we need to plan for security compliance, but at the end of the day, we are looking for a business outcome.
5. ML Runtime
Databricks can help address many resource constraints for machine learning with the Databricks ML Runtime. This Runtime is an extension of the Databricks compute engine, specifically optimized to run machine learning applications. It also conveniently comes with the most common libraries and frameworks that data scientists need for their work.
For example, the ML Runtime has libraries like scikit-learn, SparkML, and TensorFlow for different applications. The Runtime also comes with MLFlow for MLOps needs, which will be covered in a separate video.
The ML Runtime works well with the overall cluster library management, which allows you to supplement the Runtime with any additional libraries you need.
6. Exploratory Data Analysis
After you have compute ready, you can start your data science work, which usually starts with exploratory data analysis.
For those who prefer programmatic approaches to data science, you can continue to use existing processes, such as the ones you see from both the pandas and Spark frameworks. Databricks also has a utility that can provide the same kinds of information about your data without having to use a particular framework -although it uses Spark under the hood.
If you want to use the UI to explore your data, Databricks incorporated Bamboolib into the platform. Bamboolib is a python library that, as you can see in the visual here, provides an easy, non-programmatic way to explore data through an in-notebook UI with various filters and search capabilities. Users can now slice, dice, and explore datasets with minimal code, all within the Databricks platform.
7. Feature tables and feature stores
Before we are ready to develop and train a model, we need to make sure our data is in the right format. In machine learning use cases, we need to put our data into a feature table. Here we can see a raw data table, with a variety of data about book reviews. We could not put this raw data into a machine learning model, as the category and shelf_loc columns have text data, which most models cannot process. Thus, we need to featurize our data into the form on the left, where the different values in each of these columns have been assigned a corresponding number. We won't dive into specifics on how to featurize data, and recommend diving into another DataCamp course for that information.
A feature store is then a collection of these feature tables so they can be easily referenced across different datasets.
8. Databricks Feature Store
Databricks helps streamline the featurization process with the Feature Store, a centralized storage specifically for feature tables. With the Feature Store, you can discover and re-use feature tables for your different models and see upstream and downstream lineage from that feature table.
Here, we can quickly create a feature table based on an existing DataFrame we created for our features. Now, any model in the future can reference our feature table, which expedites the development process.
9. Let's practice!
Now let's start preparing for training our models in Databricks!