Get startedGet started for free

Example end-to-end machine learning pipeline

1. Example end-to-end machine learning pipeline

In this scenario, I am a data scientist for a large wine producer and have access to a variety of wine-quality datasets. I have been tasked with using these datasets to predict the quality of a particular wine based on the chemical indicators we have available. Note that this notebook will be using several different processes in Python, as that is my language of choice for Machine Learning. If I had teammates who preferred other languages, such as R or SQL, they could be working in this same notebook or data pipeline and that would be completely supported by the Databricks platform. To start off, I will load my data into different pandas DataFrames, as these are different datasets, and then combine them into a single dataset for model development. We can see that we have various chemical indicators, such as acidity and the amount of sugars in the wine. We will use these indicators as features to predict the quality score. Next, I go through some of my exploratory data analysis, in this case using the popular seaborn library in Python. Here I am creating a distribution plot, otherwise known as a histogram, of our quality scores. I can see that I have a lot of quality scores between 4 and 8, so I should expect my model to largely fall within these values as well. For the purposes of my modeling, I will say that any quality score higher than 7 would be categorized as high-quality. Now that I have explored my data, I need to ensure it is ready to build a model with. Normally, this would require some featurization. Luckily, all of my indicators are numerical and have somewhat consistent values, so I do not need to do much on the data apart from removing null values, which I can quickly do with some pandas methods. Once I have cleaned the data and featurized it, I am ready to prep my datasets for the actual model training. In the sklearn framework, we can use the train_test_split function, which allows us to randomly sample data for training with a specified ratio. In our case, 60% of our data for training, and 20% for both validation and testing steps. Now I can start to train my model. I am going to start out with a tried and true sklearn model while logging all of my important metrics and parameters with MLFlow. I first wrap all of my code under the mlflow.start_run() operation, which provides a context of an MLFlow run for all the contained code. After fitting my model to the training data, I can test my model for different accuracy metrics and log both my parameters and metrics into the MLFlow run. Once the training is complete, I can log specifics regarding my training environment, and can log the model to the run as well. After training this model and an additional XGBoost model, I can jump into my MLFlow Experiment to decide which model is the best for me. In the Experiment UI, I can view details from the runs in various Charts or a tabular format to compare the success of each training. These charts can be configured to compare whatever metrics I would like, such as the AUC value vs. max_depth parameter. For the purposes of this experiment, I will be evaluating the model runs based on the AUC, or Area Under the Curve, metric. Based on my analysis in the run table, this run is going to be my best model. I can register that model to the Model Registry and promote it to Production, either programmatically or within the UI. We can see both versions listed on the model itself, and we can decide which version will be the Production one. From here, we could deploy our model either directly in the Databricks environment or push that model into a different application for consumption. This concludes our example of an end-to-end machine learning pipeline in Databricks, now let’s go practice creating one ourselves.

2. Let's practice!