ML - Snowpark ML Modeling - Part III

1. ML - Snowpark ML Modeling - Part III

Now we’re ready to split our data into a training set and a test set and move on to our final objective: Training our model and seeing how well it performs. For those of you who don’t have a background in machine learning, splitting our data into a training set and a test set just means that we’re holding back some of the data so it’s not used to train the model. That’s what we call the test set. This is important because having a clean test set lets us check how good our model is later by having it make predictions based on the test set and seeing how correct those predictions are. We don’t want to train the model on the same data we’re going to test it on, because that’s like letting it prepare for the test by giving it the answers in advance. So we’re getting this randomSplit from Snowpark ML, even though the underlying functionality was adopted from scikit-learn. We’ll set aside 10% of the data – so two of our 20 years worth of truck location data – for testing. Great, now we’ll use the write.mode.save_as_table method on the Snowpark dataframe to save each of these as tables to our Snowflake instance. And if we go to Snowsight, we’ll see the tables pop up! I don’t know about you, but somehow doing something locally and affecting something in the cloud is very satisfying to me. Okay, and here’s the moment we’ve been waiting for – Now we’re going to hand over our days and months data, with the neighborhood target variable, and let the XGBClassifier try to mix and match that days and months info to predict the neighborhood the truck is in. The feature columns are the columns the XGBClassifier will play around with as its learning to guess what the label column is. So we create our model – you can see that we’ve handed over the input col labels and output call label as inputs: # Train an XGBoost model on snowflake. xgboost_model \= XGBClassifier( input_cols=FEATURE_COLS, label_cols=LABEL_COLS ) And then we call the fit method on that model, with our dataframe as the argument, and this fit method call is what starts the training process: xgboost_model.fit(train_snowpark_df) So now for the moment of truth! Let’s check the accuracy of our model against our test data by using the score method and feeding in the test dataframe as the argument. The truck driver’s schedule is kind of odd, but I intentionally made it very regular, to see how good XGBoost is at picking up strange but very consistent signals in the data. accuracy \= xgboost_model.score(test_snowpark_df) print("Accuracy: %.2f%%" % (accuracy * 100.0)) And it gets a perfect score! A 100 out of 100! The XGBoost classifier we trained, when applied to the test set, made 730 predictions, and got 730 of those correct. Pretty cool. We’re not going to do this here, but if I were to keep going, I’d also register our model to the Snowpark model registry so it’s easy to access later and I can keep track of all of my models. Okay, we just covered a lot! Now we know exactly where that Freezing Point truck will be. All the time. We can buy an ice cream sandwich (or even better, multiple ice cream sandwiches) every day. Before I recap what we just covered, I want to note that we could have done this inside our SQL workflow using the Snowflake Cortex ML function “Classification,” but I wanted to do this using Snowpark ML since we’d already covered Snowflake Cortex LLM functions and got a taste for how Snowflake Cortex’s functions work, and because it’s good to know Snowpark ML because there will be moments when you need more customization than you can get with the Snowflake Cortex ML functions. Whew! That was a lot of information! We learned how to: One, connect our local development environment – in this case, a Jupyter notebook – to Snowflake by creating a session using session.builder.configs.create Two, create a dataframe of a table in Snowflake using session.table Three, use additional Snowpark Dataframe methods, like count, describe, group_by, and withColumn Four, pre-process our data with Snowpark ML’s LabelEncoder Five, split our Snowpark dataframe into a training set and a test set using randomSplit Six, save a table to Snowflake using the dataframe method write.mode.save_as_table Seven, train an XGBClassifier and fit that to the training data And eight, calculate the accuracy of our model using the XGBClassifier’s score method That’s a lot! If you don’t know a lot about machine learning, so a bunch of this was mysterious, don’t worry too much about it. But I hope you have a better sense as to how to use Snowpark ML to run your ML workloads from within Snowflake.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.