ML - Snowpark ML Modeling - Part II

1. ML - Snowpark ML Modeling - Part II

If you’ve been following along, we’ve got our data all set up. Now we’re ready to try to use Snowpark ML to predict which neighborhood that sneaky food truck will choose to go to on any given day in the future! Let’s get right back into it. Snowpark ML has a few parts to it, but we’re going to focus here on Snowpark ML modeling. If you look at the docs for Snowpark ML, you’ll see that what it’s actually doing is incorporating lots of the most important functions and methods from many of the most important open-source Python ML libraries – like scikit-learn, xgboost, and lightgbm – so that they’re really easy to use in Snowflake. One example of a way this is really helpful is that there are lots of moments where, if you call one of these models in Snowflake, training that model happens in a distributed way automatically – meaning, it can cut your data up, and train on different parts of it at the same time. This parallelization can really speed up the training process. So here we’re going to use – and we can see this if we scroll up to the imports again – the “XGBClassifier” from “snowflake.ml.modeling.xgboost.” This is definitely, definitely not a course on ML theory, but at a high level, here’s what this model is trying to do: It tries to figure out the relationship between a set of labels – in our case, neighborhood 1 through 8 – and some input variables – in our case, the day and the month. It wants to get to the point where if you give it some input data, it can say: “Ahh, you know what, I bet I know what the output is going to be!” You can think of it as finding these patterns by coming up with a series of clever questions to ask about your data that, when answered, will eventually bin your data into buckets of similar items. If you’ve ever played the game “20 questions,” where one person thinks of something, and someone else gets to ask at most 20 yes or no questions to narrow down to the answers, it’s kind of like that. So in our case, it will do something like ask: “Hmm, do some neighborhoods come up in the data more in the second half of the year than the first?” The answer here would probably be yes, because neighborhood 8 only appears in the second half of the year. But then it might try different cut points: “Do some neighborhoods come up more in the data in December than in the rest of the year?” And that will turn out to be a jackpot cut point, because it will isolate neighborhood 8 really cleanly. It keeps asking these questions, and then after it decides on a really strong initial question, it will go through this process again, splitting the data into smaller and smaller buckets that it hopes are becoming more and more concentrated with just one type of label. They call models like this tree-based models, because if you draw out the logic, it looks like branches hanging down. That’s a very hand-wavy explanation for the model we’re about to use, and there’s a ton of outside reading you can do on XGBoost and tree-based classifiers if you’re interested. So in this case, we’re dealing with the XGBClassifier that I mentioned above, and the classifier won’t let us feed it labels that go from 1 to 8 (which is how our neighborhoods are numbered), and instead it wants them to go from 0 to 7. (It’s just one of the criteria it insists on.) I’ve shown two ways to do that transformation here. The first way is to just use Snowpark Dataframe functionality – For this, we use a Snowpark Dataframe function called “withColumn,” which lets us create a column – so we’re calling this “NEIGHBORHOOD2,” and give it a value, which we want to be the original Neighborhood value minus 1. Then we drop the original Neighborhood column, since we don’t need that anymore. We can test that out: test \= snowpark_df.withColumn('NEIGHBORHOOD2', snowpark_df.neighborhood - 1).drop("Neighborhood") test.show() And sure enough, that looks pretty good. But since this is a pretty common operation in the world of Machine Learning, there’s actually a tool we can use out-of-the-box to do this for us, and I wanted to show that here because I wanted to emphasize that Snowpark ML makes available to you more than just ML models – it also makes available many important pre-processing functions. So here we’ll use something called LabelEncoder, which we’re pulling from (and we can see this if we scroll up) the preprocessing part of snowflake.ml.modeling. le \= LabelEncoder(input_cols=['NEIGHBORHOOD'], output_cols= ['NEIGHBORHOOD2'], drop_input_cols=True) We’ve set up LabelEncoder so it takes an input column, and gives the output column, and LabelEncoder will automatically subtract 1 from each value. Then we use the fit method on this label encoder fitted \= le.fit(snowpark_df.select("NEIGHBORHOOD")) And then finally we run the transform method on the fitted Snowpark Dataframe, and this is what actually executes the operations and gives us back our prepared dataframe. snowpark_df_prepared \= fitted.transform(snowpark_df) And if we take a look at that: snowpark_df_prepared.show() And compare it with the previous dataframe, we’ll see that they have the same values. In the next video, we’ll split our data into a training set and a test set, and we’ll actually get to train our Snowpark ML XGBClassifier and make predictions!

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.