ML - Snowpark ML Modeling - Part I

1. ML - Snowpark ML Modeling - Part I

I’m really excited about these Snowpark ML Modeling videos for two reasons: First, we get to try out the Snowpark ML library. Second, writing it gave me an excuse to do something I’ve wanted to do for a long time – create a weird dataset that I could use to test how clever XGBoost is! If you’ve never heard of XGBoost, don’t worry – I’ll tell you more about it in a moment – but, spoiler alert, we’re about to see that XGBoost is very good at what it does. So here’s the setup for this next section. Imagine you love, love, love the ice cream sandwich that the Freezing Point food truck offers. You’re obsessed, enthralled. You can’t escape its clutches. But you’re never sure where it’s going to be on a given day. So you call the food truck company and beg for info, and they’re like: “Well, we don’t know where it’s going to go either – the driver makes her own choices – but we do know it only visits one neighborhood per day, and there are only eight possible neighborhoods it might go to. And, luckily for you, we have 20 years of data you can look through to see if you can detect some patterns.” So you load that truck location history into Snowflake, and you want to guess where your favorite food truck is going to be on any given day. You’ve heard that tree-based ML models are really good for making predictions, so you decide you’re going to use one of those – a model from XGBoost, to be specific. And you’ll be able to do all the XGBoost stuff you need *through* the Snowpark ML library – meaning, you’ll never have to install or import XGBoost directly – which you’re happy about because you’re almost as obsessed with Snowflake as you are with Freezing Point ice cream sandwiches. Okay, so here’s how this will work – I generated the data tracking this food truck’s history of visits to different neighborhoods, and I made the truck never deviate from a somewhat weird pattern. So first I’ll show you how the food truck decides which neighborhood to visit, and then we’ll see if we can use Snowpark ML to build a classification model that can predict the truck’s next move. Along the way, we’ll talk more generally about what you can do with Snowpark ML. So now we’re in a Jupyter Notebook. We could have instead done all of the work we’re about to do from VS Code, which would have made connecting to Snowflake easier because we’d have access to the Snowflake VS Code extension. We also could have done it from within Snowsite But I wanted to use a Jupyter notebook because I wanted to show you how to work with Snowflake in an external IDE. It’s a clean experience, and worth showing. You can ignore the pip installing and the importing I did at the top – We’ll talk a little bit more about which packages we’re using later. For now, let’s talk about how I generated the Freezing Point data. The driver of this truck really, really likes routines, and for the past 20 years, she’s followed her calendar very strictly in deciding which neighborhood to visit on any given day. Here’s her formula: In January, she drives to neighborhood 1 on the 1st, the 8th, the 15th, the 22nd, and the 29th because her mom lives in neighborhood 1 in January, and she likes to see her weekly. She goes to neighborhood 2 on all other days in January. From February through November, she visits neighborhood 1 on the first, neighborhood 2 on the second, neighborhood 3 on the third, and so on. Then she loops back again after visiting neighborhood 7, so on the 8th she visits neighborhood 1, on the 9th she visits neighborhood 2, and so on, until the next month starts, when she then restarts the pattern. And December is easy – In December, she visits neighborhood 8 every day because she finds that neighborhood’s holiday decorations enchanting. So I used that description of her neighborhood-selection algorithm to generate one year’s worth of data, and then I just concatenated the Dataframe 20 times and uploaded that to Snowflake using the Snowsight UI. Cool, so now we’ve got our data, and we have our work cut out for us! Let’s see if we can use Snowpark ML to create a model that accurately predicts where this food truck is going to be on any given day. The first thing we need to do is connect to Snowflake to build a bridge between the cloud and our local machine where our notebook is running. Here’s how we do this – If we scroll back up, you’ll see that we imported Session from snowflake.snowpark. Notice that the import statement doesn’t match the name of the package we pip installed at the top – Here, we’re importing snowflake.snowpark, but we’re only able to do this because we already pip installed the snowflake-snowpark-python package. Anyway, you’ll notice that I also have this statement in there: “from credential import params.” To create a Snowflake session in our account, we can hand over a dictionary of credentials – our account name, our username, our password, and so on. To make this easier on myself, and because I don’t like showing people my credentials in a demo, I created a separate “credential.py” file so I could import credential as a library, and pull that dictionary directly from there. We talked about the idea of a session briefly when we were learning about Snowpark Dataframes, but again, the session object is really, really important because it includes all of your connection details. You need to create your table from a session so the table is associated with all of the right permissions. So we create our session, and then we can immediately use the session’s table method to save our table as a dataframe. It’s important to note that we’re not actually pulling the table into our local memory here – Instead, it’s more like we’re creating a reference to it, which is great because this would still work even if the table was a huge, huge table. So let’s run that: snowpark_df \= session.table("test_database.test_schema.df_clean") Let’s take a look at this Snowpark dataframe using the show method: snowpark_df.show(n=40) Cool, this looks good! It has a month and a day, and the Neighborhood, which is the target variable and the thing we want to make a model to predict. And just to double check that the historical data looks right, let’s look at Month 1 Day 7 – That’s a 2, just like it should be! And Day 8 should be a 1, because, remember, the truck driver goes and visits her mother in Neighborhood 1 every week in January. Okay, let’s use another Snowpark Dataframe method, “count,” to check that we do in fact have 7300 rows (which is 365 times 20): snowpark_df.count() Yep, we’re good. Let’s use the describe method to see that the range of values are all what we’d expect: snowpark_df.describe().show() And sure enough, they are – 1 to 12 for months, 1 to 31 for days, and 1 to 8 for Neighborhoods. If we group by Neighborhood, we can see which neighborhoods get visited more or less by the food truck: snowpark_df.group_by("Neighborhood").count().show() Looks like Neighborhood 2 is the most popular because it gets visited so much each January. Makes sense. And neighborhood 8 gets all of December to itself, but is never visited any other time during the year, so it has the lowest count. We’ve finally got our data all set up as Snowpark dataframes, so now comes the fun part: Actually doing some Snowpark ML!

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.