Snowpark Dataframes - Part II

1. Snowpark Dataframes - Part II

I once helped a friend move apartments. We spent two hours carefully wrapping furniture in blankets and loading it into a truck — very methodically, very organized — and then realized we hadn't actually figured out where we were going to put any of it when we arrived. The point of Part II is to make sure we don't do that with Snowpark. In Part I, we learned how to get data into DataFrames and transform it. Now, we figure out what to do with it when we're done. How do we aggregate it, write it back to Snowflake, and do all of this from your own laptop when you're not working inside a notebook? Let's start with aggregations. The methods are `.group_by()` and `.agg()`, and they work closely together. The code on screen shows these functions in action. Notice we imported `sum` as `sum_` to avoid shadowing Python's built-in `sum` function. It's a small distinction, but a very important one. Everything here mirrors SQL. Group by a column, aggregate with functions, sort the results — all lazily evaluated until the `.show()` action fires. A useful method for quick inspection is `.describe()`, which gives you summary statistics for numeric columns like count, mean, standard deviation, min, and max. It looks like this. It's a quick sense check on your numeric distributions without writing any aggregation logic yourself. Now, let's move data in two directions. First, converting a Snowpark DataFrame to a pandas DataFrame using `.to_pandas()`, which you're likely familiar with if you've used pandas before. For those that are not familiar, that looks like this. This pulls the results of your Snowpark computation into local Python memory as a standard pandas DataFrame. From here, you can use any pandas-based tooling, including `matplotlib`, `seaborn`, and `scikit-learn`. The key thing to understand is that all the heavy lifting — the grouping, the aggregation, the sorting, etc — happens in Snowflake before anything comes to your local machine. You're only pulling the final result set into pandas, not the raw underlying data. Now, let's go the other direction and write a DataFrame back to Snowflake as a table. We use `.write.save_as_table(...)`. The `mode` parameter controls what happens if the table already exists. In our case, we're using `"overwrite"` to replace it. We could also use `"append"` to add to it, or `"ignore"` to skip the write if it already exists. Let's confirm our table landed correctly by showing the table like this. That's the core write pattern. Your transformation lives in Python, the execution happens in Snowflake, and the result lands in a Snowflake table — all without your data ever touching your local machine. Now, let's talk about working with Snowpark outside of Snowflake Notebooks, from your own local development environment. Snowflake Notebooks are the easiest way to get started with Snowpark, and for a lot of use cases they're all you need. But sometimes you want to work locally in an IDE, a Jupyter Notebook on your machine, or in a CI/CD pipeline. For that, you need to create a session manually using `Session.builder`. First, install the package in a terminal environment using `pip install snowflake-snowpark-python`. Then, in your Python file, set up your connection parameters like this. We're pulling credentials from environment variables rather than hard-coding them into the file. That's the right practice. You should never put a password directly in source code. You can set environment variables in your terminal, in a `.env` file, or through whatever secrets management your team uses. Once you have a session object, everything else is identical to what you write in a Notebook. `session.table()`, `.filter()`, `.select()`, `.group_by()`, and `.write.save_as_table()` are all the same API, working exactly the same way. We can see this if we execute this code in a local environment. The session is the only thing that changes between environments. Everything downstream of it is portable, as you can see. To recap across both videos: 1. Snowpark DataFrames are lazily evaluated, meaning nothing runs until you call an action like `.show()` or `.collect()`. 2. In Snowflake Notebooks, you get a session with a single `get_active_session()` call. 3. The core transformation methods are `.filter()`, `.select()`, `.group_by()`, `.agg()`, and `.sort()`, and they are all chainable. 4. `.to_pandas()` brings results to your local machine, after Snowflake has done the heavy lifting. 5. `.write.save_as_table()` with a `mode` argument writes your results back to Snowflake. 6. For local development, use `Session.builder.configs(...).create()` with credentials stored safely in environment variables. Snowpark DataFrames are one of those tools that starts feeling natural very quickly. Once you're thinking in terms of chained transformations rather than SQL strings, you'll find it opens up a much richer Python ecosystem alongside everything Snowflake already gives you.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.