Get startedGet started for free

Transforming Data with Expressions

1. Cleaning text data

Hi, I'm Liam. I'm an experienced data scientist

2. Meet your instructor

and Polars contributor. I'll be your guide to transforming data with Polars.

3. Transformation Engine

Polars is a powerful engine that takes your tabular data and transforms it in parallel

4. Is this course for you?

This course requires familiarity with creating a Polars DataFrame, using a Polars expression and doing group-by aggregations. If you are not familiar with these, then I recommend doing the Introduction to Polars course first

5. Chapter 1

In this course, we'll learn how to transform data with Polars. We'll start by working with text data and creating conditional expressions.

6. Chapter 2

Ever struggled with timestamps? Chapter two dives into time series data. We'll learn window expressions - powerful tools for running totals and moving averages.

7. Chapter 3

Real-world analysis rarely lives in a single table. Chapter three shows you how to combine DataFrames - joining and merging data from multiple sources.

8. Chapter 4

Finally, we put it all together, building custom transformation pipelines and exploring advanced analytics, such as correlation. Let's dive in.

9. Meet our dataset

We start by importing polars as pl and reading our CSV with restaurant hygiene inspections in London. The dataset includes the name, location, restaurant type, hygiene rating, and capacity.

10. Restaurant recommendation app

Our goal is to build a restaurant recommendation app for London. But to recommend clean restaurants, we need clean data first. Notice these issues: some business names have leading whitespace - this causes duplicates when filtering. The rating and capacity are floats, but should be integers. And see how Costa Coffee appears twice? Without a unique identifier combining name and location, we can't tell them apart. Let's fix these one by one.

11. Casting dtype with an expression

Let's start with those float columns. To transform a column in place, we use .with_columns().

12. Casting dtype with an expression

We create an expression on the rating column using pl.col, then chain .cast() with our target dtype of pl.Int64. Now the rating is stored as an integer.

13. Casting multiple columns

We need to cast multiple columns to an integer. While we can work with multiple columns using .with_columns(), a simpler approach is to use the .cast() method on a DataFrame.

14. Casting multiple columns

Inside .cast(), we pass a Python dictionary

15. Casting multiple columns

We then specify that we want to transform all Float64 columns to Int64 columns and confirm this has worked

16. Cleaning text data

With the dtypes fixed, let's tackle that whitespace issue. Some of the business names have whitespace at the start - we need to remove this so we can identify similar properties.

17. Cleaning text data

Polars has many expressions for working with text data in the .str namespace

18. Cleaning text data

You can see the full set here in the official docs at this link. For our purposes, we need

19. Cleaning text data

the strip_chars_start expression to remove leading whitespace

20. Cleaning text data

We call .with_columns() to transform an existing column,

21. Cleaning text data

create an expression on the business column

22. Cleaning text data

and apply the strip_chars_start expression to remove leading whitespace. Now we see that the names are consistently formatted.

23. Combining text data

Now for that identifier column. The dataset has businesses with the same name in different places, like Costa Coffee here.

24. Combining text data

We'd like to add a column that combines name and location to identify individual premises

25. Combining text data

We use pl.concat_str to combine strings from different columns

26. Combining text data

We pass the column names to combine - business and location in this case - separated by commas.

27. Combining text data

Then we provide a separator to split the strings

28. Combining text data

And we name the output column as id with the .alias() expression. This gives us our new identifier column.

29. Let's practice!

That was our introduction to transformations in Polars. Now, let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.