Handling missing and duplicate values

1. Handling missing and duplicate values

We'll now learn how to handle missing and duplicated values - two common challenges in real-world data analysis.

2. Missing values

In Polars, missing values appear as null, like in the first two rows of the singles column.

3. Counting null values

We can count nulls in each column using the .null_count() method. This gives a one row DataFrame with the null count.

4. Finding rows with null values

We can filter for rows with missing values in a specific column using the .is_null() expression. The result shows four properties that have null values in the "singles" column.

5. Dropping rows with null values

If we think rows with missing values indicate low quality data we can remove them with the .drop_nulls method, which removes any row with at least one null value. Dropping the nulls here leaves us with 43 properties.

6. Dropping rows with nulls in specific columns

For more selective cleaning, we can drop rows with nulls only in specific columns using the subset parameter of .drop_nulls. Here, we drop any properties where the "singles" column is null.

7. Filling nulls with a value

Sometimes we replace missing values rather than removing them as we know why the value is missing. Here, we replace the missing "singles" values with 0, meaning no single beds. We then see the filled values on the first two rows of the DataFrame.

8. Filling nulls with an expression

We can also fill nulls with expressions.

9. Filling nulls with an expression

Here, we fill nulls in the review column with the average review score of 8.99, as this is our best guess of what the review score would be. We see this filled value in the third row in the review column.

10. Finding duplicate rows

We might discover some properties appear multiple times. Let's identify these duplicate entries by testing whether each row is an exact copy of any other row with the .is_duplicated method. This gives a boolean series that is True for any row that is duplicated.

11. Finding duplicate rows

We can keep only the duplicated rows by using the .is_duplicated DataFrame method in a filter predicate. Two properties appear twice with identical data. This is likely a data entry error rather than separate properties.

12. Finding all duplicate rows

We can also look for duplicates by a single column with the .is_duplicated() expression. Here we find 18 rows have duplicate names. However, the results show that the other columns differ, suggesting they are likely distinct rentals on the same property.

13. Dropping duplicate rows

To clean our dataset, we can use the .unique method to keep only one example of any duplicated rows. When we run this our DataFrame goes from 49 rows to 47 rows, confirming that we had two complete duplicate rows. Be aware that calling .unique may change the row order of the DataFrame output.

14. Dropping duplicate rows using specific columns

To treat properties with the same name as duplicates, we use the subset parameter of unique. This leaves us with 40 properties.

15. Let's practice!

Now it's time to practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.