1. Handling missing and duplicate values
We'll now learn
how to handle missing and duplicated values - two common challenges in real-world data analysis.
2. Missing values
In Polars, missing values appear as null, like in the first two rows of the singles column.
3. Counting null values
We can count nulls in each column
using the .null_count() method.
This gives a one row DataFrame with the null count.
4. Finding rows with null values
We can filter for rows with missing values in a specific column
using the .is_null() expression.
The result shows four properties that have null values in the "singles" column.
5. Dropping rows with null values
If we think rows with missing values indicate low quality data we can
remove them with the .drop_nulls method, which removes any row with at least one null value.
Dropping the nulls here leaves us with 43 properties.
6. Dropping rows with nulls in specific columns
For more selective cleaning, we can drop rows with nulls only in specific columns
using the subset parameter of .drop_nulls.
Here, we drop any properties where the "singles" column is null.
7. Filling nulls with a value
Sometimes we replace missing values rather than removing them as we know why the value is missing.
Here, we replace the missing "singles" values with 0, meaning no single beds.
We then see the filled values on the first two rows of the DataFrame.
8. Filling nulls with an expression
We can also fill nulls
with expressions.
9. Filling nulls with an expression
Here, we fill nulls in the review column with the average review score of 8.99, as this is our best guess of what the review score would be.
We see this filled value in the third row in the review column.
10. Finding duplicate rows
We might discover some properties appear multiple times. Let's identify these duplicate entries
by testing whether each row is an exact copy of any other row with the .is_duplicated method.
This gives a boolean series that is True for any row that is duplicated.
11. Finding duplicate rows
We can keep only the duplicated rows by
using the .is_duplicated DataFrame method in a filter predicate.
Two properties appear twice with identical data. This is likely a data entry error rather than separate properties.
12. Finding all duplicate rows
We can also look for duplicates by a single column
with the .is_duplicated() expression. Here we find
18 rows have duplicate names. However, the results show that the other columns differ, suggesting they are likely distinct rentals on the same property.
13. Dropping duplicate rows
To clean our dataset, we can
use the .unique method to keep only one example of any duplicated rows.
When we run this our DataFrame goes from 49 rows to 47 rows, confirming that we had two complete duplicate rows. Be aware that calling .unique may
change the row order of the DataFrame output.
14. Dropping duplicate rows using specific columns
To treat properties with the same name as duplicates,
we use the subset parameter of unique.
This leaves us with 40 properties.
15. Let's practice!
Now it's time to practice!