1. Data range constraints
Hi and welcome back! In this lesson, we're going to discuss data that should fall within a range.
2. Motivation
Let's first start off with some motivation.
Imagine we have a dataset of movies with their respective average rating from a streaming service.
The rating can be any integer between 1 an 5.
3. Motivation
After creating a histogram with maptlotlib, we see that there are a few movies with an average rating of 6, which is well above the allowable range.
This is most likely an error in data collection or parsing, where a variable is well beyond its range and treating it is essential to have accurate analysis.
4. Motivation
Here's another example, where we see subscription dates in the future for a service.
Inherently this doesn't make any sense, as we cannot sign up for a service in the future, but these errors exist either due to technical or human error.
We use the datetime package's dot-date-dot-today() function to get today's date, and we filter the dataset by any subscription date higher than today's date.
We need to pay attention to the range of our data.
5. How to deal with out of range data?
There's a variety of options to deal with out of range data.
The simplest option is to drop the data. However, depending on the size of your out of range data, you could be losing out on essential information.
As a rule of thumb, only drop data when a small proportion of your dataset is affected by out of range values, however you really need to understand your dataset before deciding to drop values.
Another option would be setting custom minimums or maximums to your columns.
We could also set the data to missing, and impute it, but we'll take a look at how to deal with missing data in Chapter 3.
We could also, dependent on the business assumptions behind our data, assign a custom value for any values of our data that go beyond a certain range.
6. Movie example
Let's take a look at the movies example mentioned earlier.
We first isolate the movies with ratings higher than 5.
Now if these values are affect a small set of our data, we can drop them. We can drop them in two ways - we can either create a new filtered movies DataFrame where we only keep values of avg_rating lower or equal than to 5.
Or drop the values by using the drop method. The drop method takes in as argument the row indices of movies for which the avg_rating is higher than 5.
We set the inplace argument to True so that values are dropped in place and we don't have to create a new column.
We can make sure this is set in place using an assert statement that checks if the maximum of avg_rating is lower or equal than to 5.
7. Movie example
Depending on the assumptions behind our data, we can also change the out of range values to a hard limit.
For example, here we're setting any value of the avg_rating column in to 5 if it goes beyond it.
We can do this using the dot-loc method, which returns all cells that fit a custom row and column index. It takes as first argument the row index, or here all instances of avg_rating above 5 and as second argument the column index, which is here the avg_rating column.
Again, we can make sure that this change was done using an assert statement.
8. Date range example
Let's take another look at the date range example mentioned earlier, where we had subscriptions happening in the future.
We first look at the data types of the column with the dot-dtypes attribute. We can confirm that the subscription_date column is an object and not a date or datetime object.
To compare a pandas object to a date, the first step is to convert it to another date. We do so by first converting it into a pandas datetime object with the to_datetime function from pandas, which takes in as an argument the column we want to convert.
We then need to convert the datetime object into a date. This conversion is done by appending dt-dot-date to the code.
Could we have converted from an object directly to a date, without the pandas datetime conversion in the middle? Yes! But we'd have had to provide information about the date's format as a string, so it's just as easy to do it this way.
9. Date range example
Now that the column is a date, we can treat it in a variety of ways.
We first create a today_date variable using the datetime function date-dot-today, which allows us to store today's date.
We can then either drop the rows with exceeding dates similar to how we did in the average rating example, or replace exceeding values with today's date.
In both cases we can use the assert statement to verify our treatment went well, by comparing the maximum value in the subscription_date column. However, make sure to chain it with the dot-date method to return a date instead of a timestamp.
10. Let's practice!
Now that you know all about ranges, let's practice!