Get startedGet started for free

Range constraints

1. Range constraints

Now that we've discussed data type constraints, let's talk about another type of constraint: range constraints.

2. What's an out of range value?

Many variables have some range that you can reasonably expect all of the data points to fall within. For example, the college entrance test in the US is scored between 400 and 1600, so there shouldn't be any scores below 400 or above 1600. Other examples are the weight of a package, which can't be negative, or heart rate, which is expected to be between 60 and 100 beats per minute in most adults. Since we know what these reasonable ranges are, we know that there's something off if we see an SAT score of 2000 or a weight of -5 pounds in our dataset.

3. Finding out of range values

Let's say we're given some data containing movie ratings. Movies are rated using a five-star system, so all ratings should fall between 0 and 5.

4. Finding out of range values

To see if there's any data that's clearly out of range, we can create a histogram. We'll create a vector called breaks, which contains the minimum rating, then 0, the bottom of the expected range, then 5, the top of the expected range, then the maximum rating. We'll create a histogram of avg_rating using geom_histogram, setting the breaks argument to the breaks vector we just created. We end up with a histogram broken up into 3 groups: too low, in range, and too high. We can easily see now that there's one value below 0 and two values above 5 in our dataset.

5. Finding out of range values

We can also use the assert_all_are_in_closed_range function, which takes in a lower value and an upper value. It will give an error if anything falls outside of the specified range.

6. Handling out of range values

Once we identify that there are values out of range, how do we deal with them? We could remove those data points completely, but this should only be done when only a small proportion of the values are out of range, otherwise, we would significantly increase the amount of bias in our dataset. We could also treat each out of range value as missing, or replace it with `NA`. This allows us to use different imputation techniques for missing data, which we'll discuss in more detail later in the course. We can also replace out of range values with the range limit. For example, if we know our ratings should fall between 0 and 5 and there's a value of 6, we can replace the 6 with 5 so that it's in range. Finally, we can replace the values with some other number based on our knowledge of the dataset. For example, we could replace them with the average rating of all movies.

7. Removing rows

To remove out of range rows, we can use the filter function to get all the rows with values that fall into the range. In this case, we want all the ratings greater than or equal to 0 and less than or equal to 5, which will eliminate the rows that have an out of range rating. If we create a histogram of avg_rating now, we can see that there are no more out of range values left in the dataset.

8. Treat as missing

To treat values as missing, we'll need to replace all the out of range values with NA. We can do this using the replace function, which takes in the column you want to replace values in, the condition that should be met for a replacement to happen, and what the replacement should be. Here, we create a new column called rating_miss, replacing values of the avg_rating column that are too big with NA.

9. Replacing out of range values

We can also use the replacement function to replace out of range values with the range limit. Here, we replace all the values of avg_rating that are greater than 5 with 5.

10. Date range constraints

Dates can also be out of range. A common scenario is when data contains dates in the future. In our movies data, all of the dates should be in the past, since it's not possible for us to have a movie rating for a movie that no one has seen yet. We can use the assert_all_are_in_past function from the assertive package to check for future dates, and it looks like we have one. We can take a closer look at this row by filtering for date_recorded greater than today's date. Just like numbers, dates can be compared using greater than, less than, and equals equals operators. The today function from lubridate will get the current date.

11. Removing out-of-range dates

We can remove the rows with future dates using filter as well, but this time, we filter for date_recorded less than or equal to today. When we use assert_all_are_in_past now, nothing is returned, so we know that our range constraints have been met.

12. Let's practice!

Now it's time to practice wrangling your data with ranges.