Out of range values and inaccurate data
1. Out of range values and inaccurate data
Welcome back. In this lesson, we will learn how to select and exclude out of range values and inaccurate data.2. Dataset - series and episodes
Throughout this chapter, we will be working on a new dataset. It contains information about series3. Dataset - series and episodes
and episodes.4. Out of range values
Let's start defining out of range values. Out of range values are values that are outside the expected range of valid data. For example, a person who is 400 inches tall. If we don't detect this kind of dirty data before analyzing, it can disrupt our results. If we have the chance, we can investigate why these data are outside the expected values and correct it. We will illustrate out of range values with an example.5. Out of range values - example
In our series table, there is a column named rating. It stores the ratings given by the audience of the series on a scale of 0 to 10. Unfortunately, in this table, there aren't any constraints that checks this range. Because of that,6. Out of range values - example
some values are out of this range, due to input errors.7. Out of range values - detecting the values
If we want to detect which values are out of this range, we can use the BETWEEN logical operator in its negation. In this example, we are selecting those rows which ratings are not between the expected range. As we can see in the output, the first series has a rating of 11,4, and the second one has a rating of 11. Similarly, we can select these rows by applying the "less than" and "greater than" operators; it is, those rows which ratings are strictly less than 0 or strictly greater than 10.8. Out of range values - excluding the values
If we choose to exclude from our results the rows that have out range values, we can do it by using the BETWEEN logical operator in its positive way. In this example, we are selecting those rows which ratings are between the expected range.9. Out of range values - excluding the values
We can also select these rows by applying the "greater than", "equals", and "less than" operators; it is, those rows whose ratings are greater than or equal to 0 and less than or equal to 10.10. Inaccurate data
Another kind of messy data is inaccurate data. Inaccurate data happens when two or more values are contradictory. As with out of range values, inaccurate data can lead us to wrong results. If we have the chance, we can also investigate why these data are inaccurate and correct them.11. Inaccurate data - example
Let's illustrate inaccurate data with an example. If we look at the series table, we can see that there is a column named premiered, that stores the date on which each series was premiered. On the other hand, in the episodes table, there is a column named airdate, that stores the date when each episode was aired. It should be logical that the airdates of the episodes have the same date or later than the premiered date of the series.12. Inaccurate data - example
However, as we can see, there are some episodes with airdates before the premiered date of the series. These dates are not accurate. Let's learn how to detect them.13. Inaccurate data - detecting the values
To detect these inaccurate values, we can join both tables, series and episodes, by the series identifier, applying the condition that the airdate of the episodes is strictly before the premiered date of the series. This comparison can be done with the logical operator "less than".14. Inaccurate data - excluding the values
If we choose to exclude from our results the rows that have inaccurate data, we can do it applying the condition that the airdate of the episodes is later or equals to the premiered date of the series.15. Let's practice!
Let's practice what we have learned!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.