Get startedGet started for free

Detecting invalid values

1. Detecting invalid values

In this lesson, we will focus on the detection of invalid values that we might encounter in our datasets.

2. Invalid data values

One way in which data may need cleaning occurs when table columns contain values that are not valid for the attribute that the column represents. For example, our restaurant inspection data includes a score for certain types of inspections. Consider a scenario in which the score should be an integer with a value between 0 and 100. Encountering a score of "A" in the last record of the score column would indicate an invalid value.

3. Handling invalid data with pattern matching

One option for detecting this invalid value is to use pattern matching. The query displayed here identifies records that include scores with non-digit values. Recall that "\d" is the special character sequence that represents a digit. The "+" matches the preceding character 1 or more times. This query returns any records containing non-digit characters in the score column.

4. Handling invalid data with pattern matching

This query allows scores that are greater than 100 because the pattern only matches scores containing digits. A better query for detecting invalid scores restricts the score field to digits while also restricting the number of digits. The query displayed here identifies scores with more than 3 digits. A pattern with curly braces surrounding an integer matches the preceding character exactly once, twice, or three times in the respective conditions. This query would return a record containing a score with 4 digits. However, a score of 175 would not be included in the query results even though it is an invalid score.

5. Using type constraints

By now, you may have noticed the real problem with this column. We are attempting to restrict scores to be integer values using a string column. The most appropriate approach to avoid invalid scores is to disallow non-integer values. This preventative approach was mentioned previously. The score column can be changed from a text-based column to an integer column with an ALTER TABLE statement. The ALTER TABLE command is followed by the table (restaurant_inspection) to be modified. The ALTER COLUMN statement specifies that the score column's TYPE is to be changed to SMALLINT. This command adds a type constraint to the score column which restricts the score to a smaller range of integer values. The USING clause specifies how to convert existing column values to the SMALLINT type. You might notice that this range of values still allows invalid scores.

6. Review: Basics of Regular Expressions

The table shown here provides a review of some common regular expression metacharacters that you may encounter. This list will prove useful in the exercises that follow.

7. Type constraints enable range constraints

A more appropriate type constraint (such as SMALLINT) enables the usage of range constraints to detect invalid scores. The displayed query can detect records with negative scores.

8. Type constraints enable range constraints

Alternatively, the same condition using the less-than-or-equal-to operator can be used. This query identifies records with negative score values as well.

9. Type constraints enable range constraints

Another restriction is that scores should not be greater than 100. The condition in this query will return records that are less than 0 or greater than 100.

10. Type constraints enable range constraints

We can alternatively use the greater-than-or-equal-to operator to get the same set of results.

11. The BETWEEN operator

This bounded range is so commonly used that an operator exists for expressing it more naturally. The condition "NOT BETWEEN 0 AND 100" reflects the condition's exact purpose which is to return records for score values outside of the 0 to 100 range. Pay close attention to the usage of the NOT operator. We use this operator because we want the results to contain values outside of the range. Recall that the BETWEEN operator range is inclusive. A subset of the records returned by this query is displayed here.

12. Let's practice!

Now that you have seen how to combine type and range constraints to detect invalid values, let's put these tools to work.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.