Get startedGet started for free

Pattern matching

1. Pattern matching

Welcome back! In the next section of this chapter, we will utilize pattern matching to clean string data.

2. Identifying patterns: an example

The example data here represent New York City restaurant inspection results. Observe the NTA field consisting of two letters representing a New York City borough followed by two digits. Neighborhood Tabulation Areas (or NTAs) are represented in this format. NTAs were created to project populations in small areas in New York City.

3. Identifying patterns: an example

Valid two-letter codes for the prefix of the NTA code are displayed here. Imagine a manual data entry process which may lead to invalid values entering the table.

4. Identifying patterns: an example

In this data sample, the second record contains a duplicated 4 in the nta column. We would like to be able to identify such erroneous values.

5. The LIKE operator

Before confronting this particular data cleaning task, let's consider matching simple patterns in PostgreSQL. One option for matching patterns in PostgreSQL statements utilizes the LIKE operator. This example SELECT matches the problematic text value exactly. However, this SELECT statement provides no benefit beyond what can be accomplished using a WHERE clause with an equality condition.

6. The LIKE Operator

The usefulness of the LIKE operator comes from combining LIKE with pattern matching metacharacters. The percent metacharacter matches any sequence of zero or more characters. The underscore metacharacter matches a single character. This SELECT statement returns all records containing an NTA value beginning with "QN". If we wanted to find all records with an NTA value that does not end with two digits, we can match a second pattern using two underscores and add NOT in front of the LIKE operator. This produces all records with NTA values that violate the 2 character specification after "QN".

7. Regular expressions (REs)

However, the power of pattern matching in PostgreSQL extends beyond what we have just covered. It might be useful to ensure that only digits appeared in the last two positions of the NTA code or only uppercase letters are allowed in the first two positions. PostgreSQL provides the ability to use regular expressions that provide an expanded set of pattern matching functionality.

8. The SIMILAR TO operator

The SIMILAR TO operator is required when using regular expressions in PostgreSQL. SIMILAR TO also allows for the use of the metacharacters available to the LIKE operator. Therefore, the SELECT statement that we saw before can alternatively be written using SIMILAR TO.

9. Basics of REs

However, using SIMILAR TO, we gain new pattern-matching abilities beyond what is available using LIKE. Matching a digit is accomplished by placing a backslash character in front of the letter "d". This example matches 3 consecutive digits. The question mark character matches the preceding character 1 time or not at all. Therefore, the character preceding a question mark is optional in the pattern. The example displayed here matches when the pattern contains a single "x" character followed by any digit. It also matches a single "x" character on its own. The plus sign matches one or more occurrences of the preceding character. This example matches one or more digits. The asterisk matches any character 0 or more times. This example pattern matches 0 or more digits. A group of characters can be matched using square brackets. Within these brackets, a dash can be used to specify a range. This example matches any single lowercase letter.

10. Using REs with SIMILAR TO

Recall the previously composed SELECT statement. This statement can be re-written using more specific pattern matching. The new pattern is interpreted as matching an NTA code with an uppercase letter, followed by a second uppercase letter, followed by 2 digits. The use of NOT SIMILAR TO ensures that records are returned that do not follow this pattern. The NTA codes shown here would be matched by this query.

11. Let's practice!

It may seem that allowing any capital letter to occupy the first two positions in the NTA code is a bit too permissive as NTA codes only use 5 2-letter combinations. This intuition is correct. We will address that concern in a subsequent exercise.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.