1. Matching similar strings
In the last lesson of this chapter, we saw how pattern matching can be used to identify messy data using flexible patterns. Now, we will look to add string similarity approaches to our data cleaning toolbox.
2. Similar strings (example)
We often gather information by listening to someone speak and writing down what we heard.
Consider the case where someone places a food order over the phone and provides a delivery address. Imagine the person ordering indicates that the delivery address is 121 Fontainebleau Drive.
Often times, mistakes will be made when recording the spoken information. Consider these spellings of the word "Fontainebleau". While we may not record such information with 100% accuracy, the record is often sufficiently accurate to make such data useful.
3. The Soundex algorithm
This is the idea behind the Soundex algorithm. This algorithm was first developed in the early 1900s for use in analyzing U.S. Census data.
While we won't get into all of the details of the algorithm, think of the algorithm as a way to encode (using 4 characters) the sound of a group of words for comparison purposes.
The same Soundex encoding value, F535, is returned for each of the 3 misspelled versions of Fontainebleau. If we compare the Soundex values of these strings, they are equal because they sound similar despite spelling differences.
4. SOUNDEX() in PostgreSQL
While the Soundex algorithm is not available by default in PostgreSQL, it is part of a special module named fuzzystrmatch which enables fuzzy string matching in PostgreSQL.
This module can be made available by issuing the CREATE EXTENSION command displayed here. Issuing this command enables SOUNDEX() and other fuzzy string matching functions.
Using the SOUNDEX() function with an input string as an argument returns a 4-character Soundex code.
As demonstrated here, all of the spellings of Fountainebleau produce the same 4-character Soundex code.
5. The DIFFERENCE() function
The SOUNDEX() function produces Soundex codes that can be compared. The DIFFERENCE() function enables this common use-case.
The DIFFERENCE() function accepts two string arguments and outputs the number of positions in which their Soundex codes match. Given that a Soundex code is always 4 characters in length, the result of a call to the DIFFERENCE() function outputs an integer between 0 and 4.
The strings "pair" and "pear" have the same sound and, therefore, the same Soundex code.
As a result, the DIFFERENCE() value for these two strings is 4.
6. The DIFFERENCE() function
In the case of "bout" and "bought", the Soundex codes differ in the second and third positions.
The DIFFERENCE() value for these two strings is 2.
7. Using DIFFERENCE()
Imagine a situation where you would like to group restaurant inspection results by street location.
The results here show several inspections for restaurants on Frederick Douglass Boulevard in Manhattan. These results show an inconsistent naming of the street. Think for a moment about the pattern matching gymnastics that might need to be employed to match the street name for Mess Hall to recognize that it represents the same street as the other restaurants on the list.
8. Using DIFFERENCE()
However, the Soundex code for these records is all the same.
In this case, a query filtering on a DIFFERENCE() value of 4 would return all records representing the location "Frederick Douglass Boulevard."
9. Updating the recordings
From here, we unify these disparate street values on a single "clean" value (such as "Frederick Douglass Boulevard").
This can be done with an UPDATE statement. The UPDATE statement requires specifying a table to update, the column and value to update, and a WHERE condition specifying the records to update.
The UPDATE statement displayed here unifies these street values. The result from executing the UPDATE statement indicates that 10 records in the table were updated for this dataset.
10. Let's practice!
Now that you have seen how string similarity can be used to compare strings in PostgreSQL, let's practice using this approach to clean some messy data.