Comparing names with SOUNDEX()
Messy strings like 'Ilynois'
instead of 'Illinois'
can cause problems when analyzing data. That is why it is important to detect them.
When analyzing the flight_statistics
table, you realize that some statistician_name
and statistician_surname
are written in a different way, such as Miriam Smith
and Myriam Smyth
. You are afraid there are more differences like this, so you want to check all these names.
You think about comparing with SOUNDEX()
the names of the statisticians. If the result of SOUNDEX()
is the same, but the texts you are comparing are different, you will find the data you need to clean.
This exercise is part of the course
Cleaning Data in SQL Server Databases
Exercise instructions
- Select the distinct values of
statistician_name
andstatistician_surname
columns fromS1
. - Inner join the
flight_statistics
table asS2
on similar-sounding first names and surnames usingSOUNDEX()
. - Filter out values where the
statistician_name
andstatistician_surname
columns are different from each other inS1
andS2
, respectively.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
SELECT
-- First name and surname of the statisticians
DISTINCT S1.___, S1.___
-- Join flight_statistics with itself
FROM ___ S1 INNER JOIN ___ S2
-- The SOUNDEX result of the first name and surname have to be the same
ON ___(S1.___) = ___(S2.___)
AND ___(S1.___) = ___(S2.___)
-- The texts of the first name or the texts of the surname have to be different
WHERE S1.___ <> S2.___
OR S1.___ <> S2.___