1. Introduction to Cleaning Data
Hello, and welcome to the course!
This is Miriam Antona. I am a software engineer, and I will be teaching you how to clean data in a SQL Server database.
2. Topics covered
In chapter 1, we will discover why cleaning data is important. We will start solving simple problems such as filling numbers with leading zeros, removing trailing or leading spaces in strings, or unifying strings.
In chapter 2, we will learn how to deal with null values, avoid duplicate data in our results, and clean messy dates when they aren't in the right format.
In chapter 3, we will work with out of range values, different data types, and pattern matching.
In the last chapter, chapter 4, we will combine data of some columns into one column, split data of one column into more columns, and transform rows into columns and vice versa.
3. Dataset: Monthly airline flights by USA airports 2014-2015
Throughout chapter 1 and chapter 2, we will be working with a dataset based on monthly airline flights in USA airports, registered in the years 2014 and 2015.
It contains information about airports,
4. Dataset: Monthly airline flights by USA airports 2014-2015
carriers,
5. Dataset: Monthly airline flights by USA airports 2014-2015
flight statistics,
6. Dataset: Monthly airline flights by USA airports 2014-2015
and pilots.
7. Why is cleaning data important?
So, why is cleaning data important?
It is pretty common to acquire data that is not ready for analysis, also known as messy or dirty data. This data must be cleaned before performing any analysis because it is not easy to understand and analyze.
Data scientists and data analysts devote the majority of their time cleaning data compared with the time they spend analyzing it.
Once the cleaning process is done, the information can clearly be read. Let's see an example where we can find messy data.
8. Filling numbers with leading zeros
If we explore the content of the flight_statistics table, we can see that we have different formats for the registrations codes. Some of the registration codes have leading zeros, but some of them don't.
9. Filling numbers with leading zeros
Let's suppose that a valid registration code should have leading zeros until complete nine digits.
In those registration codes without leading zeros, we will have to add as many zeros as we need. Let's solve this problem in two different ways.
10. Filling numbers with leading zeros - Using REPLICATE and LEN
The first way is using the REPLICATE function.
REPLICATE repeats a given string a specified number of times.
11. Filling numbers with leading zeros - Using REPLICATE and LEN
In our example, the string we need to replicate is the 0 value, and the number of times can vary depending on the length of every registration_code. As valid codes must have nine characters,
12. Filling numbers with leading zeros - Using REPLICATE and LEN
we can subtract nine to the length of every registration_code. We can get the length using the LEN function.
For example, for the registration_code 120, the length is 3. So if we subtract 9 to this length, we are ordering to replicate the zero value 6 times. Once we know how many zeros we need, we have to concatenate them with the registration_code.
13. Filling numbers with leading zeros - Using REPLICATE, LEN
We can use the addition arithmetic operator to do it, or if we are working with SQL Server 2012 or more recent versions,
use the CONCAT function that returns the same result.
14. Filling numbers with leading zeros - Using REPLICATE, LEN, and CONCAT
In both cases, this will be the output. Notice that now every registration_code has the proper format.
15. Filling numbers with leading zeros - Using FORMAT
We can also fill with leading zeros using the FORMAT function. FORMAT returns a value formatted with the specified format and culture. The culture argument is optional and represents different language codes and formatting based on country or culture.
FORMAT is available since SQL Server 2012.
The value can be numeric or date and time. We will not be using the culture parameter for the moment.
In this example, the value for the FORMAT function is the registration_code cast as an integer. That is because the column registration_code is varchar, and as we said, we need it to be a numeric value. The format pattern we want is to complete with zeros until having nine digits.
16. Filling numbers with leading zeros - Using FORMAT
Again, this will be the output.
17. Let's practice!
Now it is time to practice!