Introduction to data cleaning
1. Introduction to data cleaning
Welcome to the course! My name is Darryl and I will be accompanying you on the journey into cleaning data in PostgreSQL databases.2. Why is cleaning data important?
While our world is flooded with data, very little of it arrives ready for analysis. It is often messy due to errors, inconsistent formatting, incompleteness, and for other reasons. To analyze data, cleaning the data is an extremely important first step. One of the benefits of managing data in a database is that constraints can be placed on the type of data stored in a column. When an attempt to store data that does not meet the restrictions on the allowed values (such as attempting to store character values in an integer column), the database software will complain. This is a great defensive mechanism for forcing data to be cleaned before storage. However, there are cases where such defensive approaches are not available. This could be the case because appropriate restrictions were never applied when originally importing data into the database. Alternatively, the data values may be valid but the usage of the data requires it to be in a different format. An example might be a program using the database as a client which requires a year to be specified at the beginning of a date column value while records in the database contain years at the end of the value.3. Cleaning string data
This course will provide you with a toolbox for cleaning, converting, massaging, transforming, and re-formatting data to prepare it to be used in the way that it is intended. Let's start with a few motivating examples. String data is abundant due to the amount of text that we use. It is an extremely flexible format for representing data. However, this flexibility can often lead this data to be especially messy. Take these records for example. They come from restaurant inspection records from New York City. There are at least 3 aspects of this data to clean.4. Cleaning string data
We want to display the name with only the first letter of each word capitalized. To conserve space, we'd like to remove a space on each side of the forward slash in the inspection_type column. And we'd like each census_tract value to have a uniform length for display purposes so we will pad the values with leading zeros.5. Cleaning string data
These changes would result in the subset of records appearing as displayed here. How can we accomplish these desired changes?6. Using the INITCAP() function
The INITCAP() function accepts one text value (typically in the form of a column name). This function outputs the input text value with the initial letter of each word capitalized. All other characters in the word are lowercase. The example shows a fully uppercase string modified to only include a capital letter for the initial letter of each word while all others are lowercase.7. Using the REPLACE() function
The REPLACE() function accepts an input string, a string to replace, and a replacement string. In this example, the word street in the address is replaced by the abbreviation "St".8. Using the LPAD() function
The LPAD() function accepts an input string, the total length of the output string, and an optional fill value used to achieve that length in the case that the input_string is shorter than the length value. In this example, 4 Xs are pre-pended to the input string to output a string of the desired length of 7.9. Building the string cleaning query
The SELECT statement displayed here uses the functions we just covered to clean our original data. The resulting records are output as desired.10. Let's practice!
Now, its your turn to clean data using the functions that you just learned in the exercises that follow.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.