Get startedGet started for free

Combining columns

1. Combining columns

In this chapter, we will focus on various transformations useful in data cleaning efforts. We will start with approaches to combine values stored in different columns.

2. Combining columns (an example)

Concatenation involves joining individual values end-to-end to create a single combined value. For example, the restaurant inspection dataset includes columns that are part of a mailing address as can be seen in these example records. To mail information to all restaurants where inspections have taken place, having the full mailing address would be useful. Given that this data is available in the restaurant inspection table, we can form mailing addresses with the format shown here. PostgreSQL's CONCAT() function can join mailing address components into a combined mailing address.

3. Joining values with CONCAT()

The CONCAT() function joins the string values of one or more input arguments returning a single string value. Calling the CONCAT() function with the arguments 'data', 'cleaning', 'is', and 'fun' returns a single string with the arguments joined end-to-end. Including a space between arguments requires adding the space as an argument to the CONCAT() function call as seen here. Let's now use CONCAT() to create mailing addresses from columns in the restaurant_inspection table.

4. Joining values with CONCAT()

Here we supply the name, building, street, boro, and zip_code columns as arguments to CONCAT(). We use the "E-backslash n" (E'\n') argument to indicate that the remaining arguments should be displayed on the next line. The CONCAT() argument list includes spaces where necessary for the resulting mailing address. Example results for the query include plus signs. These plus signs signify that the entry continues on the next line. This output achieves the goal of creating mailing addresses to send information to the restaurants. But not so fast. There is a problem with this approach.

5. Joining values with CONCAT()

In the previous example, the CONCAT() function call produces full addresses because the records displayed contain all of the necessary address component values. However, notice that the records displayed here are missing information (such as name and building) required for a complete mailing address.

6. Joining values with CONCAT()

Using CONCAT() with NULL arguments results in those NULL values being ignored when constructing the concatenated string value. Using the same query as before results in incomplete mailing addresses for records with NULL address component values. Ideally, we would like to identify and populate the table with the missing information. However, if that process is not feasible, we can ignore records with missing data for an initial mailing effort.

7. Joining values with ||

This can be accomplished with the double pipe (||) operator. The double pipe operator, like CONCAT(), joins the values that surround the double pipe. The behavior of the double pipe operator differs from CONCAT() in that NULL valued arguments cause the entire concatenation to result in a NULL value. We can re-write the mailing address query using the double pipe operator. Notice that the values in the concatenation are still present in the same order. This query will only produce complete mailing addresses.

8. Joining values with ||

Records with missing address components will have a NULL value in the mailing_address column when using the double pipe operator. An example can be seen in the results displayed here. The blank mailing address value indicates a record containing at least one NULL-valued address component column.

9. Let's practice!

Now that we have learned two approaches for concatenating strings in related columns, let's practice what we have learned.