Get Started

Splitting data with delimiters

1. Splitting data with delimiters

Let's now dive into splitting data containing a well-defined delimiter character or string.

2. Splitting data into columns

The inspection_type column of the restaurant inspection dataset has two components: the main inspection type and the subtype that provides more detail on the inspection. A forward slash with a space on each side (" / ") separates the main type from the subtype. This sequence of characters represents a delimiter because it determines the boundaries of (or delimits) one part of the string from another. If we would like to perform a summary analysis on the subtype (such as the number of inspections of each subtype), it is not straightforward. The inspection_type column combines these distinct values into a single value. Therefore, performing such an analysis requires separating the subtype from the main type.

3. Splitting data into columns

The goal here is to transform inspection_type from a single column into two separate columns with one being the main type and the other being the subtype. Fortunately, PostgreSQL provides a convenient function to achieve this outcome.

4. Splitting strings using SPLIT_PART()

The SPLIT_PART() function requires 3 arguments: a source string, the delimiter on which to split, and the index of the component to return from the split components. The delimiter to split inspection_type is a forward slash surrounded by spaces (' / '). Supplying 1 as the 3rd argument to the SPLIT_PART() function call returns the string "Cycle Inspection". If we instead want to return "Re-inspection", we supply 2 as the 3rd argument to the function call.

5. Splitting strings using SPLIT_PART()

SPLIT_PART() as used here produces columns for the main and sub inspection types. However, a column can also be split to create multiple rows.

6. Splitting data into rows

We will now cover an approach for performing this task. One attribute of the restaurant inspection dataset that we have yet to consider is the cuisine_description column. We might be interested in identifying an association between inspection outcome and cuisine type. However, the granularity of cuisine descriptions is inconsistent in this dataset. In this set of records, we see that while the "Chinese" and "Korean" cuisine descriptions provide the granularity that we desire, "Pizza/Italian" and "Bagels/Pretzels" present a composition of cuisines that we may be interested in separating for our analysis. Ideally, for aggregating historical inspection results by cuisine type, the cuisine_description would contain more granular descriptions as seen here.

7. Splitting data with REGEXP_SPLIT_TO_TABLE()

Fortunately, PostgreSQL provides a solution to this problem. The REGEXP_SPLIT_TO_TABLE() function splits string values in a source column using a regular expression pattern. As an example, using the forward-slash ('/') as the pattern, "Pizza/Italian" is split into two rows with the values "Pizza" and "Italian" returned by REGEXP_SPLIT_TO_TABLE(). Here the pattern argument represents a delimiter. When multiple columns are given in the query's SELECT list, a row will be generated with repeating values for the columns not transformed by REGEXP_SPLIT_TO_TABLE().

8. Splitting data with REGEXP_SPLIT_TO_TABLE()

As an example, the query displayed here splits the cuisine_description column values at the forward-slash character ("/"). This query produces the desired granular cuisine descriptions. Notice that the camis and name column values are repeated in multiple rows when cuisine_description is split into multiple values by the REGEXP_SPLIT_TO_TABLE() function call. From this representation of the data, the granular cuisine descriptions could be used in aggregation queries to answer questions such as how many pizza restaurants have received an A inspection grade.

9. Enumerating the resulting rows

We may have an interest in enumerating the rows resulting from the split values. Here we see that a column labeled cuisine_num has been added to the results we saw previously. This enumeration of rows with shared values might look familiar. In chapter 2, we used the ROW_NUMBER() function to enumerate groups of records. Recall that grouping in ROW_NUMBER() is controlled by the OVER() clause using PARTITION BY and ORDER BY clauses.

10. Enumerating the resulting rows

The SQL query generates the displayed records. The PARTITION BY clause defines how the records are grouped for enumeration. Each pairing of camis and name values restarts the numbering. The ORDER BY clause controls the display order of the column values. The asterisk ("*") ensures the inclusion of the subquery columns in the main query.

11. Let's practice!

Now that you can split column values using a delimiter, let's put that knowledge to work in some practice exercises.