Get startedGet started for free

Strategies for multiple transformations

1. Strategies for Multiple Transformations

You've learned several ways to transform character data. But what do you do when you need to use different transformations on different observations?

2. Multiple transformations

Here's an example of data where different delimiter characters were used to separate the major industry categories of Agriculture and Education, from the industry subcategories. Sometimes there's a colon, other times there's a pipe character or a dash. We can use the split_part function to separate the category into its two parts, but how can we apply different delimiters to different rows?

3. CASE WHEN

One option when you need to apply multiple transformations to subsets of the data that don't overlap is to use a CASE WHEN statement. We want to extract just the initial major category from the category column - the part before a delimiter. To do that, we have a case for each different delimiter: a colon followed by space, a dash surrounded by spaces, and a pipe surrounded by spaces. The last case here goes in the else clause. We use LIKE statements to select the rows with each type of delimiter, then apply the split_part function with the delimiter for those rows. We alias the result of the CASE WHEN statement as major_category. We can then use the major_category we extracted to group and aggregate the data. This allows us to get the number of businesses in each of the two major categories.

4. Recoding table

Another strategy for situations when you need to apply multiple transformations is to create a temporary table that maps messy values,like those in this fruit table, to recoded, clean, standardized values, like those on the right. You can then use the new recoded values in the standardized column by joining the fruit table to the temporary table. The values in the original column match those in the fav_fruit column.

5. Step 1: CREATE TEMP TABLE

The first step is to create a temporary table with two columns: original, containing the distinct values of fav_fruit and standardized, which will eventually contain the recoded values. We initially populate the standardized column with the original values.

6. Initial table

Here's what our recode table looks like initially.

7. Step 2: UPDATE values

The next step is to update the recode table using functions to clean up the values in the standardized column. The format of an update statement is UPDATE, then the name of the table, then a SET statement. You SET a column to have new values. Update statements can also include a WHERE clause to choose what rows to update. Without a WHERE clause, all of the rows will be updated.

8. Step 2: UPDATE values

Here, we need three update statements. In the first, we set the standardized value to be the lower case version of the original value, with spaces trimmed from both ends. In the second, we set the standardized value to banana only for rows that contained a double n. The third statement updates the standardized value by removing s's from the end with the trim function.

9. Resulting recode table

Now the recode table contains clean, standardized values.

10. Step 3: JOIN original and recode tables

The final step is to use the recode table by joining it to the original data. On the left, without the recoded data, we have many different values of apple and banana even when we group by fav_fruit. On the right, we join the messy fruit table to the recode table by matching the original values. We select and group by the new standardized values to get a much more useful summary of the data.

11. Recap

To recap, we first created a temporary table to store our standardized values, then updated that table to fix errors and clean up the data. Finally, we used joined the original data to the temporary table to use the clean, standardized values in the query instead of the original values.

12. Clean up the Evanston 311 data!

Temporary tables and CASE WHEN statements can also be useful in other situations. Time to practice using them to clean up the Evanston 311 data.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.