Get startedGet started for free

Converting data types

1. Converting data types

In this chapter, we're going to switch gears a bit and focus on data cleaning & validation.

2. Chapter goal

Here is the visualization you will be building in this chapter. As you build out the base report, you may notice the data is quite messy. Data is rarely stored in an ideal way, so it is important to understand how to deal with this messy data. Data can be messy in several ways.

3. "Messy" data types

One type of data messiness comes from incorrect or inconsistent datatypes. Remember that a field can be stored as one of several data types. These include string types (such as varchar), numerical types (such as float or int), or other types such as date or boolean. Why are data types so important? In general, there are two issues with having inaccurate data types.

4. Issue 1: Type-specific functions

First, some functions only act on specific data types.

5. Issue 1: Type-specific functions

For example, if you try to run a numerical function, such as AVERAGE(), on a date field, it will result in an error.

6. Issue 1: Type-specific functions

Average can only be run on numerical data types. There are type-specific functions for other data types as well, such as UPPER() for string or DATE_DIFF() for dates.

7. Issue 2: Combining tables (JOIN)

Data types also must be consistent when combining tables.

8. Issue 2: Combining tables (JOIN)

You cannot run a join on columns of different data types.

9. Issue 2: Combining tables (JOIN)

Instead, data types of the columns being joined must be the same.

10. Issue 2: Combining tables (UNION)

Same applies to UNIONs.

11. Issue 2: Combining tables (UNION)

When UNIONing two tables, each field in the upper query must have a consistent data type with the corresponding field in the lower query.

12. Interpreting errors

If you try to run your query with these issues present, you will get an error. It’s important to understand how to interpret and troubleshoot these error messages. Both queries shown here result in an error. If you notice an error explicitly calling out a data type, such as integer or character varying, there is a good chance you have a data type error. Both errors shown here indicate an issue with data types.

13. Solution: Wrap it in a CAST()

Luckily, there is a simple solution here: the CAST() function, which converts the data type of a field. Simply cast the field as a data type. Here are a few examples of changing a field's data type.

14. CASTing for functions

In this example, birthdate is stored as a string. We want to calculate the month of each birthdate, which we can do using DATE_PART(). Using this function on the default field errors out, but after first casting the field as a date, the functions outputs correct results.

15. CASTing for JOINs

Here, we want to join tables A and B on the id field. However, the datatypes of each id field are different, which will result in an error. To join, we will need to CAST() one of these fields. In this case, we are converting the datatype of id in the second table from integer to varchar, which is a subtype of the string data type.

16. Planning for data type issues

To plan for these issues, I suggest fixing these errors as they come up. Assuming you can interpret the errors, it should be straightforward which fields need a cast.

17. Planning for data type issues

Alternatively, there is a schema, or group of related tables, within any PostgreSQL database that stores information about the database, called the INFORMATION_SCHEMA. By pulling the “columns” table within this schema, you can view all column types. To pull a table from this schema, you must explicitly state the schema name by using dot notation, as shown in the from statement here.

18. Data type documentation

For more information on data types in postgreSQL, check out this link.

19. Practice time!

Let's get some hands-on experience with casting in the upcoming exercises.