Get startedGet started for free

Character data types and common issues

1. Character data types and common issues

The next type of data we’ll be exploring is character or text data.

2. PostgreSQL character types

There are three types of character columns to store strings of text: character (which can be shortened to char), character varying (which can be shortened to varchar), and text. They differ in the length of the string of text they store. The length of a string is defined as the number of characters in it. Character columns store a fixed length string; spaces are added to the end of shorter strings to make up any difference in length. Spaces at the end of char fields are ignored when comparing values. Varchar columns can optionally specify a maximum string length; they allow strings of any size up to the specified maximum. Text, or varchar columns without a maximum length specified, can store strings of unlimited length.

3. Types of text data

Regardless of the formal column type, for analysis, we want to distinguish between two types of text data: categorical variables and unstructured text. Categorical variables are short strings of text with values that are repeated across multiple rows. They take on a finite and manageable set of distinct values. Days of the week, product categories, and multiple choice survey question responses are all examples of categorical variables. Unstructured text consists of longer strings of unique values, such as answers to open-ended survey questions or product reviews. To analyze unstructured text, we can create new variables that extract features from the text or indicate whether the text has particular characteristics. For example, we could create binary indicator variables that denote whether the text contains keywords of particular interest.

4. Grouping and counting

For now, we'll focus on categorical variables. The first things to check with categorical variables are the set of distinct categories and the number of observations, or rows, for each category. We do this with GROUP BY and count. Without ordering the results, it's hard to tell which categories are commonly used and whether any categories should be grouped together.

5. Order: most frequent values

Ordering by the count of each value helps us see the most, and least, frequent categories. It's good to check whether categories with only a few observations have errors - such as spelling, capitalization, or spacing mistakes.

6. Order: category value

It's also a good idea to try ordering the results by the category. Doing so can help us identify possible duplicates and other errors in the data. Does the order of the categories in the results match what you were expecting?

7. Alphabetical order

Character types are sorted in alphabetical order. Spaces come before letters, and uppercase letters come before lowercase letters. Looking at the first character of each category shows that the results are in alphabetical order.

8. Common issues

So what are you looking for when grouping and counting values? Common inconsistencies and issues with character data include: Differences in case: for example, when there are both lower and upper case versions of the same value. White space differences, such as when values only differ in the number or placement of spaces. One exception here is that when comparing values of type char, trailing spaces are ignored. An empty string, which is a string of length zero, is not the same as a string of all spaces. An empty string is also not the same as null. These are distinct values. And finally punctuation differences. Punctuation differences can sometimes be subtle. For example, there are multiple types of hyphens and dashes that look similar but are different characters.

9. Time to examine some text data

We'll learn how to address these inconsistencies later in this chapter, but the first step is to practice identifying them. Time to get to work on the exercises.