Advanced tokenization with NLTK and regex

1. Advanced tokenization with regex

In this video, we'll take a look at doing more advanced tokenization with regex.

2. Regex groups using or "|"

One new regex pattern you will find useful for advanced tokenization is the ability to use the or method. In regex, OR is represented by the pipe character. To use the or, you can define a group using parenthesis. Groups can be either a pattern or a set of characters you want to match. You can also define explicit character classes using square brackets. We'll go a bit more into depth on groups and ranges soon. Let's take an example that we want to tokenize using regular expressions and we want to find all digits and words. We define our pattern using a group with the OR symbol and make them greedy so they catch the full word or digits. Then, we can call findall using Python's re library and return our tokens. Notice that our pattern does not match punctuation but properly matches the words and digits.

3. Regex ranges and groups

Let's take a look at another more advanced topic, defining groups and character ranges. Here we have another chart of patterns, and this time we are using ranges or character classes marked by the square brackets and groups marked by the parentheses. We can see in this chart that we can use square brackets to define a new character class. For example, we can match all upper and lowercase english letters using Uppercase A hyphen Uppercase Z which will match all uppercase and then lowercase a hyphen lowercase z which will match all lowercase letters. We can also make ranges to match all digits 0 hyphen 9, or perhaps a more complex range like uppercase and lowercase English with the hyphen and period. Because the hyphen and period are special characters in regex, we must tell regex we mean an ACTUAL period or hyphen. To do so, we use what is called an escape character and in regex that means to place a backwards slash in front of our character so it knows then to look for a hyphen or period. On the other hand, with groups which are designated by the parentheses, we can only match what we explicitly define in the group. So a-z matched only a, a hyphen and z. Groups are useful when you want to define an explicit group, such as the final example; where we are taking spaces or commas.

4. Character range with `re.match()`

In this code example, we can use match with a character range to match all lowercase ascii, any digits and spaces. It is greedy marked by the + after the range definition, but once it hits the comma, it can't match anymore. This short example demonstrates that thinking about what regex method you use (such as search versus match) and whether you define a group or a range can have a large impact on the usefulness and readability of your patterns.

5. Let's practice!

Now it's your turn to practice advanced regex techniques to help with tokenization!