Introduction to tokenization

1. Introduction to tokenization

In this video, we'll learn more about string tokenization!

2. What is tokenization?

Tokenization is the process of transforming a string or document into smaller chunks, which we call tokens. This is usually one step in the process of preparing a text for natural language processing. There are many different theories and rules regarding tokenization, and you can create your own tokenization rules using regular expresssions, but normally tokenization will do things like break out words or sentences, often separate punctuation or you can even just tokenize parts of a string like separating all hashtags in a Tweet.

3. nltk library

One library that is commonly used for simple tokenization is nltk, the natural language toolkit library. Here is a short example of using the word_tokenize method to break down a string into tokens. We can see from the result that words are separated and punctuation are individual tokens as well.

4. Why tokenize?

Why bother with tokenization? Because it can help us with some simple text processing tasks like mapping part of speech, matching common words and perhaps removing unwanted tokens like common words or repeated words. Here, we have a good example. The sentence is: I don't like Sam's shoes. When we tokenize it we can clearly see the negation in the not and we can see possession with the 's. These indicators can help us determine meaning from simple text.

5. Other nltk tokenizers

Beyond just tokenizing words, NLTK has plenty of other tokenizers you can use, including these ones you'll be working with in this chapter. The sent_tokenize function will split a document into individual sentences. The regexp_tokenize uses regular expressions to tokenize the string, giving you more granular control over the process. And the tweettokenizer does neat things like recognize hashtags, mentions and when you have too many punctuation symbols following a sentence. How convenient!!!

6. More regex practice

You'll be using more regex in this section as well, not only when you are tokenizing, but also figuring out how to parse tokens and text. Using the regex module's re.match and re.search are pretty essential tools for Python string processing. Learning when to use search versus match can be challenging, so let's take a look at how they are different. When we use search and match with the same pattern and string with the pattern is at the beginning of the string, we see we find identical matches. That is the case with matching and searching abcde with the pattern abc. When we use search for a pattern that appears later in the string we get a result, but we don't get the same result using match. This is because match will try and match a string from the beginning until it cannot match any longer. Search will go through the ENTIRE string to look for match options. If you need to find a pattern that might not be at the beginning of the string, you should use search. If you want to be specific about the composition of the entire string, or at least the initial pattern, then you should use match.

7. Let's practice!

Now it's your turn to try some tokenization!