Tokenization
1. Tokenization
Now that we have looked at a basic way to search text, let's move on to a fundamental component of text preprocessing, tokenization.2. What are tokens?
Tokenization is the act of splitting text into individual tokens. Tokens can be as small as individual characters, or as large as the entire text document. The most common types of tokens are: characters words sentences documents and even separating text into tokens based on a regular expression. For example, splitting text every time you see a 3 digit or larger number.3. tidytext package
R has an abundance of ways to tokenize text, but we will use the tidytext package - which describes itself as "Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools" The tidytext package follows the tidy data format. Taking the introduction to the Tidyverse course may be helpful if you are new to the tidy concepts.4. The Animal Farm dataset
Throughout this course, we are going to use a couple of different datasets. The first being the 10 chapters from the book Animal Farm. This is a great dataset for our course. Although our data is limited to just the text, and the chapter number, it has a rich character list, themes that repeat themselves, and simple vocabulary for us to explore.5. Tokenization practice
The tidytext function for tokenization is called unnest_tokens. This function takes our input tibble called animal_farm, and extracts tokens from the column specified by the input argument. We also specify what kind of tokens we want, and what the output column should be labeled. Our tokenization options include: sentences, lines, regex, for a user-specified regular expression, and many others.6. Counting tokens
We can take this a step further, by quickly counting the top tokens by simply adding the count function to the end of our code. Not the most interesting output yet, but we will clean this up later. The most common words are just common English words such as the, and, of, and to.7. Tokenization with regular expressions
Another use of unnest_tokens is to simply find all mentions of a particular word and to see what follows it. In Animal Farm, Boxer is one of the main characters. Let's see what chapter one says about him. Here we have filtered animal farm to chapter 1, and looked for any mention of Boxer, regardless of Boxer being capitalized or not. Since the first token starts at the beginning of the text, I am using the slice function to skip the first token. The output, is the text that follows every mention of Boxer. Who apparently, was an enormous beast at nearly eighteen hands high.8. Let's tokenize some text.
Tokenizing text is a vital component to several text analysis tasks. Let's practice with a few examples.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.