Capturing a token pattern

1. Capturing a token pattern

You may have noticed while working with the airline sentiment data from Twitter that the text contains many digits and other characters. Sometimes we may want to exclude them from our numeric representation.

2. String operators and comparisons

If we work with a string, how can we make sure we extract only certain characters? There are a few useful functionalities we will review here. We can use string comparison operators, such as .isaplha(), which returns true if a string is composed only of letters and false otherwise; .isdigits() returns true if a string is composed only of digits; and finally .isalnum() returns true if a string is composed only of alphanumeric characters, i.e. letters and digits.

3. String operators with list comprehension

String operators can improve some of the features we created earlier. As a reminder, in a previous video we used a list comprehension to iterate over each review of the product reviews dataset and create word tokens from each review. We can adjust our original code. If we want to retain only tokens consisting of letters, for example, we can use the .isaplha() operator in a second list comprehension. Since the result of the first list comprehension is a list of lists, we first need to iterate over the items in each inner list, filtering out those tokens that are not letters. This is what happens in the first part of the list comprehension, enclosed in the inner brackets. In the second part, we are iterating over the lists, basically saying that we want to perform this filtering across all lists in the word_tokens list. When we compare the length of the first item of word_tokens and the cleaned_tokens lists, we see that the filtering decreased the number of tokens, as we might expect.

4. Regular expressions

Regular expressions are a standard way to extract certain characters from a string. Python has a built-in package, called re, which allows you to work with regular expressions. We will not cover regular expressions in depth here but, a quick reminder on the syntax. We import the re package. Then imagine we have a string #Wonderfulday and we want to extract a hash(#) followed by any letter, capital or small. One standard way to do is by calling the search function on our string, specifying the regular expression. In our case, it starts with a #, and is followed by either an upper or lower case letter. When we print the result, we see that it is a match object, showing how large the match is - in our case, the span is 2, and also the exact characters that were matched.

5. Token pattern with a BOW

Our familiar CountVectorizer takes a regular expression as an argument. The default pattern used matches words that consists of at least two letters or numbers (\w) and which are separated by word boundaries (\b). It will ignore single-lettered words, and will split words such as 'don't' and 'haven't'. If we are fine with this default pattern, we don't need to change any arguments in the CountVectorizer. If we want to change it, we can specify the token_pattern argument. If we want the vectorizer to ignore digits and other characters and only consider words of two or more letters, we can use the specified token pattern. In fact, there are multiple ways to specify this. It doesn't mean the one specified here is the only correct or best way to accomplish this. Feel free to experiment with this. Note, however, that we need to add an 'r' before the regular expression itself.

6. Let's practice!

Let's go to the exercises where you can apply the things you learned here!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.