Get startedGet started for free

Build new features from text

1. Build new features from text

When we have a sentiment analysis task, which we will solve with machine learning, having extra features usually results in a better model.

2. Goal of the video

Our goal in this video is to enrich the dataset of choice with extra features related to the text capturing the sentiment.

3. Product reviews data

We continue to work with the Amazon product reviews dataset. Remember that the first column contains the numeric score, and the second column - the review itself.

4. Features from the review column

In my own experience, some very predictive features say something about the complexity of the text column. For example, one could measure how long each review is, how many sentences it contains, or say something about the parts of speech involved, punctuation marks, etc.

5. Tokenizing a string

Remember we employed a BOW approach to transform each review to numeric features, counting how many times a word occurred in the respective review. Here, we stop one step earlier and only split the reviews in individual words (usually called tokens, though a token can be a whole sentence as well.) We will work with the nltk package, and concretely the word_tokenize function. Let's apply the word_tokenize function to our familiar anna_k string. The returned result is a list, where each item is a token from the string. Note that not only words but also punctuation marks are originally assigned as tokens. The same would have been the case with digits, if we had any in our string.

6. Tokens from a column

Now we want to apply the same logic but to our column of reviews. One fast way to iterate over strings is by using list comprehension. A quick reminder on list comprehensions. They are like flattened-out for loops. The syntax is an operation we perform on each item in an iterable object (such as a list). In our case, a list comprehension will allow us to iterate over the review column, tokenizing every review. The result is going to be a list; if we explore the type of the first item, for example, we see it is also of type list. This means that our word_tokens is a list of lists. Each item stores the tokens from a single review.

7. Tokens from a column

Now that we have our word_tokens list, we only need to count how many tokens there are in each item of word_tokens. We start by creating an empty list, to which we will append the length of each review as we iterate over the word_tokens list. In the first line of the for loop, we find the number of items in the word_tokens list using the len() function. Since we want to iterate over this number, we need to surround the len() by the the range() function. In the second line, we find the length of each iterable, and append that number to our empty list len_tokens. Lastly, we create a new feature for the length of each review.

8. Dealing with punctuation

Note that we did not address punctuation but you can exclude it if it suits your context better. You can even create a new feature that measures the number of punctuation signs. In our context, a review with more punctuation signs could signal a very emotionally charged opinion. It's also good to know that we can follow the same logic and create a feature that counts the number of sentences, where one token will be equal to a sentence and not to a single word.

9. Reviews with a feature for the length

If we check how the product reviews dataset looks like, we see the 'n_tokens' column we created. It shows the number of words in each review.

10. Let's practice!

Let's solve some exercises to practice what we've learned.