Preparing text for modeling

1. Preparing text for modeling

Hello again - We just used the techniques we learned in chapter 1 to create representations of text and explore text similarity. In Chapter 3, we move past representations and explore two commonly used text analysis techniques - text classification and topic modeling.

2. Supervised learning in R: classification

Classification modeling is one of the many tools data scientists use. Datacamp offers an entire course devoted to classification modeling in R. Give this course a look if you want a deeper dive into this subject.

3. Classification modeling

Classification modeling is a type of supervised learning that tries to accurately classify observations into distinct categories. This could be predicting wins or losses, or if an approaching animal is dangerous, friendly, or indifferent. Luckily, we can use all of the big name algorithms to approach this task.

4. Modeling basics steps

For classification models, we will complete four steps. We will collect, clean, and prepare data - which is the main focus of this lesson. We then split the dataset into training and testing datasets, We will train a model on the training data, and report modeling accuracy on the testing data.

5. Character recognition

In the book animal farm, two of the main characters are drastically different. Napoleon is a pig and a ruthless ruler, with very little patience. While Boxer, a horse and a loyal worker, is ignorant and simply follows commands. Let's use classification modeling to determine which sentences from Animal farm are discussing each character.

6. Animal sentences

To do this, Let's create sentences from the animal farm dataset. We can label the sentences as Boxer or Napoleon by using the grepl function to see if their names were used. We use the gsub function to replace the names of the animals, so that our classification algorithm doesn't use it when training. And finally, we filter to only sentences that contain Boxer or Napoleon, but not both.

7. Sentences continued

We add a label to our dataset with ifelse and then select 75 sentences for each. Next we will try to predict which sentences originally included each animal.

8. Prepare the data

Recall what we've learned in the past 2 chapters. To process the data for classification, we first start off by creating tokens using unnest_tokens, we remove stop words using anti_join, and perform stemming using wordStem.

9. Preparation continued

In chapter 2, we would stop here and create a tibble with tfidf weights. For classification models, we create a document-term matrix with tfidf weights using the cast_dtm function from the tidytext package. We first count the words by sentence, and then cast this to a document-term matrix, which is a matrix with one row per document (or sentence in this case), and one column for each word. Here we have 150 sentences, and 694 unique words. However, most words are not in each sentence. The sparsity of the document-term matrix tells us how many entires in the matrix are equal to 0. 99% in this case!

10. Remove sparse terms

Using large, sparse matrices will make modeling difficult. Of the 100,00 entries, only 1,235 of them were non-0. This equates to a 99% sparsity. which might cause computation issues if we do a lot of complex modeling. Luckily, we can remove sparse terms using the removeSparseTerms function.

11. How sparse is too sparse?

So how sparse is too sparse? Let's look at a couple of examples. If we set the maximum sparsity to 90%, we would remove all words but 4. This is useless. You would not be able to classify sentences in most cases, only using 4 words. With a maximum of 99% sparsity, we would use 172 terms. Remember though, we started with 694. Deciding on matrix sparsity depends on how many terms are in your matrix and how fast your computer is. Currently, it wont take much computational power to handle this matrix.

12. Let's practice!

Before we train any models, let's practice preparing our data.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.