Getting past single words

1. Getting past single words

So far we have only worked with single words. You now get a chance to change the tokenization of your terms.

2. Unigrams, bigrams, trigrams, oh my!

This is important because "not" and "good" as separate words have a very different meaning compared to the bigram "not good". A word of caution: increasing the tokenization will increase the DTM and TDM size. In this example, the first two tweets from the coffee tweets are shown. Each individual term is tokenized and the resulting DTM has 18 terms.

3. Unigrams, bigrams, trigrams, oh my!

Next we change from unigram to bigram tokenization. To do so, we use the RWeka package's NGramTokenizer to create a custom function called tokenizer. To get bigrams, you specify a min and max equal to 2 within the NGramTokenizer control. This tokenizer function is then added to the TermDocumentMatrix function from tm within the control parameter. Here, the tokenization has alternating colors so you can see the increased number of circles. When you call the bigram_tdm object you can plainly see the number of terms has increased to 21 and this is only with 2 tweets!

4. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.