Word embeddings

1. Word embeddings

In this lesson, I want to introduce you to a more advanced technique that is a great addition for any text analysis project - word embeddings.

2. The flaw in word counts

Let's start by looking at two statements. Bob is the smartest person I know. Bob is the most brilliant person I know. Two statements that mean the same thing. But let's look at the statements with stop words removed. Only Bob smartest person & Bob brilliant person remain. And because smartest and brilliant are not the same word, traditional similarity metrics would not do well here.

3. Word meanings

But what if we used a system that had access to hundreds of other mentions of smartest and brilliant. And instead of just counting how many times each word was used, we also had access to information on which words were used in conjunction with those words.

4. word2vec

word2vec is one of the most popular word embedding methods around. It was developed in 2013 by a team at Google. It uses a large vector space to represent words and is built such that words of similar meaning are closer together. This also means that words appearing together often will be closer together in a given vector space. For example, pork, beef, and chicken are all grouped together in this visual.

5. Preparing data

Let's prepare some data to use with one of R's word2vec implementations. This implementation comes form the h2o package. This packages requires us to start an h2o instance by using h2o.init. We can convert our tibble, into an h2o object by using the as.h2o function. And finally, we complete a few steps that we've done before - only this time we use h2o's method for tokenizing, lowercasing all letters, and removing stop words. h2o.tokenize will split the text into words and place an NA after the last word of each chapter, tolower will lowercase the words, and then we can filter to only non-stopwords.

6. word2vec modeling

h2o's implementation, h2o.word2vec, has several parameters. Let's look at just two for now. min_word_freq will remove all words that don't appear more than the set value, while epochs, a common parameter in lots of machine learning models, is the number of training iterations to run. You might use a larger epoch with a larger amount of text.

7. Word synonyms

There are several uses for this model, but let's use it to find similar terms. The function h2o.findSynonyms helps us see that the word animal is most related to words like drink, act, and hero. While the word Jones, who is the enemy of the animals in the book, is most related to words like battle and enemies. The book animal farm only has about 10,000 non stop-words. We'd likely see even better results with more text.

8. Additional uses

word2vec can be used for all kinds of analysis. It has been used to provide advancements to classification modeling, sentiment analysis, and even topic modeling. All topics we have covered in this course!

9. Apply word2vec

Let's explore word2vec by creating a couple of example models.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Introduction to Natural Language Processing in R

IntermediateSkill Level

4.9+

23 reviews