Word vectors

1. Word vectors

In this video, we'll take our first steps in applying machine learning to building chatbots.

2. Machine learning

Machine learning is a field a computer science in which we write programs which can get better at a task by being exposed to more data. The first problem we'll tackle with machine learning is identifying which intent a user message belongs to.

3. Vector representations

How do we apply machine learning to text? If you've studied any machine learning, you'll know that we usually represent our data as a set of vectors. A vector is simply an ordered list of numbers. But that begs the question: how can we effectively describe text with vectors? There are a number of different approaches. We can use vector representations of individual characters, of word fragments, of whole words, and even whole sentences. You can learn all about these in other DataCamp courses.

4. Word vectors

We will use an approach called `word vectors`. The idea is to assign to each word a vector which describes it's _meaning_. words which appear in similar contexts often will have similar word vectors, and word which rarely appear in the same context will have less similar word vectors. If you create these vectors using text containing billions of words, you create vectors which capture a lot of this implicit meaning.

5. Word vectors are computationally intensive

Training word vectors can take quite a bit of computer power and lots of data. Fortunately, there are high-quality word vectors available for anyone to use. For these exercises we will use vectors trained using the GloVe algorithm, which is a cousin of word2vec. The excellent python NLP library spaCy makes these especially easy to work with.

6. Word vectors in spaCy

First we import spacy, and create a spaCy object using spacy.load() with the argument 'en'. This loads the default English language model, and we'll assign it the name nlp. Word vectors tend do have a length of a few hundred elements. We can check this for the glove vectors in spacy by looking at `nlp.vocab.vectors_length`. This tells us that we are working with 300 dimensional word vectors. To see these vectors, we first pass a string to the nlp object to create a document. We assign this the name doc. The document produces an iterator over the tokens in the string. A token is either a word, a partial word, or a punctuation mark. We can iterate over the tokens, and access their word vectors using the token's vector attribute. Here we've iterated over the tokens, and printed the first three elements of each. The actual values inside the vectors aren't that meaningful, what matters is how similar the vectors of different words are.

7. Similarity

One technical detail we need to address is that in word vector space, it's the *direction* of the vectors which matters most. So the 'distance' we want to measure between words is actually related to the angle between the vectors. The metric that's typically used is the cosine similarity, which is equal to 1 if the vectors point in the same direction, 0 if they're perpendicular, and -1 if they point in opposite directions. spaCy has a convenient method for calculating this similarity.

8. .similarity()

First we import spacy and load the English model. then we create a document by passing a string to the nlp object. to calculate the cosine similarity, we use the document's similarity method, which takes another doc as an argument. If we compare the similarity of the words "can" and "dog" to the word "cat", we see that "cat" and "dog" are much more similar, even though the strings "can" and "cat" are obviously more similar. That's because the similarity measures how close the "meaning" of words is rather than the spelling, by relying on word vectors.

9. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.