1. Word vectors
In this video, we'll take our first steps in applying machine learning to building chatbots.
2. Machine learning
Machine learning is a field a computer science in which we write programs which can get better at a task by being exposed to more data.
The first problem we'll tackle with machine learning is identifying which intent a user message belongs to.
3. Vector representations
How do we apply machine learning to text? If you've studied any machine learning,
you'll know that we usually represent our data as a set of vectors. A vector is simply an ordered list of numbers. But that begs
the question: how can we effectively describe text with vectors? There are a number
of different approaches. We can use vector representations of individual characters,
of word fragments, of whole words, and even whole sentences. You can learn all about these in other DataCamp courses.
4. Word vectors
We will use an approach called `word vectors`.
The idea is to assign to each word a vector which describes it's _meaning_. words which appear in similar contexts often will have similar word vectors, and word which rarely appear in the same context will have less similar word vectors.
If you create these vectors using text containing billions of words, you create vectors which capture a lot of this implicit meaning.
5. Word vectors are computationally intensive
Training word vectors can take quite a bit of computer power and lots of data.
Fortunately, there are high-quality word vectors available for anyone to use.
For these exercises we will use vectors trained using the GloVe algorithm, which is a
cousin of word2vec. The excellent python NLP library spaCy makes these especially easy
to work with.
6. Word vectors in spaCy
First we import spacy, and create a spaCy object using spacy.load() with the argument 'en'. This loads the default English language model, and we'll assign it the name nlp.
Word vectors tend do have a length of a few hundred elements. We can check this for the glove
vectors in spacy by looking at `nlp.vocab.vectors_length`. This tells us that we are working with 300 dimensional word vectors.
To see these vectors, we first pass a string to the nlp object to create a document. We assign this the name doc.
The document produces an iterator over the tokens in the string.
A token is either a word, a partial word, or a punctuation mark.
We can iterate over the tokens, and access their word vectors using the token's vector attribute. Here we've iterated over the tokens, and printed the first three elements of each.
The actual values inside the vectors aren't that meaningful, what matters is how similar the vectors of different words are.
7. Similarity
One technical detail we need to address is that in word vector space, it's the *direction* of the vectors
which matters most. So the 'distance' we want to measure between words is actually related to the angle
between the vectors. The metric that's typically used is the cosine similarity, which is equal to 1 if the vectors point in the same direction, 0 if they're perpendicular, and -1 if they point in
opposite directions.
spaCy has a convenient method for calculating this similarity.
8. .similarity()
First we import spacy and load the English model.
then we create a document by passing a string to the nlp object.
to calculate the cosine similarity, we use the document's similarity method, which takes another doc as an argument.
If we compare the similarity of the words "can" and "dog" to the word "cat", we see that "cat" and "dog" are much more similar, even though the strings "can" and "cat" are obviously more similar. That's because the similarity measures how close the "meaning" of words is rather than the spelling, by relying on word vectors.
9. Let's practice!