Entity extraction

1. Entity extraction

You have already worked on the problem of entity recognition, but so far only used a basic approach of looking for keywords. It's much trickier to recognize entities that you _haven't_ seen before.

2. Beyond keywords: context

Say you're building a voice-controlled music speaker, and now imagine that you had to train your NLU model with every single song, artist, and album that someone might want to play. In order to generalize, we can look at how a word is spelled, whether it's capitalized, and which words occur before and after it. This is also a pattern-recognition problem, and one we can use machine learning to solve.

3. Pre-built Named Entity Recognition

Through the spaCy library we have access to existing models built using large amounts of training data. When you have the option, it's a great idea to use these. To identify generic entities likes places, dates, organizations, and so on, we can use spacy's built in NER. As before, we `import spacy`, then we load the model using `nlp = spacy.load('en')`, and pass our string to create a `Document`: `doc = nlp("my friend Mary has worked at Google since 2009")`. The named entities in the document are then accessible through the doc's ents attribute, which returns an iterator over the entities. The entity type is given by the `ent.label_` attribute, and its value can be accessed through the `ent.text` attribute.

4. Roles

Something we haven't discussed before is that entities in text can have different _roles_. For example, when we say "please book me a flight from Tel Aviv to Bucharest", both 'Tel Aviv' and 'Bucharest' are 'LOCATION' entities, but one is an origin and the other of course a destination. One very simple approach is to match the patterns "from X to Y" and "to Y from X" and assign roles that way. Here we've defined the two patterns separately, so if pattern 1 matches, we know the origin was given first and the destination second, and vice versa. But roles aren't always so simple, so we'll use a slightly more general approach.

5. Dependency parsing

Dependency parsing is a topic that's too big to cover in this course, but we will show how to use a parse tree to assign roles. A parse tree is a hierarchical structure that specifies parent-child relationships between the words in a phrase and that is *independent* of word order. In both the phrases "a flight to shanghai from singapore", and "a flight from singapore to shanghai", the word 'to' is the parent of the word shanghai, and 'from' is the parent of the word singapore. We can refer to tokens in the spacy document with their indices. First we assign the tokens with index 3 and 5 to the names shanghai and singapore. We can then access the parents of each token through its ancestors attribute, this returns an iterator over the parents of the token in the parse tree.

6. Shopping example

Similarly in a shopping scenario, we might get a message like "let's see that jacket in red and some blue jeans". It's important not just to extract the colors, but also to know which items they belong to. We can assign these colors as follows. For each color we iterate over its ancestors, we check if we've encountered an item with the statement if token in items. The first parent item is then the item to which this color belongs.

7. Let's practice!

Now its your turn to extract entities from user messages.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.