1. Data Structures: Doc, Span and Token
Now that you know all about the vocabulary and string store, we can take a look at the most important data structure: the Doc, and its views Token and Span.
2. The Doc object
The Doc is one of the central data structures in spaCy. It's created automatically when you process a text with the nlp object. But you can also instantiate the class manually.
After creating the nlp object, we can import the Doc class from spacy dot tokens.
Here we're creating a Doc from three words. The spaces are a list of boolean values indicating whether the word is followed by a space. Every token includes that information – even the last one!
The Doc class takes three arguments: the shared vocab, the words and the spaces.
3. The Span object (1)
A Span is a slice of a Doc consisting of one or more tokens. The Span takes at least three arguments: the doc it refers to, and the start and end index of the span. Remember that the end index is exclusive!
4. The Span object (2)
To create a Span manually, we can also import the class from spacy dot tokens. We can then instantiate it with the doc and the span's start and end index.
To add an entity label to the span, we first need to look up the string in the string store. We can then provide it to the span as the label argument.
The doc dot ents are writable, so we can add entities manually by overwriting it with a list of spans.
5. Best practices
A few tips and tricks before we get started:
The Doc and Span are very powerful and optimized for performance. They give you access to all references and relationships of the words and sentences.
If your application needs to output strings, make sure to convert the doc as late as possible. If you do it too early, you'll lose all relationships between the tokens.
To keep things consistent, try to use built-in token attributes wherever possible. For example, token dot i for the token index.
Also, don't forget to always pass in the shared vocab!
6. Let's practice!
Now let's try this out and create some Docs and Spans from scratch.