1. Data Structures: Vocab, Lexemes and StringStore
Welcome back! Now that you've had some real experience using spaCy's objects, it's time for you to learn more about what's actually going on under spaCy's hood.
In this video, we'll take a look at the shared vocabulary and how spaCy deals with strings.
2. Shared vocab and string store (1)
spaCy stores all shared data in a vocabulary, the Vocab.
This includes words, but also the labels schemes for tags and entities.
To save memory, all strings are encoded to hash IDs. If a word occurs more than once, we don't need to save it every time.
Instead, spaCy uses a hash function to generate an ID and stores the string only once in the string store. The string store is available as nlp dot vocab dot strings.
It's a lookup table that works in both directions. You can look up a string and get its hash, and look up a hash to get its string value. Internally, spaCy only communicates in hash IDs.
Hash IDs can't be reversed, though. If a word in not in the vocabulary, there's no way to get its string. That's why we always need to pass around the shared vocab.
3. Shared vocab and string store (2)
To get the hash for a string, we can look it up in nlp dot vocab dot strings.
To get the string representation of a hash, we can look up the hash.
A Doc object also exposes its vocab and strings.
4. Lexemes: entries in the vocabulary
Lexemes are context-independent entries in the vocabulary.
You can get a lexeme by looking up a string or a hash ID in the vocab.
Lexemes expose attributes, just like tokens.
They hold context-independent information about a word, like the text, or whether the the word consists of alphanumeric characters.
Lexemes don't have part-of-speech tags, dependencies or entity labels. Those depend on the context.
5. Vocab, hashes and lexemes
Here's an example.
The Doc contains words in context – in this case, the tokens "I", "love" and "coffee" with their part-of-speech tags and dependencies.
Each token refers to a lexeme, which knows the word's hash ID. To get the string representation of the word, spaCy looks up the hash in the string store.
6. Let's practice!
This all sounds a bit abstract – so let's take a look at the vocabulary and string store in practice.