Get startedGet started for free

Scaling and performance

1. Scaling and performance

In this video, I'll show you a few tips and tricks to make your spaCy pipelines run as fast as possible, and process large volumes of text efficiently.

2. Processing large volumes of text

If you need to process a lot of texts and create a lot of Doc objects in a row, the nlp dot pipe method can speed this up significantly. It processes the texts as a stream and yields Doc objects. It is much faster than just calling nlp on each text, because it batches up the texts. nlp dot pipe is a generator that yields Doc objects, so in order to get a list of Docs, remember to call the list method around it.

3. Passing in context (1)

nlp dot pipe also supports passing in tuples of text / context if you set "as tuples" to True. The method will then yield doc / context tuples. This is useful for passing in additional metadata, like an ID associated with the text, or a page number.

4. Passing in context (2)

You can even add the context meta data to custom attributes. In this example, we're registering two extensions, "id" and "page number", which default to None. After processing the text and passing through the context, we can overwrite the doc extensions with our context metadata.

5. Using only the tokenizer

Another common scenario: Sometimes you already have a model loaded to do other processing, but you only need the tokenizer for one particular text. Running the whole pipeline is unnecessarily slow, because you'll be getting a bunch of predictions from the model that you don't need.

6. Using only the tokenizer (2)

If you only need a tokenized Doc object, you can use the nlp dot make doc method instead, which takes a text and returns a Doc. This is also how spaCy does it behind the scenes: nlp dot make doc turns the text into a Doc before the pipeline components are called.

7. Disabling pipeline components

spaCy also allows you to temporarily disable pipeline components using the nlp dot disable pipes context manager. It takes a variable number of arguments, the string names of the pipeline components to disable. For example, if you only want to use the entity recognizer to process a document, you can temporarily disable the tagger and parser. After the with block, the disabled pipeline components are automatically restored. In the with block, spaCy will only run the remaining components.

8. Let's practice!

Now it's your turn. Let's try out the new methods and optimize some code to be faster and more efficient.