Custom pipeline components

1. Custom pipeline components

Now that you know how spaCy's pipeline works, let's take a look at another very powerful feature: custom pipeline components. Custom pipeline components let you add your own function to the spaCy pipeline that is executed when you call the nlp object on a text – for example, to modify the Doc and add more data to it.

2. Why custom components?

After the text is tokenized and a Doc object has been created, pipeline components are applied in order. spaCy supports a range of built-in components, but also lets you define your own. Custom components are executed automatically when you call the nlp object on a text. They're especially useful for adding your own custom metadata to documents and tokens. You can also use them to update built-in attributes, like the named entity spans.

3. Anatomy of a component (1)

Fundamentally, a pipeline component is a function or callable that takes a doc, modifies it and returns it, so it can be processed by the next component in the pipeline. Components can be added to the pipeline using the nlp dot add pipe method. The method takes at least one argument: the component function.

4. Anatomy of a component (2)

To specify *where* to add the component in the pipeline, you can use the following keyword arguments: Setting last to True will add the component last in the pipeline. This is the default behavior. Setting first to True will add the component first in the pipeline, right after the tokenizer. The "before" and "after" arguments let you define the name of an existing component to add the new component before or after. For example, before equals "ner" will add it before the named entity recognizer. The other component to add the new component before or after needs to exist, though – otherwise, spaCy will raise an error.

5. Example: a simple component (1)

Here's an example of a simple pipeline component. We start off with the small English model. We then define the component – a function that takes a Doc object and returns it. Let's do something simple and print the length of the Doc that passes through the pipeline. Don't forget to return the Doc so it can be processed by the next component in the pipeline! The Doc created by the tokenizer is passed through all components, so it's important that they all return the modified doc. We can now add the component to the pipeline. Let's add it to the very beginning right after the tokenizer by setting first equals True. When we print the pipeline component names, the custom component now shows up at the start. This means it will be applied first when we process a Doc.

6. Example: a simple component (2)

Now when we process a text using the nlp object, the custom component will be applied to the Doc and the length of the document will be printed.

7. Let's practice!

Time to put this into practice and write your first pipeline component!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.