1. spaCy pipelines
Welcome! We previously learned about spaCy pipelines, let's explore them further.
2. spaCy pipelines
Recall that when we call nlp on a text, spaCy first tokenizes the text to produce a Doc container.
The Doc object is then processed in several different steps, known as the processing pipeline.
3. spaCy pipelines
To continue our learnings on spaCy pipelines, in this video, we will explore how to create pipeline components and add them to an existing or blank spaCy pipeline.
A pipeline is a sequence of pipes (pipeline components), or actors on data, that make alterations to the data or extract information from it. In some cases, later pipes require the output from earlier components, while in other cases, a pipe can exist entirely on its own.
As an example, for a named entity recognition pipeline, three pipes can be used: a Tokenizer pipe, which is the first processing step in spaCy pipelines; a rule-based named entity recognizer known as the EntityRuler, which finds entities; and an EntityLinker pipe that identifies the type of each entity.
Through this processing pipeline, an input text is converted to a Doc container with its corresponding annotated entities. We can use the doc-dot-ents feature to find the entities in the input text.
4. Adding pipes
We often use an existing spaCy model. However, in some cases, an off-the-shelf model will not satisfy our requirements.
An example of this is the sentence segmentation for a long document with 10,000 sentences. To recall, sentence segmentation is breaking a text into its given sentences. Sentencizer is the name of the spaCy pipeline component that performs sentence segmentation.
Given a document that has 10,000 sentences, even if we use the smallest English model, the most efficient spaCy model, en_core_web_sm, the model can take a long time to process 10,000 sentences and separate them. The reason is that when calling an existing spaCy model on a text, the whole NLP pipeline will be activated and that means that each pipe from named entity recognition to dependency parsing will run on the text. This increases the use of computational time by 100 times.
5. Adding pipes
In this instance, we would want to make a blank spaCy English model by using spacy-dot-blank("en") and add the sentencizer component to the pipeline by using -dot-add_pipe method of the nlp model.
By creating a blank model and simply adding a sentencizer pipe, we can considerably reduce computational time. The reason is that for this version of the spaCy model, only intended pipeline component (sentence segmentation) will run on the given documents.
6. Analyzing pipeline components
spaCy allows us to analyze a spaCy pipeline to check whether any required attributes are not set.
The nlp-dot-analyze_pipes method analyzes the components in a pipeline and outputs structured information about them, like the attributes they set on the Doc and Token, whether they retokenize the Doc and which scores they produce during training. It also shows warnings if components require values that are not set by the previous components. For example, when the entity linker is used but no component before EntityLinker sets named entities. While calling nlp-dot-analyze_pipes() method we can also set the pretty argument to True, which will print a nicely organized table as the result of analyzing the pipeline components.
7. Analyzing pipeline components
The snapshot shows the results of the analyze_pipes method. While we don't go into technical details of all the fields, we are familiar with some of the components and attributes provided in this snapshot. In this case, the result of analysis is "No problems found".
8. Let's practice!
Let's practice our learnings.