Get startedGet started for free

Customizing spaCy models

1. Customizing spaCy models

Welcome! We have learned to use spaCy model functionality such as POS taggers and NER. We'll now learn about situations where we might seek to customize spaCy models.

2. Why train spaCy models?

spaCy models go a long way for general NLP use cases such as splitting a document into sentences, understanding sentence syntax, and extracting named entities. However, sometimes we seek to work on text data from specific domains that spaCy models haven't seen during their training. For example, Twitter data can contain hashtags or emotions, which may not have any specific meaning outside the Twitter platform. Additionally, Twitter sentences are usually just phrases and not full sentences. As a result, we might observe low quality sentence segmentation results from this data if we use of-the-shelf spaCy models. Similarly, text data from the medical domain typically contains several named entities, such as drugs and diseases. We don't expect these entities to be classified accurately using existing spaCy NER models, because the models don't generally contain disease or drug entity labels and they will perform poorly on such domain data. In such scenarios, it is worthwhile to train a spaCy model using our own domain-specific text data. The snapshot shows an example of a NER model results that is trained on medical domain data and hence performs well.

3. Why train spaCy models?

We can usually make the model more accurate by showing it examples from our domain and we often also want to predict categories specific to our problem. Before starting to train, we need to ask the following questions. Do spaCy models perform well enough on our data? and does our domain include many labels that are absent in the spaCy models?

4. Models performance on our data

To determine if training is needed, let's start with the question of whether existing spaCy models perform well enough on our data. If they do, we can use existing models in our NLP pipeline. However, there are multiple scenarios where the existing models do not perform as expected. For example, an en_core_web_sm spaCy model will not be able to correctly classify Oxford Street in "The car was navigating to the Oxford Street." as a location with a GPE label, instead, it identifies this location as an organization with an ORG label. This is because the model did not observe similar location examples during its training phase, but might have observed Oxford in the title of organizations, hence it confuses this GPE entity with one that has an ORG type. If such behavior is observed from a spaCy model, we should train this spaCy model further to improve model performance.

5. Output labels in spaCy models

Before rushing to train our own models, we also need to confirm if there are missing output labels in the existing spaCy models or not. The snapshot shows an example of NER entities on the common English domain on the top and an example of medical domain on the bottom. Common domain entities (LOC, ORG, DATE) that are used for training of existing spaCy models are considerably different from medical domain entities (DISEASE, DOSAGE, CHEMICAL).

6. Output labels in spaCy models

It is clear that the existing spaCy models do not have many of the output labels for an NER task on medical domain data and do not perform well on our data. In such case, we'll need to first collect our domain specific data, annotate our data and then update an existing model or train a model from scratch with our data.

7. Let's practice!

Great! Let's practice and then begin our journey of training spaCy models.