spaCy training data format

1. Training data preparation

Welcome! Now that we have learned how to identify whether training a spaCy model is necessary, let's learn how to prepare training data.

2. Training steps

spaCy allows us to update existing models using examples from our own annotated data. To do so, we initialize a spaCy model, either with weights from an existing model, or random values. Next, we predict a batch of examples with the current weights. The model then checks the predictions against the correct answers we provide and aims at optimizing the weights to achieve better results. Optimizer objects will be used in this stage, we will learn more about them later on. We then move on to next batch of examples, and spaCy continues calling the model to predict another batch of examples in the data and refine the weights.

3. Annotating and preparing data

It is clear now that the first step of training a model is always preparing training data. spaCy model training code works with dictionaries. After collecting data, we annotate data in the required format for a spaCy model. Annotation means labeling the intent, entities, POS tags, and so on. We can see an example of an annotated data record for a NER task in the medical domain. The annotated data has two key value pairs. The first attribute records the input text with a "sentence" key and the second attribute captures all the labeled entities of the input text with an "entities" key. In this instance, there is only one labeled entity with the entity type of "Medicine".

4. Annotating and preparing data

Let's check another example of an annotated data record for a NER task of the common English language. In this instance, the annotated data has two entities for the given text. In such scenarios, a list of dictionaries will be stored for the entities attribute. For example, the first element captures the Bill Gates entity with the type PERSON and the second element shows the SFO Airport entity with the type LOC (location).

5. spaCy training data format

The goal of data annotation is to prepare training data and point the spaCy model to what we want the model to learn. This annotated data has to be stored as a dictionary format and we also need to provide start and end characters of the text span with a given label. Let's see an example of a training dataset. This dataset consists of three example pairs for a named entity recognition task. Each example pair includes a sentence as the first element. The second element of the pair is a list of annotated entities and their corresponding start and end characters and labels.

6. Example object data for training

We cannot feed the raw text and annotations directly to spaCy and need to create an Example object for each training example. Let's check an example for a NER model. Let's assume we have a training data point we want to feed to our NER component to ensure the model will correctly predict Austin as GPE (Geopolitical entity). First, we will convert the associated text to a Doc container, and then use the Example class from spaCy to convert the Doc container and the relevant annotation to an Example object which is compatible for training with spaCy. For this purpose, we use Example-dot-from_dict() method and pass two arguments: the Doc container and the annotations dictionary. We can view attributes that are processed and stored at the example object by using the example-dot-to_dict() method.

7. Let's practice!

Great! We learned about the training data format and the Example object that converts a training data to a compatible format for training a spaCy model. Let's practice our learnings.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.