spaCy basics

1. spaCy basics

Let's learn more about spaCy and some of its core functionalities.

2. spaCy NLP pipeline

We previously learned that a spaCy NLP pipeline is created when we load a spaCy model. We started by importing spaCy, then call spacy-dot-load() to return a nlp object, a spaCy Language class. The Language class is the text processing pipeline and applies all necessary preprocessing steps to our input text behind the scenes. After that, we can apply nlp() on any given text to return a Doc container.

3. spaCy NLP pipeline

Let's learn more about the spaCy NLP pipeline. Every NLP application consists of several steps of text processing. spaCy applies a series of preprocessing steps to the text when we call nlp(), the spaCy Language class. Some of the processing steps are tokenization, tagging, parsing, Named Entity Recognition and many others which result in a Doc container.

4. Container objects in spaCy

Doc object is only one of the container classes that spaCy supports. spaCy uses multiple data structures to represent text data. Container classes such as Doc hold information about sentences, words and the text. Another container class is the Span object, which represents a slice from a Doc object; and spaCy also has a Token class, which represents an individual token, like a word, punctuation symbol, etc.

5. Pipeline components

All the container classes are generated during the spaCy NLP processing steps. Each of the processing steps we saw in the spaCy pipeline has a well-defined task. In this course, we mostly focus on tokenizer, tagger, lemmatizer, and ner components. As shown, the tokenizer creates Doc object and segment text into tokens. Then the tagger and other components add more attributes such as part-of-speech tags, and label named entities.

6. Pipeline components

There are many more text processing components available in spaCy and it is important to highlight some of the other important text processing components of an nlp instance and their duties, such as Language, DependencyParser, and Sentencizer. Each component has unique features to help us process our text better. We will see more examples of each component throughout the course.

7. Tokenization

We introduced tokenization earlier, but let's explore it further. Tokenization is always the first processing step in a spaCy NLP pipeline as all other processing steps require tokens in a given text. Recall that tokenization splits a sentence into its tokens, or the smallest meaningful piece of text. Tokens can be words, numbers and punctuation. The code segment shows the tokenization process we've seen before using a small English spaCy model. Once we apply the nlp object to the input sentence and create a Doc object, we can access each Token by using list comprehension and print a token's text by using -dot-text attribute.

8. Sentence segmentation

Sentence segmentation or breaking a text into its given sentences, is a more complex task compared to tokenization due to difficulties of handling punctuation and abbreviations. Sentence segmentation happens as part of the DependencyParser pipeline component. We utilize a for loop to iterate over the sentences of "We are learning NLP. This course introduces spaCy." using the dot-sents property of a Doc container. Then, we can use the dot-text attribute to access the sentence text.

9. Lemmatization

Lemmatization, one of the spaCy processing steps, reduces the word forms to their lemmas. A lemma is the base form of a token in which the token appears in a dictionary. For instance, the lemma of the words "eats" and "ate" is "eat". Lemmatization improves the accuracy of many language modeling tasks. We iterate over tokens to get their text and lemmas using token-dot-text and token-dot-lemma_.

10. Let's practice!

Let's exercise!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.