Get startedGet started for free

Natural Language Processing (NLP) basics

1. Natural Language Processing (NLP) basics

Welcome to this course! I'm Azadeh, a principal data scientist. In this course, we'll explore Natural Language Processing (NLP) using spaCy.

2. Natural Language Processing (NLP)

NLP is a subfield of artificial intelligence that combines computer science and linguistics to help computers understand, analyze, and generate human language. NLP helps extract insights from unstructured data. Unstructured data, such as textual data, is information that is not organized in a pre-defined manner. NLP incorporates statistics, machine learning, and deep learning models to understand human language, intent, and sentiment.

3. NLP use cases

NLP has many applications. We will introduce three well-known use cases: sentiment analysis, named-entity recognition, and chatbots. Sentiment analysis is the use of computers to interpret the underlying subjective tone of a piece of text, and categorize it into positive, neutral, or negative classes. For example, a review about great service and affordable price is classified with a positive sentiment, while a review of a horrible experience is categorized with a negative sentiment.

4. NLP use cases

The next NLP use case is named entity recognition (NER). NER is used in information extraction to locate and classify named entities in unstructured text into predefined categories. Entities are objects such as a person or location. For example, with the phrase "John McCarthy was born on September 4, 1927." NER would classify John McCarthy as the name, highlighted in blue here, and September 4, 1927 as the date, highlighted in red.

5. NLP use cases

Another NLP use case is text generation in chatbots. ChatGPT is an example, which is based on a transformer-based language model trained on a vast amount of unstructured text data.

6. Introduction to spaCy

Now that we have learned about NLP, let's learn more about spaCy and how we can utilize it in our NLP projects. spaCy is a free and open-source library for NLP in Python, which is designed to simplify building systems for information extraction. spaCy provides production-ready code widely used for NLP use cases. It supports 64+ languages. It is robust, fast and has built-in visualizers for various NLP functionalities.

7. Install and import spaCy

As the first step, we install spaCy using pip, a Python package manager. We can then download any spaCy model using a specific Python command, -m spacy, with a given model name. Here we choose "en_core_web_sm", the smallest English model. After downloading the model, we import spacy and create a nlp object by passing the model name in quotation marks to the spacy-dot-load function. spaCy has multiple trained models for the English language that are available for download from spacy-dot-io website.

8. Read and process text with spaCy

Now that our NLP object is ready, we can move on to reading and processing text. The loaded spaCy model (nlp object) can process text and convert it into a Doc object, which is a container to store the processed text. The Doc object contains information like tokens, linguistic annotations, and relationships about the text. We'll learn about each of these later in the chapter.

9. spaCy in action

Let's look at an example of processing text with spaCy. This example will use a preprocessing step known as tokenization. The first step is to read text, in this case the string "A spaCy pipeline object is created.". We convert this text into a Doc object by running a loaded spaCy model, nlp, on the text. Now, we can utilize list comprehension to print all tokens of the input text by using token-dot-text for token in doc. A token is the smallest meaningful part of a text. The process of dividing a text into a list of meaningful tokens is called tokenization.

10. Let's practice!

Let's practice our learnings!