1. Introduction to preprocessing for text
Hi, my name is Shubham, your instructor for this course.
2. What we will learn
We will explore deep learning using PyTorch for text classification and generation.
We'll cover encoding, deep learning models for text, and advanced topics around transformer architecture and protecting our models from attacks. These skills apply to real-world tasks, like sentiment analysis, text summarization, and machine translation.
3. What you should know
Before we begin, you should already be familiar with developing deep learning models with PyTorch, including training and evaluation loops, and have familiarity with convolutional and recurrent neural networks.
4. Text processing pipeline
Welcome to the text processing pipeline! Our text analysis approach in PyTorch involves preprocessing, encoding, and Dataset and DataLoader. This video will focus on preprocessing. We will explore encoding and recap Dataset and Dataloader later in the chapter. Let's begin.
5. Text processing pipeline
In preprocessing, we clean and prepare the text data for encoding.
6. PyTorch and NLTK
Preprocessing raw text data utilizes natural language processing techniques. We'll use PyTorch and NLTK, the Natural Language Toolkit, which provides a range of techniques to transform raw text into processed text.
7. Preprocessing techniques
We will discuss tokenization, stop word removal, stemming, and rare word removal.
8. Tokenization
The first step in text preprocessing is tokenization. This is where we extract tokens from text. A token could be a full word, part of a word, or a punctuation.
We'll use the PyTorch get_tokenizer function imported from torchtext-dot-data-dot-utils. The basic_english tokenizer supports the English language. We input the sentence: "I am reading a book now. I love to read books!". By applying tokenization, our output becomes a list of tokens.
9. Stop word removal
Next is stopword removal, where NLTK is more suited. Here, we eliminate stopwords or commonly occurring words such as a, the, and, or, and others that don't contribute to the meaning of a text, allowing the model to focus on the words with meaning.
We download the stopwords collection of words, also known as corpus, from nltk using nltk-dot-download and import the stopwords package. We create a set of stopwords with no duplicates using stopwords-dot-words. We use English to process English text, but other options are available.
With list comprehension, we iterate through the tokens we previously created and filter out any stopwords. Note the use of the lower method; this helps us capture all instances of stopwords regardless of capitalization. Finally, we print the filtered tokens.
10. Stemming
Stemming reduces words or tokens to their base or root form for simplified analysis. For example, "running," "runs," and "ran" would all be converted to "run" using stemming. We use the NLTK library's PorterStemmer package to perform stemming on a set of words or tokens. We initialize the PorterStemmer. Its input will be a list of tokenized words with stopwords removed. We iterate through this list using stemmer-dot-stem to stem each token. In the output, reading becomes read, and books becomes book.
11. Rare word removal
Lastly, we can remove rare words that occur infrequently and may not provide value for our text analysis. We calculate the word frequencies using the FreqDist function from the nltk-dot-probability module and define the tokens input. We then define a threshold value of two to determine the rare words. We filter out the rare words by keeping only tokens whose frequency exceeds the threshold. Then, we print the result.
12. Preprocessing techniques
The techniques we have covered help refine our text data by reducing the number of features and creating cleaner, more representative datasets. We have only covered a few techniques here. Many more exist but are out of scope for this course. We encourage you to explore these further.
13. Let's practice!
For now, let's practice!