1. Introduction to building a text processing pipeline
This video combines our learnings into a text processing pipeline.
2. Recap: preprocessing
The first pipeline component is preprocessing.
Recall the techniques we reviewed are tokenization,
stopword removal,
stemming, and rare word removal. These actions help to reduce the complexity of our models.
3. Text processing pipeline
The second component is encoding. Here, we convert our preprocessed text into numerical vectors using methods like
One-Hot Encoding,
Bag-of-Words, or
TF-IDF.
This enables our models to understand and process textual data. Another technique is embeddings, which will be discussed in the next chapter.
4. Text processing pipeline
We complete our pipeline by using PyTorch's Dataset and DataLoader. In our text processing pipeline,
we will use Dataset as a container for our processed and encoded text data.
DataLoader then allows us to iterate over this dataset in
batches, shuffle the data, and apply multiprocessing for efficient loading.
5. Recap: implementing Dataset and DataLoader
Let's review applying Dataset and DataLoader to text data in PyTorch.
We create a custom class, TextDataset, serving as our data container.
The init method initializes the dataset with the input text data.
The len method returns the total number of samples in the dataset,
and the getitem method allows us to access a specific sample at a given index. This class, extending PyTorch's Dataset, allows us to organize and access our text data efficiently.
6. Recap: integrating Dataset and DataLoader
After encoding our text data, we instantiate our TextDataset with the encoded text.
We then create a DataLoader, making the dataset iterable.
7. Using helper functions
For convenience, we'll use helper functions for preprocessing and encoding. preprocess_sentences combines the techniques we've covered; we can also customize it to only include specific techniques depending on the problem.
We've chosen CountVectorizer in encode_sentences to convert the cleaned sentences into arrays.
We've included an extract_sentences function that uses regular expressions (regex) to convert English sentences. While regex is beyond the scope of this course, we've included it here for potential use in the pre-exercise code.
8. Constructing the text processing pipeline
Now, let's construct our text processing pipeline. We define a function text_processing_pipeline that takes raw text as input.
Within this function, we preprocess the text using the preprocess_sentences function. This returns a list of tokens.
Next, we convert these tokens into numerical vectors using the encode_sentences function. After encoding,
we instantiate our PyTorch TextDataset with the numerical vectors,
then initialize a DataLoader with this dataset. The DataLoader will allow us to iterate over the dataset in manageable batches of size two and in a shuffled manner, ensuring a diverse mix of examples in each batch.
9. Applying the text processing pipeline
With our text processing pipeline function ready, we can apply it to any text data. Let's say we have two sentences: "This is the first text data" and "And here is another one".
We call the extract sentences function to convert the text to sentences. We feed each of these sentences into our text_processing_pipeline function. This preprocesses, encodes, and loads them into individual DataLoaders, stored in the dataloaders list using list comprehension. We also store an instance of the vectorizer created during encoding to access the feature names for each vector.
Finally, the print statement uses the next iter combination and allows us to access the batches of data from the dataloaders. The output is the first ten components of the first batch in the dataloader. It contains the encoded representation of the sentences that represent the frequency of the first five words in the vocabulary for each sentence.
10. Text processing pipeline: it's a wrap!
Excellent work! Our text processing pipeline efficiently converts raw text data into a machine-learning-ready format. After processing the text through this pipeline, we can use the resulting structured data to train, validate, and test models. We'll apply this pipeline to large datasets in upcoming chapters.
11. Let's practice!
Time to practice.