1. Encoding text data
Welcome back. Let's encode our text.
2. Text encoding
Encoding happens after processing the data.
Using PyTorch, we can convert text into machine-readable numbers for analysis and modeling.
As seen in the image, each value in the red table is encoded in the blue table.
3. Encoding techniques
We will discuss three encoding methods:
One-hot encoding transforms words into unique numerical representations,
Bag-of-Words captures word frequency disregarding order,
and TF-IDF balances the uniqueness and importance of words in a document.
Additionally, embedding converts words into vectors representing semantic meanings. We will review embeddings in the next chapter.
4. One-hot encoding
With one-hot encoding, each word maps onto a distinct one-hot binary vector within the encoding space where one represents the presence of a word and zero the absence.
For instance, in a vocabulary consisting of cat, dog, and rabbit, the one-hot vector for 'cat' could be [1, 0, 0],
[0, 1, 0] for 'dog'
and [0, 0, 1] for 'rabbit'.
5. One-hot encoding with PyTorch
We have a vocab list that contains input tokens. For sentence input, we tokenize to create a list of tokens.
We first determine the vocab list length.
Using torch, we utilize the torch-dot-eye function to generate one-hot vectors for the length of our list.
We create a dictionary called one_hot_dict where each word is mapped to its corresponding vector from one_hot_vectors.
This allows us to easily access the vector representation of any word in our vocabulary.
6. Bag-of-words
Alternatively, we could improve our models by adding more meaning with bag-of-words, which treats a document as an unordered collection of words, emphasizing word frequency over order.
For instance, the sentence 'The cat sat on the mat' is converted into a dictionary. In our case, "the" is the only word that appears twice.
7. CountVectorizer
In some cases, like this one, sklearn streamlines Bag-of-Words implementation.
We import CountVectorizer from sklearn-dot-feature_extraction-dot-text.
We instantiate a CountVectorizer object.
We define our corpus, a collection of text documents represented here as a list of sentences. This can also be a tokenized list.
We fit our vectorizer to the corpus and transform it into a numerical format using fit_transform.
This produces our Bag-of-Words representation, which we store in X and print using the toarray function. We can visualize the words by extracting the feature names from the vectorizer with dot-get_feature_names_out.
The output is a term frequency matrix, where each row corresponds to a document and each column corresponds to a word. For example, the presence of "and" in the first column is indicated by a one in the third row.
8. TF-IDF
The last technique we will cover is TF-IDF or Term Frequency-Inverse Document Frequency.
It assesses word importance by considering word frequency across all documents, assigning higher scores to rare words and lower scores to common ones.
TF-IDF emphasizes informative words in our text data, unlike bag-of-words, which treats all words equally.
9. TfidfVectorizer
To use TF-IDF we import TfidfVectorizer from sklearn.
We instantiate a TfidfVectorizer object using the same corpus as before and fit it like we did for CountVectorizer. This transforms the data into TF-IDF vectors. TF-IDF can also accept a tokenized list.
The toarray function yields a matrix of TF-IDF scores.
We print the feature names. Every row in the matrix represents a document from the corpus. The feature names list displays the most significant words across all documents, and each word represents a column of the matrix.
10. TfidfVectorizer
For instance, the importance of the word first is highest in the first sentence with a score of zero-point-six-eight.
11. Encoding techniques
Encoding allows models to understand and process text.
Ideally, we choose one technique for encoding to avoid redundant computations.
As with processing, other encoding techniques exist but are beyond this course's scope. We will cover embeddings in the next chapter.
12. Let's practice!
Let's practice!