1. Transformers for text processing
Now, let's explore how pre-trained models function.
2. Why use transformers for text processing?
Transformers, like those from Hugging Face, are the foundation of many pre-trained models and are known for speed
and understanding the deep relationship between words, even if they're distant in a sentence, unlike RNNs, which trudge word by word.
Transformers can also generate highly authentic human-like text. Let's take a peek inside.
3. Components of a transformer
There are several components to a Transformer. Encoder layers process input, such as analyzing a movie review's tone.
Decoder layers reconstruct output, as in English-to-French translation. However, for sentiment analysis, we only need to interpret so we'll use the encoder, not the decoder.
Feed-forward networks refine understanding, identifying nuances like sarcasm.
Positional encoding ensures order matters - because in reviews, a don't can change everything.
Multi-head attention enables the model to capture multiple sentiments and complex patterns in lengthy reviews.
4. Preparing our data: train-test split
Let's use a transformer on text data. We'll create training and test datasets.
We have four sentences with sentiment labels, one for positive, and zero for negative. The first three sentences are for training, and the last one is for testing. Typically, datasets are larger than this sample.
5. Building the transformer model
Let’s create a TransformerEncoder class with nn-dot-Module. This custom class wraps PyTorch’s nn-dot-TransformerEncoder, specializing it for sentiment analysis since the built-in version is generally too broad for our needs.
The parameters for the init method of our TransformerEncoder class include embed_size for embedding, heads for attention heads, num_layers for the number of layers, and dropout rate. It leverages properties from nn-dot-Module through super.
It employs nn-dot-TransformerEncoder with nn-dot-TransformerEncoderLayer, with parameters d_model and nhead, equating to embed_size and heads, respectively. d_model influences the model's representational depth, and nhead determines how many word contexts the model can focus on simultaneously, impacting its contextual understanding.
The class includes a linear layer, self-dot-fc, transforming input features to two classes for binary classification.
During the forward method, data moves through the encoder, is averaged, and reaches self-dot-fc.
Initializing the class, we set embed_size to 512 for balanced power and efficiency, with 8 heads allowing focus on 8 word segments at once. Adjusting these affects complexity and overfitting risk. We use three num_layers and a dropout of zero-dot-five to combat overfitting.
Finally, we use Adam optimizer with a learning rate of 0-point-001 and CrossEntropyLoss for classification tasks.
6. Training the transformers
We train for five epochs. In each epoch, sentences are tokenized into words and converted to embeddings using a pre-made token_embeddings dictionary. These embeddings are stacked along a new dimension, dimension one, using torch-dot-stack to form a data tensor.
The model processes this tensor to produce an output.
The disparity between this output and the actual label constitutes the loss.
After each iteration, the optimizer clears the gradient to prevent accumulation.
The decreasing loss through the epochs indicates improving accuracy of the model.
7. Predicting the transformers
To predict sentiments using our trained Transformer, we define a predict function and set the model to evaluation mode.
We utilize torch-dot-no_grad to skip gradient calculations, saving memory.
Within, we tokenize the input sentence and get the embeddings using the torch-dot-stack. Here, we loop through each token in the sentence and retrieve its embedding from the token_embeddings dictionary. If the token is not in the dictionary, we generate a random tensor with shape 1 to 512 as a placeholder using torch-dot-rand. All random tensors are then stacked along dimension one, creating a 3D tensor.
The tensor is passed to the model to get an output and fetch predictions.
The torch-dot-argmax function identifies the predicted class,
which we then translate to 'Positive' or 'Negative' based on its value.
8. Predicting on new text
Using our predict function and the sentence 'This product can be better', we determine its sentiment, which the model interprets as 'Negative'.
9. Let's practice!
Let's practice!