1. Pre-trained models for text generation
Pre-trained models can also be used for text generation.
2. Why pre-trained models?
Building models from scratch can be effective for particular tasks, but pre-trained models offer advantages, having been trained on extensive datasets and delivering high performance in tasks like sentiment analysis, text completion, and language translation.
However, they also pose challenges, such as high computational cost,
significant storage requirements,
and limited customization options.
3. Pre-trained models in PyTorch
Using PyTorch with Hugging Face Transformers gives us access to
a library of pre-trained models.
We'll try out GPT-2 and T5.
4. Understanding GPT-2 Tokenizer and Model
GPT2LMHeadModel is HuggingFace's take on GPT-2,
tailored for tasks like text generation.
GPT2Tokenizer converts text into numerical tokens.
It includes subword tokenization, where words can be split into smaller units or 'subwords' to capture more nuanced meanings.
For example, 'larger' might be tokenized into 'large' and 'r'.
5. GPT-2: text generation implementation
We start by importing the GPT2 modules from the transformers library.
We initialize the GPT-2 model and tokenizer using the from_pretrained method with the argument 'gpt2'. The tokenizer converts our input text into a format the model understands.
Next, we set a seed text, "Once upon a time," to serve as our story's opening line.
This seed text is encoded into input tensors using the tokenizer. The flag return_tensors equals 'pt' specifies that we want these tensors in PyTorch format.
6. GPT-2: text generation implementation II
Now, we generate text using our model.
7. GPT-2: text generation implementation II
We set a maximum length for our generated text to 40 tokens using the max_length argument.
8. GPT-2: text generation implementation II
The temperature parameter, set to 0-point-7, controls the randomness of the output, with lower values reducing randomness.
9. GPT-2: text generation implementation II
The no_repeat_ngram_size parameter, set to two, prevents consecutive word repetition in the generated text.
10. GPT-2: text generation implementation II
Lastly, the pad_token_id is set to the ID of the end-of-sentence (EOS) token, which means the model pads the output with this token if it's shorter than the maximum length of 40 tokens.
11. GPT-2: text generation output
Finally, we use the decode method to convert the token IDs in our output tensor back into text. skip_special_tokens equals True ensures that any special tokens used by the model for internal purposes, such as beginning- or end-of-sentence markers, are not included in the final output.
The printed text demonstrates our GPT-2 model's successful story generation from the provided seed text.
12. T5: Language translation implementation
Another text generation task involves language translation, where T5-small (Text-to-Text Transfer Transformer) is a specialized model.
We import the necessary modules similar to GPT-2, only this time it's T5Tokenizer and T5ForConditionalGeneration.
We initialize the T5 model and tokenizer using the 't5-small' model name.
For language translation, we prepare an input prompt that spells out the task: "translate English to French," followed by the sentence "Hello, how are you?" we wish to translate.
This prompt is encoded using the tokenizer, resulting in input IDs.
These IDs are fed into the model for translation, with a max_length of 100 to accommodate longer translations if needed.
13. T5: Language translation output
We convert the generated output back to text using the decode function.
Printing the generated text reveals a successful translation of our input to French. Occasionally, the generated text might not be accurate, given this is the smaller model and would need further tuning for better translation.
14. Choosing the right pre-trained model
We've explored two pre-trained models, but many more exist, and knowing which one to choose is key.
While GPT-2 excels in text generation,
its smaller sibling, DistilGPT-2, also specializes in similar tasks.
BERT is optimal for text classification and question-answering tasks.
T5 and t5-small are well-suited for language translation and summarization.
The HuggingFace library is the official source for these models. They can also be found in other repositories and frameworks, so it's up to you to explore!
15. Let's practice!
Let's practice!