Tokenize a text dataset
You are working on market research for a travel website, and would like to use an existing dataset to fine tune a model that will help you classify hotel reviews. You decide to use the datasets
library.
The AutoTokenizer
class has been pre-imported from transformers
, and load_dataset()
has been pre-imported from datasets
.
Este exercício faz parte do curso
Reinforcement Learning from Human Feedback (RLHF)
Instruções do exercício
- Add padding to the tokenizer to process text as equal-sized batches.
- Tokenize the text data using the pre-trained GPT tokenizer and defined function.
Exercício interativo prático
Experimente este exercício completando este código de exemplo.
dataset = load_dataset("argilla/tripadvisor-hotel-reviews")
tokenizer = AutoTokenizer.from_pretrained("openai-gpt")
# Add padding with the pad token
tokenizer.____
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
# Tokenize the dataset
tokenized_datasets = dataset.map(____, batched=True)