MulaiMulai sekarang secara gratis

Tokenize a text dataset

You are working on market research for a travel website, and would like to use an existing dataset to fine tune a model that will help you classify hotel reviews. You decide to use the datasets library.

The AutoTokenizer class has been pre-imported from transformers, and load_dataset() has been pre-imported from datasets.

Latihan ini adalah bagian dari kursus

Reinforcement Learning from Human Feedback (RLHF)

Lihat Kursus

Petunjuk latihan

  • Add padding to the tokenizer to process text as equal-sized batches.
  • Tokenize the text data using the pre-trained GPT tokenizer and defined function.

Latihan interaktif praktis

Cobalah latihan ini dengan menyelesaikan kode contoh berikut.

dataset = load_dataset("argilla/tripadvisor-hotel-reviews")

tokenizer = AutoTokenizer.from_pretrained("openai-gpt")

# Add padding with the pad token
tokenizer.____

def tokenize_function(examples):
   return tokenizer(examples["text"], padding="max_length", truncation=True)

# Tokenize the dataset
tokenized_datasets = dataset.map(____, batched=True)
Edit dan Jalankan Kode