Get startedGet started for free

Preprocess text with AutoTokenizer

You're building a precision agriculture application to enable farmers to ask questions on issues they encounter in the field. You'll leverage a dataset of common questions and answers to issues faced by farmers; the fields in this dataset are

  • question: common agricultural questions
  • answers: answers to the agricultural questions

As a first step in distributed training, you'll begin by preprocessing this text dataset.

Some data has been preloaded:

  • dataset contains a sample dataset of agricultural questions and answers
  • AutoTokenizer has been imported from transformers

This exercise is part of the course

Efficient AI Model Training with PyTorch

View Course

Exercise instructions

  • Load a pre-trained tokenizer.
  • Tokenize example["question"] using the tokenizer.
  • Apply the encode() function to the dataset.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Load a pre-trained tokenizer
tokenizer = ____.____("distilbert-base-uncased")

def encode(example):
    # Tokenize the "question" field of the training example
    return ____(____["____"], padding="max_length", truncation=True, return_tensors="pt")

# Map the function to the dataset
dataset = ____.____(____, batched=True)

dataset = dataset.map(lambda example: {"labels": example["answers"]}, batched=True)

print(dataset)
Edit and Run Code