Preprocess text with AutoTokenizer

You're building a precision agriculture application to enable farmers to ask questions on issues they encounter in the field. You'll leverage a dataset of common questions and answers to issues faced by farmers; the fields in this dataset are

question: common agricultural questions
answers: answers to the agricultural questions

As a first step in distributed training, you'll begin by preprocessing this text dataset.

Some data has been preloaded:

dataset contains a sample dataset of agricultural questions and answers
AutoTokenizer has been imported from transformers

This exercise is part of the course

Efficient AI Model Training with PyTorch

View Course

Exercise instructions

Load a pre-trained tokenizer.
Tokenize example["question"] using the tokenizer.
Apply the encode() function to the dataset.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Load a pre-trained tokenizer
tokenizer = ____.____("distilbert-base-uncased")

def encode(example):
    # Tokenize the "question" field of the training example
    return ____(____["____"], padding="max_length", truncation=True, return_tensors="pt")

# Map the function to the dataset
dataset = ____.____(____, batched=True)

dataset = dataset.map(lambda example: {"labels": example["answers"]}, batched=True)

print(dataset)

Edit and Run Code