Preprocess text with AutoTokenizer
You're building a precision agriculture application to enable farmers to ask questions on issues they encounter in the field. You'll leverage a dataset of common questions and answers to issues faced by farmers; the fields in this dataset are
question
: common agricultural questionsanswers
: answers to the agricultural questions
As a first step in distributed training, you'll begin by preprocessing this text dataset.
Some data has been preloaded:
dataset
contains a sample dataset of agricultural questions and answersAutoTokenizer
has been imported fromtransformers
This exercise is part of the course
Efficient AI Model Training with PyTorch
Exercise instructions
- Load a pre-trained
tokenizer
. - Tokenize
example["question"]
using thetokenizer
. - Apply the
encode()
function to thedataset
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load a pre-trained tokenizer
tokenizer = ____.____("distilbert-base-uncased")
def encode(example):
# Tokenize the "question" field of the training example
return ____(____["____"], padding="max_length", truncation=True, return_tensors="pt")
# Map the function to the dataset
dataset = ____.____(____, batched=True)
dataset = dataset.map(lambda example: {"labels": example["answers"]}, batched=True)
print(dataset)