Mapping tokenization
You now want to test out having more control over the tokenization and want to try tokenizing the data in rows or batches. This will also give you a result that is a DataSet object, which you'll need for training.
The tokenizer
has been loaded for you along with the data as train_data
and test_data
.
This exercise is part of the course
Introduction to LLMs in Python
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Complete the function
def tokenize_function(data):
return tokenizer(data["interaction"],
____,
padding=True,
____,
max_length=64)
tokenized_in_batches = ____