1. Learn
  2. /
  3. Courses
  4. /
  5. Introduction to LLMs in Python

Connected

Exercise

Mapping tokenization

You now want to test out having more control over the tokenization and want to try tokenizing the data in rows or batches. This will also give you a result that is a DataSet object, which you'll need for training.

The tokenizer has been loaded for you along with the data as train_data and test_data.

Instructions 1/2

undefined XP
  • 1
    • Complete tokenize_function returning tokenized tensors with sequence truncation and tokenize the train_data in batches.
  • 2
    • Apply tokenize_function to train_data and tokenize row by row.