MulaiMulai sekarang secara gratis

Quiz 2 - Question 1

Imagine you train a byte pair encoding (BPE) tokenizer on English and Amharic texts. This means that they share a single vocabulary consisting of English and Amharic subword tokens. You apply this tokenizer to the following Amharic sentence:

ስለተዋወቅን ደስ ብሎኛል

The tokenizer splits this sentence into 14 tokens. When you tokenize its English translation, “Nice to meet you”, it splits it into 7 tokens.

Which explanation is most plausible given how BPE learns merges and determines its subword token vocabulary.

Latihan ini adalah bagian dari kursus

Google DeepMind: Represent Your Language Data

Lihat Kursus

Latihan interaktif praktis

Ubah teori menjadi tindakan dengan salah satu latihan interaktif kami.

Mulai berolahraga