Quiz 3 - Question 2

Assume you are training a model on a dataset that was tokenized using a subword tokenizer. The tokenizer has a vocabulary of 7,012 subword tokens and additionally, includes four special tokens: a padding token, an unknown token, and a beginning and an end of sequence token. You set the model’s embedding size to 300, meaning that each token is represented by a 300-dimensional vector.What is the shape (the dimension) of the matrix that stores the embeddings of your model?

What is the shape (the dimension) of the matrix that stores the embeddings of your model?

Diese Übung ist Teil des Kurses

<Kurs>Google DeepMind: Represent Your Language Data</Kurs>

Kurs ansehen

Interaktive praktische Übung

Verwandle Theorie mit einer unserer interaktiven Übungen in die Praxis

Übung starten

Diese Übung ist Teil des Kurses

<Kurs>Google DeepMind: Represent Your Language Data</Kurs>

Mittlere SchwierigkeitSchwierigkeitsgrad

4.8+

14 reviews

Kurs kostenlos starten

In this module, you will learn about the challenges that come with preparing text data so that it is in a format that machines can process. You will consider the course learning objectives and how to most effectively study them. Furthermore, you will learn how the meaning of text depends on social and cultural contexts and why this makes issues like ownership, consent, privacy, and exclusion central to building responsible datasets for LLMs.

Exercise 1: Teaching a machine the soul of your language Exercise 2: A world of text: Types and sources Exercise 3: Exploring raw data Exercise 4: Learning objectives Exercise 5: How to get the most out of this course

In this module, you will practice common automatic techniques for cleaning texts and think about where text data comes from. You will hear from Professor David Adelani about community efforts to create datasets that work well for African languages. Next, you will explore why reflecting on data sourcing, consent and ownership in the African context is crucial in preventing digital data from becoming another form of extraction. You will investigate how issues of transparency, benefit-sharing, and community control shape ethical questions about who owns data, who profits from it, and how it can be used responsibly.

Exercise 1: Lab: Preprocess Data Exercise 2: Harnessing the potential of low-resource languages Exercise 3: Data resources Exercise 4: Who owns the data?Exercise 5: Quiz 1 - Question 1 Exercise 6: Quiz 1 - Question 2

In this module, you will learn about different levels of granularity when splitting texts into tokens. You will first experiment with character-level and word-level tokenizers to understand their different approaches. Then, you will learn about byte pair encoding (BPE), which is a subword tokenizer. This advanced method combines the benefits of both character and word-level approaches, offering a more balanced solution. You will then move on to consider how gaps and biases in LLM training datasets can marginalize African languages and cultures, reinforcing digital exclusion. By reflecting on these disparities, you will see how inclusive data practices and community-driven initiatives are essential for building fairer, more responsible AI systems.

Exercise 1: What is tokenization?Exercise 2: Lab: Tokenize Texts into Characters and Words Exercise 3: Lab: Tokenize Texts into Subword Tokens Exercise 4: Subword tokenization Exercise 5: Lab: Implement a BPE Tokenizer Exercise 6: Whose voice is missing?Exercise 7: Quiz 2 - Question 1 Exercise 8: Quiz 2 - Question 2

In this module, you will investigate how language models represent the meaning of tokens in the form of embeddings. You will design your own “map of meaning”, experiment with Gemma’s embeddings, and learn how to visualize the token meaning representations. Finally, you will use the BPE tokenizer that you implemented in the previous module to prepare a dataset for training a small language model.

Exercise 1: What are embeddings?Exercise 2: Design your own embeddings Exercise 3: Desired properties of embeddings Exercise 4: Lab: Experiment with Embeddings Exercise 5: Lab: Train an SLM with Your BPE Tokenizer Exercise 6: Quiz 3 - Question 1 Exercise 7: Quiz 3 - Question 2

Aktuelle Übung

In this module, you will build on your values-led problem statement from 01 Build Your Own Small Language Model by learning how to design an ethical dataset that supports your solution. You will see how dataset choices shape fairness, representation, and accountability in AI, and why responsible innovation in Africa means creating systems that respect privacy, community ownership, and cultural heritage

Exercise 1: Why document data?Exercise 2: Build a dataset ethically with a Data Card Exercise 3: Quiz 4 - Question 1 Exercise 4: Quiz 4 - Question 2

In this module, you will have the opportunity to consult additional resources and further reading to investigate the topics you have covered in more detail. Finally, you will consider your next steps and how you can build on what you have learned in the course.

Exercise 1: Summary Exercise 2: Looking forward Exercise 3: Additional resources and further reading Exercise 4: Glossary Exercise 5: Feedback