Introduction to language models

1. Introduction to language models

In this lesson, you will learn in more detail how to create a language model from raw text data.

2. Sentence probability

Language models represent the probability of a sentence. For example, what is the probability of the sentence I love this movie? What is the probability of each word in this sentence to appear in this particular order? The way this probability is computed changes from one model to another. Unigram models use the probability of each word inside the document, and assume the probabilities are independent. N-gram models use the probability of each word conditional to the previous N minus one words. When N equals to 2 it's called bigram, and when it is equal to 3 it's called trigram.

3. Sentence probability (cont.)

The Skipgram model does the opposite, computes the probability of the context words, or neighboring words, given the center word. Neural networks models with a softmax function in the last layer of the model with units equal to the size of the vocabulary are also language models.

4. Link to RNNs

We are focusing on Recurrent Neural Networks. So how exactly are language models related to them? Well, everywhere! Recurrent Neural Network models are themselves language models when trained on text data, because they give the probability of the next token given the previous k tokens.

5. Link to RNN (cont.)

Also, an embedding layer can be used to create vector representations of the tokens as the first layer.

6. Building vocabulary dictionaries

When creating RNN models, we need to transform the text data into a sequence of numbers, which are the indexes of the tokens in the array of unique tokens, the vocabulary. To do that, we first need to create an array containing each unique word of the corpus. We can use the combination list-set to create a list of unique words. And we can get all words in a text by splitting the text using space as the separator. Other languages such as Chinese need additional steps to get words since there is no space between the characters. We can now create dictionaries that map words to their index on the vocabulary and vice versa using dictionary comprehension. By enumerating a list, we obtain the numeric indexes and the items as tuples, and we can use them to create key-value dictionaries. The first dictionary uses the words as keys and the indexes as values, it can transform the text into numerical values. The later one is used to go back from numbers to words, since it has indexes as keys and words as values.

7. Preprocessing input

With the created dictionaries, we can prepare pairs of X and y to be used on a supervised machine learning model. For that, we can loop into the sequences of numerical indexes in blocks of fixed-length size. We use the initial words as x and the final word as y, and shift the text step words forward. If we use a step equal to 2, it means that the X sentences will be shifted by 2 words at a time.

8. Transforming new texts

When preparing new data, we can use the dictionary to get the correct indexes for each word. Using the example on the slide, create a list that will contain the transformed text. Loop for every sentence of the new text create a temporary list that will contain the current sentence. iterate over all words of the sentence by splitting the sentence on it's white spaces. get the index using the dictionary append the index to the sentence list then, append the sentence of indexes on the first list you created, new text split.

9. Let's practice!

You saw that Language models gives the probability of a sentence. To train the model, you first need to prepare the raw text. Let's practice it!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.