1. Multi-class classification models
Now you are going to build models in keras and see how to perform multi class classification. Let's dive in!
2. Review of the Sentiment classification model
In the first chapter, you learned to build a sentiment classification model using keras. The model was created using the sequential class, the embedding layer, one LSTM layer and the output layer with the sigmoid as the activation function.
We compiled the model using binary cross-entropy loss function, the adam optimizer and accuracy as metric.
3. Model architecture
The same architecture used for sentiment classification can be used for multi class classification. The only difference, as mentioned before, is that the last layer now has the number classes units and uses softmax as activation function.
Also, when compiling we use the categorical crossentropy loss function.
4. 20 News Group dataset
To implement multi-class classification, we are going to use the 20 news groups dataset.
The dataset is available in sklearn by the function fetch 20 news group contained in the sklearn dot datasets module.
We can download the train and test data separately by using the parameter subset as in
news train equal to fetch 20 news groups, subset equal to string train.
Same goes to download the test set.
5. 20 News Group dataset
The fetched data has the following attributes.
dot DESCR contains the documentation of the dataset, including examples of usage.
dot data is an array containing the text of each news article.
dot filenames contains an array with the filenames on disk.
dot target contains an array with the numerical index of the true class of each news article.
dot target names is an array with the unique names of the classes.
6. Pre-process text data
We will use the raw texts of the 20 news groups dataset, so you can apply the same steps on any other dataset you are interested in.
To pre-process the texts, we will use the Tokenizer class from keras dot preprocessing dot text module, and the pad sequences function from keras dot preprocessing dot sequence module. To pre-process the targets, we will use the function to_categorical as before.
The tokenizer class will create numerical indexes of the vocabulary present on the training data so we can use on our RNN models. We first instantiate the class and keep it on the variable tokenizer.
Then we use the method tokenizer dot fit on texts and pass the news train dot data as the array to fit on. This updates the tokenizer instance with the vocabulary and indexes.
Next we transform the text data into a sequence of numerical indexes using the tokenizer dot texts to sequences method, and apply on the news_train dot data. Save the results in the x_train variable.
Now, we pad the sequences for them to have the same length using the pad_sequences function. We used maxlen equal to 400 as an example. This value should be big enough not to cut too much of the texts and small enough to limit the size of the data. If the texts have similar lengths (for example tweets), you can use the maximum length of your sample as the value (for example 200).
Finally, we change the targets into a one-hot encoded matrix using the function to_categorical, passing the news_train dot target.
7. Training on data
Having pre-processed the data, we can use it on the keras model to train it. We can also evaluate the performance of the model in a test set.
8. Let's practice!
Now you are ready to build your own multi-class classification model using keras, let's put it to practice.