Get startedGet started for free

Data pre-processing

1. Data pre-processing

You learned before how to perform sentiment analysis on the IMDB dataset. Sentiment analysis was framed as a classification problem with two classes. Let's learn now when there are more than two classes.

2. Text classification

Text classification can be applied to many different problems. Historically, it is studied as news articles classification into a pre-determined set of classes, and based on the summary or the title or even the whole body of the article the machine learning models determine if the news article is about economy, sports, real state, and so on. It can also be used for classifying a company's documents into categories that will be analyzed only by the corresponding department. Finally, it can direct, for example, an online customer service representative to the specific problem by classifying the query that the customer wrote when contacting the service, allowing to solve the problem faster and increasing customer satisfaction.

3. Changes from binary classification

A few parameters change when going from binary to multi-class classification. The most notable ones being: the shape of the variable y containing the classes The number of units on the output layer The activation function to use on the output layer and The loss function

4. Changes from binary classification

The shape of the variable y changes with the application of the one-hot encoding. Thus, the output layer also need the number of classes as units.

5. Changes from binary classification

One-hot encoding makes all the classes to be equidistant. If we use numbers as representation for three different classes, we would imply that class one is closer to class two than it is to class 3, and in many applications this is not the case. Also, during the training of the model, the loss function will show bigger error if misclassified class one as class 3 than if it misclassified class one as class 2 and this is an error. Using one-hot, all the classes will have a distance equal to 1. Furthermore, the softmax function will return the probability of each class, and we can easily assign the document to the class that has higher probability and the loss function will work as expected.

6. Changes from binary classification

The same goes for the activation function. The sigmoid function is very useful and fast to separate two classes, but when we have more than two it is not recommended. Instead, we use the softmax function that gives the probability of every class given the inputs, and we can choose the one with higher probability. Finally, we were using the binary cross-entropy activation function, but now we have more than two classes and it is more appropriate to use the corresponding version called categorical cross-entropy.

7. Preparing text categories for keras

Sometimes, our data uses text to represent the classes. Then, we can use pandas series category data type to transform text into numbers by accessing its cat dot codes attribute. This is the first step for preparing the data.

8. Pre-processing y

The second step is to transform y into one-hot encoded values, using the function to_categorical from keras dot utils. We simply apply it to the numeric vector of classes representation to obtain the one hot encoded version of the vector

9. Let's practice!

Well, let's first put those pre-processing changes to practice before going straight to building the multi-class classification models.