1. Introduction to deep learning
In this chapter, we will go over the basic concepts behind deep learning and get a glimpse into how it is used in real life advertising systems.
2. Perceptrons
Deep learning is a field of machine learning that tries to emulate the human brain when teaching computers to learn. The basic building blocks are called perceptrons, which look like the following: they start with input values (x) which represent the numerical features we have. These features are first standardized. Then, these values get summed through various weights (w) into a value z. The value z is a linear combination of the x's, meaning it is a summation of the x's with corresponding w's for weights. This value then undergoes a transformation through a function called the activation function. The goal of the activation function is to add non-linearity to the process. The resulting value, a, is then converted to a predicted class label, by choosing an arbitrary threshold to apply the label, which can be expressed as a unit step function: the label value is 1 if a is above a particular value, and otherwise 0. Lastly, on the top left you can see a unit called the bias unit - this is an additional unit with some associated weight for an input value of 1, which allows you to shift the activation function, analogous to the role of an intercept in linear regression.
3. Hidden layers and activation functions
In the perceptron, there is only one non-input and non-output layer, or a hidden layer. In contrast, for deep learning, there are many hidden layers, as seen in red. This allows for arbitrary numbers and varieties of transformations applied to the input features. As just discussed, the outputs at each hidden layer are linear combinations of inputs from previous layers, transformed by a non-linear activation function. Here are some examples of activation functions: the most common ones are the ones on the left: sigmoid, tanh, and ReLU (rectified linear unit).
4. Implementation
In sklearn, we can create a multi-layer perceptron model, or MLP, which represents a neural network composed of perceptrons. The main parameters are activation (the type of activation function), alpha (regularization constant), hidden_layer_sizes (representing the hidden layers), learning rate (which determines how quickly the network tunes weights based on feedback from training) and max_iter (the number of iterations of training). All of these parameters will be discussed in more depth in the next chapter. The hidden layer sizes is an object such that the ith element represents the number of neurons in the ith hidden layer. So in this example, the model has 100 hidden units within 1 hidden layer.
5. Other considerations
Before usage, input features need to be standardized to help with better convergence over time. This can be done with sklearn's StandardScaler() we discussed in chapter 2, as follows. In real life, the deep learning systems behind large-scale ad CTR prediction often involve extremely deep networks with many millions of parameters. The features, involving user X clicking or viewing item Y, are "sparse" matrices, or matrices that are mostly 0 and occasionally nonzero. In general, deep learning is powerful since models can achieve much better performance with more data. However, there are some downsides: the model becomes more "black-box" and less transparent as to how it is analyzing features, and also takes long to compute given the size of the networks.
6. Let's practice!
Now that we've done a high level overview of the inner workings of neural networks, let's practice!