Theory: The Multi-Layer Perceptron

There are different models of machine learning, and an important one is supervised learning 1. This model requires that we have input as well as the corresponding output  data. The output data acts as a “supervisor”, comparing the output of the algorithm (i.e. a prediction or ) to the actual value from data (i.e. Y) in order to calculate the difference between them. This difference (or error) is used to tune the algorithm, and hope that the error is smaller in the next time. Not too different from the first time I learned not to touch a hot surface!  

perceptron is a basic unit of a neural network. It is simply a mathematical function that takes in one or more inputs, performs an operation, and produces an output. The following tutorial goes over the basic functioning of a perceptron.

We can arrange several perceptrons in layers to create a multilayer feedforward neural network. This type of arrangement is a back-propagation network. We call it feedforward because the input propagates sequentially through the layers of the network all the way forward to create an output (i.e. prediction or ). The prediction is compared to the actual output to calculate an error, which then propagates backwards through the network, tuning weights along the way (hence the back-propagation terminology).

There are a few key equations that give one all the mathematics necessary to create a back-propagation multilayer perceptron network (hereafter referred to as MLP in this post). We will describe these in terms of the forward and backward passes through the network.

The Forward Pass

When a signal propagates forward through an MLP, it creates or induces a field \upsilon_j(n) for the nth example of input data at neuron j. This field is computed as:


(1)   \begin{equation*}  \upsilon_j(n)=\sum_{i=1}^{M}{w_{jiL}(n)y_i(n)} \end{equation*}

where M is the total amount of incoming connections into neuron j. y_i(n) is the ith input to neuron j, and L is the layer of the network (e.g. for a network with one hidden and an output layer, L will be 1 or 2 respectively). The weight connecting y_i(n) to neuron j in layer L is denoted by w_{jiL}. Note that for the first hidden layer y_i(n) is the same as as the ith data input. For other layers, it will represent the output from the respective neurons.

Once the induced field is calculated, the output of neuron j can be calculated according to the selected activation function. We will use the hyperbolic tangent function for this purpose2. The output at neuron j is then calculated as:


(2)   \begin{equation*}  \begin{align} \varphi(\upsilon_j(n)) &= a\tanh(b\upsilon_j(n)),   (a,b)>0\\ &= a\frac{\sinh(b\upsilon_j(n))}{\cosh(b\upsilon_j(n))}\\ &= a\frac{e^{b\upsilon_j(n)}-e^{-b\upsilon_j(n)}}{e^{b\upsilon_j(n)}+e^{-b\upsilon_j(n)}}\\ &= a\frac{e^{2b\upsilon_j(n)}-1}{e^{2b\upsilon_j(n)}+1} \end{align} \end{equation*}

where a, b are constants that are greater than zero. Practically useful values for these constants are a=1.7159 and b=2/3. \upsilon_j(n) for neuron j and at the nth data example is calculated as in equation 1.

And that’s it! Equations 1 & 2 completely specify the forward computational pass through the MLP. Now, let’s look at the more complicated backwards pass.

The Backwards Pass

The first thing we have to do is to compute the error term, \varepsilon. We can calculate this term by comparing the output at the final neuron in the forward pass. This is the prediction from the neural network for the output for the nth example of input data, let’s call this  \hat{Y}(n). We will compute the error as the difference of this prediction from the actual value, Y(n).


(3)   \begin{equation*}  \begin{align} \varepsilon_j(n) = \hat{Y_j}(n) - Y_j(n), \text{where $j$ is the output neuron} \end{align} \end{equation*}


Now, we need the equation to update the weights of the network The equations are not derived here. We use the following equation to update the connecting weights throughout the MLP:


(4)   \begin{equation*}  \begin{align} \Delta w_{jiL}(n) = \eta \delta (n) y_i(n) \end{align} \end{equation*}

where \Delta w_{jiL} is the change in the weight connecting neuron i in layer L-1 to neuron j in layer L. \eta is the learning rate constant (the heuristic value for it range from 0.01-5). y_i(n) is the output of neuron i, or in the case of the input layer, it is just the ith input. There is a new term in the equation (4) above, and that is the local gradient function \delta (n). This function for the output neuron is defined as:


(5)   \begin{equation*}  \begin{align} \delta_j(n) &= \varepsilon_j(n) \varphi'(\upsilon_j(n)), \text{where $j$ is the output neuron}  \\ &= \frac{a}{b}[\hat{Y_j}(n) - Y_j(n)][a-Y_j(n)][a+Y_j(n)] \end{align} \end{equation*}

where a and b are the same constants as from equation (2), \varphi' is the first order differential of equation (2).

Now that you have the local gradient for the output neuron from (5), you can use equation (4) to update the weights of the incoming connections to the output neuron. In order to update the connecting weights to any hidden neurons, you need to first calculate the local gradient at each of the hidden neurons using the following equation:


(6)   \begin{equation*}  \begin{align} \delta_j(n) &= \varphi'(\upsilon_j(n))\sum_k \delta_k(n) w_{kj}(n), \text{where $j$ is a hidden neuron}  \\ &= \frac{a}{b}[a-Y_j(n)][a+Y_j(n)]\sum_k \delta_k(n) w_{kj}(n) \end{align} \end{equation*}

where \sum_k is over all of the outgoing connectors connecting neuron j in layer L to neuron k=1,... in layer L+1.

Using equation (6) and equation (4), you can now update the remaining weights in the neural network.

I know that was a lot of algebra! Fear not – in the next post I will provide a simple working model of this mathematics in Microsoft Excel. You will be able to see the workings of the forward and backwards pass in live action!

Go back to Volume 2: Practice.


  1. There are other learning paradigms such as Hebbianmemory-based, etc. – that we will explore in a later post.
  2. A sigmoidal function is also commonly used, however the hyperbolic tangent function is antisymmetric and usually performs better