Neural networksLayers and neuronsActivation functionsSoftmax functionMulti-layer neural networksTraining a multi-layer neural networkError functionsReal-valued output Binary output Multinomial output Back-propagationError derivativesAdditional derivativesExampleBack-propagation procedureBack-propagating with all training examplesEpochs and batchesTypes of gradient descentMini-batch (stochastic) gradient descentStochastic gradient descentBatch gradient descentResources

Neural networks

Artificial neural networks are powerful and versatile machine learning systems that are perhaps the most commonly used learning method for many modern machine learning tasks.

Neural networks are often used for machine learning tasks where not much meaning or knowledge can be derived from the individual features of the training data. Instead, neural networks rely on a system of nodes called neurons, which are arranged in collections called layers, where the neurons in adjacent layers are connected. These connections have varied strengths, determined by a scalar referred to as a weight.

The purpose of the neurons on different layers is to learn the underlying (mostly non-linear) relationships between the original features and the output. Nevertheless, neural networks are often still treated as black boxes when compared to other machine learning models, since it isn't always straightforward to understand what the individual layers and neurons represent.

Layers and neurons

A neural network may consist of many layers. Namely, there are three distinct types of layers:

Note: When we say a neural network consists of layers, is the sum of hidden layers and the output layer.

Activation functions

As explained previously, the purpose of an activation function is to introduce non-linearity to a neural network—it transforms the value of the weighted sum and is chosen specifically for the type of task being performed. Below is a list of some commonly used activation functions:

Linear function 
Threshold (step) function
  • Can be used to make a decision between binary classes and , given a real-valued argument. For example, since yields a probability (where we would have to use to determine the class), we can determine the class by using in the output layer.
Logistic (sigmoid) function
  • Typically used in the output layer of a neural network performing binary classification.
  • Suffers from the issue of vanishing gradients as increases.
  • Differentiable across entire domain.
Hyperbolic tangent function
  • Can mathematically be represented as a shifted and scaled version of and is similarly used for binary classification tasks.
  • However, this cannot be directly used for negative log likelihood cost functions due to the function range being .
  • Suffers from the same vanishing gradient issue as .
Rectified linear unit (ReLU)ReLU is generally preferred to sigmoidal activation functions such as and in deeper neural networks, mainly because:
  • its simpler gradients (during backpropagation) allow for each layer to be trained quicker,
  • it does not suffer from the vanishing gradient problem.

Softmax function

The softmax function is a special type of activation function which takes a vector of real numbers as input, and normalizes it into a discrete probability distribution consisting of probabilities.

The standard softmax function is defined by the formula:

Multi-layer neural networks

A multi-layer neural network is a neural network that contains at least one hidden layer.

Figure 3: A multi-layer neural network consisting of one hidden layer with two neurons.

As seen in Figure 3, each additional layer requires connections from the neurons in the layer to the neurons in the next layer, as well as the neurons in the previous layer. The connections between each layer depend on the outputs from the previous layer, and some weights for each neuron in the previous layer.

As a result, we require more structures and indexing to formally represent a neural network. Using Figure 3 as an example:

As an example of these forms of representation, the multi-layer neural network in Figure 3 can be represented by the following matrices and vectors:

Training a multi-layer neural network

As with other machine learning algorithms such as linear and logistic regression, in order to train a neural network we need to find the weights that minimize some error function. However, this is more of a challenge in neural networks for three reasons:

  1. There are far more weights to optimize—with one entire weight matrix between layers and .

  2. The change of a single weight connecting a neuron in layer to in layer will lead to all of the neurons in subsequent layers being affected due to propagation, as seen in Figure 4 below.

    Figure 4: The propagating effect of modifying one weight, on subsequent layers.

  3. The error functions used in neural networks are non-convex, meaning that optimization methods are not guaranteed to converge to a global minimum.

    Sometimes however, converging to a local minimum may be good enough and produce weights that the neural network can perform well with.

Error functions

The purpose of an error function is to evaluate the performance of the neural network when given some training examples. As neural networks are mostly commonly used in the supervised learning setting, we have access to the actual outputs of the training examples.

The error function is a measure of the inconsistency between the predicted outputs and the actual outputs for a set of training examples. Therefore, a machine learning model's robustness and performance increases as the error function decreases. The optimal weights are thus the ones that minimize the error function :

Where: represents all of the weights in the neural network, and can therefore be represented as a vector of the weight matrices for each layer:

Neural networks may be used to predict the value of (like linear regression), (binary classification), or (multi-class/multinomial classification). As a result, the error function used in a neural network must be chosen depending on the nature of the output variable.

Real-valued output

Similarly to linear regression, the residual sum of squares (RSS) error function can be used for neural networks.

However for the back-propagation training algorithm, it is normally preferable for the error function to be represented as a mean of the samples. For this reason, mean squared error (MSE) is used more often. Additionally, this quantity is scaled by to make derivatives cleaner:

Binary output

As we have seen in logistic regression for binary classification, the output variable can be modeled by the Bernoulli distribution. As a result, the most appropriate cost function to use in this case was the negative log-likelihood. Minimizing this would yield the maximum likelihood estimate for the training data, .

For neural networks, we typically use cross entropy. Cross entropy is a measure of dissimilarity between two discrete probability distributions. As this is a binary classification problem, we have a Bernoulli-distributed output variable for each sample:


In binary classification problems, we use an average of the per-example binary cross entropies (BCE) as the cost function:


Observe that this is equivalent to taking the mean of the negative log likelihood function for the Bernoulli distribution:

Multinomial output

For multinomial classification problems, multinomial cross entropy is used as the error function for neural networks. This is a generalization of the previously seen binary cross entropy function, allowing for the use of the softmax function and multiple classes through one-hot encoding:



As shown earlier in Figure 4, the modification of a single weight in any layer will have an effect on the output of the final layer. As a result, updating weights through gradient descent as we did in the Logistic Regression notes, will not work in a neural network since we have multiple weight matrices.

Instead, the back-propagation algorithm is used to extend gradient descent to the context of hidden layers and neurons. The idea behind back-propagation is to choose an appropriate cost function and then systematically modify the weights of various neurons in order to minimize this cost function. This method is similar to gradient descent, but it uses the chain rule in order to calculate the gradient vector.

The purpose of back-propagation is to calculate all of the error derivatives of the neural network.

Error derivatives

The back-propagation algorithm decides how much to update each weight of the network after comparing the predicted output with the desired output for a particular example. For this, we need to compute how the error changes with respect to each weight—that is .

Once we have these error derivatives, the weights can be updated using a simple update rule for :

Or as a per-layer weight matrix update:


Additional derivatives

To help compute , we store two additional derivatives for each neuron:


Consider a single training example with a predicted output from the following neural network:

Figure 5: Modified version of the neural network shown in Figure 3, with an additional layer consisting of one output neuron.

The neural network has the following properties:

The neural network can be defined by the following matrices and vectors (as seen before):

Back-propagation procedure

To begin back-propagating to find the cost derivatives, we start at the end of the network, with our predicted output .

  1. Given our cost function , we have:

    Note: Recall that , so we can also say that .

  2. Now that we have , we can find through the use of the chain rule:

    Observe that may be expressed as:

    The logistic function has the nice property that its derivative is defined as . This allows us to further simplify :

    However, to maintain some generality over the various activation functions, we will continue to write this as .

    In Step 1 we found that and can therefore write as: