Neural networksLayers and neuronsActivation functionsSoftmax functionMultilayer neural networksTraining a multilayer neural networkError functionsRealvalued output Binary output Multinomial output BackpropagationError derivativesAdditional derivativesExampleBackpropagation procedureBackpropagating with all training examplesEpochs and batchesTypes of gradient descentMinibatch (stochastic) gradient descentStochastic gradient descentBatch gradient descentResources
Artificial neural networks are powerful and versatile machine learning systems that are perhaps the most commonly used learning method for many modern machine learning tasks.
Neural networks are often used for machine learning tasks where not much meaning or knowledge can be derived from the individual features of the training data. Instead, neural networks rely on a system of nodes called neurons, which are arranged in collections called layers, where the neurons in adjacent layers are connected. These connections have varied strengths, determined by a scalar referred to as a weight.
The purpose of the neurons on different layers is to learn the underlying (mostly nonlinear) relationships between the original features and the output. Nevertheless, neural networks are often still treated as black boxes when compared to other machine learning models, since it isn't always straightforward to understand what the individual layers and neurons represent.
A neural network may consist of many layers. Namely, there are three distinct types of layers:
Input layer:
Hidden layer:
Consists of neurons, each made up of two functions: a weighted sum and an activation function.
Figure 1: An artificial neuron (depicted in blue).
The weighted sum is computed from the inputs , which are the outputs from the previous layer (along with their corresponding weights ) . This is calculated as the dot product between these two vectors.
Observe that this dot product may be an arbitrary realvalued number, depending on the weightsâ€”therefore, this result is passed through the activation function to transform it.
The purpose of an activation function is to introduce nonlinearity to a neural networkâ€”it transforms the value of the weighted sum and is chosen specifically for the type of task being performed.
Example: In the case of logistic regression, we saw that the logistic function can be used to shrink into the range . This is an example of a commonly used activation function.
Figure 2: The logistic sigmoid activation function. (source)
There are many other activation functions which will be described later.
The outputs of the neurons in the final hidden layer are passed to the neurons in the output layer.
Output layer:
Note: When we say a neural network consists of layers, is the sum of hidden layers and the output layer.
As explained previously, the purpose of an activation function is to introduce nonlinearity to a neural networkâ€”it transforms the value of the weighted sum and is chosen specifically for the type of task being performed. Below is a list of some commonly used activation functions:
Name  Function  Plot  Comments 

Linear function  
Threshold (step) function 
 
Logistic (sigmoid) function 
 
Hyperbolic tangent function 
 
Rectified linear unit (ReLU)  ReLU is generally preferred to sigmoidal activation functions such as and in deeper neural networks, mainly because:

The softmax function is a special type of activation function which takes a vector of real numbers as input, and normalizes it into a discrete probability distribution consisting of probabilities.
The standard softmax function is defined by the formula:
A multilayer neural network is a neural network that contains at least one hidden layer.
Figure 3: A multilayer neural network consisting of one hidden layer with two neurons.
As seen in Figure 3, each additional layer requires connections from the neurons in the layer to the neurons in the next layer, as well as the neurons in the previous layer. The connections between each layer depend on the outputs from the previous layer, and some weights for each neuron in the previous layer.
As a result, we require more structures and indexing to formally represent a neural network. Using Figure 3 as an example:
represents an arbitrary input feature vector with a bias term .
Note: is also treated as the first outputâ€”the output of the input layer, .
is the weight matrix representing the weights of the connections between the neurons in layer and the neurons in layer .
Where:
 is the number of neurons in layer .
 represents the weight from neuron in layer , to neuron in layer .
 represents the weights from each neuron in layer , to neuron in layer .
 represents the weights from neuron in layer , to each neuron in layer .
Observe that there is no connection to the bias unit of the next layerâ€”this bias unit is introduced independently and is not affected by the activations of the neurons from the previous layer. As a result, this weight matrix has dimensions:
represents a vector consisting of the outputs from the neurons in layer .
represents the vector of neurons in layer .
Where: Each neuron comprises an activation function such that the output of this neuron is given by:
That is, the dot product between the weights from each neuron in layer to neuron in layer , and the outputs of the neurons in the previous layer, applied to the activation function .
However, it is usually the case that all of the sigmoid functions in one layer are the sameâ€”so we can just refer to this as .
It is also possible to compute these activations all at once, using matrix multiplication of the weight matrix with the outputs of the previous layer, and a vectorvalued activation function :
As an example of these forms of representation, the multilayer neural network in Figure 3 can be represented by the following matrices and vectors:
As with other machine learning algorithms such as linear and logistic regression, in order to train a neural network we need to find the weights that minimize some error function. However, this is more of a challenge in neural networks for three reasons:
There are far more weights to optimizeâ€”with one entire weight matrix between layers and .
The change of a single weight connecting a neuron in layer to in layer will lead to all of the neurons in subsequent layers being affected due to propagation, as seen in Figure 4 below.
Figure 4: The propagating effect of modifying one weight, on subsequent layers.
The error functions used in neural networks are nonconvex, meaning that optimization methods are not guaranteed to converge to a global minimum.
Sometimes however, converging to a local minimum may be good enough and produce weights that the neural network can perform well with.
The purpose of an error function is to evaluate the performance of the neural network when given some training examples. As neural networks are mostly commonly used in the supervised learning setting, we have access to the actual outputs of the training examples.
The error function is a measure of the inconsistency between the predicted outputs and the actual outputs for a set of training examples. Therefore, a machine learning model's robustness and performance increases as the error function decreases. The optimal weights are thus the ones that minimize the error function :
Where: represents all of the weights in the neural network, and can therefore be represented as a vector of the weight matrices for each layer:
Neural networks may be used to predict the value of (like linear regression), (binary classification), or (multiclass/multinomial classification). As a result, the error function used in a neural network must be chosen depending on the nature of the output variable.
Similarly to linear regression, the residual sum of squares (RSS) error function can be used for neural networks.
However for the backpropagation training algorithm, it is normally preferable for the error function to be represented as a mean of the samples. For this reason, mean squared error (MSE) is used more often. Additionally, this quantity is scaled by to make derivatives cleaner:
As we have seen in logistic regression for binary classification, the output variable can be modeled by the Bernoulli distribution. As a result, the most appropriate cost function to use in this case was the negative loglikelihood. Minimizing this would yield the maximum likelihood estimate for the training data, .
For neural networks, we typically use cross entropy. Cross entropy is a measure of dissimilarity between two discrete probability distributions. As this is a binary classification problem, we have a Bernoullidistributed output variable for each sample:
Where:
represents the probability of the output variable taking the value .
Observe that for binary classification, the final layer has only one neuron (with output ) which represents . As a result, the weight matrix from layer to layer is a vector:
Note: For binary classification, and therefore . Due to this property, we don't require a second neuron in the final layer as we can just use the complement of .
is a random variable representing the predicted output for the ^{th} training sample, such that .
In binary classification problems, we use an average of the perexample binary cross entropies (BCE) as the cost function:
Where:
 is the cross entropy function which measures the dissimilarity between random variables and .
 is a random variable representing the true output for the ^{th} training sample, .
Note: Unlike , the variable takes the value of the actual binary label since we don't have probabilities for the true outputs.
Observe that this is equivalent to taking the mean of the negative log likelihood function for the Bernoulli distribution:
For multinomial classification problems, multinomial cross entropy is used as the error function for neural networks. This is a generalization of the previously seen binary cross entropy function, allowing for the use of the softmax function and multiple classes through onehot encoding:
Where:
There are neurons in the output layer, with each output : representing the probability for the feature vector being in class .
Therefore, the weight matrix from layer to the final layer , has dimensions .
Recall that in multinomial classification (as seen in the Logistic Regression notes), for each class , we classify a feature vector as or not through the use of the softmax function which generates a probability of being in class :
We assign to the class which yields the highest probability (as generated by the softmax function).
represents the probability of feature vector being assigned to class .
is a onehot encoded vector representing the actual class of the ^{th} training sample.
For example, if we have data with classes, and training sample is assigned to class , then is set to , and the rest of the elements of are set to , giving: . Onehot encoded vectors are frequently used in machine learning as they allow for selective or conditional calculations, similar to indicator variables.
As shown earlier in Figure 4, the modification of a single weight in any layer will have an effect on the output of the final layer. As a result, updating weights through gradient descent as we did in the Logistic Regression notes, will not work in a neural network since we have multiple weight matrices.
Instead, the backpropagation algorithm is used to extend gradient descent to the context of hidden layers and neurons. The idea behind backpropagation is to choose an appropriate cost function and then systematically modify the weights of various neurons in order to minimize this cost function. This method is similar to gradient descent, but it uses the chain rule in order to calculate the gradient vector.
The purpose of backpropagation is to calculate all of the error derivatives of the neural network.
The backpropagation algorithm decides how much to update each weight of the network after comparing the predicted output with the desired output for a particular example. For this, we need to compute how the error changes with respect to each weightâ€”that is .
Once we have these error derivatives, the weights can be updated using a simple update rule for :
Or as a perlayer weight matrix update:
Where:
To help compute , we store two additional derivatives for each neuron:
How the error changes with the total (weighted) input of the neuron: .
Where: , the input for neuron .
How the error changes with the total output of the neuron: .
Where:
Consider a single training example with a predicted output from the following neural network:
Figure 5: Modified version of the neural network shown in Figure 3, with an additional layer consisting of one output neuron.
The neural network has the following properties:
The output is given by the output of the final neuron.
All of the activation functions are .
and .
The cost for one training example is given by:
The neural network can be defined by the following matrices and vectors (as seen before):
To begin backpropagating to find the cost derivatives, we start at the end of the network, with our predicted output .
Given our cost function , we have:
Note: Recall that , so we can also say that .
Now that we have , we can find through the use of the chain rule:
Observe that may be expressed as:
The logistic function has the nice property that its derivative is defined as . This allows us to further simplify :
However, to maintain some generality over the various activation functions, we will continue to write this as .
In Step 1 we found that and can therefore write as: