# Neural networks

Artificial neural networks are powerful and versatile machine learning systems that are perhaps the most commonly used learning method for many modern machine learning tasks.

Neural networks are often used for machine learning tasks where not much meaning or knowledge can be derived from the individual features of the training data. Instead, neural networks rely on a system of nodes called neurons, which are arranged in collections called layers, where the neurons in adjacent layers are connected. These connections have varied strengths, determined by a scalar referred to as a weight.

The purpose of the neurons on different layers is to learn the underlying (mostly non-linear) relationships between the original features and the output. Nevertheless, neural networks are often still treated as black boxes when compared to other machine learning models, since it isn't always straightforward to understand what the individual layers and neurons represent.

## Layers and neurons

A neural network may consist of many layers. Namely, there are three distinct types of layers:

• Input layer:

• Consists of the features of the input vector $\b{x}^{(i)}$, or transformed feature vector $\bs{\phi}\l(\b{x}^{(i)}\r)$. This vector is also sometimes denoted as $\b{a}^{(1)}$ for consistency, as the notation $\b{a}^{(i)}$ in neural networks represents the output vector from the neurons in layer $i$, where indexing starts from $1$, with the input layer.
• This layer does not pass the features through a weighted sum or activation function. The inputs are simply passed on to the next layer, with the corresponding weights $\bs{\theta}^{(1)}$ for this layer.
• There may only be one input layer in a neural network.
• Hidden layer:

• Consists of neurons, each made up of two functions: a weighted sum and an activation function.

Figure 1: An artificial neuron (depicted in blue).

1. The weighted sum is computed from the inputs $\b{a}^{(i)}$, which are the outputs from the previous layer (along with their corresponding weights $\bs{\theta}^{(i)}$) . This is calculated as the dot product between these two vectors.

Observe that this dot product may be an arbitrary real-valued number, depending on the weights—therefore, this result is passed through the activation function to transform it.

2. The purpose of an activation function is to introduce non-linearity to a neural network—it transforms the value of the weighted sum and is chosen specifically for the type of task being performed.

Example: In the case of logistic regression, we saw that the logistic function $\sigma(x)=\frac{1}{1+e^{-x}}$ can be used to shrink $\R$ into the range $(0,1)$. This is an example of a commonly used activation function.

Figure 2: The logistic sigmoid activation function. (source)

There are many other activation functions which will be described later.

• The output of one neuron is a single value, which is sent to each of the neurons on the next layer with new weights $\bs{\theta}^{(i+1)}$.
• The outputs of the neurons in the final hidden layer are passed to the neurons in the output layer.

• Output layer:

• Functions the same way as a hidden layer—each output takes a weighted sum of its inputs and passes it through an activation function.
• The number of output neurons is often the number of classes (in the case of a classification problem). However, in the case of binary classification, one output neuron is enough since we can represent the probability of the other class as the complement of the probability that is output by the neuron.

Note: When we say a neural network consists of $L$ layers, $L$ is the sum of hidden layers and the output layer.

## Activation functions

As explained previously, the purpose of an activation function is to introduce non-linearity to a neural network—it transforms the value of the weighted sum and is chosen specifically for the type of task being performed. Below is a list of some commonly used activation functions:

### Softmax function

The softmax function is a special type of activation function which takes a vector of $K$ real numbers as input, and normalizes it into a discrete probability distribution consisting of $K$ probabilities.

The standard softmax function $\sigma : \R^K \to \R^K$ is defined by the formula:

## Multi-layer neural networks

A multi-layer neural network is a neural network that contains at least one hidden layer.

Figure 3: A multi-layer neural network consisting of one hidden layer with two neurons.

As seen in Figure 3, each additional layer requires connections from the neurons in the layer to the neurons in the next layer, as well as the neurons in the previous layer. The connections between each layer depend on the outputs from the previous layer, and some weights for each neuron in the previous layer.

As a result, we require more structures and indexing to formally represent a neural network. Using Figure 3 as an example:

• $\b{x}^{(i)}$ represents an arbitrary input feature vector with a bias term $x_0^{(i)}$.

Note: $\b{x}^{(i)}$ is also treated as the first output—the output of the input layer, $\b{a}^{(1)}$.

• $\bs{\Theta}^{(i)}$ is the weight matrix representing the weights of the connections between the neurons in layer $i$ and the neurons in layer $i+1$.

Where:

• $N_i$ is the number of neurons in layer $i$.
• $\theta_{j,k}^{(i)}$ represents the weight from neuron $j$ in layer $i$, to neuron $k$ in layer $i+1$.
• $\bs{\theta}_{\cdot,k}^{(i)}=\l(\theta_{0,k}^{(i)},\ldots,\theta_{N_i,k}^{(i)}\r)^\T$ represents the weights from each neuron in layer $i$, to neuron $k$ in layer $i+1$.
• $\bs{\theta}_{j,\cdot}^{(i)}=\colv{\theta^{(i)}}{j,1}{j,N_{i+1}}$ represents the weights from neuron $j$ in layer $i$, to each neuron in layer $i+1$.

Observe that there is no connection to the bias unit of the next layer—this bias unit is introduced independently and is not affected by the activations of the neurons from the previous layer. As a result, this weight matrix has dimensions:

• $\b{a}^{(i)}$ represents a vector consisting of the outputs from the neurons in layer $i$.

• $\b{n}^{(i)}$ represents the vector of neurons in layer $i$.

Where: Each neuron $n_j^{(i)}$ comprises an activation function $\sigma_j^{(i)}$ such that the output $a_j^{(i)}$ of this neuron is given by:

That is, the dot product between the weights from each neuron in layer $i-1$ to neuron $j$ in layer $i$, and the outputs of the neurons in the previous layer, applied to the activation function $\sigma_j^{(i)}$.

However, it is usually the case that all of the sigmoid functions in one layer are the same—so we can just refer to this as $\sigma^{(i)}$.

It is also possible to compute these activations all at once, using matrix multiplication of the weight matrix $\bs{\Theta}^{(i-1)}$ with the outputs of the previous layer, and a vector-valued activation function $\bs{\sigma}^{(i)}$:

As an example of these forms of representation, the multi-layer neural network in Figure 3 can be represented by the following matrices and vectors:

### Training a multi-layer neural network

As with other machine learning algorithms such as linear and logistic regression, in order to train a neural network we need to find the weights that minimize some error function. However, this is more of a challenge in neural networks for three reasons:

1. There are far more weights to optimize—with one entire weight matrix $\bs{\Theta}^{(i)}$ between layers $i$ and $i+1$.

2. The change of a single weight connecting a neuron $n_j^{(i)}$ in layer $i$ to $n_k^{(i+1)}$ in layer $i+1$ will lead to all of the neurons in subsequent layers being affected due to propagation, as seen in Figure 4 below.

Figure 4: The propagating effect of modifying one weight, on subsequent layers.

3. The error functions used in neural networks are non-convex, meaning that optimization methods are not guaranteed to converge to a global minimum.

Sometimes however, converging to a local minimum may be good enough and produce weights that the neural network can perform well with.

#### Error functions

The purpose of an error function is to evaluate the performance of the neural network when given some training examples. As neural networks are mostly commonly used in the supervised learning setting, we have access to the actual outputs of the training examples.

The error function $C(\bs{\Theta})$ is a measure of the inconsistency between the predicted outputs $\hat{y}^{(i)}$ and the actual outputs $y^{(i)}$ for a set of training examples. Therefore, a machine learning model's robustness and performance increases as the error function decreases. The optimal weights are thus the ones that minimize the error function $C(\bs{\Theta})$:

Where: $\bs{\Theta}$ represents all of the weights in the neural network, and can therefore be represented as a vector of the weight matrices for each layer:

Neural networks may be used to predict the value of $y^{(i)}\in\R$ (like linear regression), $y^{(i)}\in\set{0,1}$ (binary classification), or $y^{(i)}\in\set{1,\ldots,K}$ (multi-class/multinomial classification). As a result, the error function used in a neural network must be chosen depending on the nature of the output variable.

##### Real-valued output $y^{(i)}\in\R$

Similarly to linear regression, the residual sum of squares (RSS) error function can be used for neural networks.

However for the back-propagation training algorithm, it is normally preferable for the error function to be represented as a mean of the samples. For this reason, mean squared error (MSE) is used more often. Additionally, this quantity is scaled by $\frac{1}{2}$ to make derivatives cleaner:

##### Binary output $y^{(i)}\in\set{0,1}$

As we have seen in logistic regression for binary classification, the output variable can be modeled by the Bernoulli distribution. As a result, the most appropriate cost function to use in this case was the negative log-likelihood. Minimizing this would yield the maximum likelihood estimate for the training data, $\hat{\bs{\theta}}$.

For neural networks, we typically use cross entropy. Cross entropy is a measure of dissimilarity between two discrete probability distributions. As this is a binary classification problem, we have a Bernoulli-distributed output variable for each sample:

Where:

• $p_1^{(i)}=\cp{\hat{y}^{(i)}=1}{\b{x}^{(i)};\bs{\Theta}}=a_1^{(L+1)}=\sigma\l(\bs{\Theta}^{(L)\T}\b{a}^{(L)}\r)$ represents the probability of the output variable $\hat{y}^{(i)}$ taking the value $1$.

Observe that for binary classification, the final layer $L+1$ has only one neuron $n^{(L+1)}_1$ (with output $a_1^{(L+1)}$) which represents $p_1^{(i)}$. As a result, the weight matrix from layer $L$ to layer $L+1$ is a $(1\times N_L)$ vector:

Note: For binary classification, $p_0^{(i)}+p_1^{(i)}=1$ and therefore $p_0^{(i)}=1-p_1^{(i)}$. Due to this property, we don't require a second neuron in the final layer as we can just use the complement of $p_1^{(i)}$.

• $\hat{y}^{(i)}$ is a random variable representing the predicted output for the $i$th training sample, $\b{x}^{(i)}$ such that $\hat{y}^{(i)}\in\set{p_0^{(i)},p_1^{(i)}}$.

In binary classification problems, we use an average of the per-example binary cross entropies (BCE) as the cost function:

Where:

• $H(X,Y)$ is the cross entropy function which measures the dissimilarity between random variables $X$ and $Y$.
• $y^{(i)}$ is a random variable representing the true output for the $i$th training sample, $\b{x}^{(i)}$.
Note: Unlike $\hat{y}^{(i)}$, the $y^{(i)}$ variable takes the value of the actual binary label since we don't have probabilities for the true outputs.

Observe that this is equivalent to taking the mean of the negative log likelihood function for the Bernoulli distribution:

##### Multinomial output $y^{(i)}\in\set{1,\ldots,K}$

For multinomial classification problems, multinomial cross entropy is used as the error function for neural networks. This is a generalization of the previously seen binary cross entropy function, allowing for the use of the softmax function and multiple classes through one-hot encoding:

Where:

• There are $K$ neurons in the output layer, with each output $a_c^{(L+1)}$ : $c\in\set{1,\ldots,K}$ representing the probability $p_c^{(i)}$ for the feature vector $\b{x}^{(i)}$ being in class $c$.

Therefore, the weight matrix $\bs{\Theta}^{(L)}$ from layer $L$ to the final layer $L+1$, has dimensions $(K\times N_L)$.

Recall that in multinomial classification (as seen in the Logistic Regression notes), for each class $c$, we classify a feature vector $\b{x}^{(i)}$ as $k$ or not-$c$ through the use of the softmax function which generates a probability of $\b{x}^{(i)}$ being in class $c$:

We assign $\b{x}^{(i)}$ to the class which yields the highest probability (as generated by the softmax function).

• $p_c^{(i)}=\cp{y^{(i)}=c}{\b{x}^{(i)};\bs{\Theta}}=\sigma_\text{Soft}\l(c,\b{x}^{(i)};\bs{\Theta}\r)$ represents the probability of feature vector $\b{x}^{(i)}$ being assigned to class $c$.

• $\b{y}^{(i)}=\colv{y^{(i)}}{1}{K}$ is a one-hot encoded vector representing the actual class of the $i$th training sample.
For example, if we have data with $K=6$ classes, and training sample $\b{x}^{(i)}$ is assigned to class $4$, then $y_4^{(i)}$ is set to $1$, and the rest of the elements of $y^{(i)}$ are set to $0$, giving: $\b{y}^{(i)}=\l(0,0,0,1,0,0\r)^\T$. One-hot encoded vectors are frequently used in machine learning as they allow for selective or conditional calculations, similar to indicator variables.

## Back-propagation

As shown earlier in Figure 4, the modification of a single weight in any layer will have an effect on the output of the final layer. As a result, updating weights through gradient descent as we did in the Logistic Regression notes, will not work in a neural network since we have multiple weight matrices.

Instead, the back-propagation algorithm is used to extend gradient descent to the context of hidden layers and neurons. The idea behind back-propagation is to choose an appropriate cost function and then systematically modify the weights of various neurons in order to minimize this cost function. This method is similar to gradient descent, but it uses the chain rule in order to calculate the gradient vector.

The purpose of back-propagation is to calculate all of the error derivatives of the neural network.

### Error derivatives

The back-propagation algorithm decides how much to update each weight of the network after comparing the predicted output with the desired output for a particular example. For this, we need to compute how the error changes with respect to each weight—that is $\pderiv{}{\theta_{j,k}^{(l)}}C(\bs{\Theta})$.

Once we have these error derivatives, the weights can be updated using a simple update rule for $l\in\set{1,\ldots,L}$:

Or as a per-layer weight matrix update:

Where:

To help compute $\pderiv{C}{\theta_{j,k}^{(l)}}$, we store two additional derivatives for each neuron:

• How the error changes with the total (weighted) input of the neuron: $\pderiv{C}{z_{j}^{(l)}}$.

Where: $z_j^{(l)}=\bs{\theta}_{\cdot,j}^{(l-1)\T}\b{a}^{(l-1)}$, the input for neuron $n_j^{(l)}$.

• How the error changes with the total output of the neuron: $\pderiv{C}{a_{j}^{(l)}}$.

Where: $a_j^{(l)}=\sigma^{(l)} \! \l(\bs{\theta}_{\cdot,j}^{(l-1)\T}\b{a}^{(l-1)}\r)=\sigma^{(l)} \! \l(z_j^{(l)}\r)$

### Example

Consider a single training example $\l(\b{x}^{(i)},y^{(i)}\r)$ with a predicted output $\hat{y}^{(i)}$ from the following neural network:

Figure 5: Modified version of the neural network shown in Figure 3, with an additional layer consisting of one output neuron.

The neural network has the following properties:

• The output $y^{(i)}$ is given by the output of the final neuron.

• All of the activation functions are $\sigma_\text{Logistic}$.

• $\b{x}^{(i)}\in\R^4$ and $y^{(i)}\in(0,1)$.

• The cost for one training example is given by:

The neural network can be defined by the following matrices and vectors (as seen before):

#### Back-propagation procedure

To begin back-propagating to find the cost derivatives, we start at the end of the network, with our predicted output $\hat{y}^{(i)}$.

1. Given our cost function $C(\bs{\Theta})=\frac{1}{2}\l(y^{(i)}-\hat{y}^{(i)}\r)^2$, we have:

Note: Recall that $\hat{y}^{(i)}=a_1^{(4)}$, so we can also say that $\pderiv{C}{\hat{y}^{(i)}}=\pderiv{C}{a_{1}^{(4)}}$.

2. Now that we have $\pderiv{C}{a_{1}^{(4)}}$, we can find $\pderiv{C}{z_{1}^{(4)}}$ through the use of the chain rule:

Observe that $\pderiv{a_1^{(4)}}{z_1^{(4)}}$ may be expressed as:

The logistic function has the nice property that its derivative is defined as $\pderiv{}{z}\sigma(z)=\sigma(z)\big(1-\sigma(z)\big)$. This allows us to further simplify $\pderiv{a_1^{(4)}}{z_1^{(4)}}$:

However, to maintain some generality over the various activation functions, we will continue to write this as $\sigma'\l(z_1^{(4)}\r)$.

In Step 1 we found that $\pderiv{C}{a_{1}^{(4)}}=y^{(i)}-\hat{y}^{(i)}$ and can therefore write $\pderiv{C}{z_{1}^{(4)}}$ as: