Logistic regression

Despite sharing a few similarities with linear regression, logistic regression is actually a classification method, where we train a model that will predict the value for a discrete-valued output variable, unlike the real-valued output that a linear model is trained to predict.

There are two types of logistic regression: binomial and multinomial. These terms refer to the nature of the output variable.

Binomial logistic regression

In binomial logistic regression, the output can be modeled as a binomial variable. Our training data in this case looks like:

Where:

• $\b{x}^{(i)}\in\R^D$, a $(D+1)$-dimensional vector of real numbers, $\b{x}^{(i)}=\colv{1,x^{(i)}}{1}{D}$.
Note: In the Linear Regression notes, $\b{x}^{(i)}$ represented the original unextended feature vector without the additional $1$. In logistic regression, this vector includes the $1$ by convention. The same applies for the weight vector $\bs{\theta}$ (additional bias term $\theta_0$) that we will see later.
However, it is still possible to apply a vector of basis functions $\bs{\phi}$ to this vector if you wish.
• $y^{(i)}\in\set{0,1}$

As explained in the Linear Regression notes, this can be represented as an $N\times D$ matrix of feature vectors $\b{X}=\l(\b{x}^{(1)^\T},\ldots,\b{x}^{(N)^\T}\r)^\T$ and a corresponding output vector $\b{y}=\l(y^{(1)},\ldots,y^{(N)}\r)^\T$.

Each of the columns $\b{x}_j=\l(x_j^{(1)},\ldots,x_j^{(N)}\r)^\T$ of $\b{X}$ represents a feature, which is simply a random variable (sometimes called an explanatory variable).

The accompanying output variable $\b{y}$ contains the output for each feature vector.

In linear regression, the output $y^{(i)}$ is assumed to be a linear combination of its feature values $\b{x}^{(i)}$—that is:

Where:

• $\theta_j\in\R$ is an arbitrary coefficient, referred to as a weight.
• $\bs{\theta}=\colv{\theta}{0}{D}$ is a weight vector.

However, modeling the output as a linear combination of the feature values won't work for logistic regression, as the output needs to be discrete ($0$ or $1$).

One way to ensure an output fits these constraints is to transform the real-valued $\bs{\theta}^\T\b{x}^{(i)}$ with some function $\sigma$, referred to as a sigmoid function.

Sigmoid functions

A sigmoid function is a function with the domain of all real numbers, that returns a monotonically increasing value ranging from $0$ to $1$—that is:

There are many functions that satisfy these conditions:

Figure 1: A collection of sigmoid functions.
Note: Despite these sigmoids having a range of (-1,1), they can be transformed to (0,1). (source)

However, the most commonly used sigmoid function for logistic regression is the appropriately named logistic function.

Logistic function

The logistic function is a sigmoid function that is used in logistic regression to transform the real-valued $\bs{\theta}^\T\b{x}^{(i)}$ into a value in the $(0,1)$ interval. This function is denoted $\sigma$, and is defined as:

Figure 2: The logistic function. (source)

Probabilistic decision-making

Despite the logistic function mapping the real-valued output $y^{(i)}$ to the range $[0,1]$, we cannot simply model this output as the application of the logistic function to $\bs{\theta}^\T\b{x}^{(i)}$:

This is problematic because $y^{(i)}$ may still be any real number in $[0,1]$—we still need to "decide" on whether to assign this number to the $0$ or $1$ class. Intuitively, this task is well-suited to probabilistic decision-making since we are already dealing with numbers between $0$ and $1$.

We can model the decision of $y^{(i)}$ with:

Where the conditional probability $\cp{y^{(i)}=1}{\b{x}^{(i)};\bs{\theta}}$ is the result of applying the logistic function to $\bs{\theta}^\T\b{x}^{(i)}$ as seen before:

Note: Observe that $\cp{y^{(i)}=1}{\b{x}^{(i)};\bs{\theta}}$ and $\cp{y^{(i)}=0}{\b{x}^{(i)};\bs{\theta}}$ must form a probability distribution, meaning that:

Decision boundary

In a classification problem with two classes, a decision boundary is a hypersurface that partitions the underlying feature space into two sets, one for each class.

In the case of binomial logistic regression (without a feature transformation with basis functions), this decision boundary is linear—a hyperplane. The decision boundary would occur when $\cp{y^{(i)}=1}{\b{x}^{(i)};\bs{\theta}}=\frac{1}{2}$, as we are completely uncertain which class to assign when the probability is $\frac{1}{2}$:

Note: Observe that the decision boundary is linear in $\b{x}^{(i)}$. However, feature transformations with basis functions $\bs{\phi}$ can result in a non-linear decision boundary.

Because logistic regression actually predicts probabilities rather than just classes, we can estimate the weights using maximum likelihood estimation. This requires us to define the distribution of the output variable $y^{(i)}$, and the likelihood function for this distribution.

Bernoulli distributed output variable

Recall that the Bernoulli distribution is the discrete probability distribution of a random variable which takes the value $1$ with probability $p$, and the value $0$ with probability $q=1-p$. In other words, the Bernoulli distribution models a single experiment that flips a (potentially biased) coin, with probability $p$ of the coin landing on heads, and probability $q=1-p$ of the coin landing on tails.

As we have a binary output variable with probability $p=\cp{y^{(i)}=1}{\b{x}^{(i)};\bs{\theta}}$ of taking value $1$, and probability $q=\cp{y^{(i)}=0}{\b{x}^{(i)};\bs{\theta}}$ (or $1-p=1-\cp{y^{(i)}=1}{\b{x}^{(i)};\bs{\theta}}$) of taking value $0$, we can say that the output variable is distributed as a Bernoulli random variable.

Where: $p^{(i)}=\cp{y^{(i)}=1}{\b{x}^{(i)};\bs{\theta}}$ is the parameter for this Bernoulli distribution.

Error function and maximum likelihood estimation

In linear regression, we solve ordinary least squares (OLS) problems which use the residual sum of squares (RSS) as an error function. However, we typically don't use this for logistic regression. As we have a discrete probability distribution with parameters, we can instead use maximum likelihood estimation to determine the weights for the model.

The likelihood function $L$ for a Bernoulli distributed output variable $y^{(i)} \sim \mathrm{Bernoulli} \l(p^{(i)}\r)$ is given by:

A maximum likelihood estimate $\hat{\bs{\theta}}$ for the weights $\bs{\theta}$ can be found by taking the derivative of $L$ with respect to $\bs{\theta}$ and set this equal to $0$.

To simplify the process of taking derivatives, we can apply natural logarithms to give us the log-likelihood $\cal{L}$, which we would still be finding the maximum of, since the natural logarithm function is monotonically increasing:

We can now write the log-likelihood as:

This allows us to represent the maximization problem for the optimal weights as:

However, in machine learning it is convention that optimization problems are minimization problems. This can be done by simply negating the function being optimized, if it is currently a maximization problem:

Where: The negative log-likelihood $-\mathcal{L}(\bs{\theta};\b{y})$ is the error function $C(\bs{\theta})$ for our model.

In the Linear Regression notes, we saw that the following minimization problem for OLS:

Had a closed-form (analytical) solution—$\hat{\bs{\theta}}=\l(\bs{\Phi}^\T\bs{\Phi}\r)^{-1}\bs{\Phi}^\T\b{y}$.

This is not the case for the MLE in logistic regression. We must take the derivative of $C(\bs{\theta})$ with respect to $\bs{\theta}$. This can be done by taking partial derivatives, since:

Where: $\nabla C(\bs{\theta})$ is called the gradient (in vector calculus).

An individual partial derivative of $C(\bs{\theta})$ with respect to $\theta_j$ would be given by:

To obtain the $\theta_j$ that minimizes this, we would set it equal to zero and solve. However, this is not possible as it is a transcendental equation, and there is no closed-form solution.

Instead, we can utilize numerical optimization methods to approximate a solution—namely, gradient descent. However, there are many other optimization methods such as L-BFGS, coordinate descent or stochastic gradient descent.

In order to minimize our error function, we will use gradient descent.

Analogy: A common analogy for gradient descent is one in which we imagine a hiker at the top of a mountain valley, left stranded and blindfolded at the top of a mountain. The objective of the hiker is to reach the bottom of the mountain.

The most intuitive way to approach this would be to feel the slope in different directions around your feet, take one step in the direction with the steepest slope, and repeat.

This action is analogous to the way in which gradient descent works.

The gradient $\nabla C(\bs{\theta})$ points in the direction of steepest ascent (at the current position $\bs{\theta}$). We are interested in "taking a step" in the direction of steepest descent, $-\nabla C(\bs{\theta})$.

By "taking a step", we mean moving some distance $\eta$ in a particular direction in $\bs{\theta}$-space.

In the hiker analogy, suppose we have the vertical $y$-axis in which we are trying to minimize, along with the two axes in the horizontal plane of the hiker, $x_1$ and $x_2$. In this case, we "take a step" in the two-dimensional $\begin{pmatrix}x_1\\ x_2\end{pmatrix}$-space, which results in a change in the vertical $y$ position of the hiker.

A step of size $\eta$ in the direction of steepest descent $-\nabla C(\bs{\theta})$ (in $\bs{\theta}$-space) would be represented by the vector $-\eta\nabla C(\bs{\theta})$. If we take our current position $\bs{\theta}$ and apply this step by adding the transformation vector $-\eta\nabla C(\bs{\theta})$, we arrive at our updated position:

Where: $\eta$ is referred to as the learning rate.

This would be repeated until convergence.

It is also possible to update individual components separately for this step—updating one component once then moving on to the next, until we have a completely updated weight vector $\bs{\theta}^\bluet{new}$. We would repeat this until convergence—when the components of $\bs{\theta}$ no longer change:

Coordinate descent

It is also possible to consider an individual direction or component $\theta_j$ at a time when taking steps—completely focusing one component, but updating it multiple times (until convergence) to minimize this component before returning to minimize the others.

Once this component is minimized, this would then be repeated for all remaining components. This variation of gradient descent is referred to as coordinate descent.

Multinomial logistic regression

Although there may be many tasks involving binary output variables, many classification tasks naturally involve multiple output labels, such as:

• Hand-written digit recognition (labels are the digits $0$-$9$)
• Fruit recognition (labels are digits representing the type of fruit, e.g. $0=\text{Apple}$, $1=\text{Banana}$, ...)
• Assigning blog posts to a category (labels are digits representing the categories)
• Blood type classification (labels are digits representing the blood types)

For these kinds of tasks, the binomial logistic regression that was previously introduced won't work. However, it is possible to generalize binomial logistic regression problems to multi-class problems. This generalized classification method is referred to as multinomial logistic regression, but is also known under other names such as softmax regression or maximum entropy classifier.

Figure 3: Division of the feature-space of a dataset into three decision regions by a classifier such as multinomial logistic regression that can generate multiple decision boundaries (each being linear in this case). (source)

Given training data $\mathcal{D}_\text{train}=\set{\l(\b{x}^{(i)},y^{(i)}\r)}_{i=1}^N$ where $y^{(i)}\in\set{1,\ldots,K}$, we can create a separate weight vector $\bs{\theta}^{(k)}=\colv{\theta^{(k)}}{0}{D}$ for each class $k\in K$. We can then classify a feature vector $\b{x}^{(i)}$ as $k$ or not-$k$, through the use of the softmax function, which is simply a normalized exponential function:

Where:

• $\bs{\Theta}=\l(\bs{\theta}^{(1)},\bs{\theta}^{(2)},\ldots,\bs{\theta}^{(K)}\r)$, a matrix where each column represents a vector of weights $\bs{\theta}^{(c)}$ for class $c$.
• $\frac{1}{\sum_{k=1}^K \exp\l({\bs{\theta}^{(k)\T}\b{x}^{(i)}}\r)}$ is a normalization term, used to ensure that the distribution sums up to one.

To decide on a class for $\b{x}^{(i)}$, we must calculate $\cp{y^{(i)}=c}{\b{x}^{(i)};\bs{\Theta}}$ for every class $c$. The chosen class for $\b{x}^{(i)}$ is then the one that results in the maximum conditional probability—that is:

Learning the weights

Due to having a separate weight vector for each class, our negative log-likelihood error function will have to be modified to account for the softmax function—this can be done by summing over all classes in addition to summing over the examples. The modified negative log-likelihood function looks like:

Where:

• $\displaystyle{p_c^{(i)}=\cp{y^{(i)}=c}{\b{x}^{(i)};\bs{\Theta}}=\frac{\exp\l({\bs{\theta}^{(c)\T}\b{x}^{(i)}}\r)}{\sum_{k=1}^K \exp\l({\bs{\theta}^{(k)\T}\b{x}^{(i)}}\r)}}$

Once again, this cannot be minimized with an analytic solution—we must resort to gradient descent (or another numerical optimization method).

Gradient descent with multiple weight vectors

As a result of having to use an individual vector of weights for each class, we will introduce a vector $\bs{\nabla}C(\bs{\Theta})$ of the gradients for each class, $\nabla_{\bs{\theta}^{(c)}} C(\bs{\Theta})$:

Where: $\nabla_{\bs{\theta}^{(c)}} C(\bs{\Theta})=\l(\pderiv{}{\theta_1^{(c)}}C(\bs{\Theta}), \pderiv{}{\theta_2^{(c)}}C(\bs{\Theta}), \cdots, \pderiv{}{\theta_N^{(c)}}C(\bs{\Theta})\r)^\T$, the derivative of the error function $\nabla C(\bs{\Theta})$ with respect to the weight vector $\bs{\theta}^{(c)}$ for class $c$. In other words, this represents the gradient of the error function for class $c$.

Taking derivatives of $C(\bs{\Theta})$, one can show that the gradient is given by:

Where:

• $\bb{I}_c^{(i)}$ is an indicator function such that $\bb{I}_c^{(i)}=\begin{cases}1\quad & \text{if}\ \ y^{(i)}=c\\0 &\text{otherwise}\end{cases}$
• $\displaystyle{p_c^{(i)}=\cp{y^{(i)}=c}{\b{x}^{(i)};\bs{\Theta}}}$

The weight vector $\bs{\theta}^{(c)}$ can then be optimized using gradient descent:

We iteratively update this weight vector until it converges, then we proceed to the next vector of the weight matrix. Alternatively, it is possible to update the entire weight matrix until it converges:

Regularization

Just as we can add a regularization term to the RSS used as an error function in linear regression (in order to simplify the model and prevent overfitting to the training data), it is also possible to add a regularization term to the negative log-likelihood, and use this sum as a new, regularized error function. For binomial logistic regression:

Where:

• $\lambda$ is a tuning parameter that controls the importance of the regularization term—higher $\lambda$ leads to more penalization. This parameter is selected through cross-validation.
• $R(\bs{\theta})$ is a regularization term chosen to penalize coefficients by a specific quantity—shrinking them towards zero.

For more general information about regularization and how the $L_1$ and $L_2$ norms (along with ElasticNet) can be used as the regularization term, please read the Linear Regression notes—these regularization methods apply in the same way to logistic regression.