Logistic regressionBinomial logistic regressionSigmoid functionsLogistic functionProbabilistic decision-makingDecision boundaryLearning taskBernoulli distributed output variableError function and maximum likelihood estimationGradient descentCoordinate descentMultinomial logistic regressionLearning the weightsGradient descent with multiple weight vectorsRegularizationResources

Despite sharing a few similarities with linear regression, **logistic regression** is actually a **classification** method, where we train a model that will predict the value for a **discrete-valued** output variable, unlike the real-valued output that a linear model is trained to predict.

There are two types of logistic regression: **binomial** and **multinomial**. These terms refer to the nature of the output variable.

In **binomial logistic regression**, the output can be modeled as a binomial variable. Our training data in this case looks like:

Where:

- , a -dimensional vector of real numbers, .
Note: In theLinear Regressionnotes, represented the original unextended feature vector without the additional . In logistic regression, this vector includes the by convention. The same applies for the weight vector (additional bias term ) that we will see later.

However, it is still possible to apply a vector of basis functions to this vector if you wish.

As explained in the *Linear Regression* notes, this can be represented as an matrix of **feature vectors** and a corresponding **output vector** .

Each of the columns of represents a **feature**, which is simply a **random variable** (sometimes called an **explanatory variable**).

The accompanying output variable contains the output for each feature vector.

In linear regression, the output is assumed to be a

linear combinationof its feature values —that is:

Where:

- is an arbitrary coefficient, referred to as a
weight.- is a
weight vector.

However, modeling the output as a linear combination of the feature values won't work for logistic regression, as the output needs to be discrete ( or ).

One way to ensure an output fits these constraints is to transform the real-valued with some function , referred to as a **sigmoid function**.

A **sigmoid function** is a function with the domain of all real numbers, that returns a **monotonically increasing** value ranging from to —that is:

There are many functions that satisfy these conditions:

**Figure 1**: A collection of sigmoid functions.* Note: Despite these sigmoids having a range of (-1,1), they can be transformed to (0,1).* (source)

However, the most commonly used sigmoid function for logistic regression is the appropriately named **logistic function**.

The **logistic function** is a sigmoid function that is used in logistic regression to transform the real-valued into a value in the interval. This function is denoted , and is defined as:

**Figure 2**: The logistic function. (source)

Despite the logistic function mapping the real-valued output to the range , we **cannot** simply model this output as the application of the logistic function to :

This is problematic because may still be any real number in —we still need to "decide" on whether to assign this number to the or class. Intuitively, this task is well-suited to probabilistic decision-making since we are already dealing with numbers between and .

We can model the decision of with:

Where the conditional probability is the result of applying the logistic function to as seen before:

Note: Observe that and must form a probability distribution, meaning that:

In a classification problem with two classes, a **decision boundary** is a **hypersurface** that partitions the underlying feature space into two sets, one for each class.

In the case of binomial logistic regression (**without** a feature transformation with basis functions), this decision boundary is linear—a **hyperplane**. The decision boundary would occur when , as we are completely uncertain which class to assign when the probability is :

Note: Observe that the decision boundary is linear in . However, feature transformations with basis functions can result in a non-linear decision boundary.

Because logistic regression actually predicts probabilities rather than just classes, we can estimate the weights using **maximum likelihood estimation**. This requires us to define the distribution of the output variable , and the likelihood function for this distribution.

Recall that the Bernoulli distribution is the discrete probability distribution of a random variable which takes the value with probability , and the value with probability . In other words, the Bernoulli distribution models a single experiment that flips a (potentially biased) coin, with probability of the coin landing on heads, and probability of the coin landing on tails.

As we have a binary output variable with probability of taking value , and probability (or ) of taking value , we can say that the output variable is distributed as a Bernoulli random variable.

Where: is the parameter for this Bernoulli distribution.

In linear regression, we solve ordinary least squares (OLS) problems which use the residual sum of squares (RSS) as an error function. However, we typically don't use this for logistic regression. As we have a discrete probability distribution with parameters, we can instead use **maximum likelihood estimation** to determine the weights for the model.

The **likelihood function** for a Bernoulli distributed output variable is given by:

A maximum likelihood estimate for the weights can be found by taking the derivative of with respect to and set this equal to .

To simplify the process of taking derivatives, we can apply natural logarithms to give us the **log-likelihood** , which we would still be finding the maximum of, since the natural logarithm function is monotonically increasing:

We can now write the log-likelihood as:

This allows us to represent the maximization problem for the optimal weights as:

However, in machine learning it is convention that optimization problems are minimization problems. This can be done by simply negating the function being optimized, if it is currently a maximization problem:

Where: Thenegative log-likelihoodis theerror functionfor our model.

In the

Linear Regressionnotes, we saw that the following minimization problem for OLS:Had a closed-form (analytical) solution—.

This is not the case for the MLE in logistic regression. We must take the derivative of with respect to . This can be done by taking partial derivatives, since:

Where: is called thegradient(in vector calculus).

An individual partial derivative of with respect to would be given by:

To obtain the that minimizes this, we would set it equal to zero and solve. However, this is not possible as it is a transcendental equation, and there is no closed-form solution.

Instead, we can utilize numerical optimization methods to approximate a solution—namely, **gradient descent**. However, there are many other optimization methods such as **L-BFGS**, **coordinate descent** or **stochastic gradient descent**.

In order to minimize our error function, we will use **gradient descent**.

Analogy: A common analogy for gradient descent is one in which we imagine a hiker at the top of a mountain valley, left stranded and blindfolded at the top of a mountain. The objective of the hiker is to reach the bottom of the mountain.The most intuitive way to approach this would be to feel the slope in different directions around your feet, take one step in the direction with the steepest slope, and repeat.

This action is analogous to the way in which gradient descent works.

The gradient points in the direction of steepest **ascent** (at the current position ). We are interested in "taking a step" in the direction of steepest descent, .

By "taking a step", we mean moving some distance in a particular direction in -space.

In the hiker analogy, suppose we have the vertical -axis in which we are trying to minimize, along with the two axes in the horizontal plane of the hiker, and . In this case, we "take a step" in the two-dimensional -space, which results in a change in the vertical position of the hiker.

A step of size in the direction of steepest descent (in -space) would be represented by the vector . If we take our current position and apply this step by adding the transformation vector , we arrive at our updated position:

Where: is referred to as thelearning rate.

This would be repeated until convergence.

It is also possible to update individual components separately for this step—updating one component **once** then moving on to the next, until we have a completely updated weight vector . We would repeat this until convergence—when the components of no longer change:

It is also possible to consider an individual direction or component at a time when taking steps—completely focusing one component, but updating it **multiple** times (until convergence) to minimize this component before returning to minimize the others.

Once this component is minimized, this would then be repeated for all remaining components. This variation of gradient descent is referred to as **coordinate descent**.

Although there may be many tasks involving binary output variables, many classification tasks naturally involve multiple output labels, such as:

- Hand-written digit recognition (labels are the digits -)
- Fruit recognition (labels are digits representing the type of fruit, e.g. , , ...)
- Assigning blog posts to a category (labels are digits representing the categories)
- Blood type classification (labels are digits representing the blood types)

For these kinds of tasks, the binomial logistic regression that was previously introduced won't work. However, it is possible to generalize binomial logistic regression problems to multi-class problems. This generalized classification method is referred to as **multinomial logistic regression**, but is also known under other names such as **softmax regression** or **maximum entropy classifier**.

**Figure 3**: Division of the feature-space of a dataset into three decision regions by a classifier such as multinomial logistic regression that can generate multiple decision boundaries (each being linear in this case). (source)

Given training data where , we can create a separate weight vector for each class . We can then classify a feature vector as or not-, through the use of the **softmax** function, which is simply a normalized exponential function:

Where:

- , a matrix where each column represents a vector of weights for class .
- is a normalization term, used to ensure that the distribution sums up to one.

To decide on a class for , we must calculate for every class . The chosen class for is then the one that results in the maximum conditional probability—that is:

Due to having a separate weight vector for each class, our negative log-likelihood error function will have to be modified to account for the softmax function—this can be done by summing over all classes in addition to summing over the examples. The modified negative log-likelihood function looks like:

Where:

Once again, this cannot be minimized with an analytic solution—we must resort to gradient descent (or another numerical optimization method).

As a result of having to use an individual vector of weights for each class, we will introduce a vector of the gradients for each class, :

Where: , the derivative of the error function with respect to the weight vector for class . In other words, this represents the gradient of the error function for class .

Taking derivatives of , one can show that the gradient is given by:

Where:

- is an
indicator functionsuch that

The weight vector can then be optimized using gradient descent:

We iteratively update this weight vector until it converges, then we proceed to the next vector of the weight matrix. Alternatively, it is possible to update the entire weight matrix until it converges:

Just as we can add a regularization term to the RSS used as an error function in linear regression (in order to simplify the model and prevent overfitting to the training data), it is also possible to add a regularization term to the negative log-likelihood, and use this sum as a new, regularized error function. For binomial logistic regression:

Where:

- is a
tuning parameterthat controls the importance of the regularization term—higher leads to more penalization. This parameter is selected through cross-validation.- is a regularization term chosen to penalize coefficients by a specific quantity—shrinking them towards zero.

For more general information about regularization and how the and norms (along with ElasticNet) can be used as the regularization term, please read the *Linear Regression* notes—these regularization methods apply in the same way to logistic regression.

*Nigel Goddard (School of Informatics, University of Edinburgh)*

Introductory Applied Machine Learning: Logistic Regression - Two-Class Linear Classifier

Introductory Applied Machine Learning: Logistic Regression - Logistic Regression

Introductory Applied Machine Learning: Logistic Regression - Learning the Parameters

Introductory Applied Machine Learning: Logistic Regression - Multiclass Classification*Iain Murray (School of Informatics, University of Edinburgh)*

Logistic Regression*Michael F. Brannick (University of South Florida)*

Logistic Regression*Cosma Shalizi (Department of Statistics, Carnegie Mellon University)*

Logistic Regression*Saishruthi Swaminathan (Towards Data Science)*

Logistic Regression — Detailed Overview*Wei Xu (Department of Computer Science and Engineering, Ohio State University)*

Multi-Class Logistic Regression and Perceptron*Wikipedia*

Decision boundary

Logistic regression

Monotonic function

Sigmoid function

Bernoulli distribution

Multinomial logistic regression

Maximum likelihood estimation*Trevor Hastie (Department of Statistics, Stanford University)*

Fast Regularization Paths via Coordinate Descent*Chandler Watson (Stack Overflow)*

Proof that conditional probabilities sum up to one*Dan Nettleton (Iowa State University)*

A Generalized Linear Model for Bernoulli Response Data*Mark (Stack Overflow)*

Logistic Regression: Bernoulli vs. Binomial Response Variables*Neos Guide (University of Wisconsin-Madison)*

Maximum Likelihood Estimation with Logit Model*Dan Nuttle (RPubs)*

Partial Derivative of Cost Function for Logistic Regression*Ayush Pant (Towards Data Science)*

Introduction to Logistic Regression*Ashutosh Singh, Yanxin Li*

To Study Implementation of Gradient Descent for Multi-class Classification Using a SoftMax Regression and Neural Networks*Arthur Juliani (Medium)*

Simple Softmax Regression in Python — Tutorial*Aanish Singla (Analytics Vidhya)*

An Introductory Guide to Maximum Likelihood Estimation (with a case study in R)*Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen, Adam Coates, Andrew Maas, Awni Hannun, Brody Huval, Tao Wang, Sameep Tandon (Stanford University)*

Supervised Learning and Optimization: Softmax Regression*Sebastian Raschka*

What is Softmax regression and how is it related to Logistic regression?