Linear regressionTraining dataLearning taskError functionsResidualsOrdinary least squares (OLS)Bias termDesign matrixSolving the OLS problem (with the design matrix)Regularization and regularized least squares problemsElastic net regularizationComparison with standalone and regularizationBasis functionsThe identity basis functionPolynomial regressionMultivariate basis functionsResources

**Linear regression** addresses the supervised learning problem of approximating the relationship between the **input variables** and **output variables** of some data.

Training data for linear regression problems comes in the form:

Where:

- , a -dimensional vector of real numbers, .

This is typically represented as an matrix of **feature vectors** and a corresponding **output vector** .

Each of the columns of represents a **feature**, which is simply a **random variable** (sometimes called an **explanatory variable**). In practicality, these features are often specific attributes of the system or object being modeled, e.g. *height*, *rent*, *temperature*.

The accompanying output variable contains the output for each feature vector where the output is assumed to be a **linear combination** of its feature values â€”that is:

Where: is an arbitrary coefficient, referred to as aweight.

If we let then we can see that the expression for as a sum of products, can be represented as a dot product:

Where: is referred to as theweight vector.

Which further allows us to represent more concisely, as:

**Note**: Remember that is the same for each feature vector, since it's trying to model a linear relation between the featuresâ€”**not** the feature vectors!

Suppose we have training data .

Linear regression aims to learn from this data in order to create a **regression line**, . This line can then be used to estimate the value of the output variable for some new unlabeled feature vector .

**Note**: The regression line is simply a -dimensional hyperplane in -dimensional feature space. This is often called a **line-of-best-fit**.

**Figure 1**: Search for an optimal regression line in one-dimentional feature space. (source)

In order to determine which of the infinitely many regression lines have the least *error*, we need to introduce **error functions**.

An **error function** (also called **cost function**) for linear regression should act as a form of aggregate measure of how far the output values of the training samples are, from the values that would be predicted by the regression lineâ€”that is, for some regression line with fixed , how far are the predicted outputs (lying on the hyperplane) from the actual output values, .

The optimum regression line should have the **minimal** errorâ€”which in turn makes this a minimization problem.

A **residual** is the error in a single resultâ€”how far an individual predicted is from the actual output .

The most commonly used residuals are **vertical-offset**, only considering the distance in the plane of the output variable.

**Figure 2**: Vertical and horizontal offset residuals. (source)

**Figure 3**: Vertical-offset residuals in a two-dimensional feature space. (source)

**Ordinary least squares (OLS)** is a form of regression analysis that uses the **sum-squared error** (or **residual sum of squares (RSS)**) function as a cost function.

Sum-squared error is simply the sum of the squared (vertical-offset) residual lengths:

Note: We square the lengths because the subtraction may result in a negative number.

Which can be represented as in matrix notation, giving us:

Then the optimum weight vector is given by minimising this cost function:

Despite there being infinitely many possible values for , the hyperplane is restricted to passing through the origin â€”as there is no intercept term in .

This can often be a very limiting restriction, as it essentially means that the hyperplane cannot be translated on the plane. This may make it difficult to create a *good* regression line from the training data.

Example: Due to the origin restriction, it is not possible to create an optimal regression line for the following collection of one-dimensional feature vectors.

Figure 4: Sub-optimal origin-restricted regression lines for one-dimensional feature vectors.

If an intercept term (called a **bias**) is incorporated into the regression line , it becomes possible for the regression line to move around in space more freely.

In practicality, we will usually want a bias for our linear model anyway.

Example: If we are modelling the prediction the price of a house, , with explanatory variables valued at , we don't want the house price to be when all of the explanatory variables are valued at .For example, when the explanatory variables are all zero, we want the default house price will be . The bias term allows us to capture this information by defining the value of the regression line when the explanatory variables are all zeroâ€”in other words, the point of intercept with the axis.

**Figure 5**: A better-fitting regression line as a result of the bias term.

In order for the bias term introduced above to be consistent with the previous linear algebra, a few changes need to be made to the representations of matrices and vectors, to ensure that dot products to remain well-defined.

With the introduction of the bias term , an individual training example can be represented linearly as:

Prior to the introduction of the bias term, we could express this as a dot product. However, the term in the expression above is **no longer** attached to the value of any explanatory variable .

We can address this issue by extending the feature vector to have as it's first element, . Let this new vector be :

If we express as a linear combination of the explanatory variables as defined in , we can now say:

Further if we treat the bias term as an extension of the weight vector , so that we can express this as a dot product:

The modified feature vectors can be represented by a new matrixâ€”known as the **design matrix**, :

Using the previous result along with matrix algebra, we can represent the output vector as:

We previously saw that the optimal with no bias term is given by solving the following minimization problem for OLS:

By extending with the bias term and using the design matrix instead of , this becomes:

Minimization for OLS has a **closed-form (analytical) solution** which can be derived by taking partial derivatives with respect to and setting them to . This leads to the following analytical solution for the optimal weight vector :

Where: is referred to as thepseudo-inverseof .This is not the actual inverse matrix since is

not invertibleas it is not square, since it has shape .

**Note**: Although OLS has an analytical solution, it is also possible to use other iterative optimization methods such as **gradient descent**, **stochastic gradient descent**, **BFGS**, etc. to minimize the cost function. However, these methods are **not** guaranteed to converge or find a global minimum.

**Regularization** refers to various methods used to penalize specific terms in a cost function in order to prevent overfitting to the training data. This is done by adding a **regularization term** (also called a **regularizer**).

Regularization discourages or decreases the complexity of a linear model.

**Figure 6**: Simplification of a polynomial regression model's *complexity* as a result of regularization.*Blue represents the unregularized and overfitted model, and green represents a regularized model which generalizes better.* (source)

For least squares problems, the regularized cost function looks like:

Where:

- is a
tuning parameterthat controls the importance of the regularization termâ€”higher leads to more penalization. This parameter is selected through cross-validation.- is a regularization term chosen to penalize coefficients by a specific quantityâ€”shrinking them towards zero.

Regularization essentially forces the minimization of a cost function within the constraints of the provided regularization term.

**Figure 7**: Graphical depiction of the constraint placed on the minimization of the RSS cost function as a result of L2 regularization in a two-dimensional feature space.*This constraint is in place due to the fact that we now have to minimize a combined sum of the RSS and regularization term. To solve this minimization problem we must get as close to the minimum of the RSS contour, whilst still remaining in the constrained region imposed by the regularization term (the circle in this case, but n-sphere in general).* (source)

Regularization terms in the form of the (Manhattan) and (Euclidean) **norms** are commonly used for linear regressionâ€”these norms form the basis for ** and regularization**.

The table below provides information about both of these regularization methods when used to solve least squares problems:

Name | Lasso | Ridge |

Regularization term â€” Note: Observe that we don't penalize the bias term, . | (or â€” the norm) | (or â€” the squared norm) |

Regularized least squares â€” | ||

Analytic solution | Noneâ€”use iterative optimization methods. | |

Affected weights | All weightsâ€”uniformly. | All weightsâ€”but low valued weights will be penalized less since we are squaring. Conversely, large weights will face more penalty. |

When to use | When there are many features which are irrelevant to the output variableâ€”since it can shrink them to zero, completely disregarding them. Also works well when (number of instances is far greater than the number of features). | When all (or most) features are relevant to the output variableâ€”since it can not shrink them to zero, meaning that all features will have some impact. Also works better when there is high collinearity between features. |

Constraint region visualizationNote: Regularization constraint region is depicted in red, RSS contour is depicted in blue. |

**Elastic net** is a compromise regularization method that involves the usage of a regularization term which linearly combines the and norms of the weights, using two tuning parameters, and .

The elastic net regularized cost function for least squares problems is given as:

This form of regularization is often used to counteract the limitations of the and penalties.

With highly-correlated features:

- regularization generally picks one and effectively discards the others by setting their weights to zero. However, it is often difficult to determine which feature was chosen.
- regularization shrinks the weights of highly-correlated features towards one another.

Elastic net is a compromise between the two that attempts to shrink and do a

**sparse selection**simultaneously.In regards to penalty:

- regularization penalizes weights more uniformly.
- regularization penalizes higher-valued weights more than the smaller ones.

Once again, elastic net acts as a compromise between this property of the two regularization methods.

The main requirement for a linear regression model is that the weights must be linear. However, it is not necessary that the explanatory variables are linearâ€”they can be defined by any non-linear function of the explanatory variables too.

This allows us to more generally define a linear model as:

Where:

- Each is called a
basis function, which is a function of the current input .- is a vector-valued function such that .
- by conventionâ€”so that the bias term is not affected by the basis function.

The most simple case of a linear regression model that we have seen before is where the output variable may be modeled as:

This regression model can be defined by the identity basis function:

Where: The individual basis functions would be â€”a function only of the feature with the same index as the basis function.

This is more clearly seen by looking at the design matrix of this basis function when applied to the training set :

Standard linear regression with the identity basis function is powerful for modelling an output variable which is assumed to be linearly dependent upon the explanatory variables .

However, it is not always the case that the explanatory variables have a linear relationship with the output variable.

**Figure 8**: Example of a relationship that cannot accurately be modeled with a hyperplane.*This data would be more accurately represented with a polynomial regression model.* (source)

In this case, it may be more appropriate to assume a different relationship, such as a polynomial one. The output variable can be modeled as a -degree polynomialâ€”a linear combination of the **monomials** of each feature:

Which can be defined with the following basis function:

Despite the identity and polynomial basis functions only operating on the feature, , this isn't a strict requirement of basis functionsâ€”remember that each basis function is a function of the entire feature vector , and can therefore be dependent upon the values of other features. This brings rise to what are known as **multivariate basis functions**.

Example: A basis function with multivariate inputs.

*Iain Murray (School of Informatics, University of Edinburgh)*

Machine Learning and Pattern Recognition: Linear Regression*Nigel Goddard (School of Informatics, University of Edinburgh)*

Introductory Applied Machine Learning: Linear Regression - Solving for Model Parameters*Hiroshi Shimodaira, Iain Murray, Steve Renals (School of Informatics, University of Edinburgh)*

Algorithms, Data Structures and Learning: Introduction to statistical pattern recognition and optimization*Gordon Ross (School of Mathematics, University of Edinburgh)*

Statistical Learning: Nonlinearity and Dimensionality Reduction*Nguyen (StackOverflow)*

Role of the bias term in linear regression*Wikipedia*

Residual sum of squares

Regularization (mathematics)

Norm (mathematics)

Taxicab geometry

Elastic net regularization*Renu Khandelwal (Medium)*

and Regularization*Jae Duk Seo (Towards Data Science)*

Only Numpy: Implementing different combinations of / norm/regularization*Stephanie (Statistics How To)*

Tuning Parameter / Penalty Parameter*Sebastian Raschka*

Does regularization in logistic regression always results in better fit and better generalization?*Sebastian Raschka (Mlxtend)*

Regularization of Generalized Linear Models*balaks (StackOverflow)*

Ridge, lasso and elastic net*Martin Krasser*

Bayesian regression with linear basis function models*Ayush Pant (Towards Data Science)*

Introduction to Linear Regression and Polynomial Regression*Ignacio P. Pozuelo (Stack Overflow)*

Lasso or Ridge for correlated variables