Linear regressionTraining dataLearning taskError functionsResidualsOrdinary least squares (OLS)Bias termDesign matrixSolving the OLS problem (with the design matrix)Regularization and regularized least squares problemsElastic net regularizationComparison with standalone and regularizationBasis functionsThe identity basis functionPolynomial regressionMultivariate basis functionsResources

Linear regression

Linear regression addresses the supervised learning problem of approximating the relationship between the input variables and output variables of some data.

Training data

Training data for linear regression problems comes in the form:

Where:

This is typically represented as an matrix of feature vectors and a corresponding output vector .

Each of the columns of represents a feature, which is simply a random variable (sometimes called an explanatory variable). In practicality, these features are often specific attributes of the system or object being modeled, e.g. height, rent, temperature.

The accompanying output variable contains the output for each feature vector where the output is assumed to be a linear combination of its feature values —that is:

Where: is an arbitrary coefficient, referred to as a weight.

If we let then we can see that the expression for as a sum of products, can be represented as a dot product:

Where: is referred to as the weight vector.

Which further allows us to represent more concisely, as:

Note: Remember that is the same for each feature vector, since it's trying to model a linear relation between the features—not the feature vectors!

Learning task

Suppose we have training data .

Linear regression aims to learn from this data in order to create a regression line, . This line can then be used to estimate the value of the output variable for some new unlabeled feature vector .

Note: The regression line is simply a -dimensional hyperplane in -dimensional feature space. This is often called a line-of-best-fit.


Figure 1: Search for an optimal regression line in one-dimentional feature space. (source)

Error functions

In order to determine which of the infinitely many regression lines have the least error, we need to introduce error functions.

An error function (also called cost function) for linear regression should act as a form of aggregate measure of how far the output values of the training samples are, from the values that would be predicted by the regression line—that is, for some regression line with fixed , how far are the predicted outputs (lying on the hyperplane) from the actual output values, .

The optimum regression line should have the minimal error—which in turn makes this a minimization problem.

Residuals

A residual is the error in a single result—how far an individual predicted is from the actual output .

The most commonly used residuals are vertical-offset, only considering the distance in the plane of the output variable.


Figure 2: Vertical and horizontal offset residuals. (source)


Figure 3: Vertical-offset residuals in a two-dimensional feature space. (source)

Ordinary least squares (OLS)

Ordinary least squares (OLS) is a form of regression analysis that uses the sum-squared error (or residual sum of squares (RSS)) function as a cost function.

Sum-squared error is simply the sum of the squared (vertical-offset) residual lengths:

Note: We square the lengths because the subtraction may result in a negative number.

Which can be represented as in matrix notation, giving us:

Then the optimum weight vector is given by minimising this cost function:

Bias term

Despite there being infinitely many possible values for , the hyperplane is restricted to passing through the origin —as there is no intercept term in .

This can often be a very limiting restriction, as it essentially means that the hyperplane cannot be translated on the plane. This may make it difficult to create a good regression line from the training data.

Example: Due to the origin restriction, it is not possible to create an optimal regression line for the following collection of one-dimensional feature vectors.

y=1.1x
y=1.5x
y=2x

Figure 4: Sub-optimal origin-restricted regression lines for one-dimensional feature vectors.

If an intercept term (called a bias) is incorporated into the regression line , it becomes possible for the regression line to move around in space more freely.

In practicality, we will usually want a bias for our linear model anyway.

Example: If we are modelling the prediction the price of a house, , with explanatory variables valued at , we don't want the house price to be when all of the explanatory variables are valued at .

For example, when the explanatory variables are all zero, we want the default house price will be . The bias term allows us to capture this information by defining the value of the regression line when the explanatory variables are all zero—in other words, the point of intercept with the axis.

y=0.6x+4

Figure 5: A better-fitting regression line as a result of the bias term.

Design matrix

In order for the bias term introduced above to be consistent with the previous linear algebra, a few changes need to be made to the representations of matrices and vectors, to ensure that dot products to remain well-defined.

With the introduction of the bias term , an individual training example can be represented linearly as:

Prior to the introduction of the bias term, we could express this as a dot product. However, the term in the expression above is no longer attached to the value of any explanatory variable .

We can address this issue by extending the feature vector to have as it's first element, . Let this new vector be :

If we express as a linear combination of the explanatory variables as defined in , we can now say:

Further if we treat the bias term as an extension of the weight vector , so that we can express this as a dot product:

The modified feature vectors can be represented by a new matrix—known as the design matrix, :

Using the previous result along with matrix algebra, we can represent the output vector as:

Solving the OLS problem (with the design matrix)

We previously saw that the optimal with no bias term is given by solving the following minimization problem for OLS:

By extending with the bias term and using the design matrix instead of , this becomes:

Minimization for OLS has a closed-form (analytical) solution which can be derived by taking partial derivatives with respect to and setting them to . This leads to the following analytical solution for the optimal weight vector :

Where: is referred to as the pseudo-inverse of .

This is not the actual inverse matrix since is not invertible as it is not square, since it has shape .

Note: Although OLS has an analytical solution, it is also possible to use other iterative optimization methods such as gradient descent, stochastic gradient descent, BFGS, etc. to minimize the cost function. However, these methods are not guaranteed to converge or find a global minimum.

Regularization

Regularization refers to various methods used to penalize specific terms in a cost function in order to prevent overfitting to the training data. This is done by adding a regularization term (also called a regularizer).

Regularization discourages or decreases the complexity of a linear model.


Figure 6: Simplification of a polynomial regression model's complexity as a result of regularization.
Blue represents the unregularized and overfitted model, and green represents a regularized model which generalizes better. (source)

For least squares problems, the regularized cost function looks like:

Where:

Regularization essentially forces the minimization of a cost function within the constraints of the provided regularization term.

Unregularized OLS cost function minimization
Regularized OLS cost function minimization

Figure 7: Graphical depiction of the constraint placed on the minimization of the RSS cost function as a result of L2 regularization in a two-dimensional feature space.
This constraint is in place due to the fact that we now have to minimize a combined sum of the RSS and regularization term. To solve this minimization problem we must get as close to the minimum of the RSS contour, whilst still remaining in the constrained region imposed by the regularization term (the circle in this case, but n-sphere in general). (source)

and regularized least squares problems

Regularization terms in the form of the (Manhattan) and (Euclidean) norms are commonly used for linear regression—these norms form the basis for and regularization.

The table below provides information about both of these regularization methods when used to solve least squares problems:

 
NameLassoRidge
Regularization term
Note: Observe that we don't penalize the bias term, .
(or — the norm) (or — the squared norm)
Regularized least squares
Analytic solutionNone—use iterative optimization methods.
Affected weightsAll weights—uniformly.All weights—but low valued weights will be penalized less since we are squaring. Conversely, large weights will face more penalty.
When to useWhen there are many features which are irrelevant to the output variable—since it can shrink them to zero, completely disregarding them. Also works well when (number of instances is far greater than the number of features).When all (or most) features are relevant to the output variable—since it can not shrink them to zero, meaning that all features will have some impact. Also works better when there is high collinearity between features.
Constraint region visualization
Note: Regularization constraint region is depicted in red, RSS contour is depicted in blue.

Elastic net regularization

Elastic net is a compromise regularization method that involves the usage of a regularization term which linearly combines the and norms of the weights, using two tuning parameters, and .

The elastic net regularized cost function for least squares problems is given as:

Comparison with standalone and regularization

This form of regularization is often used to counteract the limitations of the and penalties.

Basis functions

The main requirement for a linear regression model is that the weights must be linear. However, it is not necessary that the explanatory variables are linear—they can be defined by any non-linear function of the explanatory variables too.

This allows us to more generally define a linear model as:

Where:

The identity basis function

The most simple case of a linear regression model that we have seen before is where the output variable may be modeled as:

This regression model can be defined by the identity basis function:

Where: The individual basis functions would be —a function only of the feature with the same index as the basis function.

This is more clearly seen by looking at the design matrix of this basis function when applied to the training set :

Polynomial regression

Standard linear regression with the identity basis function is powerful for modelling an output variable which is assumed to be linearly dependent upon the explanatory variables .

However, it is not always the case that the explanatory variables have a linear relationship with the output variable.


Figure 8: Example of a relationship that cannot accurately be modeled with a hyperplane.
This data would be more accurately represented with a polynomial regression model. (source)

In this case, it may be more appropriate to assume a different relationship, such as a polynomial one. The output variable can be modeled as a -degree polynomial—a linear combination of the monomials of each feature:

Which can be defined with the following basis function:

Multivariate basis functions

Despite the identity and polynomial basis functions only operating on the feature, , this isn't a strict requirement of basis functions—remember that each basis function is a function of the entire feature vector , and can therefore be dependent upon the values of other features. This brings rise to what are known as multivariate basis functions.

Example: A basis function with multivariate inputs.

Resources