# Linear regression

Linear regression addresses the supervised learning problem of approximating the relationship between the input variables and output variables of some data.

## Training data

Training data for linear regression problems comes in the form:

Where:

• $\b{x}^{(i)}\in\R^D$, a $D$-dimensional vector of real numbers, $\b{x}^{(i)}=\colv{x^{(i)}}{1}{D}$.
• $y^{(i)}\in\R$

This is typically represented as an $N\times D$ matrix of feature vectors $\b{X}=\l(\b{x}^{(1)^\T},\ldots,\b{x}^{(N)^\T}\r)^\T$ and a corresponding output vector $\b{y}=\l(y^{(1)},\ldots,y^{(N)}\r)^\T$.

Each of the columns $\b{x}_j=\l(x_j^{(1)},\ldots,x_j^{(N)}\r)^\T$ of $\b{X}$ represents a feature, which is simply a random variable (sometimes called an explanatory variable). In practicality, these features are often specific attributes of the system or object being modeled, e.g. height, rent, temperature.

The accompanying output variable $\b{y}$ contains the output for each feature vector where the output $y^{(i)}$ is assumed to be a linear combination of its feature values $\b{x}^{(i)}$—that is:

Where: $\theta_j\in\R$ is an arbitrary coefficient, referred to as a weight.

If we let $\bs{\theta}=\colv{\theta}{1}{D}$ then we can see that the expression for $y^{(i)}$ as a sum of products, can be represented as a dot product:

Where: $\bs{\theta}$ is referred to as the weight vector.

Which further allows us to represent $\b{y}$ more concisely, as:

Note: Remember that $\bs{\theta}$ is the same for each feature vector, since it's trying to model a linear relation between the features—not the feature vectors!

Suppose we have training data $\mathcal{D}_\text{train}=\set{\l(\b{x}^{(i)},y^{(i)}\r)}_{i=1}^N$.

Linear regression aims to learn from this data in order to create a regression line, $\hat{\b{y}}$. This line can then be used to estimate the value of the output variable $y^{(k)}$ for some new unlabeled feature vector $\b{x}^{(k)}$.

Note: The regression line is simply a $D$-dimensional hyperplane in $D$-dimensional feature space. This is often called a line-of-best-fit.

Figure 1: Search for an optimal regression line in one-dimentional feature space. (source)

### Error functions

In order to determine which of the infinitely many regression lines have the least error, we need to introduce error functions.

An error function (also called cost function) for linear regression should act as a form of aggregate measure of how far the output values of the training samples are, from the values that would be predicted by the regression line—that is, for some regression line with fixed $\bs{\theta}$, how far are the predicted outputs (lying on the hyperplane) $\hat{\b{y}}=\b{X}\bs{\theta}$ from the actual output values, $\b{y}$.

The optimum regression line should have the minimal error—which in turn makes this a minimization problem.

#### Residuals

A residual is the error in a single result—how far an individual predicted $\hat{y}^{(i)}$ is from the actual output $y^{(i)}$.

The most commonly used residuals are vertical-offset, only considering the distance in the plane of the output variable.

Figure 2: Vertical and horizontal offset residuals. (source)

Figure 3: Vertical-offset residuals in a two-dimensional feature space. (source)

#### Ordinary least squares (OLS)

Ordinary least squares (OLS) is a form of regression analysis that uses the sum-squared error (or residual sum of squares (RSS)) function as a cost function.

Sum-squared error is simply the sum of the squared (vertical-offset) residual lengths:

Note: We square the lengths because the subtraction may result in a negative number.

Which can be represented as $\l(\b{y}-\hat{\b{y}}\r)^\T\l(\b{y}-\hat{\b{y}}\r)$ in matrix notation, giving us:

Then the optimum weight vector $\hat{\bs{\theta}}$ is given by minimising this cost function:

### Bias term

Despite there being infinitely many possible values for $\bs{\theta}$, the hyperplane is restricted to passing through the origin $\b{0}$—as there is no intercept term in $\hat{\b{y}}=\b{X}\bs{\theta}$.

This can often be a very limiting restriction, as it essentially means that the hyperplane cannot be translated on the $\b{y}$ plane. This may make it difficult to create a good regression line from the training data.

Example: Due to the origin restriction, it is not possible to create an optimal regression line for the following collection of one-dimensional feature vectors.

Figure 4: Sub-optimal origin-restricted regression lines for one-dimensional feature vectors.

If an intercept term $\theta_0$ (called a bias) is incorporated into the regression line $\hat{\b{y}}=\b{X}\bs{\theta}+\theta_0$, it becomes possible for the regression line to move around in space more freely.

In practicality, we will usually want a bias for our linear model anyway.

Example: If we are modelling the prediction the price of a house, $y^{(i)}$, with explanatory variables valued at $x_1^{(i)},\ldots,x_D^{(i)}$, we don't want the house price to be $0$ when all of the explanatory variables are valued at $0$.

For example, when the explanatory variables are all zero, we want the default house price will be $\1000$. The bias term allows us to capture this information by defining the value of the regression line when the explanatory variables are all zero—in other words, the point of intercept with the $\b{y}$ axis.

Figure 5: A better-fitting regression line as a result of the bias term.

### Design matrix

In order for the bias term introduced above to be consistent with the previous linear algebra, a few changes need to be made to the representations of matrices and vectors, to ensure that dot products to remain well-defined.

With the introduction of the bias term $\theta_0$, an individual training example can be represented linearly as:

Prior to the introduction of the bias term, we could express this as a dot product. However, the $\theta_0$ term in the expression above is no longer attached to the value of any explanatory variable $x_{j}^{(i)}$.

We can address this issue by extending the feature vector $\b{x}^{(i)}$ to have $1$ as it's first element, $x_0^{(i)}$. Let this new vector be $\phi \! \l(\b{x}^{(i)}\r)$:

If we express $y^{(i)}$ as a linear combination of the explanatory variables as defined in $\bs{\phi} \! \l(\b{x}^{(i)}\r)$, we can now say:

Further if we treat the bias term $\theta_0$ as an extension of the weight vector $\bs{\theta}$, so that $\bs{\theta}=\underbrace{\colv{\theta}{0}{D}}_{D+1}$ we can express this as a dot product:

The modified feature vectors $\bs{\phi} \! \l(\b{x}^{(i)}\r)$ can be represented by a new matrix—known as the design matrix, $\bs{\Phi}$:

Using the previous result along with matrix algebra, we can represent the output vector $\b{y}$ as:

#### Solving the OLS problem (with the design matrix)

We previously saw that the optimal $\hat{\bs{\theta}}$ with no bias term is given by solving the following minimization problem for OLS:

By extending $\bs{\theta}$ with the bias term $\theta_0$ and using the design matrix $\bs{\Phi}$ instead of $\b{X}$, this becomes:

Minimization for OLS has a closed-form (analytical) solution which can be derived by taking partial derivatives with respect to $\bs{\theta}$ and setting them to $0$. This leads to the following analytical solution for the optimal weight vector $\hat{\bs{\theta}}$:

Where: $\l(\bs{\Phi}^\T\bs{\Phi}\r)^{-1}\bs{\Phi}^\T$ is referred to as the pseudo-inverse of $\bs{\Phi}$.

This is not the actual inverse matrix since $\bs{\Phi}$ is not invertible as it is not square, since it has shape $N\times (D+1)$.

Note: Although OLS has an analytical solution, it is also possible to use other iterative optimization methods such as gradient descent, stochastic gradient descent, BFGS, etc. to minimize the cost function. However, these methods are not guaranteed to converge or find a global minimum.

## Regularization

Regularization refers to various methods used to penalize specific terms in a cost function in order to prevent overfitting to the training data. This is done by adding a regularization term (also called a regularizer).

Regularization discourages or decreases the complexity of a linear model.

Figure 6: Simplification of a polynomial regression model's complexity as a result of regularization.
Blue represents the unregularized and overfitted model, and green represents a regularized model which generalizes better. (source)

For least squares problems, the regularized cost function looks like:

Where:

• $\lambda$ is a tuning parameter that controls the importance of the regularization term—higher $\lambda$ leads to more penalization. This parameter is selected through cross-validation.
• $R(\bs{\theta})$ is a regularization term chosen to penalize coefficients by a specific quantity—shrinking them towards zero.

Regularization essentially forces the minimization of a cost function within the constraints of the provided regularization term.

Figure 7: Graphical depiction of the constraint placed on the minimization of the RSS cost function as a result of L2 regularization in a two-dimensional feature space.
This constraint is in place due to the fact that we now have to minimize a combined sum of the RSS and regularization term. To solve this minimization problem we must get as close to the minimum of the RSS contour, whilst still remaining in the constrained region imposed by the regularization term (the circle in this case, but n-sphere in general). (source)

### $L_1$ and $L_2$ regularized least squares problems

Regularization terms in the form of the $L_1$ (Manhattan) and $L_2$ (Euclidean) norms are commonly used for linear regression—these norms form the basis for $L_1$ and $L_2$ regularization.

The table below provides information about both of these regularization methods when used to solve least squares problems:

### Elastic net regularization

Elastic net is a compromise regularization method that involves the usage of a regularization term which linearly combines the $L_1$ and $L_2$ norms of the weights, using two tuning parameters, $\lambda_1$ and $\lambda_2$.

The elastic net regularized cost function for least squares problems is given as:

#### Comparison with standalone $L_1$ and $L_2$ regularization

This form of regularization is often used to counteract the limitations of the $L_1$ and $L_2$ penalties.

• With highly-correlated features:

• $L_1$ regularization generally picks one and effectively discards the others by setting their weights to zero. However, it is often difficult to determine which feature was chosen.
• $L_2$ regularization shrinks the weights of highly-correlated features towards one another.

Elastic net is a compromise between the two that attempts to shrink and do a sparse selection simultaneously.

• In regards to penalty:

• $L_1$ regularization penalizes weights more uniformly.
• $L_2$ regularization penalizes higher-valued weights more than the smaller ones.

Once again, elastic net acts as a compromise between this property of the two regularization methods.

## Basis functions

The main requirement for a linear regression model is that the weights must be linear. However, it is not necessary that the explanatory variables are linear—they can be defined by any non-linear function of the explanatory variables too.

This allows us to more generally define a linear model as:

Where:

• Each $\phi_j$ is called a basis function, which is a function of the current input $\b{x}^{(i)}$.
• $\bs{\phi}$ is a vector-valued function such that $\bs{\phi} \! \l(\b{x}^{(i)}\r)=\Big(\phi_0 \! \l(\b{x}^{(i)}\r),\ldots,\phi_D \! \l(\b{x}^{(i)}\r) \Big)^\T$.
• $\phi_0 \! \l(\b{x}^{(i)}\r)=1$ by convention—so that the bias term is not affected by the basis function.

### The identity basis function

The most simple case of a linear regression model that we have seen before is where the output variable may be modeled as:

This regression model can be defined by the identity basis function:

Where: The individual basis functions would be $\phi_j \! \l(\b{x}^{(i)}\r)=x_j^{(i)}$—a function only of the feature with the same index as the basis function.

This is more clearly seen by looking at the design matrix $\bs{\Phi}$ of this basis function when applied to the training set $\mathcal{D}_\text{train}$:

### Polynomial regression

Standard linear regression with the identity basis function is powerful for modelling an output variable $y^{(i)}$ which is assumed to be linearly dependent upon the explanatory variables $x_j^{(i)}$.

However, it is not always the case that the explanatory variables have a linear relationship with the output variable.

Figure 8: Example of a relationship that cannot accurately be modeled with a hyperplane.
This data would be more accurately represented with a polynomial regression model. (source)

In this case, it may be more appropriate to assume a different relationship, such as a polynomial one. The output variable can be modeled as a $D$-degree polynomial—a linear combination of the monomials of each feature:

Which can be defined with the following basis function:

### Multivariate basis functions

Despite the identity and polynomial basis functions $\phi_j \! \l(\b{x}^{(i)}\r)$ only operating on the $j^\text{th}$ feature, $x_j^{(i)}$, this isn't a strict requirement of basis functions—remember that each basis function $\phi_j$ is a function of the entire feature vector $\b{x}^{(i)}$, and can therefore be dependent upon the values of other features. This brings rise to what are known as multivariate basis functions.

Example: A basis function with multivariate inputs.