# Learning

## Clustering

Clustering is the partitioning of a data set into meaningful or useful groups, based on distance between data points.

Example: Given a set of oranges and lemons, cluster into different orange and lemon varieties.

### K-means clustering

K-means clustering is a simple algorithm to find clusters:

1. Pick $K$ random points as cluster centre positions
2. Assign each point to its nearest centre (In the unlikely event of a tie, break the tie in some way).
3. Move each centre to the mean of its assigned points
4. If the centres moved, go back to step 2.

#### Evaluation of K-means clustering

To measure the quality of a K-means clustering solution, we can use a sum-squared error function — that is, the sum of squared distances of each point from its cluster centre.

Let $z_{kn}$ be defined as:

Then:

The sum squared error decreases as we increase $K$. ($E\to0$ as $K\to N$)

## Dimensionality reduction

High-dimensional data is often difficult to understand and visualise. It is often a good idea to consider dimensionality reduction of data to improve visualisation.

As shown in the example below, dimensionality reduction is the process of reducing the number of random variables under consideration. This is done by projecting each sample in 3D onto a 2D plane.

Recall orthogonal projection of data onto an axis:

### Optimal projection of 2D data onto 1D

For the optimal projection of 2D data to 1D, we need a mapping $y_n=\b{u}^\T\b{x}_n=u_1x_{n1}+u_2x_{n2}$ where the is optimal $\b{u}$ is when $\var{y}$ is maximised.

This optimal $\b{u}$ is called the principal component axis.

### General case

Mapping $D$-dimensional data to a principal component axis $\b{u}=(u_1,\ldots,u_D)^\T$ that maximises $\var{y}$:

$\b{u}$ is given as the eigenvector with the largest eigenvalue of the covariance matrix, $\bs{\Sigma}$.

#### Covariance matrix

Each entry of the covariance matrix is given by:

Giving us the following matrix equality:

## Classification and nearest neighbours

### Types of learning problems

Where $\b{x}=(x_1,\ldots,x_D)^\T$ is a feature vectors and $y$ is a target vector or scalar.

### Classification

• The data has a feature vector $\b{x}=(x_1,\ldots,x_D)^\T$ and a label $c\in\set{1,\ldots,C}$.

• Given a training set - a set of $N$ feature vectors and their labels $(\b{x}_1,c_1),\ldots,(\b{x}_N,c_N)$, we have to use a learning algorithm to train a classifier from a training set.

• The test set is a set of feature vectors which the classifier must assign labels. It is used for evaluation.

• An error function tells us how accurate the classifier is. One option is to count the number of misclassifications:

### Nearest-neighbour classifier

Nearest-neighbour classification is labeling a test example to have the label of the closest training example.

### K-nearest neighbour classifier

K-nearest neighbour classification is finding the K closest points in the training set to the test example, then classifying the test example by using a majority vote of the K class labels.

• Training a K-nearest neighbour classifier is simple — just store the training set.
• Classifying a test example requires finding the K nearest training examples, which can be computationally demanding if the training set is large (since we need to compute the Euclidean distance between the test example and every training example).

#### K-NN classification algorithm

For each test example $\b{z}\in z$:

• Compute the distance $r(\b{z},\b{x})$ between $\b{z}$ and each training example $(\b{x},c)\in X$
• Select $U_k(\b{z})\subseteq X$, the set of the $k$ nearest training examples to $\b{z}$
• Decide the class of $\b{z}$ by the majority voting

### Geometry of nearest-neighbour classifier

• A decision boundary is a boundary that partitions the vector space into subsets of different classes.

Example: The decision boundary for a vector space with two feature vectors in two different classes would look like:

• Decision regions are the regions that are separated by the decision boundaries.

In the diagram above, the decision regions are the blue and red regions.

Example: In a vector space with more feature vectors and classes, the decision boundaries and decision regions can become more complicated:

### Generalisation and over-fitting

• Overfitting is the tuning of a classifier too closely to the training set. This can often reduce the accuracy of classification on the test set.

Example: The decision boundary of the below classifier is too overfitting and poorly generalised.

## Statistical pattern recognition and optimisation

### Rules of probability

For random variables $X\in\set{x_i}_{i=1}^L$ and $Y\in\set{y_j}_{j=1}^M$:

• Product rule:

Abbreviation:

$X$ and $Y$ are independent iff:

• $\cp{X}{Y}=\p{X}$ and $\cp{Y}{X}=\p{Y}$
• $\jp{X}{Y}=\p{X}\p{Y}$, which follows from the previous statement.
• Sum rule:

Abbreviation:

• $\p{X}$ is referred to as the marginal probability of $X$.
• $\sum_Y\jp{X}{Y}$ is the marginalisation of the joint probability over $Y$.

### Bayes' decision rule

Suppose we have a random variable $C\in\set{1,\ldots,K}$, where each element $k$ represents a class label. Let $C_k$ denote $C=k$ and suppose we have an input feature vector matrix $\b{x}$, with features $x_i$.

Then the most probable class (MAP - Maximum A Posteriori) for vector $\b{x}$ is given by the class label which generates the maximum posterior probability — that is:

Where:

Often, the evidence $\p{\b{x}}$ will be the same for all classes and therefore it might not be necessary to consider it when finding the most probable class since:

### Least square error line fitting

This is an optimisation problem, also called finding the line of best fit.

If the least square error line is given by $\hat{y}_n=ax_n+b$ (also called the regression line on $y$ with $x$), then we need to find:

That is, find $a$ and $b$ for the line $\hat{y}_n=ax_n+b$ such that the total variance of all points (along the $y$ axis) is minimized, as depicted here:

#### Regression on $x$ with $y$

Sometimes it may be better to have a least square error line in the form $\hat{x}=cy_n+d$. This is called a regression line on $x$ with $y$, where $c$ and $d$ are given by:

That is, find $c$ and $d$ for the line $\hat{x}=cy_n+d$ such that the total variance of all points (along the $x$ axis) is minimized, as depicted here:

### Iterative optimisation

Iterative optimisation is used on optimisation problems that don't have a closed-form solution (might not converge, such as $K$-means clustering)

#### Iterative optimisation method

1. Choose an initial point $\b{x}_0$, and make $t=0$
2. Choose $\b{x}_{t+1}$ based on an update formula for $\b{x}_t$
3. Increment $t$, so $t\leftarrow t+1$ and go to step 2 unless a stopping criterion is met.

Gradient descent is an example of an iterative optimisation method.

## Naïve Bayes classification

For a feature vector $\b{x}=(x_1,\ldots,x_D)^\T$, we wish to know $\cp{\b{x}}{C_k}=\cp{x_1,x_2,\ldots,x_D}{C_k}$.

With the chain rule of probabilities:

Naïve Bayes classification naïvely assumes that every feature $x_i$ is conditionally independent of every other $x_j:i\neq j$ given $C_k$, i.e:

Therefore it follows that after this naïve assumption:

Or:

Therefore, when applying maximum likelihood estimation (MLE) of priors and likelihoods (using Naïve Bayes assumption for the likelihoods), we get:

## Text classification with Naïve Bayes

Given a document $\mathcal{D}$ with a fixed set of classes $C=\set{1,\ldots,K}$, we classify $\mathcal{D}$ as the MAP class:

But how do we represent $\mathcal{D}$, and how do we estimate $\cp{\mathcal{D}}{C_k}$ and $\p{C_k}$?

### Representation of $\mathcal{D}$

• Sequence of words model: $\mathcal{D}=(X_1,X_2,\ldots,X_n)$

• Set of words (Bag-of-words) model:

In the bag-of-words model, we only consider the words in a document that appear in a predefined vocabulary $V$.

Example: If classifying scam emails, we would have a vocabulary consisting of words like:

casino, congratulations, draws, lotto, first, winner, true, lottery, million

#### Multinomial document model

• Document represented by an integer feature vector, whose elements indicate the frequency of the corresponding word in the document:

Where $D$ is either:

• The number of unique words in the document (sequence-of-words model)
• The number of words in the vocabulary $V$ (bag-of-words model), $|V|$

Note: Remember that when classifying an unlabelled document $\mathcal{D}$, we actually want the posterior — $\cp{C_k}{\mathcal{D}}$.

#### Bernoulli document model

• Document represented by a binary feature vector, whose elements indicate absence or presence of the corresponding word in the document

Where $\mathcal{D}$ is the same as previously described for the multinomial document model.

Where $\cp{w_t}{C_k}=\frac{n_k(w_t)}{N_k}$.

• $n_k(w_t)$ - number of documents of class $k$ in which $w_t$ is observed

• $N_k$ - total documents in class $k$

Note: Remember that when classifying an unlabelled document $\mathcal{D}$, we actually want the posterior — $\cp{C_k}{\mathcal{D}}$.

## Multivariate Gaussians and classification

### Gaussian distribution

• Symmetric and centered about $\mu$
• $\sigma$ increase $\to$ Gets flatter and wider
• $\sigma$ decrease $\to$ Gets taller and narrower

### Parameter estimation from data

• Sample mean and sample variance estimates:

• Maximum likelihood estimates:

### Multivariate form

For a $D$-dimensional vector $\b{x}=(x_1,\ldots,x_D)^\T$:

• $\bs{\mu}=(\mu_1,\ldots,\mu_D)^T$ is the mean vector
• $\bs{\Sigma}=(\sigma_{ij})$ is the covariance matrix

Note that the 1-dimensional Gaussian is a special case of this PDF.

#### Parameter estimation

• $\bs{\Sigma}=\e{(\b{x}-\bs{\mu})(\b{x}-\bs{\mu})^\T}=\frac{1}{N-1}\sum_{n=1}^N(\b{x}_n-\bs{\mu})(\b{x}_n-\bs{\mu})^\T$
• $\bs{\Sigma}_\text{ML}=\frac{1}{N}\sum_{n=1}^N(\b{x}_n-\bs{\mu}_\text{ML})(\b{x}_n-\bs{\mu}_\text{ML})^\T$

### Covariance

The covariance of variables $X$ and $Y$ is:

### Covariance matrix

Where $\b{x}_n=(x_{n1},\ldots,x_{nD})^\T$ is a feature vector (an observation of $D$ variables) in:

Note: Diagonals are the variances.

Example: Consider the following test data with variables length, width and height of a certain object:

Then we have $\bs{\mu}=(4.10,2.08,0.604)^\T$ and

Which tells us that the covariance between the length and height is $0.00175$, etc.

#### Geometric interpretation

The covariance matrix defines the shape of the data.

• Diagonal spread of the data is captured by the covariance, while axis-aligned spread is captured by the variance.
• The center of the spread is given by the mean vector.

Example: If $\bs{\Sigma}$ is diagonal (such that the covariances are $0$), then the variances must be equal to the eigenvalues:

But if $\bs{\Sigma}$ isn't diagonal, the eigenvalues still represent the variance magnitude in the direction of the largest spread of the data, and the variance components of the covariance matrix still represent the variance magnitude in the direction of the $x$ and $y$ axis.

But since the data isn't axis-aligned, these values aren't the same anymore:

### Problems with estimation of covariance matrix

• $\bs{\Sigma}^{-1}$ becomes unstable when $|\bs\Sigma|$ is small.

• One solution is the reduce the dimensionality with PCA

• Another solution is regularisation — adding a small positive number to the diagonal elements:

### Diagonal covariance matrix

If $\bs\Sigma$ is diagonal, then:

Remember that:

### Performance measures

• Accuracy (correct classification rate)

$\text{accuracy}=1-\text{error rate}$

• Confusion matrix

## Discriminant functions

### Decision regions

• Given an unseen point $\b{x}$, we assign the class $k=\arg\max_k\cp{C_k}{\b{x}}$.

• The $\b{x}$-space can be regarded as being divided into decision regions $\mathcal{R}_k$ such that a point falling in $\mathcal{R}_k$ is assigned to class $C_k$.

These decision regions $\mathcal{R}_k$ can consist of several disjoint regions, each associated with class $C_k$.

• The boundaries between these regions are called decision boundaries.

### Misclassification

To estimate the probability of misclassification, we need to consider the two ways that a points can be classified wrongly:

1. Assigning $x$ to $C_1$ when it belongs to $C_2$
2. Assigning $x$ to $C_2$ when it belongs in $C_1$

The probability of the total error is then:

With conditional probabilities, we get:

### Overlapping Gaussians

Connsider the overlapping Gaussian PDFs shown below.

Two possible decision boundaries are shown by the dashed line. The decision boundary on the left hand plot is optimal, assuming equal priors.

Note: The overall probability of error is given by the area of the shaded regions under the PDFs — that is:

### Discriminant functions

For a set of $K$ classes, we have a set of $K$ discriminant functions $y_k(\b{x})$, one for each class. Data point $\b{x}$ is assigned to class $C_k$ if:

In other words, assign $\b{x}$ to the class $C_k$ whose discriminant function $y_k(\b{x})$ is biggest.

One example of a discriminant function we've seen is simply the log posterior probability of class $C_k$ given a data point $\b{x}$.

#### Optimal discriminant functions

• Classifying a point as the class with the largest log posterior probability corresponds to the decision rule which minimises the probability of misclassification.

In this sense, MAP classification forms an optimal discriminant function.

• A decision boundary occurs at points in the input space where discriminant functions for each class are equal.

If the region of input space classified as class $C_k$ and the region classified as $C_l$ are contiguous, then the decision boundary separating them is given by:

Example: The form of the discriminant function when using a Gaussian PDF is given as: