LearningClusteringK-means clusteringEvaluation of K-means clusteringDimensionality reductionOptimal projection of 2D data onto 1DGeneral caseCovariance matrixClassification and nearest neighboursTypes of learning problemsClassificationNearest-neighbour classifierK-nearest neighbour classifierK-NN classification algorithmGeometry of nearest-neighbour classifierGeneralisation and over-fittingStatistical pattern recognition and optimisationRules of probabilityBayes' theoremBayes' decision ruleLeast square error line fittingRegression on with Iterative optimisationIterative optimisation methodGradient descentNaïve Bayes classificationAdvantages and disadvantagesText classification with Naïve BayesRepresentation of Multinomial document modelBernoulli document modelMultivariate Gaussians and classificationGaussian distributionParameter estimation from dataMultivariate formParameter estimationCovarianceCovariance matrixGeometric interpretationProblems with estimation of covariance matrixDiagonal covariance matrixBayes' theorem and univariate GaussiansLog likelihoodLog posterior probabilityMultivariate Gaussian classifierLog likelihoodLog posterior probabilityPerformance measuresDiscriminant functionsDecision regionsMisclassificationOverlapping GaussiansDiscriminant functionsOptimal discriminant functionsLog odds and two-class problems

**Clustering** is the partitioning of a data set into meaningful or useful groups, based on distance between data points.

Example: Given a set of oranges and lemons, cluster into different orange and lemon varieties.

**K-means clustering** is a simple algorithm to find clusters:

- Pick random points as cluster centre positions
- Assign each point to its nearest centre (In the unlikely event of a tie, break the tie in some way).
- Move each centre to the mean of its assigned points
- If the centres moved, go back to step 2.

To measure the quality of a K-means clustering solution, we can use a **sum-squared error function** — that is, the sum of squared distances of each point from its cluster centre.

Let be defined as:

Then:

The sum squared error decreases as we increase . ( as )

High-dimensional data is often difficult to understand and visualise. It is often a good idea to consider dimensionality reduction of data to improve visualisation.

As shown in the example below, **dimensionality reduction** is the process of reducing the number of random variables under consideration. This is done by projecting each sample in 3D onto a 2D plane.

Recall orthogonal projection of data onto an axis:

For the optimal projection of 2D data to 1D, we need a mapping where the is optimal is when is maximised.

This optimal is called the **principal component axis**.

Mapping -dimensional data to a **principal component axis** that maximises :

is given as the eigenvector with the largest eigenvalue of the **covariance matrix**, .

Each entry of the covariance matrix is given by:

Giving us the following matrix equality:

Data | Input | Output | Type of problem | Type of learning |
---|---|---|---|---|

groups (subsets) | clustering | unsupervised | ||

: discrete category | classification | supervised |

Where is a feature vectors and is a target vector or scalar.

The data has a feature vector and a label .

Given a

**training set**- a set of feature vectors and their labels , we have to use a learning algorithm to train a classifier from a training set.The

**test set**is a set of feature vectors which the classifier must assign labels. It is used for evaluation.An

**error function**tells us how accurate the classifier is. One option is to count the number of misclassifications:

**Nearest-neighbour classification** is labeling a test example to have the label of the closest training example.

**K-nearest neighbour classification** is finding the K closest points in the training set to the test example, then classifying the test example by using a majority vote of the K class labels.

- Training a K-nearest neighbour classifier is simple — just store the training set.
- Classifying a test example requires finding the K nearest training examples, which can be computationally demanding if the training set is large (since we need to compute the Euclidean distance between the test example and every training example).

For each test example :

- Compute the distance between and each training example
- Select , the set of the nearest training examples to
- Decide the class of by the majority voting

A

**decision boundary**is a boundary that partitions the vector space into subsets of different classes.**Example**: The decision boundary for a vector space with two feature vectors in two different classes would look like:**Decision regions**are the regions that are separated by the decision boundaries.In the diagram above, the decision regions are the blue and red regions.

**Example**: In a vector space with more feature vectors and classes, the decision boundaries and decision regions can become more complicated:

**Overfitting**is the tuning of a classifier too closely to the training set. This can often reduce the accuracy of classification on the test set.**Example**: The decision boundary of the below classifier is too overfitting and poorly generalised.

For random variables and :

**Product rule**:Abbreviation:

and are

**independent**iff:- and
- , which follows from the previous statement.

**Sum rule**:Abbreviation:

- is referred to as the
**marginal probability**of . - is the
**marginalisation**of the joint probability over .

- is referred to as the

Suppose we have a random variable , where each element represents a class label. Let denote and suppose we have an input feature vector matrix , with features .

Then the most probable class (**MAP** - Maximum A Posteriori) for vector is given by the class label which generates the maximum posterior probability — that is:

Where:

Often, the evidence will be the same for all classes and therefore it might not be necessary to consider it when finding the most probable class since:

This is an optimisation problem, also called finding the **line of best fit**.

If the least square error line is given by (also called the **regression line** on with ), then we need to find:

That is, find and for the line such that the total variance of all points (along the axis) is minimized, as depicted here:

Sometimes it may be better to have a least square error line in the form . This is called a regression line on with , where and are given by:

That is, find and for the line such that the total variance of all points (along the axis) is minimized, as depicted here:

**Iterative optimisation** is used on optimisation problems that don't have a closed-form solution (might not converge, such as -means clustering)

- Choose an initial point , and make
- Choose based on an update formula for
- Increment , so and go to step 2 unless a stopping criterion is met.

**Gradient descent** is an example of an iterative optimisation method.

For a feature vector , we wish to know .

With the chain rule of probabilities:

Naïve Bayes classification **naïvely** assumes that every feature is conditionally independent of every other given , i.e:

Therefore it follows that after this *naïve* assumption:

Or:

Therefore, when applying **maximum likelihood estimation** (**MLE**) of priors and likelihoods (using Naïve Bayes assumption for the likelihoods), we get:

Advantages | Disadvantages |
---|---|

Simplifies calculation of posterior probabilities | If there are no occurences of a class label and a certain feature together, then the entire posterior probability will be , since it is a product of all posteriors. |

Given a document with a fixed set of classes , we classify as the MAP class:

But how do we represent , and how do we estimate and ?

**Sequence of words**model:Advantages Disadvantages Maintains order and position of words Computationally expensive Difficult to train **Set of words**(**Bag-of-words**) model:In the bag-of-words model, we only consider the words in a document that appear in a predefined vocabulary .

**Example**: If classifying scam emails, we would have a vocabulary consisting of words like:*casino, congratulations, draws, lotto, first, winner, true, lottery, million***Advantages****Disadvantages**Easier to train Ignores the context of words Not computationally expensive Loses the ordering and position of words

Document represented by an integer feature vector, whose elements indicate the frequency of the corresponding word in the document:

Where is either:

- The number of unique words in the document (
**sequence-of-words**model) - The number of words in the vocabulary (
**bag-of-words**model),

- The number of unique words in the document (

**Note**: Remember that when classifying an unlabelled document , we actually want the posterior — .

Document represented by a binary feature vector, whose elements indicate absence or presence of the corresponding word in the document

Where is the same as previously described for the multinomial document model.

Where .

- number of documents of class in which is observed

- total documents in class

**Note**: Remember that when classifying an unlabelled document , we actually want the posterior — .

- Symmetric and centered about
- increase Gets flatter and wider
- decrease Gets taller and narrower

Sample mean and sample variance estimates:

Maximum likelihood estimates:

For a -dimensional vector :

- is the
**mean vector** - is the
**covariance matrix**

Note that the 1-dimensional Gaussian is a special case of this PDF.

The covariance of variables and is:

Where is a feature vector (an observation of variables) in:

**Note**: Diagonals are the variances.

Example: Consider the following test data with variables length, width and height of a certain object:Then we have and

Which tells us that the covariance between the length and height is , etc.

The covariance matrix defines the shape of the data.

- Diagonal spread of the data is captured by the covariance, while axis-aligned spread is captured by the variance.
- The center of the spread is given by the mean vector.

Example: If is diagonal (such that the covariances are ), then the variances must be equal to the eigenvalues:

But if isn't diagonal, the eigenvalues still represent the variance magnitude in the direction of the largest spread of the data, and the variance components of the covariance matrix still represent the variance magnitude in the direction of the and axis.

But since the data isn't axis-aligned, these values aren't the same anymore:

becomes unstable when is small.

One solution is the reduce the dimensionality with PCA

Another solution is regularisation — adding a small positive number to the diagonal elements:

If is diagonal, then:

Remember that:

Accuracy (correct classification rate)

Confusion matrix

Given an unseen point , we assign the class .

The -space can be regarded as being divided into decision regions such that a point falling in is assigned to class .

These decision regions can consist of several disjoint regions, each associated with class .

The boundaries between these regions are called decision boundaries.

To estimate the probability of misclassification, we need to consider the two ways that a points can be classified wrongly:

- Assigning to when it belongs to
- Assigning to when it belongs in

The probability of the total error is then:

With conditional probabilities, we get:

Connsider the overlapping Gaussian PDFs shown below.

Two possible decision boundaries are shown by the dashed line. The decision boundary on the left hand plot is optimal, assuming equal priors.

**Note**: The overall probability of error is given by the area of the shaded regions under the PDFs — that is:

For a set of classes, we have a set of discriminant functions , one for each class. Data point is assigned to class if:

In other words, assign to the class whose discriminant function is biggest.

One example of a discriminant function we've seen is simply the log posterior probability of class given a data point .

Classifying a point as the class with the largest log posterior probability corresponds to the decision rule which minimises the probability of misclassification.

In this sense, MAP classification forms an optimal discriminant function.

A

**decision boundary**occurs at points in the input space where discriminant functions for each class are equal.If the region of input space classified as class and the region classified as are contiguous, then the decision boundary separating them is given by:

Example: The form of the discriminant function when using a Gaussian PDF is given as: