LearningClusteringK-means clusteringEvaluation of K-means clusteringDimensionality reductionOptimal projection of 2D data onto 1DGeneral caseCovariance matrixClassification and nearest neighboursTypes of learning problemsClassificationNearest-neighbour classifierK-nearest neighbour classifierK-NN classification algorithmGeometry of nearest-neighbour classifierGeneralisation and over-fittingStatistical pattern recognition and optimisationRules of probabilityBayes' theoremBayes' decision ruleLeast square error line fittingRegression on with Iterative optimisationIterative optimisation methodGradient descentNaïve Bayes classificationAdvantages and disadvantagesText classification with Naïve BayesRepresentation of Multinomial document modelBernoulli document modelMultivariate Gaussians and classificationGaussian distributionParameter estimation from dataMultivariate formParameter estimationCovarianceCovariance matrixGeometric interpretationProblems with estimation of covariance matrixDiagonal covariance matrixBayes' theorem and univariate GaussiansLog likelihoodLog posterior probabilityMultivariate Gaussian classifierLog likelihoodLog posterior probabilityPerformance measuresDiscriminant functionsDecision regionsMisclassificationOverlapping GaussiansDiscriminant functionsOptimal discriminant functionsLog odds and two-class problems
Clustering is the partitioning of a data set into meaningful or useful groups, based on distance between data points.
Example: Given a set of oranges and lemons, cluster into different orange and lemon varieties.
K-means clustering is a simple algorithm to find clusters:
To measure the quality of a K-means clustering solution, we can use a sum-squared error function — that is, the sum of squared distances of each point from its cluster centre.
Let be defined as:
Then:
The sum squared error decreases as we increase . ( as )
High-dimensional data is often difficult to understand and visualise. It is often a good idea to consider dimensionality reduction of data to improve visualisation.
As shown in the example below, dimensionality reduction is the process of reducing the number of random variables under consideration. This is done by projecting each sample in 3D onto a 2D plane.

Recall orthogonal projection of data onto an axis:

For the optimal projection of 2D data to 1D, we need a mapping where the is optimal is when is maximised.
This optimal is called the principal component axis.
Mapping -dimensional data to a principal component axis that maximises :
is given as the eigenvector with the largest eigenvalue of the covariance matrix, .
Each entry of the covariance matrix is given by:
Giving us the following matrix equality:
| Data | Input | Output | Type of problem | Type of learning |
|---|---|---|---|---|
| groups (subsets) | clustering | unsupervised | ||
| : discrete category | classification | supervised |
Where is a feature vectors and is a target vector or scalar.
The data has a feature vector and a label .
Given a training set - a set of feature vectors and their labels , we have to use a learning algorithm to train a classifier from a training set.
The test set is a set of feature vectors which the classifier must assign labels. It is used for evaluation.
An error function tells us how accurate the classifier is. One option is to count the number of misclassifications:
Nearest-neighbour classification is labeling a test example to have the label of the closest training example.
K-nearest neighbour classification is finding the K closest points in the training set to the test example, then classifying the test example by using a majority vote of the K class labels.
For each test example :
A decision boundary is a boundary that partitions the vector space into subsets of different classes.
Example: The decision boundary for a vector space with two feature vectors in two different classes would look like:
Decision regions are the regions that are separated by the decision boundaries.
In the diagram above, the decision regions are the blue and red regions.
Example: In a vector space with more feature vectors and classes, the decision boundaries and decision regions can become more complicated:
Overfitting is the tuning of a classifier too closely to the training set. This can often reduce the accuracy of classification on the test set.
Example: The decision boundary of the below classifier is too overfitting and poorly generalised.
For random variables and :
Product rule:
Abbreviation:
and are independent iff:
Sum rule:
Abbreviation:
Suppose we have a random variable , where each element represents a class label. Let denote and suppose we have an input feature vector matrix , with features .
Then the most probable class (MAP - Maximum A Posteriori) for vector is given by the class label which generates the maximum posterior probability — that is:
Where:
Often, the evidence will be the same for all classes and therefore it might not be necessary to consider it when finding the most probable class since:
This is an optimisation problem, also called finding the line of best fit.
If the least square error line is given by (also called the regression line on with ), then we need to find:
That is, find and for the line such that the total variance of all points (along the axis) is minimized, as depicted here:

Sometimes it may be better to have a least square error line in the form . This is called a regression line on with , where and are given by:
That is, find and for the line such that the total variance of all points (along the axis) is minimized, as depicted here:

Iterative optimisation is used on optimisation problems that don't have a closed-form solution (might not converge, such as -means clustering)
Gradient descent is an example of an iterative optimisation method.
For a feature vector , we wish to know .
With the chain rule of probabilities:
Naïve Bayes classification naïvely assumes that every feature is conditionally independent of every other given , i.e:
Therefore it follows that after this naïve assumption:
Or:
Therefore, when applying maximum likelihood estimation (MLE) of priors and likelihoods (using Naïve Bayes assumption for the likelihoods), we get:
| Advantages | Disadvantages |
|---|---|
| Simplifies calculation of posterior probabilities | If there are no occurences of a class label and a certain feature together, then the entire posterior probability will be , since it is a product of all posteriors. |
Given a document with a fixed set of classes , we classify as the MAP class:
But how do we represent , and how do we estimate and ?
Sequence of words model:
| Advantages | Disadvantages |
|---|---|
| Maintains order and position of words | Computationally expensive |
| Difficult to train |
Set of words (Bag-of-words) model:
In the bag-of-words model, we only consider the words in a document that appear in a predefined vocabulary .
Example: If classifying scam emails, we would have a vocabulary consisting of words like:
casino, congratulations, draws, lotto, first, winner, true, lottery, million
| Advantages | Disadvantages |
|---|---|
| Easier to train | Ignores the context of words |
| Not computationally expensive | Loses the ordering and position of words |
Document represented by an integer feature vector, whose elements indicate the frequency of the corresponding word in the document:
Where is either:
Note: Remember that when classifying an unlabelled document , we actually want the posterior — .
Document represented by a binary feature vector, whose elements indicate absence or presence of the corresponding word in the document
Where is the same as previously described for the multinomial document model.
Where .
- number of documents of class in which is observed
- total documents in class
Note: Remember that when classifying an unlabelled document , we actually want the posterior — .
Sample mean and sample variance estimates:
Maximum likelihood estimates:
For a -dimensional vector :
Note that the 1-dimensional Gaussian is a special case of this PDF.
The covariance of variables and is:
Where is a feature vector (an observation of variables) in:
Note: Diagonals are the variances.
Example: Consider the following test data with variables length, width and height of a certain object:
Then we have and
Which tells us that the covariance between the length and height is , etc.
The covariance matrix defines the shape of the data.
Example: If is diagonal (such that the covariances are ), then the variances must be equal to the eigenvalues:
But if isn't diagonal, the eigenvalues still represent the variance magnitude in the direction of the largest spread of the data, and the variance components of the covariance matrix still represent the variance magnitude in the direction of the and axis.
But since the data isn't axis-aligned, these values aren't the same anymore:
becomes unstable when is small.
One solution is the reduce the dimensionality with PCA
Another solution is regularisation — adding a small positive number to the diagonal elements:
If is diagonal, then:
Remember that:
Accuracy (correct classification rate)
Confusion matrix
Given an unseen point , we assign the class .
The -space can be regarded as being divided into decision regions such that a point falling in is assigned to class .
These decision regions can consist of several disjoint regions, each associated with class .
The boundaries between these regions are called decision boundaries.
To estimate the probability of misclassification, we need to consider the two ways that a points can be classified wrongly:
The probability of the total error is then:
With conditional probabilities, we get:
Connsider the overlapping Gaussian PDFs shown below.

Two possible decision boundaries are shown by the dashed line. The decision boundary on the left hand plot is optimal, assuming equal priors.
Note: The overall probability of error is given by the area of the shaded regions under the PDFs — that is:
For a set of classes, we have a set of discriminant functions , one for each class. Data point is assigned to class if:
In other words, assign to the class whose discriminant function is biggest.
One example of a discriminant function we've seen is simply the log posterior probability of class given a data point .
Classifying a point as the class with the largest log posterior probability corresponds to the decision rule which minimises the probability of misclassification.
In this sense, MAP classification forms an optimal discriminant function.
A decision boundary occurs at points in the input space where discriminant functions for each class are equal.
If the region of input space classified as class and the region classified as are contiguous, then the decision boundary separating them is given by:
Example: The form of the discriminant function when using a Gaussian PDF is given as:
Sometimes all of the classes may share the same covariance matrix — that is, .
If this is the case, the term can be ignored since it is constant for all classes:
For two-class problems, we want to find a single decision boundary between the class regions and hence a single discriminant function such that:
A suitable discriminant function in this case is the log ratio of posterior probabilities (log odds):