LearningClusteringK-means clusteringEvaluation of K-means clusteringDimensionality reductionOptimal projection of 2D data onto 1DGeneral caseCovariance matrixClassification and nearest neighboursTypes of learning problemsClassificationNearest-neighbour classifierK-nearest neighbour classifierK-NN classification algorithmGeometry of nearest-neighbour classifierGeneralisation and over-fittingStatistical pattern recognition and optimisationRules of probabilityBayes' theoremBayes' decision ruleLeast square error line fittingRegression on with Iterative optimisationIterative optimisation methodGradient descentNaïve Bayes classificationAdvantages and disadvantagesText classification with Naïve BayesRepresentation of Multinomial document modelBernoulli document modelMultivariate Gaussians and classificationGaussian distributionParameter estimation from dataMultivariate formParameter estimationCovarianceCovariance matrixGeometric interpretationProblems with estimation of covariance matrixDiagonal covariance matrixBayes' theorem and univariate GaussiansLog likelihoodLog posterior probabilityMultivariate Gaussian classifierLog likelihoodLog posterior probabilityPerformance measuresDiscriminant functionsDecision regionsMisclassificationOverlapping GaussiansDiscriminant functionsOptimal discriminant functionsLog odds and two-class problems



Clustering is the partitioning of a data set into meaningful or useful groups, based on distance between data points.

Example: Given a set of oranges and lemons, cluster into different orange and lemon varieties.

K-means clustering

K-means clustering is a simple algorithm to find clusters:

  1. Pick random points as cluster centre positions
  2. Assign each point to its nearest centre (In the unlikely event of a tie, break the tie in some way).
  3. Move each centre to the mean of its assigned points
  4. If the centres moved, go back to step 2.

Evaluation of K-means clustering

To measure the quality of a K-means clustering solution, we can use a sum-squared error function — that is, the sum of squared distances of each point from its cluster centre.

Let be defined as:


The sum squared error decreases as we increase . ( as )

Dimensionality reduction

High-dimensional data is often difficult to understand and visualise. It is often a good idea to consider dimensionality reduction of data to improve visualisation.

As shown in the example below, dimensionality reduction is the process of reducing the number of random variables under consideration. This is done by projecting each sample in 3D onto a 2D plane.

Recall orthogonal projection of data onto an axis:

Optimal projection of 2D data onto 1D

For the optimal projection of 2D data to 1D, we need a mapping where the is optimal is when is maximised.

This optimal is called the principal component axis.

General case

Mapping -dimensional data to a principal component axis that maximises :

is given as the eigenvector with the largest eigenvalue of the covariance matrix, .

Covariance matrix

Each entry of the covariance matrix is given by:

Giving us the following matrix equality:

Classification and nearest neighbours

Types of learning problems

DataInputOutputType of problemType of learning
groups (subsets)clusteringunsupervised
: discrete categoryclassificationsupervised

Where is a feature vectors and is a target vector or scalar.


Nearest-neighbour classifier

Nearest-neighbour classification is labeling a test example to have the label of the closest training example.

K-nearest neighbour classifier

K-nearest neighbour classification is finding the K closest points in the training set to the test example, then classifying the test example by using a majority vote of the K class labels.

K-NN classification algorithm

For each test example :

Geometry of nearest-neighbour classifier

Generalisation and over-fitting

Statistical pattern recognition and optimisation

Rules of probability

For random variables and :

Bayes' theorem

Bayes' decision rule

Suppose we have a random variable , where each element represents a class label. Let denote and suppose we have an input feature vector matrix , with features .

Then the most probable class (MAP - Maximum A Posteriori) for vector is given by the class label which generates the maximum posterior probability — that is:


Often, the evidence will be the same for all classes and therefore it might not be necessary to consider it when finding the most probable class since:

Least square error line fitting

This is an optimisation problem, also called finding the line of best fit.

If the least square error line is given by (also called the regression line on with ), then we need to find:

That is, find and for the line such that the total variance of all points (along the axis) is minimized, as depicted here:

Regression on with

Sometimes it may be better to have a least square error line in the form . This is called a regression line on with , where and are given by:

That is, find and for the line such that the total variance of all points (along the axis) is minimized, as depicted here:

Iterative optimisation

Iterative optimisation is used on optimisation problems that don't have a closed-form solution (might not converge, such as -means clustering)

Iterative optimisation method

  1. Choose an initial point , and make
  2. Choose based on an update formula for
  3. Increment , so and go to step 2 unless a stopping criterion is met.

Gradient descent is an example of an iterative optimisation method.

Gradient descent

Naïve Bayes classification

For a feature vector , we wish to know .

With the chain rule of probabilities:

Naïve Bayes classification naïvely assumes that every feature is conditionally independent of every other given , i.e:

Therefore it follows that after this naïve assumption:


Therefore, when applying maximum likelihood estimation (MLE) of priors and likelihoods (using Naïve Bayes assumption for the likelihoods), we get:

Advantages and disadvantages

Simplifies calculation of posterior probabilitiesIf there are no occurences of a class label and a certain feature together, then the entire posterior probability will be , since it is a product of all posteriors.

Text classification with Naïve Bayes

Given a document with a fixed set of classes , we classify as the MAP class:

But how do we represent , and how do we estimate and ?

Representation of

Multinomial document model

Note: Remember that when classifying an unlabelled document , we actually want the posterior — .

Bernoulli document model

Note: Remember that when classifying an unlabelled document , we actually want the posterior — .

Multivariate Gaussians and classification

Gaussian distribution

Parameter estimation from data


Multivariate form

For a -dimensional vector :

Note that the 1-dimensional Gaussian is a special case of this PDF.

Parameter estimation


The covariance of variables and is:

Covariance matrix

Where is a feature vector (an observation of variables) in:

Note: Diagonals are the variances.

Example: Consider the following test data with variables length, width and height of a certain object:

Then we have and

Which tells us that the covariance between the length and height is , etc.

Geometric interpretation

The covariance matrix defines the shape of the data.

Example: If is diagonal (such that the covariances are ), then the variances must be equal to the eigenvalues:

But if isn't diagonal, the eigenvalues still represent the variance magnitude in the direction of the largest spread of the data, and the variance components of the covariance matrix still represent the variance magnitude in the direction of the and axis.

But since the data isn't axis-aligned, these values aren't the same anymore:

Problems with estimation of covariance matrix

Diagonal covariance matrix

If is diagonal, then:

Bayes' theorem and univariate Gaussians

Remember that:

Log likelihood

Log posterior probability

Multivariate Gaussian classifier

Log likelihood

Log posterior probability

Performance measures

Discriminant functions

Decision regions


To estimate the probability of misclassification, we need to consider the two ways that a points can be classified wrongly:

  1. Assigning to when it belongs to
  2. Assigning to when it belongs in

The probability of the total error is then:

With conditional probabilities, we get:

Overlapping Gaussians

Connsider the overlapping Gaussian PDFs shown below.

Two possible decision boundaries are shown by the dashed line. The decision boundary on the left hand plot is optimal, assuming equal priors.

Note: The overall probability of error is given by the area of the shaded regions under the PDFs — that is:

Discriminant functions

For a set of classes, we have a set of discriminant functions , one for each class. Data point is assigned to class if:

In other words, assign to the class whose discriminant function is biggest.

One example of a discriminant function we've seen is simply the log posterior probability of class given a data point .

Optimal discriminant functions

Example: The form of the discriminant function when using a Gaussian PDF is given as: