Continuous probabilityContinuous random variablesExamplesProbability density functionWhy can't we use the PMF anymore?DefinitionPropertiesCumulative distribution functionPropertiesExampleExpectationMomentsPropertiesVarianceDistributionsUniform distributionExponential distributionGaussian distributionStandard normal distributionApproximations of the binomial distributionPoisson approximationGaussian/normal approximationDe Moivre-Laplace theoremContinuity correctionExampleRelating probability density functions is an increasing function is a decreasing functionHazard rate functionExample 1PDF in terms of HRFExample 2Joint distributionsJoint probability density functionsJoint cumulative distribution functionsConditional distributionsDiscrete conditional distributionsDiscrete conditional expectationContinuous conditional distributionsContinuous conditional expectationExpectationCovarianceMomentsMoment-generating functionsCalculating momentsExampleMoment-generating functions for summations of independent random variablesGeneral caseExampleInequalitiesMarkov's inequalityReal-world example interpretationChebyshev's inequalityExampleChernoff boundsProofLimit theoremsWeak Law of Large NumbersProofStrong Law of Large NumbersCentral Limit TheoremExampleMarkov chains and stochastic processesDiscrete-time Markov chainsTransition probabilitiesTransition matrixGeneral caseRow vectorsProbability vectorsExampleImportant consequencesEvolution of a Markov chainExampleProperties of Markov chainsIrreducibilityErgodicityTheoremExampleContinuous-time Markov chainsEntropySurpriseExampleDesired properties for TheoremJoint and conditional entropiesImportant propositionsResources
Random variables were previously defined in the discrete probability notes as:
A random variable is a function that maps each outcome of the sample space to some numerical value.
Given a sample space , a random variable with values in some set is a function:
Where was typically or for discrete RVs.
However in continuous probability, the codomain is always .
Therefore, a continuous random variable is a random variable which can take on infinitely many values (has an uncountably infinite range).
Given a sample space , a continuous random variable is a function:
- The continuous random variable could be the length of a randomly selected telephone call in seconds.
- The continuous random variable could be the volume of water in a bucket.
Note: Random variables can be partly continuous and partly discrete!
A continuous random variable has what could be thought of as infinite precision.
More specifically, a continuous random variable can realise an infinite amount of real number values within its range, as there are an infinite amount of points in a line segment.
So we have an infinite amount of values whose sum of probabilities must equal one. This means that these probabilities must each be infinitesimal. and therefore:
It is clear from this result that the probability mass function which we previously used in discrete probability will no longer provide any useful information.
A probability density function is a function whose integral over an interval gives the probability that the value of a random variable falls within the interval.
is a continuous random variable if there is a function such that:
The function is called the probability density function (PDF).
For better reasoning as to why , we can now use the definition above.
The following properties follow from the axioms:
Sometimes also called cumulative density function (to differentiate with between cumulative distribution of a discrete random variable), the cumulative distribution function of a continuous random variable evaluated at is the probability that will take a value less than or equal to .
The cumulative distribution function is denoted , and defined as:
Additionally, if is continuous at :
The definition of the probability density function given earlier can be expressed in terms of the cumulative distribution function, by the fundamental theorem of calculus:
Suppose the lifetime of a car battery has a probability of lasting more than days. Find the probability density function of .
We are given the complementary cumulative distribution function:
And we can determine the cumulative distribution function:
If a continuous random variable is given, and its distribution is given by a probability density function , then the expected value of (if the expected value exists) can be calculated as:
The -th moment of a continuous random variable is given by:
In general, the properties of expectation for continuous random variables are the same as that of discrete random variables, but switching sums with integrals:
Linearity — for a set of tuples , each consisting of a continuous random variable and a corresponding constant :
In general, if is a function of (e.g. , ), then is also a random variable.
If , its expectation is given by:
Plus the rest of the properties from discrete random variable expectations
If the random variable represents samples generated by a continuous distribution with probability density function , then the population variance is given by:
All properties from the variance of discrete random variables still hold for continuous random variables.
The uniform distribution with parameters is a distribution where all intervals of the same length on the distribution's support , for a random variable are equally probable.
The support is defined by the two parameters and .
The probability density function for a uniformly distributed random variable would be:
Additionally, the cumulative distribution function is given by:
| Parameter | Meaning |
|---|---|
| Minimum value | |
| Maximum value |
| Quantity (or function) | Formula |
|---|---|
| Mean (expected value) | |
| Variance | |
| Moment-generating function |
The exponential distribution is the probability distribution that describes the time between events in a process in which events occur continuously and independently at a constant average rate.
An exponentially distributed random variable with rate parameter has the probability density function:
Additionally, the cumulative distribution function is given by:
| Parameter | Meaning |
|---|---|
| Constant average rate |
| Quantity (or function) | Formula |
|---|---|
| Mean (expected value) | |
| Variance | |
| Moment-generating function |
To denote a random variable which is distributed according to the Gaussian distribution, we write , with standard deviation , variance and mean/expectation .
The probability density function for a Gaussian distributed random variable would be:
Additionally, the cumulative distribution function is given by the integral:
Note: We must use an evaluation table to determine the CDF evaluated at , since is not an elementary function.
| Parameter | Meaning |
|---|---|
| Mean/expectation of the distribution (also its median and mode) | |
| Variance |
| Quantity (or function) | Formula |
|---|---|
| Mean (expected value) | |
| Variance | |
| Moment-generating function |
The standard normal distribution (sometimes normal distribution, though this is ambiguous naming) is a special case of the Gaussian distribution, when and .
To denote a random variable which is (standard) normally distributed, we write .
Additionally, the cumulative distribution function is given by the integral:
Note: This integral doesn't evaluate to any simple expression as it cannot be expressed in terms of elementary functions, and instead relies on the special function. Instead, we must use an evaluation table - specifically Table 5.1 in Section 5.4.
Recall that the binomial distribution is a discrete probability distribution representing the number of successes in a sequence of independent experiments, with each experiment being a Bernoulli trial (success/failure experiment) with probability of success .
For a binomially distributed random variable , the probability mass function is given by:
Where is the number of successes in trials.
Recall that for a Poisson distributed random variable , the probability mass function is given by:
Where is the number of successes if they occur at rate .
We can approximate the binomially distribution with the Poisson distribution reasonably well when and is small (with ). This is true because when — that is:
Note that a binomially distributed random variable such as can be expressed as a sum of Bernoulli random variables — that is:
Additionally, note that:
We then have .
This section may not be examinable, but is useful for deriving the Gaussian approximation
A standard score (denoted ) is the number of standard deviations by which a data point is above or below the mean value of what is being observed or measured.
To standardise a data point , we can use the normal standardisation formula:
If we use the normal standardisation formula for , we get:
By using the fact that can be expressed as a sum of Bernoulli random variables (as discussed earlier), and the central limit theorem (which will be discussed a bit later), we can see that:
Note: The normal approximation of the binomial is reasonable when is large, or more specifically when and are not too small relative to — that is:
For the sequence of Bernoulli random variables, we have (for ):
Or alternatively, with and :
This theorem essentially states that the probability mass function of the centred and normalised binomial random variable converges (for and ) to the probability density function of the normal random variable.
Sometimes when using the De Moivre-Laplace theorem, or approximating a discrete probability distribution with a continuous probability distribution, we must use continuity correction. For a discrete random variable , we can write:
Consider a fair coin being tossed times.
Let the random variable represent the number of heads.
Then .
Approximate using the Gaussian random variable.
First, we can start by correcting the discrete random variable for continuity:
We can compare this to the result of letting be a binomially distributed random variable.
Recall that . Therefore:
As you can see, approximating with a Gaussian random variable led to a reasonably accurate probability, but remember that we get a better estimate when is large.
Suppose we have a continuous random variable and some continuous function . Note that is also a random variable.
We will look at relating the two probability density functions and by considering two different cases for — when is an increasing function and when it is a decreasing function.
By the definition of increasing functions, we must have:
If we look at the cumulative distribution function for , we can determine a relationship between and :
By the definition of decreasing functions, we must have:
Once again, if we consider the cumulative distribution function for , we can determine a relationship between and :
The hazard rate function is the frequency with which a component fails, expressed in failures per unit of time.
Although the hazard rate function is often thought of as the probability that a failure occurs in a specified interval given no failure before time , it is not actually a probability because it can exceed one.
The hazard rate function for a continuous random variable is given by:
Where:
Consider an exponentially distributed random variable .
Recall that for :
Determine the hazard rate function, , for .
Note: The fact that the hazard rate function is constant means that the frequency of failure of some component modelled with an exponentially distributed random variable does not depend on the amount of time that has elapsed.
The following equation shows the relationship between the probability distribution function and the hazard rate function of a continuous random variable :
Note: If , then this simplifies to .
Suppose we have the following random variables:
- — The lifespan of a smoker
- — The lifespan of a non-smoker
Additionally, suppose we have the equation , which links the hazard rate functions of and , and suppose we have two ages .
Calculate
Recall that the joint probability mass function of two discrete random variables and was defined as:
However, two random variables are jointly continuous if there exists a non-negative function , such that:
The function is called the joint probability density function of and .
To avoid confusion when dealing with joint PDFs, we call the marginal probability density function of , and the marginal PDF of .
Similarly to how the integral of a marginal PDF over , or must equal , we have a similar condition with joint PDFs:
Recall that a joint CDF for two discrete random variables and was defined as:
For continuous random variables and , we have a joint CDF which is defined as:
The joint CDF satisfies the following properties:
Additionally, similarly to how we had , we have a similar relationship between a joint PDF and its CDF, involving partial derivatives:
For discrete random variables and , the conditional PMF of given is denoted:
By the definition of conditional probability:
Additionally, we have:
If and are two random variables, we can consider :
Note: The conditional expectation is a random variable, and it is a function of .
For continuous random variables and with densities , and , the conditional PDF of given is defined as:
For some subset (for which takes values in):
If and are two random variables, then the conditional expectation of given is given as:
Recall that for discrete random variables and :
Additionally, we saw that:
Recall that for random variables and , if is a function in and , then it is also a random variable
For continuous random variables and :
If we let , then:
For some random variable , we call the -th moment of . This can be calculated with the following integral:
However, this may sometimes lead to integrals that are difficult to calculate. We can use moment-generating functions to help calculate moments of a random variable instead.
The moment-generating function of a real-valued random variable is a function that is used to determine moments of a random variable. It is defined as:
This corresponds to:
For continuous random variables
For discrete random variables
To use the moment-generating function to calculate the -th moment of a random variable, we simply calculate the derivative with respect to and evaluate at . That is:
Determine the expression for the variance of an exponentially distributed random variable.
We know that the moment-generating function of an exponentially distributed random variable is:
Therefore, the second moment is given by:
And given that the expected value of an exponentially distributed random variable is , the variance is therefore given by:
Consider independent random variables.
Let then:
Similarly to the linearity of expectation, for a set of tuples , each consisting of a continuous random variable and a corresponding constant , if we let then:
Suppose that a fair die is tossed twice, let denote the number showing on the first toss, and let denote the number showing on the second toss.
For , we have:
Hence:
is given by the coefficient of above. For example:
If , then , Markov's inequality is given as:
Suppose that an average human is feet tall. Then the people who are or more feet form at most of the population.
Proof: The premise implies that if the total number of humans is , their total height in feet is . If you had more than people who are each taller than feet, then the sum of their heights (ignoring the other people) would exceed .
If , then , Chebyshev's inequality is given as:
To prove this, simply apply Markov's inequality to .
Suppose we have a car factory, where is the number of cars produced in a week, we know .
- Estimate the probability that more than cars are made in a week.
- Suppose . Give a lower bound on the probability that between and cars are produced in a week.
Let be a random variable with moment generating function , then we have the following Chernoff bounds:
Since :
Let be a sequence of independent, identically distributed random variables with mean and variance . Then the weak law of large numbers (WLLN) states:
For all .
Note that the sample mean is defined as:
Then by Chebyshev's inequality, we have:
For all .
The strong law of large numbers (SLLN) states that the sample average converges almost surely to the expected value — that is:
Where is the sample mean (as defined previously in the proof for the WLLN).
Which means that as the number of trials goes to infinity, the probability that the average of the observations is equal to the expeced value will be equal to one.
Let be a sequence of independent, identically distributed random variables with mean , variance and sample mean defined the same as before, then the Central Limit Theorem (CLT) states:
Recall that was the CDF for the standard normal distribution. Therefore we can also write the CLT as:
Which means that as approaches infinity, the expression above becomes approximately equal to the PDF of a standard normally distributed RV.
Suppose a lecturer marks exam scripts.
The time taken to mark each exam script is independent, with and .
Approximate the probability of at least exam scripts being marked in minutes.
Let be the time taken to mark script , then:
Let be the time taken to mark the first exam scripts.
We want to estimate .
Note that:
Recall that the CLT states:
- is the same as
- is the same as
Since our is only , we won't get the most accurate approximation since the probability approaches the CDF of the standard normal distribution only as tends to .
A stochastic process is a mathematical object usually defined by a collection of random variables.
A Markov chain is defined as a stochastic process on a set of states (state space) .
A Markov chain satisfies the Markov property, which refers to the memoryless property of the stochastic process. A stochastic process has the Markov property if the probability of moving to the next state depends only on the previous state .
A Discrete-time Markov chain (DTMC), can be thought of as having a clock, whereby the system only makes a transition to another state when the clock ticks.
By the Markov property, this means that the state at time only depends on the state at time ; it is independent of the rest of the history of the process.
A transition probability is the probability of the occurrence of a transition between two states — that is:
Note: These probabilities do not depend on .
Markov chains can be represented by finite state machines, but keep in mind that the transitions in a Markov chain are probabilistic rather than deterministic, which means that you can't always say with perfect certainty what will happen at time .
It is therefore more accurate to say that a Markov chain can be represented by a weighted directed graph.

In the example above, we have a two-state Markov process with state space . Observe that each number represents the probability of the Markov process changing from one state to another state, with the direction indicated by the arrow.
For example, if the Markov process is in state , then the probability it changes to state is while the probability it remains in state is . If we index the states as and , this may be expressed as and .
A transition matrix is a square matrix used to describe the transitions of a Markov chain.
Each of the entries , in row and column of a transition matrix is simply the transition probability of moving from state to in one time step.
For example, the previous Markov chain depicted by the weighted directed graph with state space where we labeled and would have a transition matrix:
Given a state space , the transition matrix (of dimension ) for a Markov chain that transitions between the states in is given by:
Note that if we are in a state , the sum of the probabilities of all of the transitions out of should add up to — that is:
In a transition matrix, this corresponds to the sum of all elements of row being equal to .
Each element (where ), of a row vector for row in a transition matrix represents the probability of transitioning from state to state .
For example, consider the following Markov chain:

The row vector representing the transition probabilities from state is shown in red:
The probability vector for a Markov chain with state space (at time , or after iterations) is defined as an -element row vector:
Observe that is simply a matrix multiplication of and :
If we let the random variable denote the state that the system is in at time , then we can also write as:
And the matrix product shown previously:
It may sometimes be useful to let , allowing us to express this matrix product with alternative indices:
Given the following transition matrix for a Markov chain with states (down), (usable) and (overloaded):
Suppose that at time , it is equally likely that the system is down or overloaded.
What is the probability that it is usable at time ?
We must find the third probability vector . Using the fact that :
Two important consequences arise from . The main consequence is:
This consequence can be proven quite easily:
Another important consequence is actually a special case of the first consequence, when we let :
If exists and is independent of then the steady-state probability vector is defined as:
To find if it exists, we must solve a system of equations which come from:
Given:
Where represents the state of a computer being down, represents the state of the computer being usable, and represents the state of the computer being overloaded.
Find the steady-state probability distribution and approximate .
- Solving for :
- Using :
After transitions, the computer is down about times, usable about times and overloaded about times.
Irreducibility is the property that regardless of the present state, we can reach any other state in finite time (finite number of transitions).
In terms of the representation of a Markov chain as a directed graph, it is irreducible if there exists a directed path between every pair of nodes.

Of the Markov chains displayed above, the one on the right is the only irreducible one.
A Markov chain which is aperiodic and irreducible is called ergodic. Alternatively, a Markov chain is ergodic if and only if such that has no zero entries (all of its entries are non-zero).
Note: Ergodicity implies the uniqueness of the steady state.
If a Markov chain is ergodic, then exists and is independent of .
Given:
Where state and .
Find the steady-state distribution of the corresponding Markov chain.
has no zero entries, so the Markov chain is ergodic. Since it is ergodic we can use the theorem above — that is, solve for .
Hence
In particular, if , then , as expected.
Continuous-time Markov chains have the following setup/properties:
Consider a discrete-valued random variable:
With a probability mass function .
The entropy of is defined as:
Where we adopt the convention that .
The entropy of can be interpreted as the average amount of surprise contained in the random variable .
Given a random variable with PMF , the surprise of is defined as:
Observe that
Consider the roll of two fair dice
- If is the event that the sum is even, then this is not too surprising, as
- If is the event that the sum is , then this is very surprising, as .
If is continuous and the above conditions are satisfied, then there is a constant .
Let and be two discrete random variables with:
The entropy of the value of is defined as:
The uncertainty of given is defined as:
The conditional entropy is defined as:
This is the expected amount of uncertainty in after is observed.
Note that if and are independent, then
Or if and are independent random variables:
and