Discrete probabilityDefinitionsElementary principlesEventsDe Morgan's LawAxiomsInclusion-exclusion principleExampleRandom variablesDiscrete random variablesContinuous random variablesNotationExampleStirling's approximationDistributionsProbability mass functionExampleConditionsCumulative distribution functionComplementary cumulative distribution functionUniform distributionBinomial distributionPoisson distributionNegative binomial distributionDifferent forms of the distributionGeometric distributionWhen to use?Hypergeometric distributionJoint probabilityJoint probability distributionJoint cumulative distribution functionIndependence of random variablesConditional probabilityConditioning of an eventAxiomatic definitionIndependent eventsConsequencesGeneral casePairwise independenceMutual independenceLaw of total probabilityBayes' theoremDerivationDefinitionChain ruleExampleMutual independenceExampleOther propertiesExpectationExampleMomentsPropertiesLinearityExampleProof of linearity of expectationOther propertiesVariance and standard deviationPropertiesCovarianceProperties

For some statistical experiment being performed:

- The set of all possible outcomes is called the
**sample space**, denoted . - A subset is an
**event**.

Let and be events from some sample space .

- is the event that either (or both) or happens
- (or ) is the event that both and happen
- (or ) is the event that does
**not**happen ()

For events :

For each event , we assign a probability satisfying:

For any sequence of

**mutually exclusive events**:Where

**mutual exclusivity**means ()

For a finite sequence of arbitrary events where :

- For

- For

A **random variable** is a function that maps each outcome of the sample space to some numerical value.

Given a sample space , a random variable with values in some set is a function:

Where is typically or in discrete probability and in continuous probability.

- The random variable is a
**discrete random variable**when its range is finite (or countably infinite).

- The random variable is a
**continuous random variable**when its range is uncountably infinite.

Random variables often make it easier to ask questions such as:

How likely is it that the value of is equal to ?

This is the same as the probability of the event , which is often denoted as and read "*the probability of the random variable taking on the value *".

Let our statistical experiment be the toss of a fair coin. We will perform this experiment times, giving us:

Let be the random variable denoting the number of heads after coin flips.

Note that for the collection , we have:

As we will see later, this represents a probablity distribution, and these are properties that all probability distributions must have.

**Stirling's approximation** is an approximation for the factorial operation. It is an accurate estimation, even for smaller values of .

The approximation is:

Where the sign means that the two quantities are asymptotic. This means that their ratio tends to as tends to .

Alternatively, there is a version of Stirling's formula with bounds valid for all positive integers , rather than asymptotics:

A **probability distribution** is a mathematical function that maps each outcome of a statistical experiment to its probability of occurrence.

A **probability mass function** is a function that gives the probability that a discrete random variable is exactly equal to some value. It defines a discrete probability distribution.

Suppose that is a discrete random variable. Then the probability mass function for is defined as:

This is the probability mass function of a discrete probability distribution.

In this case, we have a random variable and a probability mass function .

Consider the following probabilities as examples:

For any probability distribution (with some random variable ), its probability mass function must satisfy both of the following conditions:

The **cumulative distribution function** of a random variable evaluated at is the probability that will take a value less than or equal to .

If is a discrete random variable that maps to values , then the cumulative distribution function is defined as:

Sometimes, it is useful to study the opposite question â€” how often the random variable is **above** a particular value. This is called the **complementary cumulative distribution function** or simply the **tail distribution**, and is denoted , and is defined as:

A random variable is **uniformly distributed** if every possible outcome is equally likely to be observed. In other words, for some statistical experiment, suppose there are different outcomes. Then the probability of each outcome is .

Therefore, the probability mass function for a uniformly distributed discrete random variable for possible outcomes would be:

Parameter | Meaning |
---|---|

Number of possible outcomes |

The **binomial distribution** with parameters and is the discrete probability distribution of the number of successes () in a sequence of **Bernoulli trials**.

The probability mass function for a binomially distributed discrete random variable for Bernoulli trials (each with probability of success ) would be:

Parameter | Meaning |
---|---|

Number of trials | |

Probability of success in each trial |

Quantity (or function) | Formula |
---|---|

Mean (expected value) | |

Variance | |

Moment-generating function |

The **Poisson distribution** is a discrete probability distribution that expresses the probability of a given number of events occuring in a fixed interval of time or space if these events occur with a known constant rate and independently of time since the last event.

The probability mass function for a Poisson distributed discrete random variable with some constant rate would be:

Parameter | Meaning |
---|---|

Rate |

Quantity (or function) | Formula |
---|---|

Mean (expected value) | |

Variance | |

Moment-generating function |

The **negative binomial distribution** is a discrete probability distribution of the number of trials in a sequence of independent and identically distributed Bernoulli trials before a specified number of successes occurs.

The probability mass function for a negative binomially distributed discrete random variable with trials given successes, would be:

Parameter | Meaning |
---|---|

(but can be extended to ) | Number of successes until the experiment is stopped |

Success probability in each experiment |

Quantity (or function) | Formula |
---|---|

Mean (expected value) | |

Variance | |

Moment-generating function |

X counts | PMF | Formula | Support |
---|---|---|---|

trials, given successes | |||

failures, given successes |

The **geometric distribution** is a special case of the negative binomial distribution, with the parameter .

The geometric distribution gives the probability that the first occurence of success requires independent Bernoulli trials, each with success probability .

The probability mass function for a geometrically distributed discrete random variable with the first success being the trial, would be:

Parameter | Meaning |
---|---|

Success probability in each experiment |

Quantity (or function) | Formula |
---|---|

Mean (expected value) | |

Variance | |

Moment-generating function |

- The phenomenon being modelled is a sequence of independent trials
- There are only two possible outcomes for each trial (success/failure)
- The probability of success, , is the same for every trial

The **hypergeometric distribution** is a discrete probability distribution that describes the probability of successes (random draws for which the object drawn has a specified feature) in draws, **without replacement**, from a finite population of size that contains exactly objects with that feature, where each draw is either a success or failure.

The probability mass function for a hypergeometrically distributed discrete random variable with **successes**, would be:

Parameter | Meaning |
---|---|

Population size | |

Number of objects with a specific feature | |

Number of draws |

Quantity (or function) | Formula |
---|---|

Mean (expected value) |

Previously, we introduced as the probability of the intersection of the events and .

If instead, we let these events be described by the random variables:

- =

Then we can write:

Typically we write , and this is referred to as the **joint probability** of and .

If and are discrete random variables, the function given by for each pair of values , is called the **joint probability distribution** of and .

If and are discrete random variables, the definition of the **joint cumulative distribution function** of and is given by:

where is the joint probability distribution of and at .

Consider two discrete random variables and . We say that and are independent if:

The definition of independence can be extended to random variables:

Consider discrete random variables . We say that are **mutually independent** if:

**Conditional probability** is a measure of the probability of an event, given that some other event has occurred.

If the event of interest is and the event is known to have occurred, the conditional probability of given is written as:

Given two events and , the conditional probability of given is defined as:

This may be visualised as restricting the sample space to .

Sometimes the definition of conditional probability is treated as an **axiom of probability**:

This is simply a rearrangement of the equation previously shown.

Events and are said to be **statistically independent** if their joint probability equals the product of the probability of each event:

By substituting this into the definition of conditional probability, we get:

Intuitively this makes sense, as if and are independent, then the fact that event has already occured should not influence the probability of event occuring.

A finite set of events is **pairwise independent** if every pair of events is independent â€” that is, **iff**:

A finite set of events is **mutually independent** if every event is independent of any intersection of the other events â€” that is, **iff** for every -element subset of :

The **law of total probability** is the proposition that if is a finite **partition** of a sample space (in other words, a set of pairwise disjoint events whose union is the entire sample space), then for any event of the same **probability space**:

**Bayes' theorem** describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

Bayes' theorem shows that:

In other words, there exists some constant such that:

If we add these two formulas, we deduce that:

Therefore, the constant can be expressed as:

**Bayes' theorem** is then mathematically defined as:

Or alternatively:

The **chain rule** (or **multiplication rule**) permits the calculation of any member of the **joint distribution** of a set of random variables using only conditional probabilities.

Consider an indexed collection of events , then we can apply the definition of conditional probability to calculate the joint probability:

Repeating this process with each final term creates the product:

With four variables, the chain rule produces this product of conditional probabilities:

Two events are **mutually independent** (or **disjoint**) if they cannot both occur. In other words, events and are mutually independent **iff** .

This has a consequence to the inclusion-exclusion principle. If and are mutually independent, then:

If our statistical experiment is the toss of a fair coin and:

- is the event that a heads was tossed
- is the event that a tails was tossed
Then , but since a coin cannot show heads and tails simultaneously (unless it is some kind of coin that exists in quantum superposition).

Therefore .

The **expectation** of a random variable is the probability-weighted average of all possible values.

The expectation of a random variable is:

Where the notation