# Discrete probability

## Definitions

For some statistical experiment being performed:

• The set of all possible outcomes is called the sample space, denoted $\Omega$.
• A subset $E\subseteq\Omega$ is an event.

## Elementary principles

### Events

Let $E$ and $F$ be events from some sample space $\Omega$.

• $E\cup F$ is the event that either (or both) $E$ or $F$ happens
• $EF$ (or $E\cap F$) is the event that both $E$ and $F$ happen
• $E^c$ (or $\bar{E}$) is the event that $E$ does not happen ($\Omega-E$)

### De Morgan's Law

For events $\{E_i\}_{i=1}^n$:

### Axioms

For each event $E\subseteq\Omega$, we assign a probability $\p{E}$ satisfying:

• $0\leq\p{E}\leq1$

• $\p{\Omega}=1$

• For any sequence $\set{E_i}_{i=1}^\infty$ of mutually exclusive events:

Where mutual exclusivity means ($E_j\cap E_k=\emptyset\quad\forall j\neq k$)

• $\p{E^c}=1-\p{E}$

### Inclusion-exclusion principle

For a finite sequence of arbitrary events $\set{E_i}_{i=1}^n$ where $E_i\subseteq\Omega\quad\forall i$:

#### Example

• For $n=2$
• For $n=3$

## Random variables

A random variable is a function that maps each outcome of the sample space to some numerical value.

Given a sample space $\Omega$, a random variable $X$ with values in some set $\cal{R}$ is a function:

Where $\cal{R}$ is typically $\N_0$ or $\N$ in discrete probability and $\R$ in continuous probability.

### Discrete random variables

• The random variable $X$ is a discrete random variable when its range is finite (or countably infinite).

### Continuous random variables

• The random variable $X$ is a continuous random variable when its range is uncountably infinite.

### Notation

Random variables often make it easier to ask questions such as:

How likely is it that the value of $X$ is equal to $2$?

This is the same as the probability of the event $\setb{x\in\Omega}{X(x)=2}$, which is often denoted as $\p{X=2}$ and read "the probability of the random variable $X$ taking on the value $2$".

### Example

Let our statistical experiment be the toss of a fair coin. We will perform this experiment $3$ times, giving us:

Let $X$ be the random variable denoting the number of heads after $3$ coin flips.

• $\p{X=0}=\p{\{(T,T,T)\}}=\frac{1}{8}$
• $\p{X=1}=\p{\{(T,T,H),(T,H,T),(H,T,T)\}}=\frac{3}{8}$
• $\p{X=2}=\p{\{(T,H,H),(H,T,H),(H,H,T)\}}=\frac{3}{8}$
• $\p{X=3}=\p{\{(H,H,H)\}}=\frac{1}{8}$

Note that for the collection $\set{\p{X=k}}_{k=0}^3$, we have:

As we will see later, this represents a probablity distribution, and these are properties that all probability distributions must have.

## Stirling's approximation

Stirling's approximation is an approximation for the factorial operation. It is an accurate estimation, even for smaller values of $n$.

The approximation is:

Where the $\sim$ sign means that the two quantities are asymptotic. This means that their ratio tends to $1$ as $n$ tends to $\infty$.

Alternatively, there is a version of Stirling's formula with bounds valid for all positive integers $n$, rather than asymptotics:

## Distributions

A probability distribution is a mathematical function that maps each outcome of a statistical experiment to its probability of occurrence.

### Probability mass function

A probability mass function is a function that gives the probability that a discrete random variable is exactly equal to some value. It defines a discrete probability distribution.

Suppose that $X : \Omega\mapsto\mathcal{R}$ is a discrete random variable. Then the probability mass function $f_X:\mathcal{R}\mapsto[0,1]$ for $X$ is defined as:

#### Example This is the probability mass function of a discrete probability distribution.

In this case, we have a random variable $X:\Omega\mapsto\bb{N}$ and a probability mass function $f_X:\bb{N}\mapsto[0,1]$.

Consider the following probabilities as examples:

• $\p{X=1}=\p{\setb{s\in\Omega}{X(s)=1}}=\p{\set{1}}=0.2$
• $\p{X=3}=\p{\setb{s\in\Omega}{X(s)=3}}=\p{\set{3}}=0.5$
• $\p{X=7}=\p{\setb{s\in\Omega}{X(s)=7}}=\p{\set{7}}=0.3$
• \begin{align}\p{X\geq1}&=\p{\bigcup_{s\in\Omega}\set{s}}=\sum_{s\in\Omega}\p{\set{s}}\\&=\p{\set{1,3,7}}=\p{\set{1}}+\p{\set{3}}+\p{\set{7}}\\&=0.2+0.5+0.7\\&=1\end{align}

### Conditions

For any probability distribution (with some random variable $X : \Omega\mapsto\cal{R}$), its probability mass function must satisfy both of the following conditions:

• $0\leq\p{X=k}\leq1\qquad\forall k\in\mathcal{R}$
• $\sum_{k\in\mathcal{R}}^{}\p{X=k}=1$

### Cumulative distribution function

The cumulative distribution function of a random variable $X$ evaluated at $x$ is the probability that $X$ will take a value less than or equal to $x$.

If $X$ is a discrete random variable that maps to values $\set{x_i}_{i=1}^n$, then the cumulative distribution function $F_X$ is defined as:

### Complementary cumulative distribution function

Sometimes, it is useful to study the opposite question — how often the random variable is above a particular value. This is called the complementary cumulative distribution function or simply the tail distribution, and is denoted $\overline{F_X}(x)$, and is defined as:

### Uniform distribution

A random variable is uniformly distributed if every possible outcome is equally likely to be observed. In other words, for some statistical experiment, suppose there are $n$ different outcomes. Then the probability of each outcome is $\frac{1}{n}$.

Therefore, the probability mass function for a uniformly distributed discrete random variable $X:\Omega\mapsto\N_0$ for $n$ possible outcomes would be:

### Binomial distribution

The binomial distribution with parameters $n$ and $p$ is the discrete probability distribution of the number of successes ($k$) in a sequence of $n$ Bernoulli trials.

The probability mass function for a binomially distributed discrete random variable $X:\Omega\mapsto\N_0$ for $n$ Bernoulli trials (each with probability of success $p$) would be:

### Poisson distribution

The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occuring in a fixed interval of time or space if these events occur with a known constant rate and independently of time since the last event.

The probability mass function for a Poisson distributed discrete random variable $X:\Omega\mapsto\mathcal{R}$ with some constant rate $\lambda$ would be:

### Negative binomial distribution

The negative binomial distribution is a discrete probability distribution of the number of trials in a sequence of independent and identically distributed Bernoulli trials before a specified number of successes occurs.

The probability mass function for a negative binomially distributed discrete random variable $X:\Omega\mapsto\N_0$ with $n\in\N_0 : n\geq k$ trials given $k$ successes, would be:

#### Geometric distribution

The geometric distribution is a special case of the negative binomial distribution, with the parameter $r=1$.

The geometric distribution gives the probability that the first occurence of success requires $k$ independent Bernoulli trials, each with success probability $p$.

The probability mass function for a geometrically distributed discrete random variable $X:\Omega\mapsto\N_0$ with the first success being the $k^\text{th}$ trial, would be:

##### When to use?
• The phenomenon being modelled is a sequence of independent trials
• There are only two possible outcomes for each trial (success/failure)
• The probability of success, $p$, is the same for every trial

### Hypergeometric distribution

The hypergeometric distribution is a discrete probability distribution that describes the probability of $k$ successes (random draws for which the object drawn has a specified feature) in $n$ draws, without replacement, from a finite population of size $N$ that contains exactly $K$ objects with that feature, where each draw is either a success or failure.

The probability mass function for a hypergeometrically distributed discrete random variable $X:\Omega\mapsto\N_0$ with $k$ successes, would be:

### Joint probability

Previously, we introduced $\p{A\cap B}$ as the probability of the intersection of the events $A$ and $B$.

If instead, we let these events be described by the random variables:

• $A$ = $X \text{ at value } x$
• $B=Y \text{ at value y}$

Then we can write:

Typically we write $\p{X=x,Y=y}$, and this is referred to as the joint probability of $X=x$ and $Y=y$.

#### Joint probability distribution

If $X$ and $Y$ are discrete random variables, the function given by $f(x,y)=\p{X=x,Y=y}$ for each pair of values $(x,y)\in\img{X}\times\img{Y}$, is called the joint probability distribution of $X$ and $Y$.

#### Joint cumulative distribution function

If $X$ and $Y$ are discrete random variables, the definition of the joint cumulative distribution function of $X$ and $Y$ is given by:

where $f(s,t)$ is the joint probability distribution of $X$ and $Y$ at $(s,t)$.

#### Independence of random variables

Consider two discrete random variables $X$ and $Y$. We say that $X$ and $Y$ are independent if:

The definition of independence can be extended to $n$ random variables:

Consider $n$ discrete random variables $\set{X_i}_{i=1}^n$. We say that $\set{X_i}_{i=1}^n$ are mutually independent if:

## Conditional probability

Conditional probability is a measure of the probability of an event, given that some other event has occurred.

If the event of interest is $A$ and the event $B$ is known to have occurred, the conditional probability of $A$ given $B$ is written as:

### Conditioning of an event

Given two events $A$ and $B$, the conditional probability of $A$ given $B$ is defined as:

This may be visualised as restricting the sample space to $B$.

#### Axiomatic definition

Sometimes the definition of conditional probability is treated as an axiom of probability:

This is simply a rearrangement of the equation previously shown.

### Independent events

Events $A$ and $B$ are said to be statistically independent if their joint probability equals the product of the probability of each event:

#### Consequences

• By substituting this into the definition of conditional probability, we get:

Intuitively this makes sense, as if $A$ and $B$ are independent, then the fact that event $B$ has already occured should not influence the probability of event $A$ occuring.

#### General case

##### Pairwise independence

A finite set of events $\set{E_i}_{i=1}^n$ is pairwise independent if every pair of events is independent — that is, iff:

##### Mutual independence

A finite set of events is mutually independent if every event is independent of any intersection of the other events — that is, iff for every $k$-element subset of $\set{E_i}_{i=1}^n$:

#### Law of total probability

The law of total probability is the proposition that if $\{B_i\}_{i=1}^n$ is a finite partition of a sample space (in other words, a set of pairwise disjoint events whose union is the entire sample space), then for any event $A$ of the same probability space:

#### Bayes' theorem

Bayes' theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

##### Derivation

Bayes' theorem shows that:

In other words, there exists some constant $c\in\R$ such that:

If we add these two formulas, we deduce that:

Therefore, the constant $c$ can be expressed as:

##### Definition

Bayes' theorem is then mathematically defined as:

Or alternatively:

#### Chain rule

The chain rule (or multiplication rule) permits the calculation of any member of the joint distribution of a set of random variables using only conditional probabilities.

Consider an indexed collection of events $\{E_i\}_{i=1}^n$, then we can apply the definition of conditional probability to calculate the joint probability:

Repeating this process with each final term creates the product:

##### Example

With four variables, the chain rule produces this product of conditional probabilities:

#### Mutual independence

Two events are mutually independent (or disjoint) if they cannot both occur. In other words, events $A$ and $B$ are mutually independent iff $\p{A\cap B}=0$.

This has a consequence to the inclusion-exclusion principle. If $A$ and $B$ are mutually independent, then:

##### Example

If our statistical experiment is the toss of a fair coin and:

• $A$ is the event that a heads was tossed
• $B$ is the event that a tails was tossed

Then $\p{A}=\p{B}=\frac{1}{2}$, but $A\cap B=\emptyset$ since a coin cannot show heads and tails simultaneously (unless it is some kind of coin that exists in quantum superposition).

Therefore $\p{A\cap B}=0$.

### Other properties

• $\cp{A}{\bar{B}}=1-\cp{\bar{A}}{\bar{B}}$
• $\cp{\bar{A}}{B}=1-\cp{A}{B}$

## Expectation

The expectation of a random variable is the probability-weighted average of all possible values.

The expectation of a random variable $X\in\set{x_i}_{i=1}^k$ is:

Where the notation