Introduction

Working with probability theory requires understanding its two discrete and continuous worlds. While the discrete case can be straightforward to understand, the continuous one can be a little bit more tricky, yet crucial to fully utilize the tools of probability theory. Here I share my notes while going through the Introduction to Probability book by D. Bertsekas and J. Tsitsiklis, and its accompanying lectures.

The main purpose of this post is to visually illustrate how conditional densities are formed. Having a visual model of these concepts can aid our intuition around other composite concepts such as Bayes’ rule and conditional expectation as a random variable or an estimator.

Sample Space, Random Variables, and Events

Consider an experiment where we choose at random a number from the real line. The set of all possible outcomes of such an experiment is called the sample space \({ \Omega }\), and for this particular one the sample space is all the real numbers.

We define a random variable as a function of the sample space, i.e., it maps each outcome of the experiment to a real number. In the previous example, a function \({X}\) that doubles each outcome is a random variable. So if the outcome of the experiment was 2.1, then the random variable X would take the value 2.1*2 = 4.2.

image/svg+xml

The density of the random variable is derived from the density of the sample space where the density at value \({X=x}\) is computed by accumlating the probabilities of the outcomes in the sample space that make \({X}\) take the value \({x}\) . In our example, each value of \({X}\) is caused by only one outcome in the sample space, for example, the only outcome which makes \({X}\) take the value 4.2 is 2.1, therefore, the density at \({X=4.2}\) equals the density at 2.1 in the sample space.

An event is defined as a group of one or more outcomes from the sample space. For example, in the previous experiment let \({A}\) be the event that all the outcomes fall between \({a}\) and \({b}\) , where \({a}\) and \({b}\) are real numbers.

Conditioning a Random Variable on an Event

If we know that \({A}\) has occurred, where \({A}\) is the event of the output falling within \({[a, b]}\) , then it is only reasonable to rethink our probabilities since now all outcomes outside the range \({[a, b]}\) are impossible with zero probability and hence zero density.

We defined the sample space earlier as the set of all possible outcomes, when conditioning on \({A}\) , a subset of these outcomes are no longer possible, therefore, conditioning on \({A}\) defines a new set of possible outcomes, i.e. a new sample space.

Since random variables are just functions of the sample space, then restricting/changing that space would potentially restrict/change the output of these functions, i.e. restricting the sample space would restrict the values that the random variable can take. As a consequence, when \({A}\) happens, the random variable \({X}\) as defined above can only take values between \({[2a, 2b]}\) and all other values are now impossible with zero probability and hence have zero density.

image/svg+xml

According to the second Kolmogorov probability axiom, the probability of the sample space should integrate to 1. When conditioning on A the new sample space integrates to 0.65, what about the remaining 0.35 of our beliefs?. Here is where the need for a systematic way to redistribute our 100% beliefs over the new sample space arises, and that is what conditional probability provides us with. So conditional density law can be seen as an application of the second probability axiom (normalization).

To further illustrate the idea of normalization, consider the following discrete example. If we had three numbers, say: 6, 4, and 10, and we wanted to make them add up to one while preserving their ratios, we simply divide each by the total sum, so the new values become: 6/20=0.3, 4/20=0.2, and 10/20=0.5. The numbers now add up to 1 while the ratio between them is preserved (6/4 = 0.3/0.2 = 1.5, the same holds for the other numbers).

In the continuous world, instead of dividing by the sum we divide by the integral (area under the curve) as shown in the conditional density formula:

$$ f_{X|A}(x) = \frac{f_X(x)}{\int_A f_X(x) dx} $$
image/svg+xml for x's inside [2a, 2b] we already have their densities, however, they don't integrate to 1 therefore, we need to rescale them fig3-eg-1-formula-5

Conditioning 2 Random Variables on an Event

Let \({X, Y}\) be two random variables with joint density described by \({f_{X, Y}}\) . Let \({C}\) be an event which makes these two random variables take values only within \({[a, b]}\) and \({[c, d]}\) respectively. If we were told that \({C}\) happend, then the conditional joint density of \({X, Y}\) is equal to zero for values outside \({C}\) and given by the following formula for values within \({C}\) :

$$ {f_{X, Y| C}(x, y) = \frac{f_{X, Y}(x, y)}{P(C)} = \frac{f_{X, Y}(x, y)}{\int_C f_{X, Y}(x, y) dx dy}} $$

This computation can be illustrated as follows:

image/svg+xml opacity-35 opacity-50 opacity-50 bg-strokes

Conditioning a Random Variable on Another

Another important case is when two random variables are related, in which case restrictions to one random variable’s range of values might affect the range of the other. Let \({X, Y}\) be two random variables whose joint density is given by \({f_{X, Y}}\) . If we were told that \({X}\) took the value \({x}\) , then that should affect our beliefs about both \({X}\) and \({Y}\) . For \({X}\) , all values other than \({x}\) will have zero probability/density, since we now know that \({X = x}\) . For \({Y}\) , the conditional density formula is used to update our beliefs about its different values, which is given by:

$$ {f_{Y|X=x} (y | x) = \frac{f_{X, Y}(x, y)}{\int_{y = -\infty}^{\infty} f_{X, Y} (x, y) dy}} $$

This computation can be illustrated as follows:

image/svg+xml

This figure is a reproduction of figure 1 from chapter 6 of Probability by Jim Pitman.