Probability theory is the branch of mathematics involved with probability. The notion of probability is used to measure the level of uncertainty. Probability theory aims to represent uncertain phenomena in terms of a set of axioms. Long story short, when we cannot be exact about the possible outcomes of a system, we try to represent the situation using the likelihood of different outcomes and scenarios.

The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful, none of which (fortunately) we have to reason on. Therefore the true logic for this world is the calculus of Probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a reasonable man’s mind.

James Clerk Maxwell

In this post, you will learn:

  • What is the probability theory?
  • Why it is important in Artificial Intelligence and Machine Learning?
  • The fundamental definitions in probability theory
  • Some mathematical background

Probability theory in Machine Learning

The probability theory is of great importance in many different branches of science. Let’s focus on Artificial Intelligence empowered by Machine Learning. The question is, “how knowing probability is going to help us in Artificial Intelligence?” In AI applications, we aim to design an intelligent machine to do the task. First, the model should get a sense of the environment via modeling.

As there is ambiguity regarding the possible outcomes, the model works based on estimation and approximation, which are done via probability. Second, as the machine tries to learn from the data (environment), it must reason about the process of learning and decision making. Such reasoning is not possible without considering all possible states, scenarios, and their likelihood. Third, to measure and assess the machine capabilities, we must utilize probability theory as well.


Probability Axioms

Let’s roll a dice and ask the following informal question: What is the chance of getting six as the outcome? It is equivalent to another more formal question: What is the probability of getting a six in rolling a dice? Informal answer: The same as getting any other number most probably. Formal response: 1/6. How do we interpret the calculation of 1/6? Well, it is clear that when you roll a dice, you get a number in the range of {1,2,3,4,5,6}, and you do NOT get any other number. We can call {1,2,3,4,5,6} the outcome space that nothing outside of it may happen. To mathematically define those chances, some universal definitions and rules must be applied, so we all agree with it.

probability theory

To this aim, it is crucial to know what governs the probability theory. We start with axioms. The definition of an axiom is as follows: “a statement or proposition which is regarded as being established, accepted, or self-evidently true.” Before stepping into the axioms, we should have some preliminary definitions.

Sample and Event Space

Probability theory is mainly associated with random experiments. For a random experiment, we cannot predict with certainty which event may occur. However, the set of all possible outcomes might be known.

Sample Space

Definition: We call the set of all possible outcomes as the sample space and we denote it by \Omega.

After defining the sample space, we should define an event.


Definition: An event E is a set embracing some possible outcomes. Any event E is a subset of the sample space \Omega. The empty set \varnothing is called the impossible event as it is null and does not represent any outcome.

Now, let’s discuss some operations on events.

  • Union: For any set of events \{E_{1},E_{2},\ldots,E_{n}\}, the union event \bigcup_{i=1}^{n}E_i consists of all outcomes that occurred in any of E_{i} events at least once. Ex: The A \cup B indicates that if A or B occurred.
  • Intersection: For any set of events \{E_{1},E_{2},\ldots,E_{n}\}, the intersection event \bigcap_{i=1}^{n}E_i consists of all outcomes that occurred in all of E_{i} events at least once. Ex: The A \cap B indicates if A and B both occurred.
  • Mutually Exclusive: Two events A and B are mutually exclusive if they cannot occur concurrently. In other words, A \cap B = \varnothing. Ex: (A) throwing a fair coin and (B) rolling a dice. A and B are clearly mutually exclusive.
  • Complement Set: For any event E, we denote E^c as the complement of E and stands for all outcomes in the sample space \Omega that are not in E. Basically E \cap E^c = \varnothing and E \cup E^c = \Omega.


Andrey Kolmogorov, in 1933, proposed Kolmogorov Axioms that form the foundations of Probability Theory. The Kolmogorov Axioms can be expressed as follows: Assume we have the probability space of (\Omega, \mathcal{A}, \mathbb{P}). Then, the probability measure \mathbb{P} is a real-valued function mapping \mathbb{P}: \mathcal{A} \rightarrow \mathbb{R} as satisfies all the following axioms:

  1. For any event E \in \Omega, P(E) \geq 0 (the probability of occurrence is non-negative).
  2. P(\Omega) = 1.
  3. P(\bigcup_{i=1}^{n}E_i) = \sum_{i=1}^{n} P(E_i) for any set of mutually exclusive events \{E_{1},E_{2},\ldots,E_{n}\}.


Using the axioms, we can conclude some fundamental characteristics as below:

  • If event A is a subset of event B (A \subseteq B), then P(A) \leq P(B).  
  • If A is an event and A^{c} is the complementary set (all other events except A in the event space \Omega), then P(A^{c}) = 1 - P(A).
  • The probability of the empty set is zero (P(\varnothing) = 0) as the empty set is the complementary set of the sample space \Omega.
  • For any event E, we have the probability bound of 0 \leq P(E) \leq 1.

Math Background

To tackle and solve the probability problem, there is always a need to count how many elements available in the event and sample space. Here, we discuss some important counting principles and techniques.

Counting all possible outcomes

Let’s consider the special case of having two experiments as \mathcal{P} and \mathcal{E}. The basic principle states that if one experiment (\mathcal{P}) results in N possible outcomes and if another experiment (\mathcal{E}) leads to M possible outcomes, then conducting the two experiments will have M \times N possible outcome, in total. Assume experiment \mathcal{E} has M possible outcomes as \{\mathcal{E}_1,\mathcal{E}_2,\ldots,\mathcal{E}_M\} and \mathcal{P} has N possible outcomes as \{\mathcal{P}_1,\mathcal{P}_2,\ldots,\mathcal{P}_N\}.

It is easy to prove such a principle for its special case. All you need in to count all possible outcomes of two experiments:

    \[\begin{Bmatrix}(\mathcal{E}_1,\mathcal{P}_1),(\mathcal{E}_1,\mathcal{P}_2),\ldots,(\mathcal{E}_1,\mathcal{P}_N) \\ (\mathcal{E}_2,\mathcal{P}_1),(\mathcal{E}_2,\mathcal{P}_2),\ldots,(\mathcal{E}_2,\mathcal{P}_N)\\ \hdots \\ (\mathcal{E}_M,\mathcal{P}_1),(\mathcal{E}_M,\mathcal{P}_2),\ldots,(\mathcal{E}_M,\mathcal{P}_N)\end{Bmatrix}\]

The generalized principle of counting can be expressed as below:

Generalized Basic Principle of Counting

Assume we have q different experiments with the corresponding number of possible outcomes as \{N_1,N_2,\ldots,N_q\}. Then we can conclude that there is a total of outcomes N_1 \times N_2 \times \ldots \times N_q for conducting all q experiments.


What is a permutation? Suppose we have three persons called Michael, Bob, and Alice. Assume the three of them stay in a queue. How many possible arrangements we have? Take a look at the arrangements as follows: 

    \[\left\{\begin{matrix}Michael, Bob, Alice\\ Michael, Alice, Bob\\ Alice, Bob, Michael\\ Alice, Michael, Bob\\ Bob, Michael, Alice\\ Bob, Alice, Michael\end{matrix}\right.\]

As above, you will see six permutations. Right? But, we cannot always write all possible situations! We need some math. The intuition behind this problem is that we have three places to fill in a queue when we have three persons. For the first place, we have three choices. For the second place, there are two remaining choices. Finally, there is only one choice left for the last place! So we can extend this conclusion to the experiment that we have n choices. Hence, we get the following number of permutations:

    \[n \times (n-1) \times (n-2) \times \ldots \times 2 \times 1 = n!\]

NOTE: The descending order of multiplication from n to 1 is as above (the product of all positive integers less than or equal to n), denote as n!, and called n factorial.


The combination stands for different combinations of objects from a larger set of objects. For example, assume we have a total number of n objects. With how many ways can we select r objects from that n objects? Let’s get back to the above examples. Assume we have three candidates named Michael, Bob, and Alice, and we only desire to select two candidates. How many different combinations of candidates exist?

    \[\left\{\begin{matrix}Michael, Bob,\\ Michael, Alice\\ Alice, Bob\end{matrix}\right.\]

Let’s get back to the general question: How many selections we can have if we desire to pick r objects from n objects?


The number of unordered selections of r objects from n objects is denoted and calculated as:


NOTE: In the combination selection, we referred to the unordered selection. It means, the combination of \{A,B,C\} is the same as \{C,B,A\} i.e., the order does NOT matter.

The above definition can be generalized.


Assume we have n objects, r groups of objects each with n_i objects, and n_1 + n_2 + \ldots + n_r = n. The number of unordered possible divisions of n objects into these r distinct groups can be calculated as below:

    \[\binom{n}{n_1 n_2 \ldots n_r}=\frac{n!}{n_1!n_2! \ldots n_r!}\]


In this article, you learned about probability theory, why it is important in Machine Learning, and what are the fundamental concepts. Probability theory is of great importance in Machine Learning since it all deals with uncertainty and predictions. Above, the basics that help you to understand probability concepts and utilizing them. Having any questions? Feel free to ask by commenting below.

Leave a Comment

Your email address will not be published. Required fields are marked *