Probability theory is the branch of mathematics involved with probability. The notion of probability is used to measure the level of uncertainty. Probability theory aims to represent uncertain phenomena in terms of a set of axioms. Long story short, when we cannot be exact about the possible outcomes of a system, we try to represent the situation using the likelihood of different outcomes and scenarios.
The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful, none of which (fortunately) we have to reason on. Therefore the true logic for this world is the calculus of Probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a reasonable man’s mind.
James Clerk Maxwell
In this post, you will learn:
- What is the probability theory?
- Why it is important in Artificial Intelligence and Machine Learning?
- The fundamental definitions in probability theory
- Some mathematical background
Probability theory in Machine Learning
The probability theory is of great importance in many different branches of science. Let’s focus on Artificial Intelligence empowered by Machine Learning. The question is, “how knowing probability is going to help us in Artificial Intelligence?” In AI applications, we aim to design an intelligent machine to do the task. First, the model should get a sense of the environment via modeling.
As there is ambiguity regarding the possible outcomes, the model works based on estimation and approximation, which are done via probability. Second, as the machine tries to learn from the data (environment), it must reason about the process of learning and decision making. Such reasoning is not possible without considering all possible states, scenarios, and their likelihood. Third, to measure and assess the machine capabilities, we must utilize probability theory as well.

Probability Axioms
Let’s roll a dice and ask the following informal question: What is the chance of getting six as the outcome? It is equivalent to another more formal question: What is the probability of getting a six in rolling a dice? Informal answer: The same as getting any other number most probably. Formal response: 1/6. How do we interpret the calculation of 1/6? Well, it is clear that when you roll a dice, you get a number in the range of {1,2,3,4,5,6}, and you do NOT get any other number. We can call {1,2,3,4,5,6} the outcome space that nothing outside of it may happen. To mathematically define those chances, some universal definitions and rules must be applied, so we all agree with it.

To this aim, it is crucial to know what governs the probability theory. We start with axioms. The definition of an axiom is as follows: “a statement or proposition which is regarded as being established, accepted, or self-evidently true.” Before stepping into the axioms, we should have some preliminary definitions.
Sample and Event Space
Probability theory is mainly associated with random experiments. For a random experiment, we cannot predict with certainty which event may occur. However, the set of all possible outcomes might be known.
After defining the sample space, we should define an event.
Now, let’s discuss some operations on events.
- Union: For any set of events
, the union event
consists of all outcomes that occurred in any of E_{i} events at least once. Ex: The
indicates that if
or
occurred.
- Intersection: For any set of events
, the intersection event
consists of all outcomes that occurred in all of E_{i} events at least once. Ex: The
indicates if
and
both occurred.
- Mutually Exclusive: Two events
and
are mutually exclusive if they cannot occur concurrently. In other words,
. Ex: (A) throwing a fair coin and (B) rolling a dice. A and B are clearly mutually exclusive.
- Complement Set: For any event
, we denote
as the complement of
and stands for all outcomes in the sample space
that are not in
. Basically
and
.

Axioms
Andrey Kolmogorov, in 1933, proposed Kolmogorov Axioms that form the foundations of Probability Theory. The Kolmogorov Axioms can be expressed as follows: Assume we have the probability space of . Then, the probability measure
is a real-valued function mapping
as satisfies all the following axioms:
- For any event
,
(the probability of occurrence is non-negative).
.
for any set of mutually exclusive events
.
Outcomes
Using the axioms, we can conclude some fundamental characteristics as below:
- If event
is a subset of event
(
), then
.
- If
is an event and
is the complementary set (all other events except
in the event space
), then
.
- The probability of the empty set is zero (
) as the empty set is the complementary set of the sample space
.
- For any event
, we have the probability bound of
.
Math Background
To tackle and solve the probability problem, there is always a need to count how many elements available in the event and sample space. Here, we discuss some important counting principles and techniques.
Counting all possible outcomes
Let’s consider the special case of having two experiments as and
. The basic principle states that if one experiment (
) results in N possible outcomes and if another experiment (
) leads to M possible outcomes, then conducting the two experiments will have
possible outcome, in total. Assume experiment
has M possible outcomes as
and
has N possible outcomes as
.
It is easy to prove such a principle for its special case. All you need in to count all possible outcomes of two experiments:
The generalized principle of counting can be expressed as below:
Permutation
What is a permutation? Suppose we have three persons called Michael, Bob, and Alice. Assume the three of them stay in a queue. How many possible arrangements we have? Take a look at the arrangements as follows:
As above, you will see six permutations. Right? But, we cannot always write all possible situations! We need some math. The intuition behind this problem is that we have three places to fill in a queue when we have three persons. For the first place, we have three choices. For the second place, there are two remaining choices. Finally, there is only one choice left for the last place! So we can extend this conclusion to the experiment that we have choices. Hence, we get the following number of permutations:
NOTE: The descending order of multiplication from to
is as above (the product of all positive integers less than or equal to n), denote as
, and called
factorial.
Combination
The combination stands for different combinations of objects from a larger set of objects. For example, assume we have a total number of objects. With how many ways can we select
objects from that
objects? Let’s get back to the above examples. Assume we have three candidates named Michael, Bob, and Alice, and we only desire to select two candidates. How many different combinations of candidates exist?
Let’s get back to the general question: How many selections we can have if we desire to pick objects from
objects?
The above definition can be generalized.
Conclusion
In this article, you learned about probability theory, why it is important in Machine Learning, and what are the fundamental concepts. Probability theory is of great importance in Machine Learning since it all deals with uncertainty and predictions. Above, the basics that help you to understand probability concepts and utilizing them. Having any questions? Feel free to ask by commenting below.