* Probability theory* is the branch of mathematics involved with probability. The notion of probability is used to measure the level of uncertainty. Probability theory aims to represent uncertain phenomena in terms of a set of axioms. Long story short, when we cannot be exact about the possible outcomes of a system, we try to represent the situation using the likelihood of different outcomes and scenarios.

The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful, none of which (fortunately) we have to reason on. Therefore the true logic for this world is the calculus of Probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a reasonable man’s mind.

James Clerk Maxwell

In this post, you will learn:

- What is the probability theory?
- Why it is important in Artificial Intelligence and Machine Learning?
- The fundamental definitions in probability theory
- Some mathematical background

## Probability theory in Machine Learning

The probability theory is of great importance in many different branches of science. Let’s focus on Artificial Intelligence empowered by Machine Learning. The question is, ** “how knowing probability is going to help us in Artificial Intelligence?”** In AI applications, we aim to design an intelligent machine to do the task.

**First**, the model should

**via modeling.**

*get a sense of the environment*As there is ambiguity regarding the possible outcomes, the model works based on estimation and approximation, which are done via probability. **Second**, as the machine tries to learn from the data (environment), it must reason about the process of learning and decision making. Such reasoning is not possible without considering all possible states, scenarios, and their likelihood. **Third**, to measure and assess the machine capabilities, we must utilize probability theory as well.

## Probability Axioms

Let’s roll a dice and ask the following **informal** question: *What is the chance of getting six as the outcome?* It is equivalent to another more **formal** question: *What is the probability of getting a six in rolling a dice? ***Informal answer**: The same as getting any other number most probably. **Formal response**: 1/6. ** How do we interpret the calculation of 1/6?** Well, it is clear that when you roll a dice, you get a number in the range of {1,2,3,4,5,6}, and you do NOT get any other number. We can call {1,2,3,4,5,6} the

**that nothing outside of it may happen. To mathematically define those chances, some universal definitions and rules must be applied, so we all agree with it.**

*outcome space*To this aim, it is crucial to know what governs the probability theory. We start with **axioms**. The definition of an axiom is as follows: *“a statement or proposition which is regarded as being established, accepted, or self-evidently true.”* Before stepping into the axioms, we should have some preliminary definitions.

### Sample and Event Space

Probability theory is mainly associated with random experiments. For a random experiment, we cannot predict with certainty which event may occur. However, the set of all possible outcomes might be known.

After defining the sample space, we should define an **event**.

Now, let’s discuss some operations on events.

**Union:**For any set of events , the union event consists of all outcomes that occurred in**any**of E_{i} events at least once.**Ex**: The indicates that if**or**occurred.**Intersection:**For any set of events , the intersection event consists of all outcomes that occurred in**all**of E_{i} events at least once.**Ex**: The indicates if**and**both occurred.**Mutually Exclusive:**Two events and are mutually exclusive if they cannot occur concurrently. In other words, . Ex:**(A)**throwing a fair coin and**(B)**rolling a dice.**A**and**B**are clearly mutually exclusive.**Complement Set:**For any event , we denote as the complement of and stands for all outcomes in the sample space that are not in . Basically and .

### Axioms

Andrey Kolmogorov, in 1933, proposed **Kolmogorov Axioms **that form the foundations of Probability Theory. The Kolmogorov Axioms can be expressed as follows: Assume we have the probability space of . Then, the **probability measure** is a real-valued function mapping as satisfies all the following **axioms**:

- For any event , (the probability of occurrence is non-negative).
- .
- for any set of mutually exclusive events .

### Outcomes

Using the axioms, we can conclude some fundamental characteristics as below:

- If event is a subset of event (), then .
- If is an event and is the complementary set (all other events except in the event space ), then .
- The probability of the empty set is zero () as the empty set is the complementary set of the sample space .
- For any event , we have the probability bound of .

## Math Background

To tackle and solve the probability problem, there is always a need to * count how many elements available in the event and sample space*. Here, we discuss some important counting principles and techniques.

### Counting all possible outcomes

Let’s consider the special case of having two experiments as and . The basic principle states that if one experiment () results in N possible outcomes and if another experiment () leads to M possible outcomes, then conducting the two experiments will have possible outcome, in total. Assume experiment has M possible outcomes as and has N possible outcomes as .

It is easy to **prove** such a principle for its special case. All you need in to **count all possible outcomes** of two experiments:

The generalized principle of counting can be expressed as below:

### Permutation

* What is a permutation?* Suppose we have three persons called Michael, Bob, and Alice. Assume the

*three of them stay in a queue*. How many possible arrangements we have? Take a look at the arrangements as follows:

As above, you will see *six** permutations*. Right? But, we cannot always write all possible situations! We need some math. The intuition behind this problem is that we have *three places* to fill in a queue when we have three persons. **For the first place**, we have three choices. **For the second place**, there are two remaining choices. **Finally**, there is only one choice left for the last place! So we can extend this conclusion to the experiment that we have choices. Hence, we get the following number of permutations:

**NOTE:** The descending order of multiplication from to is as above (the product of all positive integers less than or equal to n), denote as , and called factorial.

### Combination

The **combination** stands for different combinations of objects from a larger set of objects. For example, assume we have a total number of objects. ** With how many ways can we select objects from that objects?** Let’s get back to the above examples. Assume we have

**three candidates**named Michael, Bob, and Alice, and we

*o*

**nly desire to select two****candidates**. How many different combinations of candidates exist?

Let’s get back to the general question: How many selections we can have if we desire to pick objects from objects?

The above definition can be generalized.

## Conclusion

In this article, you learned about probability theory, why it is important in Machine Learning, and what are the fundamental concepts. Probability theory is of great importance in Machine Learning since it all deals with uncertainty and predictions. Above, the basics that help you to understand probability concepts and utilizing them. Having any questions? Feel free to ask by commenting below.