In this article, we focus on the notion of supervised machine learning and its associated categories. In addition, we describe the components of supervised learning. Finally, we briefly discuss the popular categories and algorithms without digging too much into details. The goal is to have an idea of what is supervised learning.
In this article, you will learn the following items:
I believe your time is very valuable! So instead of mere bragging and to save your time, let me be honest with you for a second and tell you what you will NOT find here:
Let’s get started!
There are numerous applications for Machine Learning (ML), the most notable of which is data mining. Human is usually prone to making mistakes during examinations or, perhaps, when attempting to set connections between various data features. This practical setback makes it difficult for human experts to discover answers to specific problems. Machine learning employment can often tackle these obstacles, increasing the performance of systems.
Specific features should represent every sample in a dataset employed by machine learning. These features may be continuous, categorical, or binary. If we represent the samples with known labels (the associated correct outputs), then the supervised learning is practical, in contrast to unsupervised learning, where we are dealing with unlabeled samples.
What is Supervised Learning?
In machine learning, the algorithm teaches the machine how to learning from data. In supervised learning, in particular, the algorithm teaches the machine how to achieve a specific goal by showing the machine what the goal is. So the teacher (algorithm) knows the correct answers, and the student (machine) should cultivate a piece of knowledge to find the answers by try and error.
Assume you have a data which includes a bunch of photos with information associated with each of them such as the date in which somebody took the photo. What if someone gives you a new photo with no information about it and ask to determine whether this photo is new or old! Can you use the data you have to train an intelligent machine for that specific task? This is what called supervised machine learning.
In supervised learning, the algorithm digests the information of training examples to construct the function that maps an input to the desired output. In other words, supervised learning consists of input-output pairs for training. For testing, the ultimate goal is that the machine predicts the output based on an unseen input. This phenomenon is the desired generalization.
Supervised Learning Steps and Algorithm Selection
Supervised machine learning consists of the following steps:
Determine the nature of training data and performing data acquisition. Before any other step, the developer must decide what kind of data is to be employed and how to obtain it. For example, for the face classification task, it might be a gallery of face images belonging to known individuals. For object detection, it might be a gallery of labeled images and the bounding object boxes of each object in the image.
This step is one of the essential elements of any machine learning modeling, and it is also notorious for being very time-consuming. As it is clear from its name, this step aims to provide data for further processing and utilization. Data processing refers to the acquisition and manipulation of data samples to extract and generate meaningful information. It includes different elements such as validation (assuring that the provided data is accurate and consistent), sorting (systematizing parts in some sequence), abstracting (only extracting and organizing necessary data elements), aggregation (combining various pieces of data), analysis (the arrangement, coordination, study, understanding and exhibition of data) and classification (division of data into multiple categories).
Perhaps one may classify the “Data Preparation” step as the subset of “Data Pre-processing”. However, I decided to define a separate category for that to describe the specific uniqueness in supervised learning. Here, we MUST associate data instances to their corresponding labels as required by supervised training. Next, the data should be split into the training and testing sets. The training set supposed to be a representative of the real-world application. Therefore, we should gather a collection of input objects and their corresponding labels (categories), usually from human evaluations, to assure the accuracy of the labeling process.
It is critical to determine the data feature representation of the machine input. Despite amazing deep learning performance regarding the on-the-fly data feature learning, it is still imperative to put special attention to how the data is represented to the machine. As for now, the machine learning algorithms’ precision strongly depends on how the input representation. Typically, we transform the input data into feature vectors. Debates continue regarding the feature space size. As a rule of thumb, we should avoid an input feature space which is too large due to the curse of dimensionality; however, it should contain enough information necessary for predictive analysis.
Algorithm selection refers to identify the appropriate algorithm, which leads to the highest accuracy. It is neither a trivial task nor any special process, or secret sauce is currently available to find and pick the best algorithm. It needs extensive research and experience.
After algorithm selection and satisfactory preliminary testing, the classifier is available for regular employment. The classifier’s evaluation is usually based on prediction accuracy. There are different procedures to calculate a classifier’s efficiency. The most popular approach is to select a large portion of the data (two-thirds) for training and the remaining for measuring performance. In another method, identified as cross-validation, we split the training set into mutually exclusive subsets of equal size. For each subgroup, we train the classifier on the combination of all the other subgroups and perform the evaluation on that specific subgroup. Finally, we report the average classification accuracy of all subsets.
We can depict the process of applying supervised ML to a problem as below figure:
Supervised Learning Categories
Supervised learning divides into two general categories: regression and classification.
In statistics, regression consists of statistical analysis for determining the correlations between variables. It includes different methods for modeling and examining various variables with the primary goal of exploring the relationship between dependent and independent variables. More particularly, regression analysis clarifies how the typical value of the dependent variable changes when its associated independent variables are varied. Experts utilize regression analysis for prediction in which there is a substantial overlap between statistics and machine learning.
Regression analysis includes parametric and nonparametric methods. In the parametric approach, the regression function defined in terms of a measurable number of unknown parameters that will be estimated based on the data. Linear Regression is the most popular parametric regression method. Nonparametric regression points to techniques that form the regression function as a set of functions, which might be infinite-dimensional. There are different other types of regression models such as Polynomial Regression, Ridge Regression, and Lasso Regression which are beyond the scope of this article.
Linear Regression is a statistical approach to predict the value of a quantitative variable. Based upon an assemblage of independent variables, we attempt to determine the quantity of a dependent variable, which is the outcome of functional operation on the independent variables.
A linear regression points to a particular regression model that is entirely based on linear operations. Starting with the easy case, single variable linear regression is a technique for modeling the connection between a single input independent variable and an output dependent variable using a linear function, i.e., just a line.
$$ Y = f(X) = aX+b $$
We show a visual representation of the linear regression below:
Multi-variable linear regression is a more general case, where a model represents the relationship among multiple independent input variables and the associated output variable (dependent). In this case, the output is a linear combination of the independent input variables. We represent the multi-variable linear regression model as below:
$$ Y = f(X_1,X_2,…,X_N) = a_1*X_1+ a_2*X_2+…+a_N*X_N+b$$
in which \(b\) is the bias.
There are crucial points about the linear regression:
What is Classification?
Classification refers to the situation when the desired action is to predict the output category such as “elephant”, “flower” or merely a “yes/no” in binary classification, based on some known categories. A classification model tries to describe or categorize some observed data. Classification is the task of building a mapping function \(F(.)\) from the input \(X\) to output variables \(Y\) which is categorical.
There are different types of Classification algorithms. We do NOT dig deep into them. We briefly describe them and address their notion.
Logistic regression is a method for binary classification. It works to divide points in a dataset into two distinct classes, or categories. For simplicity, let’s call them class A and class B. The model will give us the probability that a given point belongs in category B. If it is low (lower than 50%), then we classify it in category A. Otherwise, it falls in class B.
It’s also important to note that logistic regression is better for this purpose than linear regression with a threshold because we should set the threshold manually, which is not very practical. Logistic regression will instead create a sort of S-curve (using the sigmoid function) which will also help show certainty since the output from logistic regression is not just a one or zero. Here is the standard logistic function, note that the output is always between 0 and 1, but never reaches either of those values.
Logistic regression is great for situations where you need to classify between two categories such as identification of “spam” or “not-spam” emails.
A recurring problem in machine learning is the need to classify the input into some preexisting class. Consider the following example. Say we want to classify a random piece of fruit we found lying around. In this example, we have three existing fruit categories: apple, blueberry, and coconut. Each of these fruits has three features we care about: size, weight, and color. We show information in below figure:
We observe the piece of fruit we found and determine it has a moderate size, it is heavy, and it is red. One can compare these features against the features of our known classes to guess what type of fruit it is. The unknown fruit is heavy like coconut, but it shares more features with the apple class. The unknown fruit shares 2 of 3 characteristics with the apple class so we guess that it’s an apple. We used the fact that the random fruit is moderately sized and red as an apple to make our guess.
This example is a bit silly, but it highlights some fundamental points about classification problems. In these types of problems, we are comparing features of an unknown input to features of known classes in our data set. Naive Bayes classification is one way to do this.
Naive Bayes is a classification technique that uses probabilities we already know to determine how to classify input. These probabilities are related to existing classes and what features they have. In the example above, we choose the class that most resembles our input as its classification. This technique is based around using Bayes’ Theorem. If you’re unfamiliar with what Bayes’ Theorem is, don’t worry! We will explain it in the next section.
Decision trees are a classifier in machine learning that allows us to make predictions based on previous data. They are like a series of sequential “if … then” statements you feed new data into to get a result.
To demonstrate decision trees, let’s take a look at an example. Imagine we want to predict whether Mike is going to go grocery shopping on any given day. We can look at previous factors that led Mike to go to the store:
Here we can see the amount of grocery supplies Mike had, the weather, and whether Mike worked each day. Green rows are days he went to the store, and red days are those he didn’t. The goal of a decision tree is to try to understand why Mike goes to the store, and apply that to new data later on.
Let’s divide the first attribute up into a tree. Mike can either have a low, medium, or high amount of supplies:
Here we can see that Mike never goes to the store if he has a high amount of supplies. We call this a pure subset, a subset with only positive or only negative examples. With decision trees, there is no need to break a pure subset down further.
Let’s break the Med Supplies category into whether Mike worked that day:
Here we can see we have two more pure subsets, so this tree is complete. We can replace any pure subsets with their respective answer – in this case, yes or no.
Finally, let’s split the Low Supplies category by the Weather attribute:
Now that we have all pure subsets, we can create our final decision tree:
The presentiment behind the K Nearest Neighbor Classifier algorithm is very simple: The algorithm classifies the new data point based on its proximity to different classes. The algorithm calculates the distance between the query data point (the unlabeled data point that supposed to be classified) and its K nearest labeled data points. Ultimately, it assigns the query data point to the class label that the majority of the K data points have.
Fig.1 demonstrates the algorithm in practice. We classify the test sample (green) as one of the blue or red classes. Considering K = 3, the three closest points determine the classification outcome. As the majority vote on the red category, then the algorithm assigns red as the test sample class.
Suport Vector Machine
A Support Vector Machine (SVM for short) is a powerful machine learning algorithm for data classification. The point of SVM’s are to try and find a line or hyperplane to divide a dimensional space which best classifies the data points. If we were trying to split two classes A and B, we would try to separate the two classes with a line, at best. On one side of the line/hyperplane would be data from class A and on the other side would be from class B. This algorithm is very useful in classifying because we must to calculate the best line or hyperplane once, and any new data points can easily be sorted just by seeing which side of the line they fall on. This contrasts with the k-nearest neighbor’s algorithm, where we would have to calculate each data points nearest neighbors.
A hyperplane depends on the space it is in, but it divides the space into two disconnected parts. For example, 1-dimensional space would just be a point, 2-d space a line, 3-d space a plane, and so on. You might be wondering that there could be multiple lines that split the data well. In fact, there is an infinite amount of lines that can divide two classes. As you can see in the graph below, every line splits the squares and the circles, so which one do we choose?
So how does SVM find the ideal line to separate the two classes? It doesn’t just pick a random one. The algorithm chooses the line/hyperplane with the maximum margin. Maximizing the margin will give us the optimal line to classify the data. Take a look at the below figure for visualization:
In this article, we addressed the notion of supervised learning as a category of machine learning. We explained what supervised learning is and why experts call it supervised! We described the steps to develop a machine learning model aimed to perform supervised learning as well as what is the purpose of supervised learning. Then, we divided supervised learning into two general categories of regression and classification. Finally, we discussed popular machine learning algorithms associated with each of the categories mentioned above. I hope you enjoyed this article! The description of the supervised learning algorithms in this article was initially written as part of our machine learning course. Please refer to the associated project page to see the details.