In this article, we outline different steps to define, frame, organize, deploy, and evaluate a successful Machine Learning (ML) project. The expected audience for this article involves business stakeholders, supervisors, Machine Learning experts, and software development engineers.
“I was wondering what is the big picture and the general steps?” Perhaps you were asking yourself the same question when you want to develop a project. Here we are interested in AI and Machine Learning. Well! You certainly need to know what you need to know! However, what do you need to know to enrich your knowledge for creating a successful Machine Learning Project?
In this article, we outline different steps to define, frame, organize, deploy, and evaluate a successful Machine Learning (ML) project. The expected audience for this article involves business stakeholders, supervisors, Machine Learning experts, and software development engineers.

Introduction
Artificial Intelligence (AI) and Machine Learning (ML) as an approach to AI, are all over the place. Anyone is interested to know what are they and how to use them. It is becoming a race between stakeholders to comprehend and make use of AI. Some people think Machine Learning is the panacea (the cure to everything)! BUT NOT SO FAST!
What is the problem in adopting Machine Learning?

Today, many businesses are craving the proper knowledge to develop applications that apply Machine Learning. The majority of the business owners believe Machine Learning influences their commerce drastically and positively. However, many companies have reasonable doubts about the business impact of Machine Learning projects. Due to the black-box nature of the Machine Learning, the stakeholders usually question the advances presented in Machine Learning projects and discard projects when the actual deployment does not meet the expectations. So, the problem is how to define, implement, deploy, and measure the impact of a Machine Learning project?
What makes the problem a problem?
The problem mentioned above requires proper planning and assessment. The proper planning, deployment, and evaluation are themselves complicated meandering processes. As we design the plan for our Machine Learning project, it’s essential to examine some emerging best practices and use-cases. This step puts our Machine Learning projects in an industrial context so that we can recognize, quantify, and maintain the business influence of the project. It is also vital to understand, manage, and alleviate the associated risks of a deployed complex Machine Learning system. We must frame and take the steps mentioned above in an organized manner. This effort demands a lot of attention and clarification.
What is our approach?

It is advantageous to showcase the bigger picture, to show how the numerous steps of a successful Machine Learning project are connected, and what is the order? In this article, we present the roadmap to a flourishing Machine Learning project. This roadmap also aims to help to improve the interaction between the Machine Learning experts and the business stakeholders, who are financially accountable for the expenses and profits of the system. Here, you will learn the following:
- The steps to plan, implement and deploy a Machine Learning project
- The approach to justify a Machine Learning project in a business setting by recognizing and evaluating its expected benefits
- Addressing the common issues faced in Machine Learning projects
- Addressing the common issues faced in Machine Learning projects
Who cares and gets benefit out of this article?
It is essential to know who the audience of this article are? If you are in one of the below categories, then you may find this article useful:
- A developer looking for a roadmap to start.
- A Machine Learning expert that want to refresh your mind
- A manager trying to gain a broad knowledge of what needs to be done in a Machine Learning project and what are the missing elements?
- A business stakeholder trying to assess a project and you are interested to understand the big picture.
- An educator trying to spread the knowledge.
Project Planning in Machine Learning
Why project planning?
First, let’s explain what we are speaking about. Many people believe the project plan is the project schedule with some elements: The schedule of duties and dates that needs to be done. This schedule is a portion of the project plan, but not the whole thing. Project planning includes anything that is associated with project success. It’s the process which aims to build the actions needed for project development such as defining objectives, illuminate the scope of the work, and the critical activities. Any Machine Learning project requires project planning, as well. You wonder why? It is simple: It is a project!
What is the task, feasibility and the requirements?
A critical step is to define “what needs to be done precisely”? We simply need to know the inputs and outputs in the beginning.

Then we need to ask the following questions to clarify the feasibility and requirements:
- Data acquisition feasibility
- How complicated is it to obtain data?
- Should we perform a data labeling operation?
- How much data do we need?
- Is there privacy issues or any risk involved?
- Performance
- How accurate the model should be?
- How fast the system should be?
- What concerns should be satisfied? Privacy maybe?
- Background research
- What kind of similar projects or practices are available?
- Is the topic justified?
- Who are the competitors?
- Technical aspects
- Do we have the necessary manpower?
- Do we have the computational resources?
- What are the setbacks in algorithm development?
- Are there any technical aspects that can be done in parallel?
What needs to be delivered in terms of expectations?
Define what needs to be explicitly delivered. This should be defined based on what is expected from the project. Examples are as follows:
- An accuracy equal or greater than 90%
- Privacy considerations needs a risk no greater than 0.01.
- A retrieval rate of 95%
It is important to know that no all the times we can have a precise value regarding the deliverables.
Data Matters the Most

Why we collect data and what are the methods?
Data collection is a significant bottleneck in Machine Learning and an ongoing research problem in various scientific societies. There are mainly two causes that demonstrate data collection necessity. First, as Machine Learning is overgrowing, researchers encounter numerous distinct applications without enough labeled data. Second, unlike traditional Machine Learning based on feature engineering, powerful data-driven techniques (deep learning) demand vast amounts of data. Interestingly, the management community, and not the Machine Learning community, originates the recent developments in data collection advancements due to the setbacks in handling large amounts of data. Due to the needs mentioned above, data collection is a crucial element to develop any effort built upon Machine Learning.

There are mainly three approaches to data collection. First, if the purpose is to explore new datasets, the data acquisition methods can be utilized to identify, expand, or produce datasets. Second, if the datasets are available, data labeling methods can be adapted to label data samples. Finally, instead of labeling new datasets, improving existing data, or its usage might be desired. The three techniques mentioned above can be used together. For example, a developer might be researching new data labeling methods in addition to searching for new datasets.
Data preparation
Stakeholders, nowadays, usually utilize Machine Learning for supervised learning for which labeled training data is required. The growth of deep learning based models demonstrates this hunger for data as they often require massive amounts of training data. This new trend raises the issues of data security and privacy. Furthermore, the advent of further regulations like General Data Protection Regulation (GDPR), and the importance of managing risk add to the great importance of data management.
In data preparation, we essentially prepare the data for being ingested by the machine. In case of supervised learning, we MUST associate data instances to their corresponding labels as required by supervised training. Next, the data should be split into the training and testing sets. The training set supposed to be a representative of the real-world application.
Companies are usually interested in augmenting the available data sets with new data to potentially improve existing Machine Learning models. There is a developing interest in exploring alternative ways to prepare data as well (such as synthetic data generation). There are important questions about data that business associates are usually asking themselves:
- How exactly can we augment the existing data?
- Do we need external data sources for data augmentation?
- Can new decentralization technologies (such as blockchain) can come to our rescue for having secure data exchange pipeline?
Data integration
It is critical to determine the data feature representation of the machine input. Despite amazing deep learning performance regarding the on-the-fly data feature learning, it is still imperative to put special attention to how the data is represented to the machine. As for now, the Machine Learning algorithms’ precision strongly depends on how the input representation. Typically, we transform the input data into feature vectors. Debates continue regarding the feature space size. As a rule of thumb, we should avoid an input feature space which is too large due to the curse of dimensionality; however, it should contain enough information necessary for predictive analysis.
Data usually comes from a variety of sources before being used in Machine Learning models. As data matters the most in Machine Learning, determining how to design, develop, and manage robust data pipelines is vital. It is critical to decide on the data feature representation of the machine input. Despite amazing deep learning performance regarding the on-the-fly data feature learning, it is still imperative to put special attention to how the data is represented to the machine. This requirement is apart from having clean labeled data. Even in the unsupervised task (for which there is no need for labeled data), data preparation and processing is a crucial element.
Over the last decade, many business stakeholders have started introducing (especially for public sale) data platforms for business intelligence and business analytics. More recently, companies have begun to move toward platforms with support for open source Machine Learning libraries, project management, and model evaluation and visualization.
Let’s Search, Find, and Pick a Model Before Creating One!
Now it’s the time to choose or model. Remember, re-inventing the wheel is usually the last step, not the first one!
Start with something simple!
We may have heard this expression before. It is truly a golden rule to start with any project. Assume we want to develop a sophisticated Machine Learning project. It is daunting to even think about the numerous elements of the project that we have no clue about. However, a critical point is this: Complicated things are made of many simple parts! That’s all! By knowing this fact, we should start breaking the project into its simplest elements. Then, we begin to perform a simple task as a part of whole.

The first model presents the most significant aid to the project development and the final product it is not necessary to make it complicated. In fact, in the initial stages, we just learn by doing. Henceforth, a lot of things may change in the middle of the way. So, do NOT be picky! There are things that needs to be determined first, that helps the future integration of the developed parts:
After the aforementioned steps, now we have a clear high-level understanding of different elements of the project and we can dig more into the practical aspects.
Set and define the baselines models
Baselines are beneficial for determining a lower bound of demanded performance as well as the desired performance level. They are some approaches to set the baseline model:
(1) The first important element is to examine the literature to find a baseline based on the available models for similar tasks/datasets.
(2) Using off-the-shelf libraries and their implemented models. We can recall Scikit-Learn as one of the well-known Machine Learning libraries.
(3) Research the real-world application to find an acceptable performance level. It is worth understanding how well humans can perform for the task at hand. There are areas in which the AI already outperforms humans such as image and object recognition, video games, and speech and lipreading technology.
Gain knowledge by reproducing the ready-to-use models
If we are utilizing a known model, we must ensure that the model’s performance on a benchmark dataset [Understanding the Purpose and Use of Benchmarking, Benchmarking] makes sense by comparing it with the reported results in the literature. Thanks to the open-source platforms such as GitHub, nowadays, we can find the implementation of the majority of known models available and ready-to-use.
The term benchmarking is utilized in Machine Learning as a referral to the evaluation and comparison of methods and algorithms regarding their capability to learn content in ‘benchmark’ datasets that are considered to be the ‘standards’ for their specific characteristics. Benchmarking could be considered as a sanity check to validate if a new method is able to reliably find simple patterns that existing methods are known to identify. In other words, benchmarking is the approach to identify the respective strengths and weaknesses of a given methodology in contrast with others.
Benchmark datasets typically take one of three forms:
- The first is accessible, well-studied real-world data, taken from different real-world problem domains of interest.
- The second is simulated data or data that has been artificially generated, often to ‘look’ like real-world data, but with known, underlying patterns.
- The third form is the toy data, which we will define here as data that is also artificially generated with a known embedded pattern but without an emphasis on representing real-world data.
Evaluation

The goal of model evaluation is to determine if it will perform a reasonable job of predicting the target on new unseen data. As real-world samples have unknown target values, we should evaluate the accuracy of the Machine Learning model using data with already known target values (test set). With the assumption that we used a proper test set (the very first condition is that the model never saw any of the test set samples in the training phase), this predictive analysis can showcase the model efficiency.
After the training is finished, we send the model the test observations for with known target values (ground truth). Then the model makes the predictions and compare them versus the known target value. Ultimately, we measure a metric as an indication of the model’s effectiveness to match between the predicted and ground truth values.
Generally speaking, an effective evaluation usually contains the following elements:
- A prediction accuracy metric to show the overall effectiveness of the model
- Visualizations of the model efficiency
- Investigating the impact of different model parameters
- Confirming the validity of the evaluation using various methods such as benchmarking, etc.
Rethink about the Model
Assuming, after performing all the previous steps, running the model, debugging, and evaluation, we have a model set and operating at a decent or even under-optimized level. Now, it the time to rethink the model and try to make it better. There are many approaches to model refinement. Model refinement, in general, refers to guide specialists to enhance the performance and robustness of Machine Learning models. Model refinement includes many different considerations. We address some of the most important ones below.
How data is affecting model development?
Do we need to investigate the effect of data on the model? The short answer is yes, Always! It is very important to investigate the utilized data once again. This step aims to help us understand if (1) we need more data, (2) different kinds of data, (3) reformat the data, and many more. In more details, the following aspects need further attention:
- Did we use sufficient data for the evaluation of the model? If the number of records in the evaluation data is much smaller than the ones used in training data, then we are using too limited data for evaluation. As a rule of thumb, usually, the portion of the test/train samples is
, respectively.
- Do we have the same data class distribution for the training and evaluation sets? It is crucial to investigate the distribution of the training and evaluation data attributes to validate the similarly in both data sources. As an example, assume we have a binary classification for which we have “apple” and “banana” images as the two classes (attributes here). The machine should classify the test images as one of the two categories. If we train the model with a target distribution different from the evaluation data, then the machine tends to make the decisions based on the training data statistics with very different statistics compared to the evaluation set. Getting back to our example, assume 50% of the training data consists of “apply” images, and only 10% of evaluation data have “apple” images. In this setup, the evaluation quality suffers.
Are we really using the best model?
This is a million-dollar question? Usually, we can never find the best model due to many reasons. To count a few: (1) There are numerous models there. We simply don’t have time to investigate them all!, (2) A model by itself may have a lot of hyperparameters. It is hard to tune and optimize them, (3) Regarding the task at hand; we may not have the necessary skilled human resources to research and suggest the top picks. However, there are some grounds to investigate which may bring more insight.
Change the model capacity
A suboptimal model capacity can cause overfitting or underfitting. Hurdles such as overfitting and underfitting are associated with the capacity of a Machine Learning model to build related information based on a set of training examples. Underfitting refers to the inability of a Machine Learning algorithm to understand and gather enough solid knowledge from the training data for generalization to the test data [lack of capacity]. On the other hand, overfitting happens when a Machine Learning model relies too much on the specific or general knowledge in the training data and fails to generalize and utilize its knowledge for the new unseen test queries [overly allocated capacity]. The question is how to understand we are encountering overfitting or underfitting?
We usually evaluate the performance of a Machine Learning model based on the following two factors: (1) The Training Error and (2 ) the difference between the training and test errors. In the general case, under-fitting has happened when factor (1) cannot converge to get an acceptable level. On the other hand, overfitting occurred when the training loss converges (factor (1) is satisfied) but the testing error stuck and is not behaving decreasing manner ( factor (2) has not been satisfied). In the case of under-fitting, the model does not have enough capacity to learn. On the other hand, if we are encountering overfitting, the model is absorbing too much non-informative information.

How about researching deep to find a more advanced model?
It is becoming tough to stay up to date with recent progress happening in machine learning. Day by day, we see fantastic innovations with new applications. However, most of these progressions are concealed inside a large number of research articles and technical documents. It is essential to dig deep into the new research to see if we can find a better model doing what we desire to do.
Model Management and Maintenance
Sometimes we think it’s done! But not so fast! Deploying a Machine Learning model is not merely equivalent to establish and evaluate the model. The entire lifecycle, including model deployment and operations, should be continuously managed. Without sufficient maintenance, Machine Learning models are prone to decay. This deterioration in predictive ability emerges when by the change of environmental conditions. The chance of model decay raises when a Machine Learning model’s performance has not been monitored for a while. There are specific matters that can affect a machine learning model’s performance which needs special attention.
Data dependency Issue in Machine Learning
This is serious because even “enhancements” in input data may have harmful effects on the Machine Learning system because it might be trained using the previous data, which may differ from the new data in many characteristics. For example, assume a model is trained using image data, which has ten different classes. Augmenting the data with new classes results in the trained machine failing to classify new classes for which the machine does not have any knowledge.
Behavior changes
One of the essential features of deployed Machine Learning systems is that they usually change their behavior as they update. It is hard to anticipate before actually utilizing the machine in the real world. Assuming a system performance is bound to another system output. For example, a scam detector aims to find the scam emails generated by another intelligent bot. What if the fake scam generator becomes stronger through time and we do not monitor the behavior of the detector?
Conclusion
This study set out to determine and address different elements of a successful Machine Learning project. We described how to frame, implement, evaluate, and manage a Machine Learning project. The evidence from this study suggests that the steps to conduct a Machine Learning project are not finalized. We have always had to get back to previous stages and make refinements and modifications.
Finally, a number of important limitations need to be considered. First, the current study has only examined the big picture and some general advice regarding each of the Machine Learning project elements and ignore the technical details which are beyond the scope of this article. Second, the current research was not explicitly designed to describe evaluation metrics related to the assessment of a Machine Learning project. The evaluation details by itself need a lot of elaboration, and we will address that in future articles. Finally, further research should, therefore, concentrate on the investigation of how to manage a deployed Machine Learning project.