MLOps can be used to improve time to market and ensure that ML models meet regulatory, compliance, and end-user requirements.
MLOps stands for Machine Learning Processes. It is an essential part of the machine learning architecture, which involves deploying, maintaining, and monitoring machine learning models in production. MLOps is often a collaborative effort carried out by data scientists, machine learning engineers, DevOps engineers, and IT professionals.
MLOps can improve the quality of machine learning solutions. It allows data scientists and machine learning engineers to collaborate more effectively through implementation Continuous Integration and Deployment (CI/CD) practices, along with monitoring, verification and governance of money laundering patterns. The end result is improved time to market and assurance that ML models meet regulatory, compliance, and end-user requirements.
The following are the critical elements needed to successfully deploy MLOps:
- Sufficient infrastructure resources—Machine learning models require resources throughout their lifecycle. In addition, these resources change as the model progresses from concept to later stages such as development and production.
- Support for various ML model formats—The MLOps solution should be independent of details such as the programming languages used by the ML model and its development strategy. After all, most organizations use multiple languages and frameworks to develop their models.
- Support for software dependencies—The ML model will have multiple dependencies, more so if it is built on open source technologies. Your MLOps solution will need to support and version control these dependencies.
- monitoring models—ML models are trained on historical data, and when their environment changes, they must be trained again. Hence, the MLOps solution must monitor the models to ensure that they do not deviate from the expected behavior during production.
- The ability to post anywhere—The ML model may need to be deployed in the cloud, on premises, or at the edge. Hence, an MLOps solution must allow for multiple deployment patterns in order for the production environment to remain flexible.
- Adequate data and governanceThe ML model needs enough data to reach an appropriate level of performance. synthetic data Helps make larger data sets available without privacy concerns. In addition, the MLOps solution needs to provide sufficient data governance capabilities so that model operations can gain the trust of companies and regulators.
- Retrain the modelThe machine learning model must adapt to the new data, and the teams must ensure that it is not broken by the new data. Hence, an MLOps solution should allow models to be retrained on newer data while retaining the original algorithms, data pipelines, and code bases.
See also: Git-based CI/CD for machine learning and MLOps
Hardware requirements for AI projects
Data is essential for machine and deep learning algorithms. After all, the accuracy of their predictions depends on how well the data is selected, collected, and pre-processed through methods such as classification, filtering, and feature extraction. Therefore, how data from different sources are collected and stored for AI applications greatly affects hardware design.
The data storage resources and computational power of an AI application usually don’t scale together. Hence, many systems handle both aspects separately. One such example is systems that allocate large, fast local storage to each AI computation node to feed an algorithm. It ensures that there is ample storage space for algorithm execution and drives AI performance.
Machine and deep learning algorithms include a large number of matrix multiplication and floating point arithmetic. Moreover, these algorithms perform these computations in parallel, similar to those performed in computer graphics applications such as ray tracing and pixel shading.
While machine and deep learning computations require high parallelism, they do not require the same level of accuracy as graphics computations. This makes it possible to reduce floating-point bits in their calculations to improve performance. Early deep learning research used standard GPU cards originally designed for graphics applications, but GPU manufacturer NVIDIA has recently developed data center GPUs specifically for AI applications.
The following are the system elements most critical to AI performance:
|CPU||Runs virtual machines or containers, and sends code to GPUs and I/O operations. Modern CPUs can also speed up ML and DL inference. Thus, it is useful for production AI workloads that feature models that are pre-trained on GPUs.|
|GPU||Responsible for training ML and DL algorithms. It also often deals with inference. Modern GPUs have high-bandwidth integrated memory modules, which are much faster than regular DDR4 or GDDR5 DRAM. Thus, a system with 8 GPUs has 256-320GB of high bandwidth memory.|
|memory||Since AI processes primarily run on the GPU, system memory is not usually a bottleneck. Typically, servers have around 128-256GB of DRAM.|
|network||AI systems are usually clustered for better performance and have 10Gbps or higher Ethernet interfaces. Some systems also have dedicated GPU interfaces that support inter-cluster communications.|
|Storage speed||The speed of data transfer between storage and compute resources affects the performance of AI workloads. Thus, NVMe drives are mostly preferred over SATA SSDs.|
MLOps in the cloud
MLOps can be hosted on premises and in the cloud:
Cloud-based MLOps Provides access to a variety of managed services and features. Leading cloud providers allow you to run MLOps in the cloud, providing the tools and computing power you need, without having to purchase and set up hardware and build an on-premises ML environment. The following are examples of services offered by leading cloud service providers:
- Amazon SigMaker It is a machine learning platform that helps you build, train, manage, and deploy machine learning models in production-ready machine learning environments. SageMaker accelerates experiments with specialized tools for labeling, data preparation, training, tuning, and management monitoring.
- AzureML A cloud-based platform for training, deployment, automation, management, and monitoring of any machine learning experience. Like SageMaker, it supports both supervised and unsupervised learning.
- Google Cloud It is a comprehensive and fully managed machine learning and data science platform. There are features to help you manage ML services and create effective machine learning workflows for developers, scientists, and data engineers. The platform enables fully automated machine learning lifecycle management.
MLOps in the workplace It requires deploying resources such as multi-GPU AI workstations to the local data center. For large-scale AI initiatives, this may require software to enable synchronization of groups of computational nodes, such as Kubernetes.
In this article, I explained the basics of MLOps, and how they affect organizations and their data centers. I have described the basic elements of an AI hardware infrastructure:
- CPU Modern CPUs can be used to speed up certain types of ML models.
- GPU – Necessary to run deep learning and some machine learning algorithms on a large scale.
- memory – becoming a non-critical resource due to reliance on integrated GPU memory.
- network Fast network connections between GPU clusters are needed.
- storage Data throughput affects AI workload performance, requiring NVMe drives.
Finally, I explained how AI infrastructure can be set up in the cloud versus on premises. I hope this helps as you plan the data center requirements for your organization’s AI initiatives.