Whitepaper 'FinOps and cost management for Kubernetes'
Please consider giving OptScale a Star on GitHub, it is 100% open source. It would increase its visibility to others and expedite product development. Thank you!
Ebook 'From FinOps to proven cloud cost management & optimization strategies'
OptScale — FinOps
FinOps overview
Cost optimization:
MS Azure
Google Cloud
Alibaba Cloud
OptScale — MLOps
ML/AI Profiling
ML/AI Optimization
Big Data Profiling
Acura — Cloud migration
Database replatforming
Migration to:
MS Azure
Google Cloud
Alibaba Cloud
Public Cloud
Migration from:
Acura — DR & cloud backup
Migration to:
MS Azure
Google Cloud
Alibaba Cloud

Cost-cutting techniques for Machine Learning in the cloud

Major cloud service providers like Amazon Web Services (AWS), Google Cloud Platform (GCP), and MS Azure provide a wide array of highly efficient and scalable managed services, encompassing storage, computing, databases, and more. These platforms do not demand deep expertise in infrastructure management, but if used imprudently, they can notably escalate your expenditure.

Here are some valuable guidelines to mitigate the risk of your Machine Learning (ML) workloads causing undue strain on your cloud expenses.

Preparation: Establishing the bedrock for financial assessment

The age-old adage wisely states, “You cannot optimize what you do not measure.” Thus, the primary stride in the cost optimization journey is comprehensively comprehending your current financial landscape. This entails a meticulous examination of your expenditures and their underlying intricacies.

Most cloud platforms provide fundamental cost-tracking functionalities, enabling you to break down expenses by service or geographical region. Engage in productive discourse with your designated cloud administrator to gain access to and thoroughly scrutinize these financial reports.

To better understand your expenditure, it is advisable to implement comprehensive database-level tracking. This entails meticulously examining the financial implications of various machine-learning models, teams, and datasets.

Utilize SQL queries

Begin this journey by leveraging SQL queries on your metadata databases. This approach allows you to unearth invaluable insights, such as identifying which training jobs place the most substantial demand on resources, assessing the duration of each job, and determining the frequency of job failures.

Strengthen tracking mechanisms

Establishing robust tracking mechanisms within your Machine Learning platform beforehand ensures you have the necessary infrastructure to capture and store pertinent financial data.

Automation and efficiency

To streamline the data analysis and reporting process, consider automating these queries. This can be achieved by harnessing the capabilities of widely recognized analytics tools such as Tableau or Looker, making the financial assessment process more efficient and manageable.

In cases where you rely on a Machine Learning platform overseen by a dedicated team, it is expedient to collaborate with them to institute application-level tracking. This initiative facilitates the automatic generation of cost leaderboards across various dimensions, encompassing users, teams, projects, models, and designated time intervals. Furthermore, you can impose well-defined resource quotas to govern expenses more granularly, instilling fiscal prudence throughout your organizational framework.

Free cloud cost optimization & enhanced ML/AI resource management for a lifetime


Preemptive halt

Postponing the scrutiny of a training job’s fruits until the eleventh hour, only to unearth a disappointing yield, amounts to a profound squandering of resources. Ideally, you should be able to assess your model’s performance as the training journey unfolds, empowering you to decide whether to call it quits early.

This discerning approach can be actualized through periodic assessments of the model’s performance using its checkpoints. Suppose your Machine Learning platform allows you to monitor vital in-progress metrics (such as loss or validation sample accuracy) or extract such data from checkpoints. In that case, it equips you with the astuteness to curtail resource-intensive endeavors before they consume your assets fruitlessly.

Reboots with a warm touch

In prolonged training endeavors, the specter of a job falling short of its ultimate goal looms large. Such setbacks can manifest for many reasons: coding glitches, transient hiccups (like network blips), memory constraints, etc. Who among us hasn’t experienced the heartache of a training job faltering on the precipice of completion after investing hours or even days?

To navigate this precarious terrain, the “warm restarts” solution beckons. The crux of this strategy lies in not restarting training from square one but instead picking up where the last job left off. This approach hinges on two fundamental pillars:

Checkpoints on standby: Save your model checkpoints diligently to a resilient storage medium, be it a cloud-based repository or a similar robust location, at regular intervals (perhaps after each epoch). For example, PyTorch conveniently furnishes APIs for crafting and storing model checkpoints. It’s paramount to ensure these checkpoints are housed in a locale unswayed by the ephemeral nature of storage, such as the short-lived disk of a Kubernetes pod.

Ready to reboot: Modify your training code to facilitate loading a preceding checkpoint upon initiation. This can be materialized through an optional input argument that directs the code to an existing checkpoint.

The fusion of checkpointing and automatic retries bestows upon your workloads the power to rebound from adversity and resume operations from the point of interruption.


Compute caching

In Machine Learning model development, running the same workload repeatedly with various inputs and configurations is expected as you fine-tune your model. You experiment with different hyperparameters, model architectures, and more. However, it’s essential to recognize that some segments of your training code remain relatively static or entirely unaltered across successive executions.

Consider, for instance, a scenario where your training pipeline involves data preprocessing tasks that prepare the input data for training. Suppose you’re currently focused on tweaking your model architecture. In such cases, it’s prudent to implement a strategy to cache the training dataset, sparing it from regeneration during each iteration. This not only conserves data transfer costs but also optimizes time.

To implement caching effectively, follow these steps:

Modular transformation: Ensure that successive transformation stages in your code are modular and well-defined, encapsulated as distinct units, such as functions, each with clear data contracts.

Persistent output: Save the work of each transformation stage using a storage key, which can either be explicitly designated by the user (a user-specified string) or implicitly generated from input arguments.

Efficient retrieval: Set up your code to recognize when the same key is provided. Rather than re-executing the transformation workload, it should retrieve and utilize the cached value, boosting efficiency.

Data cache for swift access

Even if you’re altering input configurations with each execution of your training code and unable to cache tasks, you’ll repeatedly access the same data. Depending on where your workloads are hosted, caching this data on or near the computing resources may be feasible.

For instance, if your workload operates within a cloud Virtual Machine (VM) environment, like an EC2 instance in AWS, you might be able to store some of your training data directly on the VM for significantly faster and cost-effective access. Of course, it’s improbable that the entire dataset will fit on the disk, necessitating the implementation of a Least Recently Used (LRU) garbage collection system. Commercial solutions enable mounting a local S3 cache directly on your VMs, offering efficient data accessibility.

Optimizing GPU utilization for cost-efficiency

GPU resources are among the most expensive in the cloud computing landscape. When your GPU virtual machine is idly engaged in tasks like downloading data, CPU processing, or data loading into memory, it remains underutilized, effectively translating to wasted expenditure.

GPU-optimized libraries and frameworks

Leverage GPU-optimized libraries and frameworks for Machine Learning, such as CUDA, cuDNN, and TensorRT. These tools are engineered to harness the full potential of GPUs and offer efficient implementations of joint operations, ultimately boosting GPU utilization.

Efficient memory management

Maximizing GPU memory utilization is pivotal. Minimize unnecessary data transfers between the CPU and GPU by retaining data on the GPU whenever feasible. If memory constraints pose challenges, explore memory optimization techniques like data compression or employing smaller data types to reduce memory footprint.

Profiling and optimization

Profile your GPU utilization using specialized tools provided by your GPU vendor or relevant frameworks. This profiling allows you to pinpoint potential bottlenecks or areas where GPU resources are underused. With these insights, optimize your code, data pipeline, and model architecture to enhance GPU utilization in these identified areas, ensuring efficient resource allocation.

Streamlined data loading

Optimize your data loading process to ensure a continuous data flow to the GPU. Reduce data transfer overhead by preprocessing and preloading data onto the GPU in advance, facilitating uninterrupted GPU computations. This practice is critical for cost-efficient GPU usage.

Asynchronous operations

Make the most of asynchronous operations whenever applicable to keep the GPU engaged. Asynchronous data transfers, kernel launches, and computations can overlap, enabling the GPU to multitask, thereby enhancing overall utilization.

Efficient GPU utilization reduces costs and accelerates the execution of machine learning tasks, ultimately enhancing the productivity of your cloud-based GPU resources.

Cost-efficient infrastructure strategies

Spot instances and preemptible VMs

Some cloud providers offer spot instances or preemptible VMs at significantly reduced prices compared to on-demand instances. The provider can reclaim these instances anytime, but if your workload is flexible and fault-tolerant, utilizing these lower-cost options can yield substantial cost savings.

Optimize instance types

Cloud providers offer a range of instance types with different performance characteristics and costs. Analyze your workload requirements and choose the instance type that provides computational power without overprovisioning. Consider options like burstable or GPU instances if they align with your workload.

Opt for the right cloud provider and pricing model

Cloud providers offer varying pricing structures for machine learning services. Compare the pricing models, instance types, and available options to select the most cost-effective solution. Some providers also offer discounted pricing for long-term commitments or spot instances, which can significantly reduce costs.

Leverage auto-scaling

Take advantage of auto-scaling capabilities offered by cloud providers to adjust the number of instances automatically based on demand. Scale up or down your resources dynamically to match the workload, ensuring you only pay for what you need.

Optimize data and compute placement

Cloud data transfers come with a price tag, and misconfigurations can be financially taxing. For instance, moving data between AWS regions is considerably more expensive than keeping it within the same region, and exporting S3 data outside the AWS ecosystem incurs a substantial cost increase.

Consequently, ensuring that your workloads operate within the same Availability Zone as your data storage is paramount. Failing to colocate data and compute can also inflate your computational expenses, as your virtual machines idle while transferring data rather than harnessing their CPU/GPU resources efficiently.

Enhancing Machine Learning cost efficiency beyond cloud infrastructure

Cloud data transfers come with a price tag, and misconfigurations can be financially taxing. For instance, moving data between AWS regions is considerably more expensive than keeping it within the same region, and exporting S3 data outside the AWS ecosystem incurs a substantial cost increase.

Consequently, ensuring that your workloads operate within the same Availability Zone as your data storage is paramount. Failing to colocate data and compute can also inflate your computational expenses, as your virtual machines idle while transferring data rather than harnessing their CPU/GPU resources efficiently.

Managing engineering costs

In Machine Learning (ML), ML Engineers represent precious yet costly assets. To maximize your resources, you must ensure that each ML Engineer is equipped and trained to contribute to high-impact work consistently. This objective can be achieved through various vital strategies. Regularly reviewing and refining project roadmaps is crucial to deprioritize low-impact initiatives, focusing resources on the most valuable tasks. It’s also important to carefully select tools that maximize productivity and efficiency. Consider delegating specific tasks to specialized teams, such as leveraging platform and infrastructure teams for system-related work. Lastly, establish robust knowledge-sharing processes to facilitate the growth of less experienced engineers, fostering a culture of collaboration and continuous learning.

Controlling labeling costs

Data labeling is a pivotal step in the ML process, often relying on human efforts and incurring substantial costs. Several strategies come into play to optimize and economize this aspect of ML. Firstly, ensuring that only high-leverage data is subjected to manual labeling is essential. This means focusing on rare events that are underrepresented in your training dataset, which can lead to significant production failures. Redundantly labeling data your model already performs well on is unnecessary and costly. Secondly, leveraging auto-labeling techniques can be a game-changer. While automated methods may not provide labels of the same quality as human labelers, they are highly effective for specific data types. Employing simpler models, algorithmic heuristics, or data mining techniques can substantially reduce the volume of data requiring manual labeling. These strategies not only reduce costs but also enhance the efficiency of the labeling process in ML.


ML workloads are expensive due to their dependence on extensive datasets and robust computing resources. Large ML enterprises allocate entire teams to monitor and optimize costs meticulously.

Nonetheless, this doesn’t imply that cost control is out of reach for smaller-scale operations. Through meticulous planning, thoughtful deliberation, and diligent optimization, significant cost reduction can be achieved while advancing your model development and performance.

✔️ Do you want your cloud and ML/AI operations to be under control and your expenses to meet your expectations? Assess the capabilities and potential of an open source platform OptScale → https://hystax.com/introducing-optscale-public-release-an-open-source-powerhouse-for-finops-and-mlops/

Hystax OptScale offers an MLOps and FinOps platform for cloud and ML/AI enthusiasts that is fully available under Apache 2.0 on GitHub → https://github.com/hystax/optscale

Enter your email to be notified about new and relevant content.

Thank you for joining us!

We hope you'll find it usefull

You can unsubscribe from these communications at any time. Privacy Policy

News & Reports

FinOps and MLOps

A full description of OptScale as a FinOps and MLOps open source platform to optimize cloud workload performance and infrastructure cost. Cloud cost optimization, VM rightsizing, PaaS instrumentation, S3 duplicate finder, RI/SP usage, anomaly detection, + AI developer tools for optimal cloud utilization.

FinOps, cloud cost optimization and security

Discover our best practices: 

  • How to release Elastic IPs on Amazon EC2
  • Detect incorrectly stopped MS Azure VMs
  • Reduce your AWS bill by eliminating orphaned and unused disk snapshots
  • And much more deep insights

Optimize RI/SP usage for ML/AI teams with OptScale

Find out how to:

  • see RI/SP coverage
  • get recommendations for optimal RI/SP usage
  • enhance RI/SP utilization by ML/AI teams with OptScale