Whitepaper 'FinOps and cost management for Kubernetes'

Get your copy

Please consider giving OptScale a Star on GitHub, it is 100% open source. It would increase its visibility to others and expedite product development. Thank you!

Live Webinar: VMware migration insights, smart VM replication, synthetic full backup, KubeVirt support, and more →

Please consider giving OptScale a Star on GitHub, it is 100% open source. It would increase its visibility to others and expedite product development. Thank you!

Live Webinar: VMware migration insights, smart VM replication, synthetic full backup, KubeVirt support, and more →

Please consider giving OptScale a Star on GitHub, it is 100% open source. It would increase its visibility to others and expedite product development. Thank you!

Live Webinar: VMware migration insights, smart VM replication, synthetic full backup, KubeVirt support, and more →

Ebook 'From FinOps to proven cloud cost management & optimization strategies'

Get the ebook

Case Study

How an IT startup facilitated and enhanced ML model management and experiment tracking

Executive Summary

A Machine Learning company (The Company) with a team of 83 engineers improved ML model training performance while reducing cloud costs by 27% within two months of usage by leveraging Hystax OptScale, an MLOps and FinOps open source platform with advanced MLOps capabilities.

The Goal

The Company aimed to gain complete visibility into the ML model training process and output metrics, therefore, improving the efficiency of their Machine Learning operations, with a specific focus on simplifying model training and hyperparameter tuning processes.

The Challenge

The team needed help comprehending their progress on the ML model training due to the absence of a shared dashboard that displays model training results. Also, they grappled with constantly escalating AWS cloud costs because of the intensive computing power required for ML model training. Additionally, the team needed help managing, monitoring, and optimizing their cloud resources due to the lack of detailed insights into individual ML training performance metrics and overall ML/AI operations. Moreover, hyperparameter tuning could have been streamlined for greater efficiency, and improper budget management often led to exceeded cloud spending limits.

The Solution

The Company adopted the OptScale SaaS version, taking advantage of its features targeted at cost management (FinOps) and MLOps. OptScale’s ability to provide detailed cost information for each cloud resource and its cost optimization recommendation engine helped The Company efficiently manage and optimize its costs.

ML model training instrumentation

This feature of OptScale facilitated efficient ML model training and utilized resource tracking in the cloud.

Model dashboard/leaderboard

The OptScale’s dashboard allowed a comprehensive view of various training metrics for each ML model. This option helped the team get insights to model performance, make informed decisions, and adjust operations as needed.

Model training performance insights

The developers gained valuable insights from the performance metrics gathered for every stage of each ML model training session and implemented code improvements to reduce training time.

Runsets for hyperparameter tuning

Leveraging this feature, the team set the framework and templates for running hyperparameter tuning sessions using Spot Instances. This approach also enhanced the efficiency of hyperparameters tuning and helped control costs by setting maximum budgets and durations for model training tasks.

Recommendations for optimal cloud usage

OptScale provided valuable recommendations for the cloud capacity team used for ML training.

Resource allocation

The OptScale detailed insights into resource utilization enabled the team to monitor and adjust resource allocation effectively for ML training sessions, leading to substantial cost savings and improved ML operations.

The Result

With OptScale, The Company significantly boosted its MLOps efficiency, notably in ML model training, performance and experiment tracking, and hyperparameter tuning. In addition, they achieved a 27% reduction in cloud costs in the first quarter of usage. This enhancement empowered the team to concentrate more on innovation and delivering high-quality ML solutions for their customers, resulting in increased productivity and client
satisfaction.

To run ML/AI or any workload with optimal performance and infrastructure cost with OptScale – a FinOps and MLOps open source platform, contact us today.

Enter your email to be notified about new and relevant content.

Thank you for joining us!

We hope you'll find it usefull

You can unsubscribe from these communications at any time. Privacy Policy

We're STEVIE® WINNER
in Cloud Storage and Backup Solution