A Machine Learning company (The Company) with a team of 83 engineers improved ML model training performance while reducing cloud costs by 27% within two months of usage by leveraging Hystax OptScale, an MLOps and FinOps open source platform with advanced MLOps capabilities.
The Company aimed to gain complete visibility into the ML model training process and output metrics, therefore, improving the efficiency of their Machine Learning operations, with a specific focus on simplifying model training and hyperparameter tuning processes.
The team needed help comprehending their progress on the ML model training due to the absence of a shared dashboard that displays model training results. Also, they grappled with constantly escalating AWS cloud costs because of the intensive computing power required for ML model training. Additionally, the team needed help managing, monitoring, and optimizing their cloud resources due to the lack of detailed insights into individual ML training performance metrics and overall ML/AI operations. Moreover, hyperparameter tuning could have been streamlined for greater efficiency, and improper budget management often led to exceeded cloud spending limits.
The Company adopted the OptScale SaaS version, taking advantage of its features targeted at cost management (FinOps) and MLOps. OptScale’s ability to provide detailed cost information for each cloud resource and its cost optimization recommendation engine helped The Company efficiently manage and optimize its costs.
This feature of OptScale facilitated efficient ML model training and utilized resource tracking in the cloud.
The OptScale’s dashboard allowed a comprehensive view of various training metrics for each ML model. This option helped the team get insights to model performance, make informed decisions, and adjust operations as needed.
The developers gained valuable insights from the performance metrics gathered for every stage of each ML model training session and implemented code improvements to reduce training time.
Leveraging this feature, the team set the framework and templates for running hyperparameter tuning sessions using Spot Instances. This approach also enhanced the efficiency of hyperparameters tuning and helped control costs by setting maximum budgets and durations for model training tasks.
OptScale provided valuable recommendations for the cloud capacity team used for ML training.
The OptScale detailed insights into resource utilization enabled the team to monitor and adjust resource allocation effectively for ML training sessions, leading to substantial cost savings and improved ML operations.
With OptScale, The Company significantly boosted its MLOps efficiency, notably in ML model training, performance and experiment tracking, and hyperparameter tuning. In addition, they achieved a 27% reduction in cloud costs in the first quarter of usage. This enhancement empowered the team to concentrate more on innovation and delivering high-quality ML solutions for their customers, resulting in increased productivity and client