Due to a significant number of ML processes launched by hundreds of ML engineers, the mobile advertising broker company (The Company), with more than 800 employees, providing the leading mobile advertising platform, has a complex IT infrastructure and high cloud costs. Leveraging the AWS platform for hundreds of ML models, the company spent over $80M annually on a cloud environment.
OptScale helped reduce AWS cloud costs by 37% in four months by optimizing ML/AI workload performance, organizing experiment tracking, improving ML teams’ KPI, and delivering the company’s cloud usage and cost transparency.
The Company aimed to empower the MLOps process by implementing MLOps and FinOps methodology, providing complete transparency of the ML model training process with a leaderboard and experiment tracking and optimizing ML experiment performance and cost.
Running hundreds of ML experiments daily, ML teams faced the following challenges:
ML/AI model training is a complex process that depends on a defined hyperparameter set, hardware, or cloud resource usage. Monitoring and comparing key metrics and indicators against established benchmarks or thresholds enable gaining profound
insights and enhancing the ML/AI profiling process.
Without sufficient transparency into the ML process, it became challenging for the company to determine bottlenecks in ML model training and select the optimal configuration of cloud resources. The lack of visibility hinders the ability to maximize
ML/AI training resource utilization and outcome of experiments and accurately plan and forecast resource requirements, leading to overprovisioning or underprovisioning cloud resources.
ML models often require complex and significant cloud infrastructure for training and inference. Inefficient ML model and experiment management mechanisms led to increased resource costs and longer processing times due to bottlenecks in specific resources like GPU, IO, CPU, or RAM. Without proper monitoring, the company faced challenges in identifying bottlenecks, performance issues, or areas for improvement.
OptScale allowed the company to run ML experiments with optimal performance, reduced infrastructure costs and improved their KPIs (key innovation index).
Using Hystax OptScale ML team multiplied the number of ML/AI experiments running in parallel, maximized ML/AI training resource utilization and outcome of experiments, reduced model training time, and minimized cloud costs. The solution enabled ML/AI engineers to run automated experiments based on datasets and hyperparameter conditions within a defined infrastructure budget.
OptScale enabled ML teams to manage the lifecycle of models and experiment results through simplified cloud management and enhanced user experience.
To run ML/AI or any workload with optimal performance and infrastructure cost with OptScale – a FinOps and MLOps open source platform, contact us today.
You can unsubscribe from these communications at any time. Privacy Policy
1250 Borregas Ave, Sunnyvale, CA 94089, USA | [email protected]