Kubernetes performance issues and how to handle them

August 5, 2021

Kubernetes, the open-source container orchestration software, is dominating the world of containerized applications by holding by far the largest amount of its market share. And there are a lot of reasons for that. Kubernetes drastically extends the capabilities of software for containerization-enabled environments such as Docker. It simplifies the management of deployment, network routing, resource utilization, load balancing, the resiliency of running applications and many more.

However, this solution will not work effectively on its own without proper preparation and additional configuration, as every newly created cluster doesn’t have an optimal performance by default. There are always subtle difficulties and nuances of Kubernetes’ implementation and operation, as well as the problem of suboptimal use of its advantages, which ultimately leads to the loss of money. In this case, representatives of IT teams must possess enough experience, methods and instruments to define misconfiguration and bottlenecks, but at the same time, there is a global shortage of expertise in Kubernetes on the market, because at this time the popularity of K8s is overtaking the level of knowledge about it among technical specialists.

Top Kubernetes performance issues

Based on the research conducted by Circonus, the top four Kubernetes performance issues are:

resource contention for clusters/nodes/pods,
deployment problems,
auto-scaling challenges,
crash loops and job failures.

It came as no surprise as those issues largely stem from the peculiarities of the technology and the lack of expertise and experience when working with this platform.

At the heart of Kubernetes there is a scheduler that places containers on nodes. Simply put, it’s like packing boxes of different sizes with items of different sizes and shapes. From that point of view, the scheduler needs to know the exact capacity of nodes as well as the size of each container being placed on those nodes. The failure to do so results in over-provisioning the nodes and serious performance problems.

How to address Kubernetes performance issues

Monitoring Kubernetes metrics

The most efficient – and, at the same time, the most challenging – way to tackle K8s performance issues is definitely to increase the observability of the platform in order to help you understand which of the collected metrics you need to keep an eye on in order to identify the root cause of certain issues. In fact, Kubernetes provides you with numerous metrics, and the majority of them are an important source of insights into how to use the platform regardless of how you actually run it.

Open-source monitoring systems like Prometheus can be a great help in visualizing your Kubernetes costs. And with the help of an exporter standalone program it’s possible to translate node metrics into the appropriate format and send them over to the Prometheus server. By installing it onto every node of your cluster, you’ll be then able to get access to dozens of metric categories, the most important of which are related to CPU, disk, memory and network usage.

Despite the fact that we have narrowed the range of the studied metrics to four categories, at this stage it will still be difficult for us to understand which indicators are paramount for us. Since Kubernetes is an example of a complex system, we should take the path of simplifying abstractions around the categories of interest to us. Subsequently, this will help us analyze not only node metrics, but in general all Kubernetes metrics.

The most common methods for simplifying abstractions are:

The USE Method, introduced in 2012 by Brendan Gregg; targeted at resources in your system:
Utilization – the average time that the resource was busy servicing work.
Saturation – the degree to which the resource has extra work which it can’t service, often queued.
Errors – the count of error events.
The RED Method (2015), which defines the three key metrics you should measure for every microservice in your architecture:
(Request) Rate – the number of requests served.
(Request) Errors – the number of failed requests.
(Request) Duration – distributions of the amount of time each request takes.
The Four Golden Signals (described in the Site Reliability Engineering book by Google) is to some extent a fusion of the above methods:

Latency – the time it takes to service a request.

Traffic – a measure of how much demand is placed on your system, measured in a high-level system-specific metric.

Errors – the rate of requests that fail, either explicitly, implicitly, or by the policy.

Saturation – how “full” your service is.

It turns out that it is not enough to have extensive information about the resources on the nodes in the Kubernetes cluster, it is also important to be able to analyze it. For example, analyzing resources (such as CPU, disk, memory, and network) through the lens of usage, saturation and errors (USE method) can give us an understanding of how resources are being spent and allow us to further optimize and scale their use.

Once your IT team figures out what resources are underutilized and overutilized, they will be able to define the optimal storage limits, the optimal CPU and memory size for cluster nodes, and the optimal nod pools for every node, which in turn will allow them to analyze Kubernetes costs and analyze its performance.

Sticking to best practices

Regardless of how successfully you monitor and analyze your Kubernetes resource usage, there are a number of best practices to follow to help you get the most out of the platform.

Optimize your environment for Kubernetes.
Keep in mind that containerized tools were originally designed for a loosely coupled architecture consisting of Stateless applications that process data but do not store anything. Therefore, it is a mistake not to do anything before deploying data-storing Stateful applications and not to adapt the architecture of monolithic applications letting them run on Kubernetes.
Use Kubernetes only when it’s necessary.
When moving to Kubernetes, remember that it makes sense to run databases and some applications in a virtual machine, and a potential move for the sake of a move can seriously affect performance.
Have specialists who know how to work with Kubernetes.
Working with Kubernetes requires system administrators with hands-on experience with the platform, as successfully maintaining this ecosystem of components requires a high level of expertise.
Adapt IT processes for Kubernetes implementation.
Kubernetes is fundamentally changing the distribution of roles and responsibilities within an IT team. Now, the proper implementation requires a shift to DevOps processes, and developers should accept this methodology and its tools.
Through DevOps, system administrators maintain the infrastructure, while developers support the application from planning and coding to launch, implementation, monitoring, and production. Developers now cannot help but know the infrastructure; they should also understand how their code works in the context of all these new processes.
Leverage additional tools that extend Kubernetes functionality without relying solely on out-of-the-box functionality.

In the previous paragraph, we dwelt in detail on the Prometheus-based metrics monitoring system, but this is far from all the functionality that additional services can provide. Also, thanks to various tools, you can optimize the processes of storing application data (Ceph, GlusterFS), collecting and storing logs (Fluentd, Elasticsearch, Loki), autoscaling (Metrics Server, Prometheus Adapter), security settings (Dex, Keycloak, Open Policy Agent) and much more.

Our product, OptScale, can help correctly define a set of features and configurations such as VM rightsizing, machine types, region selection, CI/CD job resource reflavoring or pod affinity groups for achieving better performance.

To give it a try, you can have a free trial; no credit card needed → https://my.optscale.com/register

Enter your email to be notified about new and relevant content.

Thank you for joining us!

We hope you'll find it usefull

You can unsubscribe from these communications at any time. Privacy Policy

News & Reports

Slide deck

FinOps and MLOps

A full description of OptScale as a FinOps and MLOps open source platform to optimize cloud workload performance and infrastructure cost. Cloud cost optimization, VM rightsizing, PaaS instrumentation, S3 duplicate finder, RI/SP usage, anomaly detection, + AI developer tools for optimal cloud utilization.

How-tos

FinOps, cloud cost optimization and security

Discover our best practices:

How to release Elastic IPs on Amazon EC2
Detect incorrectly stopped MS Azure VMs
Reduce your AWS bill by eliminating orphaned and unused disk snapshots
And much more deep insights

OptScale

Optimize RI/SP usage for ML/AI teams with OptScale

Find out how to:

see RI/SP coverage
get recommendations for optimal RI/SP usage
enhance RI/SP utilization by ML/AI teams with OptScale