Whitepaper 'FinOps and cost management for Kubernetes'
Please consider giving OptScale a Star on GitHub, it is 100% open source. It would increase its visibility to others and expedite product development. Thank you!
Ebook 'From FinOps to proven cloud cost management & optimization strategies'
OptScale — FinOps
FinOps overview
Cost optimization:
MS Azure
Google Cloud
Alibaba Cloud
OptScale — MLOps
ML/AI Profiling
ML/AI Optimization
Big Data Profiling
Acura — Cloud migration
Database replatforming
Migration to:
MS Azure
Google Cloud
Alibaba Cloud
Public Cloud
Migration from:
Acura — DR & cloud backup
Migration to:
MS Azure
Google Cloud
Alibaba Cloud

Navigating cloud downtime: steps to take when services are unavailable

Picture yourself as a cloud engineer, the mighty guardian of a company’s website, bravely fighting off digital dragons and pesky bugs. Just as you indulge in a delightful Monday night meal, an urgent alert interrupts your feast – disaster strikes! The website has taken an unexpected nap. Oh, the horror! But fear not, a brave warrior of the cloud, for you shall conquer this mischievous downtime with your wit and expertise. It seems the miscreant responsible is none other than the cloud provider’s authentication mechanism, playing hide-and-seek with your website’s accessibility. It’s time to don your virtual cape, summon your strong troubleshooting skills, and bring back the website’s online glory. Join us on this epic adventure as we unravel the secrets to defeating cloud downtime and restoring peace and laughter to the digital realm. Get ready to slay those technical gremlins and enjoy a hearty serving of victory!

handling cloud downtime - steps to take when services are unavailable

Actions to follow when cloud services experience downtime

Swift investigation:

Upon receiving the alert, promptly shift into investigation mode. Conduct a thorough assessment to determine the cause and extent of the outage. Verify that the issue lies with the cloud provider and not within your infrastructure.

Understand common causes:

Cloud services can experience outages due to various factors. Software or configuration errors are a leading cause, as acknowledged by the Uptime Institute. Additional culprits include networking or connectivity issues and mechanical or electrical failures at data centers.

Address software and configuration errors:

Cloud downtime resulting from software or configuration errors can stem from flawed deployment packages or application misconfigurations. Learn from past incidents, such as the Slack outage in the winter of 2022, where a configuration change in a database triggered a widespread service disruption.

Tackle networking and connectivity issues:

Smooth cloud operations heavily rely on reliable networking and connectivity. Configuration problems, change management issues, and errors from third-party network providers are common culprits in this category. Take note of previous incidents, like the January 2022 outage on Google Cloud caused by a configuration error leading to increased latency.

Prepare for mechanical and electrical failures:

Mechanical or electrical failures, such as uninterruptible power supply (UPS) or utility/generator failures, can bring cloud services to a halt. Refer to past incidents, such as the AWS outage in July 2022, where a power outage in an availability zone resulted in widespread disruption.

  • Cloud downtime creates stress and anxiety for end-users, highlighting the need to minimize its impact.
  • Minimizing downtime is crucial to mitigate potential data loss, protect reputation, and prevent financial losses.
  • According to the Ponemon Institute, the average cost of an outage per minute is approximately $9,000.
  • Research from the Uptime Institute indicates that over half of the surveyed organizations experienced outage costs exceeding $100,000.
  • By following recommended steps and staying prepared, businesses can effectively navigate the challenges posed by cloud downtime.
  • Taking proactive measures helps reduce downtime’s adverse effects on operations and customer experience.

Mastering cloud downtime: 5 steps to navigate the storm

Step 1: assess the situation before the outage

Before an outage occurs, evaluate the benefits and challenges of implementing a multicloud strategy. Determine if it aligns with your environment, architecture, and teams, as it can offer increased redundancy and protection against service disruptions.

Step 2: prepare for the worst: backup essential data

One vital precaution before an outage is prioritizing backing up your essential data. This proactive measure ensures you have a safeguard to protect your critical information even during an outage.

Depending on your cloud provider, various backup solutions are available to secure your data. For instance, Azure offers Azure Backup, a comprehensive solution capable of backing up data on VMs, SQL servers, Azure Blobs, and more. On the other hand, Google Cloud provides Google Cloud Backup and Disaster Recovery (DR) services, which offer data backup capabilities for GKE, VMs, and other crucial components. You establish a resilient safety net by diligently backing up your essential data beforehand. In the unfortunate event of data loss during an outage or if the outage persists for an extended period, you can rely on these backups to restore your information. This proactive approach minimizes the potential impact on your operations and enables a smoother recovery process.

Step 3: investigate locally: check for user errors

After experiencing an outage, the next step is to determine whether the issue lies solely within your environment or if it is more widespread. Several handy tools and resources are available to help you with this assessment.

To begin, you can visit Down Detector to input the website’s URL and check if other users are also reporting errors. This platform provides valuable insights into any potential widespread outages. Additionally, Down Detector often includes convenient links to the website’s support page and their social media accounts on platforms like Twitter or Facebook.

Another helpful tool for ruling out local connectivity issues and quickly verifying if a website is down is IsItDownRightNow.com. This website will not only inform you about the availability of the site you are checking but also provide information on the site’s response time.

Suppose these tools do not reveal any issues, and you want to verify the status of your cloud provider. In that case, you can consult their dedicated status page. For example, if you use Google Cloud, you can visit their status page to check for ongoing service issues or degradation. These status pages often offer updates on the situation, estimated time until resolution, and details about the steps to address the problem.

Suppose your internet connection is completely down or there is a power outage. In that case, you may consider visiting a local coffee shop or any place with accessible Wi-Fi to check if the cloud provider is experiencing an outage. Once you have confirmed that there are no local issues, you can proceed to the next step in our list of actions.

Step 4: seek support: contact your cloud provider

During a cloud outage, it is important to promptly contact your cloud provider to gather more information and report the issue. After ruling out any local connectivity issues, getting the provider becomes vital in resolving the problem. When contacting the provider, be prepared to provide specific details about the situation, including the affected services, error messages encountered, and the time the issue started. Each cloud provider has different contact methods, such as using the Azure Portal or tweeting Azure Support for Microsoft Azure, utilizing the support page for Google Cloud, or referring to the provider’s website or support site if using a different cloud service. It is crucial to exercise patience during this process, as cloud providers’ support teams work diligently to assist customers and address queries amidst an outage. Engaging with the cloud provider increases the chances of obtaining timely assistance and resolving downtime.

Step 5: understand your rights: review your cloud service agreement

Another crucial step in handling cloud downtime is thoroughly reviewing your provider’s cloud service agreement. This agreement holds vital information regarding the provider’s obligations and your rights as a customer.

First and foremost, examining the service level agreements (SLAs) outlined in the deal is essential. An SLA serves as a commitment from the provider to ensure a certain level of availability for their services. For instance, if you are utilizing AWS and your API gateway service is affected by the outage, AWS offers three SLA tiers specifically for the API gateway service. Depending on the amount of downtime experienced by the service within a given month, you may be eligible for a partial or even a full refund.

To illustrate, let’s consider a scenario where the API gateway service was down three hours earlier in the month, resulting in approximately 99.58% uptime. According to AWS’s SLA, you would be entitled to a 10% service credit as compensation for the downtime. Hence, it is crucial to thoroughly review and familiarize yourself with the specifics of your cloud service agreements to ensure you understand the guarantees and remedies available to you as a customer.

Embrace multi-cloud resilience: safeguard your data and ensure continuous operations

Cloud outages can be highly frustrating, especially for those relying heavily on cloud services for daily activities or business operations. While it’s essential to be prepared by following the steps and resources provided in the article, it’s vital to acknowledge that cloud outages can occur unexpectedly and at anytime.

Consider implementing an application or service architecture across multiple regions to safeguard your business from potential outages. This can be achieved through an active-active approach, where your application is simultaneously active in various areas, or an active-passive setup, where you can seamlessly switch to another region when an issue arises.

In addition to regional redundancy, developing a multicloud strategy can further protect your data and mitigate the risk of downtime. Utilizing multiple cloud providers lets you distribute your workload and data across different platforms. However, having the appropriate personnel and processes is crucial to execute and manage this strategy effectively. It is recommended to thoroughly review the advantages and disadvantages of adopting a multi-cloud approach to ensure it aligns with your business requirements.

In summary

To enhance your resilience against cloud outages:

  • Implement an architecture that allows your application or services to run from multiple regions in an active-active or active-passive style.
  • Consider developing a multi-cloud strategy to distribute your workload and data across cloud providers.
  • Ensure you have the expertise and processes to execute and manage a multi-cloud environment effectively.
  • Evaluate the pros and cons of going multi-cloud to determine if it aligns with your business needs.
Enter your email to be notified about new and relevant content.

Thank you for joining us!

We hope you'll find it usefull

You can unsubscribe from these communications at any time. Privacy Policy

News & Reports

FinOps and MLOps

A full description of OptScale as a FinOps and MLOps open source platform to optimize cloud workload performance and infrastructure cost. Cloud cost optimization, VM rightsizing, PaaS instrumentation, S3 duplicate finder, RI/SP usage, anomaly detection, + AI developer tools for optimal cloud utilization.

FinOps, cloud cost optimization and security

Discover our best practices: 

  • How to release Elastic IPs on Amazon EC2
  • Detect incorrectly stopped MS Azure VMs
  • Reduce your AWS bill by eliminating orphaned and unused disk snapshots
  • And much more deep insights

Optimize RI/SP usage for ML/AI teams with OptScale

Find out how to:

  • see RI/SP coverage
  • get recommendations for optimal RI/SP usage
  • enhance RI/SP utilization by ML/AI teams with OptScale