A massive Google Cloud outage on Thursday, which affected numerous Google services and major platforms like Spotify and Discord, was caused by an invalid automated quota update within Google’s API management system. The three-hour disruption highlights the critical dependency many online services have on underlying cloud infrastructure. This report details the cause, the impact, and the responses from Google and affected partners like Cloudflare.
Contents
The Thursday Outage: Timeline and Impact
The disruption began around 10:49 ET and lasted until 3:49 ET, impacting millions globally. Beyond Google Cloud itself, services like Gmail, Google Calendar, Google Drive, and Google Meet experienced issues. The ripple effect extended to third-party services that rely on Google Cloud infrastructure, including Spotify, Discord, Snapchat, and specific Cloudflare services utilizing the Workers KV store.
Unpacking the Technical Root Cause
Google has provided an initial analysis, attributing the incident to an “invalid automated quota update” that was distributed globally within its API management platform. This faulty update caused external API requests to be rejected, resulting in widespread 503 errors. The company stated that ineffective testing and error-handling systems contributed to the delay in identifying and fixing the issue.
To resolve the problem, Google bypassed the problematic quota check, restoring most regions within two hours. However, the us-central1 region faced a much longer recovery time due to an overloaded quota policy database. Some services experienced lingering effects, such as backlogs, for up to an hour or more after the primary issue was resolved.
Google’s Response and Commitment
Google issued a public apology, stating, “We are deeply sorry for the impact to all of our users and their customers that this service disruption/outage caused. Businesses large and small trust Google Cloud with your workloads and we will do better.” The company is working on a full incident report to detail the event and preventive measures.
Cloudflare’s Dependency and Future Plans
Cloudflare also addressed the outage, clarifying that its issues were not security-related and no data was lost. Their services relying on the Workers KV key-value store were affected because part of its underlying storage infrastructure is backed by a third-party cloud provider – confirmed by a Cloudflare spokesperson to be Google Cloud.
The outage underscored Cloudflare’s dependency on external cloud infrastructure for critical services like Workers KV, which is essential for configuration, authentication, and asset delivery across many of its products.
Cloudflare Workers KV error rate chart
In response, Cloudflare announced plans to migrate KV’s central store to its own R2 object storage. This strategic move aims to reduce external dependencies and enhance the resilience of its services against similar cloud provider outages in the future.
Implications and What’s Next
This incident serves as a stark reminder of the interconnectedness of the digital world and the potential cascading effects of outages in core infrastructure like cloud computing. For businesses relying heavily on cloud services, such events highlight the importance of understanding dependencies, having robust disaster recovery plans, and potentially diversifying critical infrastructure.
Google will likely release a detailed post-mortem with specific steps to prevent recurrence, focusing on improving testing, error handling, and global rollout procedures for system updates. Other companies affected will be evaluating their reliance on single cloud providers and potentially accelerating plans for multi-cloud strategies or shifting critical components to self-managed infrastructure, as Cloudflare is doing.
For further reading on recent service disruptions and their impact, explore these related articles:
- Google Cloud and Cloudflare hit by widespread service outages
- Cloudflare: Outage not caused by security incident, data is safe
- Massive Heroku outage impacts web platforms worldwide
