Cloudflare Co-Founder and CEO Matthew Prince clarified that the outage was caused due to an internal system issue.
Photo Credit: Reuters
Cloudflare called it the company’s worst outage since 2019
Cloudflare, on Tuesday, suffered a system crash, triggering a massive outage that impacted major websites such as X (formerly known as Twitter), ChatGPT, Canva, and others. The outage lasted nearly five hours, and visitors were only seeing an HTTP 500 error code, informing them about the internal server error. Now, Matthew Prince, the Co-Founder and CEO of the company, has shared a detailed breakdown of what caused this outage. Notably, the issue was not caused by a cyberattack or any external malicious activity, and instead was triggered by an internal system flaw.
In a blog post, Prince called the incident “Cloudflare's worst outage since 2019.” Apologising ”for the pain we caused the Internet today”, the CEO also detailed what triggered this outage.
The root cause was a permissions change in one of the company's database systems. This change led to the generation of a “feature file” used by its Bot Management system that doubled in size beyond the software's expected limits. The oversized file was distributed across the network and triggered failures in the proxy software reading it.
The faulty feature file originated from a query running on a ClickHouse database cluster. Every five minutes, the file was regenerated, and because some cluster nodes had been updated while others had not, there was intermittent propagation of a “bad” version of the file. This dynamic caused the network to alternate between periods of failure and recovery, complicating the diagnosis.
Initial diagnostics by the Cloudflare team pointed to a potential large-scale Distributed Denial of Service (DDoS) attack due to the pattern of failures. However, once the investigation revealed the configuration-file issue, the team halted propagation of the bad file, rolled back to a known good version, and restarted the core proxy services. Core traffic returned to normal by about 14:30 UTC (8:00pm IST), and full recovery was reported at 17:06 UTC (10:36pm IST).
Several Cloudflare services were impacted as part of this event. The core content delivery network (CDN) and security services produced elevated HTTP 5xx errors. The Turnstile bot-challenge service failed to load. The Workers KV key-value store experienced high error levels due to front-end gateway failures. The Dashboard was mostly operational, but many users were unable to log in because Turnstile was unavailable. Email security features experienced degraded spam-detection accuracy due to a temporary loss of access to an IP-reputation source.
“An outage like today is unacceptable. We've architected our systems to be highly resilient to failure to ensure traffic will always continue to flow. When we've had outages in the past, it's always led to us building new, more resilient systems,” said Prince.
Get your daily dose of tech news, reviews, and insights, in under 80 characters on Gadgets 360 Turbo. Connect with fellow tech lovers on our Forum. Follow us on X, Facebook, WhatsApp, Threads and Google News for instant updates. Catch all the action on our YouTube channel.
Samsung Partners AU Small Finance Bank to Add Tap & Pay Support For AU Visa Credit Cards