Cloudflare is a service that acts as a cross between a proxy server and a content delivery network. It improves overall access to the websites that use it by helping them load faster, no matter where viewers are in the world. All their requests get routed through Cloudflare’s global network. There’s a cybersecurity component to Cloudflare, too, since the service can stop malicious traffic before users experience its effects.
Brands ranging from Taco Bell to the New York state government use Cloudflare, so you can imagine the chaos that ensued when the service went down on July 2, 2019, for about a half-hour. Here are some crucial details about the outage.
Major sites stopped loading
Although the outage was a short-term one, it affected websites you probably know and may use often. For example, CoinDesk showed incorrect cryptocurrency prices during the incident. Perhaps the most ironic site outage related to DownDetector, a site people commonly visit to check the status of their favorite sites. Although DownDetector usually lets people know about disruptions, it couldn’t do that during the Cloudflare issue, which rendered it non-functional as well.
Not caused by a malicious attack
The developers rolled out the new firewall rules in a simulated mode that allowed them to log issues without blocking customer traffic. They did so to measure the false positive rates of the rules before full-scale deployment.
However, even implementing the firewall rules via a simulation caused problems. That’s because one of the rules contained a regular expression — a string of characters associated with a search term — that made Cloudflare’s worldwide CPUs reach 100%. After that, people trying to access sites that use Cloudflare started seeing 502 Bad Gateway errors.
Triggered by exhausted CPUs
Cloudflare referred to this problem as “an unprecedented CPU exhaustion event,” and clarified the company had never experienced global CPU exhaustion before. However, it’s also worth pointing out that Cloudflare admitted they’ve implemented processes that usually require making changes progressively. In the case that caused the problem, though, Cloudflare made the changes worldwide all at once. Doing that caused the CPU exhaustion.
Cloudflare responded quickly
Once the Cloudflare team recognized what was happening during an investigation, it issued a “global kill” command that affected all the new firewall rulesets and dropped the CPUs down to a normal level. Then, less than an hour later, the team evaluated the problematic rule, engineered and tested a fix and reenabled the rulesets.
Also, Cloudflare’s CEO mentioned that the problem was “a mistake on our part” and talked about how a bug in the firewall application caused the issue with the CPUs. He also said the public could expect a detailed blog post about what the company is doing to stop a similar event from happening again.
A future improvement to testing processes
As for now, the company’s blog discusses how the testing process in place before the downtime occurred was not sufficient. It also states the development team members are at work figuring out how to implement better testing and deployment procedures.
Once more details get published, people can start to weigh in and determine whether the remedies are adequate. Moreover, the company’s long-term response could help people consider if they should become or remain Cloudflare customers.