Talking tech since 2003

Cloudflare is a service that acts as a cross between a proxy server and a content delivery network. It improves overall access to the websites that use it by helping them load faster, no matter where viewers are in the world. All their requests get routed through Cloudflare’s global network. There’s a cybersecurity component to Cloudflare, too, since the service can stop malicious traffic before users experience its effects.

Brands ranging from Taco Bell to the New York state government use Cloudflare, so you can imagine the chaos that ensued when the service went down on July 2, 2019, for about a half-hour. Here are some crucial details about the outage.

Major sites stopped loading

Although the outage was a short-term one, it affected websites you probably know and may use often. For example, CoinDesk showed incorrect cryptocurrency prices during the incident. Perhaps the most ironic site outage related to DownDetector, a site people commonly visit to check the status of their favorite sites. Although DownDetector usually lets people know about disruptions, it couldn’t do that during the Cloudflare issue, which rendered it non-functional as well.

Not caused by a malicious attack

Cloudflare published a blog post confirming the reason for the issue was not cybercrime, despite many rumors to the contrary. Instead, it occurred because of one misconfigured rule for Cloudflare’s Web Application Firewall. The company’s development team deployed the new firewall parameters because it wanted to strengthen the existing blocks that prevent attacks based in Inline JavaScript.

The developers rolled out the new firewall rules in a simulated mode that allowed them to log issues without blocking customer traffic. They did so to measure the false positive rates of the rules before full-scale deployment.

However, even implementing the firewall rules via a simulation caused problems. That’s because one of the rules contained a regular expression — a string of characters associated with a search term — that made Cloudflare’s worldwide CPUs reach 100%. After that, people trying to access sites that use Cloudflare started seeing 502 Bad Gateway errors.

Triggered by exhausted CPUs

Cloudflare referred to this problem as “an unprecedented CPU exhaustion event,” and clarified the company had never experienced global CPU exhaustion before. However, it’s also worth pointing out that Cloudflare admitted they’ve implemented processes that usually require making changes progressively. In the case that caused the problem, though, Cloudflare made the changes worldwide all at once. Doing that caused the CPU exhaustion.

Cloudflare responded quickly

Once the Cloudflare team recognized what was happening during an investigation, it issued a “global kill” command that affected all the new firewall rulesets and dropped the CPUs down to a normal level. Then, less than an hour later, the team evaluated the problematic rule, engineered and tested a fix and reenabled the rulesets.

Also, Cloudflare’s CEO mentioned that the problem was “a mistake on our part” and talked about how a bug in the firewall application caused the issue with the CPUs. He also said the public could expect a detailed blog post about what the company is doing to stop a similar event from happening again.

A future improvement to testing processes

As for now, the company’s blog discusses how the testing process in place before the downtime occurred was not sufficient. It also states the development team members are at work figuring out how to implement better testing and deployment procedures.

Once more details get published, people can start to weigh in and determine whether the remedies are adequate. Moreover, the company’s long-term response could help people consider if they should become or remain Cloudflare customers.

You've successfully subscribed to BestTechie
Welcome back! You've successfully signed in.
Great! You've successfully signed up.
Your link has expired
Success! Your account is fully activated, you now have access to all content.