Cloudflare Outage: A Corrupted Bot Management File Brought Down the Internet
On a fateful day, a corrupted bot management file brought down a significant portion of the internet, causing widespread disruptions to online services. The culprit behind this chaos was Cloudflare, a popular proxy service that helps protect websites from malicious traffic. According to Cloudflare’s analysis, the issue stemmed from a limit on the number of machine learning features that can be used at runtime, which is set at 200. However, a bad file with more than 200 features was propagated to their servers, causing the system to panic and output errors.
The number of 5xx error HTTP status codes served by the Cloudflare network is normally very low, but it soared after the bad file spread across the network. This unusual behavior was explained by the fact that the file was being generated every five minutes by a query running on a ClickHouse database cluster, which was being gradually updated to improve permissions management. As a result, every five minutes, there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network.
Uncovering the Root Cause
The fluctuation in errors initially led the Cloudflare team to believe that it might be caused by an attack. However, as every ClickHouse node began generating the bad configuration file, the fluctuation stabilized in the failing state. The team eventually discovered that the problem was caused by the corrupted bot management file and worked to stop its generation and propagation. They manually inserted a known good file into the feature file distribution queue and forced a restart of their core proxy.
Lessons Learned and Future Improvements
Cloudflare’s analysis of the outage revealed that it was their worst since 2019. The company is taking steps to protect against similar failures in the future, including hardening the ingestion of Cloudflare-generated configuration files, enabling more global kill switches for features, and eliminating the ability for core dumps or other error reports to overwhelm system resources. While the team cannot promise that Cloudflare will never have another outage of the same scale, they are confident that previous outages have led to the development of more resilient systems.
For more information on the Cloudflare outage, you can read the full report Here
Image Credit: arstechnica.com