Cloudflare Down LIVE: Global Outage has been resolved, which took down ChatGPT, X, and others -

Post Views: 416

Cloudflare Down LIVE: Global Outage has been resolved, which took down ChatGPT, X, and others

“The worldwide outage that caused numerous well-known platforms, including X, ChatGPT, Claude, Perplexity, and others, to fall offline has been fixed, according to Cloudflare.”

Cloudflare’s network started having serious issues delivering core network traffic on November 18, 2025, at 11:20 UTC (all times in this blog are UTC). Internet users attempting to access our customers’ websites saw this as an error page that indicated a Cloudflare network breakdown.

A cyberattack or other malicious action of any type did not directly or indirectly cause the problem. Rather, it was brought on by a modification to the permissions of one of our database systems, which resulted in the database producing several entries into a “feature file” that our Bot Management system uses. In turn, the size of that feature file doubled. After that, the larger-than-anticipated feature file spread to every workstation in our network.

In order to keep our Bot Management system up to date with constantly evolving risks, the software that runs on these servers to route data throughout our network reads this feature file. The feature file size limit set by the software was less than its doubled size. The software failed as a result.

We were able to stop the spread of the larger-than-expected feature file and replace it with an earlier version of the file after initially mistakenly believing that the symptoms we were experiencing were the result of a hyper-scale DDoS attack. By 14:30, core traffic was mostly operating normally. As traffic raced back online over the next few hours, we attempted to reduce the additional stress on different areas of our network. All of Cloudflare’s systems were operating normally as of 17:06.

We apologize for the effects on our clients and the Internet at large. Any interruption of any of our systems is intolerable, considering Cloudflare’s significance in the Internet ecosystem. Every member of our team finds it extremely painful that there was a point when our network was unable to route traffic. We acknowledge that we failed you today.

This article provides a detailed account of the events and the systems and procedures that went wrong. Additionally, it is just the start of what we intend to do to ensure that a similar outage doesn’t occur in the future.

OUTAGE

The number of 5xx error HTTP status codes that the Cloudflare network serves is displayed in the chart below. This should normally be quite low, as it was until the outage began.

The anticipated baseline of 5xx errors seen throughout our network is the volume before 11:20. The surge and the oscillations that followed indicate that our system failed because the wrong feature file was loaded. The fact that our system would thereafter recover for a while is noteworthy. For an internal issue, this behavior was quite atypical.

The reason was that a ClickHouse database cluster query, which was being progressively modified to enhance permissions management, was creating the file every five minutes. Only when the query was executed on an updated portion of the cluster was bad data produced. Because of this, there was a risk that a collection of configuration files, either good or bad, would be created and quickly spread over the network every five minutes.

Because the system would occasionally recover and then fail again, it was difficult to determine what was going on because our network would occasionally receive configuration files that were both good and poor. This initially made us think that an attack might be the reason for this. The oscillation eventually settled in the failing condition, and each ClickHouse node was producing a malfunctioning configuration file.

Beginning at 14:30, errors persisted until the root cause was found and fixed. By manually adding a known good file to the feature file distribution queue and halting the creation and spread of the malicious feature file, we were able to resolve the issue. Then we make our core proxy restart.

With the 5xx error code volume returning to normal at 17:06, the remaining long tail in the accompanying chart represents our team restarting the remaining services that had fallen into a terrible state.

The following services were affected:

Service/ Product	Impact description
Core CDN and security services	5xx HTTP status codes. A typical error page that is sent to end users is depicted in the screenshot at the top of this post.
Turnstile	The turnstile did not load.
Workers KV	When requests to KV’s “front end” gateway failed because the core proxy failed, then KV returned a noticeably higher level of HTTP 5xx failures.
Dashboard	Most users were unable to log in since Turnstile was not available on the login page, even though the dashboard was largely functional.
Email Security	We noticed a brief loss of access to an IP reputation source, which decreased spam-detection accuracy and stopped some new-domain-age detections from triggering, but no significant consumer effect was seen, even though email processing and delivery remained unchanged. Additionally, we observed errors in a few Auto Move actions; all impacted messages have been examined and fixed.
Access	From the commencement of the event until the rollback was started at 13:05, most users experienced widespread authentication difficulties. There was no impact on any active Access sessions. None of these users ever made it to the intended application, while authentication was failing since every unsuccessful attempt at authentication produced an error page. During this occurrence, successful logins were accurately recorded. Updates to the Access settings at that time would have either completely failed or spread very slowly. We have retrieved all configuration updates.

During the impact period, we saw notable increases in our CDN’s response latency in addition to HTTP 5xx failures. Our debugging and observability systems, which automatically augment uncaught failures with additional debugging information, were using a lot of CPU.

What went wrong today, and how does Cloudflare handle requests?

Each request to Cloudflare travels across our network in a predetermined route. It might come via a mobile app contacting an API, a browser loading a webpage, or automated traffic from another provider. Our basic proxy system, which we refer to as FL for “Frontline,” receives these requests after they first terminate at our HTTP and TLS layer. Pingora then handles cache lookups or, if necessary, retrieves data from the origin.

More information on the core proxy’s operation was originally provided here.

We execute the different security and performance products in our network while a request passes via the core proxy. From applying WAF regulations and DDoS protection to forwarding traffic to the Developer Platform and R2, the proxy applies each customer’s specific setup and settings. It does this by applying the configuration and policy rules to traffic passing through our proxy using a collection of domain-specific modules.

The outage that occurred today was caused by one of the modules, Bot Management.

We use a machine learning model in Cloudflare’s Bot Management, among other technologies, to create bot scores for each request that passes through our network. Our clients manage which bots are permitted to visit their websites by using bot scores.

A “feature” configuration file is fed into the model. In this context, a feature is a unique characteristic that the machine learning model uses to forecast whether or not the request was automated. Individual features are grouped in the feature configuration file.

We can respond to changes in Internet traffic flows thanks to this feature file, which is updated every few minutes and distributed throughout our whole network. It enables us to respond to novel bot attacks and a variety of bots. Because bad actors change their strategies swiftly, it is imperative that it be implemented regularly and quickly.

This file has a lot of duplicate “feature” entries because of a change in our underlying ClickHouse query behavior (described below). The bots module generated an error as a result of the previously fixed-size feature configuration file’s size being altered.

Consequently, for any traffic that relied on the bots module, the core proxy system that manages traffic processing for our clients returned HTTP 5xx error codes. Workers KV and Access, which depend on the core proxy, were also impacted.

We were and are now moving our client traffic to a new version of our proxy service, internally referred to as FL2, which is unrelated to this event. Although the impact was different, the problem affected both versions.

HTTP 5xx issues were noticed by customers using the new FL2 proxy engine. Customers using our previous proxy engine, FL, did not notice any issues; however, all traffic received a bot score of zero due to incorrect bot score generation. There would have been a lot of false positives for customers who had policies set up to prevent bots. There was no effect on customers who did not use our bot score in their regulations.

Another obvious indication we noticed was throwing us off and giving us the impression that this might have been an attack: The status page for Cloudflare went down. The status page has no reliance on Cloudflare and is entirely hosted off its infrastructure. Although it turned out to be a coincidence, several members of the problem-solving team thought that an attacker might be focusing on both our systems and our status page. At that moment, an error message appeared to visitors to the status page:

We were worried in the internal incident discussion room that this could be an extension of the current round of high-volume Aisuru DDoS attacks:

The query’s behavior has changed

As I previously stated, the feature file had a lot of duplicate entries due to a change in the underlying query behavior. ClickHouse’s software is used by the database system in question.

It’s useful to understand how ClickHouse distributed queries operate. A ClickHouse cluster is made up of numerous shards. We have so-called distributed tables (powered by the table engine Distributed) in a database named default that allow us to query data from all shards. The database r0’s underlying tables are queried via the distributed engine. On each ClickHouse cluster shard, data is kept in the underlying tables.

A shared system account is used for queries to the distributed tables. In an effort to increase the security and dependability of our distributed queries, efforts are being made to have them operate under the original user accounts.

In the past, when ClickHouse users queried table metadata from ClickHouse system tables like system. tables or system columns, they would only view the tables in the default database.

We made a modification at 11:05 to make implicit user access to underlying tables in r0 explicit, allowing users to view the metadata of these tables as well. Query restrictions and access permits can be assessed more precisely by ensuring that all distributed subqueries can operate under the original user, preventing one user’s poor subquery from having an impact on others.

All users now have access to precise metadata about the tables they can access, thanks to the previously mentioned update. Regretfully, it was previously assumed that a query such as this would only return the “default” database in its list of columns:

Observe that the database name is not filtered for in the query. Following the modification at 11:05, the query above began returning “duplicates” of fields since those were for underlying tables stored in the r0 database, as we were gradually rolling out the explicit grants to users of a specific ClickHouse cluster.

Unfortunately, the Bot Management feature file generating logic used this kind of query to create each input “feature” for the file that was stated at the start of this section.

A table with columns similar to the one shown (simplified example) would be returned by the query above:

However, the response now included all of the r0 schema’s information as part of the extra rights that were given to the user. This essentially more than doubled the number of rows in the response, which ultimately affected the number of rows (i.e., features) in the final file output.

Preallocation of Memory

Every module that uses our proxy service has a number of restrictions in place to prevent excessive memory usage and to optimize efficiency by preallocating memory. In this particular case, the amount of machine learning features that may be utilized at runtime is limited by the Bot Management system. That maximum is now set at 200, which is far more than the roughly 60 characteristics we currently employ. Once more, the limit exists because we preallocate memory for the features due to performance concerns.

This limit was reached when the malicious file with more than 200 features spread to our servers, causing the system to panic. Below is the FL2 Rust code that performed the check and caused the unhandled error:

This caused the panic that followed, which ultimately led to a 5xx error:

thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

Additional effects of the incident

During the event, several systems that depend on our core proxy were affected. Cloudflare Access and Workers KV were among them. By patching Workers KV to go around the core proxy at 13:04, the team was able to lessen the impact on these systems. All downstream systems that depend on Workers KV, including Access itself, saw a decrease in error rate as a result.

Both the internal use of Workers KV and the deployment of Cloudflare Turnstile as part of our login process had an influence on the Cloudflare Dashboard.

This outage affected Turnstile, making it impossible for users without an active dashboard session to log in. The graph below illustrates this as decreased availability between 11:30 and 13:10 and between 14:40 and 15:30.

The influence on Workers KV, which is necessary for certain control plane and dashboard functions, was the reason for the first period, which ran from 11:30 to 13:10. When Workers KV circumvented the core proxy system, this was restored at 13:10. After the feature configuration data was restored, the dashboard experienced a second period of disruption. The dashboard started to become overloaded with a backlog of login attempts. Retry attempts and this backlog increased the latency, which decreased dashboard availability. Around 15:30, availability was restored by scaling control plane concurrency.

Corrective action and subsequent actions

We’ve already started working on ways to make our systems more resilient to such outages in the future, now that they’re back online and operating properly. Specifically, we are:

Similar to how we would handle user-provided data, we are hardening the ingestion of configuration files generated by Cloudflare.
Adding extra global kill switches to features.
Preventing system resources from being overloaded by core dumps or other error reports.
Examining all core proxy modules’ failure possibilities for incorrect conditions.

Cloudflare experienced its greatest outage since 2019 today. Our dashboard has been inaccessible due to outages. Some have resulted in the temporary unavailability of more recent features. However, we haven’t experienced another outage that has prevented the majority of core traffic from passing through our network in the past six or more years.

Today’s outage is awful. To guarantee that traffic will always flow, we have built our systems to be extremely resilient to failure. In the past, outages have always prompted us to develop better, more robust technologies.

I would like to express my regret on behalf of the entire Cloudflare team for the suffering we caused the Internet today.

Time (UTC)	Status	Description
11:05	Normal	A change to database access control has been implemented.
11:28	Impact starts	When deployment reaches client environments, customer HTTP traffic exhibits the first problems.
11:32-13:05	The team investigated elevated traffic levels and errors in the Workers KV service.	The first sign seems to be a lower Workers KV response rate, which affected other Cloudflare services downstream. To restore the Workers KV service to regular working levels, mitigations such as account limiting and traffic manipulation were tried. At 11:31, the first automatic test found the problem, and at 11:32, the manual inquiry began. At 11:35, the incident call was created.
13:05	Workers KV and Cloudflare Access bypass implemented — impact reduced.	We employed internal system bypasses for Workers KV and Cloudflare Access throughout the inquiry, causing them to revert to an earlier version of our core proxy. The impact was less severe, as explained below, even though the problem existed in earlier iterations of our proxy.
13:37	Work focused on the rollback of the Bot Management configuration file to a last-known-good version.	We were certain that the event was caused by the Bot Management configuration file. In several workstreams, teams worked on strategies to fix the service; the quickest workstream involved restoring an earlier version of the file.
14:24	Stopped the creation and propagation of new Bot Management configuration files.	We discovered that a malfunctioning configuration file was the cause of the 500 errors in the Bot Management module. The automatic distribution of fresh Bot Management configuration files was halted.
14:24	Test of the new file complete.	After observing a successful recovery using the previous configuration file version, we concentrated on expediting the worldwide fix.
14:30	Main impact resolved. Downstream impacted services started observing reduced errors.	After a proper Bot Management configuration file was implemented worldwide, the majority of services began to function properly.
17:06	All services resolved. Impact ends.	Every operation was entirely recovered, and all downstream services were restarted.

In addition to protecting whole corporate networks, Cloudflare’s connection cloud speeds up any website or Internet application, fends off DDoS attacks, deters hackers, and can assist you on your path to Zero Trust.

To begin using our free app that improves the speed and security of your Internet, visit 1.1.1.1 from any device.

Go here to find out more about our goal of contributing to the development of a better Internet. Check out our available openings if you’re seeking a new career path.

About The Author

Suraj Koli is a content specialist in technical writing about cybersecurity & information security. He has written many amazing articles related to cybersecurity concepts, with the latest trends in cyber awareness and ethical hacking. Find out more about “Him.”

3 Critical Flaws were found in IBM AIX, one of which was a Perfect 10. Patch Right Now