Cloudflare November 2025 Outage Analysis
A technical examination of the database permissions change that caused a six-hour global service disruption
|
|
|
|
On November 18, 2025, at 11:20 UTC, Cloudflare experienced a significant network failure affecting millions of websites worldwide. The incident lasted approximately six hours, with peak impact occurring between 11:28 UTC and 14:30 UTC.
The root cause was not a cyberattack or external threat. Rather, it originated from a database permissions modification that exposed an undocumented assumption in a SQL query, which subsequently triggered cascading failures across the entire network infrastructure.
This analysis examines the technical details of the incident, the investigation process, and the architectural decisions that contributed to the failure.
|
Timeline and Initial Symptoms
The incident began at 11:05 UTC with a database access control deployment to a ClickHouse cluster. The change was part of ongoing work to improve permissions management by making implicit database access explicit for enhanced security and query accountability.
At 11:28 UTC, the first errors appeared on customer HTTP traffic as the deployment reached production environments. The initial symptom manifested as elevated HTTP 5xx error rates across core CDN and security services. Automated testing detected the anomaly at 11:31 UTC, with manual investigation commencing at 11:32 UTC.
Between 11:32 UTC and 13:05 UTC, the engineering team focused on Workers KV, which appeared to be the primary source of degradation. Various mitigation strategies including traffic manipulation and account limiting were attempted without success, as these addressed symptoms rather than the underlying cause.
|
The Technical Root Cause
The Bot Management system at Cloudflare utilizes machine learning models to generate bot scores for network traffic. These models depend on a feature configuration file that is refreshed every few minutes and distributed globally across all network nodes.
The feature file generation process included a SQL query against the ClickHouse database. The query retrieved column metadata for the bot management feature table:
SELECT name, type
FROM system.columns
WHERE table = 'http_requests_features'
ORDER BY name;
Prior to the permissions change, this query returned approximately 60 rows representing columns from the default database. The query lacked an explicit database name filter, operating under an implicit assumption that only default database tables would be visible to the user account.
After the permissions modification granted explicit access to underlying r0 database tables, the query began returning duplicate entries for each column, one from the default database and one from r0. This doubled the result set to over 200 rows, which in turn doubled the size of the generated feature configuration file.
|
System Failure Mechanism
The proxy software responsible for processing network traffic includes memory optimization through preallocated buffers. The Bot Management module specifically implemented a 200-feature limit, well above the typical 60-feature usage observed in production.
When the enlarged feature file exceeded this limit, the FL2 proxy (Rust-based implementation) encountered an error condition. The code path included an unwrap operation that triggered a panic when presented with an error value:
thread fl2_worker_thread panicked:
called Result::unwrap() on an Err value
This panic caused the proxy worker thread to terminate, resulting in HTTP 5xx errors for requests requiring bot score evaluation. The problem compounded as the feature file updates propagated across the global network every five minutes.
The gradual rollout of database permissions meant some nodes generated correct files while others produced oversized files. This created intermittent failures where the system would recover briefly before failing again, significantly complicating diagnosis.
|
Investigation Challenges
Several factors complicated the investigation process. The intermittent nature of failures suggested external attack patterns, particularly given recent industry incidents involving multi-terabit DDoS attacks. This hypothesis gained credibility when the Cloudflare status page, hosted independently from their infrastructure, coincidentally experienced unrelated availability issues.
At 13:05 UTC, the team implemented a bypass for Workers KV and Cloudflare Access, allowing these services to fall back to an earlier proxy version. This reduced error rates for dependent systems but did not resolve the core issue.
By 13:37 UTC, engineers identified the Bot Management configuration file as the trigger. The resolution involved halting automated file generation at 14:24 UTC, followed by manual insertion of a known-good configuration file into the distribution system at 14:30 UTC. Core services returned to normal operation, though downstream effects persisted until 17:06 UTC.
|
Impact on Services and Infrastructure
The failure affected multiple service layers. Core CDN and security services returned HTTP 5xx status codes to end users attempting to access customer websites. Turnstile, the bot detection system, failed to load entirely, preventing authentication on services including the Cloudflare dashboard.
Workers KV experienced significantly elevated error rates as front-end gateway requests failed due to proxy issues. Cloudflare Access saw widespread authentication failures beginning at incident start and continuing until the rollback initiated at 13:05 UTC. Existing Access sessions remained functional throughout the incident.
Email Security observed temporary loss of IP reputation data, reducing spam detection accuracy. Some Auto Move actions failed, though all affected messages underwent subsequent review and remediation.
Systems running the older FL proxy engine exhibited different behavior. Rather than returning errors, these systems set all bot scores to zero, causing false positive blocking for customers with bot prevention rules while leaving customers without such rules unaffected.
|
Secondary Performance Issues
Beyond the primary HTTP 5xx errors, the network experienced significant latency increases during the impact period. This resulted from debugging and observability systems automatically enhancing error contexts with additional diagnostic information.
The enhanced error processing consumed substantial CPU resources across affected nodes, demonstrating how error handling mechanisms themselves can become sources of resource contention during high-error-rate scenarios.
|
Architectural Observations
This incident highlights several distributed systems design considerations. Implicit assumptions in database queries created hidden coupling between permission systems and application behavior. The SQL query operated correctly for an extended period because environmental constraints aligned with unstated expectations.
The use of panic-inducing error handling in critical path code prevented graceful degradation. While memory limits serve important purposes in resource-constrained environments, the failure mode upon limit violation resulted in complete service disruption rather than partial capability maintenance.
Configuration data generated by internal systems received different validation treatment compared to external input. The feature file, being internally produced, operated under assumptions of correctness that proved invalid when upstream data characteristics changed.
The gradual deployment strategy, while generally beneficial for risk reduction, introduced diagnostic complexity through intermittent failure patterns that mimicked external attack signatures rather than internal configuration issues.
|
Planned Remediation Measures
Cloudflare outlined several technical improvements in response to this incident. Configuration file ingestion will receive hardening comparable to external user input processing, including comprehensive validation and bounds checking regardless of data origin.
Additional global feature toggles will enable rapid selective disablement of problematic components without requiring full configuration rollbacks. This provides faster mitigation pathways during similar incidents.
Resource consumption by error reporting and diagnostic systems will receive constraints to prevent these auxiliary systems from overwhelming primary service capacity during high-error-rate conditions.
A comprehensive review of failure modes across core proxy modules will identify similar architectural patterns where error conditions trigger disproportionate impact rather than controlled degradation.
|
Historical Context
This represents the most significant Cloudflare network disruption since July 2, 2019, when a regular expression in WAF rules caused similar widespread impact. Previous recent incidents affected specific subsystems such as dashboard availability or individual features, but maintained core traffic routing capabilities.
The June 2025 outage disrupted newer features while preserving legacy service functionality. The November 2025 incident differed in affecting fundamental traffic processing across the entire network, representing a more severe failure class.
|
Technical Takeaways
The incident demonstrates how reasonable infrastructure improvements can expose latent system assumptions. The database permissions change served legitimate security and operational objectives. The failure emerged not from the change itself but from undocumented dependencies on previous permission models.
Query construction that omits explicit scope constraints operates correctly until environmental conditions change. Database schema visibility, permission models, and access patterns represent environmental factors that may shift independently of application code.
Error handling mechanisms in high-throughput systems require careful consideration of failure modes. Panic-based error handling provides clear failure signals during development but can create cascading failures in production environments when errors occur at scale.
Resource limits serve important architectural purposes but benefit from monitoring proximity to thresholds and testing behavior at boundary conditions. Static limits set with comfortable headroom can become active constraints as system behavior evolves.
Configuration data validation requirements apply regardless of data source. Internal data generation does not eliminate the need for comprehensive input validation and safe handling of unexpected values or formats.
|
|
ResearchAudio Technical Analysis
Infrastructure and systems engineering research
|
|