CDN Resilience Worksheet

Practical steps after the Cloudflare outage — without the expensive overreaction

What Happened: On November 18, 2025, a routine database change at Cloudflare caused their Bot Management feature file to exceed a size limit. Their core proxy crashed globally. X, ChatGPT, Spotify, Discord, and thousands of other services went down for up to 6 hours.

This worksheet focuses on CDN resilience specifically — practical steps you can take without expensive multi-CDN setups. It's a starting point, not a complete resilience strategy.

Let's address the hot takes first.

After every major outage, the same advice floods LinkedIn: "You HAVE to go multi-CDN now!" Same as after an AWS region goes down: "You HAVE to go multi-region!" Or multi-cloud.

These are expensive, complex decisions. Multi-CDN adds 30-50% cost and significant operational overhead — keeping configs in sync, managing certificates across providers, testing failover. For most companies, it's overkill.

You can build meaningful CDN resilience with much lower investment. That's what this worksheet covers.

1. Know What You're Running Through Your CDN

Before you can improve resilience, you need to know what breaks when your CDN goes down.

Checklist:

I know which domains route through my CDN
I know which services depend on CDN availability (not just caching)
I know if my CDN is also handling DNS
I understand what happens to traffic if I "grey-cloud" or bypass the CDN

Common things you might lose if you bypass your CDN:

WAF rules — exposed to attacks you were blocking
Caching — origin takes full load, latency increases
DDoS protection — origin IP exposed, no scrubbing
Rate limiting — API abuse, brute force attempts
Bot management — scrapers, credential stuffing get through
Origin IP hidden — once exposed, attackers can hit you directly forever

Domain	What It Serves	What You Lose If Bypassed	Can Bypass?
www.example.com	Main website	WAF, caching, DDoS protection	Yes, but exposed to attacks + slow
api.example.com	Customer API	Rate limiting, origin IP hidden	Yes, if origin can handle load

2. Don't Get Locked Out of Your Own CDN

During the Cloudflare outage, the dashboard login used Turnstile (their CAPTCHA). Turnstile was down. So the GUI login was completely blocked — no amount of clicking would help.

The API was intermittently failing, not completely down. This is a crucial difference: API calls can be retried automatically. IaC tools like Terraform handle retries. You can script "keep trying until it works." With GUI login blocked, you had zero chance. With API access, you at least had a fighting chance.

Even if the API is flaky, submitting changes programmatically beats clicking a login button that's completely broken.

Checklist:

CDN configuration is managed via Infrastructure as Code (Terraform, Pulumi)
API tokens exist and are stored outside the CDN provider
API tokens are valid (not expired) and have sufficient permissions
Team can attempt emergency changes without relying on the dashboard
Configuration is documented and exportable — ready to deploy elsewhere if needed

Example emergency actions you should be able to attempt via API/IaC:

Bypass CDN proxy (grey-cloud DNS records)
Disable problematic security rules
Purge cache
Update origin settings

3. Know When It's Them, Not You

Teams lost 20+ minutes during the Cloudflare outage debating whether the problem was their code or their CDN. Set up monitoring that answers this instantly.

Checklist:

Synthetic monitoring is set up
Monitoring runs from outside my infrastructure (not through the CDN)
Monitoring can hit my origin directly
I can tell the difference between "origin down" and "CDN down" within minutes

The test: If your origin returns 200 OK but users see 500 errors, you know instantly it's a CDN/vendor issue.

4. Have a CDN Outage Plan

Not a generic incident runbook. A specific plan for "my CDN is down."

Your plan should answer:

How do we confirm the CDN is the problem?
Who can make DNS/CDN changes?
Do we bypass the CDN or wait it out?
What's the security exposure if we bypass? (DDoS, WAF rules gone)
How do we communicate to customers?
Where is our status page hosted? (Not behind the same CDN, hopefully)

Key decision to make in advance:

If your CDN goes down, do you bypass it and accept the security exposure, or do you wait for recovery?

This depends on your traffic, threat model, and how long you can tolerate downtime. Decide now, not during the incident.

5. Build a Cheap Static Failover

If full multi-CDN isn't justified, at minimum have a way to communicate during an outage — or a bypass ready to deploy.

Real example: During this outage, Resend (an email API company) built a CloudFront bypass while Cloudflare was down. They didn't end up deploying it because Cloudflare recovered first, but they now have a runbook to switch to the fallback within 60 seconds. That's the right approach — have it ready, decide in the moment whether to use it.

Checklist:

DNS is managed separately from CDN (e.g., Route53, not Cloudflare DNS)
Static status page exists on separate infrastructure (S3, GitHub Pages, Netlify)
DNS record ready to point status.yourdomain.com to the failover
Process for activating failover is documented
Failover has been tested in the last 6 months

You won't be fully operational, but you'll be communicative. That matters more than most teams realize.

6. When Multi-CDN Actually Makes Sense

Multi-CDN isn't always overkill. But it's a serious investment. Be honest about whether you need it — and whether you can actually pull it off.

Why multi-CDN is harder than it sounds

It's not just "add another CDN." Here's what you're signing up for:

Configuration management — Each CDN has unique APIs and settings (caching rules, security policies, SSL certs). Replicating config consistently requires significant effort or specialized middleware.
Feature parity — Different CDNs offer different features. Getting them to behave the same can require custom code or edge computing to bridge gaps.
Traffic steering — Basic DNS failover is simple. Intelligent steering based on real user measurements (RUM) and real-time performance requires specialized platforms and expertise.
Observability — Consolidating logs, metrics, and security events from multiple CDNs is painful. Data formats aren't uniform.
Content sync — Ensuring content is simultaneously available across all CDNs requires automated workflows for consistency and version control.
Ongoing expertise — Managing contracts, troubleshooting across vendors, continuous optimization. This isn't a one-time setup.

Multi-CDN probably makes sense if:

You have strict SLAs with financial penalties for downtime
Revenue loss per hour significantly exceeds multi-CDN costs
Your competitors stayed up during this outage (and that matters to your customers)
You have the engineering capacity to maintain two CDN configurations

Multi-CDN is probably overkill if:

Outages of a few hours are tolerable for your business
Your customers also went down (because they use the same CDN)
You don't have the team to manage the complexity above
You haven't even done the basics in this worksheet yet

Check your competitors: Did they stay up during the Cloudflare outage? If yes, they might have multi-CDN and you're behind. If no, the market accepted the risk.

A Reality Check

When giants like Cloudflare go down, there's often not much you can do to fix it. You wait.

But preparation isn't about preventing their outages — it's about staying in control. Knowing what's affected. Communicating clearly. Taking what actions you can instead of sitting blind.

The steps in this worksheet won't prevent CDN outages. But they'll help you respond better when they happen — without the expensive overreaction.

This Worksheet Covers CDN Only

Many companies use Cloudflare for much more than CDN — DNS, Workers, KV, Access, WAF, bot management. Each adds complexity and different failure modes.

A proper resilience assessment looks at your full architecture, dependencies, and business context. That's a different conversation.

Visit incidentist.io