Black Friday Operational Playbook

Quick Glossary

Technical terms used in this playbook:

CDN (Content Delivery Network)

Network of servers that cache your content closer to users. Protects your actual servers from traffic spikes.

Cache Hit Rate

Percentage of requests served from cache without hitting your servers. Aim for 90%+ during events.

Auto-Scaling

Automatic addition or removal of servers based on traffic/load. Can be too slow or too aggressive if not tuned.

Graceful Degradation

Intentionally disabling non-critical features to keep critical paths (like checkout) working under load.

Rate Limiting

Limiting how many requests a single user/IP can make per minute. Protects against attacks and overload.

Circuit Breaker

Automatically stops sending requests to a failing service. Prevents cascading failures.

DDoS Attack

Distributed Denial of Service - overwhelming your site with fake traffic to take it down.

CAPTCHA

"Prove you're human" challenge to block automated bots from accessing your site.

War Room

Physical or virtual space where your team coordinates during high-traffic events. Central command center.

Runbook

Step-by-step instructions for handling specific scenarios. "When X happens, do Y, then Z."

Customer Journey

The critical path users take through your application (e.g., browse → add to cart → checkout → payment). Protect these paths first during degradation.

SLI/SLO

Service Level Indicator (SLI): Metric measuring service health (e.g., request success rate). Service Level Objective (SLO): Target for that metric (e.g., 99.9% success rate). These define what "working" means for your service.

Challenge 1: Traffic Overload & Downtime

The Problem

Traffic spikes higher than expected. Load balancers reach capacity limits. Site slows down or crashes. Customers can't browse or checkout. Revenue stops.

Prevention (2 Weeks Before)

Understand Your Limits:

Load test well beyond expected peak traffic
Identify load balancer capacity ceiling
Document exactly when you hit the wall
Test with realistic traffic patterns (not just raw requests)
Validate CDN can handle traffic (check contract limits)
Know your SLOs and what "degraded" means for your service

Capacity Headroom:

Pre-scale load balancers before event starts
Warm up additional capacity 24 hours before
Have manual scale-up procedures ready
Confirm cloud provider quotas allow emergency scaling
Know your vendor's emergency contact for quota increases

Detection (Day Of)

Warning Signs:

Request queuing at load balancers
Response time increasing steadily
Connection errors starting to appear
Traffic approaching 80% of known capacity

Monitor These Metrics:

Requests per second vs. capacity ceiling - How close you are to your maximum traffic capacity
Load balancer queue depth - How many requests are waiting to be processed (should be near zero)
Connections hitting your actual servers - Should stay flat even as traffic increases (your CDN should absorb most load)
CDN cache effectiveness - Percentage of requests served from cache without hitting your servers (aim for 90%+)

Response Playbook

At 50-60% Capacity:

Communicate proactively to team: "traffic increasing, monitoring closely"
Verify all monitoring is working correctly
Check SLO budgets - are we within acceptable ranges?
Prepare degradation plans and ensure team knows the steps

At 70% Capacity:

Scale load balancers proactively
Verify CDN is taking the load
Alert team and stakeholders: "approaching capacity limits"
Review customer journey SLIs - which paths are most critical?
Prepare Level 1 degradation

At 80-85% Capacity or SLO degradation:

Implement Level 1 degradation to protect critical customer journeys
Disable resource-heavy features
Maximize CDN caching
Scale aggressively if possible
Communicate to customers: "experiencing high traffic"

At 90%+ Capacity or SLO breach:

Implement Level 2 degradation immediately
Protect critical customer journeys at all costs (checkout, payment)
Queue non-critical requests
Show waiting page for browsing - "We're experiencing high traffic, please wait..." (but keep checkout path working)
Executive escalation - major revenue impact
Regular status updates to stakeholders every 15 minutes

Graceful Degradation for Overload

Key Principle: Protect critical customer journeys (browse → cart → checkout → payment) above all else. Degrade non-essential features to keep revenue-generating paths working.

Level 1 - Reduce Load While Preserving Customer Journeys:

Disable product recommendations (compute-heavy)
Reduce image sizes/quality
Disable real-time inventory checks
Simplify product pages
Disable non-essential third-party scripts
Monitor SLIs for critical customer journey endpoints

Level 2 - Minimal Experience, Critical Paths Only:

Static product pages only
Direct to checkout from any page
Queue system for browsing
Preserve checkout path at all costs
Focus all resources on revenue-critical customer journeys

Challenge 2: Scaling Issues

The Problem

Auto-scaling is configured but doesn't work as expected. Either:

Too slow: System overwhelmed before new capacity comes online
Too aggressive: Costs explode, or rapid scaling causes other issues
Wrong metrics: Scaling on CPU when the bottleneck is database connections

Prevention (2 Weeks Before)

Test Your Auto-Scaling:

Load test triggers auto-scaling successfully
Measure time from trigger to new capacity available (usually 5-10 min)
Test scale-down doesn't kill active sessions
Verify scaling metrics match actual bottlenecks
Confirm max scale limits are appropriate
Calculate cost at maximum scale (get budget approval)

Common Scaling Mistakes:

Scaling based on CPU when database is the bottleneck
Scale limits set too low for Black Friday traffic
Cooldown periods too long - Time between scaling actions; if too long, can't respond fast enough to spikes
New servers not ready to handle traffic - They need to load code, warm up caches, establish database connections first

Detection (Day Of)

Signs Scaling Is Too Slow:

Metrics triggering auto-scaling but no new capacity yet
Response times degrading despite "healthy" CPU
Database can't handle more connections - Your database has a limit on how many connections it can handle simultaneously
Work queues getting longer steadily

Signs Scaling Is Too Aggressive:

Instance count growing exponentially
Costs spiking beyond projections
New instances cycling rapidly (up/down/up)
Downstream services (database) overwhelmed by scaled apps

Response Playbook

If Scaling Too Slow:

Manual scale immediately (don't wait for auto-scaling)
Implement degradation to buy time
Adjust auto-scaling thresholds lower (scale earlier)
Pre-scale more aggressively

If Scaling Too Aggressively:

Check what's triggering it (wrong metric?)
Implement rate limiting on new capacity
Adjust scaling policies if safe to do so
May need to let it scale and pay the cost

If Scaling Metrics Are Wrong:

Switch to manual scaling immediately
Base decisions on actual bottleneck (DB connections, API limits)
Don't fight auto-scaling during event - fix after

Manual Scaling Decision Framework

Manual scaling means you manually add more servers/capacity instead of waiting for automatic systems to do it. Sometimes you need to override automation and make the call yourself.

When to Scale Up:

Approaching capacity limits (80%+)
Response times increasing steadily
Customer-impacting errors appearing
Better to overpay than lose revenue

When NOT to Scale:

Database is the bottleneck (adding app servers won't help)
Third-party APIs are failing (your scale doesn't matter)
Attack in progress (scaling won't help, wastes money)

Challenge 3: Checkout Failures

The Problem

Payment processor APIs can't handle your transaction volume. Result:

Failed payment attempts
Abandoned carts
Lost revenue directly
Angry customers

This is often your #1 revenue risk because you can't control the payment processor's capacity.

Prevention (2 Weeks Before)

Understand Payment Capacity:

Confirm transaction rate limits with payment processor
Request increased limits for Black Friday
Test at expected transaction volume
Have backup payment processor configured (if possible)
Implement smart retry logic - If payment fails, wait progressively longer before retrying (1s, 2s, 4s, 8s...)
Set up payment processor health monitoring

Payment Architecture:

Queue payment requests (don't fail immediately)
Implement automatic failure handling - Stop sending requests to a failing payment processor automatically
Separate payment processing from order creation
Store payment attempts for retry
Have clear failure messaging for customers

Detection (Day Of)

Warning Signs:

Payment API response times increasing
Payment failure rate increasing
Timeout errors from payment processor
Rate limit errors - Payment processor telling you "slow down, too many requests"

Monitor These Metrics:

Payment success rate (should be >98%)
Payment API response time
Failed payment error types
Transactions per minute vs. limit
Payment processor status page

Response Playbook

If Payment Processor Degrading:

Increase timeout limits (give processor more time)
Implement payment request queuing
Slow down checkout flow slightly
Contact processor for status/ETA
Consider switching to backup processor

If Payment Processor Down:

Queue all payment requests
Allow customers to "complete" order (process payments later)
Clear customer communication: "Order confirmed, payment processing"
Switch to backup processor if available
Manual payment processing as last resort

Communication Templates:

Temporary Issue:
"Payment processing is slower than usual. Please don't refresh - your transaction is being processed."

Major Issue:
"We're experiencing high volume with our payment provider. Your order is saved and we'll process payment within 24 hours. You'll receive confirmation via email."

Don't Do This:

❌ Retry payment attempts immediately (makes it worse)
❌ Fail the entire checkout (you lose the sale)
❌ Show technical error messages to customers
❌ Abandon the order (capture it even if payment pending)

Challenge 4: AI Agent/Feature Overload

The Problem

AI services and features that work fine under normal load suddenly become bottlenecks during traffic spikes:

Inference latency spikes - AI models are compute-heavy and slow under load
Cost explosion - External AI APIs charge per token/request
Rate limits - AI providers throttle your requests
Quality degradation - Models timeout or return poor results under pressure
Cascading failures - When AI features timeout, they can block critical paths

AI features can become your most expensive and fragile dependency during high-traffic events.

Prevention (2 Weeks Before)

Understand Your AI Dependencies:

Map all AI features and their criticality to customer journeys
Identify which AI features are revenue-critical vs. nice-to-have
Confirm rate limits with AI service providers
Request increased limits for Black Friday if possible
Calculate worst-case cost at peak traffic
Get budget approval for potential cost spikes

Test AI Under Load:

Load test AI endpoints specifically (not just overall traffic)
Measure AI inference latency at scale (p50, p95, p99)
Test timeout and circuit breaker behavior
Verify fallback mechanisms work correctly
Test graceful degradation of AI features

Prepare Fallbacks:

Implement caching for AI responses where possible
Pre-compute AI results for common scenarios
Have rule-based fallbacks for critical AI features
Set up feature flags to disable AI features quickly
Prepare simpler/faster model alternatives
Document which features can be disabled and in what order

Detection (Day Of)

Warning Signs:

AI inference latency increasing (especially p95/p99)
Rate limit errors from AI providers
Cost per minute exceeding projections
Timeout errors on AI-powered features
Cache hit rate dropping (more requests hitting AI)
Customer journey SLIs degrading on AI-powered paths

Monitor These Metrics:

Performance

AI inference latency (p50, p95, p99)
Timeout rate per AI feature
Queue depth for AI requests
Circuit breaker activation rate

Cost & Limits

Cost per minute/hour
Requests per minute vs. rate limit
Token usage rate
Rate limit errors

User Impact

Customer journey SLIs (with AI vs without)
Fallback activation rate
Feature availability percentage
Cache hit rate for AI responses

Response Playbook

If AI Latency Increasing:

Increase AI response caching aggressively
Implement request queuing with lower priority for AI
Switch to faster/simpler models if available
Increase timeout limits slightly (but not too much)
Use pre-computed results for common cases

If Hitting Rate Limits:

Activate fallback mechanisms immediately
Implement aggressive caching
Queue AI requests and process at sustainable rate
Switch to rule-based alternatives
Contact provider for emergency limit increase
Consider disabling non-critical AI features

If Costs Spiking:

Review which AI features are generating most cost
Disable non-critical AI features
Implement stricter caching
Switch to cheaper/simpler models
Set hard cost limits if provider allows
Accept this may be the cost of doing business during peak

Graceful Degradation for AI Features

Key Principle: AI features are usually enhancements, not core functionality. Protect critical customer journeys by degrading AI features first.

Level 1 - Optimize AI Usage:

Maximize caching of AI responses
Use pre-computed results where possible
Switch to faster/simpler models
Implement aggressive timeouts (better to fallback than wait)
Reduce AI request rate with sampling (e.g., personalize 50% of requests)

Level 2 - Disable Non-Critical AI:

Turn off AI-powered recommendations
Disable AI search enhancements (use basic search)
Turn off AI content generation
Disable personalization features
Use rule-based fallbacks for everything
Keep only revenue-critical AI features (e.g., fraud detection)

Don't Do This:

❌ Let AI timeouts block critical paths (implement circuit breakers)
❌ Keep retrying failed AI requests (accept the fallback)
❌ Run AI features synchronously during checkout (queue or disable)
❌ Assume AI providers will scale with you (they have their own limits)
❌ Wait to see if costs stabilize (disable features proactively)

Challenge 5: Alert Fatigue

The Problem

Your on-call SRE gets 500 alerts during Black Friday. 490 are noise. They miss the 10 critical ones. Or they're so exhausted they stop responding effectively.

Alert fatigue is one of the biggest operational risks during high-traffic events.

Prevention (1 Week Before)

Alert Tuning:

Review every alert threshold
Adjust thresholds for Black Friday traffic patterns
Disable alerts that won't require action
Group related alerts to reduce noise
Implement alert suppression during known events
Set up tiered alerting (critical vs. warning)

Alert Categories:

CRITICAL - Page immediately:

Checkout path failing
Payment processing failing
Revenue dropping below threshold
Site completely down
Security incident

HIGH - Notify but no page:

Approaching capacity limits
Degraded performance (not failed)
Non-critical features failing
Scaling events

INFO - Log only:

Normal scaling activities
Expected traffic spikes
Completed deployments (shouldn't happen during freeze)

Response Playbook

If Team Is Drowning in Alerts:

Emergency alert suppression
Focus only on customer-impacting issues
Batch similar alerts
Assign alert triaging role (filter for on-call)
Accept some alerts will be missed

Real-Time Alert Management:

Be thoughtful about what you suppress. Don't blindly mute alerts during high-traffic events.

Consider grouping related alerts instead of muting them entirely
Prioritize alerts by customer impact - critical customer journeys get top priority
Keep SLO burn alerts active - these tell you when you're degrading user experience
Adjust notification routing (e.g., send non-critical to Slack channel instead of paging)
Have dedicated alert triage role to filter signal from noise for the on-call responder
Document what you suppress and why - review after the event

Challenge 6: Cyberthreats

The Problem

Attackers know Black Friday is your most vulnerable time:

Your team is focused on traffic/scaling
Systems are stressed and harder to defend
Attacks blend in with legitimate traffic
You can't easily distinguish attack from flash traffic

Common attacks:

DDoS (Distributed Denial of Service) - Overwhelming your site with fake traffic to take it down
Credential stuffing - Attackers trying stolen usernames/passwords from other breaches to access accounts
Payment fraud
Bot-driven inventory hoarding - Bots adding items to cart to make them appear sold out

Prevention (2 Weeks Before)

DDoS Protection:

Enable DDoS protection at CDN/cloud provider level
Configure rate limiting - Limit how many requests a single user/IP can make per minute
Set up geographic filtering if appropriate
Have DDoS response plan ready
Know your provider's DDoS mitigation contacts

Fraud Prevention:

Review fraud detection rules
Increase monitoring on new account creation
Watch for unusual payment patterns
Set up alerts on suspicious order volumes
Have process to block suspicious users

Bot Management:

Implement CAPTCHA on critical paths - "Prove you're human" challenges to block automated bots
Monitor for bot-like behavior
Rate limit per IP/session
Watch for inventory hoarding patterns

Detection (Day Of)

Signs of DDoS Attack:

Sudden traffic spike from unusual sources
Traffic pattern doesn't match user behavior
Single IP ranges generating high load
Connection attempts overwhelming infrastructure
CDN blocking significant traffic

Signs of Bot Activity:

High traffic but low conversion
Rapid cart additions/removals
Inventory suddenly "sold out" but no payments
Unusual geographic patterns

Signs of Fraud:

Spike in failed payment attempts
Orders from unusual locations
High-value orders with new accounts
Multiple accounts same payment method

Response Playbook

If Under DDoS Attack:

Confirm it's an attack (not just high traffic)
Enable DDoS mitigation at CDN/cloud level
Implement geographic filtering if attack is localized
Rate limit aggressively
Contact security vendor/cloud provider for help
Consider putting site in queue mode to protect backend

If Bot Attack:

Implement CAPTCHA on affected paths
Rate limit by IP/session more aggressively
Block obvious bot patterns
May need to sacrifice some legitimate traffic to stop bots

If Fraud Spike:

Increase fraud detection sensitivity
Add manual review for suspicious orders
Communicate with payment processor
May need to slow down checkout for verification
Don't process suspicious orders same-day

Critical Decision: DDoS vs. Legitimate Traffic

How to tell the difference:

Legitimate: Gradual ramp, matches user behavior patterns, normal conversion rates

Attack: Sudden spike, unusual sources, low/no conversion, repetitive patterns

When in doubt:

Enable some defenses (rate limiting, CAPTCHA)
Watch conversion rates closely
Better to add friction than go down completely

The War Room: Bringing It All Together

War Room Roles

These are the ideal roles for a war room. For smaller teams, one person can cover multiple roles - but be mindful of capacity and stress. See guidance below on role consolidation and rotation.

Incident Commander

Makes final decisions
Manages communication
Coordinates team

Technical Lead

Monitors systems
Executes technical responses
Advises on trade-offs

Business Liaison

Communicates to stakeholders
Understands revenue impact
Approves degradation decisions

Customer Communication

Manages status page
Coordinates with support
Handles social media

Security Monitor

Watches for attacks
Manages security tools
Escalates threats

Role Consolidation & Team Capacity

For smaller teams: It's okay to consolidate roles. Common combinations:

Incident Commander + Business Liaison - One person can handle both if they have stakeholder context
Technical Lead + Security Monitor - Especially if security tools are part of normal monitoring
Customer Communication + Business Liaison - Natural pairing for external messaging

Critical: Watch for overload. Black Friday can last 24-48 hours. People under stress make mistakes.

Plan for rotation:

No one should be on duty more than 4-6 hours continuously during peak events
Have backup coverage for each role ready to swap in
Schedule breaks proactively - don't wait for people to burn out
Brief the next shift thoroughly during handoff
Accept that handoffs may miss details - that's still better than exhausted people making critical decisions

Warning signs someone needs rotation:

Delayed responses to questions
Increasing frustration or short responses
Making decisions without considering options
Fixating on one problem and missing others

Better to rotate early than wait until someone is completely drained.

Decision Framework Under Pressure

When something goes wrong during Black Friday, use this framework:

1. Assess Impact

Is checkout affected? → CRITICAL
Is browse affected? → HIGH
Is it costing revenue right now? → Measure

2. Identify Root Cause Quickly

Don't spend 30 minutes debugging during peak traffic
Use monitoring to narrow it down fast
If unclear, implement defensive measures anyway

3. Choose Response

Can we scale out of it? → Try that first
Is it a third-party issue? → Implement graceful degradation
Is it an attack? → Security response
Is it our code? → Roll back or disable the feature remotely

4. Communicate

Internal: Every 15 minutes during active incident
Customers: When impact is visible
Stakeholders: When revenue impact significant

5. Document

Log all decisions and actions
Note exact times
Will need this for post-mortem

Pre-Event Checklist

2 Weeks Before:

All five failure modes have documented response plans
Load testing completed at 2x expected traffic
Payment processor limits confirmed and increased
Auto-scaling tested and tuned
Alerts reviewed and thresholds adjusted
Security defenses tested
Graceful degradation strategy defined and tested

1 Week Before:

War room roles assigned
Decision framework reviewed with team
Emergency contacts compiled
Code freeze in effect
Capacity pre-scaled
All playbooks accessible to team

48 Hours Before:

Final war room walkthrough
All monitoring validated
Team well-rested (don't burn out before event!)
Stakeholders briefed
Customer communication templates ready

Day Of:

War room active
All roles staffed
Dashboards visible
Communication channels open
Response playbooks at hand

Post-Event Review

Key Questions

Traffic Overload:

Did we hit capacity limits?
How effective was graceful degradation?
Should we pre-scale more next time?

Scaling:

Did auto-scaling work as expected?
Were manual interventions needed?
Were costs within projections?

Checkout:

What was payment success rate?
Did payment processors handle volume?
How many sales were lost to payment issues?

Alerts:

How many alerts fired?
How many were actionable?
What critical issues had no alert?

Security:

Were we attacked?
How effective were defenses?
Any fraud patterns detected?

Black Friday Operational Playbook

Introduction: The Six Black Friday Failure Modes

Quick Glossary

Challenge 1: Traffic Overload & Downtime

The Problem

Prevention (2 Weeks Before)

Understand Your Limits:

Capacity Headroom:

Detection (Day Of)

Warning Signs:

Monitor These Metrics:

Response Playbook

At 50-60% Capacity:

At 70% Capacity:

At 80-85% Capacity or SLO degradation:

At 90%+ Capacity or SLO breach:

Graceful Degradation for Overload

Level 1 - Reduce Load While Preserving Customer Journeys:

Level 2 - Minimal Experience, Critical Paths Only:

Challenge 2: Scaling Issues

The Problem

Prevention (2 Weeks Before)

Test Your Auto-Scaling:

Common Scaling Mistakes:

Detection (Day Of)

Signs Scaling Is Too Slow:

Signs Scaling Is Too Aggressive:

Response Playbook

If Scaling Too Slow:

If Scaling Too Aggressively:

If Scaling Metrics Are Wrong:

Manual Scaling Decision Framework

When to Scale Up:

When NOT to Scale:

Challenge 3: Checkout Failures

The Problem

Prevention (2 Weeks Before)

Understand Payment Capacity:

Payment Architecture:

Detection (Day Of)

Warning Signs:

Monitor These Metrics:

Response Playbook

If Payment Processor Degrading:

If Payment Processor Down:

Communication Templates:

Don't Do This:

Challenge 4: AI Agent/Feature Overload

The Problem

Prevention (2 Weeks Before)

Understand Your AI Dependencies:

Test AI Under Load:

Prepare Fallbacks:

Detection (Day Of)

Warning Signs:

Monitor These Metrics:

Performance

Cost & Limits

User Impact

Response Playbook

If AI Latency Increasing:

If Hitting Rate Limits:

If Costs Spiking:

Graceful Degradation for AI Features

Level 1 - Optimize AI Usage:

Level 2 - Disable Non-Critical AI:

Don't Do This:

Challenge 5: Alert Fatigue

The Problem

Prevention (1 Week Before)

Alert Tuning:

Alert Categories:

CRITICAL - Page immediately:

HIGH - Notify but no page:

INFO - Log only:

Response Playbook

If Team Is Drowning in Alerts:

Real-Time Alert Management:

Challenge 6: Cyberthreats

The Problem