Incidentist Logo

Black Friday Operational Playbook

Your guide to managing extreme load and operational stress

⬇️ Download Markdown

βœ“ Copied to clipboard!

Introduction: The Six Black Friday Failure Modes

Black Friday is about extreme load exposing weaknesses in your operational readiness.

Based on 13+ years managing high-traffic events, here are the six issues that will test your team:

  1. Traffic Overload - Load balancers hit limits, site crashes
  2. Scaling Failures - Auto-scaling too slow or too aggressive
  3. Checkout Breakdown - Payment APIs can't handle volume
  4. AI Agent/Feature Overload - AI services slow down, cost explodes, or hit rate limits
  5. Alert Fatigue - On-call team drowns in noise, misses critical alerts
  6. Cyberattacks - DDoS and attacks when you're most vulnerable

This playbook gives you response plans for each.

Quick Glossary

Technical terms used in this playbook:

CDN (Content Delivery Network)
Network of servers that cache your content closer to users. Protects your actual servers from traffic spikes.
Cache Hit Rate
Percentage of requests served from cache without hitting your servers. Aim for 90%+ during events.
Auto-Scaling
Automatic addition or removal of servers based on traffic/load. Can be too slow or too aggressive if not tuned.
Graceful Degradation
Intentionally disabling non-critical features to keep critical paths (like checkout) working under load.
Rate Limiting
Limiting how many requests a single user/IP can make per minute. Protects against attacks and overload.
Circuit Breaker
Automatically stops sending requests to a failing service. Prevents cascading failures.
DDoS Attack
Distributed Denial of Service - overwhelming your site with fake traffic to take it down.
CAPTCHA
"Prove you're human" challenge to block automated bots from accessing your site.
War Room
Physical or virtual space where your team coordinates during high-traffic events. Central command center.
Runbook
Step-by-step instructions for handling specific scenarios. "When X happens, do Y, then Z."
Customer Journey
The critical path users take through your application (e.g., browse β†’ add to cart β†’ checkout β†’ payment). Protect these paths first during degradation.
SLI/SLO
Service Level Indicator (SLI): Metric measuring service health (e.g., request success rate). Service Level Objective (SLO): Target for that metric (e.g., 99.9% success rate). These define what "working" means for your service.

Challenge 1: Traffic Overload & Downtime

The Problem

Traffic spikes higher than expected. Load balancers reach capacity limits. Site slows down or crashes. Customers can't browse or checkout. Revenue stops.

Prevention (2 Weeks Before)

Understand Your Limits:

Capacity Headroom:

Detection (Day Of)

Warning Signs:

Monitor These Metrics:

Response Playbook

At 50-60% Capacity:

  1. Communicate proactively to team: "traffic increasing, monitoring closely"
  2. Verify all monitoring is working correctly
  3. Check SLO budgets - are we within acceptable ranges?
  4. Prepare degradation plans and ensure team knows the steps

At 70% Capacity:

  1. Scale load balancers proactively
  2. Verify CDN is taking the load
  3. Alert team and stakeholders: "approaching capacity limits"
  4. Review customer journey SLIs - which paths are most critical?
  5. Prepare Level 1 degradation

At 80-85% Capacity or SLO degradation:

  1. Implement Level 1 degradation to protect critical customer journeys
  2. Disable resource-heavy features
  3. Maximize CDN caching
  4. Scale aggressively if possible
  5. Communicate to customers: "experiencing high traffic"

At 90%+ Capacity or SLO breach:

  1. Implement Level 2 degradation immediately
  2. Protect critical customer journeys at all costs (checkout, payment)
  3. Queue non-critical requests
  4. Show waiting page for browsing - "We're experiencing high traffic, please wait..." (but keep checkout path working)
  5. Executive escalation - major revenue impact
  6. Regular status updates to stakeholders every 15 minutes

Graceful Degradation for Overload

Key Principle: Protect critical customer journeys (browse β†’ cart β†’ checkout β†’ payment) above all else. Degrade non-essential features to keep revenue-generating paths working.

Level 1 - Reduce Load While Preserving Customer Journeys:

Level 2 - Minimal Experience, Critical Paths Only:

Challenge 2: Scaling Issues

The Problem

Auto-scaling is configured but doesn't work as expected. Either:

Prevention (2 Weeks Before)

Test Your Auto-Scaling:

Common Scaling Mistakes:

Detection (Day Of)

Signs Scaling Is Too Slow:

Signs Scaling Is Too Aggressive:

Response Playbook

If Scaling Too Slow:

  1. Manual scale immediately (don't wait for auto-scaling)
  2. Implement degradation to buy time
  3. Adjust auto-scaling thresholds lower (scale earlier)
  4. Pre-scale more aggressively

If Scaling Too Aggressively:

  1. Check what's triggering it (wrong metric?)
  2. Implement rate limiting on new capacity
  3. Adjust scaling policies if safe to do so
  4. May need to let it scale and pay the cost

If Scaling Metrics Are Wrong:

  1. Switch to manual scaling immediately
  2. Base decisions on actual bottleneck (DB connections, API limits)
  3. Don't fight auto-scaling during event - fix after

Manual Scaling Decision Framework

Manual scaling means you manually add more servers/capacity instead of waiting for automatic systems to do it. Sometimes you need to override automation and make the call yourself.

When to Scale Up:

When NOT to Scale:

Challenge 3: Checkout Failures

The Problem

Payment processor APIs can't handle your transaction volume. Result:

This is often your #1 revenue risk because you can't control the payment processor's capacity.

Prevention (2 Weeks Before)

Understand Payment Capacity:

Payment Architecture:

Detection (Day Of)

Warning Signs:

Monitor These Metrics:

Response Playbook

If Payment Processor Degrading:

  1. Increase timeout limits (give processor more time)
  2. Implement payment request queuing
  3. Slow down checkout flow slightly
  4. Contact processor for status/ETA
  5. Consider switching to backup processor

If Payment Processor Down:

  1. Queue all payment requests
  2. Allow customers to "complete" order (process payments later)
  3. Clear customer communication: "Order confirmed, payment processing"
  4. Switch to backup processor if available
  5. Manual payment processing as last resort

Communication Templates:

Temporary Issue:
"Payment processing is slower than usual. Please don't refresh - your transaction is being processed."

Major Issue:
"We're experiencing high volume with our payment provider. Your order is saved and we'll process payment within 24 hours. You'll receive confirmation via email."

Don't Do This:

Challenge 4: AI Agent/Feature Overload

The Problem

AI services and features that work fine under normal load suddenly become bottlenecks during traffic spikes:

AI features can become your most expensive and fragile dependency during high-traffic events.

Prevention (2 Weeks Before)

Understand Your AI Dependencies:

Test AI Under Load:

Prepare Fallbacks:

Detection (Day Of)

Warning Signs:

Monitor These Metrics:

Performance
  • AI inference latency (p50, p95, p99)
  • Timeout rate per AI feature
  • Queue depth for AI requests
  • Circuit breaker activation rate
Cost & Limits
  • Cost per minute/hour
  • Requests per minute vs. rate limit
  • Token usage rate
  • Rate limit errors
User Impact
  • Customer journey SLIs (with AI vs without)
  • Fallback activation rate
  • Feature availability percentage
  • Cache hit rate for AI responses

Response Playbook

If AI Latency Increasing:

  1. Increase AI response caching aggressively
  2. Implement request queuing with lower priority for AI
  3. Switch to faster/simpler models if available
  4. Increase timeout limits slightly (but not too much)
  5. Use pre-computed results for common cases

If Hitting Rate Limits:

  1. Activate fallback mechanisms immediately
  2. Implement aggressive caching
  3. Queue AI requests and process at sustainable rate
  4. Switch to rule-based alternatives
  5. Contact provider for emergency limit increase
  6. Consider disabling non-critical AI features

If Costs Spiking:

  1. Review which AI features are generating most cost
  2. Disable non-critical AI features
  3. Implement stricter caching
  4. Switch to cheaper/simpler models
  5. Set hard cost limits if provider allows
  6. Accept this may be the cost of doing business during peak

Graceful Degradation for AI Features

Key Principle: AI features are usually enhancements, not core functionality. Protect critical customer journeys by degrading AI features first.

Level 1 - Optimize AI Usage:

Level 2 - Disable Non-Critical AI:

Don't Do This:

Challenge 5: Alert Fatigue

The Problem

Your on-call SRE gets 500 alerts during Black Friday. 490 are noise. They miss the 10 critical ones. Or they're so exhausted they stop responding effectively.

Alert fatigue is one of the biggest operational risks during high-traffic events.

Prevention (1 Week Before)

Alert Tuning:

Alert Categories:

CRITICAL - Page immediately:

HIGH - Notify but no page:

INFO - Log only:

Response Playbook

If Team Is Drowning in Alerts:

  1. Emergency alert suppression
  2. Focus only on customer-impacting issues
  3. Batch similar alerts
  4. Assign alert triaging role (filter for on-call)
  5. Accept some alerts will be missed

Real-Time Alert Management:

Be thoughtful about what you suppress. Don't blindly mute alerts during high-traffic events.

Challenge 6: Cyberthreats

The Problem

Attackers know Black Friday is your most vulnerable time:

Common attacks:

Prevention (2 Weeks Before)

DDoS Protection:

Fraud Prevention:

Bot Management:

Detection (Day Of)

Signs of DDoS Attack:

Signs of Bot Activity:

Signs of Fraud:

Response Playbook

If Under DDoS Attack:

  1. Confirm it's an attack (not just high traffic)
  2. Enable DDoS mitigation at CDN/cloud level
  3. Implement geographic filtering if attack is localized
  4. Rate limit aggressively
  5. Contact security vendor/cloud provider for help
  6. Consider putting site in queue mode to protect backend

If Bot Attack:

  1. Implement CAPTCHA on affected paths
  2. Rate limit by IP/session more aggressively
  3. Block obvious bot patterns
  4. May need to sacrifice some legitimate traffic to stop bots

If Fraud Spike:

  1. Increase fraud detection sensitivity
  2. Add manual review for suspicious orders
  3. Communicate with payment processor
  4. May need to slow down checkout for verification
  5. Don't process suspicious orders same-day

Critical Decision: DDoS vs. Legitimate Traffic

How to tell the difference:

Legitimate: Gradual ramp, matches user behavior patterns, normal conversion rates

Attack: Sudden spike, unusual sources, low/no conversion, repetitive patterns

When in doubt:

The War Room: Bringing It All Together

War Room Roles

These are the ideal roles for a war room. For smaller teams, one person can cover multiple roles - but be mindful of capacity and stress. See guidance below on role consolidation and rotation.

Incident Commander
  • Makes final decisions
  • Manages communication
  • Coordinates team
Technical Lead
  • Monitors systems
  • Executes technical responses
  • Advises on trade-offs
Business Liaison
  • Communicates to stakeholders
  • Understands revenue impact
  • Approves degradation decisions
Customer Communication
  • Manages status page
  • Coordinates with support
  • Handles social media
Security Monitor
  • Watches for attacks
  • Manages security tools
  • Escalates threats

Role Consolidation & Team Capacity

For smaller teams: It's okay to consolidate roles. Common combinations:

Critical: Watch for overload. Black Friday can last 24-48 hours. People under stress make mistakes.

Plan for rotation:

Warning signs someone needs rotation:

Better to rotate early than wait until someone is completely drained.

Decision Framework Under Pressure

When something goes wrong during Black Friday, use this framework:

1. Assess Impact

2. Identify Root Cause Quickly

3. Choose Response

4. Communicate

5. Document

Pre-Event Checklist

2 Weeks Before:

1 Week Before:

48 Hours Before:

Day Of:

Post-Event Review

Key Questions

Traffic Overload:

Scaling:

Checkout:

Alerts:

Security: