Troubleshooting Large-Scale Platform Outages: A Runbook for On-Call Teams
troubleshootingrunbookoutages

Troubleshooting Large-Scale Platform Outages: A Runbook for On-Call Teams

UUnknown
2026-02-15
10 min read
Advertisement

A concise on-call runbook for upstream outages (Cloudflare/AWS/X): diagnostics, mitigation commands, escalation paths, and customer comms templates.

Hook: When an upstream outage takes your platform offline, your on-call team must act fast — and decisively

Large-scale outages that originate at upstream providers like Cloudflare, AWS or a major third-party API (e.g., X) are the most heart-stopping incidents an on-call team faces. They’re noisy, externally caused, and often require cross-organization coordination while you try to keep customers informed and systems stable. In 2026 we've seen a rise in correlated failures across edge/CDN providers and cloud regions — making a concise, well-practiced runbook for upstream outages a must-have for every SRE and platform on‑call roster.

Quick summary (inverted pyramid)

  • Detect: Confirm outage source (upstream vs. your origin).
  • Contain: Apply immediate mitigations: bypass CDN, switch DNS, enable origin direct, degrade gracefully.
  • Escalate: Follow the provider-specific escalation matrix and internal chain of command.
  • Communicate: Publish status page and customer messages with cadence templates below.
  • Recover & Review: Revert temporary mitigations, run RCA, and automate fixes.

Why this matters in 2026

Recent events in early 2026 — including spikes of outage reports tied to Cloudflare and downstream platforms such as X — show that even the largest providers can cause broad customer impact. At the same time, architectural trends like multi-CDN, edge compute, API-first integrations, and automated DNS failovers have matured. That means you can (and should) operationalize pre-approved mitigations and automated fallbacks so your team doesn't reinvent the wheel under pressure.

When to use this runbook

Use this runbook when monitoring and checks indicate an outage that appears to originate from an upstream provider (CDN, cloud provider network, managed DNS or large API provider). Typical signals:

  • Simultaneous client errors (5xx) across geographically diverse clients.
  • Vendor status page indicates partial/full outage for CDN or cloud region.
  • Server-side traces show requests failing at the provider proxy layer (e.g., Cloudflare error pages, TLS handshake failures before your origin).
  • External observability tools (Synthetics, Real User Monitoring) show provider-specific errors.

Incident checklist (first 10 minutes)

  1. Declare an incident and set severity (P0/P1) based on SLO impact and user-facing degradation.
  2. Assign roles immediately: Incident Commander (IC), Communications Lead, Engineering Lead, and Provider Liaison.
  3. Open a shared incident bridge (Zoom/Teams/Slack Huddle) and a live document (Google Doc / Notion / Confluence).
  4. Run fast diagnostics to determine upstream involvement (commands below).
  5. Publish a short acknowledgement on your status page and social channels (template below).
  6. If upstream confirmed, escalate to provider support using the prioritized path below.
  7. Apply pre-approved mitigations (DNS failover, bypass CDN, origin direct), in coordination with engineering lead. Use low-risk changes first.

Fast diagnostics: Commands and what they tell you

Run these checks from multiple networks (on‑call laptop with mobile tether, cloud instances in different regions) to confirm if the problem is upstream. For an overview of what to monitor and faster detection patterns, see Network Observability for Cloud Outages: What To Monitor to Detect Provider Failures Faster.

1) DNS and connectivity

dig +short A example.com @8.8.8.8
  dig +trace example.com
  nslookup example.com
  

Does the DNS resolve to a Cloudflare/third-party IP? If so, failures may be upstream.

2) HTTP/TLS reachability (bypass DNS with --resolve)

curl -v --resolve example.com:443:203.0.113.10 https://example.com/  # hits origin ip directly
  curl -v https://example.com/ -H 'Host: example.com'
  

If direct origin works but going through the provider fails, it’s upstream. Instrumentation and edge-cloud telemetry patterns can speed root-cause confirmation — see Edge+Cloud Telemetry for high-throughput approaches.

3) TCP-level checks

traceroute example.com
  mtr -c 30 example.com
  tcpdump -n -i eth0 'host 203.0.113.10 and port 443'
  

4) CDN and provider diagnostics

# Cloudflare: check zone config via API
  curl -s -H "Authorization: Bearer $CF_API_TOKEN" \
    "https://api.cloudflare.com/client/v4/zones?name=example.com"

  # AWS Route53: list records
  aws route53 list-resource-record-sets --hosted-zone-id Z12345

  # Check AWS Health (requires perm):
  aws health describe-events --filter file://filters.json
  

Provider-specific mitigations (safe order)

Below are real commands and change patterns you should have pre-approved and tested in staging. Always use automation where possible so changes are reproducible and auditable.

Cloudflare: common mitigations

  • Bypass Cloudflare proxy (proxied → DNS-only) — Quick way to test origin health and restore traffic if Cloudflare is failing. Use the Cloudflare API to change DNS record proxied flag to false. See guidance on how to harden CDN configurations so bypassing is safe and auditable.
# Set DNS record to DNS-only (proxied=false)
  curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
    -H "Authorization: Bearer $CF_API_TOKEN" -H "Content-Type: application/json" \
    --data '{"type":"A","name":"example.com","content":"203.0.113.10","proxied":false}'
  
  • Purge cache to avoid stale error responses:
curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/purge_cache" \
    -H "Authorization: Bearer $CF_API_TOKEN" -H "Content-Type: application/json" \
    --data '{"purge_everything":true}'
  
  • Disable specific Cloudflare features like WAF rules or rate limiting that may be causing blocks; use API to toggle.

AWS: common mitigations

  • Use Route 53 failover / weighted routing to move traffic to an alternate region or provider. Prepare change-batch JSON templates ahead; explore multi-region and multi-provider routing discussed in cloud-native hosting patterns.
# Apply change via CLI
  aws route53 change-resource-record-sets --hosted-zone-id Z12345 --change-batch file://failover-change.json

  # Example failover-change.json
  # {"Comment":"Failover to backup","Changes":[{"Action":"UPSERT","ResourceRecordSet":{"Name":"example.com.","Type":"A","TTL":60,"SetIdentifier":"backup","Weight":0,"ResourceRecords":[{"Value":"198.51.100.10"}]}}]}
  
  • Use AWS CLI to check resource status (ELB, EC2 instance health) and confirm origin is healthy.

Generic upstream CDN or third-party API mitigations

  • Fallback to pre-warmed alternate providers (multi-CDN or secondary cloud) via DNS weight changes or traffic steering services. Use orchestration tooling and message-brokered pipelines to automate failover.
  • Serve cached, read-only content while write paths remain blocked; enable feature flags to degrade non-critical services. For caching strategies and safe cache purges, see the technical brief on caching strategies.
  • Throttle or queue writes with retry strategies so clients don’t experience silent data loss.

Provider escalation matrix (template)

Every on-call team should maintain a short matrix per vendor that contains:

  • Support URL and phone number for your plan level (Basic, Business, Enterprise).
  • API endpoints for programmatic support where available.
  • Escalation targets: Account Manager, Technical Account Manager (TAM), Enterprise Support phone number, Slack/IRC channel, and trusted partner contact.

Example (Cloudflare & AWS):

  • Cloudflare: Support portal (support.cloudflare.com) → submit priority ticket → use Enterprise TAM contact (email/phone) → Cloudflare status @ cloudflarestatus.com/updates
  • AWS: Support Center (console.aws.amazon.com/support) → create Severity 1 case → call Enterprise Support phone → open AWS Support Slack if provisioned → check AWS Health Dashboard
  1. On-call Engineer (first responder)
  2. Secondary On-call / Senior Engineer
  3. Engineering Manager / Platform Lead
  4. Head of SRE / CTO
  5. Customer Success & Legal (for SLA/regulatory impact)

Customer communications: templates and cadence

Quick, transparent updates reduce inbound pressure. Use consistent timestamps and a posted cadence. Suggested cadence: Acknowledgement immediately, updates every 15 minutes until stable, then every 30–60 minutes until resolved.

Initial status page / tweet / post (use TLP:GREEN for public info)

[Acknowledgement] [HH:MM UTC] We are investigating reports of errors and degraded performance affecting example.com. Early indicators show an issue with an upstream provider (Cloudflare/AWS). We have an incident open and our engineers are working on mitigation. Next update in 15 minutes.

15-minute update (progress)

[Update] [HH:MM UTC] Our engineers have applied temporary mitigations (DNS routed to origin, purged CDN cache). Some users may still experience intermittent errors while the changes propagate. We continue to coordinate with the upstream provider. Next update in 15 minutes.

Resolution message

[Resolved] [HH:MM UTC] The issue is resolved. We reverted emergency changes and are monitoring. A full post-incident report will be published within 72 hours. If you continue to see problems, contact support@example.com with [incident-id].

In-product banner / email (short)

We experienced an outage between [start] and [end] due to an upstream provider incident. Service has been restored. Read the incident postmortem: [link].

Sample support case template for providers

Use this when creating Priority/Severity 1 cases with upstream vendors:

Subject: SEV1 - example.com production traffic errors via Cloudflare (Incident: [id])

  Customer: Example Corp
  Environment: Production
  Start Time (UTC): 2026-01-16T08:12:00Z
  Impact: 100% of user requests returning 5xx via Cloudflare edge; origin is reachable via direct IP. Degraded for all regions.
  Steps to reproduce: curl -v https://example.com -> 520/524 errors; curl --resolve example.com:443:203.0.113.10 https://example.com works
  Recent changes: No deploys in last 2 hours
  Logs/Error payload: [attach stack traces, edge error pages]
  Request: Immediate escalation to Engineering/TAM; provide status and mitigation options.
  Contact: On-call IC John Doe +1-555-555-0100; IC Slack: #inc-1234
  

When to perform DNS failover vs. CDN bypass

  • CDN bypass is fast and low-risk for short outages and testing origin health. Use when provider-specific errors or edge behavior is suspected. See the CDN hardening checklist for safe bypass procedures (How to Harden CDN Configurations).
  • DNS failover is appropriate when routing must be moved off the provider entirely (multi-CDN or alternate origin). Use pre-warmed endpoints and low TTLs to reduce propagation time.

Pre-incident preparation (what to automate now)

  • Automated runbook scripts that toggle proxied=false, update Route53, or purge caches via CI/CD pipeline with approval gating. Save runbooks as code and integrate with your developer platform (build devex platforms) for auditability.
  • Pre-warmed alternate origin(s) with valid TLS certs and tested signed URL or mTLS if used. Use tested remote-analysis hardware if needed (see the Nimbus Deck Pro review for remote telemetry & rapid analysis devices).
  • Lowered DNS TTL for critical records (60–120s) with documented rollback policy.
  • Multi-CDN and traffic steering agreements with fast-failover runbooks. Consider orchestration and message-broker approaches covered in edge message broker field reviews (Edge Message Brokers for Distributed Teams).
  • Provider escalation contacts (TAM, phone numbers) kept up-to-date in a secure vault accessible to on-call.
  • Regular chaos experiments that include simulated upstream provider failures, to validate runbooks. Combine these with security and testing programmes (for example, bug bounty lessons) to stress multiple dimensions of incident response.

Post-incident: actions and indicators

After recovery:

  1. Keep temporary mitigations (DNS changes, proxy toggles) in place only as long as necessary. Revert in a controlled window.
  2. Collect timelines: who changed what and when (audit logs from Cloudflare/AWS/Route53, CI/CD logs, git commits).
  3. Perform RCA focusing on root cause, contributing factors, and action items. Include impact to customers, SLA breaches, and communications effectiveness.
  4. Implement permanent fixes and automation (e.g., programmable failover configured in Terraform/CloudFormation).
  5. Update runbook with lessons and test changes in staging (blameless review).
  • Multi-CDN orchestration: Use steering platforms or DNS-based weighted routing for active-active failover. Recent adoption in 2025–2026 shows significantly reduced mean time to mitigate for edge provider incidents.
  • Edge-aware observability: Instrument eBPF-based network visibility at edge proxies and origin to detect provider-induced latency or packet loss. For high-throughput approaches, see Edge+Cloud Telemetry.
  • Automated, policy-driven runbooks: Save pre-approved remediation playbooks as code so runbook actions can be executed automatically under IC approval. Build these into your developer platform and CI/CD workflows (build devex platforms).
  • Provider performance SLOs: Negotiate SLOs and on-call expectations with your provider T&Cs and TAMs for faster escalations. Track provider trust and performance via vendor-trust frameworks.
  • AI-assisted triage: Use AI detection to correlate multi-source telemetry and suggest mitigations, but keep human IC in the loop for high-impact decisions. Complement AI tools with regular security testing and lessons learned from coordinated programmes like bug bounty write-ups (Bug Bounties Beyond Web).

Example incident timeline (concise)

  1. 08:12 UTC — Alerts trigger: 5xx spike, SLO breach. IC declared P0.
  2. 08:15 UTC — Quick diagnostics show CDN edge returning 520; origin reachable directly.
  3. 08:20 UTC — Cloudflare support ticket opened and TAM contacted. Status posted to customers.
  4. 08:25 UTC — Temporary mitigation: set proxied=false via Cloudflare API to route to origin. Cache purge performed.
  5. 08:45 UTC — Partial recovery observed; traffic gradually restored. Monitoring shows stable service at 09:10 UTC.
  6. 09:30 UTC — Revert proxied flag after coordination and validation. Postmortem scheduled.

Checklists you can paste into your incident tool

On-call quick checklist

  • Confirm upstream source with diagnostics (dig, curl --resolve). For the set of signals and metrics to monitor, consult Network Observability for Cloud Outages.
  • Open incident bridge and assign IC.
  • Post initial status. Contact provider support and TAM.
  • Apply safe mitigations (bypass CDN / route DNS to alternate).
  • Log all changes and notify customers every 15 minutes.
  • Escalate per internal matrix if no progress in 30 minutes.

Provider contact template (kept in secure vault)

TAM-Contact-List:
  - Cloudflare-TAM: tam@example.com; +1-555-100-200
  - AWS-Enterprise: ent-support@example.com; +1-555-111-222
  - DNS-Provider: dnsops@example.com; +1-555-333-444
  

Final notes: maintain calm, communicate clearly, and automate recovery

Upstream outages will continue to happen in 2026, and the difference between a contained incident and a full-blown customer nightmare is preparation. Use this runbook as a living document: test it regularly, automate low-risk steps, and keep escalation contacts current. The faster you can confirm the source, apply pre-approved mitigations, and communicate status to customers, the lower your overall impact.

Call to action

If your team doesn't yet have a tested upstream outage playbook, make it a Q1 priority. Start by cloning this runbook into your incident tooling, pre-authorize the low-risk API commands, and run a table-top or chaos experiment this month. Need a template tailored to your stack (Cloudflare + AWS + multi-CDN)? Contact smartstorage.host for a bespoke on-call runbook and automation pack that integrates with your CI/CD and status page. For additional reading on cache strategies, observability and vendor trust, see the links below.

Advertisement

Related Topics

#troubleshooting#runbook#outages
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-16T23:08:13.159Z