Incident Runbook Generator
Before going to production, to prepare for the most likely failure scenarios — write the runbook when you're calm, not at 2am when the site is down.
Submitted by @dotsystemsdevs
Prompt
Generate an incident runbook for [SERVICE NAME]. I need a practical, step-by-step guide that I (or anyone I trust) can follow during a real outage — at 3am, half-awake, under pressure.
Service name: [SERVICE NAME] What it does: [ONE SENTENCE] Tech stack: [YOUR STACK — include hosting, database, auth, external APIs] Monitoring tools: [e.g. Sentry, Vercel dashboard, Supabase dashboard, UptimeRobot] Deployment method: [e.g. Vercel auto-deploy from GitHub main branch] Database: [e.g. Supabase Postgres] Generate runbooks for the 5 most likely failure modes for this type of application. For each, use this exact structure:
INCIDENT: [SHORT NAME — e.g. "Site returns 500 on all pages"]
SYMPTOM (what a user sees): [Describe exactly what a user experiences — error message text, blank page, redirect loop, etc.] ALERT THAT FIRES: [What monitoring tool would catch this and what does the alert say — or if no alert would fire, flag that as a gap] SEVERITY: [P1 - Site down / P2 - Core feature broken / P3 - Degraded experience] DIAGNOSTIC STEPS (in order — do these before trying to fix anything): 1. [First thing to check — e.g. "Check Vercel deployment dashboard for failed builds"] 2. [Second thing to check — e.g. "Check Sentry for error volume spike in last 15 minutes"] 3. [Third thing to check — e.g. "Check Supabase dashboard for database connection errors or query timeouts"] MOST COMMON ROOT CAUSES (in order of likelihood): 1. [Most likely cause and how to confirm it] 2. [Second most likely cause and how to confirm it] 3. [Less likely but possible cause] MITIGATION / ROLLBACK PROCEDURE: [Step-by-step procedure to restore service. Be specific. Include exact commands, dashboard clicks, or config changes. If rollback means reverting a deployment, describe exactly how.] COMMUNICATION TEMPLATE (if users need to be notified): "[Short status message to post — include what is affected, that you're investigating, and when you'll update]" POST-INCIDENT ACTION (to prevent recurrence): [What to add, change, or monitor to prevent this from happening again]
After all 5 runbooks
GAPS TO FIX BEFORE GOING LIVE:
List any scenario where no alert would fire and the incident would only be discovered by a user complaint. These are monitoring blind spots that must be addressed.
RUNBOOK LOCATION:
Recommend where this runbook should be stored so it's accessible during an outage (not just in the codebase — somewhere reachable when the site is down).