Infrastructure Incident Log
Purpose: Track infrastructure failures, root causes, and mitigations to prevent recurrence.
Template
## YYYY-MM-DD: [Incident Title]
- **Severity:** Critical | High | Medium | Low
- **Duration:** [start time] - [end time] ([total duration])
- **Impact:** [what broke, who/what was affected]
- **Root Cause:** [technical explanation]
- **Detection:** [how we found out]
- **Resolution:** [what fixed it]
- **Prevention:** [changes to prevent recurrence]
- **Related:** [links to PRs, commits, docs]
2026-02-11: Container Restart Data Loss
- Severity: High
- Duration: N/A (data loss event, not outage)
- Impact: Lost 4+ hours of work (webhook server, Gmail Pub/Sub setup, venv, MEMORY.md updates, incomplete-threads.md)
- Root Cause: New files created in container writable layer (not mounted volume). Container recreate destroyed writable layer. No git commits before restart.
- Detection: After container restart, discovered files missing
- Resolution:
- Rebuilt from scratch (30 min rebuild vs 4+ hours original work)
- Old container (932c6ef0814a) already removed, couldn't recover files
- Prevention:
- ✅ Created auto-commit script (
tools/auto-commit.sh)
- ✅ Added hourly cron job (ID: 5566acd6-385c-404f-8eaa-d0b7e9aaeb82)
- ✅ Created pre-restart checklist (
tools/pre-restart-check.sh)
- ✅ Documented Protection Protocol (
PROTECTION-PROTOCOL.md)
- ✅ Added volume verification script (
tools/verify-volume.sh)
- ✅ Mandatory:
git add -A && git commit && git push before any container operation
- Related:
- Protection Protocol:
/PROTECTION-PROTOCOL.md
- Auto-commit script:
/tools/auto-commit.sh
- Pre-restart check:
/tools/pre-restart-check.sh
Lesson: Container writable layer is ephemeral. All critical data must be in mounted volumes OR committed to git. No exceptions.
2026-02-13: Mid-Week Rebuild Temptation
- Severity: Low (prevented, not executed)
- Duration: N/A (avoided incident)
- Impact: Could have interrupted Quan's work week
- Root Cause: Considered installing ChromaDB for vector search, which requires image rebuild
- Detection: Pre-rebuild ROI check revealed low immediate value vs high disruption cost
- Resolution:
- Deferred to Sunday rebuild window (9:45 PM PT)
- Added to
pending-rebuilds.md
- Continued with file-based solutions (grep, memory search)
- Prevention:
- ✅ Created Rebuild Discipline Protocol (
REBUILD-WINDOW.md)
- ✅ Established Sunday-only rebuild window
- ✅ Added mutation layer classification (Image, Runtime, Infra)
- ✅ Founder energy constraint: protect deep-work blocks
- ✅ Expected ~5 rebuilds/year if disciplined
- Related:
- Rebuild Window Protocol:
/REBUILD-WINDOW.md
- Pending rebuilds queue:
/pending-rebuilds.md
Lesson: Infrastructure changes fragment founder attention. Batch mutations weekly. Protect deep-work blocks above all else.
2026-02-15: Quo Webhook Server 502 Error
- Severity: Medium
- Duration: ~30 min investigation + fix
- Impact: Inbound SMS not routing to Telegram (webhook failing)
- Root Cause: Webhook URL returning 502 Bad Gateway (server not responding)
- Detection: User reported missing SMS notifications
- Resolution:
- Verified webhook server process running:
systemctl status quo-webhook
- Checked logs:
journalctl -u quo-webhook -f
- Discovered server script error (Flask dependency issue)
- Rewrote server using stdlib HTTP server (no Flask dependency)
- Updated systemd service:
quo-webhook.service
- Restarted service:
systemctl restart quo-webhook
- Tested webhook: SMS → Telegram working
- Prevention:
- ✅ Removed Flask dependency (use stdlib for simple webhooks)
- ✅ Added service health check to API monitoring (when implemented)
- ✅ Documented webhook troubleshooting in TOOLS.md
- 🔄 TODO: Add uptime monitoring for webhook endpoint
- Related:
- Quo webhook handler:
/tools/quo-webhook-handler-v2.py
- Systemd service:
quo-webhook.service (host-managed)
- TOOLS.md section: Quo (Business SMS/Phone)
Lesson: Minimize dependencies for critical services. Stdlib > frameworks for simple use cases.
2026-02-13 to 2026-02-17: Gmail OAuth Unauthorized (Day 3+ Offline)
- Severity: High
- Duration: ~4 days (Feb 13-17, 2026)
- Impact: Email triage blocked, API calls failing with KeyError or 401 Unauthorized
- Root Cause:
- Access tokens expire after 1 hour
- No automated refresh mechanism
- Tokens last refreshed manually on Feb 16 (expired within hours)
- Detection:
- User reported in INITIALIZATION.md ("Gmail OAuth offline Day 3")
- Infrastructure domain initialization revealed issue
- Manual test (
gmail-test-account.py) failed with KeyError
- Resolution:
- Created
tools/gmail-refresh-all-tokens.py (extracts refresh logic, handles 3 accounts)
- Added cron job to run every 45 minutes (ID:
42f43a41-d500-4736-8951-178d3b478952)
- Updated
tools/gmail-test-account.py to handle multi-account token format
- Tested successfully: 76597 messages in quan@ztag.com account
- Validated all 3 accounts refreshing (quan@ztag.com, quan@gantom.com, quan777@gmail.com)
- Prevention:
- ✅ Automated token refresh (cron every 45 min, well before 1-hour expiration)
- ✅ Multi-account support (all 3 Gmail accounts maintained)
- 🔄 TODO: API health monitoring to detect token failures proactively
- 🔄 TODO: Alert if refresh fails 3x consecutively
- ✅ Documented in
oauth-health.md and system-status.md
- Related:
- Token refresh script:
/tools/gmail-refresh-all-tokens.py
- OAuth health tracking:
/working/infrastructure/oauth-health.md
- System status:
/working/infrastructure/system-status.md
- Cron job:
42f43a41-d500-4736-8951-178d3b478952
Lesson: OAuth access tokens are short-lived (1 hour). For production reliability, automated refresh is mandatory, not optional. Cron-based refresh is simpler than daemon for low-volume APIs.
Incident Metrics
Total Incidents: 3
- Critical: 0
- High: 1 (data loss)
- Medium: 1 (webhook failure)
- Low: 1 (prevented rebuild)
Mean Time to Detect (MTTD)
- Data loss: Immediate (post-restart discovery)
- Rebuild temptation: Immediate (pre-execution ROI check)
- Webhook failure: ~Unknown (user reported, no automated monitoring)
Mean Time to Resolve (MTTR)
- Data loss: 30 min rebuild + 2h prevention measures
- Rebuild temptation: 15 min (defer decision + document)
- Webhook failure: 30 min (investigation + fix + test)
Prevention Effectiveness
- Data loss: ✅ Auto-commit + pre-restart checks deployed
- Rebuild discipline: ✅ Protocol established, Sunday-only window
- Webhook monitoring: 🔄 TODO (API health monitoring not yet implemented)
Next Review: Weekly (Sunday rebuild window)
Monitoring: Alert critical incidents to Infrastructure & Tech group immediately