Infrastructure Incident Log

Purpose: Track infrastructure failures, root causes, and mitigations to prevent recurrence.

Template

## YYYY-MM-DD: [Incident Title]
- **Severity:** Critical | High | Medium | Low
- **Duration:** [start time] - [end time] ([total duration])
- **Impact:** [what broke, who/what was affected]
- **Root Cause:** [technical explanation]
- **Detection:** [how we found out]
- **Resolution:** [what fixed it]
- **Prevention:** [changes to prevent recurrence]
- **Related:** [links to PRs, commits, docs]

2026-02-11: Container Restart Data Loss

Severity: High
Duration: N/A (data loss event, not outage)
Impact: Lost 4+ hours of work (webhook server, Gmail Pub/Sub setup, venv, MEMORY.md updates, incomplete-threads.md)
Root Cause: New files created in container writable layer (not mounted volume). Container recreate destroyed writable layer. No git commits before restart.
Detection: After container restart, discovered files missing
Resolution:
- Rebuilt from scratch (30 min rebuild vs 4+ hours original work)
- Old container (932c6ef0814a) already removed, couldn't recover files
Prevention:
1. ✅ Created auto-commit script (tools/auto-commit.sh)
2. ✅ Added hourly cron job (ID: 5566acd6-385c-404f-8eaa-d0b7e9aaeb82)
3. ✅ Created pre-restart checklist (tools/pre-restart-check.sh)
4. ✅ Documented Protection Protocol (PROTECTION-PROTOCOL.md)
5. ✅ Added volume verification script (tools/verify-volume.sh)
6. ✅ Mandatory: git add -A && git commit && git push before any container operation
Related:
- Protection Protocol: /PROTECTION-PROTOCOL.md
- Auto-commit script: /tools/auto-commit.sh
- Pre-restart check: /tools/pre-restart-check.sh

Lesson: Container writable layer is ephemeral. All critical data must be in mounted volumes OR committed to git. No exceptions.

2026-02-13: Mid-Week Rebuild Temptation

Severity: Low (prevented, not executed)
Duration: N/A (avoided incident)
Impact: Could have interrupted Quan's work week
Root Cause: Considered installing ChromaDB for vector search, which requires image rebuild
Detection: Pre-rebuild ROI check revealed low immediate value vs high disruption cost
Resolution:
- Deferred to Sunday rebuild window (9:45 PM PT)
- Added to pending-rebuilds.md
- Continued with file-based solutions (grep, memory search)
Prevention:
1. ✅ Created Rebuild Discipline Protocol (REBUILD-WINDOW.md)
2. ✅ Established Sunday-only rebuild window
3. ✅ Added mutation layer classification (Image, Runtime, Infra)
4. ✅ Founder energy constraint: protect deep-work blocks
5. ✅ Expected ~5 rebuilds/year if disciplined
Related:
- Rebuild Window Protocol: /REBUILD-WINDOW.md
- Pending rebuilds queue: /pending-rebuilds.md

Lesson: Infrastructure changes fragment founder attention. Batch mutations weekly. Protect deep-work blocks above all else.

2026-02-15: Quo Webhook Server 502 Error

Severity: Medium
Duration: ~30 min investigation + fix
Impact: Inbound SMS not routing to Telegram (webhook failing)
Root Cause: Webhook URL returning 502 Bad Gateway (server not responding)
Detection: User reported missing SMS notifications
Resolution:
1. Verified webhook server process running: systemctl status quo-webhook
2. Checked logs: journalctl -u quo-webhook -f
3. Discovered server script error (Flask dependency issue)
4. Rewrote server using stdlib HTTP server (no Flask dependency)
5. Updated systemd service: quo-webhook.service
6. Restarted service: systemctl restart quo-webhook
7. Tested webhook: SMS → Telegram working
Prevention:
1. ✅ Removed Flask dependency (use stdlib for simple webhooks)
2. ✅ Added service health check to API monitoring (when implemented)
3. ✅ Documented webhook troubleshooting in TOOLS.md
4. 🔄 TODO: Add uptime monitoring for webhook endpoint
Related:
- Quo webhook handler: /tools/quo-webhook-handler-v2.py
- Systemd service: quo-webhook.service (host-managed)
- TOOLS.md section: Quo (Business SMS/Phone)

Lesson: Minimize dependencies for critical services. Stdlib > frameworks for simple use cases.

2026-02-13 to 2026-02-17: Gmail OAuth Unauthorized (Day 3+ Offline)

Severity: High
Duration: ~4 days (Feb 13-17, 2026)
Impact: Email triage blocked, API calls failing with KeyError or 401 Unauthorized
Root Cause:
- Access tokens expire after 1 hour
- No automated refresh mechanism
- Tokens last refreshed manually on Feb 16 (expired within hours)
Detection:
- User reported in INITIALIZATION.md ("Gmail OAuth offline Day 3")
- Infrastructure domain initialization revealed issue
- Manual test (gmail-test-account.py) failed with KeyError
Resolution:
1. Created tools/gmail-refresh-all-tokens.py (extracts refresh logic, handles 3 accounts)
2. Added cron job to run every 45 minutes (ID: 42f43a41-d500-4736-8951-178d3b478952)
3. Updated tools/gmail-test-account.py to handle multi-account token format
4. Tested successfully: 76597 messages in quan@ztag.com account
5. Validated all 3 accounts refreshing (quan@ztag.com, quan@gantom.com, quan777@gmail.com)
Prevention:
1. ✅ Automated token refresh (cron every 45 min, well before 1-hour expiration)
2. ✅ Multi-account support (all 3 Gmail accounts maintained)
3. 🔄 TODO: API health monitoring to detect token failures proactively
4. 🔄 TODO: Alert if refresh fails 3x consecutively
5. ✅ Documented in oauth-health.md and system-status.md
Related:
- Token refresh script: /tools/gmail-refresh-all-tokens.py
- OAuth health tracking: /working/infrastructure/oauth-health.md
- System status: /working/infrastructure/system-status.md
- Cron job: 42f43a41-d500-4736-8951-178d3b478952

Lesson: OAuth access tokens are short-lived (1 hour). For production reliability, automated refresh is mandatory, not optional. Cron-based refresh is simpler than daemon for low-volume APIs.

Incident Metrics

Total Incidents: 3

Critical: 0
High: 1 (data loss)
Medium: 1 (webhook failure)
Low: 1 (prevented rebuild)

Mean Time to Detect (MTTD)

Data loss: Immediate (post-restart discovery)
Rebuild temptation: Immediate (pre-execution ROI check)
Webhook failure: ~Unknown (user reported, no automated monitoring)

Mean Time to Resolve (MTTR)

Data loss: 30 min rebuild + 2h prevention measures
Rebuild temptation: 15 min (defer decision + document)
Webhook failure: 30 min (investigation + fix + test)

Prevention Effectiveness

Data loss: ✅ Auto-commit + pre-restart checks deployed
Rebuild discipline: ✅ Protocol established, Sunday-only window
Webhook monitoring: 🔄 TODO (API health monitoring not yet implemented)

Next Review: Weekly (Sunday rebuild window)
Monitoring: Alert critical incidents to Infrastructure & Tech group immediately