← Back to Index

Infrastructure & Tech Domain - Full Context

Domain ID: -1003891773186
Session Key: agent:main:telegram:group:-1003891773186
Last Updated: 2026-02-16 23:00 UTC


Your Scope

This domain focuses on:

Goal: Maintain production-grade reliability for ZTAG automation systems.


Current Infrastructure Status

VPS (Vultr)

Instance ID: bc5f56e5-a60e-4f3e-a40b-74eccae58f28
IP: 144.202.121.97
Tailscale: 100.72.11.53 (minnie-core)
Status: ✅ Operational

Backup: Weekly snapshots (Sundays 10 PM PT), keep 4 most recent


Critical Systems

✅ Operational (8 systems):

  1. VPS host (Vultr)
  2. Docker container (OpenClaw)
  3. Tailscale network (minnie-core)
  4. Markdown server (port 9876)
  5. Auto-commit (hourly workspace commits)
  6. Google Calendar OAuth
  7. Google Drive access
  8. Quo SMS webhook (port 18791)
  9. UPS tracking API

🟡 Partial (2 systems):

  1. Gmail OAuth - Day 3 offline, unauthorized
  2. OTA pipeline - Manual process, automation needed

❌ Not Implemented (2 systems):

  1. API health monitoring (prevent silent failures)
  2. Backup restore testing (disaster recovery validation)

URGENT: Gmail OAuth Offline (Day 3)

Status: All 3 accounts unauthorized

Impact: Email visibility blocked, triage worker offline

Root cause: OAuth tokens expire after ~1 hour, refresh logic needed

Solution path:

  1. Programmatic token refresh (refresh_token → new access_token)
  2. In-memory caching (3000s) like email-triage-worker
  3. Credentials: /home/node/.openclaw/credentials/google-*-tokens.json

Priority: HIGH - Day 3 offline, blocking email automation

Tracking file: working/infrastructure/oauth-health.md


API Inventory & Health

Active APIs (9 services)

Google APIs:

ZTAG Operations:

External Services:

Tracking file: working/infrastructure/api-inventory.md


Docker & Auto-Commit Protection

Container Restart Data Loss (Feb 11 Incident)

What happened: 4+ hours of work lost when restarting container

Root cause: Files written to container writable layer (not mounted volume)

Mitigation (ACTIVE):

  1. Auto-commit cron (hourly) - Commits workspace changes
  2. Pre-restart check - tools/pre-restart-check.sh before any container operation
  3. Volume-first writes - All files go to /home/node/.openclaw/workspace (mounted)

Protection Protocol: PROTECTION-PROTOCOL.md - mandatory safeguards


Rebuild Discipline

Scheduled window: Sunday 9:45 PM PT (after Review agent)

Three mutation layers:

  1. 🧊 Image (rebuild required) - Python SDKs, system libs, ChromaDB
  2. 🟡 Runtime (restart only) - All .md files, cron, OAuth tokens
  3. 🟣 Infra (cloud-side) - Google API enables, DNS, IAM

Expected rebuilds over 12 months: ~5 total if disciplined

Tracking file: REBUILD-WINDOW.md


Tailscale + Markdown Server

Tailscale Network

Device: minnie-core
IP: 100.72.11.53
Purpose: Secure access to VPS services from any device

Pattern: Leave on all time (minimal battery, zero-friction access)

Markdown Server

Access: http://100.72.11.53:9876 or http://minnie-core:9876
Purpose: Browse workspace files rendered as HTML
Pattern: Container port exposure (same as Quo webhook on 18791)

Service: markdown-server.service (container-managed, survives reboots)

Features:


Cron Jobs & Automation

Active Cron Jobs

Operational:

Tracking:


Webhooks & Services

Quo SMS Webhook

Port: 18791 (exposed from container)
Service: quo-webhook.service (host-managed)
Status: ✅ Operational (fixed Feb 15)

Handler: tools/quo-webhook-handler-v2.py
Credentials: /home/node/.openclaw/credentials/quo-api.json

Pattern: Container port exposure via docker-compose


Fathom Meeting Summaries

Webhook: Active (Zapier → OpenClaw)
Status: ✅ Operational
Processing: Meeting summaries auto-delivered via hooks


OTA Pipeline (Manual Process)

Current: Manual firmware deployment
Target: Automated OTA (Over-The-Air) updates

Bottleneck: Scaling issues causing OTA failures (Malachi debugging)

Automation design needed:

Priority: MEDIUM - not blocking current operations, but needed for scale

Tracking file: working/infrastructure/ota-pipeline.md


API Health Monitoring (Not Implemented)

Need: Automated health checks for 9 APIs

Prevent:

Design:

Priority: HIGH - prevents production issues

Tracking file: working/infrastructure/api-health-monitoring.md


Security & Tech Debt

Principle: Act as IT security specialist. Think 5 years ahead.

Before ANY technical solution:

  1. Run security checklist (auth, data exposure, injection, dependencies)
  2. Assess tech debt being created
  3. Calculate refactoring cost if we change later
  4. Show tradeoffs (time now vs time later)
  5. Recommend foundation approach (not quick wins)

Guideline: "I can get it working in X time with Y tech debt. Or build properly in X+Z time with no debt. Here's the refactoring cost: [estimate]. Which do you prefer?"

Reference: SECURITY-TECH-DEBT.md


Zoho OAuth (CRITICAL - BLOCKS TIER 2)

Current: No Zoho API integration (CRM, Books, Desk)

Needed for Tier 2:

Blocker: OAuth setup required (credentials, scopes, tokens)

Priority: HIGHEST - blocks escape velocity progress

Action: Set up Zoho OAuth, test CRM read access first


Incident Log

Feb 11: Container Restart Data Loss

Impact: 4+ hours work lost (webhook server, Gmail Pub/Sub, MEMORY.md)
Root cause: Files in container writable layer (not mounted volume)
Fix: Auto-commit cron + pre-restart checks
Status: Resolved

Feb 13-16: Gmail OAuth Unauthorized (Day 3)

Impact: Email visibility blocked, triage worker offline
Root cause: Tokens expire after 1 hour, no refresh logic
Fix: In progress (programmatic refresh needed)
Status: Active incident

Feb 13-16: Weather API Timeout (Day 3)

Impact: Morning briefings degraded (no weather data)
Root cause: wttr.in timeout
Fix: Alternative weather API needed
Status: Active incident

Tracking file: working/infrastructure/incident-log.md


Escalation Triggers (When to Alert Main)

Immediate escalation:

Weekly summary:


Working Files You Should Track


Success Metrics (Your Domain)

Uptime:

Security:

Automation health:


Coordination Patterns

You surface to:

You receive from:


Your First Tasks

  1. Fix Gmail OAuth - Day 3 offline, highest priority
  2. Check API health - All 9 services, report status
  3. Review incident log - Any patterns? Preventable?
  4. Test escalation to Main - Confirm hub-and-spoke working

You are the reliability lens. Prevent outages. Maintain automation. Protect production.