Date: 2026-02-16
Context: Analysis of what went wrong in Jan 2025 vs what's possible now
Goal: Maximize AI leverage without repeating past mistakes or destroying developer agency
The Jan 2025 AI experiment failed not because "AI is shit at coding" but because of workflow mistakes:
What's different now:
The path forward: Invert the loop. Human defines constraints → AI plans → Human approves → AI implements with iteration → Human validates. Developer understanding preserved. AI handles tedium.
From Jan 14, 2025 meeting, Quan proposed:
"Treat the documentation as your code, not the code. The code should be basically treated as opaque as a binary. You shouldn't have to look at it anymore."
| Problem | What Happened | Root Cause |
|---|---|---|
| Dynamic Memory | AI used shared_ptr despite spec saying no dynamic allocation | No embedded constraints in prompt context |
| Spec Gaps | Implementation revealed missing details | AI couldn't ask clarifying questions |
| Merge Hell | Ferenc's manual fixes conflicted with AI code | No iteration workflow - generate once |
| No Understanding | Team couldn't debug the code | "Opaque binary" philosophy |
| 6 Month Cleanup | Devs spent half a year fixing | All of the above |
Source: Claude Code Overview, How Claude Code Works
| Capability | Jan 2025 | Feb 2026 |
|---|---|---|
| Agentic Iteration | No - generate once | Yes - iterate until task done |
| Self-Correction | No | Yes - reads errors, fixes |
| Multi-Agent | No | Yes - subagents in parallel |
| Tool Use | No | Yes - terminal, build, test |
| Project Context | Lost every session | CLAUDE.md persists constraints |
| Plan Mode | No | Yes - Research → Clarify → Plan → Build |
Source: OpenAI Codex 5.3, Codex Review 2026
Key advancement: First model that was instrumental in creating itself - debugged its own training, managed deployment, diagnosed test results.
"GPT-5.3-Codex can take on long-running tasks that involve research, tool use, and complex execution. Much like a colleague, you can steer and interact with GPT-5.3-Codex while it's working, without losing context."
Source: Cursor AI Guide 2026, Best AI Coding Agents 2026
"Agent mode is a fully autonomous coding agent that plans, writes, tests, and debugs without hand-holding."
Key: Multiple agents working in parallel - one refactoring, one fixing tests, one doing UI polish.
Source: Embedder, ESP32 Development 2026
"Memory is now allocated once, during initialization — not continuously at runtime."
This is EXACTLY what Code5 needs, and what AI got wrong in Jan 2025.
Source: Writing a Good CLAUDE.md, Claude.md Research
The AI had no persistent context about Code5's constraints. Every generation started fresh without knowing:
# Code5 ZTagger Firmware - Project Context
## CRITICAL CONSTRAINTS (READ FIRST)
**Memory Management:**
- **NO dynamic memory allocation at runtime** (no shared_ptr, no new/malloc after init)
- ALL memory allocated once during initialization
- Static allocation patterns ONLY
- Heap fragmentation is FATAL for long-running embedded
**Real-Time Requirements:**
- Deterministic timing required
- No blocking calls in critical paths
- All long-running work in FreeRTOS tasks
- ISR handlers must be minimal
**Framework:**
- ESP-IDF ONLY (no Arduino abstractions)
- FreeRTOS task model
- ESP-NOW for wireless (reliability > speed)
## Architecture Overview
Game Manager (core)
├── Task Manager (FreeRTOS)
├── Peripheral Tasks
│ ├── Display (LVGL)
│ ├── Haptics (motor control)
│ ├── Sound (synthesizer)
│ └── Light Bar (RGB)
├── Game Logic (state machines)
└── Communication (ESP-NOW, OTA, WiFi)
## Before Generating Code
1. **ALWAYS** ask clarifying questions if spec is ambiguous
2. **ALWAYS** confirm memory allocation strategy before implementing
3. **ALWAYS** include unit test with implementation
4. **NEVER** use dynamic allocation without explicit approval
## Testing Requirements
- Unit tests run in simulator before hardware
- BDD tests validate behavior
- Test must pass before PR
## Code Style
- Snake_case for functions and variables
- UPPER_CASE for constants and macros
- Prefix with module name (e.g., game_manager_, display_)
| Jan 2025 Failure | CLAUDE.md Prevention |
|---|---|
| Dynamic memory | "NO dynamic memory" in CRITICAL section |
| Spec gaps | "ALWAYS ask clarifying questions" |
| No tests | "ALWAYS include unit test" |
| Wrong framework | "ESP-IDF ONLY" |
| No understanding | Architecture overview provides context |
Profile:
AI Integration Strategy:
| Use AI For | Why It Works for Malachi |
|---|---|
| CI/CD automation | His own idea, pure tooling |
| Test generation | Tests validate tests, low risk |
| Code review assist | Second opinion, he has final say |
| Documentation | Low risk, high leverage |
| Boilerplate | Boring stuff, well-defined |
Workflow That Preserves His Agency:
Malachi defines requirements
↓
AI generates PLAN (not code)
↓
Malachi reviews/modifies plan
↓
AI implements with iteration
↓
Malachi reviews final code
↓
Malachi decides merge
Key Principle: He stays architect. AI is his tool, not replacement.
Entry Point: His own CI/CD idea. Let him own it.
Profile:
AI Integration Strategy:
| Use AI For | Why It Works for Ryan |
|---|---|
| Full agent mode | He already validates output |
| LVGL iteration | Render → test → fix loop |
| Cross-referencing | His existing multi-tool approach |
| Bug hunting | Claude's strength he identified |
Workflow:
Ryan defines feature/fix
↓
Agent implements + tests
↓
Agent runs tests, iterates
↓
Ryan reviews final result
↓
Malachi approves for merge
Expansion Opportunity: Let agents do full iteration loops on display rendering where output is visually verifiable.
Profile:
AI Integration Strategy:
| Use AI For | Why It Works for Juniors |
|---|---|
| Test writing | Learn by seeing tests written |
| Documentation | Low risk practice |
| Boilerplate | 10x velocity on boring stuff |
| Explanation | AI explains code to them |
| Pair programming | Agent as "senior dev pair" |
Workflow That Builds Understanding:
Malachi gives explicit requirement
↓
Junior prompts AI with requirement
↓
AI explains plan, asks questions
↓
Junior validates understanding
↓
AI implements + explains
↓
Junior learns from explanation
↓
Junior validates with Malachi
Key Principle: Agents accelerate THEIR learning, not replace them.
Entry Point: Have one junior use Cursor for test writing on a bounded task.
| Task | Risk Level | Benefit | Owner |
|---|---|---|---|
| CI/CD Pipeline | Very Low | Automation, testable | Malachi |
| Test Generation | Low | More coverage | Everyone |
| Documentation | Very Low | Time savings | Juniors |
| Boilerplate Code | Low | Velocity | Everyone |
| Code Formatting | None | Consistency | Automated |
| Task | Risk Level | Mitigation | Owner |
|---|---|---|---|
| Bug Triage | Medium | Human validates fix | Malachi/Ryan |
| Feature Implementation | Medium | Plan mode + review | Ryan/Juniors |
| Refactoring | Medium | Test coverage first | Ryan |
| Code Review Assist | Medium | Human has final say | Malachi |
| Task | Risk Level | When OK | Owner |
|---|---|---|---|
| Game Logic | High | Only with full test coverage | Team |
| State Machines | High | After Malachi validates design | Malachi |
| ESP-NOW Protocol | High | Only for boilerplate | Malachi |
| Task | Why Not | Who Owns |
|---|---|---|
| Architecture Decisions | Requires deep system knowledge | Malachi |
| Memory Strategy | ESP32-specific, determinism | Malachi |
| Unsupervised Generation | The "opaque binary" mistake | NO ONE |
| Critical Path Without Review | Too risky | Malachi |
Before any AI integration, ask:
"Will the developer understand this code well enough to debug it at 2am when production is down?"
If no → Don't use AI for it, or restructure approach.
Spec (Markdown)
↓
AI Generates Code
↓
Hope It Works
↓
Manual Fix When Broken
↓
6 Months Cleanup
CLAUDE.md (Constraints)
↓
Human Defines Task
↓
AI Outputs PLAN
↓
Human Reviews Plan
↓
AI Implements + Tests
↓
AI Runs Tests, Iterates
↓
Human Reviews Final
↓
Human Decides Merge
| Aspect | Jan 2025 | Feb 2026 |
|---|---|---|
| Constraints | None | CLAUDE.md |
| Planning | None | Plan mode first |
| Iteration | Manual | Autonomous |
| Testing | Afterthought | In the loop |
| Understanding | Sacrificed | Preserved |
| Human Role | Fix AI mess | Validate AI work |
The ESP32/embedded context creates constraints that actually PROTECT against AI overreach:
| Pure Software | Embedded (Code5) |
|---|---|
| Deploy instantly | Flash to hardware |
| Test in CI | Test on physical device |
| Memory is cheap | Memory is precious |
| Latency tolerant | Real-time critical |
| Can hot-patch | Brick risk is real |
| Users can refresh | Devices in field |
1. Hardware Validation Cannot Be Faked
AI can generate code that compiles. It cannot:
This is why Malachi's acceptance testing vision is perfect:
"You would use a live feed camera to hear and watch the output on a ztagger device"
Physical validation = human judgment required. AI assists, humans validate.
2. Determinism Requirements
From meetings, Malachi emphasized:
"Mission-critical determinism requirements"
AI-generated code in Jan 2025 used shared_ptr → non-deterministic allocation → embedded disaster.
Rule: Any AI-generated code for Code5 must pass determinism review. This is a HUMAN judgment.
3. Field Deployment Reality
Pure software: Bad code → user refreshes → fixed
Embedded: Bad OTA → bricked devices → customer disaster
From meetings (Feb 2026):
"OTA fails with 24+ devices"
This kind of bug requires physical testing with actual devices. AI can help debug logs, but humans must validate on hardware.
4. The 24-Tagger Problem
ESP-NOW at scale is not something AI has training data for. From Malachi's interview:
"Most ESP-NOW demos are only good for demos... if you've got 20 taggers running around..."
Domain expertise > AI capability for edge cases like this.
The constraints actually make AI safer to use:
| Constraint | Why It Helps |
|---|---|
| Must compile for ESP32 | Immediate feedback loop |
| Must run on device | Physical validation gate |
| Memory limits | Forces AI to write efficient code |
| Real-time requirements | Can measure timing, catch violations |
| Test harness exists | 13-15 devices can validate |
High Value, Hardware-Safe:
| Use Case | Why Safe for Embedded |
|---|---|
| Unit tests | Run in simulator first |
| Build scripts | Tooling, not runtime |
| Documentation | No hardware risk |
| Log analysis | Read-only, diagnostic |
| CI/CD | Automation of existing flows |
Requires Physical Validation:
| Use Case | Validation Method |
|---|---|
| Display code | Must see on actual screen |
| Haptic patterns | Must feel on device |
| Game logic | Must play the game |
| Multi-device | Must test with 20+ taggers |
| OTA | Must flash and verify |
AI generates code
↓
Compiles for ESP32? (automated gate)
↓
Unit tests pass in simulator? (automated gate)
↓
Manual flash to device
↓
HUMAN validates on hardware
↓
Multi-device test
↓
HUMAN approves for production
Key: The hardware gates FORCE human involvement. This is a feature, not a bug.
Pure software companies face a risk: AI could theoretically replace their entire stack.
Code5/ZTAG has physical reality:
AI cannot replace the human judgment of "does this feel like a good game?"
This means:
Don't fight the embedded constraints - use them:
The physicality of Code5 means AI integration is inherently bounded. You can't over-adopt AI because hardware validation will catch mistakes.
The Jan 2025 failure taught us that AI without constraints and iteration creates debt, not value.
The Feb 2026 opportunity is that AI WITH constraints and iteration can create leverage without destroying understanding.
The difference:
AI Value = (Capability × Constraints × Iteration) / Supervision Required
Jan 2025: Low capability × No constraints × No iteration / Zero supervision = Negative value
Feb 2026: High capability × CLAUDE.md × Agentic loops / Human validation = High value