← Back to Index

Implementation Guide: Agentic Workflows with Malachi

Before You Start: Pre-Flight Checklist


Phase 1 Implementation (Months 1-2)

Task 1A: Boilerplate Generation Pilot

Objective: Auto-generate FreeRTOS task scaffolding; Malachi approves structure
Timeline: Week 1-2
Owner: [Pick one senior dev + one tool specialist]

Setup

  1. Choose a simple FreeRTOS task (e.g., new button handler, sensor poller)
  2. Create prompt for Claude Code:
You are generating FreeRTOS task scaffolding for ZTAG Code 5.

Task Name: [e.g., "ButtonHandler"]
Priority: [e.g., "5 (medium)"]
Dependencies: [e.g., "Reads GPIO pin 12, posts event to queue"]
Timeout Behavior: [e.g., "Watchdog: 1 second timeout"]

Generate:
1. Task function (FreeRTOS conventions)
2. Unit test skeleton (test framework: [Unity/Catch2])
3. Architecture comment explaining assumptions
4. TODO list for implementation

Constraints:
- Must match ZTAG Code 5 style guide
- Comments explain WHY, not WHAT
- No functionality—only scaffold
  1. Present generated code to Malachi:

    • Show scaffold + test template
    • Ask: "Does the structure make sense? Anything you'd change?"
    • Get written approval (comment in GitHub issue)
  2. Junior dev fills in logic:

    • Implement task behavior
    • Write unit tests to fill skeleton
    • Submit PR
  3. Code review:

    • Malachi: Check for correctness, test coverage, clarity
    • Merge if ≥80% test coverage + all tests pass

Success Criteria

What to Document


Task 1B: Test Coverage Analysis Pilot

Objective: AI flags untested code paths; Malachi decides which tests matter
Timeline: Week 2-3
Owner: [QA lead + tool specialist]

Setup

  1. Run code coverage on existing Code 5 module:

    # Using gcov or similar
    make test-coverage
    coverage report > coverage.txt
    
  2. Create prompt for Claude Code:

    We have a C/FreeRTOS codebase (ZTAG Code 5) with partial test coverage.
    
    Analyze this coverage report and identify:
    1. Code paths with <50% coverage (priority high)
    2. Edge cases that aren't tested (e.g., timeouts, errors)
    3. Risky untested scenarios (e.g., concurrent access, boundary conditions)
    4. Suggested test cases to close gaps
    
    For each gap, explain why it matters and what test would cover it.
    
    [Paste coverage report]
    
  3. Generate report:

    • AI produces prioritized list of test gaps
    • Include: risk level, suggested test case, estimated time to implement
  4. Malachi reviews & decides:

    • Which tests are actually important (not all code paths need tests)
    • Which are "nice to have" vs. "must have"
    • Creates GitHub issues for approved tests
  5. Junior dev implements approved tests

Success Criteria

What to Document


Task 1C: Documentation Sync Checker Pilot

Objective: AI flags where code and docs diverge; Malachi decides if docs or code is wrong
Timeline: Week 3-4
Owner: [Documentation lead + tool specialist]

Setup

  1. Gather comparison data:

    • Latest PRs merged to develop branch (last 2 weeks)
    • Functional Design Document (FDD)
    • Code comments in relevant modules
  2. Create prompt for Claude Code:

    I'm checking if code and documentation stay in sync.
    
    Recent changes (from PRs):
    [List recent changes: "PR #123: Changed MQTT reconnect timeout from 5s to 10s"]
    
    FDD/Documentation:
    [Paste relevant sections of design doc]
    
    Identify:
    1. Changes that SHOULD have updated docs but didn't
    2. Docs that SHOULD be updated but weren't
    3. Assumptions in code that aren't documented
    4. Discrepancies that indicate potential bugs
    
    For each, flag as:
    - "Code needs to be changed" (docs are correct)
    - "Docs need to be updated" (code is correct)
    - "Needs investigation" (unclear which is right)
    
  3. Generate report with recommendations

  4. Malachi reviews & decides:

    • For each discrepancy: Which was right? Update the other.
    • Create GitHub issues for needed doc updates
  5. Assign doc updates to team

Success Criteria

What to Document


Phase 1 Milestone Review (End of Week 4)

Meeting with Malachi:

Agenda:
1. Present Phase 1 results
   - Time saved (boilerplate, test analysis, doc sync)
   - Quality metrics (regressions: 0?)
   - Team feedback (was this helpful?)

2. Malachi's assessment
   - "Does this add value?"
   - "Any concerns?"
   - "Ready to expand to Phase 2?"

3. Decision
   - GO: Proceed to Phase 2 (expand tools to PR review, refactoring)
   - HOLD: Refine Phase 1 tasks, stay longer
   - NO-GO: Tools aren't working; pivot strategy

Success = Malachi says: "This freed up time on routine work. Let's expand it."


Phase 2 Implementation (Months 3-4)

Task 2A: PR Review Assistant

What it does: Before Malachi manually reviews a PR, automated checks run:

How to set up:

  1. Create GitHub Actions workflow (runs on every PR):

    name: Automated Code Review
    on: [pull_request]
    jobs:
      review:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v2
          - name: Run agentic review
            run: |
              # Call Claude Code via API with PR diff + context
              # Generate report as GitHub comment
    
  2. Claude Code prompt:

    Review this PR for Code 5.
    
    Code context:
    - Main branch: [paste git diff]
    - Related tests: [paste test changes]
    - Module changes affect: [list impacted modules]
    
    Check for:
    1. Style violations (ZTAG conventions)
    2. Uncovered code paths (new logic without tests?)
    3. Edge cases (what if timeout? concurrent access?)
    4. Similar past bugs (patterns we've seen before)
    5. Assumptions that should be documented
    
    Format as:
    - ✅ Good: [specific praise]
    - ⚠️ Check: [something to verify]
    - ❌ Fix: [must address before merge]
    
  3. Report generated as PR comment (automated, no human needed)

  4. Malachi's review flow:

    • Reads agentic analysis (2-5 min)
    • Adds human expertise (architecture, long-term implications)
    • Approves or requests changes
  5. Junior dev addresses feedback

Key: Malachi still reviews; AI just did the legwork


Task 2B: Refactoring Proposals

What it does: AI suggests improvements to a module; Malachi decides if worth doing
Example: "MQTT module has coupling with game state. Could decouple via event queue."

How to set up:

  1. Select a refactoring candidate (pick a module Malachi mentioned improving)

  2. Create Cline task:

    Analyze this Code 5 module for refactoring opportunities.
    
    [Paste module code]
    
    Current architecture:
    - [How it's used]
    - [Known issues/limitations]
    
    Suggest:
    1. Three potential improvements (in order of impact)
    2. For each: benefit, risk, estimated effort
    3. Pro/con of each approach
    
    Don't implement—just analyze.
    
  3. Generate analysis document

  4. Malachi reviews & decides:

    • "Is this worth doing?"
    • "Which approach is best?"
    • "Creates GitHub issue with decision rationale"
  5. If approved: Team implements with review


Task 2C: Field Issue Pre-Diagnosis

What it does: When a customer reports a bug, AI pre-analyzes logs
Example: "Device hangs after 15 min" → AI checks for timeout patterns, memory leaks, watchdog resets

How to set up:

  1. Gather field data:

    • Device logs (syslog, app logs)
    • Timing of failure
    • Reproducibility info
  2. Create Claude Code prompt:

    A customer reported: "[bug description]"
    
    Device logs:
    [Paste relevant logs]
    
    Timeline:
    [When did it happen? How to reproduce?]
    
    Possible causes (in order of likelihood):
    1. [Root cause 1]: [why likely]
    2. [Root cause 2]: [why likely]
    
    Tests to run to narrow down:
    - [Test 1]: [what would this tell us?]
    - [Test 2]
    
    If it's [cause X], the fix is [suggestion].
    
  3. Generate diagnostic report

  4. Malachi reviews & prioritizes:

    • "Which root cause to investigate first?"
    • "Run these tests"
    • "If confirmed, implement fix"
  5. Faster bug resolution (pre-analysis saves investigation time)


Phase 3 Setup (Months 5-6)

Task 3A: Automated Test Scaffolding for Games

What: When a new game is added, auto-generate unit test template
Benefit: Ensures consistency, junior dev focuses on assertions not boilerplate

Setup: Similar to Task 1A, but for game-specific test patterns

Task 3B: Dependency Compatibility Checker

What: When ESP-IDF version is bumped, AI flags deprecated APIs
Benefit: Proactive (catch incompatibilities before they break builds)

Setup: GitHub Actions + Claude Code analyzing codebase against new API docs


Conversation Starters with Malachi

Opening

"I've been looking at our bottlenecks. You're blocking on:

That's not sustainable at scale. I want to explore tools that handle tedious work so you focus on decisions only you can make.

Would you be willing to try a structured 6-week pilot?"

If He's Skeptical

Him: "Will this actually help or create more work?"
You: "Fair question. Let's start with 3 small tasks—boilerplate generation, test gap analysis, doc sync checking. We measure: time saved, regressions, and whether you think it adds value. If not, we stop. Deal?"

Him: "What if the AI generates broken code?"
You: "Same risk as a junior. You review before it ships. AI just handles the legwork faster. Phase 1 succeeds only if zero regressions. If we break something, we know it's not ready."

Him: "I don't trust AI with our code."
You: "I don't trust AI either. That's why you're in control. It suggests, you decide. You gate it. You turn it off if it doesn't work."

If He's Interested

Him: "OK, what do you need from me?"
You: "For Phase 1:

  1. Review 3 boilerplate scaffolds I generate (15 min each)
  2. Approve list of test gaps we should cover (30 min)
  3. Flag doc-code discrepancies we catch (30 min)

Total: ~3 hours over 4 weeks. In return, you save ~10 hours on routine work. You focus on the hard decisions."

Him: "What tools are we using?"
You: "Claude Code (explains the why, good for architecture) + Cline (task-based workflows) + GitHub Actions (automated checks). You can customize the rules."


Weekly Check-ins (Phase 1)

Every Monday, 10 min sync with Malachi:

Week 1: "How did the boilerplate generation feel? Anything you'd change?"
Week 2: "Did the test gap analysis flag real gaps? False positives?"
Week 3: "Is doc-code sync checking finding real discrepancies?"
Week 4: "Before we wrap Phase 1, what's your gut feeling? Worth expanding?"

Document feedback → Feed into Phase 2 setup


Metrics to Collect

Time Tracking (Per Task)

Activity Before AI After AI Savings
Create FreeRTOS scaffold 60 min 30 min (gen) + 15 min (review) 15 min
Analyze test coverage 60 min 10 min (gen) + 30 min (Malachi decide) 20 min
Check doc-code sync 45 min 15 min (gen) + 20 min (Malachi decide) 10 min
Total per month ~900 min ~600 min ~300 min (5.5 hrs)

Quality Metrics

Adoption Metrics


Rollback Plan

If Phase 1 fails (regressions, poor quality, Malachi says "no"), here's the pivot:

  1. Stop using agentic tools (immediately)
  2. Analyze what went wrong:
    • Was tool poor quality? → Different tool
    • Wrong task selection? → Try different tasks
    • Malachi's concerns valid? → Address them
  3. Adjust and retry (or accept agentic workflows aren't right for ZTAG now)

Success Stories to Celebrate

When Phase 1 wins happen, share them:

"We auto-generated 5 FreeRTOS scaffolds using Claude Code. 
Saved 2 hours on boilerplate. Zero quality issues. 
Malachi approved the process. Rolling into Phase 2."

This builds team confidence. When Malachi sees tools working, he'll advocate for them.


Key Reminders

  1. Malachi controls the dial. Never surprise him with automation.
  2. Show don't tell. Demos beat arguments.
  3. Focus on tedious work. Boilerplate, analysis, not decision-making.
  4. Quality > speed. If a tool trades quality for speed, Malachi will reject it.
  5. Evidence-based. Track metrics. Let data drive Phase 2 decision.

Timeline Reference

Week 1:     Task 1A setup (boilerplate generation)
Week 2:     Task 1B setup (test coverage), 1A feedback to Malachi
Week 3:     Task 1C setup (doc sync), 1B feedback to Malachi
Week 4:     Phase 1 review with Malachi, decide on Phase 2
Weeks 5-8:  Phase 2 (PR review assistant, refactoring proposals, field issues)
Weeks 9-12: Phase 3 (test scaffolding, dependency checker)

Next Action

  1. Schedule 1:1 with Malachi (use "Opening" conversation starter)
  2. If he agrees, kick off Phase 1 Week 1
  3. If he's hesitant, ask specific concerns → address them → retry
  4. If he says no, respect it and document why (may try different approach later)

Good luck! 🚀