Implementation Guide: Agentic Workflows with Malachi

Before You Start: Pre-Flight Checklist

Read the full analysis document (malachi-agentic-workflow-integration.md)
Understand Malachi's decision-making flow (evidence-based, quality-first)
Prepare talking points (see "Conversation Starters" below)
Identify Phase 1 pilot tasks (3 low-risk tasks to start)
Set up tool infrastructure (Claude Code, Cline, GitHub Actions)

Phase 1 Implementation (Months 1-2)

Task 1A: Boilerplate Generation Pilot

Objective: Auto-generate FreeRTOS task scaffolding; Malachi approves structure
Timeline: Week 1-2
Owner: [Pick one senior dev + one tool specialist]

Setup

Choose a simple FreeRTOS task (e.g., new button handler, sensor poller)
Create prompt for Claude Code:

You are generating FreeRTOS task scaffolding for ZTAG Code 5.

Task Name: [e.g., "ButtonHandler"]
Priority: [e.g., "5 (medium)"]
Dependencies: [e.g., "Reads GPIO pin 12, posts event to queue"]
Timeout Behavior: [e.g., "Watchdog: 1 second timeout"]

Generate:
1. Task function (FreeRTOS conventions)
2. Unit test skeleton (test framework: [Unity/Catch2])
3. Architecture comment explaining assumptions
4. TODO list for implementation

Constraints:
- Must match ZTAG Code 5 style guide
- Comments explain WHY, not WHAT
- No functionality—only scaffold

Present generated code to Malachi:
- Show scaffold + test template
- Ask: "Does the structure make sense? Anything you'd change?"
- Get written approval (comment in GitHub issue)
Junior dev fills in logic:
- Implement task behavior
- Write unit tests to fill skeleton
- Submit PR
Code review:
- Malachi: Check for correctness, test coverage, clarity
- Merge if ≥80% test coverage + all tests pass

Success Criteria

Scaffold generated in <30 min (vs. ~1 hour manual creation)
No rework needed on structure after Malachi's review
Junior dev implements logic with <2 clarifications
Unit tests verify behavior (no surprises in integration)

What to Document

Time saved (generation vs. manual)
Quality of generated scaffold (% that needed rework)
Junior dev experience ("Was the scaffold helpful?")

Task 1B: Test Coverage Analysis Pilot

Objective: AI flags untested code paths; Malachi decides which tests matter
Timeline: Week 2-3
Owner: [QA lead + tool specialist]

Setup

Run code coverage on existing Code 5 module:

# Using gcov or similar
make test-coverage
coverage report > coverage.txt

Create prompt for Claude Code:

We have a C/FreeRTOS codebase (ZTAG Code 5) with partial test coverage.

Analyze this coverage report and identify:
1. Code paths with <50% coverage (priority high)
2. Edge cases that aren't tested (e.g., timeouts, errors)
3. Risky untested scenarios (e.g., concurrent access, boundary conditions)
4. Suggested test cases to close gaps

For each gap, explain why it matters and what test would cover it.

[Paste coverage report]

Generate report:
- AI produces prioritized list of test gaps
- Include: risk level, suggested test case, estimated time to implement
Malachi reviews & decides:
- Which tests are actually important (not all code paths need tests)
- Which are "nice to have" vs. "must have"
- Creates GitHub issues for approved tests
Junior dev implements approved tests

Success Criteria

Report generated in <10 min (vs. ~1 hour manual review)
≥70% of AI's suggestions are relevant
Coverage increases to target level (e.g., 75%+)
Malachi: "This flags gaps I would've missed"

What to Document

Which gaps matter vs. which AI suggested unnecessarily
Time saved on coverage analysis
Quality of suggested tests (% accepted as-is, % modified, % rejected)

Task 1C: Documentation Sync Checker Pilot

Objective: AI flags where code and docs diverge; Malachi decides if docs or code is wrong
Timeline: Week 3-4
Owner: [Documentation lead + tool specialist]

Setup

Gather comparison data:
- Latest PRs merged to develop branch (last 2 weeks)
- Functional Design Document (FDD)
- Code comments in relevant modules

Create prompt for Claude Code:

I'm checking if code and documentation stay in sync.

Recent changes (from PRs):
[List recent changes: "PR #123: Changed MQTT reconnect timeout from 5s to 10s"]

FDD/Documentation:
[Paste relevant sections of design doc]

Identify:
1. Changes that SHOULD have updated docs but didn't
2. Docs that SHOULD be updated but weren't
3. Assumptions in code that aren't documented
4. Discrepancies that indicate potential bugs

For each, flag as:
- "Code needs to be changed" (docs are correct)
- "Docs need to be updated" (code is correct)
- "Needs investigation" (unclear which is right)

Generate report with recommendations
Malachi reviews & decides:
- For each discrepancy: Which was right? Update the other.
- Create GitHub issues for needed doc updates
Assign doc updates to team

Success Criteria

Report generated in <15 min (vs. ~30-60 min manual check)
≥80% of flagged discrepancies are real
Zero false positives that waste time
Docs-code sync improves (track # of discrepancies over time)

What to Document

False positive rate (AI flagged stuff that wasn't actually wrong)
Time saved on sync checking
Bugs caught due to doc-code discrepancy (before they shipped)

Phase 1 Milestone Review (End of Week 4)

Meeting with Malachi:

Agenda:
1. Present Phase 1 results
   - Time saved (boilerplate, test analysis, doc sync)
   - Quality metrics (regressions: 0?)
   - Team feedback (was this helpful?)

2. Malachi's assessment
   - "Does this add value?"
   - "Any concerns?"
   - "Ready to expand to Phase 2?"

3. Decision
   - GO: Proceed to Phase 2 (expand tools to PR review, refactoring)
   - HOLD: Refine Phase 1 tasks, stay longer
   - NO-GO: Tools aren't working; pivot strategy

Success = Malachi says: "This freed up time on routine work. Let's expand it."

Phase 2 Implementation (Months 3-4)

Task 2A: PR Review Assistant

What it does: Before Malachi manually reviews a PR, automated checks run:

Code style/naming violations
Uncovered code paths
Potential edge cases
Pattern matches against prior issues

How to set up:

Create GitHub Actions workflow (runs on every PR):

name: Automated Code Review
on: [pull_request]
jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Run agentic review
        run: |
          # Call Claude Code via API with PR diff + context
          # Generate report as GitHub comment

Claude Code prompt:

Review this PR for Code 5.

Code context:
- Main branch: [paste git diff]
- Related tests: [paste test changes]
- Module changes affect: [list impacted modules]

Check for:
1. Style violations (ZTAG conventions)
2. Uncovered code paths (new logic without tests?)
3. Edge cases (what if timeout? concurrent access?)
4. Similar past bugs (patterns we've seen before)
5. Assumptions that should be documented

Format as:
- ✅ Good: [specific praise]
- ⚠️ Check: [something to verify]
- ❌ Fix: [must address before merge]

Report generated as PR comment (automated, no human needed)
Malachi's review flow:
- Reads agentic analysis (2-5 min)
- Adds human expertise (architecture, long-term implications)
- Approves or requests changes
Junior dev addresses feedback

Key: Malachi still reviews; AI just did the legwork

Task 2B: Refactoring Proposals

What it does: AI suggests improvements to a module; Malachi decides if worth doing
Example: "MQTT module has coupling with game state. Could decouple via event queue."

How to set up:

Select a refactoring candidate (pick a module Malachi mentioned improving)

Create Cline task:

Analyze this Code 5 module for refactoring opportunities.

[Paste module code]

Current architecture:
- [How it's used]
- [Known issues/limitations]

Suggest:
1. Three potential improvements (in order of impact)
2. For each: benefit, risk, estimated effort
3. Pro/con of each approach

Don't implement—just analyze.

Generate analysis document
Malachi reviews & decides:
- "Is this worth doing?"
- "Which approach is best?"
- "Creates GitHub issue with decision rationale"
If approved: Team implements with review

Task 2C: Field Issue Pre-Diagnosis

What it does: When a customer reports a bug, AI pre-analyzes logs
Example: "Device hangs after 15 min" → AI checks for timeout patterns, memory leaks, watchdog resets

How to set up:

Gather field data:
- Device logs (syslog, app logs)
- Timing of failure
- Reproducibility info

Create Claude Code prompt:

A customer reported: "[bug description]"

Device logs:
[Paste relevant logs]

Timeline:
[When did it happen? How to reproduce?]

Possible causes (in order of likelihood):
1. [Root cause 1]: [why likely]
2. [Root cause 2]: [why likely]

Tests to run to narrow down:
- [Test 1]: [what would this tell us?]
- [Test 2]

If it's [cause X], the fix is [suggestion].

Generate diagnostic report
Malachi reviews & prioritizes:
- "Which root cause to investigate first?"
- "Run these tests"
- "If confirmed, implement fix"
Faster bug resolution (pre-analysis saves investigation time)

Phase 3 Setup (Months 5-6)

Task 3A: Automated Test Scaffolding for Games

What: When a new game is added, auto-generate unit test template
Benefit: Ensures consistency, junior dev focuses on assertions not boilerplate

Setup: Similar to Task 1A, but for game-specific test patterns

Task 3B: Dependency Compatibility Checker

What: When ESP-IDF version is bumped, AI flags deprecated APIs
Benefit: Proactive (catch incompatibilities before they break builds)

Setup: GitHub Actions + Claude Code analyzing codebase against new API docs

Conversation Starters with Malachi

Opening

"I've been looking at our bottlenecks. You're blocking on:

Code review (every PR waits for your availability)

Documentation sync (manually tracking changes)

Architecture decisions (juniors need your input)

That's not sustainable at scale. I want to explore tools that handle tedious work so you focus on decisions only you can make.

Would you be willing to try a structured 6-week pilot?"

If He's Skeptical

Him: "Will this actually help or create more work?"
You: "Fair question. Let's start with 3 small tasks—boilerplate generation, test gap analysis, doc sync checking. We measure: time saved, regressions, and whether you think it adds value. If not, we stop. Deal?"

Him: "What if the AI generates broken code?"
You: "Same risk as a junior. You review before it ships. AI just handles the legwork faster. Phase 1 succeeds only if zero regressions. If we break something, we know it's not ready."

Him: "I don't trust AI with our code."
You: "I don't trust AI either. That's why you're in control. It suggests, you decide. You gate it. You turn it off if it doesn't work."

If He's Interested

Him: "OK, what do you need from me?"
You: "For Phase 1:

Review 3 boilerplate scaffolds I generate (15 min each)

Approve list of test gaps we should cover (30 min)

Flag doc-code discrepancies we catch (30 min)

Total: ~3 hours over 4 weeks. In return, you save ~10 hours on routine work. You focus on the hard decisions."

Him: "What tools are we using?"
You: "Claude Code (explains the why, good for architecture) + Cline (task-based workflows) + GitHub Actions (automated checks). You can customize the rules."

Weekly Check-ins (Phase 1)

Every Monday, 10 min sync with Malachi:

Week 1: "How did the boilerplate generation feel? Anything you'd change?"
Week 2: "Did the test gap analysis flag real gaps? False positives?"
Week 3: "Is doc-code sync checking finding real discrepancies?"
Week 4: "Before we wrap Phase 1, what's your gut feeling? Worth expanding?"

Document feedback → Feed into Phase 2 setup

Metrics to Collect

Time Tracking (Per Task)

Activity	Before AI	After AI	Savings
Create FreeRTOS scaffold	60 min	30 min (gen) + 15 min (review)	15 min
Analyze test coverage	60 min	10 min (gen) + 30 min (Malachi decide)	20 min
Check doc-code sync	45 min	15 min (gen) + 20 min (Malachi decide)	10 min
Total per month	~900 min	~600 min	~300 min (5.5 hrs)

Quality Metrics

Regressions: 0 (success = no bugs from agentic code)
Test coverage: Increases from X% to Y%
Doc-code discrepancies: Decreases over time
Code review time: Slightly faster (Malachi spends less time on discovery)

Adoption Metrics

% of tasks using tools: Track by phase
Team sentiment: "Is this helpful?" (survey, Month 2)
Malachi's confidence: "Would you use this going forward?" (explicit buy-in)

Rollback Plan

If Phase 1 fails (regressions, poor quality, Malachi says "no"), here's the pivot:

Stop using agentic tools (immediately)
Analyze what went wrong:
- Was tool poor quality? → Different tool
- Wrong task selection? → Try different tasks
- Malachi's concerns valid? → Address them
Adjust and retry (or accept agentic workflows aren't right for ZTAG now)

Success Stories to Celebrate

When Phase 1 wins happen, share them:

"We auto-generated 5 FreeRTOS scaffolds using Claude Code. 
Saved 2 hours on boilerplate. Zero quality issues. 
Malachi approved the process. Rolling into Phase 2."

This builds team confidence. When Malachi sees tools working, he'll advocate for them.

Key Reminders

Malachi controls the dial. Never surprise him with automation.
Show don't tell. Demos beat arguments.
Focus on tedious work. Boilerplate, analysis, not decision-making.
Quality > speed. If a tool trades quality for speed, Malachi will reject it.
Evidence-based. Track metrics. Let data drive Phase 2 decision.

Timeline Reference

Week 1:     Task 1A setup (boilerplate generation)
Week 2:     Task 1B setup (test coverage), 1A feedback to Malachi
Week 3:     Task 1C setup (doc sync), 1B feedback to Malachi
Week 4:     Phase 1 review with Malachi, decide on Phase 2
Weeks 5-8:  Phase 2 (PR review assistant, refactoring proposals, field issues)
Weeks 9-12: Phase 3 (test scaffolding, dependency checker)

Next Action

Schedule 1:1 with Malachi (use "Opening" conversation starter)
If he agrees, kick off Phase 1 Week 1
If he's hesitant, ask specific concerns → address them → retry
If he says no, respect it and document why (may try different approach later)

Good luck! 🚀

Implementation Guide: Agentic Workflows with Malachi

Before You Start: Pre-Flight Checklist

Phase 1 Implementation (Months 1-2)

Task 1A: Boilerplate Generation Pilot

Setup

Success Criteria

What to Document

Task 1B: Test Coverage Analysis Pilot

Setup

Success Criteria

What to Document

Task 1C: Documentation Sync Checker Pilot

Setup

Success Criteria

What to Document

Phase 1 Milestone Review (End of Week 4)

Phase 2 Implementation (Months 3-4)

Task 2A: PR Review Assistant

Task 2B: Refactoring Proposals

Task 2C: Field Issue Pre-Diagnosis

Phase 3 Setup (Months 5-6)

Task 3A: Automated Test Scaffolding for Games

Task 3B: Dependency Compatibility Checker

Conversation Starters with Malachi

Opening

If He's Skeptical

If He's Interested

Weekly Check-ins (Phase 1)

Metrics to Collect

Time Tracking (Per Task)

Quality Metrics

Adoption Metrics

Rollback Plan

Success Stories to Celebrate

Key Reminders

Timeline Reference

Next Action

Share Document Externally