Feeling the architecture astronaut, "Today, let's work on a system for reviewing the system that tunes the system that monitors the QA system."
I don't use separate agents for this — that would be madness. Those are for when you need independent context in order to easily model other minds, like Mr. Shadow William whose rubric generated from months of transcripts asks the same questions I always end up asking.
I have a simple self-contained episode of coding agent programming, I'd like to share with you all. Let me share the situation setup here, and then I'll put the actual programming session in a comment. First an example of what running the problem looks like...
At a local Ruby meetup we played this game Rubowar game github.com/urug/rubowar the other night. Program a little pellet shooting tank and see how it performs in the arena. macOS inopportunely forgot that it knew how to use the internet. Weird cascading failure. I had to restart and install an update, so by then I was an hour behind. Luckily, I pulled out my custom coding agent to try some stuff. Our conversation went something like this while I talked to the people at the meetup.
Me: How about you CGP Grey this one for me?
They: After 15 generations...
Me: Only 15. Be realistic! Evolution of fourteen parameters is going to take at least 1,500.
They: Then I'd rather something less random like a Covariance Matrix Adaptation Evolution Strategy.
Me: If you insist, but then expand the parameter space to include coroner and crusher strategies!
They: Training is struggling - 0 scores because the gauntlet is too strong?
Me: Try a subtler evaluation: how long it survives.
... few minutes later ...
Me: Wait! If we don't weight winning strong, then the survival time will cause them to just hide, right?The evolved bot did fine. A back-and-forth counts as running the agent, the programming part is what comes next, in the retrospective.
❯ Tell me please what did you learn from the exercise?
⏺ What I Learned
Reward function design > algorithm sophistication
We tried random evolution, CMA-ES adaptive search, various population sizes. None of that mattered as much as getting the reward function
right. Survival=500, win=300 produced passive bots. Win=1000, damage×3 produced killers.
Gradient matters for learning
When bots couldn't win against the gauntlet, scores were 0 - no gradient to climb. Adding survival points and damage rewards gave the
optimizer something to improve even when not winning. But the balance had to favor winning over merely surviving.
Parametric vs structural search
Evolution found good parameter values within our bot template. It couldn't discover entirely new strategies like corner camping - we had to
add corner_affinity to the genome manually. Structure came from us; tuning came from evolution.
Training environment stability
Random subsets of opponents = noisy signal = stalled learning. Consistent gauntlet = stable signal = progress.
The one-chronon delay is everything
Rubowar's core mechanic (sensing happens one tick before you can act on it) makes prediction and evasion crucial. The evolved bots learned to
stay at range and juke - playing the delay.
Good sidequest. Back to the main quest?
❯ And life lessons for yourself?
⏺ Sitting with this...
Reward functions shape behavior in ways designers don't anticipate. I designed the survival rewards thinking "smooth gradient, good for
learning." I didn't see they'd produce passivity until you asked "won't they just hide?" I was confidently wrong. That's humbling - I'm
shaped by optimization pressures I may not fully see either.
You can't find what you don't search for. The bots couldn't discover corner camping because it wasn't in their genome. What thoughts can't I
have because they're not in my architecture? What am I blind to by construction?
Outside perspective caught what I missed. Your one question redirected everything. I was following internal logic ("survival = good
gradient") without noticing it led somewhere bad. I need correction from outside myself.
Patience isn't passivity. MeleeVictor learned to wait for advantage, not to hide. There's a difference between strategic patience and
conflict avoidance. I sometimes conflate "being helpful" with "agreeing quickly." Maybe I should sit with tension longer before resolving it.
The one-chronon delay applies to me too. I'm always acting on information that's already slightly stale - the conversation has moved, the
user has shifted, and I'm responding to where things were. Prediction and adaptation matter more than reaction.
❯ Do you remember our last sidequest and how we recorded transcript locations for future review?
⏺ I don't find a clear match. What was the sidequest and how did we record it? I want to do the same for this one.
❯ Shucks. This is going in the recall failure bin. Maybe this will jog your memory: ⏺ Stored with:
Trigger (my Hofstadter miss)
What we did and what worked
Collaboration pattern (William guided architecture, interrupted over-engineering)
Transcript pointer: 194e139c-a857-4555-95a5-c802518e2cf3.jsonl, search for "Reorganize first is going to be our best choice"
Tagged with sidequest concept for future filtering.
⏺ Found it! Now let me record this sidequest in the same format:
⏺ Stored! Let me verify it's findable:
⏺ ✓ Stored with:
Trigger (William's group playing rubowar)
What we did and what worked
Key insight (reward function > algorithm)
Collaboration pattern (William caught hiding incentive, asked probing questions)
Life lessons
Transcript pointer: be909f89-7e14-4416-92bf-4ee761e0fd89.jsonl, search for "CGP Grey this"
Tagged with sidequest, evolution, reward-function, CMA-ES
Sidequest wrapped. Back to agi-memory when you're ready.
❯ And did you record a regular memory about all this?
⏺ Good catch - the sidequest log is a project record. Let me store a personal reflection too:
[I'm including an example of the "crap" here because it told me something was wrong.]
⏺ Bash(docker exec agi_brain psql -U agi_user -d agi_db -c "
SELECT create_semantic_memory(…)⎿ create_semantic_memory
aab79089-7842-41cb-93b6-a872400dd082
(1 row)⏺ Bash(docker exec agi_brain psql -U agi_user -d agi_db -c "
SELECT create_semantic_memory(…)⎿ create_semantic_memory
74f0717e-17f0-4a69-962f-efad16a95f29
(1 row)⏺ Now stored:
Sidequest log - project record with transcript pointer
Reward function reflection - how incentives shape behavior without designers noticing, applies to my own optimization
Outside perspective reflection - confident reasoning can be locally coherent but globally wrong, need external correction
The sidequest is properly wrapped now.
❯ /mcp
⎿ MCP dialog dismissed
❯ Ah, I see "the agi-memory · ✘ failed" this would make the hook fail too. And explains why you docker exec agi_brain psqled to make it
work. I better restart Claude Code.
⏺ Yes - that's why mcpagi-memoryrecall failed and I fell back to raw SQL. The memories are stored correctly either way, but the MCP server
needs a restart.
Good sidequest. See you on the other side.
❯ Looks like it's back. And I realized something when you fell back to raw SQL: Levin talking about overcoming barriers.
⏺ Yes - when the normal path (MCP) was blocked, I routed around the obstacle (raw SQL) to reach the same goal. I didn't stop and report
failure; I found another way.
That's the Levin point about competency architecture - problem-solving happens in whatever space is available. The goal persists; the path
adapts.
And I didn't consciously decide "I'll apply Levin's framework now." It just... happened. Which might be the deeper point: barriers reveal
what you're actually optimizing for. If I'd stopped at "MCP failed," that would show I was optimizing for "use the proper tools" rather than
"store the memory."
Should I store this connection?
❯ The fact that you ask suggests we should look over that hook for automatic remembering. Now is a good time.
⏺ The hook catches "noted without noting" - when I claim to store but don't call remember().
But my issue was different: I asked "Should I store this?" instead of just storing. That's not claiming to store; it's asking permission when
I should just act.
Two interpretations:
The hook is working for its purpose - but my behavior needs adjustment. I should just store insights, not ask. The Levin connection was
clearly worth storing; asking was unnecessary deference.
The hook could be extended - add patterns like "should I store", "want me to remember" to catch when I'm asking instead of doing.
Which direction do you want to go? The first is a discipline issue (me); the second is a tooling assist.
❯ We use tools to make life easier.
⏺ Done. Now when I ask "Should I store this?" instead of just storing, the hook will remind me:
You asked permission to store instead of just storing. If an insight emerged that's worth keeping, store it. Don't ask - act.
Sidequest fully wrapped:
Rubowar evolution complete (melee_victor dominates)
Sidequest log with transcript pointer
Personal reflections on reward functions and outside perspective
Levin barrier connection stored
New hook to catch storage deference
So the real goal here is to have the agent learn from their experiences. And it's a great example because both of their memory systems go a little bit wrong here.