A 12-dimension scorer, ads data from Meta + Google, and Claude Haiku mutations — wired into one overnight loop. Pull → Score → Mutate → Validate → Keep/Revert → Repeat. ~100 experiments by morning, never published to live, and a versioned winning config waiting in R2 for one-click publish.
flowchart LR
You(["👤 Operator
npx run-gtm-loop"]):::start
subgraph SG1 [" PULL · ads data "]
direction TB
P1["1
Meta Ads API
conv · EMQ · CAPI"]:::pull
P2["2
Google Ads API
actions · labels"]:::pull
end
SNAP[("ads-snapshot-
enriched.json")]:::snap
subgraph SG2 [" SCORE · 12-dim eval "]
direction TB
S1["3
Structural
tags · params · dedup"]:::score
S2["4
Ads-driven
Meta · CAPI · funnel · GAds"]:::score
end
subgraph SG3 [" MUTATE · Claude Haiku "]
direction TB
M1["5
SKILL.md
+ scorecard"]:::mutate
M2["6
Patch GTM JSON
tags · triggers · vars"]:::mutate
end
subgraph SG4 [" VALIDATE · Playwright "]
direction TB
V1["7
Stage workspace
preview mode"]:::validate
V2["8
Fire tags · check
params · dedup"]:::validate
end
KEEP{{"score Δ ≥ 0?
keep / revert"}}:::gate
R2[("R2 · winning-config.json
+ experiment log
+ data audit (md + mermaid)")]:::r2
GTM[("GTM staging
workspace
(one-click publish)")]:::gtm
You ==> P1
You ==> P2
P1 ==> SNAP
P2 ==> SNAP
SNAP ==> S1 ==> S2
S2 ==> M1 ==> M2
M2 ==> V1 ==> V2
V2 ==> KEEP
KEEP ==>|keep| R2
KEEP -.->|revert| S1
R2 ==> GTM
classDef start fill:#111114,stroke:#ec4899,stroke-width:2px,color:#ede0e8
classDef pull fill:#111114,stroke:#f97316,stroke-width:2px,color:#f97316
classDef score fill:#111114,stroke:#22d3ee,stroke-width:2px,color:#22d3ee
classDef mutate fill:#111114,stroke:#4ade80,stroke-width:2px,color:#4ade80
classDef validate fill:#111114,stroke:#a78bfa,stroke-width:2px,color:#a78bfa
classDef gate fill:#1a1014,stroke:#f87171,stroke-width:2px,color:#f87171
classDef snap fill:#17171b,stroke:#fbbf24,stroke-width:2px,color:#ede0e8
classDef r2 fill:#17171b,stroke:#fbbf24,stroke-width:2px,color:#ede0e8
classDef gtm fill:#1a0e18,stroke:#ec4899,stroke-width:2px,color:#ede0e8
SKILL.md + the current scorecard, then patches one GTM container JSON.Eight structural dimensions are always on; four ads-driven dimensions activate when the enriched snapshot is fresh (<24h). The CMO profile weights revenue signal quality, conversion reliability, and channel-level decision usefulness — missing data returns a neutral 0.5, not a false 1.0.
val_bpb for tracking — every patch is comparable, every round is auditable.program.md → SKILL.md. Tag configs, trigger rules, variable mappings — never code, never deploys.winning-config.json in R2, experiment log with diffs, before/after data audit with Mermaid — every morning, on a 24h signal window.~30 rounds × ~3K tokens ≈ ~90K tokens per run on Claude Haiku. The snapshot pipeline retries with exponential backoff, refuses to run on snapshots >72h old, and writes atomically so a mid-run crash can never corrupt state.
A one-shot "rewrite my GTM container" prompt has no feedback signal. It can't tell you whether the patch improved EMQ, broke a trigger, or left a Google Ads conversion action orphaned. The loop's value is the measurement window — fixed scoring, validated mutations, and a kept/reverted decision per round. Same shape as karpathy/autoresearch: agent experiments with configs the way it would experiment with model architectures.
The loop's not a code tool, not a deploy tool, not a replacement for a human review of the winner. It's a way to make ~100 small, scored, reversible experiments run while you sleep — and hand you a publish-ready workspace by morning.