Autoresearch · GTM · Overnight Loop

Karpathy's experimentation loop, pointed at GTM containers.

A 12-dimension scorer, ads data from Meta + Google, and Claude Haiku mutations — wired into one overnight loop. Pull → Score → Mutate → Validate → Keep/Revert → Repeat. ~100 experiments by morning, never published to live, and a versioned winning config waiting in R2 for one-click publish.

flowchart LR
    You(["👤 Operator
npx run-gtm-loop"]):::start

    subgraph SG1 [" PULL · ads data "]
      direction TB
      P1["1
Meta Ads API
conv · EMQ · CAPI"]:::pull
      P2["2
Google Ads API
actions · labels"]:::pull
    end

    SNAP[("ads-snapshot-
enriched.json")]:::snap

    subgraph SG2 [" SCORE · 12-dim eval "]
      direction TB
      S1["3
Structural
tags · params · dedup"]:::score
      S2["4
Ads-driven
Meta · CAPI · funnel · GAds"]:::score
    end

    subgraph SG3 [" MUTATE · Claude Haiku "]
      direction TB
      M1["5
SKILL.md
+ scorecard"]:::mutate
      M2["6
Patch GTM JSON
tags · triggers · vars"]:::mutate
    end

    subgraph SG4 [" VALIDATE · Playwright "]
      direction TB
      V1["7
Stage workspace
preview mode"]:::validate
      V2["8
Fire tags · check
params · dedup"]:::validate
    end

    KEEP{{"score Δ ≥ 0?
keep / revert"}}:::gate

    R2[("R2 · winning-config.json
+ experiment log
+ data audit (md + mermaid)")]:::r2

    GTM[("GTM staging
workspace
(one-click publish)")]:::gtm

    You ==> P1
    You ==> P2
    P1 ==> SNAP
    P2 ==> SNAP
    SNAP ==> S1 ==> S2
    S2 ==> M1 ==> M2
    M2 ==> V1 ==> V2
    V2 ==> KEEP
    KEEP ==>|keep| R2
    KEEP -.->|revert| S1
    R2 ==> GTM

    classDef start    fill:#111114,stroke:#ec4899,stroke-width:2px,color:#ede0e8
    classDef pull     fill:#111114,stroke:#f97316,stroke-width:2px,color:#f97316
    classDef score    fill:#111114,stroke:#22d3ee,stroke-width:2px,color:#22d3ee
    classDef mutate   fill:#111114,stroke:#4ade80,stroke-width:2px,color:#4ade80
    classDef validate fill:#111114,stroke:#a78bfa,stroke-width:2px,color:#a78bfa
    classDef gate     fill:#1a1014,stroke:#f87171,stroke-width:2px,color:#f87171
    classDef snap     fill:#17171b,stroke:#fbbf24,stroke-width:2px,color:#ede0e8
    classDef r2       fill:#17171b,stroke:#fbbf24,stroke-width:2px,color:#ede0e8
    classDef gtm      fill:#1a0e18,stroke:#ec4899,stroke-width:2px,color:#ede0e8

Pull · OrangeLive ads data — Meta + Google APIs feed conversion counts, EMQ, CAPI dedup, conversion actions.

Score · Cyan12 dimensions — 8 structural (always on) + 4 ads-driven (with enriched snapshot). One number to beat.

Mutate · GreenClaude Haiku reads SKILL.md + the current scorecard, then patches one GTM container JSON.

Validate · PurplePlaywright drives a staging preview — tags fire, parameters land, dedup holds. No live traffic.

Persist · AmberWinning JSON + experiment log + before/after data audit, all versioned in R2 for rollback.

Gate · RedScore Δ < 0 reverts in-loop. Final publish to live GTM stays a one-click human decision.

12-dimension scorecard · CMO profile

From 84.3% → 91.2% in five rounds.

Eight structural dimensions are always on; four ads-driven dimensions activate when the enriched snapshot is fresh (<24h). The CMO profile weights revenue signal quality, conversion reliability, and channel-level decision usefulness — missing data returns a neutral 0.5, not a false 1.0.

Tag coverage100%

Parameter completeness100%

Deduplication100%

Trigger quality100%

Folder organization100%

Meta Ads alignment100%

CAPI coverage99.4%

Naming conventions97.6%

Variable hygiene87.5%

Google Ads alignment80%

Funnel integrity70%

Consent settings60%

Concrete value of running the loop

Six things autoresearch buys you.

PULL · 01

Real signals, not vibes

The optimizer sees what's actually firing in Meta + Google — conversion counts, EMQ, CAPI dedup. Catches revenue leakage where browser pixel fires but sGTM never forwards.

SCORE · 02

One number to beat

12 dimensions collapse to a single signal-quality score. val_bpb for tracking — every patch is comparable, every round is auditable.

MUTATE · 03

Claude touches one file

Haiku rewrites the GTM container JSON, guided by program.md → SKILL.md. Tag configs, trigger rules, variable mappings — never code, never deploys.

VALIDATE · 04

Playwright before promotion

Each candidate is fired in staging preview — tags, parameters, dedup all checked. Bad patches revert before they ever touch the winning slot.

PERSIST · 05

Wake up to deliverables

Staging workspace, winning-config.json in R2, experiment log with diffs, before/after data audit with Mermaid — every morning, on a 24h signal window.

GATE · 06

Never publishes to live

Loop revert is automatic. The publish-to-live decision is a human one-click against a versioned, validated workspace. Rollback to any previous night.

Setup

Two commands. Overnight budget, $0.15.

# 1. refresh ads data — Meta + Google Ads APIs into one snapshot npx tsx scripts/refresh-ads-snapshot.ts # 2. run the optimization loop — score · mutate · validate · keep/revert npx tsx scripts/run-gtm-loop.ts # standalone eval against a seed export npx tsx evals/eval_gtm_signal_quality.ts content/gtm-templates/HRE/seed/shopify-ecom-web.json \ --enriched-snapshot data/signals/ads-snapshot-enriched.json

~30 rounds × ~3K tokens ≈ ~90K tokens per run on Claude Haiku. The snapshot pipeline retries with exponential backoff, refuses to run on snapshots >72h old, and writes atomically so a mid-run crash can never corrupt state.

Why a loop and not a one-shot prompt

Same shape as Karpathy's autoresearch — different domain.

A one-shot "rewrite my GTM container" prompt has no feedback signal. It can't tell you whether the patch improved EMQ, broke a trigger, or left a Google Ads conversion action orphaned. The loop's value is the measurement window — fixed scoring, validated mutations, and a kept/reverted decision per round. Same shape as karpathy/autoresearch: agent experiments with configs the way it would experiment with model architectures.

The loop's not a code tool, not a deploy tool, not a replacement for a human review of the winner. It's a way to make ~100 small, scored, reversible experiments run while you sleep — and hand you a publish-ready workspace by morning.