apply · CMO profile karpathy/autoresearch
Autoresearch · GTM · Overnight Loop

Karpathy's experimentation loop, pointed at GTM containers.

A 12-dimension scorer, ads data from Meta + Google, and Claude Haiku mutations — wired into one overnight loop. Pull → Score → Mutate → Validate → Keep/Revert → Repeat. ~100 experiments by morning, never published to live, and a versioned winning config waiting in R2 for one-click publish.

flowchart LR
    You(["👤 Operator
npx run-gtm-loop"]):::start subgraph SG1 [" PULL · ads data "] direction TB P1["1
Meta Ads API
conv · EMQ · CAPI"]:::pull P2["2
Google Ads API
actions · labels"]:::pull end SNAP[("ads-snapshot-
enriched.json")]:::snap subgraph SG2 [" SCORE · 12-dim eval "] direction TB S1["3
Structural
tags · params · dedup"]:::score S2["4
Ads-driven
Meta · CAPI · funnel · GAds"]:::score end subgraph SG3 [" MUTATE · Claude Haiku "] direction TB M1["5
SKILL.md
+ scorecard"]:::mutate M2["6
Patch GTM JSON
tags · triggers · vars"]:::mutate end subgraph SG4 [" VALIDATE · Playwright "] direction TB V1["7
Stage workspace
preview mode"]:::validate V2["8
Fire tags · check
params · dedup"]:::validate end KEEP{{"score Δ ≥ 0?
keep / revert"}}:::gate R2[("R2 · winning-config.json
+ experiment log
+ data audit (md + mermaid)")]:::r2 GTM[("GTM staging
workspace
(one-click publish)")]:::gtm You ==> P1 You ==> P2 P1 ==> SNAP P2 ==> SNAP SNAP ==> S1 ==> S2 S2 ==> M1 ==> M2 M2 ==> V1 ==> V2 V2 ==> KEEP KEEP ==>|keep| R2 KEEP -.->|revert| S1 R2 ==> GTM classDef start fill:#111114,stroke:#ec4899,stroke-width:2px,color:#ede0e8 classDef pull fill:#111114,stroke:#f97316,stroke-width:2px,color:#f97316 classDef score fill:#111114,stroke:#22d3ee,stroke-width:2px,color:#22d3ee classDef mutate fill:#111114,stroke:#4ade80,stroke-width:2px,color:#4ade80 classDef validate fill:#111114,stroke:#a78bfa,stroke-width:2px,color:#a78bfa classDef gate fill:#1a1014,stroke:#f87171,stroke-width:2px,color:#f87171 classDef snap fill:#17171b,stroke:#fbbf24,stroke-width:2px,color:#ede0e8 classDef r2 fill:#17171b,stroke:#fbbf24,stroke-width:2px,color:#ede0e8 classDef gtm fill:#1a0e18,stroke:#ec4899,stroke-width:2px,color:#ede0e8
Pull · OrangeLive ads data — Meta + Google APIs feed conversion counts, EMQ, CAPI dedup, conversion actions.
Score · Cyan12 dimensions — 8 structural (always on) + 4 ads-driven (with enriched snapshot). One number to beat.
Mutate · GreenClaude Haiku reads SKILL.md + the current scorecard, then patches one GTM container JSON.
Validate · PurplePlaywright drives a staging preview — tags fire, parameters land, dedup holds. No live traffic.
Persist · AmberWinning JSON + experiment log + before/after data audit, all versioned in R2 for rollback.
Gate · RedScore Δ < 0 reverts in-loop. Final publish to live GTM stays a one-click human decision.

From 84.3% → 91.2% in five rounds.

Eight structural dimensions are always on; four ads-driven dimensions activate when the enriched snapshot is fresh (<24h). The CMO profile weights revenue signal quality, conversion reliability, and channel-level decision usefulness — missing data returns a neutral 0.5, not a false 1.0.

Tag coverage100%
Parameter completeness100%
Deduplication100%
Trigger quality100%
Folder organization100%
Meta Ads alignment100%
CAPI coverage99.4%
Naming conventions97.6%
Variable hygiene87.5%
Google Ads alignment80%
Funnel integrity70%
Consent settings60%

Six things autoresearch buys you.

PULL · 01
Real signals, not vibes
The optimizer sees what's actually firing in Meta + Google — conversion counts, EMQ, CAPI dedup. Catches revenue leakage where browser pixel fires but sGTM never forwards.
SCORE · 02
One number to beat
12 dimensions collapse to a single signal-quality score. val_bpb for tracking — every patch is comparable, every round is auditable.
MUTATE · 03
Claude touches one file
Haiku rewrites the GTM container JSON, guided by program.md → SKILL.md. Tag configs, trigger rules, variable mappings — never code, never deploys.
VALIDATE · 04
Playwright before promotion
Each candidate is fired in staging preview — tags, parameters, dedup all checked. Bad patches revert before they ever touch the winning slot.
PERSIST · 05
Wake up to deliverables
Staging workspace, winning-config.json in R2, experiment log with diffs, before/after data audit with Mermaid — every morning, on a 24h signal window.
GATE · 06
Never publishes to live
Loop revert is automatic. The publish-to-live decision is a human one-click against a versioned, validated workspace. Rollback to any previous night.

Two commands. Overnight budget, $0.15.

# 1. refresh ads data — Meta + Google Ads APIs into one snapshot npx tsx scripts/refresh-ads-snapshot.ts # 2. run the optimization loop — score · mutate · validate · keep/revert npx tsx scripts/run-gtm-loop.ts # standalone eval against a seed export npx tsx evals/eval_gtm_signal_quality.ts content/gtm-templates/HRE/seed/shopify-ecom-web.json \ --enriched-snapshot data/signals/ads-snapshot-enriched.json

~30 rounds × ~3K tokens ≈ ~90K tokens per run on Claude Haiku. The snapshot pipeline retries with exponential backoff, refuses to run on snapshots >72h old, and writes atomically so a mid-run crash can never corrupt state.

Same shape as Karpathy's autoresearch — different domain.

A one-shot "rewrite my GTM container" prompt has no feedback signal. It can't tell you whether the patch improved EMQ, broke a trigger, or left a Google Ads conversion action orphaned. The loop's value is the measurement window — fixed scoring, validated mutations, and a kept/reverted decision per round. Same shape as karpathy/autoresearch: agent experiments with configs the way it would experiment with model architectures.

The loop's not a code tool, not a deploy tool, not a replacement for a human review of the winner. It's a way to make ~100 small, scored, reversible experiments run while you sleep — and hand you a publish-ready workspace by morning.