Autonomous GTM Experimentation
Built on the karpathy/autoresearch loop pattern, this playbook applies autonomous feedback loops to GTM assets — emails, ads, landing pages, nurture flows — tested against revenue-linked metrics. Replace manual A/B testing with agent-driven loops that compound ICP-specific learnings across channels.
Goal: Replace manual, low-velocity GTM testing with autonomous experimentation loops that compound learnings across channels and drive revenue-linked outcomes at 100x the velocity of traditional A/B testing.
Complexity
High
Tools
7
Context
The Problem
GTM teams run campaigns, not experiments. When they do test, it's 1-2 manual A/B tests per month — a human writes a hypothesis, a developer sets it up, a week passes before there's enough data, another human decides what to do next. By the end of the year you've run 30 experiments. A competitor running autoresearch loops has run 3,000.
The AI SDR wave made this worse by promising autonomy without architecture. Tools that claim to "do outbound for you" optimize for booked meetings, not SQLs. 70% of AI SDR users quit within three months because pipeline never moves.
What breaks:
- Optimizing the wrong metric — reply rates, opens, and click-throughs go up while SQLs stay flat, because no one wired the feedback loop to revenue
- Statistical noise masquerading as signal — B2B volumes are low; decisions made on 50-100 events that need 200-500 to mean anything
- Bad data at scale — siloed tools with inconsistent identity resolution mean autonomous agents personalize on fragments and scale the wrong decisions across every channel
- Autonomy without strategy — AI SDR stacks with no human layer misidentify ICPs, send robotic sequences, and collapse pipeline while the monthly invoice keeps clearing
Why it matters:
The AI SDR market is growing from $4.12B (2025) to $15.01B by 2030 at 29.5% CAGR. Most of that spend will produce exactly the results the Reddit threads document: $2,000/month tools that book zero demos and extract two-year contracts. The teams that win aren't the ones who buy the most autonomous agents — they're the ones who build the right loops.
Resolution
The Solution
The autoresearch pattern — originally built by Andrej Karpathy for ML model optimization — is a 630-line feedback loop: modify one variable, run a fixed experiment, measure against a single metric, keep what wins, discard what doesn't, repeat. Karpathy's script ran ~700 experiments in two days and found 20 improvements a human expert missed. Shopify's CEO pointed it at their Liquid templating engine and got 93 automated commits, 53% faster rendering, and 61% fewer memory allocations.
The GTM version replaces the training script with a GTM asset (email, ad, landing page, nurture flow) and the model accuracy metric with a revenue-linked outcome (reply rate, CVR, SQL rate). The loop runs on real traffic, logs everything, and compounds learnings across channels.
Level 1: First Loop (Week 1-2)
Start with cold email. One ICP segment, one metric, no full autonomy yet.
- Choose one ICP segment (e.g., RevOps leaders at 50-500 FTE SaaS companies, UK-based)
- Primary metric: reply rate. Guardrails: spam complaints, unsubscribe rate
- Stack: Clay for list and signals, Instantly or Lemlist for sending, Claude or MindStudio to generate variants
- Take your current best-performing subject + opener as the baseline
- Generate 3 challenger variants using an LLM prompt embedding your ICP, offer, and brand guardrails — test one variable at a time (subject only, or opener only, never both)
- Send each variant to 100+ prospects in the same segment over 48 hours; keep sending the baseline in parallel
- Measure positive reply rate only — not opens, not total replies
- Promote a challenger to new baseline only if it beats by +30% relative lift with at least 20 total replies
- Log hypothesis, what changed, and outcome in a JSON file — this is your experiment journal
By the end of Week 2 you have a working loop, a minimal memory system, and ground truth on what sample size your audience actually needs.
Level 2: Full System — The Autonomous GTM Lab (Week 2-4)
Build the reusable architecture that applies the core loop pattern to every channel with automated execution and shared memory.
The Core Loop (every channel, every time):
- Define the objective function — one primary metric + 1-2 guardrails (never optimize for anything you wouldn't report to your CEO)
- Define the action space — enumerate exactly which fields the agent can touch; freeze everything else
- Set the measurement window — channel-specific (48h email, 3-7d ads, 1-3w landing pages, 7d nurture)
- Agent proposes hypothesis + one variant, with rationale drawn from the experiment journal
- Execute via API — no manual deployment
- Measure against baseline using the same data source as always
- Keep if it beats baseline; revert if it doesn't; log either way
- Generate next hypothesis from memory (last N journal entries)
- Loop
Channel architecture:
- Cold email: Primary metric = positive reply rate. Agent touches subject, opener, CTA, send time. 48h window, 100 sends per variant, 20 total replies minimum. Stack: Clay + Instantly/Lemlist + agent.
- Google Ads: Primary metric = CPA or ROAS. Agent touches headlines and descriptions only (no budgets). 3-7 day window, 400 conversions per variant for 20-30% lift detection.
- Landing pages: Primary metric = CVR (visit to next action). Agent touches H1, subheadline, primary CTA text, social proof block. 1-3 week window, 200-500 visitors per variant.
- Email nurture: Primary metric = conversion to next stage. Agent touches subject, preview text, CTA, send timing. 7 day window, 50 triggered per variant.
- LinkedIn content: Primary metric = click-to-site rate. Agent touches hook (first line), format, CTA, length, post time. 48h window, 500 impressions per variant.
- SEO meta: Primary metric = organic CTR. Agent touches title tag, meta description (fixed URL set). 2-4 week window, 1,000 GSC impressions per variant.
Safety architecture:
Every loop has three layers of protection:
- Budget caps — per-experiment spend ceilings for ads (10-20% of channel budget), plus hard monthly limits with auto-pause. Agent never touches budget settings.
- Rollback thresholds — auto-revert when primary metric drops >30% vs control or any guardrail (spam rate, unsubscribe rate, CPC ceiling) trips. For ads: rollback after two consecutive measurement windows of underperformance.
- HOTL governance tiers: - Tier 0 (auto-deploy): subject lines, body copy variants, send timing, minor CTA text - Tier 1 (human approval queue): offers, pricing page copy, anything mentioning competitors - Tier 2 (no autonomous changes): contracts, legal language, security claims, pricing
Level 3: Multi-Channel Lab (Week 4-6)
Once two or more single-channel loops are running and producing clean journal data, introduce the planner-executor-evaluator architecture that Meta used in their Ranking Engineer Agent (REA), which doubled model accuracy and let three engineers do the work of six.
- Planner agent — reads business objectives and the cross-channel journal, allocates experiment budget by channel based on current confidence and impact potential
- Executor agents — one per channel, each running the core loop within the Planner's constraints
- Evaluator agent — aggregates pipeline and revenue outcomes across channels, identifies cross-channel patterns, flags conflicts, updates the Planner
Cross-channel compounding in practice: timeline hooks consistently outperform problem hooks in cold email for RevOps ICs → ads loop seeds new headlines with timeline framing for the same retargeting segment → landing page loop tests timeline-framed H1 for the same ICP. Learning generated once, applied everywhere.
Expected Metrics
<5 to 50-200+ per channel per week
Experiment velocity
2-4% → 8-12% in 4-6 weeks (vendor-reported, MindStudio)
Cold email reply rate
+15-40% over 8-12 weeks (vendor-reported, MindStudio)
Landing page CVR
-20-30% over 8-16 weeks (vendor-reported)
Ad CPA
Traditional Experimentation vs. Autonomous GTM Lab
Experiments per period
Traditional
1-2 per month; manual setup and analysis
Our Approach
10-200+ micro-experiments per week, all logged
Metric alignment
Traditional
Often CTR and CVR; revenue linkage ad hoc
Our Approach
Primary metrics are SQLs, pipeline, and CAC with hard guardrails
Data ownership
Traditional
You own data; experiments sit in vendor silos
Our Approach
Data and journals live in your warehouse or DuckDB
Customization
Traditional
Manual — you design tests one at a time; logic lives in your head
Our Approach
Systematic — open program.md prompts and per-channel schemas; agent iterates within your defined action space
Cost model
Traditional
$5k-80k/year for enterprise tools
Our Approach
Engineering and infrastructure time; no vendor lock-in
Transparency
Traditional
Fragmented — results split across tool dashboards, no unified experiment record
Our Approach
Full audit trail — every hypothesis, variant, metric, and outcome in a queryable experiment journal
Human role
Traditional
Human designs and analyzes every test
Our Approach
Human sets strategy and guardrails; agents execute within constraints
| Aspect | Traditional | Our Approach |
|---|---|---|
| Experiments per period | 1-2 per month; manual setup and analysis | 10-200+ micro-experiments per week, all logged |
| Metric alignment | Often CTR and CVR; revenue linkage ad hoc | Primary metrics are SQLs, pipeline, and CAC with hard guardrails |
| Data ownership | You own data; experiments sit in vendor silos | Data and journals live in your warehouse or DuckDB |
| Customization | Manual — you design tests one at a time; logic lives in your head | Systematic — open program.md prompts and per-channel schemas; agent iterates within your defined action space |
| Cost model | $5k-80k/year for enterprise tools | Engineering and infrastructure time; no vendor lock-in |
| Transparency | Fragmented — results split across tool dashboards, no unified experiment record | Full audit trail — every hypothesis, variant, metric, and outcome in a queryable experiment journal |
| Human role | Human designs and analyzes every test | Human sets strategy and guardrails; agents execute within constraints |
Tools & Data
Required (Minimum Viable)
Recommended (Full System)
Competitor Landscape
| Tool | Approach | Best For | Limitation |
|---|---|---|---|
| Landbase | AI SDR platform — agentic outbound sequences with fixed workflows and 40M+ campaign training data | Teams wanting turnkey outbound without engineering | Black-box, no experiment journal, no human-configurable loops. ~$3,000/month; vendor-reported claims not independently audited |
| Warmly | Signal-based AI GTM — visitor de-anonymization + automated outbound triggers for B2B website traffic | Website-traffic-driven outbound automation | Channel automation, not systematic experimentation with memory. Sales-led pricing |
| MindStudio | No-code agent builder with scheduling and integrations. Most explicit autoresearch implementation guide in the market | Teams wanting visual GTM loop builders — closest to what this playbook describes | Platform dependency; free + ~$20/month Individual; enterprise custom |
| Vect AI | 69 SaaS growth strategies as autonomous blueprints executed by agents | Pre-codified growth playbook execution | Blueprints are pre-designed, not iterative loops with shared memory. Sales-led pricing |
| Traditional A/B testing tools (VWO, Optimizely) | Statistical rigor for website and app tests — excellent test harnesses and statistical engines | Web experimentation with manual hypothesis design | No autonomous hypothesis generation; still human-driven. VWO Starter ~$314/month; Optimizely $50k-200k+/year |
| Google PMax / Meta Advantage+ | Platform automation — black-box budget and creative optimization within platform walls | Broad reach optimization within walled gardens | PMax is blind, hungry, and confused when fed weak creative or wrong goals; you cannot inspect or override its logic |
| Custom build (warehouse + agents) | Full control; no vendor lock-in — exactly what this playbook describes | Teams with engineering capacity wanting permanent data and logic ownership | Higher initial build cost; ~$0-500/month in infrastructure |
Industry Benchmarks
| Metric | Benchmark | Source |
|---|---|---|
| Autoresearch loop efficiency | ~700 experiments in 2 days, ~20 improvements, 11% model speedup | Fortune / Karpathy, Mar 2026 |
| Shopify Liquid autoresearch | 93 automated commits, 53% faster parse+render, 61% fewer allocations | Simon Willison / WecoAI, Mar 2026 |
| Meta REA autonomous experimentation | 2x average model accuracy; 3 engineers delivered work of 6+ | Meta Engineering Blog, Mar 2026 |
| Cold email loop performance | Reply rates from 2-4% to 8-12% in 4-6 weeks | MindStudio, 2026 |
| Landing page loop performance | 15-40% CVR uplift over 8-12 weeks | MindStudio, 2026 |
| AI SDR market growth | $4.12B (2025) to $15.01B (2030) at 29.5% CAGR | MarketsandMarkets / GlobeNewswire, Oct 2025 |
| AI SDR churn rate | 70% of users quit within 3 months | r/gtmengineering, 2026 |
| Multi-agent system inquiries | 1,445% surge from Q1 2024 to Q2 2025 | Gartner, via VirtualAssistantVA |
| B2B experiment velocity (traditional) | Most teams run 20-30 experiments/year | Eric Siu / Fortune framing, 2026 |
Emerging Trends
karpathy/autoresearch applied to GTM — Andrej Karpathy's open-source autoresearch loop (https://github.com/karpathy/autoresearch) ran ~700 ML experiments in 2 days and found 20 improvements a human expert missed. GTM practitioners are now adapting the same pattern — modify one variable, deploy, measure against a single metric, keep what wins — to cold email, ad copy, and landing pages. This is the architectural foundation this playbook builds on.
March 2026
Enables 100x experiment velocity over manual A/B testing by removing humans from the iteration loop while keeping them in the strategy and guardrails layer
Shopify Liquid autoresearch — Tobi Lütke pointed the autoresearch pattern at Shopify's Liquid templating engine. Result: 93 automated commits, 53% faster parse-and-render, 61% fewer memory allocations. First major production validation that autoresearch loops deliver compounding gains on real engineering assets.
March 2026
Proof-of-concept that autoresearch produces measurable, compounding improvements on real production systems — not just ML benchmarks
Meta Ranking Engineer Agent (REA) — Meta's autonomous experimentation system doubled average model accuracy and let 3 engineers deliver the output of 6+ across 8 ranking models. The planner-executor-evaluator architecture this playbook uses at Level 3 is derived from Meta's REA design.
March 2026
Validates the multi-agent orchestration pattern at enterprise scale; 2x output with half the headcount is the benchmark for what autonomous GTM labs should target
Team Responsibilities
| Role | Responsibility |
|---|---|
| GTM Engineer | Loop design, API integrations, program.md prompts, scheduling, and experiment orchestration. The person who builds and maintains the system. |
| Marketing Ops | Channel configurations, compliance, deliverability, brand guardrails, and alignment between live campaigns and loops. The person who stops the agent from doing something embarrassing. |
| Data Engineer | Clean data pipelines, experiment journal schema, warehouse/DuckDB integration, and coverage monitoring. Without this role, loops break silently. |
Failure Patterns
| Pattern | What Happens | Why | Prevention |
|---|---|---|---|
| Optimizing Reply Rate, Not Revenue | Reply rates go up; SQL and pipeline stay flat; agent keeps improving the wrong thing | Objective function was set to a proxy metric with no feedback loop to CRM pipeline | Set primary metric as SQL or SQO creation rate; require pipeline linkage before any variant gets promoted |
| $2,000/month AI SDR, Zero Demos | Contract signed, tool deployed, zero meetings booked, two-year lock-in begins | Black-box workflows, no ICP validation, no experiment transparency, misaligned vendor incentives | Open experiment journal from day one; no black-box agents; ICP defined and owned by your team in Clay before any loop runs |
| 70% Quit AI SDR Tools in 3 Months | Hype cycle ends, revenue never moves, teams cancel and lose trust in AI GTM entirely | Tools promised full autonomy; delivered automation without intelligence; no transparency on what the agent actually tried | Start with one channel, show pipeline impact before scaling, log every experiment so you can explain every decision |
| Over-Fitting to Noise in B2B | Variant that looked good at 80 sends gets promoted; underperforms at full volume; wasted weeks | No minimum sample thresholds; frequentist thinking applied to tiny B2B audiences | Hard minimum sample gates per channel; sequential testing or Bayesian logic; only run bold single-variable tests |
| Stale or Siloed Data at Scale | Agent personalizes using company size data from 18 months ago; sends enterprise copy to a company that laid off 200 people | No unified identity layer; disconnected data sources with different refresh cadences | Require unified identity and events (DuckDB or CDP) as a prerequisite; build data freshness checks into every loop config |
ICP Fit Notes
Best fit
- •Series A-C B2B SaaS with $2M-$50M ARR, measurable inbound and outbound volume (hundreds of leads/month), and a 5+ person GTM team
- •PLG or hybrid PLG/Sales motions where website, in-app, email, and sales touchpoints generate thousands of measurable events per month
- •Teams already running some experimentation (VWO, Optimizely, Statsig) but stuck at low velocity because every test requires a developer and a human review cycle
Also works for
- •High-velocity mid-market SaaS with heavy paid acquisition and a strong analytics foundation already in place
- •Later-stage companies modernizing their GTM stack away from channel silos toward experiment-led operations
Insight: Teams that already know what channels convert their ICP but not why see the fastest return. The autoresearch lab turns that implicit, undocumented knowledge into an explicit, compounding playbook that doesn't leave when a senior marketer does.
Implementation Checklist
Phase 1: Foundation (Week 1)
- Audit GTM data: confirm CRM, analytics, and messaging events share consistent identity (email or domain)
- Map your current funnel metrics to a clear hierarchy: primary (SQLs/pipeline), secondary (CTR/reply rate), guardrails (spam, unsubscribes, CPA ceiling)
- Choose first channel — cold email if you have an active outbound motion; landing page if you have 1,000+ monthly visitors to a key URL
- Stand up experiment journal: DuckDB table or JSON store with the experiment schema
- Configure API access for your chosen tools (Clay, PostHog, email platform or CMS)
Phase 2: First Loop (Week 2)
- Write channel-specific program.md: hypothesis format, action space definition, guardrail thresholds, and measurement window
- Run the first 10 experiments manually — generate variants with LLM, deploy via API, measure, log
- Enforce minimum sample thresholds before promoting any winner
- Review journal entries with GTM and RevOps lead to confirm metrics and safety logic
- Adjust action space, guardrails, or prompts based on what the first 10 experiments taught you
Phase 3: Second Channel + Automation (Week 3-4)
- Add a second channel loop sharing the same experiment journal
- Automate loop execution via MindStudio, GitHub Actions, or custom worker
- Implement HOTL workflow for Tier 1 changes: approval queue with Slack notifications
- Run weekly journal review to extract human-readable ICP learnings by segment
- Integrate experiment outcomes into Revenue Intelligence dashboard (play_029)
Phase 4: Multi-Channel Lab (Week 5-6)
- Introduce Planner and Evaluator agents to coordinate across channels
- Wire cross-channel hypothesis sharing (email winners seed ad headline candidates)
- Build GTM Lab dashboard: experiment velocity, win rate, and pipeline impact per channel
- Write governance charter: autonomy tiers, escalation paths, compliance rules
- Publish program.md files for each active channel to your internal knowledge base
FAQ
Sources
- 1. Fortune — The Karpathy Loop (2026): autoresearch pattern and 700-experiment, 20-improvement benchmark
- 2. Simon Willison / WecoAI — Shopify Liquid autoresearch (2026): 93 automated commits, 53% faster rendering, 61% fewer allocations
- 3. Meta Engineering Blog — REA autonomous experimentation (Mar 2026): 2x model accuracy, 3 engineers delivering work of 6+ — https://engineering.fb.com/2026/03/17/developer-tools/ranking-engineer-agent-rea-autonomous-ai-system-accelerating-meta-ads-ranking-innovation/
- 4. karpathy/autoresearch — original repo (2026): reference architecture for autonomous feedback loops — https://github.com/karpathy/autoresearch
- 5. MindStudio — Autonomous Marketing Optimization Agent (2026): GTM loop templates and cold email/landing page optimization guides
- 6. Treasure Data — Agentic Marketing (2025): 20-40% campaign performance improvement benchmark
- 7. BCG — CMOs who move first in agentic marketing (2025): strategic framing for autonomous marketing adoption
- 8. MarketsandMarkets / GlobeNewswire — AI SDR market (Oct 2025): $4.12B to $15.01B by 2030 at 29.5% CAGR
- 9. VirtualAssistantVA / Gartner — 1,445% multi-agent inquiry surge (2026)
- 10. Reddit r/SaaS — $2,000/month AI SDR, zero demos (2026): practitioner failure case
- 11. Reddit r/gtmengineering — 70% AI SDR churn in 3 months (2026): practitioner-reported adoption failure
- 12. Statsig — B2B SaaS experimentation guide (2025): statistical rigor for low-volume B2B testing
- 13. Eric Siu / LinkedIn — 36,500 experiments framing (2026): velocity comparison for autoresearch vs manual testing
- 14. Agentic Foundry — Human-on-the-loop governance (2026): HOTL tier framework for autonomous marketing
- 15. Oracle — The Agentic Marketing Era (2025): enterprise framing for autonomous marketing systems
- 16. WecoAI — awesome-autoresearch (2026): community reference collection — https://github.com/WecoAI/awesome-autoresearch
- 17. zkarimi22 — autoresearch-anything (2026): generalized autoresearch pattern
When NOT to Use
- •Low volume GTM — if you cannot reach 200-500 visitors per landing page variant or 100+ email sends per variant within a reasonable window, statistical noise overwhelms signal
- •No clean baseline metrics — if you do not reliably track SQLs, pipeline stage, and revenue back to specific campaigns and channels, there is no signal to optimize against
- •Enterprise-only, long sales cycles — if your average sales cycle is 6-18 months and you close 5-10 deals per quarter, you do not have enough events for any feedback loop
- •No API access to your GTM channels — autonomous experimentation requires programmatic variant deployment and metric retrieval
- •Compliance-sensitive industries — financial services, healthcare, legal where copy changes carry non-trivial legal or reputational risk need humans reviewing every public-facing change
- •No data engineering capacity — without someone who can maintain clean identity resolution, event pipelines, and experiment journal integrity, autonomous loops will silently amplify data quality problems
Tools & Tech