Mazorda

Autonomous GTM Experimentation

Built on the karpathy/autoresearch loop pattern, this playbook applies autonomous feedback loops to GTM assets — emails, ads, landing pages, nurture flows — tested against revenue-linked metrics. Replace manual A/B testing with agent-driven loops that compound ICP-specific learnings across channels.

Goal: Replace manual, low-velocity GTM testing with autonomous experimentation loops that compound learnings across channels and drive revenue-linked outcomes at 100x the velocity of traditional A/B testing.

Complexity

High

Tools

7

Context

The Problem

GTM teams run campaigns, not experiments. When they do test, it's 1-2 manual A/B tests per month — a human writes a hypothesis, a developer sets it up, a week passes before there's enough data, another human decides what to do next. By the end of the year you've run 30 experiments. A competitor running autoresearch loops has run 3,000.

The AI SDR wave made this worse by promising autonomy without architecture. Tools that claim to "do outbound for you" optimize for booked meetings, not SQLs. 70% of AI SDR users quit within three months because pipeline never moves.

What breaks:

  • Optimizing the wrong metric — reply rates, opens, and click-throughs go up while SQLs stay flat, because no one wired the feedback loop to revenue
  • Statistical noise masquerading as signal — B2B volumes are low; decisions made on 50-100 events that need 200-500 to mean anything
  • Bad data at scale — siloed tools with inconsistent identity resolution mean autonomous agents personalize on fragments and scale the wrong decisions across every channel
  • Autonomy without strategy — AI SDR stacks with no human layer misidentify ICPs, send robotic sequences, and collapse pipeline while the monthly invoice keeps clearing

Why it matters:

The AI SDR market is growing from $4.12B (2025) to $15.01B by 2030 at 29.5% CAGR. Most of that spend will produce exactly the results the Reddit threads document: $2,000/month tools that book zero demos and extract two-year contracts. The teams that win aren't the ones who buy the most autonomous agents — they're the ones who build the right loops.

Resolution

The Solution

The autoresearch pattern — originally built by Andrej Karpathy for ML model optimization — is a 630-line feedback loop: modify one variable, run a fixed experiment, measure against a single metric, keep what wins, discard what doesn't, repeat. Karpathy's script ran ~700 experiments in two days and found 20 improvements a human expert missed. Shopify's CEO pointed it at their Liquid templating engine and got 93 automated commits, 53% faster rendering, and 61% fewer memory allocations.

The GTM version replaces the training script with a GTM asset (email, ad, landing page, nurture flow) and the model accuracy metric with a revenue-linked outcome (reply rate, CVR, SQL rate). The loop runs on real traffic, logs everything, and compounds learnings across channels.

Level 1: First Loop (Week 1-2)

Start with cold email. One ICP segment, one metric, no full autonomy yet.

  • Choose one ICP segment (e.g., RevOps leaders at 50-500 FTE SaaS companies, UK-based)
  • Primary metric: reply rate. Guardrails: spam complaints, unsubscribe rate
  • Stack: Clay for list and signals, Instantly or Lemlist for sending, Claude or MindStudio to generate variants
  1. Take your current best-performing subject + opener as the baseline
  2. Generate 3 challenger variants using an LLM prompt embedding your ICP, offer, and brand guardrails — test one variable at a time (subject only, or opener only, never both)
  3. Send each variant to 100+ prospects in the same segment over 48 hours; keep sending the baseline in parallel
  4. Measure positive reply rate only — not opens, not total replies
  5. Promote a challenger to new baseline only if it beats by +30% relative lift with at least 20 total replies
  6. Log hypothesis, what changed, and outcome in a JSON file — this is your experiment journal

By the end of Week 2 you have a working loop, a minimal memory system, and ground truth on what sample size your audience actually needs.

Level 2: Full System — The Autonomous GTM Lab (Week 2-4)

Build the reusable architecture that applies the core loop pattern to every channel with automated execution and shared memory.

The Core Loop (every channel, every time):

  1. Define the objective function — one primary metric + 1-2 guardrails (never optimize for anything you wouldn't report to your CEO)
  2. Define the action space — enumerate exactly which fields the agent can touch; freeze everything else
  3. Set the measurement window — channel-specific (48h email, 3-7d ads, 1-3w landing pages, 7d nurture)
  4. Agent proposes hypothesis + one variant, with rationale drawn from the experiment journal
  5. Execute via API — no manual deployment
  6. Measure against baseline using the same data source as always
  7. Keep if it beats baseline; revert if it doesn't; log either way
  8. Generate next hypothesis from memory (last N journal entries)
  9. Loop

Channel architecture:

  • Cold email: Primary metric = positive reply rate. Agent touches subject, opener, CTA, send time. 48h window, 100 sends per variant, 20 total replies minimum. Stack: Clay + Instantly/Lemlist + agent.
  • Google Ads: Primary metric = CPA or ROAS. Agent touches headlines and descriptions only (no budgets). 3-7 day window, 400 conversions per variant for 20-30% lift detection.
  • Landing pages: Primary metric = CVR (visit to next action). Agent touches H1, subheadline, primary CTA text, social proof block. 1-3 week window, 200-500 visitors per variant.
  • Email nurture: Primary metric = conversion to next stage. Agent touches subject, preview text, CTA, send timing. 7 day window, 50 triggered per variant.
  • LinkedIn content: Primary metric = click-to-site rate. Agent touches hook (first line), format, CTA, length, post time. 48h window, 500 impressions per variant.
  • SEO meta: Primary metric = organic CTR. Agent touches title tag, meta description (fixed URL set). 2-4 week window, 1,000 GSC impressions per variant.

Safety architecture:

Every loop has three layers of protection:

  • Budget caps — per-experiment spend ceilings for ads (10-20% of channel budget), plus hard monthly limits with auto-pause. Agent never touches budget settings.
  • Rollback thresholds — auto-revert when primary metric drops >30% vs control or any guardrail (spam rate, unsubscribe rate, CPC ceiling) trips. For ads: rollback after two consecutive measurement windows of underperformance.
  • HOTL governance tiers: - Tier 0 (auto-deploy): subject lines, body copy variants, send timing, minor CTA text - Tier 1 (human approval queue): offers, pricing page copy, anything mentioning competitors - Tier 2 (no autonomous changes): contracts, legal language, security claims, pricing

Level 3: Multi-Channel Lab (Week 4-6)

Once two or more single-channel loops are running and producing clean journal data, introduce the planner-executor-evaluator architecture that Meta used in their Ranking Engineer Agent (REA), which doubled model accuracy and let three engineers do the work of six.

  • Planner agent — reads business objectives and the cross-channel journal, allocates experiment budget by channel based on current confidence and impact potential
  • Executor agents — one per channel, each running the core loop within the Planner's constraints
  • Evaluator agent — aggregates pipeline and revenue outcomes across channels, identifies cross-channel patterns, flags conflicts, updates the Planner

Cross-channel compounding in practice: timeline hooks consistently outperform problem hooks in cold email for RevOps ICs → ads loop seeds new headlines with timeline framing for the same retargeting segment → landing page loop tests timeline-framed H1 for the same ICP. Learning generated once, applied everywhere.

Expected Metrics

<5 to 50-200+ per channel per week

Experiment velocity

2-4% → 8-12% in 4-6 weeks (vendor-reported, MindStudio)

Cold email reply rate

+15-40% over 8-12 weeks (vendor-reported, MindStudio)

Landing page CVR

-20-30% over 8-16 weeks (vendor-reported)

Ad CPA

Traditional Experimentation vs. Autonomous GTM Lab

Experiments per period

Traditional

1-2 per month; manual setup and analysis

Our Approach

10-200+ micro-experiments per week, all logged

Metric alignment

Traditional

Often CTR and CVR; revenue linkage ad hoc

Our Approach

Primary metrics are SQLs, pipeline, and CAC with hard guardrails

Data ownership

Traditional

You own data; experiments sit in vendor silos

Our Approach

Data and journals live in your warehouse or DuckDB

Customization

Traditional

Manual — you design tests one at a time; logic lives in your head

Our Approach

Systematic — open program.md prompts and per-channel schemas; agent iterates within your defined action space

Cost model

Traditional

$5k-80k/year for enterprise tools

Our Approach

Engineering and infrastructure time; no vendor lock-in

Transparency

Traditional

Fragmented — results split across tool dashboards, no unified experiment record

Our Approach

Full audit trail — every hypothesis, variant, metric, and outcome in a queryable experiment journal

Human role

Traditional

Human designs and analyzes every test

Our Approach

Human sets strategy and guardrails; agents execute within constraints

Tools & Data

Required (Minimum Viable)

ClayB2B data enrichment, list building, and signal routing; Starter plan ~$149/month for 2,000 credits
PostHogProduct analytics, feature flags, and experiment tracking; free up to ~1M events/month, usage-based after
Outbound platform (Instantly / Lemlist)Cold email execution; Instantly ~$37-47/month or Lemlist Email Pro ~$63-79/user/month annually
Customer.ioEmail automation for nurture loop execution; Essentials ~$100/month for 5,000 profiles

Recommended (Full System)

PostHog ExperimentsBuilt-in A/B testing and feature flags on the same PostHog instance already in Required — covers web and landing page loops without an additional platform. Note: PostHog handles on-site measurement; the autoresearch loop pattern here adds the cross-channel orchestration layer (email, ads, nurture) and agent-driven hypothesis generation that PostHog alone does not provide.
MindStudioVisual agent builder for scheduling and running autoresearch loops across channels; free + paid plans from ~$20/month
Google Ads API + Meta Marketing APIFor autonomous ad copy iteration; API access is free, spend is on media
StatsigStatistical experimentation engine for high-volume landing page and pricing tests requiring sequential or Bayesian significance; ~$150/month at moderate scale

Competitor Landscape

ToolApproachBest ForLimitation
LandbaseAI SDR platform — agentic outbound sequences with fixed workflows and 40M+ campaign training dataTeams wanting turnkey outbound without engineeringBlack-box, no experiment journal, no human-configurable loops. ~$3,000/month; vendor-reported claims not independently audited
WarmlySignal-based AI GTM — visitor de-anonymization + automated outbound triggers for B2B website trafficWebsite-traffic-driven outbound automationChannel automation, not systematic experimentation with memory. Sales-led pricing
MindStudioNo-code agent builder with scheduling and integrations. Most explicit autoresearch implementation guide in the marketTeams wanting visual GTM loop builders — closest to what this playbook describesPlatform dependency; free + ~$20/month Individual; enterprise custom
Vect AI69 SaaS growth strategies as autonomous blueprints executed by agentsPre-codified growth playbook executionBlueprints are pre-designed, not iterative loops with shared memory. Sales-led pricing
Traditional A/B testing tools (VWO, Optimizely)Statistical rigor for website and app tests — excellent test harnesses and statistical enginesWeb experimentation with manual hypothesis designNo autonomous hypothesis generation; still human-driven. VWO Starter ~$314/month; Optimizely $50k-200k+/year
Google PMax / Meta Advantage+Platform automation — black-box budget and creative optimization within platform wallsBroad reach optimization within walled gardensPMax is blind, hungry, and confused when fed weak creative or wrong goals; you cannot inspect or override its logic
Custom build (warehouse + agents)Full control; no vendor lock-in — exactly what this playbook describesTeams with engineering capacity wanting permanent data and logic ownershipHigher initial build cost; ~$0-500/month in infrastructure

Industry Benchmarks

MetricBenchmarkSource
Autoresearch loop efficiency~700 experiments in 2 days, ~20 improvements, 11% model speedupFortune / Karpathy, Mar 2026
Shopify Liquid autoresearch93 automated commits, 53% faster parse+render, 61% fewer allocationsSimon Willison / WecoAI, Mar 2026
Meta REA autonomous experimentation2x average model accuracy; 3 engineers delivered work of 6+Meta Engineering Blog, Mar 2026
Cold email loop performanceReply rates from 2-4% to 8-12% in 4-6 weeksMindStudio, 2026
Landing page loop performance15-40% CVR uplift over 8-12 weeksMindStudio, 2026
AI SDR market growth$4.12B (2025) to $15.01B (2030) at 29.5% CAGRMarketsandMarkets / GlobeNewswire, Oct 2025
AI SDR churn rate70% of users quit within 3 monthsr/gtmengineering, 2026
Multi-agent system inquiries1,445% surge from Q1 2024 to Q2 2025Gartner, via VirtualAssistantVA
B2B experiment velocity (traditional)Most teams run 20-30 experiments/yearEric Siu / Fortune framing, 2026

Emerging Trends

karpathy/autoresearch applied to GTM — Andrej Karpathy's open-source autoresearch loop (https://github.com/karpathy/autoresearch) ran ~700 ML experiments in 2 days and found 20 improvements a human expert missed. GTM practitioners are now adapting the same pattern — modify one variable, deploy, measure against a single metric, keep what wins — to cold email, ad copy, and landing pages. This is the architectural foundation this playbook builds on.

March 2026

Enables 100x experiment velocity over manual A/B testing by removing humans from the iteration loop while keeping them in the strategy and guardrails layer

Shopify Liquid autoresearch — Tobi Lütke pointed the autoresearch pattern at Shopify's Liquid templating engine. Result: 93 automated commits, 53% faster parse-and-render, 61% fewer memory allocations. First major production validation that autoresearch loops deliver compounding gains on real engineering assets.

March 2026

Proof-of-concept that autoresearch produces measurable, compounding improvements on real production systems — not just ML benchmarks

Meta Ranking Engineer Agent (REA) — Meta's autonomous experimentation system doubled average model accuracy and let 3 engineers deliver the output of 6+ across 8 ranking models. The planner-executor-evaluator architecture this playbook uses at Level 3 is derived from Meta's REA design.

March 2026

Validates the multi-agent orchestration pattern at enterprise scale; 2x output with half the headcount is the benchmark for what autonomous GTM labs should target

Team Responsibilities

RoleResponsibility
GTM EngineerLoop design, API integrations, program.md prompts, scheduling, and experiment orchestration. The person who builds and maintains the system.
Marketing OpsChannel configurations, compliance, deliverability, brand guardrails, and alignment between live campaigns and loops. The person who stops the agent from doing something embarrassing.
Data EngineerClean data pipelines, experiment journal schema, warehouse/DuckDB integration, and coverage monitoring. Without this role, loops break silently.

Failure Patterns

PatternWhat HappensWhyPrevention
Optimizing Reply Rate, Not RevenueReply rates go up; SQL and pipeline stay flat; agent keeps improving the wrong thingObjective function was set to a proxy metric with no feedback loop to CRM pipelineSet primary metric as SQL or SQO creation rate; require pipeline linkage before any variant gets promoted
$2,000/month AI SDR, Zero DemosContract signed, tool deployed, zero meetings booked, two-year lock-in beginsBlack-box workflows, no ICP validation, no experiment transparency, misaligned vendor incentivesOpen experiment journal from day one; no black-box agents; ICP defined and owned by your team in Clay before any loop runs
70% Quit AI SDR Tools in 3 MonthsHype cycle ends, revenue never moves, teams cancel and lose trust in AI GTM entirelyTools promised full autonomy; delivered automation without intelligence; no transparency on what the agent actually triedStart with one channel, show pipeline impact before scaling, log every experiment so you can explain every decision
Over-Fitting to Noise in B2BVariant that looked good at 80 sends gets promoted; underperforms at full volume; wasted weeksNo minimum sample thresholds; frequentist thinking applied to tiny B2B audiencesHard minimum sample gates per channel; sequential testing or Bayesian logic; only run bold single-variable tests
Stale or Siloed Data at ScaleAgent personalizes using company size data from 18 months ago; sends enterprise copy to a company that laid off 200 peopleNo unified identity layer; disconnected data sources with different refresh cadencesRequire unified identity and events (DuckDB or CDP) as a prerequisite; build data freshness checks into every loop config

ICP Fit Notes

Best fit

  • Series A-C B2B SaaS with $2M-$50M ARR, measurable inbound and outbound volume (hundreds of leads/month), and a 5+ person GTM team
  • PLG or hybrid PLG/Sales motions where website, in-app, email, and sales touchpoints generate thousands of measurable events per month
  • Teams already running some experimentation (VWO, Optimizely, Statsig) but stuck at low velocity because every test requires a developer and a human review cycle

Also works for

  • High-velocity mid-market SaaS with heavy paid acquisition and a strong analytics foundation already in place
  • Later-stage companies modernizing their GTM stack away from channel silos toward experiment-led operations

Insight: Teams that already know what channels convert their ICP but not why see the fastest return. The autoresearch lab turns that implicit, undocumented knowledge into an explicit, compounding playbook that doesn't leave when a senior marketer does.

Implementation Checklist

Phase 1: Foundation (Week 1)

  • Audit GTM data: confirm CRM, analytics, and messaging events share consistent identity (email or domain)
  • Map your current funnel metrics to a clear hierarchy: primary (SQLs/pipeline), secondary (CTR/reply rate), guardrails (spam, unsubscribes, CPA ceiling)
  • Choose first channel — cold email if you have an active outbound motion; landing page if you have 1,000+ monthly visitors to a key URL
  • Stand up experiment journal: DuckDB table or JSON store with the experiment schema
  • Configure API access for your chosen tools (Clay, PostHog, email platform or CMS)

Phase 2: First Loop (Week 2)

  • Write channel-specific program.md: hypothesis format, action space definition, guardrail thresholds, and measurement window
  • Run the first 10 experiments manually — generate variants with LLM, deploy via API, measure, log
  • Enforce minimum sample thresholds before promoting any winner
  • Review journal entries with GTM and RevOps lead to confirm metrics and safety logic
  • Adjust action space, guardrails, or prompts based on what the first 10 experiments taught you

Phase 3: Second Channel + Automation (Week 3-4)

  • Add a second channel loop sharing the same experiment journal
  • Automate loop execution via MindStudio, GitHub Actions, or custom worker
  • Implement HOTL workflow for Tier 1 changes: approval queue with Slack notifications
  • Run weekly journal review to extract human-readable ICP learnings by segment
  • Integrate experiment outcomes into Revenue Intelligence dashboard (play_029)

Phase 4: Multi-Channel Lab (Week 5-6)

  • Introduce Planner and Evaluator agents to coordinate across channels
  • Wire cross-channel hypothesis sharing (email winners seed ad headline candidates)
  • Build GTM Lab dashboard: experiment velocity, win rate, and pipeline impact per channel
  • Write governance charter: autonomy tiers, escalation paths, compliance rules
  • Publish program.md files for each active channel to your internal knowledge base

FAQ

Sources

  1. 1. Fortune — The Karpathy Loop (2026): autoresearch pattern and 700-experiment, 20-improvement benchmark
  2. 2. Simon Willison / WecoAI — Shopify Liquid autoresearch (2026): 93 automated commits, 53% faster rendering, 61% fewer allocations
  3. 3. Meta Engineering Blog — REA autonomous experimentation (Mar 2026): 2x model accuracy, 3 engineers delivering work of 6+https://engineering.fb.com/2026/03/17/developer-tools/ranking-engineer-agent-rea-autonomous-ai-system-accelerating-meta-ads-ranking-innovation/
  4. 4. karpathy/autoresearch — original repo (2026): reference architecture for autonomous feedback loopshttps://github.com/karpathy/autoresearch
  5. 5. MindStudio — Autonomous Marketing Optimization Agent (2026): GTM loop templates and cold email/landing page optimization guides
  6. 6. Treasure Data — Agentic Marketing (2025): 20-40% campaign performance improvement benchmark
  7. 7. BCG — CMOs who move first in agentic marketing (2025): strategic framing for autonomous marketing adoption
  8. 8. MarketsandMarkets / GlobeNewswire — AI SDR market (Oct 2025): $4.12B to $15.01B by 2030 at 29.5% CAGR
  9. 9. VirtualAssistantVA / Gartner — 1,445% multi-agent inquiry surge (2026)
  10. 10. Reddit r/SaaS — $2,000/month AI SDR, zero demos (2026): practitioner failure case
  11. 11. Reddit r/gtmengineering — 70% AI SDR churn in 3 months (2026): practitioner-reported adoption failure
  12. 12. Statsig — B2B SaaS experimentation guide (2025): statistical rigor for low-volume B2B testing
  13. 13. Eric Siu / LinkedIn — 36,500 experiments framing (2026): velocity comparison for autoresearch vs manual testing
  14. 14. Agentic Foundry — Human-on-the-loop governance (2026): HOTL tier framework for autonomous marketing
  15. 15. Oracle — The Agentic Marketing Era (2025): enterprise framing for autonomous marketing systems
  16. 16. WecoAI — awesome-autoresearch (2026): community reference collectionhttps://github.com/WecoAI/awesome-autoresearch
  17. 17. zkarimi22 — autoresearch-anything (2026): generalized autoresearch pattern

When NOT to Use

  • Low volume GTM — if you cannot reach 200-500 visitors per landing page variant or 100+ email sends per variant within a reasonable window, statistical noise overwhelms signal
  • No clean baseline metrics — if you do not reliably track SQLs, pipeline stage, and revenue back to specific campaigns and channels, there is no signal to optimize against
  • Enterprise-only, long sales cycles — if your average sales cycle is 6-18 months and you close 5-10 deals per quarter, you do not have enough events for any feedback loop
  • No API access to your GTM channels — autonomous experimentation requires programmatic variant deployment and metric retrieval
  • Compliance-sensitive industries — financial services, healthcare, legal where copy changes carry non-trivial legal or reputational risk need humans reviewing every public-facing change
  • No data engineering capacity — without someone who can maintain clean identity resolution, event pipelines, and experiment journal integrity, autonomous loops will silently amplify data quality problems

Tools & Tech

Clay
PostHog
Claude / LLM
Customer.io
+3
Ask Mazorda AI