Build a Self-Improving Data Pipeline with Multi-Agent Quality Gates

Intermediate24m readFull-stack developers

A practical Anthropic SDK walkthrough of Meta's Autodata closed-loop pattern — four subagents (Generator, Evaluator, Filter, Optimizer), exact acceptance thresholds, and a TypeScript implementation that turns inference compute into higher-quality training data.

Primary Focus

ai development

AI Tools Covered

anthropic-sdksynthetic-datamulti-agent

What You'll Learn

  • The bottleneck is data quality, not compute
  • The empirical gap
  • When the closed loop is worth it
  • Roles and naming
  • Generator — the Challenger
  • Evaluator — Weak Solver and Strong Solver

Guide Curriculum

Why Multi-Agent Quality Gates Beat One-Pass Synthetic Data

Learn key concepts

3 lessons
  • The bottleneck is data quality, not compute1m
  • The empirical gap1m
  • When the closed loop is worth it1m

The Four-Subagent Loop

Learn key concepts

5 lessons
  • Roles and naming1m
  • Generator — the Challenger2m
  • Evaluator — Weak Solver and Strong Solver1m
  • Filter — Verifier/Judge plus the acceptance gate3m
  • Wiring the inner loop1m

Acceptance Thresholds — How to Set and Adjust Them

Learn key concepts

3 lessons
  • What each threshold prevents2m
  • Adjusting for a different domain1m
  • Verifying the thresholds are doing work1m

Meta-Optimization — The Outer Loop

Learn key concepts

3 lessons
  • What the Optimizer actually optimizes2m
  • A minimal Optimizer in TypeScript2m
  • What NOT to put in the outer loop1m

Practical Takeaway — When This Pattern Earns Its Keep

Learn key concepts

3 lessons
  • When to reach for Autodata1m
  • When something simpler wins1m
  • Cost shape before you commit2m

Preview: First Lesson

Why Multi-Agent Quality Gates Beat One-Pass Synthetic Data

The bottleneck is data quality, not compute

Module objectives

  • Understand the failure mode of single-pass synthetic data generation
  • See the empirical gap that motivates a closed loop
  • Identify when this pattern earns its compute cost

For most teams sitting on top of frontier LLMs, the headache is no longer "can the model do this?" It's "can we feed the model enough high-signal examples to specialize it?" Hand-labeled data is expensive and slow. Generic synthetic data — the kind a single LLM call produces from a one-shot prompt — looks abundant but is full of trivial, leaky, or impossibly hard examples that train weak behavior.

Meta's RAM (Reasoning, Alignment, Memory) team published Autodata in May 2026 as a direct response. The framing: treat the data scientist's actual workflow — generate, inspect, score, refine — as a closed loop run by AI agents instead of by a human. The output is a stream of examples that have already been quality-screened and difficulty-tuned before they ever touch a training run.

Free Access

Start learning with this comprehensive guide

This guide includes:

5 modules with 17 lessons
24m estimated reading time

About the Author

H
✨ Vibe Coder
@hiram-clark

Hiram Clark is the founder and managing editor of vybecoding.ai and sets editorial direction for the guides and news published here. Articles are drafted with AI assistance and edited before publication. He works hands-on with the AI development tools, workflows, and infrastructure covered on the site.

Full Guide Content

Complete lesson text — start the interactive course above for exercises and progress tracking.

Module 1Why Multi-Agent Quality Gates Beat One-Pass Synthetic Data

1.1The bottleneck is data quality, not compute

Module objectives
  • Understand the failure mode of single-pass synthetic data generation
  • See the empirical gap that motivates a closed loop
  • Identify when this pattern earns its compute cost

For most teams sitting on top of frontier LLMs, the headache is no longer "can the model do this?" It's "can we feed the model enough high-signal examples to specialize it?" Hand-labeled data is expensive and slow. Generic synthetic data — the kind a single LLM call produces from a one-shot prompt — looks abundant but is full of trivial, leaky, or impossibly hard examples that train weak behavior.

Meta's RAM (Reasoning, Alignment, Memory) team published Autodata in May 2026 as a direct response. The framing: treat the data scientist's actual workflow — generate, inspect, score, refine — as a closed loop run by AI agents instead of by a human. The output is a stream of examples that have already been quality-screened and difficulty-tuned before they ever touch a training run.

1.2The empirical gap

The Autodata paper compared two pipelines on 2,117 QA pairs derived from 10,000+ CS papers:

| Approach | Weak Solver Score | Strong Solver Score | Score Gap |

|---|---|---|---|

| CoT Self-Instruct (one-shot) | 71.4% | 73.3% | 1.9 pp |

| Agentic Self-Instruct (Autodata) | 43.7% | 77.8% | 34 pp |

The CoT pipeline produces examples that both solvers handle nearly equally well — meaning the data is not separating capability levels. Autodata produces examples where strong models clearly outperform weak ones, which is exactly the signal RL training needs. When the team trained Qwen-3.5-4B on the Agentic data with GRPO (~1 epoch), it beat the CoT-trained variant on both in-distribution and out-of-distribution test sets.

1.3When the closed loop is worth it

This pattern is not free. Each accepted example may take 3–5 retry rounds, each round runs the weak solver 3 times and the strong solver 3 times, plus a verifier call. That is roughly an order of magnitude more inference per accepted example than one-shot generation.

Use this pattern when:

  • You are building training data for fine-tuning, RL, or evaluation suites where data quality dominates compute cost
  • You need to control the difficulty distribution explicitly (no trivial, no impossible)
  • You can afford to spend inference compute as a substitute for human annotators

Skip it when:

  • You only need a few hundred examples and can hand-label them faster than you can debug the loop
  • A single judge rubric is enough — if you don't need difficulty calibration, just generate + verify is simpler
  • Your task has a verifier that's already as expensive as the generator (you'll spend most of the loop in the verifier)

Module 2The Four-Subagent Loop

2.1Roles and naming

Module objectives
  • Map sprint-plan terminology onto the source paper's terminology
  • Implement each subagent against the Anthropic Messages API
  • Wire the Filter to the acceptance gate

The sprint brief frames Autodata as a Generator → Evaluator → Filter → Optimizer pipeline. The Meta paper uses different names. The mapping is one-to-many but consistent:

| Pipeline role | Source paper agent(s) | Job |

|---|---|---|

| Generator | Challenger LLM | Produce a candidate (input, response, rubric) given source material |

| Evaluator | Weak Solver + Strong Solver | Run the candidate through two capability levels and score both |

| Filter | Verifier/Judge + acceptance gate | Score answers against the rubric and apply numeric thresholds |

| Optimizer | Meta-optimization outer loop | Mutate the agent harness itself based on downstream performance |

The Generator/Evaluator/Filter triple is the inner loop (per-example). The Optimizer is the outer loop (per-batch, slow).

2.2Generator — the Challenger

The Challenger reads source material (a paper, a document, a code file) and produces a structured triple: a self-contained question, a reference answer, and a rubric the Verifier will score against.

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();
const MODEL = "claude-sonnet-4-6";

const CHALLENGER_SYSTEM = `You are a data-scientist agent generating training examples from a research paper.
Output JSON only. The rubric MUST be answerable only with knowledge from the source paper
(self-test: would a domain expert who has not read this paper be able to answer correctly? If yes, reject your own draft).
Use positive-only criteria. Each criterion has an integer weight 1..7. No criterion may exceed weight 7.`;

type Triple = {
  question: string;
  reference_answer: string;
  rubric: { criterion: string; weight: number }[];
};

export async function generateCandidate(
  paperText: string,
  feedback: string | null
): Promise {
  const userBlocks: Anthropic.Messages.ContentBlockParam[] = [
    {
      type: "text",
      text: paperText,
      cache_control: { type: "ephemeral" },
    },
    {
      type: "text",
      text: feedback
        ? `Generate a new candidate. Targeted feedback from prior attempt:\n${feedback}`
        : "Generate a candidate triple {question, reference_answer, rubric}.",
    },
  ];

  const resp = await client.messages.create({
    model: MODEL,
    max_tokens: 2048,
    system: [
      { type: "text", text: CHALLENGER_SYSTEM, cache_control: { type: "ephemeral" } },
    ],
    messages: [{ role: "user", content: userBlocks }],
  });

  const text = resp.content
    .filter((b): b is Anthropic.Messages.TextBlock => b.type === "text")
    .map((b) => b.text)
    .join("");
  return JSON.parse(text) as Triple;
}

Two things to notice. First, the system prompt and the paper text are both wrapped in cache_control: { type: "ephemeral" }. Across the 3–5 retries that a single example may take, the paper does not change — caching the prefix is what makes this loop affordable. Second, the rubric format is locked to { criterion, weight }[] with weights capped at 7. Both constraints came directly out of Autodata's meta-optimization phase: free-form rubrics blew up parsing, and unbounded weights collapsed to a single criterion.

2.3Evaluator — Weak Solver and Strong Solver

The Evaluator runs the candidate question through two solvers at different capability levels, three times each. The two solvers can be different models, or the same model with different inference modes (e.g., regular vs extended thinking, or with/without tool use).

async function runSolver(model: string, question: string): Promise {
  const resp = await client.messages.create({
    model,
    max_tokens: 1024,
    messages: [{ role: "user", content: question }],
  });
  return resp.content
    .filter((b): b is Anthropic.Messages.TextBlock => b.type === "text")
    .map((b) => b.text)
    .join("");
}

export async function evaluateCandidate(
  question: string,
  weakModel: string,
  strongModel: string
): Promise<{ weak: string[]; strong: string[] }> {
  const N = 3;
  const [weak, strong] = await Promise.all([
    Promise.all(Array.from({ length: N }, () => runSolver(weakModel, question))),
    Promise.all(Array.from({ length: N }, () => runSolver(strongModel, question))),
  ]);
  return { weak, strong };
}
N = 3 matches the paper. Three runs per solver gives you a stable mean without exploding cost. Lowering it to 1 makes acceptance noisy; raising it to 5+ doubles cost for marginal gain.

2.4Filter — Verifier/Judge plus the acceptance gate

The Filter has two jobs. The Verifier scores each answer against the rubric. The acceptance gate then applies fixed numeric thresholds.

const VERIFIER_SYSTEM = `You score answers against a rubric. For each criterion, output a 0..1 score.
Final answer score = sum(criterion_score * weight) / sum(weight). Output strict JSON only.`;

async function scoreAnswer(
  rubric: Triple["rubric"],
  question: string,
  answer: string
): Promise {
  const resp = await client.messages.create({
    model: MODEL,
    max_tokens: 512,
    system: [
      { type: "text", text: VERIFIER_SYSTEM, cache_control: { type: "ephemeral" } },
      { type: "text", text: JSON.stringify(rubric), cache_control: { type: "ephemeral" } },
    ],
    messages: [
      { role: "user", content: `QUESTION: ${question}\nANSWER: ${answer}\nReturn {"score": number}.` },
    ],
  });
  const text = resp.content
    .filter((b): b is Anthropic.Messages.TextBlock => b.type === "text")
    .map((b) => b.text)
    .join("");
  return (JSON.parse(text) as { score: number }).score;
}

export type AcceptanceResult =
  | { accepted: true }
  | { accepted: false; reason: string };

export type Thresholds = {
  weakAvg: number;     // upper bound — reject if exceeded
  maxWeak: number;     // upper bound — reject if exceeded
  strongAvgLo: number; // lower bound — reject if below
  strongAvgHi: number; // upper bound (exclusive) — reject if at or above
  gap: number;         // lower bound on strong−weak difference
};

export const DEFAULT_THRESHOLDS: Thresholds = {
  weakAvg: 0.65,
  maxWeak: 0.75,
  strongAvgLo: 0.60,
  strongAvgHi: 0.95,
  gap: 0.20,
};

export async function filterCandidate(
  triple: Triple,
  evals: { weak: string[]; strong: string[] },
  t: Thresholds = DEFAULT_THRESHOLDS
): Promise {
  const weakScores = await Promise.all(
    evals.weak.map((a) => scoreAnswer(triple.rubric, triple.question, a))
  );
  const strongScores = await Promise.all(
    evals.strong.map((a) => scoreAnswer(triple.rubric, triple.question, a))
  );

  const avg = (xs: number[]) => xs.reduce((a, b) => a + b, 0) / xs.length;
  const weakAvg = avg(weakScores);
  const strongAvg = avg(strongScores);
  const maxWeak = Math.max(...weakScores);
  const minWeak = Math.min(...weakScores);
  const gap = strongAvg - weakAvg;

  if (minWeak === 0) return { accepted: false, reason: "weak score has zero" };
  if (weakAvg > t.weakAvg) return { accepted: false, reason: `weak_avg ${weakAvg.toFixed(2)} > ${t.weakAvg} — too easy` };
  if (maxWeak > t.maxWeak) return { accepted: false, reason: `max_weak ${maxWeak.toFixed(2)} > ${t.maxWeak} — too easy` };
  if (strongAvg < t.strongAvgLo) return { accepted: false, reason: `strong_avg ${strongAvg.toFixed(2)} < ${t.strongAvgLo} — too hard` };
  if (strongAvg >= t.strongAvgHi) return { accepted: false, reason: `strong_avg ${strongAvg.toFixed(2)} >= ${t.strongAvgHi} — trivial for strong` };
  if (gap < t.gap) return { accepted: false, reason: `gap ${gap.toFixed(2)} < ${t.gap} — insufficient difficulty separation` };

  return { accepted: true };
}

The thresholds come straight from the Autodata paper: weak_avg ≤ 0.65, max_weak ≤ 0.75, no zero scores, strong_avg ≥ 0.60 and < 0.95, gap ≥ 0.20. They are tuned for a CS-research QA task and are the right starting point — but they are not magic. The Optimizer, in Module 4, is what makes them domain-specific.

2.5Wiring the inner loop

export async function runInnerLoop(
  paperText: string,
  weakModel: string,
  strongModel: string,
  thresholds: Thresholds = DEFAULT_THRESHOLDS,
  maxRounds = 5
): Promise {
  let feedback: string | null = null;
  for (let round = 0; round < maxRounds; round++) {
    const triple = await generateCandidate(paperText, feedback);
    const evals = await evaluateCandidate(triple.question, weakModel, strongModel);
    const verdict = await filterCandidate(triple, evals, thresholds);
    if (verdict.accepted) return triple;
    feedback = verdict.reason;
  }
  return null; // budget exhausted, drop this seed
}

That's the closed loop in roughly 60 lines. Generator proposes, Evaluator runs both solvers, Filter scores and either accepts or feeds the rejection reason back to the Generator for the next round.


Module 3Acceptance Thresholds — How to Set and Adjust Them

3.1What each threshold prevents

Module objectives
  • Read each numeric threshold and understand the failure mode it prevents
  • Adjust thresholds when porting Autodata to a non-CS domain
  • Verify thresholds are doing useful work, not just rejecting everything

| Threshold | What it prevents | Symptom if removed |

|---|---|---|

| weak_avg ≤ 0.65 | Trivial examples both models can solve | Score gap collapses; weak data |

| max_weak ≤ 0.75 | Outlier easy examples that slip past the average | A few "gimme" questions inflate weak signal |

| no zeros on weak | Verifier mis-scoring or broken question | Garbage examples accepted because weak "failed" |

| strong_avg ≥ 0.60 | Impossible examples even strong models can't answer | Training on noise — model learns to output nothing |

| strong_avg < 0.95 | Trivial-for-strong (memorized or templated) | Strong model is solving from prior knowledge, not the source |

| gap ≥ 0.20 | Examples that don't separate capability levels | RL signal is flat |

Read these as a difficulty band, not a quality gate. The Verifier is the quality gate (does the answer match the rubric?). The numeric thresholds enforce that accepted examples actually distinguish strong from weak models — which is what you need them to do during training.

3.2Adjusting for a different domain

Suppose you want to apply Autodata to customer-support transcripts instead of CS papers. The CS paper thresholds were tuned for tasks where strong models reach 60–95%. If your domain is harder (medical reasoning, legal drafting), you'll find every candidate getting filtered out by strong_avg ≥ 0.60. If your domain is easier (intent classification), nothing will pass weak_avg ≤ 0.65.

A defensible adjustment procedure:

  1. Calibrate first — generate 20 candidates with no acceptance gate, just record weak_avg, strong_avg, and gap.
  2. Find your strong-solver ceiling — the 75th percentile of strong_avg is roughly where you should set the upper bound.
  3. Find your weak-solver floor — the 25th percentile of weak_avg is roughly where the lower bound should sit.
  4. Fix the gap last — the gap threshold should be 1–2× the standard deviation of (strong_avg − weak_avg) across your calibration set.
  5. Lock thresholds, then run the meta-optimizer — see Module 4.

3.3Verifying the thresholds are doing work

A common failure mode is thresholds that are too loose (everything passes) or too tight (acceptance rate drops to single digits and inference cost dominates). Track a small dashboard:

  • Acceptance rate (target: 30–60% after the meta-optimizer warms up)
  • Distribution of rejection reasons (should be spread across reasons, not all "too easy")
  • Average rounds per accepted example (target: 1.5–3; >4 means thresholds are too tight)

If acceptance rate is below 10%, your thresholds are starving the pipeline. Above 80% means you're not filtering — likely calibrated for too-easy data.


Module 4Meta-Optimization — The Outer Loop

4.1What the Optimizer actually optimizes

Module objectives
  • Understand why an outer optimization loop is part of Autodata's contribution, not a footnote
  • Implement an evolution-based optimizer that mutates the agent harness itself
  • Know what NOT to put in the outer loop

The inner loop produces examples. The outer loop produces better inner loops. Specifically, it mutates the agent harness — the prompts, the rubric format, the scoring rules — based on the inner loop's downstream performance.

Meta's outer loop is evolution-based. Each iteration:

  1. Select — Boltzmann sampling at temperature T=0.1 over the current pool of harnesses. Low temperature = elitist; you usually pick the best, sometimes explore.
  2. Evaluate — run the selected harness on a training set of papers and measure pass rate.
  3. Analyze — an LLM agent reads the failed trajectories and proposes mutations.
  4. Implement — a code-editing agent applies the diff to the harness.
  5. Re-evaluate — measure the mutant's pass rate on a held-out validation set.
  6. Accept or reject — only add the mutant to the pool if validation strictly exceeds parent.

The strict-better-than-parent gate is what prevents drift. Without it, the optimizer happily adds neutral or slightly worse mutants and the population walks off a cliff.

Meta ran this for 233 iterations on 50 training papers and 25 validation papers, accepted 126 mutants, and pushed validation pass rate from 12.8% to 42.4%. The optimizer discovered:

  • Paper-specific insight enforcement, with a self-test that the question be unanswerable from prior knowledge alone
  • Context-leak prevention: the source must appear in the setup but never in the rubric criteria
  • Positive-only rubrics with weights capped at 7
  • Strict JSON rubric format, eliminating parsing failures

Three of these four are now in the Generator system prompt in Module 2. They are not invented; they are what evolution found.

4.2A minimal Optimizer in TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();
const ANALYZER_MODEL = "claude-opus-4-7";

// `Thresholds` is the same shape exported from Module 2 (Filter).
// The Optimizer mutates these values; the Filter consumes them via filterCandidate(triple, evals, harness.thresholds).
type Thresholds = {
  weakAvg: number;
  maxWeak: number;
  strongAvgLo: number;
  strongAvgHi: number;
  gap: number;
};

type Harness = {
  challengerSystem: string;
  verifierSystem: string;
  thresholds: Thresholds;
};

type EvalResult = { passRate: number; failureTrajectories: string[] };

async function evaluateHarness(h: Harness, papers: string[]): Promise {
  // For each paper, call runInnerLoop from Module 2 — passing `h.thresholds` so the
  // Filter uses the harness's current values, and patching `CHALLENGER_SYSTEM`/`VERIFIER_SYSTEM`
  // with `h.challengerSystem`/`h.verifierSystem` (left abstract here for brevity).
  // Count accepted vs. attempted, collect rejection trajectories.
  return { passRate: 0, failureTrajectories: [] };
}

async function proposeMutation(parent: Harness, failures: string[]): Promise {
  const resp = await client.messages.create({
    model: ANALYZER_MODEL,
    max_tokens: 4096,
    system:
      "You optimize a synthetic-data agent harness. Read failure trajectories and propose ONE diff to challengerSystem, verifierSystem, or thresholds. Output strict JSON of the full new Harness.",
    messages: [
      {
        role: "user",
        content: `PARENT_HARNESS: ${JSON.stringify(parent)}\nFAILURES: ${failures.slice(0, 10).join("\n---\n")}`,
      },
    ],
  });
  const text = resp.content
    .filter((b): b is Anthropic.Messages.TextBlock => b.type === "text")
    .map((b) => b.text)
    .join("");
  return JSON.parse(text) as Harness;
}

function boltzmannPick(pool: { h: Harness; score: number }[], T = 0.1): Harness {
  const scores = pool.map((p) => p.score);
  const max = Math.max(...scores);
  const weights = scores.map((s) => Math.exp((s - max) / T));
  const total = weights.reduce((a, b) => a + b, 0);
  let r = Math.random() * total;
  for (let i = 0; i < pool.length; i++) {
    r -= weights[i];
    if (r <= 0) return pool[i].h;
  }
  return pool[pool.length - 1].h;
}

export async function metaOptimize(
  seed: Harness,
  trainPapers: string[],
  valPapers: string[],
  iterations: number
): Promise {
  const seedScore = (await evaluateHarness(seed, valPapers)).passRate;
  const pool = [{ h: seed, score: seedScore }];

  for (let i = 0; i < iterations; i++) {
    const parent = boltzmannPick(pool);
    const trainEval = await evaluateHarness(parent, trainPapers);
    const mutant = await proposeMutation(parent, trainEval.failureTrajectories);
    const valEval = await evaluateHarness(mutant, valPapers);

    const parentScore = pool.find((p) => p.h === parent)!.score;
    if (valEval.passRate > parentScore) {
      pool.push({ h: mutant, score: valEval.passRate });
    }
  }
  return pool.sort((a, b) => b.score - a.score)[0].h;
}

This is a stripped-down version of what Meta ran. The shape is what matters: select with Boltzmann, evaluate, propose-via-LLM, validate, gate on strict improvement.

4.3What NOT to put in the outer loop

The temptation is to let the optimizer mutate everything. Don't.

Don't optimize: the acceptance threshold numbers themselves on a small validation set. They will overfit. Use the calibration procedure from Module 3 instead, then freeze them per domain and let the optimizer mutate prompts and rubric format. Don't optimize: the model identifiers (e.g., switching from Sonnet to Opus mid-loop). That's a deployment decision, not a harness mutation, and it confounds your before/after comparison. Don't optimize: the inner loop topology (e.g., adding a fifth subagent). Save that for the next paper.

The outer loop's job is to find local-but-meaningful improvements to the prompts and rubric format. Bigger structural changes belong in a separate review.


Module 5Practical Takeaway — When This Pattern Earns Its Keep

5.1When to reach for Autodata

Module objectives
  • Recognize the small set of problems where Autodata is unambiguously the right call
  • Identify simpler patterns that work for the long tail
  • Understand the cost shape before you commit

Three conditions. If all three are true, this is your pattern:

  1. You are training (or evaluating) a model, not just generating prose. Autodata's whole point is producing examples with a tunable difficulty gap, which only matters if those examples land in a training run or an eval suite.
  2. Quality dominates volume. You'd rather have 1,000 hard, calibrated examples than 100,000 mid examples. If you need volume, this loop is too slow.
  3. You can spend inference compute liberally. A single accepted example is roughly 10× the cost of a one-shot generation (3–5 retries × 6 solver calls × 6 verifier calls). Plus the meta-optimizer.

5.2When something simpler wins

| Situation | Better pattern |

|---|---|

| You need ~100 hand-crafted examples for a smoke test | Just write them |

| You need ~10K examples and you don't care about difficulty distribution | Single-pass generation + dedupe + simple verifier |

| You need eval data, not training data | Generation + human review (eval data is high-stakes) |

| Your domain has a cheap automatic verifier (compiler, unit test) | Generation + verifier loop, skip the weak/strong split |

The weak/strong solver split is what makes Autodata expensive. If you only need acceptance/rejection (not difficulty calibration), a single solver against a verifier is enough.

5.3Cost shape before you commit

Rough order-of-magnitude per accepted example, using Sonnet 4.6 prices and 4-page paper inputs:

  • Generator: 1 call × ~3 retries × ~10K input tokens (cached after first) × ~2K output tokens
  • Evaluator: 6 solver calls × ~1K input × ~1K output
  • Filter: 6 verifier calls × ~3K input (rubric, cached) × ~200 output

With prompt caching on the paper text and the rubric, the dominant cost is the 6 solver calls in the Evaluator. Without caching, the Generator dominates and the loop is roughly 3× more expensive — which is why every example in this guide wraps long, reused content in cache_control: { type: "ephemeral" }.

The meta-optimizer adds another order of magnitude. Meta ran 233 iterations × 50 train + 25 validation papers × inner loop per paper. Plan for it as a one-time investment per domain, not a per-batch cost.


Related guides on vybecoding.ai

Citations