Build a Self-Improving Data Pipeline with Multi-Agent Quality Gates
A practical Anthropic SDK walkthrough of Meta's Autodata closed-loop pattern — four subagents (Generator, Evaluator, Filter, Optimizer), exact acceptance thresholds, and a TypeScript implementation that turns inference compute into higher-quality training data.
Primary Focus
ai developmentAI Tools Covered
What You'll Learn
- ✓The bottleneck is data quality, not compute
- ✓The empirical gap
- ✓When the closed loop is worth it
- ✓Roles and naming
- ✓Generator — the Challenger
- ✓Evaluator — Weak Solver and Strong Solver
Guide Curriculum
Why Multi-Agent Quality Gates Beat One-Pass Synthetic Data
Learn key concepts
- •The bottleneck is data quality, not compute1m
- •The empirical gap1m
- •When the closed loop is worth it1m
The Four-Subagent Loop
Learn key concepts
- •Roles and naming1m
- •Generator — the Challenger2m
- •Evaluator — Weak Solver and Strong Solver1m
- •Filter — Verifier/Judge plus the acceptance gate3m
- •Wiring the inner loop1m
Acceptance Thresholds — How to Set and Adjust Them
Learn key concepts
- •What each threshold prevents2m
- •Adjusting for a different domain1m
- •Verifying the thresholds are doing work1m
Meta-Optimization — The Outer Loop
Learn key concepts
- •What the Optimizer actually optimizes2m
- •A minimal Optimizer in TypeScript2m
- •What NOT to put in the outer loop1m
Practical Takeaway — When This Pattern Earns Its Keep
Learn key concepts
- •When to reach for Autodata1m
- •When something simpler wins1m
- •Cost shape before you commit2m
Preview: First Lesson
Why Multi-Agent Quality Gates Beat One-Pass Synthetic Data
The bottleneck is data quality, not compute
Module objectives
- Understand the failure mode of single-pass synthetic data generation
- See the empirical gap that motivates a closed loop
- Identify when this pattern earns its compute cost
For most teams sitting on top of frontier LLMs, the headache is no longer "can the model do this?" It's "can we feed the model enough high-signal examples to specialize it?" Hand-labeled data is expensive and slow. Generic synthetic data — the kind a single LLM call produces from a one-shot prompt — looks abundant but is full of trivial, leaky, or impossibly hard examples that train weak behavior.
Meta's RAM (Reasoning, Alignment, Memory) team published Autodata in May 2026 as a direct response. The framing: treat the data scientist's actual workflow — generate, inspect, score, refine — as a closed loop run by AI agents instead of by a human. The output is a stream of examples that have already been quality-screened and difficulty-tuned before they ever touch a training run.
Start learning with this comprehensive guide
This guide includes:
About the Author
Hiram Clark is the founder and managing editor of vybecoding.ai and sets editorial direction for the guides and news published here. Articles are drafted with AI assistance and edited before publication. He works hands-on with the AI development tools, workflows, and infrastructure covered on the site.
Full Guide Content
Complete lesson text — start the interactive course above for exercises and progress tracking.
Module 1Why Multi-Agent Quality Gates Beat One-Pass Synthetic Data
1.1The bottleneck is data quality, not compute
- Understand the failure mode of single-pass synthetic data generation
- See the empirical gap that motivates a closed loop
- Identify when this pattern earns its compute cost
For most teams sitting on top of frontier LLMs, the headache is no longer "can the model do this?" It's "can we feed the model enough high-signal examples to specialize it?" Hand-labeled data is expensive and slow. Generic synthetic data — the kind a single LLM call produces from a one-shot prompt — looks abundant but is full of trivial, leaky, or impossibly hard examples that train weak behavior.
Meta's RAM (Reasoning, Alignment, Memory) team published Autodata in May 2026 as a direct response. The framing: treat the data scientist's actual workflow — generate, inspect, score, refine — as a closed loop run by AI agents instead of by a human. The output is a stream of examples that have already been quality-screened and difficulty-tuned before they ever touch a training run.
1.2The empirical gap
The Autodata paper compared two pipelines on 2,117 QA pairs derived from 10,000+ CS papers:
| Approach | Weak Solver Score | Strong Solver Score | Score Gap |
|---|---|---|---|
| CoT Self-Instruct (one-shot) | 71.4% | 73.3% | 1.9 pp |
| Agentic Self-Instruct (Autodata) | 43.7% | 77.8% | 34 pp |
The CoT pipeline produces examples that both solvers handle nearly equally well — meaning the data is not separating capability levels. Autodata produces examples where strong models clearly outperform weak ones, which is exactly the signal RL training needs. When the team trained Qwen-3.5-4B on the Agentic data with GRPO (~1 epoch), it beat the CoT-trained variant on both in-distribution and out-of-distribution test sets.
1.3When the closed loop is worth it
This pattern is not free. Each accepted example may take 3–5 retry rounds, each round runs the weak solver 3 times and the strong solver 3 times, plus a verifier call. That is roughly an order of magnitude more inference per accepted example than one-shot generation.
Use this pattern when:
- You are building training data for fine-tuning, RL, or evaluation suites where data quality dominates compute cost
- You need to control the difficulty distribution explicitly (no trivial, no impossible)
- You can afford to spend inference compute as a substitute for human annotators
Skip it when:
- You only need a few hundred examples and can hand-label them faster than you can debug the loop
- A single judge rubric is enough — if you don't need difficulty calibration, just generate + verify is simpler
- Your task has a verifier that's already as expensive as the generator (you'll spend most of the loop in the verifier)
Module 2The Four-Subagent Loop
2.1Roles and naming
- Map sprint-plan terminology onto the source paper's terminology
- Implement each subagent against the Anthropic Messages API
- Wire the Filter to the acceptance gate
The sprint brief frames Autodata as a Generator → Evaluator → Filter → Optimizer pipeline. The Meta paper uses different names. The mapping is one-to-many but consistent:
| Pipeline role | Source paper agent(s) | Job |
|---|---|---|
| Generator | Challenger LLM | Produce a candidate (input, response, rubric) given source material |
| Evaluator | Weak Solver + Strong Solver | Run the candidate through two capability levels and score both |
| Filter | Verifier/Judge + acceptance gate | Score answers against the rubric and apply numeric thresholds |
| Optimizer | Meta-optimization outer loop | Mutate the agent harness itself based on downstream performance |
The Generator/Evaluator/Filter triple is the inner loop (per-example). The Optimizer is the outer loop (per-batch, slow).
2.2Generator — the Challenger
The Challenger reads source material (a paper, a document, a code file) and produces a structured triple: a self-contained question, a reference answer, and a rubric the Verifier will score against.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const MODEL = "claude-sonnet-4-6";
const CHALLENGER_SYSTEM = `You are a data-scientist agent generating training examples from a research paper.
Output JSON only. The rubric MUST be answerable only with knowledge from the source paper
(self-test: would a domain expert who has not read this paper be able to answer correctly? If yes, reject your own draft).
Use positive-only criteria. Each criterion has an integer weight 1..7. No criterion may exceed weight 7.`;
type Triple = {
question: string;
reference_answer: string;
rubric: { criterion: string; weight: number }[];
};
export async function generateCandidate(
paperText: string,
feedback: string | null
): Promise {
const userBlocks: Anthropic.Messages.ContentBlockParam[] = [
{
type: "text",
text: paperText,
cache_control: { type: "ephemeral" },
},
{
type: "text",
text: feedback
? `Generate a new candidate. Targeted feedback from prior attempt:\n${feedback}`
: "Generate a candidate triple {question, reference_answer, rubric}.",
},
];
const resp = await client.messages.create({
model: MODEL,
max_tokens: 2048,
system: [
{ type: "text", text: CHALLENGER_SYSTEM, cache_control: { type: "ephemeral" } },
],
messages: [{ role: "user", content: userBlocks }],
});
const text = resp.content
.filter((b): b is Anthropic.Messages.TextBlock => b.type === "text")
.map((b) => b.text)
.join("");
return JSON.parse(text) as Triple;
}
Two things to notice. First, the system prompt and the paper text are both wrapped in cache_control: { type: "ephemeral" }. Across the 3–5 retries that a single example may take, the paper does not change — caching the prefix is what makes this loop affordable. Second, the rubric format is locked to { criterion, weight }[] with weights capped at 7. Both constraints came directly out of Autodata's meta-optimization phase: free-form rubrics blew up parsing, and unbounded weights collapsed to a single criterion.
2.3Evaluator — Weak Solver and Strong Solver
The Evaluator runs the candidate question through two solvers at different capability levels, three times each. The two solvers can be different models, or the same model with different inference modes (e.g., regular vs extended thinking, or with/without tool use).
async function runSolver(model: string, question: string): Promise {
const resp = await client.messages.create({
model,
max_tokens: 1024,
messages: [{ role: "user", content: question }],
});
return resp.content
.filter((b): b is Anthropic.Messages.TextBlock => b.type === "text")
.map((b) => b.text)
.join("");
}
export async function evaluateCandidate(
question: string,
weakModel: string,
strongModel: string
): Promise<{ weak: string[]; strong: string[] }> {
const N = 3;
const [weak, strong] = await Promise.all([
Promise.all(Array.from({ length: N }, () => runSolver(weakModel, question))),
Promise.all(Array.from({ length: N }, () => runSolver(strongModel, question))),
]);
return { weak, strong };
}
N = 3 matches the paper. Three runs per solver gives you a stable mean without exploding cost. Lowering it to 1 makes acceptance noisy; raising it to 5+ doubles cost for marginal gain.2.4Filter — Verifier/Judge plus the acceptance gate
The Filter has two jobs. The Verifier scores each answer against the rubric. The acceptance gate then applies fixed numeric thresholds.
const VERIFIER_SYSTEM = `You score answers against a rubric. For each criterion, output a 0..1 score.
Final answer score = sum(criterion_score * weight) / sum(weight). Output strict JSON only.`;
async function scoreAnswer(
rubric: Triple["rubric"],
question: string,
answer: string
): Promise {
const resp = await client.messages.create({
model: MODEL,
max_tokens: 512,
system: [
{ type: "text", text: VERIFIER_SYSTEM, cache_control: { type: "ephemeral" } },
{ type: "text", text: JSON.stringify(rubric), cache_control: { type: "ephemeral" } },
],
messages: [
{ role: "user", content: `QUESTION: ${question}\nANSWER: ${answer}\nReturn {"score": number}.` },
],
});
const text = resp.content
.filter((b): b is Anthropic.Messages.TextBlock => b.type === "text")
.map((b) => b.text)
.join("");
return (JSON.parse(text) as { score: number }).score;
}
export type AcceptanceResult =
| { accepted: true }
| { accepted: false; reason: string };
export type Thresholds = {
weakAvg: number; // upper bound — reject if exceeded
maxWeak: number; // upper bound — reject if exceeded
strongAvgLo: number; // lower bound — reject if below
strongAvgHi: number; // upper bound (exclusive) — reject if at or above
gap: number; // lower bound on strong−weak difference
};
export const DEFAULT_THRESHOLDS: Thresholds = {
weakAvg: 0.65,
maxWeak: 0.75,
strongAvgLo: 0.60,
strongAvgHi: 0.95,
gap: 0.20,
};
export async function filterCandidate(
triple: Triple,
evals: { weak: string[]; strong: string[] },
t: Thresholds = DEFAULT_THRESHOLDS
): Promise {
const weakScores = await Promise.all(
evals.weak.map((a) => scoreAnswer(triple.rubric, triple.question, a))
);
const strongScores = await Promise.all(
evals.strong.map((a) => scoreAnswer(triple.rubric, triple.question, a))
);
const avg = (xs: number[]) => xs.reduce((a, b) => a + b, 0) / xs.length;
const weakAvg = avg(weakScores);
const strongAvg = avg(strongScores);
const maxWeak = Math.max(...weakScores);
const minWeak = Math.min(...weakScores);
const gap = strongAvg - weakAvg;
if (minWeak === 0) return { accepted: false, reason: "weak score has zero" };
if (weakAvg > t.weakAvg) return { accepted: false, reason: `weak_avg ${weakAvg.toFixed(2)} > ${t.weakAvg} — too easy` };
if (maxWeak > t.maxWeak) return { accepted: false, reason: `max_weak ${maxWeak.toFixed(2)} > ${t.maxWeak} — too easy` };
if (strongAvg < t.strongAvgLo) return { accepted: false, reason: `strong_avg ${strongAvg.toFixed(2)} < ${t.strongAvgLo} — too hard` };
if (strongAvg >= t.strongAvgHi) return { accepted: false, reason: `strong_avg ${strongAvg.toFixed(2)} >= ${t.strongAvgHi} — trivial for strong` };
if (gap < t.gap) return { accepted: false, reason: `gap ${gap.toFixed(2)} < ${t.gap} — insufficient difficulty separation` };
return { accepted: true };
}
The thresholds come straight from the Autodata paper: weak_avg ≤ 0.65, max_weak ≤ 0.75, no zero scores, strong_avg ≥ 0.60 and < 0.95, gap ≥ 0.20. They are tuned for a CS-research QA task and are the right starting point — but they are not magic. The Optimizer, in Module 4, is what makes them domain-specific.
2.5Wiring the inner loop
export async function runInnerLoop(
paperText: string,
weakModel: string,
strongModel: string,
thresholds: Thresholds = DEFAULT_THRESHOLDS,
maxRounds = 5
): Promise {
let feedback: string | null = null;
for (let round = 0; round < maxRounds; round++) {
const triple = await generateCandidate(paperText, feedback);
const evals = await evaluateCandidate(triple.question, weakModel, strongModel);
const verdict = await filterCandidate(triple, evals, thresholds);
if (verdict.accepted) return triple;
feedback = verdict.reason;
}
return null; // budget exhausted, drop this seed
}
That's the closed loop in roughly 60 lines. Generator proposes, Evaluator runs both solvers, Filter scores and either accepts or feeds the rejection reason back to the Generator for the next round.
Module 3Acceptance Thresholds — How to Set and Adjust Them
3.1What each threshold prevents
- Read each numeric threshold and understand the failure mode it prevents
- Adjust thresholds when porting Autodata to a non-CS domain
- Verify thresholds are doing useful work, not just rejecting everything
| Threshold | What it prevents | Symptom if removed |
|---|---|---|
| weak_avg ≤ 0.65 | Trivial examples both models can solve | Score gap collapses; weak data |
| max_weak ≤ 0.75 | Outlier easy examples that slip past the average | A few "gimme" questions inflate weak signal |
| no zeros on weak | Verifier mis-scoring or broken question | Garbage examples accepted because weak "failed" |
| strong_avg ≥ 0.60 | Impossible examples even strong models can't answer | Training on noise — model learns to output nothing |
| strong_avg < 0.95 | Trivial-for-strong (memorized or templated) | Strong model is solving from prior knowledge, not the source |
| gap ≥ 0.20 | Examples that don't separate capability levels | RL signal is flat |
Read these as a difficulty band, not a quality gate. The Verifier is the quality gate (does the answer match the rubric?). The numeric thresholds enforce that accepted examples actually distinguish strong from weak models — which is what you need them to do during training.
3.2Adjusting for a different domain
Suppose you want to apply Autodata to customer-support transcripts instead of CS papers. The CS paper thresholds were tuned for tasks where strong models reach 60–95%. If your domain is harder (medical reasoning, legal drafting), you'll find every candidate getting filtered out by strong_avg ≥ 0.60. If your domain is easier (intent classification), nothing will pass weak_avg ≤ 0.65.
A defensible adjustment procedure:
- Calibrate first — generate 20 candidates with no acceptance gate, just record
weak_avg,strong_avg, and gap. - Find your strong-solver ceiling — the 75th percentile of
strong_avgis roughly where you should set the upper bound. - Find your weak-solver floor — the 25th percentile of
weak_avgis roughly where the lower bound should sit. - Fix the gap last — the gap threshold should be 1–2× the standard deviation of
(strong_avg − weak_avg)across your calibration set. - Lock thresholds, then run the meta-optimizer — see Module 4.
3.3Verifying the thresholds are doing work
A common failure mode is thresholds that are too loose (everything passes) or too tight (acceptance rate drops to single digits and inference cost dominates). Track a small dashboard:
- Acceptance rate (target: 30–60% after the meta-optimizer warms up)
- Distribution of rejection reasons (should be spread across reasons, not all "too easy")
- Average rounds per accepted example (target: 1.5–3; >4 means thresholds are too tight)
If acceptance rate is below 10%, your thresholds are starving the pipeline. Above 80% means you're not filtering — likely calibrated for too-easy data.
Module 4Meta-Optimization — The Outer Loop
4.1What the Optimizer actually optimizes
- Understand why an outer optimization loop is part of Autodata's contribution, not a footnote
- Implement an evolution-based optimizer that mutates the agent harness itself
- Know what NOT to put in the outer loop
The inner loop produces examples. The outer loop produces better inner loops. Specifically, it mutates the agent harness — the prompts, the rubric format, the scoring rules — based on the inner loop's downstream performance.
Meta's outer loop is evolution-based. Each iteration:
- Select — Boltzmann sampling at temperature T=0.1 over the current pool of harnesses. Low temperature = elitist; you usually pick the best, sometimes explore.
- Evaluate — run the selected harness on a training set of papers and measure pass rate.
- Analyze — an LLM agent reads the failed trajectories and proposes mutations.
- Implement — a code-editing agent applies the diff to the harness.
- Re-evaluate — measure the mutant's pass rate on a held-out validation set.
- Accept or reject — only add the mutant to the pool if validation strictly exceeds parent.
The strict-better-than-parent gate is what prevents drift. Without it, the optimizer happily adds neutral or slightly worse mutants and the population walks off a cliff.
Meta ran this for 233 iterations on 50 training papers and 25 validation papers, accepted 126 mutants, and pushed validation pass rate from 12.8% to 42.4%. The optimizer discovered:
- Paper-specific insight enforcement, with a self-test that the question be unanswerable from prior knowledge alone
- Context-leak prevention: the source must appear in the setup but never in the rubric criteria
- Positive-only rubrics with weights capped at 7
- Strict JSON rubric format, eliminating parsing failures
Three of these four are now in the Generator system prompt in Module 2. They are not invented; they are what evolution found.
4.2A minimal Optimizer in TypeScript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const ANALYZER_MODEL = "claude-opus-4-7";
// `Thresholds` is the same shape exported from Module 2 (Filter).
// The Optimizer mutates these values; the Filter consumes them via filterCandidate(triple, evals, harness.thresholds).
type Thresholds = {
weakAvg: number;
maxWeak: number;
strongAvgLo: number;
strongAvgHi: number;
gap: number;
};
type Harness = {
challengerSystem: string;
verifierSystem: string;
thresholds: Thresholds;
};
type EvalResult = { passRate: number; failureTrajectories: string[] };
async function evaluateHarness(h: Harness, papers: string[]): Promise {
// For each paper, call runInnerLoop from Module 2 — passing `h.thresholds` so the
// Filter uses the harness's current values, and patching `CHALLENGER_SYSTEM`/`VERIFIER_SYSTEM`
// with `h.challengerSystem`/`h.verifierSystem` (left abstract here for brevity).
// Count accepted vs. attempted, collect rejection trajectories.
return { passRate: 0, failureTrajectories: [] };
}
async function proposeMutation(parent: Harness, failures: string[]): Promise {
const resp = await client.messages.create({
model: ANALYZER_MODEL,
max_tokens: 4096,
system:
"You optimize a synthetic-data agent harness. Read failure trajectories and propose ONE diff to challengerSystem, verifierSystem, or thresholds. Output strict JSON of the full new Harness.",
messages: [
{
role: "user",
content: `PARENT_HARNESS: ${JSON.stringify(parent)}\nFAILURES: ${failures.slice(0, 10).join("\n---\n")}`,
},
],
});
const text = resp.content
.filter((b): b is Anthropic.Messages.TextBlock => b.type === "text")
.map((b) => b.text)
.join("");
return JSON.parse(text) as Harness;
}
function boltzmannPick(pool: { h: Harness; score: number }[], T = 0.1): Harness {
const scores = pool.map((p) => p.score);
const max = Math.max(...scores);
const weights = scores.map((s) => Math.exp((s - max) / T));
const total = weights.reduce((a, b) => a + b, 0);
let r = Math.random() * total;
for (let i = 0; i < pool.length; i++) {
r -= weights[i];
if (r <= 0) return pool[i].h;
}
return pool[pool.length - 1].h;
}
export async function metaOptimize(
seed: Harness,
trainPapers: string[],
valPapers: string[],
iterations: number
): Promise {
const seedScore = (await evaluateHarness(seed, valPapers)).passRate;
const pool = [{ h: seed, score: seedScore }];
for (let i = 0; i < iterations; i++) {
const parent = boltzmannPick(pool);
const trainEval = await evaluateHarness(parent, trainPapers);
const mutant = await proposeMutation(parent, trainEval.failureTrajectories);
const valEval = await evaluateHarness(mutant, valPapers);
const parentScore = pool.find((p) => p.h === parent)!.score;
if (valEval.passRate > parentScore) {
pool.push({ h: mutant, score: valEval.passRate });
}
}
return pool.sort((a, b) => b.score - a.score)[0].h;
}
This is a stripped-down version of what Meta ran. The shape is what matters: select with Boltzmann, evaluate, propose-via-LLM, validate, gate on strict improvement.
4.3What NOT to put in the outer loop
The temptation is to let the optimizer mutate everything. Don't.
Don't optimize: the acceptance threshold numbers themselves on a small validation set. They will overfit. Use the calibration procedure from Module 3 instead, then freeze them per domain and let the optimizer mutate prompts and rubric format. Don't optimize: the model identifiers (e.g., switching from Sonnet to Opus mid-loop). That's a deployment decision, not a harness mutation, and it confounds your before/after comparison. Don't optimize: the inner loop topology (e.g., adding a fifth subagent). Save that for the next paper.The outer loop's job is to find local-but-meaningful improvements to the prompts and rubric format. Bigger structural changes belong in a separate review.
Module 5Practical Takeaway — When This Pattern Earns Its Keep
5.1When to reach for Autodata
- Recognize the small set of problems where Autodata is unambiguously the right call
- Identify simpler patterns that work for the long tail
- Understand the cost shape before you commit
Three conditions. If all three are true, this is your pattern:
- You are training (or evaluating) a model, not just generating prose. Autodata's whole point is producing examples with a tunable difficulty gap, which only matters if those examples land in a training run or an eval suite.
- Quality dominates volume. You'd rather have 1,000 hard, calibrated examples than 100,000 mid examples. If you need volume, this loop is too slow.
- You can spend inference compute liberally. A single accepted example is roughly 10× the cost of a one-shot generation (3–5 retries × 6 solver calls × 6 verifier calls). Plus the meta-optimizer.
5.2When something simpler wins
| Situation | Better pattern |
|---|---|
| You need ~100 hand-crafted examples for a smoke test | Just write them |
| You need ~10K examples and you don't care about difficulty distribution | Single-pass generation + dedupe + simple verifier |
| You need eval data, not training data | Generation + human review (eval data is high-stakes) |
| Your domain has a cheap automatic verifier (compiler, unit test) | Generation + verifier loop, skip the weak/strong split |
The weak/strong solver split is what makes Autodata expensive. If you only need acceptance/rejection (not difficulty calibration), a single solver against a verifier is enough.
5.3Cost shape before you commit
Rough order-of-magnitude per accepted example, using Sonnet 4.6 prices and 4-page paper inputs:
- Generator: 1 call × ~3 retries × ~10K input tokens (cached after first) × ~2K output tokens
- Evaluator: 6 solver calls × ~1K input × ~1K output
- Filter: 6 verifier calls × ~3K input (rubric, cached) × ~200 output
With prompt caching on the paper text and the rubric, the dominant cost is the 6 solver calls in the Evaluator. Without caching, the Generator dominates and the loop is roughly 3× more expensive — which is why every example in this guide wraps long, reused content in cache_control: { type: "ephemeral" }.
The meta-optimizer adds another order of magnitude. Meta ran 233 iterations × 50 train + 25 validation papers × inner loop per paper. Plan for it as a one-time investment per domain, not a per-batch cost.
Related guides on vybecoding.ai
- Production Claude: Building with the Anthropic SDK Beyond the Chat Interface — the prerequisite for everything in this guide. Covers the Messages API, tool use, and prompt-caching mechanics in more detail.
- Anthropic's Three-Tier Routing Strategy — Opus 4.7, Sonnet 4.6, and Haiku 4.5 — directly relevant to choosing weak vs strong solver models in Module 2. The same routing logic applies to picking the analyzer/implementer in the meta-optimizer.
Citations
- Marktechpost summary: Meta Introduces Autodata: An Agentic Framework That Turns AI Models into Autonomous Data Scientists for High-Quality Training Data Creation (2026-05-01)
- Primary technical source: Meta RAM team — Autodata blog
- Anthropic SDK reference: Prompt caching — Claude API Docs