Intelligence Per Token: Why Local AI Needs Loops, Not One-Shots

Name: Intelligence Per Token: Why Local AI Needs Loops, Not One-Shots
Author: vybecoding

Intermediate22m readFull-stack developers

A small local model that runs on your own desk can do real work — but only if you stop prompting it like ChatGPT. This guide breaks down intelligence per token and the five-step agentic loop that turns a 3B-active model into a reliable builder, with the exact specify-implement-review-patch-verify cycle and the hardware and tools to run it.

Primary Focus

ai &-machine-learning

AI Tools Covered

local-aiai-agentsqwen

What You'll Learn

✓Why One-Shot Prompts Work on Frontier Models and Fail on Local Ones
✓The Tetris Test
✓It's Not the Model, It's the Method
✓Not All Tokens Carry the Same Intelligence
✓Trading More Tokens for Less Intelligence-Per-Token
✓"PhD-Level Idiots" — What Frontier Models Actually Are

Guide Curriculum

The One-Shot Failure

Learn key concepts

3 lessons

•Why One-Shot Prompts Work on Frontier Models and Fail on Local Ones2m
•The Tetris Test1m
•It's Not the Model, It's the Method1m

What Intelligence Per Token Means

Learn key concepts

3 lessons

•Not All Tokens Carry the Same Intelligence1m
•Trading More Tokens for Less Intelligence-Per-Token1m
•"PhD-Level Idiots" — What Frontier Models Actually Are1m

The 5-Step Agentic Loop

Learn key concepts

6 lessons

•The Loop, Not the Prompt1m
•Step 1 — Specify Before You Build1m
•Step 2 — Implement Narrowly1m
•Step 3 — Review Adversarially1m
•Step 4 — Patch Narrowly1m
•Step 5 — Verify Deterministically1m

Practical Examples with Local Models

Learn key concepts

3 lessons

•Choosing a Local Model and Hardware1m
•The External Verifier Trick1m
•Awareness — Giving Small Models Your Context1m

Building Your First Local Agent Loop

Learn key concepts

3 lessons

•The Tools You Need1m
•Research Before You Build — Reuse, Don't Reinvent1m
•A Worked Walkthrough — Your First Loop1m

Local vs Frontier Workflows

Learn key concepts

3 lessons

•When to Use Local vs Frontier1m
•The "Good Enough" Threshold1m
•A Hybrid Strategy You Can Adopt Today1m

Preview: First Lesson

The One-Shot Failure

Why One-Shot Prompts Work on Frontier Models and Fail on Local Ones

When you use a frontier chatbot, you've been trained to expect a particular ritual: you type a request, the model thinks, and a finished answer comes back. For simple software that often works on the first try — a hosted model with trillions of parameters can one-shot a small program and have it run. That habit is the problem. The exact same prompt handed to a local model — something like Qwen3.6 or Gemma 4 running on your own machine — usually produces output that looks right but doesn't actually work. The game opens but won't restart, the script runs but mishandles an edge case, the feature is there but the wiring is wrong.

The instinct at that point is to blame the model: "local AI is stupid." But the model isn't being asked to do the same job. The frontier model succeeded because it could absorb the entire task in one pass. A smaller model can't hold that much in a single shot — and when you prompt it as if it can, you're setting it up to fail.

Do this: The next time a local model gives you broken output, resist the urge to swap models or rewrite the prompt for the tenth time. The fix is almost never a better one-shot prompt — it's a different process, which the rest of this guide builds.

Free Access

Start learning with this comprehensive guide

This guide includes:

6 modules with 21 lessons

22m estimated reading time

About the Author

✨ Vibe Coder

@hiram-clark

Hiram Clark is the founder of vybecoding.ai and editor of every guide and news article published on the site. He reviews all AI-drafted content for accuracy before publication and is personally accountable for factual errors. He works hands-on with the AI development tools, workflows, and infrastructure covered here.

Full Guide Content

Complete lesson text — start the interactive course above for exercises and progress tracking.

Module 1The One-Shot Failure

1.1Why One-Shot Prompts Work on Frontier Models and Fail on Local Ones

1.2The Tetris Test

A concrete example makes the gap obvious. Ask a top frontier coding model to "write me a Tetris game" and you'll likely get a playable result immediately. Ask a local model the same thing and you'll get a Tetris — the pieces fall, the board renders — but something is off. Maybe there's no sound. Maybe there's no way to restart once you lose. Maybe a rotation near the wall crashes it. The model produced the shape of the answer without the finish.

This isn't a Tetris-specific quirk. It's what happens any time you ask a smaller model to deliver a whole, working artifact in one response. The bigger and more interconnected the task, the wider the gap between "looks done" and "is done."

Do this: Use a tiny, throwaway project — a Tetris or Snake clone — as your own benchmark. Run the same one-shot prompt on a frontier model and a local model and compare. Feeling the difference firsthand is what makes the rest of this material click.

1.3It's Not the Model, It's the Method

The lesson Igor Kudryk draws from this — in the video this guide is based on — is that the value of a local model is locked behind how you use it. The weights are capable; the one-shot interface is what wastes them. A local model used like a frontier chatbot will reliably disappoint. The same model used inside a structured loop can build things you'd struggle to build yourself.

That reframe is the whole point of this guide. Everything that follows is about replacing the single prompt with a process that plays to a small model's strengths and works around its limits.

Do this: Adopt a mental rule before the next module: a local model is a worker you supervise in a loop, not an oracle you query once. Hold that distinction and the techniques below will feel natural.

Module 2What Intelligence Per Token Means

2.1Not All Tokens Carry the Same Intelligence

The core concept is intelligence per token. A language model produces text one token at a time, and the central insight is that not every token is equally smart. A token generated by a small model carries less reasoning, less context, and less reliability than a token from a frontier model trained at vastly larger scale. They look identical on the page — both are just words — but the density of "intelligence" packed into each one is different.

Once you accept that, the strategy follows logically. If each individual token from your local model is worth less, you can compensate by spending more of them — using a larger number of cheaper tokens to reach a result that a frontier model could reach with fewer, denser ones.

Do this: Stop measuring a model only by "is it smart enough?" and start asking "can I reach the result by spending more tokens in a smart structure?" For most everyday tasks, the answer is yes.

2.2Trading More Tokens for Less Intelligence-Per-Token

Spending more tokens is not the same as rambling. The trade only pays off if those extra tokens are organized. Telling a weak model to "think harder" or generating ten redundant drafts doesn't close the gap — it just produces more mediocre output. What closes the gap is structure: breaking the work into stages, checking each stage, and feeding the results forward.

The cost is time. A local model working through a structured loop will take longer than a frontier model answering in one shot. That's the deal you're accepting — more wall-clock time and more tokens in exchange for not needing a trillion-parameter cloud model. For a huge share of real tasks, that's a trade worth making, especially when the local stack is free and runs on hardware you already own.

Do this: Budget time, not just tokens. If a task would take a frontier model thirty seconds, expect the local loop to take several minutes. Plan for it instead of being surprised by it.

2.3"PhD-Level Idiots" — What Frontier Models Actually Are

Kudryk offers a sharp framing for the whole intelligence question: even frontier models are, in his words, "PhD-level idiots." They hold an enormous amount of knowledge but have intelligence that is, in many situations, closer to an idiot's — they get lost, lose the thread, and need correction. With that much knowledge, they can still accomplish a great deal, and in narrow domains like coding they perform much better than that label suggests. But the deeper point is that you are the intelligence in the system.

This matters because it sets the right expectation for local models too. You are not waiting for the model to be brilliant. You are supplying the judgment, the structure, and the verification. The model supplies knowledge and labor. Internalize that and you stop over-trusting any model — local or frontier.

Do this: In every loop, assume the model has knowledge but lacks judgment. Keep the judgment — what to build, what counts as done, what's actually broken — firmly in your own hands.

Module 3The 5-Step Agentic Loop

3.1The Loop, Not the Prompt

The heart of this method is a five-step loop you run instead of a single prompt: specify, implement narrowly, review adversarially, patch narrowly, verify deterministically. It is not "one shot." It is not even "one shot, then fix." It's a disciplined cycle where each step has a job and a stopping condition. The next five lessons take each step in turn.

Do this: Write these five words on a sticky note: specify, implement, review, patch, verify. They are the skeleton of every local-model build you'll do.

3.2Step 1 — Specify Before You Build

Before any code is written, you and the model agree on what you're building. You write the specification together: what the software is, what elements it must contain, how it should behave. No code yet — just a shared, explicit agreement on the target. This step is cheap and it prevents the most expensive failure mode, which is a model confidently building the wrong thing.

A specification also gives every later step something to check against. "Review adversarially" and "verify deterministically" are only meaningful if there's an agreed definition of done to compare to.

Do this: Open every project by asking the model to draft a short spec and not to write code. Read it, correct it, approve it. Only then move on.

3.3Step 2 — Implement Narrowly

Now you build — but in pieces, not all at once. Divide the work into small elements and have the model implement them one block at a time. A small model handles a single, well-scoped element far more reliably than a whole interconnected system. Narrow implementation keeps each chunk inside the model's effective working capacity.

This is the direct antidote to the one-shot failure from Module 1. The Tetris that broke on a single prompt becomes a Tetris built as discrete parts: board, piece movement, rotation, line-clearing, scoring, restart — each implemented and checked before the next.

Do this: Never ask for the whole thing. Slice the spec into the smallest sensible blocks and request them individually.

3.4Step 3 — Review Adversarially

After a block is implemented, don't assume it's done — go looking for what's wrong. Reviewing adversarially means actively hunting for breakage: what doesn't work, what's missing, and whether what you agreed on in the spec was actually implemented. The mindset is skeptical by design. You're not admiring the output; you're trying to break it.

This step is where a second model earns its keep (more on that in Module 4), but even a single model prompted to critique its own work against the spec will surface problems that a "looks good" glance never would.

Do this: Phrase the review as an attack: "Find everything broken or missing in this block, and check it against the spec." Default to assuming something is wrong until proven otherwise.

3.5Step 4 — Patch Narrowly

When the review surfaces problems, fix them one at a time — not the whole thing at once. Patching narrowly mirrors implementing narrowly: a small model applying a single targeted fix is far more reliable than one asked to "fix everything." Broad rewrites reintroduce bugs and undo working code. Narrow patches keep the working parts working.

Do this: Take the review's findings as a list and address them individually. One fix, then re-check, then the next. Resist the "just regenerate the whole file" shortcut.

3.6Step 5 — Verify Deterministically

The final and most important step: prove it works, don't eyeball it. A model saying "the code looks correct" is not verification — Kudryk is blunt that "it looks good" is not an answer. Deterministic verification means actually running the thing. Open the browser, load the game, play it. Does the piece move? Can you restart? Does the score update? You check observable, repeatable facts, not the model's opinion of its own output.

Deterministic verification is what closes the loop. If verification fails, you go back to review and patch. If it passes, the block is genuinely done. This is the step that separates a working agentic loop from a pile of plausible-looking code.

Do this: Define a concrete pass/fail test for every block before you build it, and actually execute it. "It compiles" is not verification. "I loaded it and the restart button works" is.

Module 4Practical Examples with Local Models

4.1Choosing a Local Model and Hardware

The loop needs a model and a machine to run it. Two strong open-weight options as of mid-2026 are Qwen3.6-35B-A3B (Apache 2.0, released April 2026) and Gemma 4's 26B A4B (released spring 2026). Both are Mixture-of-Experts models: Qwen3.6 has 35B total parameters but only ~3B active per token, and Gemma 4 26B activates roughly 4B of its total. That sparsity is the trick — they reason like much larger models but run at the speed of much smaller ones, around 30–60 tokens per second on capable consumer hardware.

For hardware, Kudryk runs on an ASUS Ascent GX10, a desk-side machine built on NVIDIA's GB10 Grace Blackwell chip with 128GB of unified memory — the same reference design as NVIDIA's DGX Spark. You don't need that exact box; the point is that 128GB of unified memory lets you load large models comfortably. A model needs to stay above roughly 30 tokens per second to feel interactive rather than frustrating.

Do this: Match the model to your hardware first. If you can't run a 35B MoE at usable speed, drop to a smaller Gemma 4 variant. A fast small model in a good loop beats a slow large one.

4.2The External Verifier Trick

One of the most useful techniques is using a second, smarter model as a watchdog. Kudryk had a frontier model (a Codex-class GPT model) watch what his local Qwen model was doing while it coded — an external verifier observing the work and flagging errors, gaps, and bad strategy in real time. The local model does the bulk labor; the frontier model audits.

This is the "review adversarially" step, supercharged. The frontier model's denser intelligence-per-token is spent only where it's most valuable — judging quality — rather than doing all the work. Over a few sessions of this, Kudryk distilled the watchdog's feedback into a reusable protocol that let the local model work at a higher level on its own.

Do this: If you have access to a frontier model, don't use it to build — use it to review. Let the cheap local model produce, and spend frontier tokens only on verification and strategy.

4.3Awareness — Giving Small Models Your Context

A small model knows less than a frontier one, but you can hand it the missing context. Kudryk calls this awareness: sharing the data the model needs to operate in your specific world. For strategy and writing tasks, that means your own material — your goals, your values, your past decisions, even your mistakes — so the output is aligned with you rather than a generic average. He's careful to note this only applies to judgment-heavy work; for pure coding, your personal values are noise, and what you supply instead is technical context and research.

The principle generalizes: when a small model lacks knowledge for a task, the fix is often to give it the knowledge rather than reach for a bigger model. Awareness can close gaps that more parameters would otherwise be needed to fill.

Do this: Before a strategy or writing task, assemble a short context pack — your goals, constraints, and relevant background — and feed it in. For coding tasks, swap that for technical docs and prior art.

Module 5Building Your First Local Agent Loop

5.1The Tools You Need

A working local loop has three roles: a harness that gives the model tools and runs the loop, a model to do the work, and a verifier to check it. For the harness, open-source options include Hermes Agent from Nous Research (MIT-licensed, model-agnostic, with built-in tools, sub-agents, and persistent memory) and OpenCode, an open-source terminal coding agent that runs any model you point it at. For the model, use one of the MoE options from Lesson 13. For the verifier, either a second model or — best — deterministic tests you run yourself.

These pieces are all free and open source. The harness is what turns a chat model into something that can take action: run commands, edit files, open a browser, and loop until a task is actually done.

Do this: Pick one harness and install it before writing any prompts. The harness is the part that makes the loop real; without it you're back to one-shot chatting.

5.2Research Before You Build — Reuse, Don't Reinvent

A powerful move that runs before the loop: have the model do research first. For a coding task, that means finding existing open-source code — ideally permissively licensed, like MIT — and assembling from proven building blocks instead of writing everything from scratch. Battle-tested code used by thousands of people is more reliable than anything a model writes fresh, and reusing it saves both time and tokens.

The same applies to knowledge tasks: have the model research the topic to build the awareness it needs before producing output. Kudryk frames it as giving the model the lay of the land first, so the actual work starts from understanding rather than guessing. This is the best strategy even for frontier models — there's no reason to reinvent the wheel when a solid one already exists.

Do this: Add a research step before "specify." Ask the model to find existing, well-licensed solutions and summarize how they work, then build from those rather than from a blank page.

5.3A Worked Walkthrough — Your First Loop

Put it together on a small project. Pick something concrete — say, a Snake game. Research: ask the model to look up how Snake works and find reference implementations. Specify: agree on the elements — grid, snake movement, food, growth, collision, score, restart. Implement narrowly: build the grid first, alone. Review adversarially: hunt for what's broken in just that block. Patch narrowly: fix one issue at a time. Verify deterministically: run it and confirm the grid renders correctly. Then repeat the implement-review-patch-verify cycle for movement, then food, then collision, and so on.

By the end you'll have a working game built the way a small model can actually build — and you'll have felt why each step exists. The same skeleton scales to real software, with the caveat that bigger systems need real architecture, not just more loops.

Do this: Build one full small project end-to-end with the loop before applying it to real work. The muscle memory matters more than the project.

Module 6Local vs Frontier Workflows

6.1When to Use Local vs Frontier

The honest position is not "local replaces frontier." There's still a real intelligence gap — frontier models represent trillions of parameters and billions in training investment against a local model's tens of billions of parameters. For genuinely novel, high-stakes, or deeply complex problems, that gap matters and frontier models still win.

But for the large majority of everyday tasks — a routine email, a summary, a standard utility, a common automation — a local model in a good loop is more than enough. The skill is knowing which bucket a task falls into.

Do this: Before starting, classify the task: is this a common job where "good enough" is genuinely good enough, or a rare, high-ceiling problem? Route the first kind to local, the second to frontier.

6.2The "Good Enough" Threshold

Kudryk uses an audio analogy: local versus frontier is like MP3 versus high-resolution audio. For background music, an MP3 is perfect — most people can't even hear the difference, and the task doesn't demand it. For critical listening on good equipment, the difference is real and audible. AI is the same. Writing a quick reply or a routine summary doesn't need frontier-grade output; you just need it done. Reserve the high-resolution option for the moments that actually warrant it.

The trap is feeling like you always need the best. That feeling is usually wrong, and acting on it wastes money and time on tasks where good enough was, in fact, good enough.

Do this: Default to local for routine work and only escalate to frontier when you can articulate a concrete reason the extra quality matters for this specific task.

6.3A Hybrid Strategy You Can Adopt Today

The practical end state is hybrid. Kudryk runs local models daily and still keeps frontier subscriptions — using each where it's strongest. A common split: let the cheap local model do the bulk implementation, and spend scarce frontier tokens on the high-leverage parts — strategy, architecture, and verification. The external-verifier trick from Module 4 is exactly this hybrid in miniature.

There's also a timing argument: frontier models are currently subsidized, priced below what they cost to run, because providers want your usage and data. That makes now an unusually good moment to spend those cheap frontier tokens building durable infrastructure — scaffolding, protocols, reusable skills — that your local models can then operate inside cheaply and privately for the long term.

Do this: Set up a deliberate split. Decide which parts of your workflow run local and which run frontier, and use the frontier budget on judgment-heavy steps rather than bulk labor.

Source

Based on Igor Kudryk's video, "Your Local AI is 'Stupid' Because You're Using it Like ChatGPT" (https://www.youtube.com/watch?v=NC2mE7C4s2c). Model and tool details verified against primary sources: Qwen3.6-35B-A3B, Gemma 4 26B A4B, Hermes Agent (Nous Research), OpenCode, NVIDIA DGX Spark, and OpenAI Whisper.