technology

Subquadratic Launches SubQ, a 12M-Token Model With ~1,000x Less Attention Compute

vybecodingBy Hiram Clark — vybecoding.ai
June 20, 20265 min readOfficial
Subquadratic Launches SubQ, a 12M-Token Model With ~1,000x Less Attention Compute
Subquadratic, a Miami startup of 13 people that left stealth in early May 2026, is asking developers to take a set of numbers on faith — because so far it is the only party that has measured them.

Subquadratic, a Miami startup of 13 people that left stealth in early May 2026, is asking developers to take a set of numbers on faith — because so far it is the only party that has measured them. Its model, SubQ, pairs a new attention design called SSA with a context window of 12 million tokens, and the company says that combination needs roughly 1,000 times less attention compute than a standard transformer would at that length — about 50 average novels in a single prompt. The figures are striking. They are also almost entirely self-reported.

This article separates what is well-corroborated across multiple outlets, what is a single-source company claim, and what remains an open question no outside lab has answered yet.

What SSA actually changes

A standard transformer uses "dense" attention: every token compares itself against every other token. That is the famous quadratic cost — double the context, quadruple the work. It is the reason most frontier models cap context around 128K tokens, with 1M as the current high-water mark.

SSA — which Subquadratic and The New Stack expand as "Subquadratic Selective Attention," though some coverage (36Kr, DataCamp) calls it "Sparse Attention" — instead picks the positions in the sequence that matter for the current token and ignores the rest. Subquadratic describes it as skipping roughly 99% of attention interactions through content-based selection, so cost grows close to linearly with context length rather than quadratically.

The efficiency figures the company reports are internally consistent and have been repeated across several writeups:

  • Attention-compute reduction versus dense attention: about 62.5x at 1 million tokens, climbing to roughly 1,000x at 12 million tokens.
  • End-to-end speed versus dense attention with FlashAttention-2: about 52x faster at 1M tokens, 23x at 512K, 13.2x at 256K, and 7.2x at 128K.
  • Note the distinction the headline number hides: the ~1,000x figure is attention FLOP reduction at 12M tokens, while the ~52x figure is wall-clock speedup at 1M tokens. They describe different metrics at different scales, and conflating them is the easiest way to overstate the result.

    The benchmarks — and the gap nobody has explained

    Subquadratic shipped three products at launch: a 12M-token API, SubQ Code (a command-line agent that loads an entire repository in one pass for what the company pitches as "whole-artifact reasoning" — planning across every file at once instead of over a few retrieved snippets), and, per some reports, a deep-research tool called SubQ Search. A third-party testing service reportedly confirmed several benchmark runs, but no independent research group has reproduced the architecture from scratch.

    On the published numbers, the picture is mixed rather than dominant:

  • RULER 128K (long-context retrieval): SubQ around 95–97%, edging out Claude Opus 4.6 at ~94.8%.
  • SWE-Bench Verified (coding): SubQ ~81.8% — credible, but below Opus and GPT-5.5, which report high-80s.
  • MRCR v2 (multi-needle retrieval): this is where the story wobbles. The company quotes a lab score near 83, but DataCamp reports a production score of 65.9% — a roughly 17-point gap between the lab figure and the deployed model that, as of writing, has no public explanation.
  • At the full 12M-token length, Subquadratic claims "over 90%" on needle-in-a-haystack retrieval — a scale at which no competing frontier model has even been benchmarked, which means there is currently nothing to compare it against.

    Cost, funding, and the things sources disagree on

    The cost claim is the loudest: on the RULER 128K test, Subquadratic says a full run costs about $8 on SubQ versus roughly $2,600 on Opus — a ~300x gap, or "about 5% of Opus." That is a company-reported figure with no published per-token pricing behind it, and SubQ remains in private beta behind a waitlist, so independent buyers cannot yet check it.

    Even the basic corporate facts vary by outlet, which is itself a useful signal about how early this is:

  • Funding: eWeek reports $25M in seed money, backed by former SoftBank Vision Fund partner Javier Villamizar and Tinder co-founder Justin Mateen; other coverage cites $29M and a ~$500M valuation.
  • Future context: the company says it is targeting a larger window by Q4 — reported as 50M tokens by The New Stack and DataCamp, but 100M by eWeek.
  • The team: widely reported as 13 people, including roughly 11 PhDs from Meta, Google, Oxford, Cambridge, and Adobe, with SubQ reportedly built on top of an open-source base model rather than trained from scratch — the innovation sits in the attention mechanism, not a new foundation model.
  • Why developers should hold the applause for one cycle

    There is a direct precedent worth remembering. In 2024, Magic.dev raised around $500M on claims of a 100-million-token context window aimed at coding, and despite the funding it saw limited real-world adoption. Big context numbers have outrun shipped, verified utility before.

    If SSA holds up under independent reproduction, the second-order effect is the interesting one: a model that can ingest a whole codebase or document corpus in one pass weakens the case for some retrieval pipelines that exist mainly to work around short context windows. But "weakens the case for some" is not "RAG is dead." Retrieval still does jobs long context does not — cross-session memory, access control, auditability, and live index updates among them — and that argument deserves its own treatment rather than a victory lap.

    For now, the honest summary is narrow and specific: a small team has published an attention design that, on its own three benchmarks, trades a little coding accuracy for a large efficiency and context-length win, with one unexplained lab-versus-production gap and zero outside reproduction. That is genuinely worth watching. It is not yet worth rewriting your architecture around.

    Sources

  • The New Stack — "The context window has been shattered: Subquadratic debuts a 12-million-token window"
  • eWeek — "Subquadratic Launches SubQ, a 12M-Token AI Model for Long-Context Tasks"
  • DataCamp — "SubQ AI Explained: How Good Is the 12M Context Window LLM?"
  • 36Kr — "13 People Overthrow Transformer: New Architecture SSA Cuts Computing Power by a Thousand Times"
  • vybecoding

    Written by Hiram Clark, Editor — vybecoding.ai

    Published on June 20, 2026

    TOPICS

    #AI#LLM#long-context#2026