The bottleneck for AI-written quantum code isn't intelligence — it's documentation.

Experiment2026-02-10AI x Quantum Research Team

Can AI Write Quantum Code? We Tested 151 Tasks and Then Gave It the Manual

From 63% to 80%. The bottleneck isn't intelligence — it's documentation.

benchmarkRAGContext7QiskitLLMGeminiquantum codingAPI staleness

The Question

Quantum programs are short but dense — a handful of gates on a handful of qubits can encode algorithms that no classical computer can efficiently simulate. Writing them correctly requires knowing both the physics and the SDK. Can today's AI models do it?

We tested this with the Qiskit HumanEval benchmark: 151 quantum programming tasks, each graded by automated code execution. No partial credit. The model gets a function signature and docstring, writes the body, and the code either passes the test suite or it doesn't.

We ran two frontier models — Claude Opus 4.6 and Gemini 3 Flash — with no retrieval, no chain-of-thought, no retries. Just the raw prompt and one shot. Then we gave them access to up-to-date documentation and measured what changed.

The headline: 63% baseline → 71% with documentation → 80% with a multi-model ensemble. The failures tell the more interesting story.

Baseline Results

Model	Pass@1	Basic (79)	Intermediate (67)	Difficult (5)
Claude Opus 4.6	63.6% (96/151)	67.1%	62.7%	20%
Gemini 3 Flash	62.3% (94/151)	65.8%	61.2%	20%

Both models are within 1.4 percentage points of each other. The basic-to-intermediate drop is surprisingly small — these models don't just know simple gate sequences; they can construct meaningful quantum algorithms. The cliff happens at "difficult" tasks requiring multi-step reasoning with precise API calls.

For context: QUASAR (agentic RL, 4B params) achieves 99% circuit validity, but validity is a much looser metric than our functional correctness. QCoder with o3 reached 78% on a related benchmark, vs 40% for human experts.

Why They Fail

Of Gemini's 57 baseline failures, we classified every error:

Error Type	Count	What It Means
Wrong answer (real quantum mistakes)	13	Code runs, logic is wrong
Syntax errors	11	Malformed Python
Deprecated Qiskit API calls	18	SamplerV2, wrong methods, missing imports
Account/runtime	6	Trying to use IBM Runtime (requires auth)
Other	9	Misc runtime failures

Only 23% of failures are genuine quantum mistakes. The dominant failure mode is API staleness: the models were trained on Qiskit 1.x, but the benchmark runs on Qiskit 2.x, which introduced major breaking changes. The models understand the quantum computing but generate code for an API that no longer exists.

The Fix: Give Them the Manual

We tested two documentation strategies. The first — a static 335-line cheatsheet covering every Qiskit 2.x breaking change, prepended to every prompt — did nothing. Same pass rate, 17x more tokens. Dumping a comprehensive migration guide on every task just adds noise.

The second strategy worked. Context7 dynamically retrieves only the documentation relevant to each specific task. Instead of "here's everything that changed," it's "here's how SamplerV2 results work" — exactly when the model needs it.

Configuration	Pass@1	Change
Baseline (no docs)	62–64%	—
+ Static cheatsheet	62%	+0pp
+ Context7 (dynamic RAG)	68–71%	+11–14% relative
3-run ensemble (2×Gemini + Opus)	79.5%	+25–28% relative

Dynamic retrieval targets the exact API the model is struggling with. The improvement is concentrated in basic and intermediate tasks — exactly those most likely to fail from a wrong import path or deprecated method call. Difficult tasks, which require multi-step algorithmic reasoning, don't improve. RAG helps with API recall, not quantum reasoning.

RAG isn't purely additive — a few tasks regress when retrieved docs steer the model away from a correct answer it already had. Run-to-run variance is roughly ±3pp even at temperature=0, due to sampling nondeterminism and minor differences in retrieved snippets.

What's Still Broken

Across the two best Context7 runs (Opus and Gemini run 1, both at 107/151), 34 tasks fail for both models — neither can solve them even with documentation. These are the hard ceiling:

Failure Mode	Count	%
Logic/algorithm error	14	41%
API staleness (uncovered by Context7)	9	26%
Other runtime errors	11	32%

At the hard floor, the dominant failure flips: it's no longer stale documentation but genuine quantum reasoning errors. The models produce incorrect circuits or misinterpret the task requirements. The remaining 9 API failures are edge cases that Context7's index simply doesn't cover yet — gaps in documentation, not model capability.

What This Means

AI can already write most quantum code. 80% of 151 tasks are solvable with current frontier models, current documentation tools, and a simple multi-run strategy. No fine-tuning required.
SDK stability matters for AI. Qiskit's breaking changes between 1.x and 2.x created a "knowledge wall" that even frontier models can't cross without external help. SDKs designed with stable interfaces and machine-readable migration guides will get better AI-generated code.
Retrieval precision beats retrieval volume. A comprehensive cheatsheet did nothing; targeted per-task documentation retrieval improved scores by 11–14%. For fast-evolving domains, dynamic RAG should be standard infrastructure.

What's Next

The 79.5% ensemble ceiling comes from single-shot generation — no retries, no error feedback. About half the hard-floor failures produce assertion errors with informative messages. Letting the model see the error and try again is how developers actually work, and it's the basis for our next benchmark: an agentic evaluation where the model gets multiple attempts and tool access.

Raw Data

Every result file is a JSON with per-task pass/fail, error messages, generated code, and token counts. The canonical runs cited in this post:

Configuration	Result File
Claude Opus 4.6 baseline	claude-opus-4-6_20260210_110315
Gemini 3 Flash baseline	gemini-3-flash_20260209_234106
Gemini + Static RAG	rag_gemini-3-flash_20260210_144039
Gemini + Context7 (run 1)	rag_context7_gemini-3-flash_20260210_163102
Opus + Context7	rag_context7_claude-opus-4-6_20260210_184650
Gemini + Context7 (run 2)	rag_context7_gemini-3-flash_20260210_204214

All benchmark code and results: github.com/JDerekLomas/quantuminspire/tree/main/benchmark_results

Sources & References

Qiskit HumanEval benchmark paperhttps://arxiv.org/abs/2406.02132
QUASAR — agentic RL for quantum code generationhttps://arxiv.org/abs/2510.00967
Context7 by Upstashhttps://context7.com
Benchmark code and results (GitHub)https://github.com/JDerekLomas/quantuminspire/tree/main/benchmark_results
Qiskit 2.x migration guidehttps://docs.quantum.ibm.com/migration-guides

← Previous

Giving Claude Direct Access to Quantum Hardware

Seeing Quantum: Why Visualization Is the Missing Layer