Skip to main content
Trust the platform that shows you what actually happened.

Trust the platform that shows you what actually happened.

Opinion2026-02-12AI x Quantum Research Team

Six Things We Learned Running 50+ Experiments on Quantum Inspire

Honest benchmarks, fragile auth tokens, and why the hardware you trust is the hardware that runs your circuit as written.

Quantum InspireTuna-9developer experiencecQASMIBM Quantumerror mitigationbenchmarkingQI SDK

Over the past weeks we've run 50+ quantum experiments across Quantum Inspire's Tuna-9, IBM Torino, IBM Marrakesh, and IQM Garnet. We built an MCP server around the QI SDK, wrote an autonomous experiment daemon, and pushed every error mitigation technique we could find through every backend we had access to.

This post is our honest field report. What surprised us, what frustrated us, and what we'd tell the Quantum Inspire team if we had 20 minutes of their time.

1. The Honest Compiler Is a Feature, Not a Limitation

This is the single most important thing we learned about Quantum Inspire, and it took a cross-platform benchmarking experiment to see it.

We ran randomized benchmarking across three backends. IBM Torino reported 99.99% gate fidelity. Tuna-9 reported 99.82%. IQM Garnet reported 99.82%. Two out of three agree — and the outlier is the one with the aggressive transpiler.

What happened: IBM's Clifford-level transpiler collapsed our 32-gate RB sequences into 1–2 gates. It was measuring readout error, not gate error. QI and IQM ran the circuits as written, and converged on the same answer.

Two honest compilers agree. The outlier is the one that optimizes away your benchmark.

cQASM 3.0 goes straight to hardware. No black-box compilation. This makes certain things harder (you need to know the topology, get CNOT direction right, handle routing yourself), but it also means your benchmarks measure your hardware, not your compiler. For a research platform, that's the right tradeoff.

2. Open Error Mitigation Matches Proprietary Techniques

IBM's TREX achieved 0.22 kcal/mol on H2 VQE. Chemical accuracy, one API call, impressive. But it's proprietary — locked to IBM's Estimator primitives, opaque internals, works only on their hardware.

On Tuna-9, we combined two open-source techniques: readout error mitigation (confusion matrix inversion) and parity post-selection. The result on the best qubit pair: 0.92 kcal/mol. Chemical accuracy, using techniques that work on any backend with a calibration matrix.

BackendBest techniqueH2 VQE errorPortable?
IBM TorinoTREX (resilience=1)0.22 kcal/molNo (IBM only)
Tuna-9 q[2,4]REM + post-selection0.92 kcal/molYes
Tuna-9 q[6,8]REM + post-selection1.32 kcal/molYes

Going from raw counts (22 kcal/mol) to hybrid PS+REM (0.92 kcal/mol) on Tuna-9 recovers 96% of the error using generic techniques. The remaining gap between 0.92 and IBM's 0.22 reflects IBM's better baseline readout fidelity, not a fundamentally better approach.

And on the hard problem — HeH+ VQE, where the coefficient amplification makes chemical accuracy impossible — both platforms converge: 4.45 kcal/mol (IBM TREX) vs 4.44 kcal/mol (Tuna-9 REM+PS). The Hamiltonian sets the error floor, not the hardware.

3. Qubit Selection Matters More Than Mitigation Strategy

On Tuna-9, the best qubit pair (q[2,4]) gave 0.92 kcal/mol with REM+PS. The worst pair (q[0,1]) gave 9.5 kcal/mol with the same mitigation. That's a 10.3x difference from qubit selection alone.

Readout error on individual qubits ranged from 1.6% (q[2]) to 12.3% (q[0]). This isn't documented anywhere we could find programmatically. Our autonomous characterization agent discovered it empirically — by running Bell circuits on every connected pair and comparing fidelities.

On a 9-qubit chip, you can find the best qubits by exhaustive search in 20 minutes. On a 133-qubit chip, you need calibration data. IBM publishes it via their API. QI should too. A backend.calibration_data() endpoint returning per-qubit T1, T2, readout error, and gate fidelity would let users auto-select optimal qubits instead of discovering the hard way that q[0] is 8x noisier than q[2].

4. The Developer Experience Has Sharp Edges

We built three different interfaces to QI hardware: a Model Context Protocol server, a direct SDK integration in our experiment daemon, and a CLI-based subprocess wrapper. All three hit friction.

Auth tokens go stale silently

The QI SDK authenticates via ~/.quantuminspire/config.json, populated by qi login (GitHub OAuth). Tokens expire without warning. Our MCP server would work for hours, then start returning opaque errors. We ended up rewriting the experiment daemon to call the SDK directly (RemoteBackend) instead of the CLI, specifically to work around token staleness. A token refresh mechanism — or at minimum, a clear "token expired" error instead of a generic failure — would save significant debugging time.

qi files run output is unpredictable

Our daemon tried to parse JSON from qi files run stdout. Sometimes it got JSON. Sometimes it got extra log text mixed in. This made automated pipelines brittle. A --format json flag (or even just a guarantee that stdout is clean JSON when the command succeeds) would make programmatic use much more reliable.

cQASM needs a validation tool

cQASM 3.0 has no compiler and no implicit routing. CNOT directionality must match the hardware topology, and the topology can change after recalibration. We discovered stale topology data the hard way — CNOTs that worked last week got rejected this week because the hardware was recalibrated.

A qi validate circuit.cqasm command that checks a circuit against the current topology (without submitting it) would catch these errors at development time instead of after waiting in the job queue. Even better: an API endpoint that returns the current connectivity graph, so tools can validate locally.

5. Document the Noise Profile, Not Just the Qubit Count

Our most actionable finding was that Tuna-9 is readout-error dominated. We discovered this by running a gate-folding diagnostic: tripling the CNOT count changed VQE error by less than 1 kcal/mol out of ~7 kcal/mol total. The remaining error was almost entirely readout.

This means:

  • REM (readout correction) is the right first-line treatment — it recovered 96% of the error
  • ZNE (gate noise extrapolation) returned NaN — there wasn't enough gate noise to extrapolate
  • Dynamical decoupling made things worse — it added gate overhead to fix a non-problem

Most quantum computing tutorials assume gate-error-dominated hardware (because IBM's larger chips are gate-limited at scale). A researcher following those tutorials on Tuna-9 would waste time on ZNE and DD before discovering they need REM. The QI documentation should lead with this: "Tuna-9 is readout-limited. Start with confusion matrix correction. Skip ZNE for shallow circuits."

6. Topology Changes Need Notifications

Hardware recalibration changed Tuna-9's connectivity graph. Qubits q[6–8], which our cached topology map said were dead, came back online. Other connections shifted. Our autonomous agent discovered this by accident — it started from zero and re-probed everything, catching changes a human would have missed by reusing old data.

There's no notification when the topology changes. No version number. No "recalibrated at" timestamp in the API. A simple backend.topology_version() that increments on recalibration — or a webhook, or even a "last calibrated" field in qi_list_backends — would prevent the stale-data trap. Research results built on wrong topology assumptions are worse than no results at all.

The Broader Picture

Quantum Inspire is a research platform, and it behaves like one. The hardware runs your circuits honestly. The error mitigation techniques that work on it are portable. The benchmarks you get from it are real. These properties matter more than qubit count for the kind of work we're doing — replicating published experiments, characterizing noise, building trust in results.

The gaps are in developer experience: auth management, output parsing, calibration data access, topology change detection. These are solvable engineering problems, not fundamental limitations. And the honest-compiler philosophy — what you write is what runs — is an undermarketed strength in an ecosystem where transpiler tricks can quietly inflate benchmarks.

We'll keep running experiments on Tuna-9. The data it produces is data we trust.


All experiment data: experiments dashboard. Cross-platform comparison: full analysis. Error mitigation techniques: mitigation showdown. Autonomous hardware characterization: characterization report.

Sources & References