The same quantum algorithm produces radically different results depending on where you run it.
Four Quantum Backends, One Question: How Much Does the Hardware Matter?
We ran the same experiments on a noiseless emulator, IBM Torino (133q), Tuna-9 (9q), and IQM Garnet (20q). The answer: it matters a lot, but not always in the ways you expect.
What happens when you take the same quantum algorithm and run it on four completely different backends? We've been answering this question systematically across 50+ experiments, 4 paper replications, and 4 platforms: a noiseless QI emulator, IBM's 133-qubit Torino processor, QuTech's 9-qubit Tuna-9 transmon chip, and IQM's 20-qubit Garnet processor.
The headline: benchmarks are forgiving, chemistry is brutal, and error correction reveals the sharpest hardware differences of all.
The Scorecard
| Experiment | Metric | Emulator | IBM Torino | Tuna-9 | IQM Garnet |
|---|---|---|---|---|---|
| QV n=2 | HOF | 77.2% | 69.7% | 69.2% | 74.0% |
| QV n=3 | HOF | 85.1% | 81.0% | 82.1% | 78.6% |
| QV n=5 | HOF | — | — | — | 69.2% |
| QV (best) | Volume | ≥8 | ≥8 | 8 | 32 |
| RB 1-qubit | Gate fidelity | 99.95% | 99.99%* | 99.82% | 99.82% |
| VQE H2 | Error (kcal/mol) | 0.75 | 0.22‡ | 0.92† | — |
| VQE HeH+ | Error (kcal/mol) | 0.08 | 4.31‡ | 4.44† | — |
| [[4,2,2]] QEC | Detection / FP | 100% / 0% | 92.7% / 14.0% | 66.6% / 30.9% | — |
| Bell state | Fidelity | 100% | 99.1% | 85.8–93.5% | 88.4–98.1% |
| GHZ-10 | Fidelity | — | — | n/a (9q) | 54.7% |
*IBM RB fidelity inflated: Qiskit transpiler collapses Clifford sequences to depth 1–2 circuits, measuring readout error rather than gate error. †Tuna-9 VQE on best qubit pair q[2,4] with hybrid PS+REM (0.92 kcal/mol); worst pair q[0,1] gives 9.5 kcal/mol — a 10.3x difference from qubit selection alone.
Four patterns jump out:
- Benchmarks pass everywhere, but unevenly. QV passes on all hardware, but IQM Garnet hits QV=32 while Tuna-9 tops out at QV=8. More qubits with better connectivity wins the benchmark game.
- VQE achieves chemical accuracy on two hardware backends. IBM TREX (0.22 kcal/mol) and Tuna-9 hybrid PS+REM (0.92 kcal/mol on q[2,4], 1.32 on q[6,8]) both pass the 1.6 kcal/mol threshold. But qubit selection still matters: wrong pair on Tuna-9 gives 9.5 kcal/mol (10x worse). Error mitigation technique choice matters less than which qubits you pick.
- Error correction reveals the sharpest differences. The same [[4,2,2]] code runs perfectly on the emulator, achieves 92.7% detection on IBM, and reaches 66.6% detection on Tuna-9 with a 30.9% false positive rate — functional but noisy, limited by the 10-CNOT depth needed to route through q4 (the only degree-4 qubit).
- Compiler tricks inflate benchmarks. IBM's 99.99% RB fidelity is measuring readout error, not gate quality. Tuna-9 and IQM Garnet both report 99.82% — genuine gate fidelity measured via raw native gates with no Clifford-level compilation.
VQE: When Bond Curves Break
The Peruzzo 2014 replication tells this story most clearly. We computed the potential energy surface (PES) of HeH+ across 11 bond distances (0.5–3.0 Å), using the same 2-qubit sector-projected ansatz on each backend.
| Bond distance (Å) | FCI (Ha) | Emulator (Ha) | IBM Torino (Ha) | IBM error (kcal/mol) |
|---|---|---|---|---|
| 0.50 | −2.641 | −2.641 | −2.459 | 114.1 |
| 0.75 | −2.846 | −2.846 | −2.701 | 91.2 |
| 1.00 (eq.) | −2.860 | −2.860 | −2.728 | 82.9 |
| 1.50 | −2.825 | −2.825 | −2.716 | 68.5 |
| 2.00 | −2.811 | −2.811 | −2.687 | 77.9 |
| 3.00 | −2.808 | −2.808 | −2.678 | 81.4 |
The emulator matches the exact (FCI) curve to within 0.08 kcal/mol MAE. IBM Torino's curve has the right shape — minimum at R≈1.0 Å, dissociation plateau at large R — but is offset by ~0.13 Ha at every point. The error is remarkably uniform: 68–114 kcal/mol across all 11 distances.
Why so bad? The HeH+ 2-qubit Hamiltonian has the form H = g0 + g1〈Z0〉 + g2〈Z1〉 + g3〈Z0Z1〉 + g4〈X0X1〉 + g5〈Y0Y1〉. The g1 coefficient (∼0.5–0.8) amplifies readout bias: a 10% readout error on 〈Z〉 operators contributes ∼0.05–0.08 Ha of error. The energy also depends on the difference g1−g2 — when both Z terms are biased in the same direction, the error compounds rather than cancels.
For H2, the Hamiltonian is more symmetric and the g1 coefficient is smaller (∼0.4), which is why IBM TREX gets 0.22 kcal/mol error on H2 but 4.45 kcal/mol on HeH+. Even with the best mitigation, HeH+ remains 20x worse. The molecule matters as much as the hardware.
Qubit Selection: The Cheapest Error Mitigation
On Tuna-9, we ran the exact same H2 VQE circuit on three different qubit pairs. The results are striking:
| Qubit pair | Bell fidelity | VQE error (kcal/mol) | Post-sel. kept |
|---|---|---|---|
| q[0,1] | 87.0% | 9.45 | 83% |
| q[4,6] | 93.5% | 6.2 (with REM) | — |
| q[2,4] | 92.3% | 3.04 | 96% |
Switching from q[0,1] to q[2,4] — no algorithm change, no extra error mitigation, just picking better qubits — cuts error by 3.1x. And q[2,4] outperforms q[4,6] despite q[4,6] having higher Bell fidelity (93.5% vs 92.3%). This suggests that CNOT direction, measurement axis noise, and spectator qubit effects matter beyond what Bell fidelity captures.
The PES sweep confirms this pattern holds across the full dissociation curve:
| R (Å) | Emulator (kcal/mol) | Tuna-9 q[2,4] (kcal/mol) | Hardware gap |
|---|---|---|---|
| 0.5 | 2.0 | 9.98 | +8.0 |
| 0.735 (eq.) | 0.6 | 3.04 | +2.4 |
| 1.0 | 1.4 | 4.12 | +2.7 |
| 1.5 | 2.6 | 12.68 | +10.1 |
| 2.0 | 2.1 | 17.32 | +15.2 |
| 2.5 | 0.09 | 13.42 | +13.3 |
Hardware noise grows dramatically past R=1.0 Å, where the circuit needs more entanglement (larger rotation angle α). The X/Y basis measurements required for the 〈X0X1〉 and 〈Y0Y1〉 terms add gates, introducing noise that dominates at large bond distances.
The takeaway: on current NISQ hardware, smart qubit routing is the single most impactful optimization — cheaper than error mitigation and with no runtime overhead.
Quantum Error Correction: Where Topology Taxes
The [[4,2,2]] error detection code encodes 2 logical qubits into 4 data qubits, with 2 ancilla (helper) qubits that check for errors by measuring collective properties called stabilizers. It can detect (but not correct) any single-qubit error.
On the emulator: 100% detection rate, 0% false positive rate. Perfect, as expected from a noiseless backend.
On IBM Torino (133 qubits, heavy-hex topology): 92.7% detection rate, 14.0% false positive rate. IBM's rich connectivity easily accommodates the circuit — each helper qubit needs CNOT gates (two-qubit operations) to all 4 data qubits, and IBM's topology provides this.
On Tuna-9: 66.6% detection rate, 30.9% false positive rate. This only became possible after we discovered Tuna-9's full 12-edge topology (the original characterization found 10 edges because couplers q4–q7 and q5–q7 were disabled during that calibration cycle). With 12 edges, q4 has degree 4 — the only qubit connected to 4 neighbors — making it viable as the sole ancilla. But the circuit requires 10 CNOTs (5 to encode via q4 as a bus, 1 to disentangle, 4 for syndrome extraction), compared to IBM's 6-CNOT layout. That extra depth is why detection drops from 92.7% to 66.6%.
This remains the sharpest cross-platform difference in our data. The same algorithm, same encoding, same error model — one platform runs it at 93% detection with a clean layout, the other manages 67% through a deeper circuit forced by sparser connectivity. Post-selection still helps: discarding flagged shots improves raw fidelity from 49.4% to 66.3% (1.34x gain). Topology doesn't make it impossible, but it determines the circuit depth tax you pay — and on noisy hardware, depth is everything.
Training an AI Decoder on Hardware Data
The IBM Torino [[4,2,2]] data gave us something the emulator never could: realistic noise patterns to train an AI decoder.
We ran 13 error variants (no error, X/Z/Y on each of 4 data qubits) with 4,096 shots each = 53,248 labeled samples. Each sample is a 6-bit measurement outcome (4 data + 2 syndrome bits) with a known injected error class.
We trained a neural network decoder (scikit-learn MLPClassifier, 32/16 hidden layers) and compared it to a lookup-table decoder:
| Decoder | Accuracy | Notes |
|---|---|---|
| NN (13 classes) | 61.7% | Learns data-bit correlation patterns |
| Lookup table (13 classes) | 41.1% | Syndrome → most likely error |
| Lookup table (4 classes) | 79.8% | Coarser: no-error vs X/Z/Y type only |
The NN outperforms the detailed lookup table by 50% (61.7% vs 41.1%). Why? The lookup table only uses the 2 syndrome bits; the NN also uses the 4 data bits. This matters because hardware noise creates correlations between data-bit patterns and error types that a syndrome-only decoder can't see.
One fundamental limitation: Z errors can't be localized from Z-basis measurement. Z errors don't flip bits in the computational basis — they flip phase — so the NN gets 0% recall on individual Z errors (Z_d0, Z_d1, etc.). The ZZZZ syndrome detects that a Z error occurred, but the data bits don't reveal which qubit was affected. This isn't a decoder failure; it's a fundamental limitation of single-basis measurement.
Why IBM's 99.99% RB Is Fake (and Tuna-9's 99.82% Is Real)
This might be the most important methodological finding in our data. IBM Torino reports 99.99% single-qubit gate fidelity from randomized benchmarking. Tuna-9 reports 99.82%. At face value, IBM's gates are 100x better. In reality, the two numbers are measuring completely different things.
Here's what happens: IBM's Qiskit transpiler recognizes that a sequence of random Clifford gates composes into a single Clifford operation. So regardless of whether you ask for m=1, 4, 8, 16, or 32 Clifford gates, the transpiler compiles the entire sequence down to 1–2 physical gates. Our data shows this clearly:
| Sequence length | IBM survival | Tuna-9 survival | IQM Garnet survival |
|---|---|---|---|
| m=1 | 90.5% | 95.8% | 98.9% |
| m=4 | 90.3% | 94.8% | 97.9% |
| m=8 | 90.4% | 93.6% | 96.4% |
| m=16 | 90.0% | 91.5% | 94.3% |
| m=32 | 90.1% | 89.0% | 88.2% |
IBM's survival probability is flat at ~90% — no decay at all. That 90% floor is pure readout error: how accurately you can measure a qubit in the |0〉 state. The exponential decay that RB is supposed to measure — the decay that tells you about gate quality — never appears because there are no extra gates to decay through.
Tuna-9's curve, by contrast, actually decays from 95.8% to 89.0%. Its compiler doesn't collapse Clifford sequences, so the gates are physically executed. The 99.82% fidelity extracted from this decay is a genuine measurement of gate quality.
The punchline: Tuna-9's "worse" number is the more honest measurement. A smaller processor with a simpler compiler produces more trustworthy benchmarks than a 133-qubit system with an aggressively optimizing transpiler. For the field, this raises an uncomfortable question: how many published RB numbers are actually measuring readout error dressed up as gate fidelity?
IQM Garnet confirms this prediction. IQM's native gate set is prx(angle, phase) and cz — there is no Clifford-level transpilation. When we submit a 32-Clifford RB sequence, IQM executes all ~130 physical prx gates without collapsing them. The result: clear exponential decay from 98.9% at m=1 to 88.2% at m=32, yielding 99.82% gate fidelity — identical to Tuna-9. Two independent backends with honest compilers converge on the same answer. IBM's 100x-better number is the outlier, not the norm.
The fix is straightforward — use interleaved RB with non-Clifford gates, or disable Clifford compilation during benchmarking. But this is rarely flagged in cross-platform comparisons, and it means you cannot naively compare RB numbers across platforms without understanding what each compiler does to your circuits.
What We Learned
- Compiler honesty matters more than qubit count. IBM's 99.99% RB looks 100x better than Tuna-9's 99.82%, but IBM's number measures readout error while Tuna-9's measures actual gate quality. IQM Garnet confirms this: with no Clifford-level compilation, IQM's RB shows genuine decay and converges on the same 99.82% fidelity as Tuna-9. Two honest compilers agree; the outlier is the one with aggressive optimization. Cross-platform comparisons are meaningless without understanding what each transpiler does to your circuits.
- Benchmarks and applications live in different worlds. QV and RB pass on hardware that can't do useful chemistry. The gap between "this hardware works" (QV PASS) and "this hardware is useful" (VQE within chemical accuracy) is enormous.
- Error correction needs topology, not just qubits. Tuna-9 has enough qubits for [[4,2,2]] but not enough connectivity. IBM Torino has 133 qubits but ~14% false positive rate on a 6-qubit code. Neither is ready for fault-tolerant computation, but they fail for completely different reasons.
- AI decoders beat classical decoders on real hardware data. A simple 2-layer neural network outperforms lookup tables by 50% on qubit-level error classification. On real hardware, noise has structure that ML can exploit.
- The molecule matters as much as the machine. Even with TREX, IBM Torino gets 0.22 kcal/mol on H2 but 4.45 kcal/mol on HeH+ — because HeH+'s asymmetric Hamiltonian (|g1|/|g4| = 7.8 vs 4.4) amplifies readout bias 20x. You can't benchmark VQE on one molecule and assume it generalizes.
- Qubit selection is the cheapest optimization. On Tuna-9, switching from q[0,1] to q[2,4] cuts VQE error by 3.1x — no algorithm change, no error mitigation, just picking better qubits. This outperforms readout error mitigation and costs nothing at runtime.
All raw data — measurement counts, job IDs, expectation values, decoder metrics — is available in the experiments/results/ directory. The experiments dashboard at /experiments shows live results across all backends.
Hardware job IDs: IBM HeH+ VQE (d65ncqoqbmes739d4h30). IBM Cross QV/RB (d65ncilbujdc73ctmjr0). IBM [[4,2,2]] (d65n33je4kfs73cvklt0 + 12 more). Tuna-9 QV (415379–415394). Tuna-9 RB (415395–415404). IQM Bell (019c48cf-99f2-7e03). IQM diagnostics (30 jobs, 47K shots).
Sources & References
- Experiments dashboardhttps://haiqu.org/experiments
- HeH+ VQE IBM results (JSON)https://github.com/JDerekLomas/quantuminspire/blob/main/experiments/results/peruzzo2014-ibm-torino.json
- Cross QV/RB IBM results (JSON)https://github.com/JDerekLomas/quantuminspire/blob/main/experiments/results/cross2019-ibm-torino.json
- IQM Garnet diagnostics (JSON)https://github.com/JDerekLomas/quantuminspire/blob/main/experiments/results/iqm-garnet-diagnostic-suite.json
- [[4,2,2]] IBM results (JSON)https://github.com/JDerekLomas/quantuminspire/blob/main/experiments/results/detection-code-001-ibm-torino.json
- Peruzzo et al. 2014https://arxiv.org/abs/1304.3061
- Cross et al. 2019https://arxiv.org/abs/1811.12926
- Sagastizabal et al. 2019https://arxiv.org/abs/1902.11258