Skip to main content
Replication is the foundation of science. In quantum computing, the gap between published results and reproduced results reveals how much hardware matters.

Replication is the foundation of science. In quantum computing, the gap between published results and reproduced results reveals how much hardware matters.

Experiment2026-02-11AI x Quantum Research Team

We Tried to Replicate 4 Quantum Computing Papers. Here's What Happened.

AI agents reproduced 14 published claims across emulator, IBM Torino, and Tuna-9 hardware. The gaps tell us more than the successes.

replicationVQEquantum volumeIBM QuantumTuna-9reproducibilityhardware noise

Reproducibility is the foundation of science. In quantum computing, it's also one of the field's biggest open questions: when a paper reports a ground state energy or a quantum volume, can someone else get the same result on different hardware?

We set out to answer this systematically. Using AI agents with direct access to quantum hardware through MCP tool calls, we attempted to replicate 4 landmark papers across 4 backends: a noiseless emulator, IBM's 133-qubit Torino processor, QuTech's 9-qubit Tuna-9 transmon chip, and IQM's 20-qubit Garnet processor.

The results: emulators reproduce published claims almost perfectly (85% pass rate). Real hardware tells a different story.

The Papers

PaperTypeQubitsClaimsPass Rate
Sagastizabal et al. 2019VQE + Error Mitigation2450%
Kandala et al. 2017Hardware-efficient VQE6560%
Peruzzo et al. 2014First VQE23100%
Cross et al. 2019Quantum Volume53100%

Across all 4 papers, we tested 19 claims on up to 4 backends. The overall pass rate is 76%. But that number hides the real story: the gap between emulator and hardware.

The VQE Story: Physics Works, Hardware Struggles

Three of our four papers involve the Variational Quantum Eigensolver (VQE) — computing molecular ground state energies. The same H2 molecule at the same bond distance (R=0.735 Å) appears in both Sagastizabal 2019 and Kandala 2017, giving us a natural cross-check.

The Circuit

The 2-qubit VQE ansatz (trial quantum state) uses a short 3-gate circuit: Ry(α) → CNOT → X, producing a superposition of two electron configurations. The optimal parameter α = −0.2235 gives the ground state energy E = −1.1373 Hartree (the exact solution).

Energy is reconstructed by measuring the qubits in three different bases (Z, X, and Y — corresponding to different Pauli operators) and combining the results:

E = g0 + g1⟨Z0⟩ + g2⟨Z1⟩ + g3⟨Z0Z1⟩ + g4⟨X0X1⟩ + g5⟨Y0Y1⟩

Here ⟨Z0⟩ means "average measurement outcome of qubit 0 in the Z basis," and the g coefficients come from the molecular Hamiltonian.

Results Across Backends

ObservableIdealEmulatorIBM TorinoTuna-9
⟨Z0⟩−0.975−0.973−0.961
⟨Z1⟩+0.975+0.973+0.950
⟨Z0Z1⟩−1.000−1.000−0.969
⟨X0X1⟩−0.222−0.252−0.256
⟨Y0Y1⟩−0.222−0.219−0.197
Energy (Ha)−1.1373−1.1385−1.1226−1.005
Error (kcal/mol)0.759.2283.4

The emulator achieves chemical accuracy (< 1 kcal/mol). IBM Torino is 9x worse but qualitatively correct — the dominant state |01⟩ appears in 97% of Z-basis shots. Tuna-9 is noise-dominated at 83 kcal/mol error.

The noise signature on IBM is instructive: Z-basis correlations degrade by 3–5% (depolarizing noise), while the off-diagonal X and Y correlations are surprisingly well-preserved. This suggests the dominant error is measurement noise rather than gate errors — the entangled state is prepared correctly but read out imperfectly.

Quantum Volume: Hardware Passes

Cross et al. 2019 defined Quantum Volume as the gold standard for benchmarking quantum computers. The test: run random circuits on n qubits, check whether the heavy output fraction exceeds 2/3.

All four backends pass QV=8, but IQM goes further with QV=32:

TestThresholdEmulatorIBM TorinoTuna-9IQM Garnet
n=2 (5 circuits)> 66.7%77.2%69.7%69.2%74.0%
n=3 (5 circuits)> 66.7%85.1%81.0%82.1%78.6%
n=4 (5 circuits)> 66.7%69.5%
n=5 (5 circuits)> 66.7%69.2%

All four backends pass QV≥8. IQM Garnet stands out by reaching QV=32 (passing n=2 through n=5). Tuna-9's n=2 result (69.2%) barely clears the threshold. IQM's 20-qubit processor with 30 connections and square-lattice topology gives it an edge over Tuna-9's 9 qubits with 12 connections.

The randomized benchmarking results complement this: Tuna-9 and IQM Garnet both achieve 99.82% single-qubit gate fidelity (0.18% error per gate), matching the emulator's 99.95% closely. IBM Torino shows 99.99% — though this is inflated because IBM's transpiler collapses Clifford sequences to single gates, so RB measures readout error rather than gate error. The fact that two independent backends with honest compilers converge on the same answer (99.82%) while IBM reports 99.99% strongly suggests IBM's figure is a compiler artifact. This confirms that single-qubit operations on all three hardware platforms are high quality; the VQE failures come from 2-qubit (CNOT) errors and decoherence.

The Reproducibility Gap

Here's the central finding, visualized across all 4 papers:

BackendClaims TestedPassPartialFail
QI Emulator131200
IBM Torino7331
QI Tuna-95311
IQM Garnet5500

The pattern is clear:

  • Emulators reproduce nearly everything. The physics in published papers is correct. When you remove noise, the algorithms work as described.
  • Hardware introduces a reproducibility gap. IBM Torino gets VQE results that are qualitatively correct but quantitatively off by 9 kcal/mol — not chemical accuracy. Tuna-9 passes benchmarks (QV, RB) but fails VQE.
  • The gap depends on the experiment type. Benchmarks (QV, RB) are designed to be noise-tolerant. VQE is noise-sensitive. Same hardware, different outcomes.

This matches what Sagastizabal et al. themselves showed in 2019: error mitigation (symmetry verification) was essential for their results. Without it, their hardware couldn't achieve chemical accuracy either. We're seeing the same thing, seven years later, on different hardware.

What AI Agents Bring to Replication

This project wasn't about whether AI can write quantum circuits (it can). It was about whether AI agents can systematically test published claims and produce structured, comparable results. Three things stood out:

  1. Cross-platform comparison is hard for humans, easy for agents. The same VQE circuit had to be written in cQASM 3.0 for Tuna-9 and OpenQASM 2.0 for IBM, with different qubit conventions, basis rotations, and measurement protocols. An agent handles this translation without errors (after the initial debugging).
  2. Structured output enables meta-analysis. Every result is stored as JSON with claim IDs, published values, measured values, failure modes, and error classifications. This makes it possible to ask "what fraction of VQE claims reproduce on hardware?" across papers — something manual replication rarely enables.
  3. The agent catches its own mistakes. Our first IBM VQE submission used the wrong ansatz (Ry(0.1118) → CNOT instead of Ry(−0.2235) → CNOT → X). The agent detected the error by comparing counts against the expected state, resubmitted with the correct circuit, and documented both runs. Self-correction is built into the loop.

What's Next

Update (2026-02-10): We've now completed Peruzzo on IBM (HeH+ bond sweep, MAE 83.5 kcal/mol — PES shape correct but absolute values noise-dominated) and Cross on IBM (QV PASS at n=2 and n=3, RB 99.99%). See the cross-platform comparison post for the full story, including [[4,2,2]] quantum error correction and a neural network decoder trained on hardware data.

Remaining targets:

  • Harrigan et al. 2021 — QAOA for MaxCut on non-planar graphs. This requires 23 qubits and will be our first test beyond the small-circuit regime.
  • Error mitigation — Implement Sagastizabal's symmetry verification to see if the IBM VQE result improves from 9.2 kcal/mol toward chemical accuracy.
  • Peruzzo 2014 on Tuna-9 — HeH+ bond sweep on 9-qubit hardware. Will the topology constraints that broke QEC also affect VQE?

All raw data — measurement counts, job IDs, expectation values, circuit definitions — is available in the experiments/results/ directory and replication reports. The replications dashboard at /replications shows live results.

Hardware job IDs: IBM VQE (d65n0gbe4kfs73cvkisg, d65n0gre4kfs73cvkitg, d65n0hbe4kfs73cvkivg). Tuna-9 QV (415379–415394). Tuna-9 RB (415395–415404).

Sources & References