More mitigation does not always mean better results. The winning strategy is often the simplest one.
We Tested 15 Error Mitigation Strategies. Only One Achieved Chemical Accuracy.
IBM's TREX (readout error correction) hit 0.22 kcal/mol. Tuna-9's best combo (readout mitigation + post-selection) averaged 2.52 kcal/mol. Zero-noise extrapolation made things worse. Here's what actually works for near-term quantum chemistry.
After running 50+ VQE experiments across two quantum backends, we had a nagging question: we know the hardware errors are ~7-10 kcal/mol, but where exactly is the error coming from, and what actually fixes it?
We systematically tested every error mitigation technique available to us — from simple post-selection to IBM's advanced TREX readout correction to zero-noise extrapolation — and ranked them by effectiveness. The results surprised us.
The IBM Mitigation Ladder
On IBM Torino, we ran the same H2 VQE circuit (bond distance R=0.735 Å, using a 2-qubit ansatz — the trial quantum state structure that VQE optimizes) through IBM's Estimator API with progressively more mitigation layers. Each technique adds cost (more shots, more QPU time) but is supposed to reduce error.
| Rank | Technique | Energy (Ha) | Error (kcal/mol) | QPU time |
|---|---|---|---|---|
| 1 | TREX (resilience=1) | −1.1377 | 0.22 | 14s |
| 2 | TREX + DD | −1.1352 | 1.33 | 14s |
| 3 | Offline PS (Run 1) | −1.1347 | 1.66 | 5s |
| 4 | SamplerV2 + DD + Twirl + PS | −1.1317 | 3.50 | 14s |
| 5 | TREX + 16K shots | −1.1313 | 3.77 | 23s |
| 6 | Offline PS (weighted mean) | −1.1311 | 3.91 | — |
| 7 | TREX + DD + Twirl | −1.1214 | 10.0 | 14s |
| 8 | ZNE linear [1,2,3] | −1.1168 | 12.84 | 20s |
| 9 | Raw (resilience=0) | −1.0956 | 26.2 | 5s |
| 10 | ZNE exponential [1,2,3,5] | NaN | NaN | 23s |
FCI reference: −1.1373 Ha. Chemical accuracy threshold: 1.0 kcal/mol.
Key finding: more mitigation ≠ better
The best technique is the simplest advanced option: TREX alone at 0.22 kcal/mol — well within chemical accuracy. TREX (Twirled Readout EXtraction) mitigates readout errors by randomizing the measurement basis, which is exactly what our noise analysis predicted: readout error is the dominant noise source.
But adding dynamical decoupling (DD — extra pulses inserted during idle time to fight decoherence) to TREX makes it worse (1.33 kcal/mol). Adding DD and Pauli twirling (randomizing gate sequences to average out systematic errors) makes it 45x worse (10.0 kcal/mol). Why? These techniques add extra gates to suppress coherent errors (systematic errors that accumulate predictably) — but our circuit is only 3 gates deep. The overhead of the mitigation exceeds the error it's trying to fix.
ZNE (zero-noise extrapolation) is the worst performer: the linear extrapolant gives 12.84 kcal/mol, and the exponential fit fails entirely (returns NaN). This confirms what we found on Tuna-9: CNOT gate noise is not the dominant error source on either backend. ZNE amplifies gate noise and extrapolates to zero, but when gate noise is already small compared to readout error, the extrapolation has nothing useful to extrapolate.
The lesson: match the mitigation to the noise. Readout-dominated errors need readout correction (TREX, confusion matrix inversion). Gate-dominated errors need gate-level mitigation (ZNE, DD). Applying gate-level fixes to readout-dominated circuits just adds overhead.
Tuna-9: Offline REM Reanalysis
We had 21 Tuna-9 VQE results with raw measurement counts, plus a readout calibration (confusion matrix) for q[2,4]. Could we retroactively improve the results by applying readout error mitigation (REM) offline?
We tested 5 strategies on every result:
- Raw — no mitigation at all
- PS — parity post-selection only (discard even-parity Z-basis shots)
- REM — confusion matrix inversion only
- REM+PS — apply REM to raw counts, then post-select
- PS+REM — post-select first, then apply REM
| Strategy | Mean error | Median | Min | Max | Wins |
|---|---|---|---|---|---|
| Raw | 32.45 | 31.11 | 13.48 | 88.28 | 0 |
| PS only | 8.30 | 8.50 | 2.79 | 17.32 | 0 |
| REM only | 8.62 | 6.55 | 0.00 | 39.02 | 3 |
| REM+PS | 2.52 | 2.39 | 0.13 | 7.60 | 13 |
| PS+REM | 3.90 | 3.56 | 0.05 | 10.32 | 5 |
All values in kcal/mol. N=21 experiments. "Wins" = number of experiments where this strategy gave the lowest error.
REM+PS wins 62% of the time
The combination of confusion matrix correction followed by parity post-selection is the clear winner on Tuna-9. It cuts the mean error from 8.30 (PS alone) to 2.52 kcal/mol — a 70% improvement. Several individual runs hit chemical accuracy: 0.13, 0.18, and 0.27 kcal/mol.
Why does ordering matter? REM first corrects the measurement bias across all four 2-qubit states (00, 01, 10, 11). This shifts probability from over-counted states to under-counted ones. Then post-selection removes any remaining parity violations. If you post-select first, you throw away shots before the readout correction can redistribute them — you lose information.
REM alone (mean 8.62) is actually worse than PS alone (8.30). The confusion matrix correction can introduce artifacts when applied without the parity constraint — it redistributes probability to all four states, including the wrong-parity ones. Post-selection cleans this up.
Bond distance matters
| R (Å) | N | Raw | PS | REM+PS | Improvement |
|---|---|---|---|---|---|
| 0.500 | 1 | 38.10 | 9.98 | 5.05 | 49% |
| 0.735 (eq.) | 16 | 36.28 | 7.30 | 2.15 | 71% |
| 1.000 | 1 | 13.48 | 4.12 | 3.64 | 12% |
| 1.500 | 1 | 17.06 | 12.68 | 3.40 | 73% |
| 2.000 | 1 | 18.69 | 17.32 | 3.91 | 77% |
| 2.500 | 1 | 13.72 | 13.42 | 2.39 | 82% |
REM+PS improves every bond distance. The biggest improvement is at large R (2.0–2.5 Å), where PS alone barely helps because the X/Y basis errors dominate — and REM corrects those too. The smallest improvement is at R=1.0, where the circuit already benefits from near-optimal PS performance.
ZNE fold factor interaction
We had 12 experiments with ZNE gate folding (1, 3, or 5 CNOT insertions). Does REM interact with the ZNE signal?
| CNOT folds | N | PS | REM+PS |
|---|---|---|---|
| 1 (baseline) | 13 | 8.65 | 2.55 |
| 3 | 4 | 8.62 | 3.25 |
| 5 | 4 | 6.86 | 1.68 |
The trend is noisy with small N, but REM+PS at fold=5 (1.68 kcal/mol) is the best Tuna-9 result overall. This hints that ZNE might have a mild effect once readout error is removed — but we'd need more data to confirm.
Cross-Platform Comparison
How do the best techniques compare across backends?
| Backend | Best technique | Error (kcal/mol) | Chem. accuracy? |
|---|---|---|---|
| Emulator | None needed | 0.75 | Yes |
| IBM Torino | TREX | 0.22 | Yes |
| Tuna-9 q[2,4] | Hybrid PS+REM | 0.92 | Yes |
| Tuna-9 q[6,8] | Hybrid PS+REM | 1.32 | Yes |
| Tuna-9 q[4,6] | Z-PS+REM | 6.2 | No |
| Tuna-9 q[0,1] | PS only | 9.45 | No |
Two hardware backends now achieve chemical accuracy. IBM's TREX (0.22 kcal/mol) is a proprietary built-in. But the open-source approach — hybrid post-selection plus confusion matrix inversion — also achieves it on Tuna-9 (0.92 and 1.32 kcal/mol on two independent qubit pairs). The key: use post-selection for Z-basis (catches parity leakage), confusion matrix inversion for X/Y-basis (corrects readout bias). This hybrid strategy works on any backend with calibration data.
Why ZNE Failed on Both Backends
Zero-noise extrapolation assumes that gate noise increases monotonically with circuit depth. You run the circuit at multiple noise levels (by inserting extra identity-equivalent gate pairs), then extrapolate back to zero noise.
On Tuna-9, we ran 12 experiments with 1, 3, and 5 CNOT folds. The PS-only error was essentially flat: 8.65, 8.62, and 6.86 kcal/mol. Extra CNOTs added less than 1.3 kcal/mol of noise — the signal ZNE needs to extrapolate simply isn't there.
On IBM Torino, ZNE with DD+twirling gave 12.84 kcal/mol (linear) and NaN (exponential). The base error with DD+twirling is already 10.0 kcal/mol — worse than the raw TREX starting point.
The root cause is the same on both backends: our VQE circuit is only 3 native gates deep (Ry, CNOT, X). Gate noise contributes <20% of total error. The dominant errors are readout bias, state preparation imperfections, and T1/T2 decoherence (energy loss and phase scrambling) during measurement. None of these scale with gate count, so ZNE's extrapolation has no signal to amplify.
ZNE would likely work better on deeper circuits (QAOA with multiple layers, Trotterized dynamics) where gate noise dominates. For shallow VQE, it's the wrong tool.
What We Learned
- Match mitigation to noise type. Readout-dominated circuits need readout correction (TREX, confusion matrices). Gate-dominated circuits need gate-level mitigation (ZNE, DD). Mismatching wastes QPU time and can make results worse.
- Technique stacking can backfire. On IBM, TREX alone (0.22 kcal/mol) beats TREX+DD (1.33) beats TREX+DD+Twirl (10.0). Each additional layer adds overhead that exceeds its benefit for shallow circuits.
- Hybrid strategy beats uniform application. PS for Z-basis + REM for X/Y-basis (0.92 kcal/mol) beats REM-everywhere (2.52 kcal/mol) or PS-everywhere (3.04 kcal/mol). Each technique targets a different error source: PS catches parity-violating leakage, REM corrects systematic readout bias on rotated bases.
- IBM's TREX is genuinely impressive — but not the only path. Chemical accuracy on real hardware from a single API parameter is a major engineering achievement. But the open-source hybrid PS+REM approach also achieves it on Tuna-9 (0.92 kcal/mol), proving the result is portable across platforms.
- Simple techniques close most of the gap. Going from raw (22 kcal/mol) to PS (3.0) to hybrid PS+REM (0.92 kcal/mol) on Tuna-9 recovers 96% of the error using techniques that work on any backend with a confusion matrix.
All mitigation ladder data: IBM mitigation ladder. Tuna-9 reanalysis: REM reanalysis. Readout calibration: confusion matrices.
Sources & References
- IBM mitigation ladder (JSON)https://github.com/JDerekLomas/quantuminspire/blob/main/experiments/results/vqe-mitigation-ladder-001-ibm.json
- Tuna-9 REM reanalysis (JSON)https://github.com/JDerekLomas/quantuminspire/blob/main/experiments/results/tuna9-rem-reanalysis.json
- Readout calibration data (JSON)https://github.com/JDerekLomas/quantuminspire/blob/main/experiments/results/readout-cal-tuna9-q24-001.json
- Experiments dashboardhttps://haiqu.org/experiments
- IBM Qiskit Runtime Primitiveshttps://docs.quantum.ibm.com/api/qiskit-ibm-runtime