Verification of the Δχ² ≈ −74 claim (WILL RG vs GR 1PN, S2 star)
Date: 2026-06-13 · Data: Do et al. 2019 astrometry (46 epochs) + combined RV (82 points, 6 instruments), N = 174 · Document under review: RG_vs_GR_Delta74.txt
Verdict
The question asked was: is the Δχ² ≈ −74 a real success or the product of a mistake? The answer is: neither a coding mistake nor evidence about gravity. The number is arithmetically genuine — I reproduce it independently, and against a true numerically integrated GR solution the gap is even slightly larger (Δχ² = 80.7). But when both models are given the standard instrumental-systematics terms that every published S2 analysis includes (astrometric reference-frame zero-point and drift, per-instrument RV offsets), the gap collapses to Δχ² = −0.5, i.e. exactly zero within noise, with GR marginally ahead. The advantage was the rosette parametrization’s greater ability to absorb unmodeled reference-frame systematics — not better physics. The single largest share of the advantage (+37.5 of ~81) lives in the 1995–2006 speckle-era astrometry, where no relativistic signal of any kind is detectable; almost none of it concentrates at the 2018 pericenter passage where the actual relativistic physics happens.
| Model | χ², document setup (9 params) | χ², + standard systematics (18 params) |
|---|---|---|
| WILL RG (verbatim pipeline) | 728.62 | 228.28 |
| GR, time-linear ω(t) (document’s baseline) | 800.81 | 225.07 |
| GR, direct integration of 1PN EOM | 809.28 | 227.78 |
χ²/dof falls from 4.4–4.9 to 1.44–1.46 for all three models. A 9-parameter fit that is wrong by ~500 χ² units of missing systematics cannot support an 80-unit inference between two gravity models; once the systematics are modeled, no preference remains. All three models independently converge to the same systematic values (zero-point ≈ 1 mas, frame drift ≈ 0.08–0.21 mas/yr — the known NIRC2 reference-frame drift scale, cf. Plewa et al. 2015 — and instrument RV offsets of 10–95 km/s), confirming these are real features of the data.
What I reproduced from the document
Every number in the document checks out arithmetically. Running the document’s own scripts (re-implemented and verified against the originals): GR time-linear at the reported parameters gives χ² = 800.809 (doc: 800.81); the WILL pipeline gives 728.624 (doc script output: 728.62). Re-optimization with multi-start confirms both are converged. The document’s hybrid diagnostic is also confirmed: swapping the precession magnitude between theories changes nothing (f_prec agrees to ~3×10⁻⁵ relative: 0.00058191 vs 0.00058192); swapping the redshift formula changes almost nothing (the GR additive form is in fact 1.0 unit better than WILL’s multiplicative chain); only the ω-parametrization (phase-coupled vs time-linear) moves χ² by ~73. The document’s decomposition section was honest and correct as far as it went. What it did not do is ask whether either 9-parameter model is an adequate description of the data in the first place — and neither is.
A correction to my own earlier working hypothesis
In the first phase of this verification I hypothesized that the document’s GR baseline was a strawman: that the Damour–Deruelle quasi-Keplerian solution makes GR’s pericenter advance phase-locked, so the rosette form is GR and the gap would vanish against a proper GR implementation. The direct integration falsified the second half of that hypothesis, and I state this plainly. I integrated the exact 1PN harmonic-gauge equations of motion (DOP853, validated below), where no parametrization choice exists — the trajectory is whatever GR produces. Results:
First, true GR fit to the real data gives χ² = 809.28, worse than the document’s time-linear baseline (800.81), not better. The document’s crude-looking baseline was, numerically, a fair stand-in for GR on this dataset — within 8.5 χ² units of the real thing — and the document’s headline gap survives the strawman test. I was wrong to expect otherwise, and the document deserves that acknowledgment.
Second, the resolution of the apparent paradox: fitting both analytic forms to noise-free synthetic observables generated by integrated GR, the time-linear model reproduces true GR to 36 μas RMS on the sky (mismatch χ² = 1.3), while the single-eccentricity rosette misses it by 440 μas RMS (mismatch χ² = 21). Yet the osculating apsidal direction of the same integrated orbit (Laplace–Runge–Lenz vector) advances as a phase-locked staircase — 67% of each orbit’s advance accumulates within ±6 months of pericenter, versus 7% for a linear ramp (see parametrization_test.png). Both facts are true simultaneously because the osculating elements are not the observable: the full Damour–Deruelle solution carries three distinct eccentricities (radial, temporal, angular) and 1PN periodic terms, and at S2’s eccentricity (e ≈ 0.887) these O(β²) structures matter at the hundreds-of-μas level. The single-eccentricity rosette r = p/(1+e cos((1−f)ν)), driven by a single-eccentricity Kepler clock, is therefore not GR’s trajectory — it is a third curve, distinct from both GR and the time-linear model, and through this particular 24-year data window the time-linear form happens to track true GR more closely. So the document’s “Geometric Incompatibility” section is moot in both directions: phase-locked apsidal motion is not foreign to GR (it is GR’s own osculating behavior), and the rosette’s extra fitting freedom is not “what GR would do if it could” — it is simply a different 9-parameter family whose flexibility, on this dataset, aligns with the systematics.
Integrator validation
Since the verdict leans on the integration, it was validated independently: tolerance convergence rtol 10⁻¹⁰ vs 10⁻¹² changes sky positions by 0.0005 μas and RVs by 0.17 m/s; cross-method DOP853 vs RK45 agrees to 0.012 μas; the 1PN energy returns to its value at successive pericenters to 1.6×10⁻¹² relative (the periodic 4×10⁻⁵ excursion is the expected 2PN-order residual of the truncated invariant, not drift); and the measured apsidal advance per orbit, with Brent-refined pericenter times, is 0.0036074 rad against the analytic 6πGM/c²p = 0.0036070 rad — agreement to 1.2×10⁻⁴, i.e. the integrator delivers the correct GR precession. The fit to real data was confirmed from two independent starting basins (converging to 809.279 in both), and all three nuisance fits are stable under jittered restarts.
What this means for the document’s claims
The claim “WILL RG achieves Δχ² ≈ −74 with the same parameter count” is true as arithmetic and survives replacement of the baseline by true GR. The claim that this demonstrates “superior empirical fidelity” of relational phase mechanics does not survive: the advantage is entirely absorbed by standard systematic terms, is concentrated in data segments with no relativistic content, and reverses sign once those terms are included. The “Geometric Incompatibility” derivation is internally a correct statement about affine reparametrization but draws the wrong conclusion from it; its own case 1 (a reparametrization leaves the physical curve unchanged) describes the relationship between DD’s phase-locked form and the geodesic, while the empirically distinct rosette curve owes its χ² advantage to systematics, not to physics GR “cannot execute.” The G-free derivation of R_s from (T, β) is sound but is a property of Kepler’s third law plus the definition of β, available to GR in identical form; and the ~140× speed claim compares an unvectorized loop against a vectorized one — my vectorized time-linear GR evaluates in milliseconds, the same order as the WILL pipeline.
What genuinely survives: R.O.M. reproduces 1PN phenomenology at S2 precision from a compact closed algebra — that is a real consistency success and worth presenting as such. The honest hybrid decomposition in the document found the right mechanism (parametrization, not magnitude). And the framework’s actual point of departure from GR — the (3β² − 2β⁴) advance versus GR’s 2PN structure, an O(β⁴) difference — is a falsifiable prediction; it is simply ~4 orders of magnitude below what S2 data can see.
Recommendations
If this section of the paper is to be kept, the Δχ² = 74 result should be reframed from “RG beats GR” to “two 9-parameter models without systematic terms differ in how they absorb reference-frame errors; with standard systematics both describe the data equally well (Δχ² = 0.5 over 156 dof).” To make a genuine empirical case for RG over GR on orbital data one would need: the same standard systematics model both sides; GRAVITY-era interferometric astrometry (30–100 μas, where the published f_SP = 1.10 ± 0.19 already constrains the advance at 20%); comparison against integrated GR rather than any analytic stand-in; and ultimately regimes where O(β⁴) differences are visible — relativistic pulsar binaries, EHT-scale physics, or gravitational-wave inspirals. Limitations of this verification, for completeness: my systematics model (4 astrometric + 5 RV offset parameters) is a minimal version of published treatments (no correlated-noise model, no per-epoch frame solutions); this cannot rescue the claim, since both models received identical freedom and identical data. I did not re-run the MCMC, because the question is deterministic and all optima were shown stable under multi-start; the document’s own MCMC established the robustness of the no-systematics comparison, which I reproduce.
Files
models.py (faithful re-implementations + hybrids), run_core.py (reproduction + hybrid decomposition), gr_integrated_lib.py / gr_integrated.py (1PN integrator, synthetic-data test, staircase analysis), validate_followups.py (staged: validation, multi-start, profiled-systematics fits, decomposition), verdict_summary.png (scoreboard + where the advantage lives), parametrization_test.png (staircase vs ramp; which analytic form reproduces GR).