Building the Exodus Tracker Index: turning 25 noisy signals into one usable number

The original idea behind Exodus Tracker was small and specific: watch private jets on OpenSky and flag when the herd starts moving in a way that doesn’t match the baseline. The world’s wealthy tend to leave a country a little before the news cycle catches up — bizjet activity is one of the earliest, leakiest crisis signals you can pick up without paying anyone for data.

That idea worked. Then it kept growing. Today the system synthesises 25 independent indicators — aviation, FX, equities, fixed income, credit, prediction markets, news tone, civil unrest, internet outages, AIS shipping, USDT P2P premiums, refugee flows, seismic events, Wikipedia pageviews, plus a physical-hazard layer (humanitarian relief demand, health-security alerts, climate hazards) — into a single composite called the Exodus Tracker Index (ETI). One number, 0 to 100, refreshed every five minutes. On top of that sits a causal model that says which indicators are worth pulling on if you want to actually move the system — and which, run in reverse, says what the matrix would have to look like to behave a certain way, and whether the model holds up against its own history.

This post is the long version of how that math works, and how you wring useful signal out of 25 wildly different data sources without ending up with a dashboard nobody trusts.

Why composite indices usually fail

Most “global risk indices” you’ll see published in the wild fail in one of three predictable ways.

They use raw z-scores everywhere. Z-scores feel rigorous because they involve standard deviations, but they punish exactly the events you most want to capture: tail events. The 4σ FX move that signals an actual currency crisis is the same data point that drags the rolling mean up next month and breaks the next z-score’s interpretability. Worse, a z-score that fires on a calm-market move of 1.5σ doesn’t tell you anything about whether 1.5σ is a meaningful deviation in this signal — the historical realism of the threshold is baked into the volatility of the past month, not the past decade.

They sum apples and oranges with arbitrary weights. A 10% drop in MSCI gets averaged with the number of armed conflicts in ACLED, multiplied by some weight pulled out of someone’s head, and the resulting number drifts. Once it drifts you have to renormalise. Once you renormalise you lose comparability across time.

They silently degrade when sources go offline. Half the inputs vanish during a real crisis (rate limits, outages, scraping bans), the dashboard keeps producing numbers, and nobody tells you that today’s “calm reading” is actually a coverage artefact.

ETI is built to avoid all three of those failures. The trick is to do the per-signal calibration once, by hand, against historical priors — and only do statistical work over things that are already on a comparable [0, 1] scale.

Step 1 — per-signal calibration with piecewise-linear curves

Every one of the 25 signals goes through its own scoring function, all of which return a number in [0, 1]. None of them use z-scores naively. Most are piecewise-linear functions chosen so that the breakpoints match historically meaningful levels for that specific signal.

Concrete example — the VIX scoring curve:

VIX 10 → 0.0
VIX 15 → 0.2     (post-2010 baseline)
VIX 20 → 0.4     (elevated)
VIX 30 → 0.7     (stressed — March 2020 territory)
VIX 40 → 1.0     (severe — banking-stress territory)

The numbers aren’t from a model. They come from looking at what VIX actually did during the events you’d want this index to flag — COVID, the 2023 regional bank stress, August 2024. The curve makes “0.4” mean “elevated like spring 2022” rather than “0.4 standard deviations above last month.”

The same approach is used for everything else:

HY−IG credit spread breakpoints at 200 / 400 / 600 / 900 / 1200 bps — matching historical credit-stress regimes from “calm” to “GFC-territory”
MOVE index at 60 / 100 / 130 / 170 / 220 — calibrated against COVID and 2008
Yield curve (T10Y2Y): inverted = stress (1.0 at −1.0pp), flat = mild (0.55 at 0), steep = calm (0 at +1.0pp)
Energy intraday Δ: +2% / +5% / +10% / +15% — 2% is unusual for crude, 10% is a true shock
USDT P2P premium: 1% / 2% / 5% / 10% / 20% — 1% is bid-ask friction, 20% is the panic-buying premium seen in Argentina and Turkey during currency crises
GDELT crisis-news deviation: +25% / +50% / +100% above 24h baseline
UNHCR refugee outflows YoY: 10% / 25% / 50% / 100% / 200% — Ukraine 2022 hit ~1000% in months
Seismic count M ≥ 6.5 in 24h: 0 / 1 / 2 / 3+ — the qualitative jump from “background” to “active sequence” matters more than absolute count

For the noisier signals (FX, CDS proxy, attention) we do use rolling z-scores, but only as an input to another piecewise curve — the z gets mapped through a curve that knows that anything below ~1.5σ on a noisy signal is just chatter, not a flag.

The point of all this is to make the scoring function legible to a human. A reviewer can look at the VIX curve and disagree with the 30 → 0.7 breakpoint based on their own historical reading; they can’t meaningfully argue with a z-score that produced 0.62 because of a moving average. Calibration is in the open.

Step 2 — composition with renormalisation, with a cap

Once every component returns a 0–1 stress score, ETI is a weighted sum. The 25 weights sum to 1.0:

ETI = 100 × Σᵢ wᵢ · scoreᵢ

Private aviation still carries the heaviest single weight at 11% (the original signal), followed by FX volatility at 9%, then a 6% tier shared by named jets aloft, military aircraft, internet disruptions, and prediction markets. Market fear and crisis news sit at 5%; maritime chokepoints and energy at 4%; then a long tail of 1–3% slots — credit spreads, the CDS proxy, the yield curve, Fed stress, civil unrest, crypto flight, safe-haven demand, travel advisories, attention, seismic, refugee displacement, and the physical-hazard trio (humanitarian relief demand, climate hazards, health security) that the index grew into in 2026. The weights were rebalanced as the index widened from 13 components to 25 — every addition pushes the originals down, which is the right behaviour: no single signal should dominate a composite.

The interesting math happens when sources go offline. The naive fix is renormalisation: if 80% of weight is currently available, divide the running sum by 0.8 so the score still spans 0–100. That works fine for one missing component, and breaks badly when a correlated set of sources goes down — say the OpenSky outage that kills bizjet + military + named jets all at once (33% of weight gone). Naive renorm would then boost the remaining 67% by ~50%, which is methodologically wrong: a quiet reading on one signal shouldn’t be amplified to compensate for a missing signal somewhere else.

The fix is a capped renormaliser:

renormFactor = min(RENORM_CAP, totalWeight / availableWeight)

RENORM_CAP = 1.3 — meaning we’ll renormalise up to a 30% boost, then stop. Below that coverage threshold (~77%), the score degrades gracefully rather than fabricating a coverage-corrected reading. The dashboard surfaces a “coverage capped” badge so the user can tell why the number is what it is.

Step 3 — correlation dampening (the second hard problem)

If you do nothing else, your composite double-counts panic. VIX up + worst-day equity drop + FX volatility spike + MOVE up are all reading the same underlying risk-off event. Treating them as four independent signals inflates the score the moment any single shock fires.

ETI groups correlated components into three explicit clusters:

Market-panic cluster: markets (VIX), FX volatility, bond volatility (MOVE)
Rates & credit cluster: yield curve, sovereign CDS proxy, HY-IG spread, Fed stress
OpenSky aviation cluster: bizjet, military, named jets aloft

Within each cluster, only the strongest member counts at full weight. The second-strongest is dampened to 50%, third+ to 25%. So if FX, VIX, and MOVE all fire at once, the index reads the strongest of the three as a panic signal, the second as half-confirmation, and treats anything past that as noise — instead of triple-counting one event.

Critically, the dampening factor applies to the numerator only — it doesn’t redistribute weight to other components. So damping pulls the score down without secretly inflating unrelated indicators via renormalisation.

Step 4 — score → alert level

A single threshold table maps the composite to a five-level alert:

score < 15  → 1  Calm        (all sidecars quiet)
score < 30  → 2  Elevated    (one or two above baseline)
score < 50  → 3  Pronounced  (multiple above baseline simultaneously)
score < 75  → 4  Severe      (several flashing at once)
score ≥ 75  → 5  Extreme     (this has rarely happened)

These thresholds were picked by backtesting against the 13-component v1 of the index on the past 18 months of data. Level 3+ corresponded to actually-newsworthy stress regimes; Level 5 has never fired in our backtests, which is the right answer for “this has rarely happened.”

How each of the 25 sources actually works

The interesting part of the project is not the math at the top — it’s the source-by-source mechanics underneath, because every single one has its own gotcha. A speed-run:

OpenSky bizjet ADS-B. OAuth2 client_credentials against Keycloak (HTTP Basic was retired in March 2026). GET /states/all returns ~10K state vectors. We classify each one twice — once via looksLikeBizJet() against a callsign-prefix list (NetJets, Flexjet, VistaJet…) combined with a cruise-altitude/speed envelope, once via classifyMilitary() against a curated callsign list (RCH = Reach C-17, SPAR = SAM VIP, NIGHT = E-4B Nightwatch…). One upstream call, two indicators, a third — named-jet watchlist — pulled by icao24. The shared 60s cache means even a high-traffic dashboard hits OpenSky no more than once a minute per process.

FX with crisis-currency coverage. Frankfurter (ECB-backed, 14 currencies) gives us history for z-score baselines. open.er-api.com fills in RUB, ARS, EGP, NGN, PKR — the currencies that actually move in crises but that ECB doesn’t cover. Two providers, one summary, current-only fallback for the long-tail.

Market fear via VIX and equities. CBOE’s own delayed-quotes JSON at cdn.cboe.com — the canonical free source. Stooq’s /q/l/ latest-tick CSV covers the regional equity indices (S&P, KOSPI, Hang Seng, BVSP, SENSEX, etc.). Yahoo Finance is rate-limited but works for the MOVE index, which Stooq doesn’t carry. Twelve Data’s free tier doesn’t include indices — don’t bother.

Polymarket. GraphQL/REST hybrid at gamma-api.polymarket.com?tag_id=2. We weight movers by category: conflict ×2, leadership ×1.5, geopolitics ×1, elections ×0.5. The 5pp threshold for what counts as a “mover” prevents micro-liquidity slippage from generating phantom signal.

GDELT — themes + tone. Every 15 minutes, GDELT publishes a fresh GKG slice (~50K articles globally). We process the file twice: once for the legacy curated LEGACY_CRISIS_THEMES list (kept stable so the long-running deviation series stays comparable across time), once for a tiered V2 taxonomy bucketed into econ_stress, conflict, social, leadership plus six curated GCAM emotional dimensions (negative/anger/fear/optimism/uncertainty/hostility) and the top-5 persons/orgs by mention count. The component score takes the max of theme-deviation and tone-weighted deviation — both spiking is one event, not two.

GDELT Events 2.0 for civil unrest. Replaced ACLED in May 2026 (ACLED API access got expensive and slow; GDELT Events is no-auth and pushes a new CSV every 15 minutes). CAMEO codes 14, 17, 18, 19, 20 filter the firehose to protest/violence/coercion. Weighted Goldstein severity gives us “is this a march or a riot.”

AIS chokepoints. AISStream.io WebSocket. A 45-second sample every 15 minutes is enough to count unique MMSIs inside each of five chokepoint bboxes (Suez, Bab el-Mandeb, Hormuz, Bosphorus, Panama). We close the socket between samples — leaving it open costs us no signal and pegs a CPU.

Cloudflare Radar. radar/annotations/outages — the operator-blessed source of truth for which countries are currently dark. Weighted by scope (nation-scale > regional > ASN-scale).

FRED for the entire rates stack. One key, one API, six derived signals: T10Y2Y curve, T10Y3M fallback, US10Y, foreign 10Y per OECD country (the CDS proxy — real CDS quotes are paywalled, but foreign10Y − US10Y correlates well enough on the z-score and is free), BAMLH0A0HYM2 minus BAMLC0A0CM for the HY-IG OAS spread, WALCL + WPC + RPONTSYD + RRPONTSYD + TREAST + MBST for the Fed-stress composite. The Fed-stress sub-composite itself is a 60/40 blend of WALCL z90d and Discount Window z90d.

USGS seismic. The significant-events GeoJSON, filtered to M ≥ 6.5. UNIQUE constraint on event_id means we can re-ingest hourly with no dedup logic. The scoring function is presence-driven (0/1/2/3+ events) with a magnitude bonus, because the qualitative jump from “background” to “active sequence” matters more than absolute frequency.

Crypto capital flight. Yadio (parallel-market USD→fiat in a single batch call) + DefiLlama for stablecoin supply across all chains. The component score is the worst USDT P2P premium across Argentina, Turkey, Nigeria — when residents are willing to pay a 10%+ premium for tether, capital flight is already happening.

UNHCR refugee flows. Annual data, monthly revision sweep, multi-month lag. This is a confirming signal, not a leading one — a country with surging outflows is confirmation that other crisis indicators were right. Scoring is YoY % growth in top origins.

Wikipedia + Google Trends. Two-source attention layer. Wikipedia is the API rest endpoint for daily pageviews on ~21 curated crisis-relevant articles. Google Trends is the unofficial interest-over-time endpoint with hourly polling and aggressive caching — Google rate-limits aggressively and the parse breaks roughly twice a quarter, so the sidecar degrades to “unavailable” cleanly rather than producing nonsense. The component score is the max z-score across both sources.

US State Dept advisories. Scraped from the official RSS. L4 (“Do Not Travel”) counts ×4 vs L3 ×1.5 — the operational signal is L4, because that’s the list of countries the US is preparing to evacuate from. Peacetime baseline is ~12 L4 + ~25 L3 (weighted ~60).

Physical-hazard feeds (humanitarian / health / climate). The three newest signals, all no-auth, share one normalised table — external_signal_observations, keyed by (source, entity_id, timestamp) with a source discriminator that routes rows to the right summariser on read, so a new feed drops in without a schema migration. Humanitarian relief demand comes from ReliefWeb + GDACS disaster alerts; health-security from WHO/CDC outbreak notices; climate hazards from NASA FIRMS active-fire data, NHC tropical-cyclone advisories, and SWPC space-weather alerts. Country name → ISO, ISO → region, and signal-age → severity all run through one shared external-utils helper rather than re-deriving the maps per feed. These are physical-infrastructure risk: slower-moving than the financial signals, but they’re what’s left when the markets have already priced everything in.

The unifying pattern: every sidecar follows a DB-first read pattern. Cron writes to SQLite every five minutes; the page reads from SQLite first, then in-memory cache, then upstream. A request never blocks on an upstream call unless every cache layer missed.

Historical analog finder

Once you have a 25-dimensional score vector representing “what the system looks like right now,” the cosine-similarity question writes itself. Given the current vector v, find the past eti_snapshots rows where cos(v, vᵢ) ≥ 0.5 and tᵢ < now − 24h (so the rolling window doesn’t trivially win). Rank descending, return top 3.

function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0;
  for (let i = 0; i < a.length; i++) dot += a[i] * b[i];
  return dot / (norm(a) * norm(b));
}

It’s not prediction. It’s “today most resembles 2026-03-15 (ETI 41, level 3, top contributors: bizjet, FX volatility, prediction markets).” Surfacing the historical analog with its top three contributors lets the user reason from a concrete past pattern rather than from an abstract score.

Performance: at one snapshot per five minutes the table grows ~290 rows/day, and the naive full-sweep JSON.parse of components_json started to drag — so the “precompute into a dedicated table” idea we’d filed under later got built. A typed eti_vectors table (one REAL column per component) lets the sweep read native floats and finish in single-digit milliseconds; cold reads opportunistically backfill it. The columns are append-only — slot order is load-bearing, because reordering would silently corrupt every historical vector.

Each match now also carries its forward outcomes — what ETI actually did 6h, 24h, 7d, and 30d after that analog — aggregated into a median-plus-range ensemble: “across the closest analogs, ETI moved +X (range a…b) over the next 7 days.” Still not a forecast model; it only reports what those past states actually led to, and longer horizons return null until the archive reaches that far back.

System temperature — is the calm real?

ETI tells you how high the stress is. It doesn’t tell you whether a low reading is trustworthy — and the most dangerous regime is the quiet one where money is already moving but nothing has been priced or talked about yet. July 2007 looked calm on every screen.

System temperature is a sibling sub-index that reads the same 25-component vector ETI does — never mutating it, level-based so it needs no history — and splits the components into four registers:

behavioral — revealed action: where money and people actually move (bizjet flight, crypto flight, shipping, capital)
narrative — what’s being said: crisis-news tone, public attention, prediction-market odds
priced — what markets have already priced: volatility, credit, the curve, the safe-haven bid
exogenous — physical hazards (seismic, climate, humanitarian), excluded from the comparison because nobody “prices” an earthquake

The complacency gap is behavioral − mean(narrative, priced). Positive means action is running ahead of perception — the It’s-All-Good regime, where the repricing hasn’t happened yet. Negative means fear without action — narrative and markets more alarmed than anyone’s actual behaviour, often a fade. The reading collapses to one of four words — calm, complacent, fearful, aligned — and that single word changes how you read everything else on the page: a low ETI that’s complacent is not the same all-clear as a low ETI that’s aligned.

The AEC leverage layer — turning a thermometer into a steering wheel

The composite ETI tells you how stressed the system is. It doesn’t tell you which indicator is worth pulling on if you want to actually change anything. That’s a different question, and it needs different math.

We layered on a port of the Algorithm for Effective Controls (AEC) — originally a Fortran routine (mils.f90 / ml03.f90) from the Cognition system, reimplemented from scratch in TypeScript. The idea: given a signed causal-adjacency matrix M where M[i][j] is the influence from node i to node j, ask “which nodes are the system’s effective levers, and which are the most responsive indicators?”

The math, in five steps:

1.  A = I − d·M               (propagation operator; d = damping factor)
2.  B = Aᵀ·A                  (symmetric positive semi-definite — stable)
3.  x = dominant eigenvector of B (subject to sign constraints)
4.  u = A·x                   (impact vector)
5.  X1[i] = x[i]² / Σ x[j]²   (Effectiveness Index — sums to 1)
    Y1[i] = u[i]² / Σ u[j]²   (Controllability Index — sums to 1)

The translation to plain English:

High X1 (Effectiveness): “this indicator responds strongly to system shocks.” It’s a sensitive thermometer — watch it.
High Y1 (Controllability): “this indicator has high leverage over the rest of the system.” It’s a steering wheel — if you push it, more things move.
Gear ratio ||x||² / ||u||² is the system’s amplification factor.

The constrained-solve step (3) is the interesting one, and we held the port to a hard bar: it had to match the original numerically. Here’s the wrinkle most people get backwards — the reference is single-precision Fortran (REAL), and JavaScript’s number is IEEE-754 double, so reimplementing in TypeScript doesn’t lose precision, it gains it. A committed golden-fixture suite runs the reference numpy implementation across a spread of matrices and asserts our output matches to ~1e-16 — machine epsilon — on the sign-invariant indices. That parity is the licence to not ship a Fortran toolchain into a Next.js app.

The solver itself is staged, fastest-path-first: unconstrained power iteration, then a sign-flip of the eigenvector when that alone resolves the inequality constraints (N+ ≥ 0, N− < 0), then a Gram-Schmidt projection onto any N0 equalities. Genuinely hard mixed-constraint cases fall through to the active-set CG solver — the faithful port of the Fortran — now wired and guarded: its result is adopted only if it’s non-degenerate and actually satisfies the constraints; otherwise we return the best-effort iterate flagged converged: false. An earlier version stubbed that branch entirely; the guard means a hard scenario gets either a real answer or an honest “couldn’t,” never a confident wrong one.

The seed matrix has 86 hand-encoded edges authored from standard macro priors:

Reinhart & Rogoff on sovereign-stress propagation (fx → cds)
Krugman-style currency-crisis transmission (gdelt → fx → markets)
The Helbling-Terrones panic backbone (markets → bondVol → creditSpreads)
Geopolitical → energy (gdelt → energy, military → energy) — Brent reactions to Middle East events
Civil unrest dynamics (acled → radar via state-driven internet shutdowns, acled → displacement → travelAdvisories)

Every edge carries a weight ∈ [-1, +1], a confidence ∈ [0, 1], and a one-line rationale that surfaces on hover in the dashboard.

Eight scenarios as constraint sets

The dashboard offers eight named scenarios: whatDrivesEti, investWhereRecovery, giveWhereImpact, escalationMilitary, escalationEnergy, deEscalationGdelt, civilUnrestSurge, financialContagion, plus a custom slot. Each one compiles to a {constraints?, shock?, iterate?} triple. Different constraint sets produce different X1/Y1 rankings — the math is the same; the question is what you’re asking.

Pure linear cascade (no eigen-solve) is exposed separately as propagate(matrix, shock, iterate), which iterates Mᵀ · shock n times. That’s what powers the counterfactual playground — drop a unit shock onto any single component, see where it propagates after one, two, three steps. Instantaneous, numerically stable, the supported path for what-if scenarios.

Hand-authoring 68 edges from textbook priors gets you a decent v1. Keeping them current with real-world propagation patterns is a separate problem. Every Monday at 06:00 UTC, a cron fires /api/cron/matrix-refine, which:

Builds context from the current matrix + last 7d of briefings + last 7d of chains
Calls Claude (the newest available Opus, resolved live and weekly-cached) with web_search_20260209 enabled (max 8 uses)
Forces a structured submit_matrix_patch tool call returning JSON: edges_added[], edges_adjusted[], edges_retired[], each requiring ≥1 URL citation
Clamps every adjustment to a sign-locked band — at most ±0.2 and within [0.65×, 1.35×] of the prior magnitude — so the model can shrink or strengthen an edge but never flip its sign or lurch. This is enforced in code, not just requested in the prompt (an earlier version only asked for ≤0.2 and nothing stopped a sign flip); clamps are audit-logged. Patches are also capped at ≤ 8 changes / run.
Writes a new causal_matrices row with promoted=0 (canary mode)
Computes topLevers(old) vs topLevers(new) — if the top-3 Y1 node-IDs are unchanged, auto-promote; otherwise stay in canary for manual review at /admin/matrix

Every edge change lands in matrix_edge_updates with before/after weights, Claude’s reasoning, and the citation URLs. The full audit trail is queryable — you can ask “why did edge X change three weeks ago” and get an answer with citations.

Cost: roughly $0.40 per call, ~$2/month at the weekly cadence.

Backtest validator — does the matrix actually predict?

The hardest question about any causal matrix is: does it predict anything? The backtest validator answers that quantitatively. For each historical pair (snapshot at T, snapshot at T+horizon), the matrix predicts a component-delta vector via propagate(M, scores_T, 1). The actual delta is scores_T+h − scores_T. We compute:

Pearson r (predicted vs actual, across all components × all pairs)
Sign-hit-rate — fraction of cells where sign(predicted) == sign(actual). 50% is random; > 65% is meaningful
MAE per cell — raw error magnitude
ETI-score MAE — same but converted to ETI points via the per-component weights

A per-component breakdown ships in the same result — Pearson r, MAE, and sign-hit-rate for each of the 25 nodes, sorted by error — so you can see which edges are pulling their weight and which are noise, not just one number for the whole matrix. (That was on the “would revisit” list in the first version of this post; it got built.)

Surfaced as a [Backtest] button per matrix row in the admin log. The query params (horizonHours, lookbackDays, maxPairs, toleranceMinutes) make it trivial to compare v23 vs v22 on identical windows — that’s the actual “is v23 better than v22” answer, not a vibe check.

Running the model in reverse — design, and validation against data

The AEC layer runs in one direction: matrix → structure. Feed it the causal matrix, get back which nodes are levers and which are thermometers. Two questions it can’t answer that way turned out to be worth building separately.

The first is design — “what would the matrix have to look like for indicator X to be the dominant lever?” Targets in, matrix out. The second is trust — “forget the textbook priors: what does our own history say the edges should be?” That’s fitting the matrix to data.

Both share one idea that keeps them honest: a sign-locked bound. Any reconstructed or fitted edge is held to [0.65×, 1.35×] of its prior magnitude and can never flip sign. An edge the textbook says is fx → cds, positive stays positive; the math can only argue about how positive, within a band. That single constraint is what lets you run aggressive optimisation against a load-bearing model without it driving off a cliff.

IMRA — the inverse of AEC

IMRA (Inverse Matrix Reconstruction) takes target response and impact vectors and reconstructs a matrix that produces them, inside the band. Mechanically it’s a per-row, box-constrained, regularised least-squares — strongly convex, solved by projected gradient, so it converges to a unique optimum.

The reference we ported it from — the same Fortran lineage as AEC — had two bugs we only found by holding the port to that machine-precision bar:

The bounds were transposed. Each element z[i][j] was bounded by a band derived from M[j][i] — the wrong edge. On any non-symmetric matrix (i.e. every real one) that makes the true solution infeasible, which is why the reference solver never converged. We bound each element by its own prior.
The solver had no iteration cap and flip-flopped between active sets forever. The first thing our fixture run did was hang. We swapped the heuristic active-set for the projected-gradient box-QP above; for self-consistent targets it now recovers the original matrix exactly — residual 0, in-band, no sign flips.

The lesson generalises: porting old numerical code is a correctness exercise, not a transcription one. “Match the reference to 1e-16, with tests” is precisely the discipline that surfaces where the reference was quietly wrong.

DCML — distance to a regime

DCML answers “how few bounded, sign-locked edge changes would make a chosen indicator the dominant responder?” — a way to ask how far the matrix sits from a regime where, say, prediction markets lead, and to see exactly which edges would have to move.

The original (Fortran ml04) is a stochastic learning loop that random-inits the matrix each iteration. We didn’t want stochastic — an admin tool should give the same answer twice — so we kept the goal and the bound model but optimised the objective directly: greedy coordinate ascent minimising the dominance gap over the sign-locked edge box. Monotonic (the gap can’t increase), deterministic, reproducible. If the band can’t get there, it says so and reports the residual gap, which is itself the “distance” answer.

Against the live matrix it’s concrete: target prediction markets, and it reports that within the band the regime is reachable — the dominance gap goes from +0.39 to −0.002, the node becomes the top responder, and it lists the 69 edges that move to get there.

Validating the matrix against its own history

This is the one that pays for the rest. The matrix is seeded from textbook macro priors and nudged weekly by Claude reading the news — both qualitative sources. Neither has ever checked whether the system’s own recorded behaviour agrees.

The validator fits the matrix that best reproduces the observed one-step cascades in the snapshot history, bounded to the sign-locked band of the live matrix, then diffs the two and flags the divergent edges.

The modelling decision here is the kind of thing that’s easy to get subtly wrong. It would be tempting to reuse IMRA’s eigenstructure machinery — but IMRA’s operator (A = I − d·M, applied row-wise) is not the operator the system runs forward. The forward cascade — what propagate computes, what the backtest scores — is the column-wise inflow Δs ≈ Mᵀ·s. Fitting anything else would be optimising a model we don’t use. So the validator fits that: minimise ‖Δs − Mᵀ·s‖² over the band, which decomposes per output column into n small box-constrained ridge problems sharing one Gram matrix. Fast — and it reuses exactly the parts of IMRA that were right (the bound model, the box solver) on the objective that’s actually ours.

Run against production history it’s pointed: the data consistently wants to shrink several of the strongest hand-coded edges — military → crisis news, fx → crypto flow, military → energy — all the way to their lower bound, cutting in-sample magnitude error by ~30%. They pin at the floor, meaning the data wants to shrink them even further than the band allows. The honest reading isn’t “those edges are wrong”; it’s “the hand-authored matrix runs a bit hot, and here’s the evidence.”

And the caveat is baked into the tool: that fit is in-sample. Shrinking weights to fit the data you already have always lowers in-sample error — it doesn’t prove the new weights predict better. So the validator is a second opinion that feeds the weekly refinement and a human, not an autopilot. The next honest step is a train/test split — fit on one window, score on a held-out one — before any shrinkage earns its way into the live matrix. The model is allowed to learn from its own history; it isn’t allowed to overfit to it without showing its work.

The dashboard view

All of that math ends up on one page. The ETI gauge at the top, regional sub-scores, the trend chart, historical analogs, leverage points, the live causal graph.

The leverage panel is the part that gets used most. “Military aircraft is the strongest current lever (Y1 = 0.585) — pushing that node moves the rest of the system the most. Travel advisories is the most responsive indicator (X1 = 0.233) — when something happens, that’s where the shock will surface first.” That’s a different question than “how stressed is the system right now,” and it’s the question you actually need answered if you’re trying to act on the index, not just watch it.

The things this design got right

Looking back at twelve months of iteration:

Hand-calibrated per-signal curves beat z-scores everywhere. Every time we tried to replace one with a “rigorous” rolling statistical method, the breakpoints stopped corresponding to anything historically meaningful, and the index drifted. The calibration is the model; pretending it’s purely statistical buys nothing.
Correlation dampening is non-optional. Without it, a single risk-off event triple-counts and the score saturates immediately, breaking the alert-level mapping. With it, the index actually distinguishes a panic event from a multi-source crisis.
DB-first read pattern with five-minute crons is the right cadence. Faster reads don’t add signal; slower reads miss the cascade. Crucially, page-level reads never hit upstreams — the cron is the only thing that does, so traffic spikes don’t take out the data sources.
Capped renormalisation beats both naive renorm and no-renorm. Both alternatives produce systematically wrong numbers under correlated outages. The cap is ugly, principled, and the right answer.
Surfacing the analog matters more than the score. A 31/100 reading is abstract. “Today most resembles 2026-03-15 — top contributors then were military aircraft, maritime chokepoints, prediction markets” is concrete. Users reason about the analog; the score is just a way of finding it.
Auditable causal-matrix refinement. Letting Claude touch the matrix is interesting; making every change land in a queryable audit table with before/after weights and citation URLs — inside hard, sign-locked caps — is what makes it trustworthy. The matrix is a living model; the audit log is the version history.
A reimplemented numerical core, held to machine-precision parity, paid for itself. Re-porting the AEC math into TypeScript behind a golden-fixture suite (match the reference to ~1e-16, in CI) wasn’t just defensive housekeeping — it’s what caught the two latent bugs in the inverse solver, and it’s what let us later run the model in reverse and against data with any confidence. The precision bar is what made the new math trustworthy, not just the old.

What we’d revisit

The first version of this post ended with two things to revisit — precomputing the analog vectors, and per-component backtest scoring. Both got built; they’re described above. That’s the honest way to keep a section like this alive: ship the items, then replace them. Two new ones have taken their place:

Out-of-sample backtesting for the data validator. Today it fits and scores on the same history window, so its “the matrix runs hot” finding is in-sample — suggestive, not conclusive. The fix is a train/test split: fit the bounded matrix on an early window, score it on a held-out later one, and only trust a shrinkage that survives the holdout. It’s the obvious gate before any data-derived edge change reaches the live matrix.
Per-edge vs. global-scale calibration. The validator’s divergences all pinned at the band floor uniformly, which smells less like “these specific edges are wrong” and more like “the whole matrix is scaled a touch hot” (consistent with the modest backtest Pearson). The right response might be a single global damping term rather than per-edge surgery — a different, safer knob. We don’t know yet; that’s what the out-of-sample work is for.

Why this architecture works

The pattern that made everything fit together is the same one that made the original bizjet pipeline work: treat the index as a layered problem rather than a single composite calculation.

The bottom layer is 25 sidecars, each one a self-contained module with its own upstream, its own cache, its own DB-first read path, and its own degraded-mode behaviour. They don’t know about each other. They don’t know about ETI. They each return a *Summary shape that says “here’s the latest reading, here’s the source confidence, here’s whether the upstream is alive.”

The middle layer is the ETI composition function — pure, deterministic, takes the 25 summaries and returns a score. No I/O, no state, fully unit-testable. It’s the only place where weights and thresholds live, so changing the methodology means changing one file.

The top layer is the AEC math, which doesn’t know how the underlying scores were computed — it just consumes the score vector and the causal matrix and returns leverage rankings. Same purity, same testability.

That separation is what lets the project be both ambitious and stable. Adding a 26th sidecar means writing one module and adding one line to the weights table. Adjusting the methodology means changing the composition function and rerunning the analog finder. Tightening the causal model means changing the matrix and re-backtesting. The pieces compose, the layers don’t leak, and the dashboard can keep telling you something useful even when half the data sources are angry.

That’s the whole game with a project like this: each piece does one thing, the math layer above is honest about what it knows, and the user sees one number on top of all of it that they can actually trust.