The original idea behind Exodus Tracker was small and specific: watch private jets on OpenSky and flag when the herd starts moving in a way that doesn’t match the baseline. The world’s wealthy tend to leave a country a little before the news cycle catches up — bizjet activity is one of the earliest, leakiest crisis signals you can pick up without paying anyone for data.
That idea worked. Then it kept growing. Today the system synthesises 22 independent indicators — aviation, FX, equities, fixed income, credit, prediction markets, news tone, civil unrest, internet outages, AIS shipping, USDT P2P premiums, refugee flows, seismic events, Wikipedia pageviews — into a single composite called the Exodus Tracker Index (ETI). One number, 0 to 100, refreshed every five minutes. On top of that sits a causal model that says which indicators are worth pulling on if you want to actually move the system, not just watch it.
This post is the long version of how that math works, and how you wring useful signal out of 22 wildly different data sources without ending up with a dashboard nobody trusts.
![]()
Why composite indices usually fail
Most “global risk indices” you’ll see published in the wild fail in one of three predictable ways.
They use raw z-scores everywhere. Z-scores feel rigorous because they involve standard deviations, but they punish exactly the events you most want to capture: tail events. The 4σ FX move that signals an actual currency crisis is the same data point that drags the rolling mean up next month and breaks the next z-score’s interpretability. Worse, a z-score that fires on a calm-market move of 1.5σ doesn’t tell you anything about whether 1.5σ is a meaningful deviation in this signal — the historical realism of the threshold is baked into the volatility of the past month, not the past decade.
They sum apples and oranges with arbitrary weights. A 10% drop in MSCI gets averaged with the number of armed conflicts in ACLED, multiplied by some weight pulled out of someone’s head, and the resulting number drifts. Once it drifts you have to renormalise. Once you renormalise you lose comparability across time.
They silently degrade when sources go offline. Half the inputs vanish during a real crisis (rate limits, outages, scraping bans), the dashboard keeps producing numbers, and nobody tells you that today’s “calm reading” is actually a coverage artefact.
ETI is built to avoid all three of those failures. The trick is to do the per-signal calibration once, by hand, against historical priors — and only do statistical work over things that are already on a comparable [0, 1] scale.
Step 1 — per-signal calibration with piecewise-linear curves
Every one of the 22 signals goes through its own scoring function, all of which return a number in [0, 1]. None of them use z-scores naively. Most are piecewise-linear functions chosen so that the breakpoints match historically meaningful levels for that specific signal.
Concrete example — the VIX scoring curve:
VIX 10 → 0.0
VIX 15 → 0.2 (post-2010 baseline)
VIX 20 → 0.4 (elevated)
VIX 30 → 0.7 (stressed — March 2020 territory)
VIX 40 → 1.0 (severe — banking-stress territory)
The numbers aren’t from a model. They come from looking at what VIX actually did during the events you’d want this index to flag — COVID, the 2023 regional bank stress, August 2024. The curve makes “0.4” mean “elevated like spring 2022” rather than “0.4 standard deviations above last month.”
The same approach is used for everything else:
- HY−IG credit spread breakpoints at 200 / 400 / 600 / 900 / 1200 bps — matching historical credit-stress regimes from “calm” to “GFC-territory”
- MOVE index at 60 / 100 / 130 / 170 / 220 — calibrated against COVID and 2008
- Yield curve (T10Y2Y): inverted = stress (1.0 at −1.0pp), flat = mild (0.55 at 0), steep = calm (0 at +1.0pp)
- Energy intraday Δ: +2% / +5% / +10% / +15% — 2% is unusual for crude, 10% is a true shock
- USDT P2P premium: 1% / 2% / 5% / 10% / 20% — 1% is bid-ask friction, 20% is the panic-buying premium seen in Argentina and Turkey during currency crises
- GDELT crisis-news deviation: +25% / +50% / +100% above 24h baseline
- UNHCR refugee outflows YoY: 10% / 25% / 50% / 100% / 200% — Ukraine 2022 hit ~1000% in months
- Seismic count M ≥ 6.5 in 24h: 0 / 1 / 2 / 3+ — the qualitative jump from “background” to “active sequence” matters more than absolute count
For the noisier signals (FX, CDS proxy, attention) we do use rolling z-scores, but only as an input to another piecewise curve — the z gets mapped through a curve that knows that anything below ~1.5σ on a noisy signal is just chatter, not a flag.
The point of all this is to make the scoring function legible to a human. A reviewer can look at the VIX curve and disagree with the 30 → 0.7 breakpoint based on their own historical reading; they can’t meaningfully argue with a z-score that produced 0.62 because of a moving average. Calibration is in the open.
Step 2 — composition with renormalisation, with a cap
Once every component returns a 0–1 stress score, ETI is a weighted sum. The 22 weights sum to 1.0:
ETI = 100 × Σᵢ wᵢ · scoreᵢ
Private aviation gets the heaviest single weight at 13% (the original signal), followed by FX volatility at 10%, internet disruptions at 7%, prediction markets at 7%, named jets aloft at 7%, military aircraft at 6%, market fear at 5%, crisis news at 5%, then a long tail of 2–4% slots for travel advisories, energy, attention, seismic, and the rest.
The interesting math happens when sources go offline. The naive fix is renormalisation: if 80% of weight is currently available, divide the running sum by 0.8 so the score still spans 0–100. That works fine for one missing component, and breaks badly when a correlated set of sources goes down — say the OpenSky outage that kills bizjet + military + named jets all at once (33% of weight gone). Naive renorm would then boost the remaining 67% by ~50%, which is methodologically wrong: a quiet reading on one signal shouldn’t be amplified to compensate for a missing signal somewhere else.
The fix is a capped renormaliser:
renormFactor = min(RENORM_CAP, totalWeight / availableWeight)
RENORM_CAP = 1.3 — meaning we’ll renormalise up to a 30% boost, then stop. Below that coverage threshold (~77%), the score degrades gracefully rather than fabricating a coverage-corrected reading. The dashboard surfaces a “coverage capped” badge so the user can tell why the number is what it is.
Step 3 — correlation dampening (the second hard problem)
If you do nothing else, your composite double-counts panic. VIX up + worst-day equity drop + FX volatility spike + MOVE up are all reading the same underlying risk-off event. Treating them as four independent signals inflates the score the moment any single shock fires.
ETI groups correlated components into three explicit clusters:
- Market-panic cluster: markets (VIX), FX volatility, bond volatility (MOVE)
- Rates & credit cluster: yield curve, sovereign CDS proxy, HY-IG spread, Fed stress
- OpenSky aviation cluster: bizjet, military, named jets aloft
Within each cluster, only the strongest member counts at full weight. The second-strongest is dampened to 50%, third+ to 25%. So if FX, VIX, and MOVE all fire at once, the index reads the strongest of the three as a panic signal, the second as half-confirmation, and treats anything past that as noise — instead of triple-counting one event.
Critically, the dampening factor applies to the numerator only — it doesn’t redistribute weight to other components. So damping pulls the score down without secretly inflating unrelated indicators via renormalisation.
Step 4 — score → alert level
A single threshold table maps the composite to a five-level alert:
score < 15 → 1 Calm (all sidecars quiet)
score < 30 → 2 Elevated (one or two above baseline)
score < 50 → 3 Pronounced (multiple above baseline simultaneously)
score < 75 → 4 Severe (several flashing at once)
score ≥ 75 → 5 Extreme (this has rarely happened)
These thresholds were picked by backtesting against the 13-component v1 of the index on the past 18 months of data. Level 3+ corresponded to actually-newsworthy stress regimes; Level 5 has never fired in our backtests, which is the right answer for “this has rarely happened.”
How each of the 22 sources actually works
The interesting part of the project is not the math at the top — it’s the source-by-source mechanics underneath, because every single one has its own gotcha. A speed-run:
OpenSky bizjet ADS-B. OAuth2 client_credentials against Keycloak (HTTP Basic was retired in March 2026). GET /states/all returns ~10K state vectors. We classify each one twice — once via looksLikeBizJet() against a callsign-prefix list (NetJets, Flexjet, VistaJet…) combined with a cruise-altitude/speed envelope, once via classifyMilitary() against a curated callsign list (RCH = Reach C-17, SPAR = SAM VIP, NIGHT = E-4B Nightwatch…). One upstream call, two indicators, a third — named-jet watchlist — pulled by icao24. The shared 60s cache means even a high-traffic dashboard hits OpenSky no more than once a minute per process.
FX with crisis-currency coverage. Frankfurter (ECB-backed, 14 currencies) gives us history for z-score baselines. open.er-api.com fills in RUB, ARS, EGP, NGN, PKR — the currencies that actually move in crises but that ECB doesn’t cover. Two providers, one summary, current-only fallback for the long-tail.
Market fear via VIX and equities. CBOE’s own delayed-quotes JSON at cdn.cboe.com — the canonical free source. Stooq’s /q/l/ latest-tick CSV covers the regional equity indices (S&P, KOSPI, Hang Seng, BVSP, SENSEX, etc.). Yahoo Finance is rate-limited but works for the MOVE index, which Stooq doesn’t carry. Twelve Data’s free tier doesn’t include indices — don’t bother.
Polymarket. GraphQL/REST hybrid at gamma-api.polymarket.com?tag_id=2. We weight movers by category: conflict ×2, leadership ×1.5, geopolitics ×1, elections ×0.5. The 5pp threshold for what counts as a “mover” prevents micro-liquidity slippage from generating phantom signal.
GDELT — themes + tone. Every 15 minutes, GDELT publishes a fresh GKG slice (~50K articles globally). We process the file twice: once for the legacy curated LEGACY_CRISIS_THEMES list (kept stable so the long-running deviation series stays comparable across time), once for a tiered V2 taxonomy bucketed into econ_stress, conflict, social, leadership plus six curated GCAM emotional dimensions (negative/anger/fear/optimism/uncertainty/hostility) and the top-5 persons/orgs by mention count. The component score takes the max of theme-deviation and tone-weighted deviation — both spiking is one event, not two.
GDELT Events 2.0 for civil unrest. Replaced ACLED in May 2026 (ACLED API access got expensive and slow; GDELT Events is no-auth and pushes a new CSV every 15 minutes). CAMEO codes 14, 17, 18, 19, 20 filter the firehose to protest/violence/coercion. Weighted Goldstein severity gives us “is this a march or a riot.”
AIS chokepoints. AISStream.io WebSocket. A 45-second sample every 15 minutes is enough to count unique MMSIs inside each of five chokepoint bboxes (Suez, Bab el-Mandeb, Hormuz, Bosphorus, Panama). We close the socket between samples — leaving it open costs us no signal and pegs a CPU.
Cloudflare Radar. radar/annotations/outages — the operator-blessed source of truth for which countries are currently dark. Weighted by scope (nation-scale > regional > ASN-scale).
FRED for the entire rates stack. One key, one API, six derived signals: T10Y2Y curve, T10Y3M fallback, US10Y, foreign 10Y per OECD country (the CDS proxy — real CDS quotes are paywalled, but foreign10Y − US10Y correlates well enough on the z-score and is free), BAMLH0A0HYM2 minus BAMLC0A0CM for the HY-IG OAS spread, WALCL + WPC + RPONTSYD + RRPONTSYD + TREAST + MBST for the Fed-stress composite. The Fed-stress sub-composite itself is a 60/40 blend of WALCL z90d and Discount Window z90d.
USGS seismic. The significant-events GeoJSON, filtered to M ≥ 6.5. UNIQUE constraint on event_id means we can re-ingest hourly with no dedup logic. The scoring function is presence-driven (0/1/2/3+ events) with a magnitude bonus, because the qualitative jump from “background” to “active sequence” matters more than absolute frequency.
Crypto capital flight. Yadio (parallel-market USD→fiat in a single batch call) + DefiLlama for stablecoin supply across all chains. The component score is the worst USDT P2P premium across Argentina, Turkey, Nigeria — when residents are willing to pay a 10%+ premium for tether, capital flight is already happening.
UNHCR refugee flows. Annual data, monthly revision sweep, multi-month lag. This is a confirming signal, not a leading one — a country with surging outflows is confirmation that other crisis indicators were right. Scoring is YoY % growth in top origins.
Wikipedia + Google Trends. Two-source attention layer. Wikipedia is the API rest endpoint for daily pageviews on ~21 curated crisis-relevant articles. Google Trends is the unofficial interest-over-time endpoint with hourly polling and aggressive caching — Google rate-limits aggressively and the parse breaks roughly twice a quarter, so the sidecar degrades to “unavailable” cleanly rather than producing nonsense. The component score is the max z-score across both sources.
US State Dept advisories. Scraped from the official RSS. L4 (“Do Not Travel”) counts ×4 vs L3 ×1.5 — the operational signal is L4, because that’s the list of countries the US is preparing to evacuate from. Peacetime baseline is ~12 L4 + ~25 L3 (weighted ~60).
The unifying pattern: every sidecar follows a DB-first read pattern. Cron writes to SQLite every five minutes; the page reads from SQLite first, then in-memory cache, then upstream. A request never blocks on an upstream call unless every cache layer missed.
Historical analog finder
Once you have a 13-dimensional score vector representing “what the system looks like right now,” the cosine-similarity question writes itself. Given the current vector v, find the past eti_snapshots rows where cos(v, vᵢ) ≥ 0.5 and tᵢ < now − 24h (so the rolling window doesn’t trivially win). Rank descending, return top 3.
function cosineSimilarity(a: number[], b: number[]): number {
let dot = 0;
for (let i = 0; i < a.length; i++) dot += a[i] * b[i];
return dot / (norm(a) * norm(b));
}
It’s not prediction. It’s “today most resembles 2026-03-15 (ETI 41, level 3, top contributors: bizjet, FX volatility, prediction markets).” Surfacing the historical analog with its top three contributors lets the user reason from a concrete past pattern rather than from an abstract score.
Performance: at one snapshot per 5 minutes the table grows ~290 rows/day. A full sweep with JSON.parse dominates and runs in under 100ms on six months of data. If the row count gets large enough to matter, precomputing into a dedicated eti_vectors table is a 30-line fix — not worth doing yet.
The AEC leverage layer — turning a thermometer into a steering wheel
The composite ETI tells you how stressed the system is. It doesn’t tell you which indicator is worth pulling on if you want to actually change anything. That’s a different question, and it needs different math.
We layered on a port of the Algorithm for Effective Controls (AEC) — originally a Fortran routine (mils.f90 / ml03.f90) from the Cognition system, reimplemented from scratch in TypeScript. The idea: given a signed causal-adjacency matrix M where M[i][j] is the influence from node i to node j, ask “which nodes are the system’s effective levers, and which are the most responsive indicators?”
The math, in five steps:
1. A = I − d·M (propagation operator; d = damping factor)
2. B = Aᵀ·A (symmetric positive semi-definite — stable)
3. x = dominant eigenvector of B (subject to sign constraints)
4. u = A·x (impact vector)
5. X1[i] = x[i]² / Σ x[j]² (Effectiveness Index — sums to 1)
Y1[i] = u[i]² / Σ u[j]² (Controllability Index — sums to 1)
The translation to plain English:
- High X1 (Effectiveness): “this indicator responds strongly to system shocks.” It’s a sensitive thermometer — watch it.
- High Y1 (Controllability): “this indicator has high leverage over the rest of the system.” It’s a steering wheel — if you push it, more things move.
- Gear ratio
||x||² / ||u||²is the system’s amplification factor.
The constrained-solve step (3) is the interesting one. The original Cognition active-set CG solver matches the Fortran outputs to ~1e-6 on well-conditioned matrices but can collapse to zero on stiff matrices where B is close to I. We replaced the default path with a stabilised solver: unconstrained power iteration first, then sign-flip the eigenvector when that resolves the inequality constraints (N+ ≥ 0, N− < 0) deterministically, then Gram-Schmidt-project onto any N0 equalities. The legacy active-set CG path is still exported so the admin tooling can compare both solvers side-by-side when investigating edge cases.
![]()
The seed matrix has 68 hand-encoded edges authored from standard macro priors:
- Reinhart & Rogoff on sovereign-stress propagation (
fx → cds) - Krugman-style currency-crisis transmission (
gdelt → fx → markets) - The Helbling-Terrones panic backbone (
markets → bondVol → creditSpreads) - Geopolitical → energy (
gdelt → energy,military → energy) — Brent reactions to Middle East events - Civil unrest dynamics (
acled → radarvia state-driven internet shutdowns,acled → displacement → travelAdvisories)
Every edge carries a weight ∈ [-1, +1], a confidence ∈ [0, 1], and a one-line rationale that surfaces on hover in the dashboard.
Eight scenarios as constraint sets
The dashboard offers eight named scenarios: whatDrivesEti, investWhereRecovery, giveWhereImpact, escalationMilitary, escalationEnergy, deEscalationGdelt, civilUnrestSurge, financialContagion, plus a custom slot. Each one compiles to a {constraints?, shock?, iterate?} triple. Different constraint sets produce different X1/Y1 rankings — the math is the same; the question is what you’re asking.
Pure linear cascade (no eigen-solve) is exposed separately as propagate(matrix, shock, iterate), which iterates Mᵀ · shock n times. That’s what powers the counterfactual playground — drop a unit shock onto any single component, see where it propagates after one, two, three steps. Instantaneous, numerically stable, the supported path for what-if scenarios.
Weekly Claude refinement loop
Hand-authoring 68 edges from textbook priors gets you a decent v1. Keeping them current with real-world propagation patterns is a separate problem. Every Monday at 06:00 UTC, a cron fires /api/cron/matrix-refine, which:
- Builds context from the current matrix + last 7d of briefings + last 7d of chains
- Calls Claude (Opus 4.7) with
web_search_20260209enabled (max 8 uses) - Forces a structured
submit_matrix_patchtool call returning JSON:edges_added[],edges_adjusted[],edges_retired[], each requiring ≥1 URL citation - Caps the patch at ≤ 8 changes / run, ≤ 0.2 weight delta per edge (stability tradeoff)
- Writes a new
causal_matricesrow withpromoted=0(canary mode) - Computes
topLevers(old) vs topLevers(new)— if the top-3 Y1 node-IDs are unchanged, auto-promote; otherwise stay in canary for manual review at/admin/matrix
Every edge change lands in matrix_edge_updates with before/after weights, Claude’s reasoning, and the citation URLs. The full audit trail is queryable — you can ask “why did edge X change three weeks ago” and get an answer with citations.
Cost: roughly $0.40 per call, ~$2/month at the weekly cadence.
Backtest validator — does the matrix actually predict?
The hardest question about any causal matrix is: does it predict anything? The backtest validator answers that quantitatively. For each historical pair (snapshot at T, snapshot at T+horizon), the matrix predicts a component-delta vector via propagate(M, scores_T, 1). The actual delta is scores_T+h − scores_T. We compute:
- Pearson r (predicted vs actual, across all components × all pairs)
- Sign-hit-rate — fraction of cells where
sign(predicted) == sign(actual). 50% is random; > 65% is meaningful - MAE per cell — raw error magnitude
- ETI-score MAE — same but converted to ETI points via the per-component weights
Surfaced as a [Backtest] button per matrix row in the admin log. The query params (horizonHours, lookbackDays, maxPairs, toleranceMinutes) make it trivial to compare v23 vs v22 on identical windows — that’s the actual “is v23 better than v22” answer, not a vibe check.
The dashboard view
All of that math ends up on one page. The ETI gauge at the top, regional sub-scores, the trend chart, historical analogs, leverage points, the live causal graph.
![]()
The leverage panel is the part that gets used most. “Military aircraft is the strongest current lever (Y1 = 0.585) — pushing that node moves the rest of the system the most. Travel advisories is the most responsive indicator (X1 = 0.233) — when something happens, that’s where the shock will surface first.” That’s a different question than “how stressed is the system right now,” and it’s the question you actually need answered if you’re trying to act on the index, not just watch it.
The things this design got right
Looking back at twelve months of iteration:
- Hand-calibrated per-signal curves beat z-scores everywhere. Every time we tried to replace one with a “rigorous” rolling statistical method, the breakpoints stopped corresponding to anything historically meaningful, and the index drifted. The calibration is the model; pretending it’s purely statistical buys nothing.
- Correlation dampening is non-optional. Without it, a single risk-off event triple-counts and the score saturates immediately, breaking the alert-level mapping. With it, the index actually distinguishes a panic event from a multi-source crisis.
- DB-first read pattern with five-minute crons is the right cadence. Faster reads don’t add signal; slower reads miss the cascade. Crucially, page-level reads never hit upstreams — the cron is the only thing that does, so traffic spikes don’t take out the data sources.
- Capped renormalisation beats both naive renorm and no-renorm. Both alternatives produce systematically wrong numbers under correlated outages. The cap is ugly, principled, and the right answer.
- Surfacing the analog matters more than the score. A 31/100 reading is abstract. “Today most resembles 2026-03-15 — top contributors then were military aircraft, maritime chokepoints, prediction markets” is concrete. Users reason about the analog; the score is just a way of finding it.
- Auditable causal-matrix refinement. Letting Claude touch the matrix is interesting; making every change land in a queryable audit table with before/after weights and citation URLs is what makes it trustworthy. The matrix is a living model; the audit log is the version history.
What we’d revisit
If we were starting over today, two things would change:
- Precompute analog vectors. The full-table cosine-similarity sweep is fine for 30K rows, painful at 300K. A dedicated
eti_vectorstable indexed on(timestamp DESC)with one float32 array per row would cut the query to a few milliseconds and remove a future scaling cliff. - Per-component backtest scoring. The matrix-level backtest gives one Pearson r across all components; what we actually want is per-component skill so we can see which edges are pulling their weight and which are noise. The data is already there; the visualisation isn’t.
Why this architecture works
The pattern that made everything fit together is the same one that made the original bizjet pipeline work: treat the index as a layered problem rather than a single composite calculation.
The bottom layer is 22 sidecars, each one a self-contained module with its own upstream, its own cache, its own DB-first read path, and its own degraded-mode behaviour. They don’t know about each other. They don’t know about ETI. They each return a *Summary shape that says “here’s the latest reading, here’s the source confidence, here’s whether the upstream is alive.”
The middle layer is the ETI composition function — pure, deterministic, takes the 22 summaries and returns a score. No I/O, no state, fully unit-testable. It’s the only place where weights and thresholds live, so changing the methodology means changing one file.
The top layer is the AEC math, which doesn’t know how the underlying scores were computed — it just consumes the score vector and the causal matrix and returns leverage rankings. Same purity, same testability.
That separation is what lets the project be both ambitious and stable. Adding a 23rd sidecar means writing one module and adding one line to the weights table. Adjusting the methodology means changing the composition function and rerunning the analog finder. Tightening the causal model means changing the matrix and re-backtesting. The pieces compose, the layers don’t leak, and the dashboard can keep telling you something useful even when half the data sources are angry.
That’s the whole game with a project like this: each piece does one thing, the math layer above is honest about what it knows, and the user sees one number on top of all of it that they can actually trust.


