Concepts / Overfitting

Overfitting

6 min read · Concept

The single biggest reason backtests don't survive contact with reality. Here's the checklist for catching it before you deploy.

What it is

Overfitting happens when your strategy parameters were chosen — explicitly or implicitly — to fit the in-sample data so well that they describe the noise rather than the signal. The classic version is too many parameters: a 12-input rule will always “explain” the past better than a 2-input rule, but its out-of-sample performance is usually worse because the extras were fitting random fluctuations.

The subtler version is selection bias. You don't need to overfit a single strategy to overfit your process — testing 1,000 candidate strategies and reporting the best one's Sharpe is overfitting in expectation, even if each individual candidate looks clean. The seminal warning is Bailey, Borwein, López de Prado & Zhu (2014) — “Pseudo-Mathematics and Financial Charlatanism”.

The five diagnostics

Each of these flags a different overfitting failure mode. You want them all green before deploying.

1. Walk-forward Sharpe degradation

Compare in-sample Sharpe to walk-forward OOS Sharpe. A drop of less than 30% is normal — OOS is harder than in-sample by construction. A drop of 50%+ means the parameters were fit to in-sample noise that didn't survive into the OOS window.

A drop to negative or near-zero is the unambiguous overfit signature. Don't deploy.

2. PSR-DSR gap

PSR is honest about sampling noise; DSR adjusts for selection across candidates. A wide gap (50+ percentage points) means the variant search drove the headline Sharpe — you would have found something that looks this good even from a portfolio of nothing-burgers. Narrow gaps (under 10 points) mean the variant search added almost nothing because the underlying strategy was already strong.

3. CPCV fraction-positive

The fraction of CPCV paths with positive Sharpe is a robustness pulse. A real edge wins across most slices of history (target: above 75%); a fitted edge wins on a few lucky slices and loses on the rest. Below 60% is a strong overfit signal even if the median Sharpe is positive.

4. Parameter sensitivity

Vary each parameter ±20% and re-backtest. A robust strategy degrades smoothly — Sharpe might drop from 1.0 to 0.7 but stays positive across the neighbourhood. An overfit strategy collapses: changing the lookback from 60 to 65 days takes the Sharpe from 1.5 to -0.2. The single best Sharpe was sitting on a knife edge of parameter space.

The fix isn't to lock in the knife edge — it's to back off to a robust plateau even if it costs you 20% of the headline Sharpe. The 20% you give up in-sample is what you would have given up out-of-sample anyway, with interest.

5. IC decay shape

A factor with positive in-sample IC but no decay (IC stays flat at all forward horizons) is suspicious — real edge usually decays smoothly as the horizon extends. A factor with positive in-sample IC and negative IC at adjacent horizons is almost certainly an artefact of the construction window. Real factors don't flip sign across a 5-day shift.

The anti-patterns to avoid

Testing many variants, reporting the best. The fundamental sin. Even with disciplined OOS testing, the act of selection inflates expected best-of-N performance. DSR exists to quantify this — respect it.
Walking forward, then peeking. You run walk-forward, see the OOS Sharpe is bad, tweak parameters, re-run. The second walk-forward is no longer out-of-sample for the new parameters; you just selected on it. Each peek burns the remaining OOS budget.
In-sample parameter tuning at granularity. Tuning lookbacks across all integers from 5 to 252 lets you pick whichever happens to have hit a 2-sigma noise event. Coarse grids (e.g. 20, 60, 120, 252) leave less room for noise-fitting and force you to justify each choice.
Look-ahead bias. Using next-bar's data, end-of-day prints to trade at open, fundamentals before their reporting date — all silent overfits. Code review the data pipeline before trusting any backtest.
Survivorship bias. Backtesting on a universe constructed today (e.g. “current S&P 500”) bakes in survivor selection. Use point-in-time index membership.

What to watch in the result card

All five diagnostics green simultaneously. Walk-forward stable, PSR-DSR gap narrow, CPCV fraction-positive > 75%, IC decay smooth, parameters in a robust neighbourhood. This is rare; it's also the bar.
Composite overfit score. Quantis rolls these into a single trust pill. A green pill means all five passed; an amber pill means at least one failed and you should look at which.