A short note recording a recurring observation rather than a new finding. Of the published trading papers that have crossed this desk with reported Sharpe ratios above three, none has reproduced cleanly on independently collected data. One was a daily macro-sentiment model whose classifier could not separate the next session at better than coin-flip once it was scored on data its authors had not seen. Another was a higher-frequency residual stat-arb paper whose proxy curve was so far from the published shape that the local copy carries the label “abstract-only reproduction.” A third, lower-frequency, behaved a little better, but only once the universe was trimmed back to the few instruments the original paper had access to.
The pattern is not “papers are wrong.” The pattern is that headline numbers above three are usually the upper end of a search distribution; the reproducer is not running the same search, on the same vendor’s data, with the same look-ahead conventions, and should not expect to land on the same point. The discipline this enforces is mundane. Any new paper that crosses the desk with a reported Sharpe above three is now assumed to take one to two weeks of replication before any decision is made on whether it joins the research programme. That budget has paid for itself often enough that it no longer counts as a cost.
The same care is applied in reverse to anything written here. A clean number in a back-test is the start of a question, not the end of one.