
A question came up in the Rithmm community recently that deserves a real answer. The app shows the MLB models running against roughly 1,450 games, but the backtesting documentation references 3+ years of data. Those two things seem to conflict. Here's what's actually going on.
The total backtest pool for MLB is approximately 3,000 games. That pool gets split into two separate groups before any analysis begins. The first group is used exclusively to calibrate the recommendation windows, meaning the ranges the models use to identify where there's value in a given matchup. The second group is used exclusively to calculate the ROI and win rate numbers you see reported in the app.
The 1,450 figure represents one half of that split. It's the set used for performance measurement, not the full dataset.
If you use the same data to both build a strategy and then measure how well it performed, the results will almost always look good. The strategy was tuned on those exact games. That's not a meaningful performance signal, it's circular reasoning dressed up as a track record.
Keeping the calibration set and the measurement set completely separate ensures that the ROI numbers reflect performance on games the models were never optimized against. That distinction is what makes the reported numbers trustworthy.
The backtest doesn't pull strictly from the most recent seasons. Rithmm samples across the last several years so that recent seasons are represented in both halves of the split. The goal is a dataset that reflects how the game plays today while still having enough historical depth to build reliable patterns. For most sports Rithmm covers, the backtest spans roughly three years. For MLB specifically, given the high volume of games per season, meaningful depth is reached with fewer full seasons.
When you see a pattern with a reported win rate and ROI in the app, those figures were measured on games that had no role in shaping the recommendation windows. The methodology is designed to give you an accurate picture of historical performance, not an optimistic one.
