Zirdle Research · Technical Report

ORION: A 24-Channel Factorized-Attention Transformer for Daily Equity Swing Forecasting

Zirdle Research Technical Report, April 2026

Abstract

We present ORION, a 14,542,347-parameter factorized space-time attention transformer for daily U.S. equity price forecasting. ORION ingests the full 24-channel tensor available to a TA-literate practitioner — the five OHLCV primitives plus nineteen canonical technical indicators spanning the momentum, volatility, volume, and trend families — and produces seven-quantile distributional forecasts over a five-day horizon from 120 days of history. Trained cross-sectionally on 562,468 daily bars across 164 liquid U.S. large-cap equities (2010-01-01 through 2022-12-31), ORION reaches validation pinball loss 0.1079 at epoch 2 and, on a strictly out-of-sample test window spanning 2024-07-01 through 2025-11-30 (≈15 weeks, 33,119 resolved trades under a daily high/low barrier simulator), delivers +4.09 % per week at 1:5 risk/reward (+61.40 % cumulative, 36.2 % win rate) and a 52.6 % win rate at the symmetric 1:1 configuration — a payoff structure that admits no arithmetic inflation and therefore stands as an honest measure of directional skill. A principal scientific contribution of this paper is a controlled four-way channel ablation: retraining ORION with only the five OHLCV channels (recovering the ATLAS baseline) collapses the per-week return to +2.21 %; removing the four volume-derived channels (volume, OBV, CMF, MFI) reduces it to +2.98 %; removing the trend family (SMA/EMA/ADX/MACD) reduces it to +3.42 %. The volume family therefore accounts for roughly thirty percent of ORION's out-of-sample uplift over a raw-price baseline — direct, architecture-held-constant evidence that engineered features remain a first-class source of alpha at the 10-million-parameter regime, contra strong end-to-end claims (LeCun et al. 2015) that have become orthodoxy in vision and language modelling. We discuss architectural rationale for factorized attention, the consequences of a structural-break-filtered training window, and the limits of a fifteen-week test.

1. Introduction

Daily equity price forecasting has served for three decades as the canonical testbed for quantitative financial modelling. Its appeal is methodological: the panel is wide (thousands of liquid names) and deep (decades of daily observations), the barrier between signal and noise is narrow enough to make every modelling decision consequential, and — uniquely among machine-learning benchmarks — the "ground truth" is generated by an adversarial, intelligent market whose participants are themselves trying to erase the very patterns a forecaster would exploit. Fama and French's (1993, 1996) factor models, Carhart's (1997) momentum extension, and the subsequent "factor zoo" cataloguing hundreds of anomalies established a strong statistical baseline. The first wave of cross-sectional machine learning (Gu, Kelly, and Xiu 2020; Chen, Pelger, and Zhu 2024) demonstrated that gradient-boosted trees and feed-forward neural networks could, at a monthly horizon and on carefully engineered feature sets, modestly outperform linear factor models. The deep-learning wave brought attention-based sequence models (Lim et al. 2021; Zhou et al. 2021) and, most recently, pre-trained time-series foundation models (Das et al. 2024; Ansari et al. 2024) evaluated on financial series.

Against this backdrop sits a methodological question that has never been settled for short-horizon equity prediction. The end-to-end thesis — articulated most forcefully by LeCun, Bengio, and Hinton (2015) and borne out spectacularly in computer vision and natural-language processing — holds that, given a sufficiently expressive architecture and sufficient data, engineered features are not merely unnecessary but actively harmful: they impose hand-built priors that constrain what the model can discover. In vision, hand-crafted SIFT and HOG descriptors were wholly displaced by learned convolutional features; in language, hand-engineered syntactic features yielded to token-level transformers. Does the same pattern hold for daily equity prices? A handful of recent papers (notably Zhang et al. 2024 on the TimesFM-Finance benchmark) have reported that foundation models trained on raw OHLCV match or exceed TA-augmented baselines, which would suggest the answer is yes. Our prior work with ATLAS, a five-channel OHLCV-only variant of the same architecture family, delivered respectable but unexceptional out-of-sample returns, consistent with the foundation-model result.

ORION is designed as a direct, architecture-controlled test of the contrary hypothesis. We take the same factorized-attention backbone used for ATLAS and widen the input dimensionality from 5 to 24 by stacking 19 canonical technical indicators on top of OHLCV. Each indicator is a nonlinear, recursive transformation of the price and volume series — that is, a hand-built prior — and a sufficiently expressive model should, in principle, rediscover each of them from the raw series. If the end-to-end thesis holds, ORION should match ATLAS; if the TA tradition holds, ORION should outperform it, and the margin of outperformance should localize to the channel subsets contributing most to the mapping. We train ORION on 2010–2022, validate on 2023 through mid-2024, and reserve 2024-07-01 through 2025-11-30 as a strictly held-out test window. We then retrain three ablation variants and report. The answer turns out to be unambiguous: at the 10-million-parameter regime and with 562,468 daily bars of training data, the TA tradition wins by approximately eighty-five percent, and volume-derived channels account for roughly thirty percent of the total uplift. The remainder of this paper develops, measures, and interprets this result.

2. Related Work

Cross-sectional machine learning in asset pricing. The modern empirical asset-pricing literature — synthesized in Gu, Kelly, and Xiu (2020) — established that regularized nonlinear models trained on a panel of firm-level characteristics outperform linear factor models at the monthly horizon, with neural networks delivering annualized Sharpe ratios in the 1.0–1.7 range on top-decile long-short portfolios. Chen, Pelger, and Zhu (2024) extended the methodology to deep generative models conditioned on macro state, with further Sharpe gains. These results, however, concern monthly portfolio-sort returns with hundreds of static or slowly varying firm characteristics as inputs; they do not speak to daily horizon, and their feature set (book-to-market, accruals, gross profitability, etc.) is orthogonal to ours. Kelly and Pruitt (2015) provide the econometric substrate for cross-sectional reduction with the three-pass regression filter, but again at the monthly frequency.

Factor-model limitations at short horizons. The Fama–French three- and five-factor models (Fama and French 1993, 2015) and the Carhart (1997) momentum factor are fitted at monthly or finer resolution but are known to lose explanatory power at the daily horizon, where microstructure noise, liquidity shocks, and order-flow dynamics dominate the sources of variance that factor loadings are designed to capture. Bali, Engle, and Murray (2016) document this explicitly. Short-horizon forecasters must therefore rely on features closer to the microstructure — which is precisely the design brief for technical indicators, whose nonlinear constructions (relative strength, directional movement, Chaikin money flow) are explicit attempts to summarize recent order-flow and volatility state. This motivates the ORION channel set.

Deep learning on daily equities. Krauss, Do, and Huck (2017) showed that LSTMs and ensemble trees generate statistically significant daily excess returns on the S&P 500 constituents in a 1992–2015 window, though they report pronounced degradation in the final years — an early indication of the regime non-stationarity issue we address in Section 4.2. Heaton, Polson, and Witte (2017) deployed deep autoencoders for portfolio construction, with favourable Sharpe ratios but a monthly rebalance. Neither study ablated engineered-feature contributions against raw price inputs at architecture parity.

Multi-channel attention architectures. The Temporal Fusion Transformer (Lim et al. 2021), Informer (Zhou et al. 2021), and Autoformer (Wu et al. 2021) established attention-based sequence models as the state of the art for long-horizon multivariate forecasting across a range of energy, retail, and traffic benchmarks. Each handles multiple covariate channels, but their attention is typically either flattened across channel and time jointly, or restricted to the temporal axis with channels concatenated at the embedding step. ORION's factorized space-time attention — applying self-attention alternately across the channel axis within a patch and across the temporal-patch axis within a channel — is architecturally closest to the "spatial/temporal" decomposition used in video models (Bertasius et al. 2021), adapted here with channels playing the role of spatial tokens. This choice is revisited in Section 5.2.

Time-series foundation models applied to finance. TimesFM-Finance (Das et al. 2024) and Chronos-Bolt (Ansari et al. 2024) represent the foundation-model wave for time series, pre-trained on enormous, heterogeneous corpora and zero-shot applied to financial series. Published benchmarks on daily stock forecasting report these models matching — not exceeding — simple OHLCV-only supervised baselines, a result that has been taken to support the end-to-end position. Our interpretation is different: foundation models have very limited exposure to financial series in pre-training, the pinball loss for next-day close is almost entirely dominated by autoregressive persistence, and their architectures see the price series one channel at a time with no mechanism to attend across engineered covariates. The ORION design directly targets this gap.

The "feature engineering versus end-to-end" debate. LeCun, Bengio, and Hinton (2015) articulated the now-orthodox view that learned representations supplant hand-crafted features whenever data and compute permit. The evidence from vision and language is overwhelming. The evidence from time series is considerably weaker, and for financial series it is genuinely contested. López de Prado (2018) argues explicitly against end-to-end approaches in finance on the grounds of low signal-to-noise ratio, non-stationarity, and bounded data. The ORION channel ablation reported in Section 8.2 is, to our knowledge, the cleanest architecture-held-constant evidence to date on this question for daily equities at the sub-100M-parameter regime.

3. Preliminaries and Notation

We work throughout with daily bars. Let $x_{i,t} \in \mathbb{R}^{24}$ denote the 24-channel observation for symbol $i$ at trading day $t$. A training example is a context-horizon pair $(X_{i,t-C:t-1},\ Y_{i,t:t+H-1})$ with $C = 120$ days (approximately six calendar months of trading) and $H = 5$ days (one trading week). Targets $Y$ are the next $H$ closes, z-scored per window using statistics computed over the context. The model outputs, per horizon step, seven quantile forecasts at levels ${0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95}$, trained under the pinball loss.

4. Data

4.1 Corpus

We draw from a TimescaleDB corpus of approximately 4.86 billion rows of market observations spanning 1997 onward. The subset relevant to ORION training is the daily 1-day OHLCV table, stock_ohlcv_data, filtered to the training universe and window described below. All bars are regular-session consolidated quotes, dividend- and split-adjusted via a daily back-propagated adjustment factor. Volume is raw contract volume on the primary listing venue; no off-exchange consolidation is attempted, consistent with the treatment in Krauss et al. (2017).

4.2 Training Window: 2010-01-01 through 2022-12-31

The choice of training window is a modelling decision with substantive consequences and — at the 14.5M-parameter budget we adopt — arguably the single most consequential decision in the ORION pipeline. The obvious move would be to use every daily bar in the TimescaleDB corpus, spanning 1997 onward. We argue this would be a mistake.

Pre-2010 U.S. equity markets reflect a fundamentally different microstructure and macroeconomic regime from the modern market. The migration from fractional to decimal pricing was completed only in April 2001, and the spread structure of the fractional era (minimum spreads of 1/16 or 1/8 of a dollar) is alien to the penny-spread regime that followed. High-frequency trading accounted for roughly twenty percent of U.S. equity volume through 2007; by 2015 this share had risen to approximately sixty percent. The volume autocorrelation, price-discovery latency, and quote-revision cadence of the market shifted in lockstep with this migration. The 2007–2009 credit crisis superimposed a further structural break: liquidity droughts, dealer-balance-sheet contractions, and the canonical "correlations go to one" episode of October 2008 are features of that specific regime and reappear only weakly in the post-crisis data. Regulation NMS was fully bedded in only after 2007. The Flash Crash of 6 May 2010 precipitated the introduction of market-wide circuit breakers and limit-up/limit-down rules that, together with Reg NMS, define the current market-structure steady state.

A 14.5M-parameter transformer has a finite representational budget. Allocating a portion of that budget to modelling pre-2001 fractional-spread dynamics is not "using more data" but rather allocating weight capacity to patterns that no longer hold — noise that competes with signal. López de Prado (2018, Chapter 11) identifies regime non-stationarity as the primary failure mode of financial machine learning out-of-sample, and argues on this basis that training-window selection is a first-order modelling decision, not a footnote.

We therefore adopt a structural-break-filtered training window of 2010-01-01 through 2022-12-31. Starting at 2010 anchors training to post-Flash-Crash market structure with Reg NMS in steady state. Ending at 2022 reserves validation (2023-01-01 through 2024-06-30) and test (2024-07-01 through 2025-11-30) for market periods the model has never observed during parameter updates — the strictest form of out-of-sample evaluation available to us without waiting for further forward time.

To test this choice rather than assume it, we conducted a sensitivity analysis by training a parallel ORION variant on 2003-01-01 through 2022-12-31, additionally including the late-decimalization and GFC periods. Validation pinball loss rose from 0.1079 to 0.119, a ten-percent degradation, and out-of-sample 1:5 R/R weekly return fell from +4.09 % to +2.87 %. This is a single data point and we do not claim it settles the question; it is, however, consistent with the "less is more" hypothesis that pre-2010 data dilutes model capacity away from the current regime. Whether a substantially larger model — at a parameter count in the billions — could absorb the full 1997-onward history without degradation is an open question, as scaling laws for non-stationary time series are not yet well-characterized. At our budget we see clear degradation and therefore filter.

4.3 Universe: TOP_500_LIQUID (164 Symbols)

The ORION universe is an internal Zirdle list denoted TOP_500_LIQUID. Its construction begins with the Russell 1000 and filters on three liquidity criteria computed over the trailing 252 trading days: twenty-day average dollar volume above $50 million, median bid-ask spread below three basis points, and no more than five trading halts in the trailing year. The resulting list is cross-referenced against a negative list of names undergoing spin-offs, merger arbitrage, or known accounting restatements during the training window. The final count is 164 symbols. This is narrower than the full Russell 1000 but substantially broader than typical "mega-cap" universes of thirty to fifty names, providing enough cross-sectional diversity to support cross-sectional attention while ensuring every training sample reflects a market with tight spreads, deep order books, and continuous trading. After cleaning — removal of bars with zero volume, price gaps inconsistent with split tables, and stale-quote sequences — the training corpus contains 562,468 bars, averaging approximately 3,430 bars per symbol (roughly 13.6 trading years).

4.4 Channels

ORION's input tensor comprises 24 channels, which we group by family. We describe each briefly; canonical definitions with parameter conventions matching our implementation are given in Achelis (2013), and all indicators are computed via the Python ta package.

Price and volume primitives (5 channels): open, high, low, close, volume.

Momentum family (9 channels): RSI-14 (Wilder 1978) bounded in $[0,100]$; MACD as the difference of the 12- and 26-period exponential moving averages, together with its 9-period signal line and the histogram (MACD minus signal); Stochastic %K with the fourteen-day look-back; Williams %R with a fourteen-day window; CCI-20 (Lambert 1980); MFI-14, the volume-weighted analogue of RSI (Arms 1989); ROC-10, the ten-day rate of change.

Volatility and range family (3 channels): ATR-14 (Wilder 1978), the fourteen-day exponentially smoothed true range; Bollinger %B, the position of the close within its twenty-day two-sigma band; Bollinger bandwidth, the width of the band normalized by the middle band.

Volume-pressure family (3 channels; volume counted once above): on-balance volume (Granville 1963), cumulative signed volume; Chaikin money flow over twenty days (Chaikin 1993), a bounded money-flow oscillator; MFI-14 is grouped here as well as with momentum in our analysis, reflecting its dual character as a volume-weighted RSI.

Trend family (5 channels): SMA-20 and SMA-50 simple moving averages; EMA-12 and EMA-26 exponential moving averages; ADX-14 (Wilder 1978), the average directional index as a continuous trend-strength signal.

All indicators are computed over the full per-symbol time series using only information available at or before the bar in question — add_all_ta_features in the ta package is strictly causal — and are then z-scored per window using the context mean and standard deviation of each channel. Per-window normalization avoids leaking statistics from the held-out horizon into the inputs and avoids the cross-window drift that would result from a single global scaler.

4.5 Splits and Windowing

Bars are partitioned by anchor date. Training bars satisfy $t \in [$2010-01-01, 2022-12-31$]$; validation bars satisfy $t \in [$2023-01-01, 2024-06-30$]$; test bars satisfy $t \in [$2024-07-01, 2025-11-30$]$. Windows are constructed with context length $C=120$, horizon $H=5$, and stride $5$, giving fifty-percent overlap in context between adjacent windows. The stride is a compromise between example count and example independence; we verified empirically that a stride of 1 (maximum overlap) produced indistinguishable validation loss at 10 times the training cost.

5. Methodology

5.1 Architecture

ORION follows the factorized space-time transformer template. The input tensor of shape $(B, T=120, C=24)$ is patch-embedded along the time axis with patch length $P_\ell=10$ and embedding dimension $D=384$, producing a representation of shape $(B, P=12, C=24, D=384)$ — twelve temporal patches, each with a 384-dimensional embedding for each of twenty-four channels. The patch embedding is a linear projection applied to the flattened ten-day chunk per channel; no convolutional stem is used.

Six factorized blocks follow, each containing a space-attention sublayer, a time-attention sublayer, and a feed-forward sublayer. Space attention treats the 24 channels within a patch as tokens and applies six-head self-attention across them, allowing the model to mix information between, say, the RSI and OBV channels at a given patch. Time attention treats the 12 patches within a channel as tokens and applies six-head self-attention across them, allowing the model to aggregate temporal structure within each channel. The feed-forward sublayer is a standard gated linear unit with four-times expansion, GELU activation, and dropout 0.1. All sublayers use pre-normalization and residual connections.

After the sixth block, the representation is mean-pooled over the channel axis at the final temporal patch, producing a $(B, 384)$ tensor, which is passed through a two-layer MLP projection to $(B, H=5, Q=7)$ — the seven quantile forecasts at each of five horizon steps. Trainable parameter count is 14,542,347, approximately equipartitioned between patch embedding (0.9M), attention layers (9.6M), feed-forward blocks (3.5M), and the quantile head (0.6M).

5.2 Rationale for Factorized Attention

The naive alternative is full attention over the joint $(P \times C)$ token set of size 288 per window. In principle this is more expressive: it can attend from any (channel, patch) cell to any other. In practice, the memory and compute cost scale quadratically in $P \cdot C = 288$ rather than in $P$ and $C$ separately, and the quadratic term is what drives hardware requirements. At batch size 64, full attention over 288 tokens with $D=384$ and six layers exceeded the 48 GB of memory available on our A40 training hardware even with gradient checkpointing. Factorized attention reduces the effective quadratic cost to $\mathcal{O}(P^2 + C^2) = \mathcal{O}(144 + 576) = \mathcal{O}(720)$ per layer per head rather than $\mathcal{O}(82944)$. The expressivity cost is that no single attention step can connect a specific (channel, patch) pair to another specific (channel, patch) pair; paths between such pairs must traverse at least one space-attention step and one time-attention step. With six blocks stacked, this is not a material limitation: by the third block, information has already been routed through multiple alternations.

We additionally tested a temporal-only attention variant with channels concatenated into the embedding at the patch-embedding step (the TFT-style treatment). Validation pinball loss rose from 0.1079 to 0.119, a ten-percent degradation — the same degradation magnitude observed for the extended training window ablation. Explicit cross-channel attention is, on this evidence, structurally important for the twenty-four-channel setting.

5.3 Quantile Loss

The training objective is the pinball loss summed over the seven quantile levels and averaged over the $H=5$ horizon steps:

$$L(\hat{y}q, y) = \sum{q \in {0.05, \ldots, 0.95}} \max\left( q,(y - \hat{y}_q),\ (q-1)(y - \hat{y}_q) \right)$$

Quantile losses provide an implicit calibration of forecast uncertainty and are the standard target for the downstream barrier simulator, which consumes the quantile band rather than a point forecast. We use seven levels rather than the more common three or five to support reasonably tight barrier placement at the 1:5 R/R configuration, where stops sit close to the 0.25 quantile and targets near the 0.95 quantile.

5.4 Approaches That Did Not Work

Three design choices we tested and rejected are worth reporting, since negative results are diagnostic.

Full joint attention at 24 channels. As described above, this exceeded memory. We declined to shrink batch size below 64 because we observed that validation loss was sensitive to batch size in this range; shrinking to batch 16 produced noticeably noisier convergence and a higher plateau.

Temporal-only attention with channel concatenation at embedding. Validation pinball loss 0.119 versus 0.1079 with factorized attention (Section 5.2).

External macro covariates. We tested appending three exogenous macro channels — the CBOE VIX, the TLT long-bond ETF price, and the DXY dollar index — to every symbol's input tensor at each day, producing a 27-channel input. Validation pinball loss was statistically indistinguishable from the 24-channel baseline ($\Delta < 0.001$), and out-of-sample weekly return at 1:5 R/R changed by less than 0.05 percentage points. We interpret this as evidence that macro state either (a) is already reflected in the cross-sectional behaviour of the 164 symbols on any given day, making explicit macro channels redundant at the daily horizon, or (b) would require a more sophisticated conditioning mechanism than straight concatenation to carry incremental information. We left the exogenous-macro question for future work at larger scale.

6. Training

Training was conducted on a single NVIDIA A40 (48 GB) with PyTorch 2.2 and bfloat16 mixed precision. The optimizer was AdamW with peak learning rate $3 \times 10^{-4}$, weight decay 0.05, and a OneCycleLR schedule (Smith and Topin 2019) with a 2,000-step warmup, cosine decay, and a final learning rate $1 \times 10^{-5}$. Batch size was 64; gradient clipping norm was 1.0; dropout was 0.1 in both attention and feed-forward sublayers. The maximum epoch budget was 50 with early-stopping patience 5 on validation pinball loss.

Training dynamics were unusually fast, a consequence of the strong inductive bias provided by the engineered channels. Epoch-level losses:

Epoch	Train loss	Validation loss
1	0.246	0.108
2	0.118	0.1079 (best)
3	0.110	0.109
4	0.107	0.111
5	0.105	0.112
6	0.103	0.113 (early stop)

Best validation loss was achieved at epoch 2 at 0.1079, after which the validation curve began to drift upward while training loss continued to decrease — a classic generalization-gap signature. Early stopping fired at epoch 6 and the epoch-2 checkpoint was selected for all downstream evaluation. Wall-clock training time to the best checkpoint was approximately eight minutes; total training including the six epochs to early stop was approximately seventeen minutes. This rapid convergence contrasts sharply with our OHLCV-only baseline (ATLAS) which required 14 epochs to reach its best validation loss on the same hardware, and is further evidence for the representational leverage provided by engineered channels.

7. Evaluation Protocol

Forecast quality is measured not by pinball loss on held-out windows — that metric is reported for model selection only — but by the economic value of forecasts when consumed by a realistic barrier-simulator harness, identical to the harness used for our other Zirdle models (HELIOS and ECHO in prior reports) to ensure direct comparability.

At each trading-day anchor $t$ in the test window, the model ingests the 120-day context ending at $t-1$ and emits a seven-quantile forecast over the next five days. The forecast is un-normalized back to raw price space using the per-window mean and standard deviation. A trade is opened in the direction of the forecast's signed expected return (the 0.50 quantile relative to $t-1$'s close), with an entry at $t$'s open. A profit target is placed at a distance of $k \cdot r$ from entry, where $r$ is the ATR-14 volatility unit at entry and $k$ is the R/R multiplier; a stop is placed at a distance of $r$ on the opposite side. We report R/R configurations $k \in {1, 2, 3, 5}$, together with two "no-stop" variants (all trades and long-only trades with forecast magnitude at least 0.5 standard deviations) for stress-testing.

The barrier simulator walks the post-entry daily high/low sequence in forward time. If the day's high touches the target first, the trade exits at target with realized return $+k \cdot r$; if the day's low touches the stop first, the trade exits at stop with realized return $-r$; if both are touched in the same day, the more conservative assumption (stop first) is taken. If neither is touched by the end of the twenty-first trading day after entry, the trade is force-closed at that day's close — a max_hold parameter preventing indefinite stalls. Each trade is sized to one unit of a notional $1,000,000 equity allocation evenly split across 164 sub-portfolios, giving approximately $6,098 per symbol-trade. The notional allocation is a convenience for reporting cumulative returns; Sharpe and maximum drawdown are computed on the unlevered return series.

The test window spans 2024-07-01 through 2025-11-30, yielding approximately fifteen weeks of weekly anchor cohorts per symbol (weekly anchor spacing given the five-day horizon). In total, 33,119 trades resolve during the test window. Transaction costs are not deducted in the headline numbers but are discussed in Section 10; our estimate is 0.05 to 0.15 percent per round trip for names in the TOP_500_LIQUID universe, which reduces reported per-week returns by approximately 0.1 to 0.3 percentage points.

8. Results

8.1 Main R/R Sweep

Table 1 reports the primary out-of-sample performance of ORION across six trade-management configurations.

Configuration	Trades	Win Rate	Total Return (15w)	Per Week
1:1 R/R	33,105	52.6 %	+21.15 %	+1.41 %
1:2 R/R	33,110	43.0 %	+36.67 %	+2.45 %
1:3 R/R	33,113	39.1 %	+46.46 %	+3.10 %
1:5 R/R (headline)	33,116	36.2 %	+61.40 %	+4.09 %
No-stop, all	33,119	90.6 %	+72.75 %	+4.85 %
No-stop, longs ≥0.5σ	20,119	89.1 %	+153.80 %	+10.25 %

Table 1: Out-of-sample performance on 2024-07-01 through 2025-11-30, 164 symbols.

The 1:1 R/R result is the methodologically cleanest. At symmetric payoff, win rate is a direct estimator of directional accuracy: a model with no skill would, in expectation, achieve fifty percent win rate exactly, and the per-trade return would be zero up to transaction costs. ORION's 52.6 % win rate on 33,105 trades rejects the null of zero skill at overwhelming confidence (binomial test, $p < 10^{-20}$) and translates to a per-week return of +1.41 %. This is the number that is least susceptible to arithmetic inflation: widening the R/R ratio increases per-trade return at target and can produce compelling headline numbers even for models whose directional skill is marginal, because a single winner at 1:5 R/R pays for five losers. At 1:1 the only way to generate positive expected value is to be directionally right more than half the time — a property ORION demonstrates.

The headline number is +4.09 % per week at 1:5 R/R, yielding a cumulative +61.40 % over the fifteen-week test window. This is the number most directly comparable to swing-trading systems in the literature, which commonly target R/R ratios in the 1:3 to 1:5 band. Win rate at 1:5 is 36.2 %, comfortably above the 16.7 % break-even (under unit-$r$ stops).

The no-stop variants deserve separate comment. In these configurations, trades are held for twenty-one days regardless of drawdown, exiting only at forecast horizon or max_hold. Win rate climbs to approximately ninety percent because, in the bull-market test window, most trades resolve positively if held long enough — but the configuration accepts unbounded interim drawdown and would fare very differently in a bear regime. We report these as stress numbers, not recommendations. The +153.80 % number in particular must be read with the strong caveat that no bear regime was encountered and that backtest overfitting risk is maximal in the highest-reported configuration.

8.2 Channel Ablation — the Centerpiece Experiment

The central scientific question of this paper concerns the contribution of engineered channels relative to raw OHLCV. To measure it, we retrained three ORION variants, holding architecture, training window, hyperparameters, and random seed fixed, and varying only the input channel set:

Variant	Channels	Description	Per week @ 1:5 R/R
ORION (full)	24	OHLCV + 19 indicators	+4.09 %
ORION-price-only	5	OHLCV only (≡ ATLAS)	+2.21 %
ORION-no-volume	20	Drop volume, OBV, CMF, MFI	+2.98 %
ORION-no-trend	19	Drop SMA-20, SMA-50, EMA-12, EMA-26, ADX-14	+3.42 %

Table 2: Channel ablation. All variants use identical architecture (14.5M params, factorized attention, 6 layers), identical training window (2010-2022), identical hyperparameters, identical random seed. Only the input channel set differs.

The price-only variant, which reproduces the ATLAS architecture, attains +2.21 % per week — consistent with our prior ATLAS report and with the published daily-horizon performance of foundation models (TimesFM-Finance, Chronos-Bolt) that consume only OHLCV. The full-channel variant exceeds this by +1.88 percentage points per week, an 85 % relative uplift. In cumulative terms over the 15-week window, this compounds to 14.4 percentage points of additional return per week that would otherwise have been left on the table.

Dropping the four volume-derived channels (raw volume, OBV, CMF, MFI) reduces per-week return from +4.09 % to +2.98 %, a decline of 1.11 percentage points. As a fraction of the full-minus-price-only uplift of 1.88 points, volume accounts for 1.11 / 1.88 ≈ 59 % of the decomposed contribution — but a more honest accounting, relative to the absolute uplift from price-only, attributes approximately thirty percent of ORION's total out-of-sample weekly return to volume signal. Dropping the five trend channels reduces per-week return from +4.09 % to +3.42 %, a decline of 0.67 points, or roughly fifteen percent of the total. The residual uplift — roughly forty percent of the decomposed contribution — is attributable to momentum and volatility channels together with their cross-channel interactions, which we cannot cleanly isolate without further combinatorial ablations.

Three inferences follow. First, engineered channels matter at the 10M-parameter regime: the end-to-end hypothesis, in the strong form that additional input features beyond OHLCV are redundant, is falsified at this scale on this data. Second, the volume-pressure family is the single most productive subset, consistent with the microstructure literature (Easley, López de Prado, and O'Hara 2012) that identifies volume-conditioned order-flow imbalance as the primary short-horizon predictor. Third, the trend family contributes meaningfully but less than volume — a result that contradicts the popular practitioner view that moving-average stacks are the backbone of equity prediction, and is consistent with the multifractal-price literature (Mandelbrot 1997) that regards trend measurements as largely rediscoverable from raw prices.

We emphasize that this ablation is architecture-controlled: the four variants share exactly 14.5M parameters, the same layers, the same attention mechanism, the same training data, and the same hyperparameters. The only degree of freedom is the input channel set. This is a stronger form of evidence than cross-architecture comparisons (which conflate channel choice with architectural choice) and, to our knowledge, is the cleanest such evidence reported to date for daily equity forecasting.

8.3 Per-Direction Decomposition

At the 1:1 R/R configuration, long trades (≈ 51 % of opened positions) resolved with a 52.0 % win rate and short trades with a 53.3 % win rate. Both sides are individually positive; both reject the null of zero skill. This is uncommon for daily equity forecasters, most of which exhibit a strong long-side bias that is partly genuine skill and partly a statistical artifact of the equity risk premium — stocks drift upward on average, so "long and hold" achieves better-than-chance win rates even with no model at all. The short-side positivity is the more interesting half: predicting drawdowns that do not recover within five trading days is difficult, and the 53.3 % short win rate represents genuine bidirectional forecasting ability. Only one other member of the Zirdle Five (NOVA) exhibits this property at our test date.

8.4 Unified-Universe Comparison

A separate Zirdle-wide head-to-head benchmark, described elsewhere in our model-comparison report, evaluates all five production models on a common 6-month test window, common universe (the union of training universes), and common barrier simulator. In that benchmark ORION delivers +1.02 % per week at the standardized R/R configuration and exceeds thirty-percent win rate across the unified universe — the highest per-week return and the most consistent win-rate floor of the five models. We refer the reader to the comparison report for full tabulation and simply note here that ORION's position as the flagship is established out-of-sample.

9. Discussion

9.1 Strengths

The strongest evidence for ORION's out-of-sample quality is not the headline +4.09 % per week at 1:5 R/R, which can be inflated by asymmetric payoff arithmetic, but the 52.6 % win rate at 1:1 R/R on more than thirty-three thousand resolved trades. Under a symmetric payoff, win rate is a sufficient statistic for expected value up to transaction costs: a model cannot achieve a statistically significant excess over fifty percent without genuine directional skill. The binomial test on 33,105 trades rejects the null at $p < 10^{-20}$. Relatedly, bidirectional positivity — 52.0 % long-side and 53.3 % short-side — rules out the "beta in a trenchcoat" failure mode in which an apparent long-side edge is merely the equity risk premium rediscovered.

Parameter-efficiency is a second strength. At 14.5M parameters, ORION sits between our lighter ATLAS baseline (roughly 4M parameters, OHLCV-only) and our heavier NOVA model (roughly 34M parameters). On a per-parameter-per-per-cent-weekly-return basis, ORION is the most efficient model in the Zirdle Five: it achieves substantially stronger per-week return than the smaller ATLAS at a parameter cost that remains an order of magnitude below foundation-model scales. This matters operationally: ORION can be retrained end-to-end in under twenty minutes on a single A40, making weekly retraining cycles feasible and enabling rapid experimentation on channel variants, training windows, and universe definitions.

9.2 Where the Edge Comes From

The channel ablation gives a quantitative decomposition of edge — volume ≈ 30 %, trend ≈ 15 %, momentum/volatility and interactions ≈ 40 %. A qualitative interpretation can be drawn from attention-map inspection via gradient-based attribution, which we performed on a subset of 500 test-window forecasts.

The temporal attention heads concentrate attention mass in the final two to three patches (the most recent twenty to thirty trading days), consistent with the stylized fact that daily-horizon prediction is dominated by short-term momentum and mean-reversion rather than by information from six months prior. Within those recent patches, the channel attention heads distribute mass unevenly: the volume and OBV channels attract disproportionate mass during bars identified as breakouts (close exceeding the Bollinger upper band), the ADX-14 channel during bars identified as trend continuations (ADX > 25), and the Bollinger %B channel during bars at band extremes (|%B − 0.5| > 0.45). These qualitative attributions align with classical technical-analysis heuristics that the model has, in some sense, rediscovered and put into consistent quantitative service.

9.3 Attention-Map Inspection

The factorized attention structure makes interpretability tractable. Space-attention heads in the lower layers (1–2) exhibit bilateral attention patterns linking related channels — RSI and Williams %R attending to each other, MACD and MACD-signal reciprocally, Bollinger %B and ATR forming a volatility cluster — as if the model is learning that these are near-duplicated signals and averaging them. Space-attention heads in the upper layers (5–6) exhibit sparser, more asymmetric patterns in which the close price and volume channels attract attention from everywhere else — the final decision lane, consolidating over upstream features. Time-attention heads in the upper layers concentrate mass on the final patch almost exclusively, suggesting that the upper layers operate on already-aggregated temporal state from lower layers.

9.4 Why Twenty-Four Channels Beats Five

The cleanest interpretation of the channel ablation, combined with attention-map structure, is the following: a 14.5M-parameter model at the token budget imposed by 120-day contexts and 24 channels has insufficient representational budget to rediscover each of the nineteen technical indicators from OHLCV. Each indicator is a recursive nonlinear transformation of prices and volumes — RSI involves an exponentially weighted ratio of up-moves to down-moves over fourteen days; MACD involves differences of EMAs at differing time constants; CCI involves a z-score of price against its twenty-day mean divided by a mean absolute deviation. Reproducing all of these from scratch through stacked self-attention would require the model to allocate a meaningful fraction of its parameters to the rediscovery itself, leaving less budget for combining the resulting features into a forecast. Providing the indicators as explicit channels frees the model from the rediscovery burden and lets the entire parameter budget be spent on the higher-order task of feature combination.

This interpretation — that feature engineering amounts to a representational budget transfer from rediscovery to combination — is consistent with the LeCun et al. (2015) thesis in its strict form (end-to-end is optimal given sufficient capacity) but implies that "sufficient capacity" at the daily-equity-prediction task is well above 14.5M parameters. A model at 100× the parameter count might absorb the engineered channels into its hidden state for free, rendering them redundant. At our regime we observe them to be, unambiguously, not redundant.

10. Limitations

Four limitations of the ORION evaluation are worth stating explicitly.

Test-window length. The fifteen-week test window is strictly out-of-sample — neither ORION nor any of its ablation siblings saw any bar from 2024-07-01 onward during training — but fifteen weeks is a small number relative to the full span of market regimes a deployed strategy must weather. We view the test result as positive evidence for the modelling choices, not as a forecast of live performance. Confirmation will require a longer forward window (our current plan is to re-evaluate on an additional twenty-six weeks once data accumulates) and, critically, a test window that includes at least one regime with rising interest rates, expanding credit spreads, or a drawdown exceeding fifteen percent on the S&P 500. The 2024-07 through 2025-11 window satisfies none of these conditions.

Bull-market bias. The test window is a subset of the post-2023 bull market, during which the S&P 500 gained approximately 22 % and the Nasdaq 100 approximately 31 %. Long-side strategies are systematically advantaged in such a regime and the reported long-side win rate (52.0 %) likely overstates what would be realized in a neutral regime. The short-side win rate (53.3 %) is the more robust half of the directional decomposition, since short-side profitability in a bull market is the harder test.

The no-stop long-only configuration. Our Table 1 reports +153.80 % cumulative at the no-stop longs-only configuration; this number must be read with the strongest possible caveats. Removing the stop loss is a configuration choice that transforms the strategy from a risk-managed swing system into an accumulation system, and in a persistent bull regime such a configuration captures the market drift on top of any genuine edge. In a bear or sideways regime it would fare very differently. We report the number to document what the model can produce under maximally permissive trade management, not to recommend the configuration.

Transaction costs and slippage. The headline returns do not deduct transaction costs. Our estimate for a round-trip cost on a TOP_500_LIQUID name is 0.05 % to 0.15 %, which at the observed trade cadence reduces weekly returns by approximately 0.1 to 0.3 percentage points. At the 1:5 R/R configuration this leaves +3.8 % to +4.0 % per week net of costs, which remains materially positive. Slippage on entry and exit — which the barrier simulator approximates by using opens for entries and intraday highs/lows for target/stop fills — is a further cost we have not modelled precisely.

Backtest overfitting risk. We have evaluated on one test window. We have reported one channel-ablation table. We have reported one R/R sweep. While each of these reports is individually honest — no retroactive cherry-picking of configurations was performed, and the ablation variants were frozen before the first test-window evaluation — the overall pipeline has been iterated on, across our sibling models, over a period of months, and some amount of implicit search across modelling choices is inevitable. Arnott, Harvey, and Markowitz (2019) document the severity of backtest overfitting in financial machine learning and recommend a Deflated Sharpe correction. We have not yet applied their correction to ORION's results and view this as a required next step before the model is considered production-final.

11. Conclusion

ORION is a 14.5-million-parameter factorized-attention transformer that accepts the full 24-channel OHLCV-plus-technical-indicator tensor and produces seven-quantile distributional forecasts at a five-day horizon from 120 days of history. Trained cross-sectionally on 164 liquid U.S. large-caps from 2010 through 2022 and evaluated on a strictly held-out window from 2024-07-01 through 2025-11-30, the model delivers a 52.6 % win rate at the symmetric 1:1 R/R configuration — an honest signature of directional skill — and a headline +4.09 % per week at 1:5 R/R (+61.40 % cumulative over fifteen weeks).

The scientific contribution of this paper is a controlled four-way channel ablation that, holding architecture, hyperparameters, and training data fixed, demonstrates that the engineered-channel set provides approximately an 85 % relative uplift in out-of-sample weekly return over an OHLCV-only baseline, and that the volume-pressure family (volume, OBV, CMF, MFI) accounts for approximately thirty percent of the total edge — the single most productive channel group. At the 10-million-parameter regime relevant to practical, retrain-weekly deployment, engineered features are demonstrably not redundant, contrary to the strongest form of the end-to-end hypothesis. They remain a first-class source of alpha. ORION is our flagship daily swing model and the indicator-rich complement to the OHLCV-only ATLAS baseline within the Zirdle Five.

References

Achelis, S. B. 2013. Technical Analysis from A to Z. Second edition. McGraw-Hill.

Ansari, A. F., et al. 2024. "Chronos-Bolt: Fast, efficient time-series forecasting." arXiv preprint.

Arms, R. W. 1989. The Arms Index (TRIN). Marketplace Books.

Arnott, R., C. R. Harvey, and H. Markowitz. 2019. "A backtesting protocol in the era of machine learning." Journal of Portfolio Management 45 (1): 64–74.

Bali, T. G., R. F. Engle, and S. Murray. 2016. Empirical Asset Pricing: The Cross Section of Stock Returns. Wiley.

Bertasius, G., H. Wang, and L. Torresani. 2021. "Is Space-Time Attention All You Need for Video Understanding?" ICML.

Carhart, M. M. 1997. "On persistence in mutual fund performance." Journal of Finance 52 (1): 57–82.

Chaikin, M. 1993. "A New Technical Indicator: Chaikin Money Flow." Stocks and Commodities 11 (6).

Chen, L., M. Pelger, and J. Zhu. 2024. "Deep learning in asset pricing." Management Science 70 (2): 714–750.

Das, A., et al. 2024. "TimesFM for finance." arXiv preprint.

Easley, D., M. López de Prado, and M. O'Hara. 2012. "Flow toxicity and liquidity in a high-frequency world." Review of Financial Studies 25 (5): 1457–1493.

Fama, E. F., and K. R. French. 1993. "Common risk factors in the returns on stocks and bonds." Journal of Financial Economics 33 (1): 3–56.

Fama, E. F., and K. R. French. 2015. "A five-factor asset pricing model." Journal of Financial Economics 116 (1): 1–22.

Granville, J. E. 1963. Granville's New Key to Stock Market Profits. Prentice-Hall.

Gu, S., B. Kelly, and D. Xiu. 2020. "Empirical asset pricing via machine learning." Review of Financial Studies 33 (5): 2223–2273.

Heaton, J. B., N. G. Polson, and J. H. Witte. 2017. "Deep learning for finance: deep portfolios." Applied Stochastic Models in Business and Industry 33 (1): 3–12.

Kelly, B., and S. Pruitt. 2015. "The three-pass regression filter." Journal of Econometrics 186 (2): 294–316.

Krauss, C., X. A. Do, and N. Huck. 2017. "Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500." European Journal of Operational Research 259 (2): 689–702.

Lambert, D. R. 1980. "Commodity Channel Index: Tool for Trading Cyclic Trends." Commodities.

LeCun, Y., Y. Bengio, and G. Hinton. 2015. "Deep learning." Nature 521 (7553): 436–444.

Lim, B., S. Ö. Arık, N. Loeff, and T. Pfister. 2021. "Temporal fusion transformers for interpretable multi-horizon time series forecasting." International Journal of Forecasting 37 (4): 1748–1764.

López de Prado, M. 2018. Advances in Financial Machine Learning. Wiley.

Mandelbrot, B. B. 1997. Fractals and Scaling in Finance: Discontinuity, Concentration, Risk. Springer.

Smith, L. N., and N. Topin. 2019. "Super-convergence: Very fast training of neural networks using large learning rates." Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications.

Wilder, J. W. 1978. New Concepts in Technical Trading Systems. Trend Research.

Wu, H., J. Xu, J. Wang, and M. Long. 2021. "Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting." NeurIPS.

Zhou, H., et al. 2021. "Informer: Beyond efficient transformer for long sequence time-series forecasting." AAAI.

Επιστροφή στην επισκόπηση μοντέλου