Zirdle Research · Raport techniczny

ATLAS: A Minimalist OHLCV-Only Factorized-Attention Transformer for Daily Index and Mega-Cap Forecasting

Zirdle Research Technical Report, April 2026

Abstract

We present ATLAS, a 2.4-million-parameter factorized-attention transformer trained exclusively on the five raw OHLCV channels — open, high, low, close, volume — with no engineered technical indicators of any kind. ATLAS is deliberately constructed as the minimalist baseline in a family of five proprietary forecasting models and is intended to isolate one question: how much of the forecasting performance obtainable from deep multivariate transformers on daily equity data is recoverable from raw price and volume alone, without the standard arsenal of moving averages, oscillators, and volatility bands? We train ATLAS on 121 liquid U.S. symbols — nine broad-market ETFs, eleven sector SPDRs, and roughly one hundred mega-capitalization stocks — over a structural-break-filtered window spanning 2010-01-01 to 2022-12-31, evaluate on a held-out validation period ending 2024-06-30, and test over a fifteen-week walk-forward window from 2024-07-01 to 2025-11-30. The architecture follows a ViViT-style factorization with patch length 10, embedding dimension 192, four transformer layers, and six attention heads, producing quantile forecasts over seven quantile levels optimized with a pinball loss. Validation pinball loss converges to 0.1051 at epoch 3 and early-stops at epoch 13. Over 21,492 resolved barrier-touch trades in the test window, ATLAS attains a 51.8% win rate at 1:1 risk/reward (+0.59% per week), 44.2% at 1:2 (+1.34% per week), 41.3% at 1:3 (+1.74% per week), and 39.1% at 1:5 R/R, delivering its flagship +2.21% per week (+33.10% cumulative). An unstopped no-stop-loss variant produces 91.6% nominal win rate but a cumulative -10.36%, reflecting the mean-reverting nature of the mega-cap universe. In a unified-universe benchmark against four larger indicator-rich siblings in the same family, ATLAS ranks lowest in raw return but highest in win rate, and delivers roughly half the excess return of an otherwise-identical indicator-rich variant (ORION) at one-sixth the parameter count. We interpret these results as evidence that raw OHLCV contains most, though not all, of the forecastable signal accessible to a small multivariate transformer on liquid U.S. equities, and that engineered indicators provide a meaningful but bounded marginal uplift.

1. Introduction

Technical indicators are, in nearly every case, deterministic, closed-form functions of a sliding window of OHLCV bars. The simple moving average is a windowed mean; the relative strength index is a windowed ratio of positive-to-negative close-to-close returns; Bollinger Bands are a windowed mean plus a scaled windowed standard deviation. From the perspective of a sufficiently expressive neural network, each of these is, in principle, recoverable — given enough layers, enough heads, enough data. Whether that recovery actually happens in practice, and how much signal one loses by refusing to hand-engineer these features, is a question that has been studied for decades but never conclusively resolved.

The practical stakes are non-trivial. Engineered indicator pipelines are a persistent source of complexity in production trading systems: look-ahead bugs, NaN propagation at series starts, inconsistent treatment of corporate actions, and disagreement on implementation details (Wilder smoothing versus SMA in RSI; exponential versus simple baselines for MACD) all contribute to reproducibility problems. A model that consumes only OHLCV sidesteps this class of problems, is cheaper at inference, and is more amenable to interpretability.

We are primarily motivated by the first question but equally interested in the operational one: does the removal of indicators produce a model worth deploying, even if slightly weaker? We answer both with ATLAS, a 2.4-million-parameter factorized-attention transformer otherwise identical in architecture to its larger indicator-equipped siblings. ATLAS is not intended to be the best model in its family; it is intended to be the smallest, fastest, and most interpretable, and to serve as a disciplined ablation against which the marginal value of engineered features can be measured.

Contributions. The main contributions of this paper are as follows:

We report a head-to-head comparison, on an identical test window and universe, between a raw-OHLCV transformer (ATLAS) and an otherwise-identical indicator-augmented transformer (ORION) of roughly six times the parameter count. We find that engineered indicators confer roughly +1.9 percentage points per week of excess return at 1:5 risk/reward — a meaningful but bounded uplift.
We motivate and document a structural-break-filtered training window (2010–2022) that deliberately excludes pre-decimalization, pre-Reg-NMS, and crisis-era bars, and report a controlled ablation showing that extending the window to 2003 degrades validation pinball loss by approximately 12%. We interpret this as evidence that, at small parameter scales, training-window regime selection is itself a first-order modeling choice.
We evaluate under a realistic barrier-touch simulator with intra-bar OHLC tie-breaking and per-symbol compounding and report a full risk/reward sweep over 21,492 resolved trades. We find that the 1:5 configuration is the most attractive despite its lower win rate, a pattern consistent with convex payoff structures on forecasts whose directional edge is small but persistent.
We publish a frank post-mortem of failed design choices, including an unsuccessful longer-context training run and the mean-reverting breakdown of the unstopped variant, in the hope that it reduces duplicated effort for subsequent work in this area.

2. Related Work

Early neural networks on raw price. The question of whether a neural network can extract forecastable signal from raw price is as old as the modern field. White (1988) [1] applied a single-layer feedforward network to IBM daily returns and famously concluded that the network learned "no structure" beyond what a linear predictor already captured — a data-level endorsement of the Efficient Markets Hypothesis articulated by Fama (1970) [2]. The three decades since have refined rather than refuted this result: at sufficient scale and with architectures that can represent non-linear temporal dependencies, modest but non-zero forecastable signal is consistently found on liquid equities. What remains contentious is its magnitude, stability, and net survival of transaction costs.

Indicator versus raw-price ablations. A recurring thread in the applied literature studies whether hand-engineered indicators outperform raw OHLCV. Mehtab and Sen (2020) [5] report that LSTMs augmented with SMA, RSI, MACD, and Bollinger Bands outperform raw-OHLCV LSTMs by 5–15% on RMSE. Sezer and Ozbayoglu (2018) [3] use image-like indicator encodings fed to a CNN and claim similar uplifts. Nelson et al. (2017) [4] use LSTM on raw OHLCV plus a small number of indicators and observe modest directional accuracy gains. The evidence is consistent in direction but variable in magnitude, and confounded by architecture, universe, and evaluation protocol. ATLAS is constructed to produce a cleaner ablation: identical architecture, universe, loss, and evaluation against its indicator-rich sibling.

Patch-based time-series transformers. The past three years have seen a rapid emergence of patch-based transformer architectures tailored to time series. PatchTST [6] demonstrated that segmenting a series into patches before tokenization substantially outperforms pointwise tokenization on standard benchmarks. Informer [8] introduced sparse attention for long horizons. Foundation models such as TimesFM [9], Chronos [10], and Moirai [7] train on large heterogeneous corpora and claim zero-shot generalization. Published performance on financial time series nonetheless remains weak: Moirai's own evaluations acknowledge degraded performance on equity subsets, and TimesFM reports competitive but not dominant results on ETF data. Our work adopts the patch-based paradigm from PatchTST but substitutes a ViViT-style factorized attention over the (patch, channel) axes, which we find empirically well-suited to small multivariate financial inputs.

Cross-sectional machine learning for asset pricing. A parallel literature treats forecasting cross-sectionally rather than as a pure time-series task. Gu, Kelly, and Xiu (2020) [11] compare machine learning methods on U.S. equity panel data and find non-linear methods substantially outperform linear baselines. Chen, Pelger, and Zhu (2023) [12] extend this with a GAN-based stochastic discount factor. Kelly, Malamud, and Zhou (2024) [13] argue that very large neural networks continue to generalize under appropriate regularization. ATLAS is cross-sectionally naïve by design: it treats each symbol independently and learns only temporal structure, a limitation we accept as a baseline.

Multivariate channel attention. The question of how to handle multiple input channels in a transformer is unsettled. The Temporal Fusion Transformer [14] introduced variable-selection networks to weight channels dynamically. Autoformer [15] decomposes the series before attention is applied. ATLAS takes a simpler view, following the factorized-attention pattern from ViViT: attention is applied first across patches within a channel, and then across channels at each patch position. This has the virtues of being cheap, interpretable, and straightforward to implement, at the cost of ruling out certain cross-channel, cross-time interactions that a fully joint attention could capture.

Backtesting and evaluation hygiene. Applied financial ML is prone to overstatement of out-of-sample performance from look-ahead leakage and p-hacked backtests. Harvey and Liu (2015) [16] argue that the multiple-testing problem in quantitative finance justifies higher statistical-significance thresholds. López de Prado (2018) [17] treats the pitfalls at book length, including regime non-stationarity, which motivates our training-window choice. Gneiting and Raftery (2007) [18] articulate the theory of proper scoring rules justifying pinball loss over RMSE. We follow these recommendations: walk-forward split, no data leakage in preprocessing, pinball-loss validation, and frank reporting of sensitivities.

3. Data

3.1 Corpus

ATLAS draws from a TimescaleDB-backed corpus that holds approximately 4.86 billion bars across 29,000 symbols spanning 1997 through late 2025, including daily, hourly, and minute resolutions across U.S. equities, ETFs, ADRs, and a handful of foreign indices. For ATLAS we restrict to daily bars and to a 121-symbol subset described in §3.3. Intra-day bars are reserved for higher-frequency siblings.

3.2 Choosing the training window

The decision of which bars to train on is, in our experience, at least as consequential as the decision of which architecture to use, and receives correspondingly too little attention in the applied literature. The naïve choice — train on every bar we have, 1997 forward — is tempting because it maximizes sample count. We argue against it on empirical and structural grounds.

U.S. equity market microstructure has undergone several discrete transformations since 1997. Prices were quoted in fractions until the final phase of decimalization in April 2001; spread structure and minimum-tick dynamics under fractions are alien to modern markets. The HFT share of U.S. equity volume rose from approximately 20% in 2005 to roughly 60% by the mid-2010s, changing the autocorrelation structure of volume and the timescale of price discovery. The 2007–2009 crisis produced liquidity droughts and cross-asset correlations that approached unity episodically — behavior whose precise dynamics are unlikely to be informative in a typical regime. The introduction of Regulation NMS, the post-Flash-Crash limit-up/limit-down mechanism, and the standardization of circuit breakers through 2010–2013 produced a market structure that has remained broadly stable since.

A separate concern is the 2020–2022 retail/COVID anomaly. Retail order-flow share of U.S. equity volume rose from roughly 10% pre-pandemic to a peak of approximately 25% in early 2021, driven by the Robinhood-era expansion and the meme-stock phenomenon. Monetary policy was unprecedentedly accommodative. Correlations, volatility term structures, and option-implied distributions all behaved atypically. Including these bars is defensible, but they pose the risk of biasing a small model toward a regime unlikely to recur.

A model with a finite representational budget must allocate capacity somehow. At 2.4 million parameters, ATLAS is on the small end of modern time-series transformers; every parameter spent modelling pre-Reg-NMS microstructure or the 2008 liquidity crisis is a parameter not spent on the current regime. López de Prado (2018) [17] argues that regime non-stationarity is the primary failure mode of financial ML out of sample, and that structural-break-filtering of the training window is frequently the single most valuable pre-processing decision.

We therefore train ATLAS on a structural-break-filtered window of 2010-01-01 through 2022-12-31. The start date anchors the training distribution to the post-Flash-Crash, post-Reg-NMS steady state. The end date reserves the 2023-01-01 through 2024-06-30 period for validation and 2024-07-01 through 2025-11-30 for test, preserving walk-forward integrity without contaminating either set with any bar the model has seen.

We confirmed this decision empirically with a pilot ablation: a run with identical architecture and hyperparameters but with the training start pushed back to 2003-01-01 produced a validation pinball loss of 0.117, compared to 0.1051 for the 2010-anchored run — a relative increase of approximately 12%. We interpret this as the model expending capacity on dynamics (fractional-tick microstructure, pre-HFT volume autocorrelation, the 2007–2009 regime) that have little bearing on the test period. Whether a substantially larger model would benefit from a longer history is an open question; at our parameter budget, the evidence favors a shorter, cleaner window.

3.3 Universe

The ATLAS universe consists of 121 symbols, drawn from three overlapping groups and chosen to represent the liquid, institutionally-tradable, large-capitalization segment of the U.S. equity market.

The first group comprises nine broad-market ETFs: SPY, QQQ, DIA, IWM, VTI, VOO, VT, EFA, and EEM. These capture U.S. large-cap, U.S. small-cap, U.S. total-market, global, developed-foreign, and emerging-markets exposures and are among the highest-volume exchange-traded products in the world.

The second group comprises eleven sector SPDR ETFs — XLF, XLK, XLE, XLV, XLY, XLP, XLI, XLB, XLU, XLRE, and XLC — each of which tracks a single GICS sector with total assets ranging from approximately $5B to over $40B and daily dollar volumes sufficient for institutional execution.

The third group comprises the roughly one hundred largest U.S.-listed equities by market capitalization as of late 2025, spanning mega-cap technology (AAPL, MSFT, GOOGL, AMZN, META, NVDA, TSLA), mega-cap financials (JPM, BAC, BRK.B, WFC, GS), healthcare (UNH, JNJ, LLY, PFE, ABBV), industrial and consumer bellwethers, and a handful of liquid international ADRs.

The universe is deliberately tight. We are not attempting to forecast the cross-section of all listed U.S. equities — a task for which a Fama-MacBeth cross-sectional approach [11, 12] is better suited — but rather to forecast the future return distribution for each of a modest number of highly liquid symbols. The choice also has a practical motivation: transaction costs on these names are typically 2–8 basis points round-trip, substantially below the per-week returns we report; on small-cap names, the corresponding figure can exceed 50 basis points and would erase most of the measurable edge.

3.4 Cleaning and validation

Corporate actions are back-adjusted in the source feed; we do not apply additional adjustments. Splits are checked for using the ratio of adjusted-to-unadjusted close, and any symbol showing an unexplained discontinuity is flagged for manual review; 3 such flags were reviewed during preparation of the ATLAS corpus and 2 were corrected in the feed. We drop any bar with zero volume, any symbol with fewer than 200 bars in the training period, and any bar whose close differs from the previous close by more than 40% unless an announced corporate action explains the move. After cleaning, the training set contains 562,468 bars across 121 symbols, averaging approximately 3,400 bars per symbol. The validation set contains approximately 44,600 bars, and the test set approximately 42,800 bars.

3.5 Split and walk-forward philosophy

We adopt a strict walk-forward split: 2010-01-01 through 2022-12-31 for training, 2023-01-01 through 2024-06-30 for validation, and 2024-07-01 through 2025-11-30 for test. No bar from the validation or test set enters the training set under any circumstance. Hyperparameters are selected on the validation set; the test set is evaluated exactly once, per the published results in §8. We do not engage in any form of retrospective re-selection on the test set.

3.6 Windowing

Each training example is a context window of $C = 120$ daily bars followed by a target horizon of $H = 5$ daily bars. Windows are generated with a stride of 5 bars, yielding one example per non-overlapping business week of target outcomes for each symbol. Within each context window, each of the five channels is z-score normalized using statistics computed solely from the context itself; the target horizon is not used for normalization, and no cross-window or cross-symbol statistics are leaked into the normalization step. Inside the model, the context window is further segmented into patches of length $P = 10$, producing $C / P = 12$ patches per channel. The ViT-style patch tokenization is applied per channel; channel mixing occurs inside the transformer blocks through the factorized attention layer, not at the input.

4. Methodology

4.1 Architecture

ATLAS is a factorized-attention transformer with a ViViT-style [19] separation of the temporal (patch) and channel axes. The input tensor has shape $(B, T = 120, C = 5)$, where $B$ is the batch size. A linear patch-embedding layer maps each non-overlapping length-$P=10$ patch to an embedding of dimension $D = 192$, producing a tensor of shape $(B, P = 12, C = 5, D = 192)$. Sinusoidal positional encodings are added along the patch axis; no positional encoding is applied along the channel axis, which we treat as exchangeable.

Each of the four factorized blocks applies, in sequence: a multi-head self-attention over the patch axis (attending across the 12 patches within each channel independently), a residual connection and LayerNorm, a multi-head self-attention over the channel axis (attending across the 5 channels at each patch position), another residual connection and LayerNorm, and a feedforward MLP with GELU activation and hidden dimension $4D = 768$. Each attention layer uses 6 heads of dimension $32$. After the final block, the representation at the final patch position is pooled and fed through an MLP head producing forecasts at 7 quantile levels — $q \in {0.1, 0.2, 0.3, 0.5, 0.7, 0.8, 0.9}$ — for each of the 5 target days. Output tensor shape: $(B, H = 5, 7)$.

Total trainable parameter count is 2,410,063, distributed roughly as 46,080 in the patch embedding, approximately 2.25M across the four transformer blocks, and the remainder in the output head. The checkpoint size is approximately 19 MB in fp32. At batch size 128, GPU memory consumption is under 1 GB, permitting training on any modern consumer-grade accelerator.

4.2 Why factorized attention

A fully joint attention over the combined $(P \cdot C) = 60$ tokens costs $O(3600 D)$ per layer, while the factorized form costs $O(169 D)$ — a twenty-fold reduction. More important for a small-data regime like ours, the factorization acts as an inductive bias: it combines information first within-channel-across-time and then within-time-across-channel, rather than admitting arbitrary cross-channel-cross-time interactions. On our validation set, the factorized form outperforms a joint-attention baseline of equivalent parameter count by approximately 3% on pinball loss, which we attribute to the regularizing effect of factorization rather than expressive-capacity gain.

4.3 Why the pinball loss

We minimize the pinball loss (equivalently quantile regression / check loss) summed over the seven target quantiles. For target $y$ and forecast $\hat{y}_q$ at level $q$, $$L_q(y, \hat{y}_q) = \max\left(q \cdot (y - \hat{y}_q),\ (q - 1) \cdot (y - \hat{y}_q)\right)$$ and the total loss is $L = \sum_q L_q(y, \hat{y}_q)$. The pinball loss is a strictly proper scoring rule for quantile forecasts (Gneiting and Raftery 2007 [18]): a forecaster reporting true conditional quantiles uniquely minimizes expected loss. Quantile outputs also feed directly into the barrier-touch simulator of §5: the 10th-percentile forecast for the 5-day horizon low becomes the stop-loss candidate, the 90th-percentile forecast of the 5-day high becomes the take-profit candidate, with no additional heuristics required.

4.4 What we tried that did not work

We initially attempted to train ATLAS with a longer context window of $C = 240$ bars (roughly one year of trading days) while holding the patch length fixed at 10, reasoning that longer history should give the attention mechanism more to work with. Validation pinball loss plateaued at 0.118 in this configuration and the loss trajectory was notably jagged, with a visible instability around the epochs that corresponded to large weight updates on 2020 COVID-era bars — the model appeared to be over-emphasizing the COVID volatility shock in a way that hurt generalization. Shortening the context window to $C = 120$ smoothed the loss trajectory, reduced best-epoch validation loss by approximately 6%, and — helpfully — cut training time roughly in half. We report results only for the $C = 120$ configuration.

We also experimented briefly with a richer output head — predicting a full distribution over 21 quantile levels rather than 7 — and with training for longer (patience = 15 rather than 5). Neither change produced a validation-loss improvement that survived a second run with a different random seed, so we retain the simpler configuration.

5. Training

ATLAS was trained on a single NVIDIA A40 GPU in bfloat16 mixed precision. We used the AdamW optimizer with an initial learning rate of $3 \times 10^{-4}$, a weight decay of 0.05, and a cosine learning-rate schedule with 500 warm-up steps. Batch size was 128. Gradient clipping was applied at a global norm of 1.0. Random seed was fixed at 42 for the reported run; a seed sweep of five additional seeds produced validation losses in the range 0.104–0.108, which we consider an indication that the reported result is stable and not an artefact of initialization.

The training loop used a per-symbol sampler that balanced symbol representation within each batch, preventing the high-bar-count symbols (SPY, QQQ, and the mega-caps with long histories) from dominating gradient updates at the expense of symbols with shorter histories. Early stopping was configured with a patience of 5 epochs on the validation loss.

Validation pinball loss declined rapidly from an initial value of 0.129 at epoch 0 to 0.109 at epoch 1 and reached its best value of 0.1051 at epoch 3. The subsequent epochs 4 through 12 oscillated in a range of 0.104 to 0.107 without producing a new best, and the early-stopping criterion triggered at epoch 13. Total wall-clock training time was approximately 8 minutes, the fastest of any model in the Zirdle family and a consequence of ATLAS's small parameter count and 5-channel input width.

6. Evaluation Protocol

We evaluate ATLAS with a barrier-touch simulator that, for each 5-day forecast, opens a hypothetical trade at the open of the day following the forecast and closes it when either a take-profit or a stop-loss barrier is first touched, or at the close of the fifth forecast day if neither barrier is touched.

Direction. The trade is long if the model's median (50th-percentile) forecast for the 5-day cumulative return is positive, short if negative. A very small number of trades with an exactly zero median forecast are discarded; in practice this occurs on approximately 0.02% of candidate trades.

Stop-loss placement. The stop-loss is placed at the 10th-percentile forecast of the minimum-over-horizon return (for longs) or the 90th-percentile forecast of the maximum-over-horizon return (for shorts), with a floor of 0.5% of entry price. The take-profit is placed at a multiple of the stop-loss distance corresponding to the desired R/R: 1×, 2×, 3×, or 5×.

Barrier-touch logic. For each day in the hypothetical trade's life, we check whether the day's high exceeded the take-profit (for longs) or the day's low fell below the stop-loss (for longs), with the reverse for shorts. If both barriers are touched in the same bar — which we term a "same-bar ambiguity" — we conservatively assume the stop-loss was touched first, on the grounds that a retail or institutional trader cannot, from daily OHLC data alone, determine the true intra-bar ordering; assuming stop-first is the prudent tie-break. In the data we observe, same-bar ambiguities resolve to stop-first in approximately 3.8% of trades at 1:1 R/R and approximately 1.1% at 1:5 R/R.

Position sizing. We adopt a constant-fractional sizing rule: on each trade, the notional exposure is $1,000,000 / 121 \approx $8,264$ per symbol, with the per-symbol capital compounding across that symbol's trade sequence independently. Trades are not cross-funded. The reported total return is the average across the 121 per-symbol compounded equity curves; the per-week return is the weekly mean of the pooled daily returns.

Transaction costs. We do not model transaction costs in the reported numbers. A conservative estimate for round-trip costs on the ATLAS universe, combining the half-spread and an institutional-tier commission, is 5–15 basis points per trade. Applied uniformly, this would reduce 1:5 R/R weekly returns from +2.21% to an estimated +2.10% to +2.15%, which does not qualitatively change the picture. We note this caveat explicitly and report costs-included variants in a supplementary table in the project's internal dashboard.

Statistical summary. Over the 15-week test window, the evaluator produces 21,492 resolved trades across the 121 symbols. We compute win rate (fraction of trades whose realized return is strictly positive under the R/R payoff), total cumulative return, and per-week return as the mean of the weekly pooled returns.

7. Results

7.1 Main risk/reward sweep

Table 1 reports the main out-of-sample results for ATLAS over the 15-week test window.

R/R	Win rate	Total (15 weeks)	Per-week mean
1:1	51.8%	+8.85%	+0.590%
1:2	44.2%	+20.12%	+1.341%
1:3	41.3%	+26.16%	+1.744%
1:5	39.1%	+33.10%	+2.207%
no-SL	91.6%	-10.36%	-0.691%

Table 1. ATLAS out-of-sample performance, 2024-07-01 to 2025-11-30, 21,492 resolved trades. Transaction costs are not modeled.

Three features of this table warrant discussion.

First, the 1:1 win rate of 51.8% is only approximately 2 percentage points above the random baseline of 50%. This is a small edge in absolute terms — but it is in the correct direction, and it is consistent across all four asymmetric R/R configurations. We interpret this as evidence of a genuine but weak directional signal, which is consistent with both the prior literature on large-cap U.S. equity predictability (Gu, Kelly, and Xiu (2020) [11] report monthly directional accuracies of roughly 52–54% for their best neural-network configurations) and with the theoretical expectation that liquid, heavily researched mega-caps should be close to, but not perfectly, efficient.

Second, the per-week return increases monotonically with R/R, from +0.59% at 1:1 to +2.21% at 1:5, even as the win rate declines from 51.8% to 39.1%. This convex-payoff pattern is diagnostic: it is what one expects from a forecaster whose point directional edge is small but whose quantile estimates (the 10th percentile used for stop placement and the corresponding high-side percentile used for take-profit placement) have usable shape. The lower win rate at 1:5 is more than compensated for by the larger payoff on winners. We select 1:5 as the flagship configuration on this basis, with the caveat that it produces higher psychological drawdowns in a live setting — a matter of deployment policy rather than of statistical performance.

Third, the no-stop-loss variant is a striking failure. With no stop in place, the nominal win rate rises to 91.6% — unsurprising, since most mega-cap 5-day returns cross any reasonably-placed take-profit at some point — but the cumulative return is -10.36%, corresponding to -0.69% per week. We interpret this as a consequence of the mean-reverting microstructure of the ATLAS universe: mega-caps and broad ETFs are, on 5-day timescales, more mean-reverting than the broader universe tested by some of ATLAS's indicator-rich siblings, and unstopped trades that briefly move in the forecast direction often retrace before horizon close. A stop-loss, in this universe, is not merely a risk-control tool; it is a fundamental part of the edge. We discuss this further in §8.3.

7.2 Long versus short decomposition

Decomposing the 1:5 R/R results by trade direction, longs account for approximately 58% of opened trades and shorts for 42%, reflecting the mild upward drift in the test window and the model's correspondingly slightly-more-often-positive median forecast. Long trades achieved a 40.4% win rate and +2.40% per week; short trades achieved a 37.3% win rate and +1.94% per week. Both directions are profitable, a finding we interpret as evidence that the model's directional signal is not purely a beta play on the test-window bull regime but contains genuine bidirectional information.

7.3 Head-to-head against ORION (indicator-rich sibling)

Table 2 compares ATLAS to ORION, an otherwise-identical factorized-attention transformer trained on the same universe and window but augmented with a full suite of 24 computed input channels — the original 5 OHLCV plus 19 engineered indicators including multiple SMA horizons, RSI, MACD, Bollinger position, ATR-based volatility, and several volume derivatives.

Metric	ATLAS (5 ch, 2.4M)	ORION (24 ch, ~14M)	Uplift
Validation pinball	0.1051	0.0976	-7.1%
1:1 WR	51.8%	52.6%	+0.8 pp
1:1 weekly return	+0.59%	+1.41%	+0.82 pp (+140%)
1:5 weekly return	+2.21%	+4.09%	+1.88 pp (+85%)

Table 2. ATLAS versus ORION, identical universe and test window.

ORION outperforms ATLAS on every metric — as it should, given its six-times-larger parameter count and its indicator-augmented input. The uplift is real but bounded: approximately +1.9 percentage points per week at the flagship 1:5 R/R, or roughly 85% more return in proportional terms. We do not interpret this as a blanket argument for indicator augmentation at all scales; with a sufficiently large raw-OHLCV model, we expect the gap to narrow further. What the comparison does establish, in a clean ablation, is that engineered indicators still contribute non-trivial marginal signal at the 2–14M parameter scale.

7.4 Unified-universe benchmark

In a separate unified-universe benchmark, all five members of the Zirdle family are evaluated on the same 130+ symbols over an identical 6-month window. ATLAS produces the lowest absolute return (+0.84% per week) but the highest directional win rate at 1:1 (38.9%) of any family member. We interpret this ranking as consistent with ATLAS's design: it is the most conservative and highest-precision member of the family, trading breadth of signal for parsimony.

8. Discussion

8.1 Why ATLAS works despite having no indicators

Two complementary explanations are plausible.

The first, which we call the recoverability hypothesis, is that a factorized-attention transformer with sufficient depth can internally learn approximations of the commonly-used technical indicators from raw OHLCV alone. A simple moving average is a uniform temporal pooling, which the attention mechanism can trivially approximate by learning uniform attention weights over the patch axis. A relative strength index is essentially a non-linear function of recent close-to-close differences, which the MLP sub-layers can represent. Bollinger position is a standardization against a windowed mean and standard deviation, both of which the model can compute internally. The recoverability hypothesis does not claim exact replication; it claims only that the functional content of these indicators is learnable from the underlying OHLCV substrate. The 7% gap in validation loss between ATLAS and ORION (Table 2) suggests that the recovery is imperfect — the indicators do contain some marginal signal that the 2.4M-parameter ATLAS does not manage to extract — but the majority of the signal is, in fact, recoverable.

The second, which we call the indicator redundancy hypothesis, is complementary: many engineered indicators are strongly correlated with each other and with raw price structure. Adding MACD on top of a model that already has OHLCV context does less than one might expect, because MACD is largely redundant with the difference between two moving averages — which an attention mechanism can approximate at leisure. A corollary of this hypothesis is that indicator-augmented models should see diminishing returns as more indicators are added, a pattern we have observed in internal ablations not reported here.

These two hypotheses are not mutually exclusive, and the truth is likely a combination of both.

8.2 ATLAS as a baseline for the Zirdle family

The Zirdle family comprises five models spanning a range of parameter counts, input configurations, and target horizons. ATLAS occupies the "minimalist baseline" role: smallest, fastest, OHLCV-only. The larger models — ORION with its indicator-augmented daily forecast and three further members with different horizons and input stacks — can be read as scaling experiments built on the ATLAS foundation. Having ATLAS as a disciplined baseline materially strengthens claims about the other models: the marginal value of indicators, larger parameter counts, and richer horizon structures can all be measured against a common reference point.

ATLAS is also the preferred production candidate in two deployment scenarios. The first is low-latency: with no indicator precompute required, end-to-end inference is approximately 10–50 ms faster than for ORION. The second is interpretability: attention weights over raw OHLCV are directly human-readable, and a compliance team can trace exactly which historical bars drove a forecast, rather than tracing through an intermediate indicator layer.

8.3 The no-stop-loss puzzle

The -0.69% per week on the unstopped variant warrants more than a passing mention. ATLAS's test universe — indices plus mega-caps — is narrower, more liquid, and more mean-reverting than the broader universes tested by its siblings. On 5-day horizons, mean reversion in mega-cap returns is well documented: large-cap name daily autocorrelations are consistently negative at 1- to 5-day lags. An unstopped trade that is initially correct in direction is therefore disproportionately likely to retrace before horizon close, converting an intra-horizon winner into a horizon-end loser.

A stop-loss intervenes in this process by closing the winning trade at the stop barrier — which, for an initially-wrong trade, locks in a small loss — and not reopening on the subsequent retrace. The net effect is that stops in ATLAS are not just risk-control scaffolding; they are a non-trivial component of the edge itself. This is in contrast to the indicator-rich, broader-universe siblings, which retain positive expected return even unstopped.

We regard this as an interesting structural finding. A narrow-universe, mean-reverting-regime forecaster's edge lives partly in its stop-loss policy; a broad-universe, drift-regime forecaster's edge lives primarily in its directional signal. This distinction has implications for deployment — ATLAS should not be deployed without stops; ORION can be — and for ensembling: combining ATLAS with a pure-directional sibling may produce complementary signal under different market regimes.

8.4 Deployment case

In live deployment, ATLAS is allocated to the Index Tracker bot slot, which manages capital against broad-market ETFs and the most liquid mega-caps. Its flagship 1:5 R/R configuration is the default trading policy; alternative R/R levels are available for users with different volatility preferences. End-to-end inference, including data retrieval, windowing, and post-processing, takes approximately 40 ms per symbol on a single CPU core, with the model itself accounting for under 10 ms. The production ensemble wraps ATLAS with a light meta-layer that checks for agreement with ORION on the direction of the median forecast; disagreement cases are flagged for manual review or, in automated mode, skipped. In internal evaluation, the agreement filter raises the 1:5 R/R win rate to approximately 42% while reducing trade count by roughly 30%.

9. Limitations

The results reported in this paper are subject to several limitations that we list here explicitly rather than burying in the main text.

Test-window length. Fifteen weeks is a short out-of-sample window by the standards of academic finance, even if it contains 21,492 resolved trades. A longer window would tighten our confidence intervals and provide more coverage of different intra-period regimes. We expect to report extended results as additional months of live data accumulate.

Bull-regime test exposure. The 2024-07-01 to 2025-11-30 test window includes substantial upward drift in U.S. equities. A model that benefits from beta exposure will appear better in bull regimes than it is. Our long-versus-short decomposition (§7.2) provides some evidence that the model's edge is bidirectional and not purely a beta play, but a bear-regime test would be more compelling, and we do not claim to have one.

Temporal generalization, not universe generalization. We train and test on the same 121 symbols. This is a test of temporal generalization — does a model trained on 2010–2022 SPY perform on 2024–2025 SPY — but not of universe generalization — would the model perform on symbols it has never seen at training time? The literature on zero-shot time-series forecasting [7, 9, 10] suggests that universe generalization is substantially harder, and we do not claim it for ATLAS. The Zirdle family includes models trained on larger universes where universe-generalization claims are more defensible.

Transaction costs. We do not model transaction costs in the headline numbers. Our best estimate is that realistic costs on the ATLAS universe reduce 1:5 R/R weekly returns by approximately 5–15 basis points; the qualitative picture is unchanged but the absolute numbers would be slightly lower. We regard this as a non-trivial but second-order concern, and we acknowledge that for a small-cap or low-liquidity universe the cost correction could erase the edge entirely.

Look-ahead risk. We have taken care to prevent look-ahead in preprocessing — per-window normalization uses context-only statistics, and validation and test sets are cleanly separated in time. We cannot, however, guarantee the absence of look-ahead in the underlying vendor feed itself. If the vendor applies a corporate-action adjustment retroactively and serves a revised historical bar, the feature will differ from what was available to a trader in real time. We have no evidence of such an issue but cannot categorically exclude it.

Single random seed for headline results. The production trade evaluation is based on the seed-42 run. A fully rigorous evaluation would repeat the simulation across all seeds. We regard validation-loss stability (0.104–0.108 across seeds) as suggestive evidence that the trade results are similarly stable, but we note the gap explicitly.

10. Conclusion

ATLAS is a 2.4-million-parameter factorized-attention transformer that forecasts daily equity price distributions on a curated 121-symbol universe of U.S. indices and mega-caps using only raw OHLCV channels. It achieves a validation pinball loss of 0.1051, a 51.8% win rate at 1:1 risk/reward (+0.59% per week) over a 15-week walk-forward test window, and a flagship +2.21% per week at 1:5 risk/reward (+33.10% cumulative), over 21,492 resolved trades. Head-to-head against an otherwise-identical indicator-augmented sibling with six times the parameter count, ATLAS recovers approximately half of the excess return — a finding consistent with the hypothesis that a sufficiently expressive transformer can internally approximate standard technical indicators from OHLCV substrate, but that engineered indicators still confer a bounded and non-trivial marginal uplift at this scale.

We regard ATLAS's contribution as threefold: it provides a disciplined minimalist baseline against which the marginal value of indicator augmentation can be measured in a clean ablation; it isolates the regime-filtering choice of training-window boundaries as a first-order modeling decision, confirmed by a controlled ablation showing that extending the window past a structural break costs approximately 12% of validation performance; and it establishes that OHLCV-only forecasting at small scale is operationally viable as a production model in low-latency, interpretability-sensitive deployment scenarios. The no-stop-loss failure mode — a cumulative -10.36% return despite a 91.6% nominal win rate — highlights a structural feature of the narrow mega-cap universe that we believe merits further investigation in its own right: in mean-reverting markets, the stop-loss is not only a risk-control tool but a constitutive component of the forecasting edge. We intend to explore this phenomenon in future work, alongside longer test windows, bear-regime evaluation, and universe-generalization experiments with held-out symbols.

References

[1] H. White, "Economic prediction using neural networks: The case of IBM daily stock returns," in Proceedings of the IEEE International Conference on Neural Networks, vol. 2, 1988, pp. 451–458.

[2] E. F. Fama, "Efficient capital markets: A review of theory and empirical work," The Journal of Finance, vol. 25, no. 2, pp. 383–417, 1970.

[3] O. B. Sezer and A. M. Ozbayoglu, "Algorithmic financial trading with deep convolutional neural networks: Time series to image conversion approach," Applied Soft Computing, vol. 70, pp. 525–538, 2018.

[4] D. M. Q. Nelson, A. C. M. Pereira, and R. A. de Oliveira, "Stock market's price movement prediction with LSTM neural networks," in International Joint Conference on Neural Networks (IJCNN), 2017, pp. 1419–1426.

[5] S. Mehtab and J. Sen, "Stock price prediction using machine learning and LSTM-based deep learning models," arXiv:2009.10819, 2020.

[6] Y. Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam, "A time series is worth 64 words: Long-term forecasting with transformers," in International Conference on Learning Representations (ICLR), 2023.

[7] G. Woo, C. Liu, A. Kumar, C. Xiong, S. Savarese, and D. Sahoo, "Unified training of universal time series forecasting transformers (Moirai)," in International Conference on Machine Learning (ICML), 2024.

[8] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, "Informer: Beyond efficient transformer for long sequence time-series forecasting," in AAAI Conference on Artificial Intelligence, 2021, pp. 11106–11115.

[9] A. Das, W. Kong, R. Sen, and Y. Zhou, "A decoder-only foundation model for time-series forecasting (TimesFM)," in International Conference on Machine Learning (ICML), 2024.

[10] A. F. Ansari, L. Stella, C. Turkmen, X. Zhang, P. Mercado, H. Shen, O. Shchur, S. S. Rangapuram, S. P. Arango, S. Kapoor, J. Zschiegner, D. C. Maddix, M. W. Mahoney, K. Torkkola, A. G. Wilson, M. Bohlke-Schneider, and Y. Wang, "Chronos: Learning the language of time series," Transactions on Machine Learning Research, 2024.

[11] S. Gu, B. Kelly, and D. Xiu, "Empirical asset pricing via machine learning," The Review of Financial Studies, vol. 33, no. 5, pp. 2223–2273, 2020.

[12] L. Chen, M. Pelger, and J. Zhu, "Deep learning in asset pricing," Management Science, vol. 70, no. 2, pp. 714–750, 2023.

[13] B. Kelly, S. Malamud, and K. Zhou, "The virtue of complexity in return prediction," The Journal of Finance, vol. 79, no. 1, pp. 459–503, 2024.

[14] B. Lim, S. Arik, N. Loeff, and T. Pfister, "Temporal Fusion Transformers for interpretable multi-horizon time series forecasting," International Journal of Forecasting, vol. 37, no. 4, pp. 1748–1764, 2021.

[15] H. Wu, J. Xu, J. Wang, and M. Long, "Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting," in Advances in Neural Information Processing Systems (NeurIPS), 2021, pp. 22419–22430.

[16] C. R. Harvey and Y. Liu, "Backtesting," The Journal of Portfolio Management, vol. 42, no. 1, pp. 13–28, 2015.

[17] M. López de Prado, Advances in Financial Machine Learning. Hoboken, NJ: Wiley, 2018.

[18] T. Gneiting and A. E. Raftery, "Strictly proper scoring rules, prediction, and estimation," Journal of the American Statistical Association, vol. 102, no. 477, pp. 359–378, 2007.

[19] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, "ViViT: A video vision transformer," in IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 6836–6846.

[20] D. Easley, M. López de Prado, and M. O'Hara, "The volume clock: Insights into the high-frequency paradigm," The Journal of Portfolio Management, vol. 39, no. 1, pp. 19–29, 2012.

Powrót do przeglądu modeli