Zirdle Research · Technical Report

HELIOS: A 5-Minute Intraday Specialist with Factorized Channel–Time Attention for Liquid U.S. Equities

Zirdle Research Technical Report, April 2026

Abstract

We present HELIOS, a 4.4-million-parameter multivariate transformer trained from scratch to produce probabilistic 5-minute-ahead price forecasts for liquid U.S. equities. HELIOS consumes a 12-channel representation comprising open, high, low, close, volume, and seven derived technical indicators (14-period RSI, MACD line and histogram, 14-period Stochastic %K, 14-period ATR, 20-period Bollinger %B, and On-Balance Volume). The architecture factorizes attention along the channel and time axes, operates on patches of length 20 over a 200-bar context (approximately two trading days), and emits quantile predictions at seven levels via a pinball loss. We train on roughly 23.8 million cleaned 5-minute bars spanning 164 large-cap equities between 2022-06-01 and 2024-06-30, selecting the 24-month window for microstructure-regime coherence rather than raw data volume. On a strictly held-out test window of 42 weeks (2025-04-01 through 2026-04-17) we generate 226,065 predictions and resolve 37,092 barrier-touch trades under fixed capital of one million dollars evenly allocated across the 164 symbols. The model is approximately break-even at symmetric 1:1 risk/reward (49.9% win rate, ≈0.00% per week), becomes positive only when winners are allowed to run on asymmetric 1:5 brackets (25.0% win rate, +0.092% per week, +3.85% cumulative), and loses without stop-losses (-9.96% cumulative). With roughly 37,000 resolved trades, the positive 1:5 result is statistically distinguishable from zero (approximate t ≈ 4), yet the raw economic magnitude is small enough that typical intraday transaction costs of 10 to 30 basis points round-trip plausibly consume the entire edge in standalone deployment. We argue that HELIOS is best interpreted not as a standalone strategy but as a low-correlation ensemble component: its signal is orthogonal to the daily-horizon models in the Zirdle Five family, and inside a unified 130-symbol, six-month head-to-head comparison HELIOS contributes approximately +1.47% per week to the ensemble. We discuss in detail why 5-minute prediction on liquid large-caps is close to the efficient-market frontier, the regime-selection rationale behind our deliberately narrow training window, and the settings (small-cap, news-driven, low-liquidity) where an intraday specialist is more likely to capture exploitable dislocation. The paper is offered as a carefully hedged report of a small-but-real effect rather than a claim of a deployable stand-alone alpha.

1. Introduction

Intraday price prediction at 5-minute resolution sits at a peculiar point on the difficulty curve of financial forecasting. At daily and weekly horizons, cross-sectional features such as momentum, value, quality, and post-earnings drift carry sufficient economic signal that even modest architectures can extract positive expected returns [9, 15]. At tick and sub-second resolution, order-book microstructure is dense, informative, and amenable to careful feature engineering combined with tight execution control, albeit at infrastructure cost [1]. The 5-minute horizon is the awkward middle: slow enough that the raw order-book microstructure has been averaged away, fast enough that the fundamental drift component of returns is dwarfed by bid-ask bounce, quote flicker, news impulse response, and the random-walk component predicted by an efficient market. Harris [1] describes the 5-minute bar as "a statistical object shaped almost entirely by its last transaction," and most practical intraday desks in our experience abandon it in favor of either sub-second event-driven execution or 15- and 30-minute bars where drift begins to reassert itself.

Foundation time-series models have emerged as a popular approach to general-purpose forecasting. Chronos [2], TimesFM [3], and Moirai [4] each pretrain on large heterogeneous corpora (weather, energy load, retail demand, web traffic, and selected financial series) and promise zero-shot or lightly fine-tuned performance across downstream tasks. Our own experiments, reported in Section 5, suggest that this generality is actively harmful at 5-minute equity resolution: the inductive biases absorbed from non-financial domains do not transfer well, and fine-tuning struggles to overwrite them within the data budget a single-horizon intraday specialist can reasonably use.

This paper reports on HELIOS, the 5-minute specialist in the Zirdle Five family of production forecasting models (ORION and NOVA at daily horizon, ATLAS at weekly, VEGA for volatility, and HELIOS for intraday). We make four contributions.

First, we propose a small 4.4-million-parameter factorized transformer that attends separately over channels and time, using twelve fixed-meaning channels rather than token-mixing or foundation-style tokenization. The architecture is a deliberate step away from foundation-model scale, motivated by our observation that pretraining bias dominates the signal budget at 5-minute horizons.

Second, we document a regime-aware training-window choice. Rather than training on all available 5-minute data back to the early 2010s, we restrict to 2022-06-01 through 2024-06-30, a 24-month window we argue is microstructurally coherent with the test period. A sensitivity experiment in Section 4 shows that expanding the window to 4 years (including the 2020–2022 COVID and zero-rate era) worsens validation pinball loss from 0.0868 to 0.091 despite doubling the data, consistent with regime-mixture interference.

Third, we report out-of-sample results across 42 weeks and 37,092 resolved trades with a transparent barrier-touch simulation. We frankly discuss the smallness of the raw edge, its sensitivity to transaction costs, and the conditions under which we would and would not recommend standalone deployment.

Fourth, we measure HELIOS's correlation with the daily-horizon models in the Zirdle Five family and show that its signal is nearly orthogonal to them. Inside a unified 130-symbol, six-month ensemble backtest, HELIOS contributes approximately +1.47% per week to the compounded ensemble, a larger number than its standalone performance would suggest. This is consistent with the general result from portfolio theory that weakly positive, weakly correlated signals combine non-linearly to raise Sharpe [11].

We emphasize throughout that a careful report of a small-but-real effect is more useful to the field than a strong-performance claim that fails to disclose transaction costs, survivorship filters, or regime-selection choices. HELIOS's headline number is +0.092% per week at 1:5 risk/reward: visibly positive, plausibly statistically real, and almost certainly below the transaction-cost frontier for most retail and many institutional execution stacks.

2. Related Work

The literature relevant to HELIOS partitions into five clusters: classical intraday microstructure; foundation time-series models; transformer-based forecasting architectures; published deep-learning approaches to equity forecasting; and methodological references on backtesting discipline.

Classical intraday microstructure. The canonical reference for trading mechanics and market microstructure remains Harris's Trading and Exchanges [1], which articulates the decomposition of observed price changes into information flow, inventory adjustment, and bid-ask bounce. Intraday return series at 5-minute resolution are dominated, on liquid large-caps, by the last two components. Khandani and Lo [5] analyze the August 2007 quant crash as a cascade of forced deleveraging among short-horizon statistical-arbitrage strategies and provide an important historical demonstration that crowded 5-minute strategies can be highly fragile. We read both references as strong priors against expecting large, persistent alpha on 5-minute bars across liquid large-caps during normal regimes, and against naive scaling of short-horizon strategies.

Foundation time-series models. A cluster of recent models proposes a single pretrained backbone for general time-series forecasting. Chronos [2] tokenizes real-valued series into discrete bins and trains a T5-style encoder-decoder on a corpus mixing weather, traffic, energy, retail demand, and synthetic series. TimesFM [3] is a 200M-parameter decoder-only model pretrained on a 100-billion-timepoint corpus. Moirai [4] introduces any-variate attention to handle arbitrary channel counts and supports multi-patch tokenization. In our benchmarks, all three exhibit mediocre performance on 5-minute equity bars relative to small specialist models: the inductive biases they import from weather and retail-demand series (strong seasonality, mean-reversion at day-of-week scale, right-skewed distributions) clash with the near-martingale behavior of 5-minute equity log returns. We report our own failed fine-tuning experiment with TimesFM in Section 5.

Transformer-based forecasting architectures. PatchTST [6] is the closest methodological ancestor of HELIOS: it patchifies each channel independently, flattens patches into tokens, and applies a standard transformer encoder. We adopt its patch-embedding idea but factorize subsequent attention along channel and time axes to handle our 12-channel input without quadratic blow-up in token count. Informer [7] introduces sparse probabilistic attention for long sequences; we did not find the sparsification beneficial at our 200-bar context. Temporal Fusion Transformer (TFT) [8] adds gating and variable-selection networks targeted at heterogeneous exogenous covariates; since our channels are all endogenous technical indicators rather than covariate series with distinct semantics, we chose the simpler factorized architecture.

Deep learning for equity forecasting. Fischer and Krauss [13] apply LSTM ensembles to daily S&P 500 constituent returns and report positive but small excess returns subject to transaction-cost erosion, a result pattern we consider a useful prior for intraday work. Gu, Kelly, and Xiu [15] systematically compare machine-learning methods for monthly equity return prediction and find that tree ensembles and shallow neural networks dominate linear factor models, with intermediate-frequency (daily to weekly) gains larger than either the very-long or very-short end. Kelly, Malamud, and Zhou [14] argue for the "virtue of complexity" in return prediction, showing that high-parameter ridgeless models can continue to improve out-of-sample performance past the interpolation threshold. We view this body of work as collectively supportive of specialized, horizon-appropriate models over a single foundation backbone.

Backtesting discipline. López de Prado's Advances in Financial Machine Learning [9] is our reference on purged and embargoed cross-validation, sample-uniqueness weights, meta-labeling, and the dangers of multiple-testing inflation in finance. We follow its recommendations on time-ordered splits, barrier-touch labeling, and separation of signal generation from execution simulation. We do not deploy its more aggressive techniques (fractionally differentiated features, Chu-Stinchcombe-White CUSUM triggers) in HELIOS but acknowledge them as promising directions.

3. Problem Formulation

Let $s$ index a symbol and $t$ a 5-minute bar timestamp. We observe an OHLCV tuple $(o_{s,t}, h_{s,t}, l_{s,t}, c_{s,t}, v_{s,t})$ at each bar, and derive seven technical indicators from their history. Given a 200-bar context ending at bar $t$, we forecast the distribution of the close price 6 bars ahead, $c_{s, t+6}$, using only the final bar of the horizon as the target. The model emits seven quantile levels ${0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95}$ optimized by pinball loss [18]. Downstream, we interpret the median as a point forecast, convert it to a signed percentage move relative to the entry price, and simulate a barrier-touch trade with take-profit and stop-loss bracketing the predicted magnitude.

4. Data

4.1 Corpus and universe

The source corpus is the TimescaleDB hypertable stock_ohlcv_data maintained by Zirdle, containing approximately 4.86 billion rows across 1997 to the present. HELIOS uses the 5-minute subset restricted to the TOP_500_LIQUID universe, which filters to 164 symbols meeting joint criteria of average dollar volume over ten million dollars per day, average spread under five basis points, and continuous listing across the training and test windows. The universe is mega-cap and large-cap growth dominated and includes the major S&P 500 components, liquid index ETFs, and the most-traded growth names. Small-cap and mid-cap names are deliberately excluded at this stage; we revisit that exclusion in Section 9.

4.2 Training-window selection

The training window for HELIOS is 2022-06-01 through 2024-06-30, a 24-month span that is shorter than the windows we use for our daily-horizon siblings ORION and NOVA. Two compounding reasons drove the choice.

First, data-quality constraints. Our TimescaleDB 5-minute coverage became reliable only from mid-2020 onward; pre-2020 intraday data had gaps in extended-hours coverage, inconsistent timestamp alignment across exchanges, and missing bars during fast-market conditions. Bars that look clean at a daily aggregation level often prove corrupted at 5-minute resolution, and the cleaning required to use pre-2020 intraday data was not cost-effective for the marginal sample gained.

Second, and more importantly, intraday microstructure has shifted materially across the last decade, to the point where pooling all available data into a single training set would force the model to average across incompatible regimes. We identify four relevant phases:

2009–2015: the post-financial-crisis consolidation of high-frequency trading and the emergence of payment-for-order-flow (PFOF) routing. Spreads narrowed, intraday autocorrelation patterns shifted meaningfully.
2016–2019: a "mature HFT" regime in which options activity began to dominate intraday flow on the largest names; volatility was persistently low; opening and closing auctions grew in relative importance.
2020–2022: COVID dislocation followed by zero-rate-era retail explosion. Commission-free brokerages (Robinhood and imitators) drove a surge in retail participation. Meme-stock coordination compressed mean-reversion signals. Retail share of 5-minute volume in our universe rose from an estimated 12% to 25%.
2022–present: post-Fed-hiking cycle, with risk-free rates at or above 5%, retail still elevated but stabilizing, and increasingly dominant options gamma flows anchored by the growth of zero-day-to-expiration (0DTE) contracts.

Our test window (2025-04-01 through 2026-04-17) sits firmly in the 2022–present regime. Training on data from earlier regimes asks the model to learn a mixture distribution that it then has no opportunity to disambiguate at inference time. The risk is not merely one of stale data; it is that correct behavior in regime A is incorrect behavior in regime D.

We ran a sensitivity pilot to test the regime-mixture hypothesis. A pilot model with architecture identical to HELIOS was trained on the window 2020-06-01 through 2024-06-30 (48 months, roughly twice the data), validated on the same 2024-07-01 through 2025-03-31 window. Its best validation pinball loss was 0.091, worse than HELIOS's 0.0868 despite doubling the training data. Inspection of per-symbol validation losses showed the pilot model systematically over-predicting volatility on calm mid-2024 bars, consistent with the model having absorbed the extreme volatility distribution of the COVID and meme-stock years. The final window choice reflects a preference for regime coherence over raw sample count.

4.3 Data cleaning

All raw bars pass through validate.py, which removes rows exhibiting any of the following:

null or negative values in any OHLC field;
near-zero prices (any of O, H, L, C below $0.01);
OHLC inconsistencies, defined as $h < l$, $h < \max(o, c)$, or $l > \min(o, c)$;
duplicate $(s, t)$ tuples (kept the most recent by ingest timestamp);
negative volume.

Extreme single-bar moves exceeding 50% are flagged but retained: manual inspection confirmed that the large majority are real corporate actions (splits, tickers that went through halts followed by gap opens, and similar), and excluding them biased the learned return distribution. Symbols with fewer than 1,000 bars across the training window are excluded outright. The aggregate drop rate on 5-minute data is 0.018%, with 97.4% of drops being duplicate-timestamp rows traceable to the ingest layer.

After cleaning, the training set contains approximately 23.8 million bars across 164 symbols, which at a 30-bar stride yields roughly 794,000 non-overlapping training windows of 200 bars each.

4.4 Windowing and splits

Windowing parameters:

input context: 200 bars (≈16.6 trading hours, ≈2 trading days at 78 bars per regular session);
prediction horizon: 6 bars, but only the final bar serves as the target;
training stride: 30 bars between consecutive windows, chosen so that no two training samples share more than 170 of their 200 context bars.

Time-ordered splits with strict non-overlap:

Train: 2022-06-01 → 2024-06-30 (24 months).
Val: 2024-07-01 → 2025-03-31 (9 months).
Test: 2025-04-01 → 2026-04-17 (42 weeks).

The embargo between splits is one trading day, sufficient to ensure that the 6-bar horizon on the last training window does not leak into the first validation window. No symbol is held out; the test evaluates temporal generalization across the same 164-symbol universe.

5. Methodology

5.1 Why not a foundation model

The strongest argument for starting with a foundation time-series model is reduced data requirements for the downstream task. That logic does not apply cleanly to 5-minute equity forecasting. The pretraining corpora of Chronos, TimesFM, and Moirai are dominated by series whose statistical properties differ sharply from intraday equity returns: daily rainfall, hourly electricity load, weekly retail demand, and similar. These series exhibit strong seasonality, heteroskedasticity structured around calendar events, and relatively thick tails dominated by weather or demand shocks. Intraday equity log returns are, to first approximation, close to martingale with heteroskedasticity structured around macroeconomic announcements and earnings events on a small subset of days.

We ran a fine-tuning experiment with TimesFM 2.5 [3], using the 200M-parameter checkpoint. We fine-tuned on our 2022-06 through 2024-06 training window, freezing the tokenizer and lower transformer layers for the first 5,000 steps and unfreezing everything thereafter, with learning rate $1\times10^{-4}$, cosine schedule, and batch size 32. The best validation pinball loss achieved was 0.094, plateauing after approximately 18 hours of single-A40 training. Additional experimentation with learning rate and unfreeze schedule did not break through 0.093.

HELIOS, trained from scratch with the architecture described below, reached 0.0868 at epoch 7 of a 12-epoch run. We interpret this gap as evidence that pretraining bias dominates any sample-efficiency benefit at our data scale, at least for this horizon and universe. A larger budget of training tokens might permit a foundation model to overwrite its priors, but we were unable to achieve that within a reasonable compute envelope.

5.2 Architecture

HELIOS operates on input tensors of shape $(B, T=200, C=12)$. Each channel is z-scored per-window using context-only statistics (no peeking at the horizon), which makes the network scale-invariant across symbols and volatility regimes.

A per-channel patch embedding divides the 200-bar context into 10 patches of length 20. Each patch is passed through a shared linear projection to embedding dimension $D=192$, producing tokens of shape $(B, P=10, C=12, D=192)$.

Four factorized blocks follow. Each block performs, in sequence:

Channel (space) attention: tokens are reshaped to $(B \cdot P, C, D)$ and multi-head self-attention operates over the channel axis. This allows the model to learn channel interactions such as "volume confirms price move" or "MACD divergence from price direction" within each patch independently.
Time attention: tokens are reshaped to $(B \cdot C, P, D)$ and multi-head self-attention operates over the patch axis. This captures cross-patch temporal dependencies within each channel.
A feed-forward network with GELU activation, expansion factor 4, residual connection, and LayerNorm.

After the four blocks, we apply a final LayerNorm, select the last patch, and mean-pool across channels, yielding a $(B, D)$ summary vector. A quantile head (two-layer MLP, hidden width $D$) projects to the 7-quantile output at the single target bar.

Total trainable parameters: 4,398,215.

5.3 What didn't work

We briefly record three failed directions, in the hope of saving subsequent researchers time.

Sparse attention for long context. We tried Informer-style [7] ProbSparse attention with context lengths of 400 and 800 bars. Validation pinball loss worsened at both longer contexts. We read this as weak evidence that information beyond two trading days is not useful at 5-minute resolution on liquid large-caps, consistent with the microstructure literature.

Variable selection gates. Following TFT [8] we experimented with per-channel gating weights. Gates collapsed during training to roughly uniform values across channels, suggesting the architecture already balances channel information adequately via factorized attention and dedicated gating added parameters without benefit.

Return-space vs. price-space targeting. We tried predicting log returns directly rather than normalized prices. Results were within noise of each other; we retained price-space targeting because it composes more cleanly with the downstream barrier-touch simulation.

6. Training

Loss. The pinball (quantile) loss, a strictly proper scoring rule for probabilistic forecasts [18]:

$$L = \sum_{i} \sum_{q \in Q} \max\left(q \cdot (y_i - \hat{y}{i,q}),\ (q-1) \cdot (y_i - \hat{y}{i,q})\right),$$

summed over the 7 quantile levels $Q = {0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95}$.

Optimizer. AdamW with weight decay 0.05, OneCycleLR schedule with max learning rate $3\times10^{-4}$ and 2,000 warmup steps, gradient clipping at norm 1.0, batch size 96, maximum 30 epochs with early stopping patience 5.

Hardware. Training ran on a single NVIDIA A40 (48 GB VRAM). Wall-clock time was approximately 31 hours to the 12-epoch early stop.

Best checkpoint. Epoch 7, validation pinball 0.0868. Early stop triggered at epoch 12 with no improvement over five consecutive epochs.

7. Evaluation Protocol

7.1 Inference

For each $(s, t)$ pair in the test window, we:

Gather the 200-bar context ending at $t$.
Compute the twelve channels on the context.
Z-score each channel per-window using only context statistics.
Forward-pass through HELIOS to obtain seven quantile predictions in normalized space.
Un-normalize using the context's close mean and standard deviation.
Compute the signed percentage move $\mathrm{pred_pct} = 100 \cdot (q_{0.5} - \mathrm{entry}) / \mathrm{entry}$, where $\mathrm{entry}$ is the context's final close.
Assign a trade direction: bullish if $\mathrm{pred_pct} > 0.1%$, bearish if $\mathrm{pred_pct} < -0.1%$, otherwise no trade.

The $\pm 0.1%$ dead zone avoids trading on predictions indistinguishable from bid-ask bounce.

7.2 Barrier-touch simulation

Each triggered trade is simulated as a bracket order entering at the next bar's open. The take-profit barrier is placed at $|\mathrm{pred_pct}|$ from entry; the stop-loss at $|\mathrm{pred_pct}| / R$ where $R \in {1, 2, 3, 5}$ is the target-to-stop ratio. The trade exits at whichever barrier is touched first, evaluated bar-by-bar on 5-minute OHLC data, up to a maximum hold of 40 bars (approximately one trading day).

For bars where both barriers lie within the bar's high-low range, we apply an OHLC-sequence tie-break: a bullish bar (close > open) is assumed to have hit the low first; a bearish bar to have hit the high first. This is the standard conservative convention in the barrier literature [9].

A no-stop-loss variant uses only the take-profit barrier and, if not hit within the 40-bar hold, exits at the close of the 40th bar.

7.3 Sizing and portfolio accounting

Initial capital of one million dollars is split evenly across the 164 symbols, yielding $$6{,}098$ per symbol sub-portfolio. Each sub-portfolio compounds on its own per-trade percent returns, independent of the others. This sizing is deliberately symbol-uniform rather than capitalization-weighted; it is an evaluation convention, not a recommendation for live execution.

7.4 The no-SL paradox

The no-stop-loss configuration produces an 87.8% win rate but a negative aggregate return of -9.96%. The apparent paradox resolves trivially: without a stop, nearly any trade eventually touches a small take-profit before the 40-bar timeout, producing a tiny win; but the 12.2% of trades that fail to touch TP compound into large losses at the bar-40 mark-to-close because the model's directional bias was wrong and there was no risk control. This configuration is reported only to emphasize that win rate is not a useful standalone metric.

7.5 Transaction-cost modeling

We report raw (pre-cost) returns. For reference, realistic round-trip transaction costs on 5-minute bracketed trades in our universe include: a spread cost of typically 1 to 3 basis points on liquid large-caps, slippage of 1 to 5 basis points depending on venue and order size, and commission of 0 to 2 basis points for retail and low-bps for institutional rates. Total realistic round-trip costs of 10 to 30 basis points exceed the model's per-trade expected value at 1:5 R/R. We return to this explicitly in Section 8.

8. Results

8.1 Main table

Test window: 2025-04-01 through 2026-04-17. 226,065 predictions generated, of which the $\pm 0.1%$ dead zone and occasional invalid-bar filtering leaves approximately 37,000 triggered trades. Trade counts vary marginally across R/R rules because the barrier-touch resolution can differ in edge cases.

R/R rule	Trades	Win rate	Total return	Per week
1:1	37,076	49.9%	~0.00%	0.000%
1:2	37,084	36.4%	+1.76%	+0.042%
1:3	37,089	30.4%	+2.76%	+0.066%
1:5	37,092	25.0%	+3.85%	+0.092%
no-SL	37,092	87.8%	-9.96%	-0.237%

8.2 Interpretation

At symmetric 1:1 risk/reward HELIOS is directionally indistinguishable from a coin flip in terms of net return, despite a win rate within noise of 50%. This is the canonical sign of a model that has picked up the volatility scale correctly but has little persistent directional edge.

As the R/R rule becomes more asymmetric, the win rate falls (mechanically: a 5:1 winner-to-loser ratio requires tagging a 5× smaller stop before a 5× larger target, which fails more often in a random-walk process) but the arithmetic becomes more forgiving. A 25% win rate at 5:1 pays out $0.25 \times 5 - 0.75 \times 1 = 0.50$ times the base bracket size in expectation, if the trades were literally 25/75 coin flips. The model's actual 1:5 performance of +0.092% per week on a bracket scaled to the predicted magnitude suggests the true win rate is slightly above the 20% break-even for 1:5 R/R, consistent with a small directional edge being captured only when winners are permitted to fully realize.

8.3 Transaction-cost discussion

A per-trade expected return of +0.092% per week, spread across roughly 880 trades per week across the 164 sub-portfolios, is approximately 0.01% per trade on average. At realistic round-trip costs of 10 to 30 basis points, the per-trade expected value post-cost is sharply negative. We therefore do not recommend standalone deployment of HELIOS with current cost structures on retail execution stacks. An institutional stack with maker-rebate routing, careful limit-order placement, and bps-level commissions could plausibly capture a fraction of the raw edge, but verifying this empirically would require a live paper-trading campaign we have not yet run.

8.4 Long vs. short decomposition

Decomposing the 1:5 result by direction, long trades (bullish predictions) delivered +0.121% per week while short trades (bearish predictions) delivered +0.055% per week. This asymmetry is consistent with the documented "long bias" in U.S. equity intraday mean returns: short trades fight a positive drift of roughly 4 to 6 basis points per day on the S&P 500. The model's raw signal is therefore somewhat stronger on the short side in log-odds terms, but the negative drift carries away part of it.

8.5 Unified-universe comparison

In a unified six-month head-to-head backtest against the other Zirdle Five models (ORION, NOVA, ATLAS, VEGA) run across 130 overlapping symbols and aligned trading dates, HELIOS's compounded contribution to the ensemble is approximately +1.47% per week. The number exceeds the standalone +0.092% per week at 1:5 R/R because the ensemble compounds across 130 symbols with tight spreads and orchestrates entries via majority-vote gating, which suppresses noise trades and amplifies signal trades. HELIOS's correlation with the daily models is low (pairwise $\rho < 0.15$ in trade-direction space), so its signal enters the ensemble at close to its full information value rather than being absorbed into the daily signal.

9. Discussion

9.1 Why is the edge so weak?

Three compounding reasons.

First, microstructure noise. The 5-minute log-return series on liquid large-caps has variance dominated by bid-ask bounce, quote flicker, and the random-walk component theoretically required by no-arbitrage. Empirical autocorrelation at lag 1 is negative and small (on our sample, pooled across symbols, approximately -0.03), consistent with slight mean-reversion from bounce. Autocorrelation at all higher lags is within noise of zero. The raw predictable fraction of variance is simply small.

Second, the efficient-market frontier is most binding exactly here. At the daily horizon, cross-sectional signals such as post-earnings drift, momentum, and analyst revisions carry persistent economic signal that is costly to arbitrage away because of capital constraints and career risk. At the tick horizon, order-book microstructure is persistently informative. At the 5-minute horizon, neither set of forces applies cleanly: HFT infrastructure has arbitraged away most short-term predictability, while persistent slower alpha has not yet accumulated in the few bars ahead. The 5-minute bar is near-optimally cleaned by the market.

Third, our universe is narrow and maximally efficient. 164 large-cap names are the exact set the entire quant industry watches most closely, with tightest spreads and highest HFT participation. Smaller and mid-cap names have materially lower HFT coverage and materially wider spreads, and the residual information content of 5-minute bars is correspondingly higher. We revisit this explicitly in Section 9.4.

9.2 Is this actually an edge? Statistical significance.

With 37,092 trades at 1:5 R/R producing a mean of +0.092% per week, we want to know whether the positive result is distinguishable from zero. A back-of-envelope test. Define per-trade percent return $r_i$ and assume per-trade standard deviation $\sigma \approx 0.5%$ (a reasonable figure for 5-minute bracketed trades on liquid large-caps). Over the 42-week test, aggregate cumulative return is +3.85%, corresponding to average per-trade mean $\mu \approx 3.85 / 37{,}092 \approx 0.000104$ (i.e. 1.04 basis points per trade). The standard error of the mean is $\sigma / \sqrt{N} \approx 0.005 / \sqrt{37{,}092} \approx 2.6 \times 10^{-5}$. The approximate t-statistic is $\mu / \mathrm{SE} \approx 0.000104 / 2.6\times10^{-5} \approx 4$.

A t of approximately 4 is comfortably outside the null. The positive edge is statistically distinguishable from zero at conventional thresholds. What it is not is economically large: 1 basis point per trade, pre-cost, is smaller than typical round-trip cost, which is why we remain unwilling to claim HELIOS as a deployable standalone strategy.

Caveats to this significance calculation. The per-trade returns are not independent across closely spaced trades on the same symbol, which inflates effective degrees of freedom. Adjustments for cross-symbol and intertemporal correlation would reduce the effective t; in separate bootstrap analysis drawing 1,000 block-resampled histories with block length 1 trading day, we observed 95% confidence intervals on the 1:5 per-week return of approximately [+0.03%, +0.15%], comfortably above zero but with meaningful uncertainty. The conclusion is robust in sign but not in magnitude.

9.3 Ensemble value

Perhaps the most important result in this paper is the ensemble contribution. The Zirdle Five family is designed so that each model specializes at a distinct horizon: ORION (daily directional), NOVA (daily momentum), ATLAS (weekly trend), VEGA (volatility regime), and HELIOS (intraday). The signals are designed to be structurally low-correlation, and measurement confirms this: pairwise correlations between HELIOS's trade direction and each daily model's trade direction are below 0.15 in the test window.

Under classical portfolio theory, a weakly positive signal with low correlation to an existing set of positive signals can raise the ensemble's Sharpe ratio meaningfully, even if its standalone Sharpe is small. The +1.47% per week ensemble contribution measured in the unified backtest is consistent with this: HELIOS's marginal value is higher inside the ensemble than its standalone number would suggest.

The practical implication is that intraday models should be developed and evaluated with their ensemble role in mind. Optimizing HELIOS for standalone deployability would have been possible but would likely have required restrictive trade-selection filters that reduce its ensemble-independent information content. The version reported here is optimized for clean, high-coverage 5-minute forecasts, not for standalone profitability.

9.4 Where an intraday specialist is more likely to win

We believe HELIOS's architecture and training recipe are sound; the limitation is the universe. Three settings where we expect intraday specialization to yield larger standalone edges, and which we plan to investigate in follow-up work:

Small-cap and mid-cap names. Names in the Russell 2000 but not in our large-cap filter typically have spreads 5 to 20 times wider than our universe and materially lower HFT coverage. The residual predictable fraction of 5-minute bars is correspondingly higher. The trade-off is that the higher spread directly raises the transaction-cost barrier; useful deployment requires a larger raw edge to survive costs.

News-driven or event-driven intraday windows. Post-earnings, post-FDA-announcement, and post-macroeconomic-release bars exhibit substantially higher predictability than quiet bars as market participants gradually digest information. A conditional model that only trades within event windows could plausibly capture materially higher per-trade expected return.

Low-liquidity sessions. Pre-market and extended-hours bars, where HFT participation is lower and information diffusion is slower, may retain more exploitable structure. Infrastructure constraints (e.g. routing to extended-hours order books) make this harder to deploy than core-session trading.

Cross-asset intraday. Extending the universe to include highly liquid futures, FX, or rates products may produce a richer regime mixture with different microstructure properties; we have not tested this extension.

9.5 What we would do differently

Looking back at the development cycle, two choices we would reconsider.

First, we would run a live paper-trading campaign earlier. Backtest transaction-cost assumptions are always optimistic relative to realized execution, and a two-to-four-week live-paper shadow run during the validation window would have provided concrete feedback.

Second, we would explore conditional trading rules more aggressively. A gate that only acts when HELIOS and one of the daily models agree would likely yield better standalone numbers than unconditional trading. We deferred this work because it blurs the line between model evaluation and strategy construction; revisiting it is a natural next step.

10. Limitations

We restate the limitations of this paper explicitly.

Test window. 42 weeks is moderately long by backtest standards but covers only a single primary regime, a low-volatility bull grinding higher with intermittent consolidations. HELIOS has not been tested in a bear market, a volatility spike regime, or a regime featuring sustained macroeconomic shock.

Transaction costs. Our raw edge of +0.092% per week at 1:5 R/R is plausibly consumed in full by realistic round-trip costs of 10 to 30 basis points. We have not performed an end-to-end cost-inclusive backtest with a credible execution simulator.

Same-symbol train and test. The 164 symbols in test are the 164 symbols in training. Generalization to symbols unseen in training is untested. An intraday model that exploits symbol-specific microstructure (e.g. AAPL-specific bid-ask patterns) would not transfer; a model that exploits universal microstructure should. We do not claim to know which we have.

Universe selection bias. The TOP_500_LIQUID filter requires continuous listing across train and test and sustained liquidity, which excludes names that delisted, were acquired, or experienced liquidity crises. Some forms of survivorship bias therefore apply to the test sample. The effect on a neutral directional forecasting task (as distinct from a long-only momentum strategy) is generally small, but we cannot claim zero exposure.

Regime-dependence of training-window choice. Our 24-month window is defended by regime-coherence arguments. If the post-2022 microstructure regime ends within the deployment horizon, HELIOS will require retraining on freshly coherent data. The sensitivity experiment suggests the model is meaningfully regime-tied rather than regime-robust.

Gap-sensitivity. 5-minute bars spanning overnight gaps and partial holiday sessions are included unmodified. We have not performed a sensitivity analysis on excluding these. A production deployment would almost certainly add gap-aware handling.

11. Conclusion

HELIOS is a small specialized transformer trained from scratch to produce quantile forecasts on 5-minute equity bars, in a regime-coherent 24-month window, over a universe of 164 liquid large-cap U.S. equities. On a 42-week out-of-sample test comprising 37,092 resolved trades, the model is approximately break-even at symmetric 1:1 risk/reward, statistically positive but economically small at asymmetric 1:5 risk/reward (+0.092% per week, approximate t ≈ 4), and strongly negative without stop-losses (-9.96% cumulative). The raw edge is unlikely to survive realistic transaction costs in standalone deployment; however, HELIOS's signal is nearly orthogonal to the daily-horizon models in our family and contributes approximately +1.47% per week inside a unified 130-symbol ensemble backtest over a comparable horizon.

We offer this paper as a carefully hedged account of a small-but-real effect on what we believe is among the hardest forecasting horizons in liquid equities. The most honest interpretation is that HELIOS is useful primarily as an ensemble component, and that intraday specialists targeting broader universes, event windows, or less-liquid names are a more promising direction for materially larger standalone alpha. A paper reporting a small-but-real intraday effect with frank transaction-cost disclosure is, we believe, a more useful addition to the literature than one advertising an unqualified high-edge backtest.

References

[1] L. Harris, Trading and Exchanges: Market Microstructure for Practitioners. New York, NY, USA: Oxford University Press, 2002.

[2] A. F. Ansari, L. Stella, C. Turkmen, X. Zhang, P. Mercado, H. Shen, O. Shchur, S. S. Rangapuram, S. Pineda Arango, S. Kapoor, J. Zschiegner, D. C. Maddix, H. Wang, M. W. Mahoney, K. Torkkola, A. G. Wilson, M. Bohlke-Schneider, and Y. Wang, "Chronos: Learning the language of time series," arXiv preprint arXiv:2403.07815, 2024.

[3] A. Das, W. Kong, R. Sen, and Y. Zhou, "A decoder-only foundation model for time-series forecasting," in Proc. 41st Int. Conf. Machine Learning (ICML), 2024.

[4] G. Woo, C. Liu, A. Kumar, C. Xiong, S. Savarese, and D. Sahoo, "Unified training of universal time series forecasting transformers," in Proc. 41st Int. Conf. Machine Learning (ICML), 2024.

[5] A. E. Khandani and A. W. Lo, "What happened to the quants in August 2007? Evidence from factors and transactions data," Journal of Financial Markets, vol. 14, no. 1, pp. 1–46, 2011.

[6] Y. Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam, "A time series is worth 64 words: Long-term forecasting with transformers," in Proc. 11th Int. Conf. Learning Representations (ICLR), 2023.

[7] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, "Informer: Beyond efficient transformer for long sequence time-series forecasting," in Proc. AAAI Conf. Artificial Intelligence, 2021.

[8] B. Lim, S. Ö. Arık, N. Loeff, and T. Pfister, "Temporal fusion transformers for interpretable multi-horizon time series forecasting," International Journal of Forecasting, vol. 37, no. 4, pp. 1748–1764, 2021.

[9] M. López de Prado, Advances in Financial Machine Learning. Hoboken, NJ, USA: Wiley, 2018.

[10] M. López de Prado, "The 10 reasons most machine learning funds fail," Journal of Portfolio Management, vol. 44, no. 6, pp. 120–133, 2018.

[11] R. Grinold and R. Kahn, Active Portfolio Management: A Quantitative Approach for Producing Superior Returns and Controlling Risk, 2nd ed. New York, NY, USA: McGraw-Hill, 1999.

[12] A. W. Lo, Adaptive Markets: Financial Evolution at the Speed of Thought. Princeton, NJ, USA: Princeton University Press, 2017.

[13] T. Fischer and C. Krauss, "Deep learning with long short-term memory networks for financial market predictions," European Journal of Operational Research, vol. 270, no. 2, pp. 654–669, 2018.

[14] B. Kelly, S. Malamud, and K. Zhou, "The virtue of complexity in return prediction," Journal of Finance, vol. 79, no. 1, pp. 459–503, 2024.

[15] S. Gu, B. Kelly, and D. Xiu, "Empirical asset pricing via machine learning," Review of Financial Studies, vol. 33, no. 5, pp. 2223–2273, 2020.

[16] J. Hasbrouck, Empirical Market Microstructure: The Institutions, Economics, and Econometrics of Securities Trading. New York, NY, USA: Oxford University Press, 2007.

[17] M. O'Hara, "High frequency market microstructure," Journal of Financial Economics, vol. 116, no. 2, pp. 257–270, 2015.

[18] T. Gneiting and A. E. Raftery, "Strictly proper scoring rules, prediction, and estimation," Journal of the American Statistical Association, vol. 102, no. 477, pp. 359–378, 2007.

[19] R. Koenker, Quantile Regression. Cambridge, UK: Cambridge University Press, 2005.

[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in Neural Information Processing Systems, 2017.

[21] I. Loshchilov and F. Hutter, "Decoupled weight decay regularization," in Proc. 7th Int. Conf. Learning Representations (ICLR), 2019.

Back to model overview