Projects/S&P 500 Deep Learning Forecasting System

S&P 500 Deep Learning Forecasting System

A research-grade forecasting system that evaluates Temporal Fusion Transformers against LSTM and ARIMAX baselines for S&P 500 return prediction using mixed-frequency market and macroeconomic data, then extends TFT with regime-aware attention and interpretability diagnostics.

Financial ML2025ML Research Engineer / Time-Series Modeling
Financial MLTime-Series ForecastingTemporal Fusion TransformerRegime-Aware AttentionMixed-Frequency DataQuantile ForecastingModel InterpretabilityExperiment Design

Highlights

  • Built a mixed-frequency financial forecasting pipeline spanning January 1991 to October 2025, combining daily market variables such as VIX, Treasury yields, and yield spreads with monthly CPI-derived inflation releases.
  • Aligned macroeconomic variables using ALFRED vintage dates to reduce look-ahead bias, then evaluated models across chronological train, validation, and test windows covering COVID-19, post-pandemic, and Fed-tightening regimes.
  • Implemented and compared TFT, LSTM, and ARIMAX baselines, then added domain-specific TFT modifications including directional-diversity penalties, regime-conditional outputs, multi-task classification, and VIX-conditioned attention gates.
  • Found that regime-aware attention improved weekly directional accuracy from 57.9% to 59.4% and Sharpe ratio from 1.05 to 1.22, while learned gates dampened attention in low-volatility regimes and amplified it in high-volatility regimes.
  • Used attention gates, variable-selection weights, gradient-flow plots, prediction variance, and directional-bias diagnostics to explain when the model learned useful regime structure and when the output layer collapsed.

Key metrics

Data window
1991-2025
Daily S&P 500 + mixed-frequency macro indicators
Weekly accuracy
59.4%
Regime-aware TFT directional accuracy
Sharpe ratio
1.22
Weekly long-only strategy after regime attention
Attention gates
4 params
2 VIX regimes x 2 attention heads
Bear-market VIX weight
0.74
Regime-aware VSN focus during 2022 stress period
Baselines
TFT / LSTM / ARIMAX
Transformer, recurrent, and statistical comparisons

Media

Project cover summarizing the mixed-frequency forecasting problem, model family, and regime-aware evaluation loop.
System architecture: mixed-frequency features enter a TFT encoder, then branch into quantile regression, optional classification, mixture-of-experts output, and regime-aware attention gates.
Regime-aware attention results: weekly directional accuracy improved from 57.9% to 59.4%, Sharpe increased from 1.05 to 1.22, and high-volatility gates amplified attention while low-volatility gates dampened it.
Variable-selection interpretation: baseline feature weights shift gradually, while regime-aware attention focuses more sharply on VIX in the 2022 bear market and yield spread in the 2024-2025 bull-market setting.

Tech stack

PythonPyTorchPyTorch ForecastingTemporal Fusion TransformerLSTMARIMAXPandasNumPyMatplotlibALFRED / FREDVIX Regime FeaturesQuantile Loss

Overview

This project investigates whether Temporal Fusion Transformers can forecast S&P 500 returns when market data arrives at different temporal resolutions. Daily variables such as VIX update continuously, while macroeconomic variables such as CPI-derived inflation arrive monthly and can become stale before the next release.

The portfolio version emphasizes the engineering story: building the dataset correctly, avoiding look-ahead bias, comparing strong baselines, diagnosing model collapse, and adding regime-aware mechanisms that make the model behavior more interpretable.

Problem and constraints

  • Financial returns are noisy and close to random-walk behavior, so even small directional improvements must be treated carefully and evaluated over multiple market regimes.
  • Mixed-frequency features create a stale-information problem: a 30-day-old macro release should not be treated with the same freshness as yesterday's volatility signal.
  • Market non-stationarity creates regime shifts where relationships between volatility, yields, inflation, and returns change across bull, bear, crisis, and tightening periods.
  • The project is framed as an evaluation and interpretability system, not a claim of deployable trading alpha.

Data pipeline

The dataset spans January 1991 to October 2025 and combines S&P 500 returns with daily market indicators and lower-frequency macroeconomic releases. Features include VIX, 10-year Treasury yield, 10Y-2Y yield spread, and CPI-derived inflation with publication lag handling.

To reduce look-ahead bias, macroeconomic records were aligned using ALFRED vintage dates so each historical prediction only sees data that would have been available at that point in time.

  • Chronological split: train on 1991-2015, validate on 2015-2020, test on 2020-2025.
  • Validation covers COVID-era disruption; testing covers post-pandemic conditions and 2022-2023 Fed-tightening regimes.
  • VIX-based regime labels are used for interpretability, attention gating, and rolling-window robustness analysis.

Modeling approach

  • Temporal Fusion Transformer: LSTM encoder/decoder for local temporal structure plus interpretable multi-head attention for longer-range dependencies and variable selection.
  • Quantile forecasting: predicts seven quantiles (0.02, 0.10, 0.25, 0.50, 0.75, 0.90, 0.98) using quantile loss to represent asymmetric uncertainty.
  • Baselines: ARIMAX for statistical comparison and a three-layer LSTM baseline for recurrent deep learning comparison on the same prediction task.
  • Training controls: constrained TFT hidden size after larger hidden dimensions produced degenerate positive-return predictions with low output variance.

Regime-aware attention

The central architectural experiment adds lightweight VIX-conditioned attention gates. Each attention head is multiplied by a learned regime-specific gate, letting the model specialize attention behavior under low- and high-volatility conditions with only four learned parameters for two regimes and two heads.

The resulting gates learned intuitive behavior: low-volatility periods dampened attention to about 0.46, while high-volatility periods amplified attention to about 0.57. In the weekly setting this lifted directional accuracy from 57.9% to 59.4% and improved Sharpe ratio from 1.05 to 1.22.

  • Low-volatility regime: attention is dampened, suggesting the model relies less on sharp historical re-weighting when markets are calmer.
  • High-volatility regime: attention is amplified, suggesting the model searches harder for regime-specific temporal signals during stress periods.
  • Feature-weight analysis showed VIX becoming dominant during the 2022 bear market and yield spread becoming more important in later bull-market windows.

Failure analysis and diagnostics

A major learning from the project was that validation loss alone was not enough. Multiple experiments converged to similar validation loss while producing very different downstream behavior, including prediction collapse.

Gradient-flow diagnostics showed cases where encoder and decoder layers continued learning useful structure while the output layer collapsed early, indicating a mismatch between representation learning and the final prediction objective.

  • Tracked prediction variance, directional bias, gradient norms, quantile loss, directional accuracy, hit rate, Sharpe ratio, RMSE, MAE, and max drawdown.
  • Added directional-diversity and anti-collapse penalties to discourage batches where almost all predictions share the same sign.
  • Used multi-task classification to test whether the encoder could learn VIX-based regime structure even when daily direction prediction remained noisy.

Results and interpretation

  • Weekly regime-aware attention improved directional accuracy from 57.9% to 59.4%, a 1.5 percentage-point lift in a very noisy forecasting task.
  • The same weekly setting improved Sharpe ratio from 1.05 to 1.22, suggesting that the regime-conditioned attention mechanism improved the quality of selected signals, not just the raw classification metric.
  • Variable-selection weights became more regime-adaptive: VIX rose to about 0.74 during the 2022 bear market compared with about 0.31 in the baseline visualization, while yield spread dominated the 2024 bull-market window at about 0.54.
  • The project ultimately shows both promise and limits: transformer encoders can learn meaningful regime structure, but noisy financial targets require careful evaluation, interpretation, and failure-mode analysis.

Engineering contribution

  • Designed reproducible data preprocessing and chronological evaluation workflows for mixed-frequency financial time series.
  • Implemented TFT, LSTM, and ARIMAX experiments with comparable inputs and evaluation metrics.
  • Built diagnostics to expose output collapse, gradient disconnect, prediction diversity, and regime-specific attention behavior.
  • Produced clear visualizations for attention gates, variable-selection weights, rolling regimes, and architecture-level modifications.

Next steps

  • Add multi-horizon and multi-target forecasting across S&P 500 constituents rather than a single index-level target.
  • Compare additional architectures such as Informer, Autoformer, and non-stationarity-aware time-series models on the same vintage-aligned dataset.
  • Add a Streamlit or Next.js demo that allows users to inspect a forecast window, attention weights, feature weights, and regime classification side-by-side.
  • Turn the notebook/report artifacts into a fully reproducible experiment package with locked data snapshots and automated figure generation.

Related projects

← Back to all projects