Lesson 06

Transformers for Time Series

Attention mechanisms, originally designed for natural language, are now state-of-the-art for long-horizon time series forecasting. Self-attention captures arbitrary-range dependencies without the vanishing gradient problems of RNNs.

Scaled dot-product attention

LSTMs process sequences step-by-step, making long-range dependencies hard to learn due to vanishing gradients. Transformers attend to every position simultaneously — a token at step t can directly attend to step t−100 with equal ease.

Attention(Q, K, V) = softmax( Q·Kᵀ / √d_k ) · V Q = query matrix (what am I looking for?) K = key matrix (what do I offer to others?) V = value matrix (what information do I contain?) d_k = key dimension (scale prevents vanishing softmax gradients) Each row of the output is a weighted sum of values V, weighted by how well that query matches each key.

Attention weight matrix — each row shows what each position attends to (brighter = stronger)

Reading the attention matrix. Each row is a softmax over all key positions. Row t shows which past positions step t attends to most. Brighter cells = stronger attention. Notice diagonal patterns (attending to nearby positions) and occasional long-range spikes.

Positional encoding

Unlike RNNs, transformers have no built-in notion of order — the architecture is permutation-invariant. Positional encodings are added to the input embeddings to inject sequence position information.

Sinusoidal positional encoding (Vaswani et al. 2017): PE(pos, 2i) = sin( pos / 10000^2i/d_model ) PE(pos, 2i+1) = cos( pos / 10000^2i/d_model ) pos : position in sequence (0, 1, 2, ...) i : dimension index (0 to d_model/2 − 1) d_model: embedding dimension

Sinusoidal PE heatmap — positions (x) × dimensions (y) — blue=positive, dark=negative

Key property. PE(pos+k) can be expressed as a linear function of PE(pos), allowing the model to learn relative positional attention naturally. Each dimension oscillates at a different frequency — low dimensions capture slow (large-scale) positional structure, high dimensions capture fine (local) structure.

PatchTST — patches for time series

PatchTST (Nie et al. 2023) treats the time series as a sequence of patches (short local windows), not individual steps. This reduces sequence length, preserves local temporal structure, and dramatically reduces computation.

PatchTST pipeline: Input: univariate series x ∈ ℝ^L (L = lookback length) 1. Patching: split into P patches of length p, stride s P = ⌊(L − p) / s⌋ + 2 patches per series 2. Embedding: linear projection ℝ^p → ℝ^d_model 3. Transformer encoder: P patches × d_model 4. Flatten + linear head → ℝ^T (T = forecast horizon)

Patch extraction diagram — overlapping windows mapped to transformer input

Why patches work. Treating each time step as a token means a 512-bar lookback requires 512 attention computations (quadratic cost). With patch_len=16, stride=8, you get ~63 patches — 8× fewer tokens, 64× cheaper attention. Patches also group semantically-local information, matching how financial patterns (candlestick patterns, volatility bursts) span multiple bars.

Practical considerations

Channel independence vs mixing. PatchTST treats each feature independently (CI mode). Mixing channels (sharing attention across features) can help when features are correlated but risks learning spurious cross-series dependencies. Start with CI mode; add channel mixing only if diagnostics suggest it helps.

Lookback window length matters far more for transformers than for ARIMA. Typical values: L = 96 to 336 bars. Larger L gives more context but increases memory and attention cost quadratically. With patches, L = 336 is tractable (≈42 patches at stride 8). For daily data, L = 252 (one year) is a natural starting point.

When to use which model. For horizons ≤ 10 steps and datasets < 1000 bars: ARIMA or GARCH. For horizons 10–100 steps and datasets > 5000 bars: PatchTST or similar. For volatility forecasting specifically: GARCH almost always outperforms transformers at ≤ 10-step horizons. Transformers shine at multi-step mean forecasting with long lookbacks.

← Lesson 05: Volatility Forecasting Next: Full System →