Lesson 06

Feature Engineering

Raw OHLCV data cannot be fed directly to an HMM. We compute three features that capture the distinct statistical signature of each market regime: direction, volatility, and participation.

Why these three features?

Each feature captures a different dimension of market behaviour. Together, they give the multivariate Gaussian emission enough axes to distinguish Bull Run from Bear/Crash from Chop.

Feature 1: returns = (Close[t] − Close[t-1]) / Close[t-1]
Feature 2: range = (High[t] − Low[t]) / Close[t] (normalised)
Feature 3: volume_change = (Volume[t] − Volume[t-1]) / Volume[t-1]

Returns capture direction and magnitude. Bull runs have consistent positive returns. Bear regimes have negative, often large returns. Choppy markets have returns near zero but variable sign.

Normalised range (High−Low)/Close captures intrabar volatility without price-level dependence. High range = uncertainty, wide bid-ask, potential regime transition. Bear/Crash periods have characteristically high range.

Volume change captures market participation. Bull runs typically show rising or elevated volume confirming the move. Chop shows declining volume as conviction fades. Volume spikes in bear regimes often signal capitulation.

From OHLCV to feature matrix

Below is a synthetic sample. Click "Regenerate" to see different market conditions. The computed features feed directly into the HMM training matrix — one row per bar.

Time	Open	High	Low	Close	Volume	returns	range	vol_chg

Feature time series

Visualised over 150 bars (three synthetic regimes). Notice how each feature cluster differs between Bull, Chop, and Bear periods.

Feature 1 — Returns

Feature 2 — Normalised Range

Feature 3 — Volume Change

Regime coloring (from lesson 02)

Preprocessing: clipping outliers

Before training, we clip each feature to its 1st–99th percentile. Without this, a single flash crash can produce an extreme return that dominates the Gaussian fit and corrupts the entire model.

features[col] = clip(features[col], quantile(1%), quantile(99%))

Why not standardise? Standardising (z-score) removes the scale information the HMM uses to distinguish regimes. A Bear/Crash bar has genuinely larger magnitude than a Chop bar — standardising would erase that signal. Clipping outliers is sufficient to keep the EM algorithm stable without losing the regime-distinguishing scale.

← Lesson 05: Viterbi & Training Next: Indicators & Signal Logic →