Lesson 03

Q-Learning

Turn the Bellman equation into an algorithm. Q-learning iteratively improves Q-value estimates using experience — no model of the environment required.

Model-free learning

We don't know T(s'|s,a) — the transition probabilities. But we can sample (s, a, r, s') tuples and use them to update Q estimates. This is model-free RL: learn directly from experience, without ever building a model of how the environment works.

Q(s,a) ← Q(s,a) + α · δ
δ = R + γ · max_a' Q(s',a') − Q(s,a)     ← TD error
              ↑                  ↑
       TD target        current estimate
α = learning rate (how much to shift Q toward the target)
γ = discount (how much to value future rewards)

The TD error δ is the surprise — how much better or worse the outcome was than expected. Positive δ means 'this was better than I thought, do it more'. Negative δ means 'this was worse than expected, do it less'. The agent learns by repeatedly correcting its own predictions.

Interactive Q-table

A simplified market with 5 price states and 3 actions. Each cell in the Q-table shows the estimated value of taking that action in that state. Watch the values update as the agent learns. Green cells mean 'good to do here', red cells mean 'avoid'.

Q-table: rows = price states, columns = actions (Buy / Hold / Sell)

Cumulative reward

ε decay and learning progress

Early training is mostly exploration (high ε). As ε decays, the agent exploits its learned Q values more. The reward curve should trend upward as Q values improve — the agent moves from random actions toward informed decisions.

Exploration rate (ε, purple) and cumulative reward per episode (green) — after running steps

Convergence guarantee. For finite MDPs with sufficient exploration and a decaying learning rate, Q-learning converges to Q* — the optimal Q-function. In practice, convergence is slow and we use neural networks to generalise.

Why Q-tables break

Our toy market has 5 × 3 = 15 Q-values. A real trading state might include: 20-bar return history, 5 indicators, current position, time in trade = 20 + 5 + 1 + 1 = 27 continuous dimensions. Even discretised to 10 bins each, that's 10²⁷ states — more than atoms in the universe.

This is the curse of dimensionality. The Q-table is only practical for tiny toy problems. For real trading with continuous state spaces we need a function approximator — a neural network — that can generalise across unseen states. That is the DQN, coming next.

← Reward Design Next: Deep Q-Networks →