← Back to RL Markets

Lesson 04

Deep Q-Networks

Replace the Q-table with a neural network. DQN extends Q-learning to continuous, high-dimensional state spaces using function approximation, experience replay, and a target network.

Function approximation

Instead of a lookup table, use a neural network Q(s,a;θ) parameterised by weights θ. The network takes a state vector as input and outputs Q-values for all actions simultaneously.

Neural network architecture — input state to Q-values for all actions
Best action: —  |  Q(Hold)=—  |  Q(Buy)=—  |  Q(Sell)=—
Weight sharing. The same neural network processes all states. This allows the agent to generalise — states that look similar produce similar Q-values, even if they've never been seen before. This is the key advantage over Q-tables.

Two stability tricks

Experience Replay

Naive Q-learning updates from consecutive (s,a,r,s') transitions. Consecutive transitions are highly correlated — the network overfits to recent experience and 'forgets' the past. Experience replay solves this.

Circular replay buffer — blue = just written  ·  yellow = current batch  ·  green/red = reward sign  ·  fading = older

Each click writes one experience tuple to the buffer, ages all existing entries, then draws a fresh random mini-batch for training.

Buffer: 20 / 1000  ·  Batch size: 4  ·  Batch: —
r(t) stored as (st, at, rt, st+1, donet) ← transition tuple

Target Network

Standard DQN uses two networks: online Q(s,a;θ) and target Q(s,a;θ). The target network is a delayed copy updated every C steps. This prevents the target from shifting every step — stabilising training.

Loss = E[(r + γ · maxa' Q(s',a'; θ) − Q(s,a; θ))2]
                       ↑ target network (frozen)   ↑ online network (trained)
θ ← τ·θ + (1−τ)·θ    ← soft update every step (τ=0.005)
or hard copy every C=1000 steps
Without a target network, the Bellman target changes every gradient step — like trying to hit a moving target. Freezing θ for C steps makes the target stationary long enough for the online network to converge toward it.

The DQN training loop

All pieces together in one loop.

# Initialise networks and buffer
Initialise Q(s,a;θ) and Q̂(s,a;θ=θ), replay buffer B
For each episode:
  s ← env.reset()
  For each step t:
    a ← ε-greedy(Q(s;θ))            # select action
    s', r, done ← env.step(a)         # execute action
    B.push(s, a, r, s', done)          # store transition
    if |B| ≥ batch_size:
      Sample mini-batch from B
      y = r + γ · max Q̂(s';θ)     # compute target
      L = MSE(Q(s,a;θ), y)          # compute loss
      θ ← θ − α·∇L                # gradient step
    if t mod C == 0: θ ← θ        # update target
    s ← s'
Batch size matters. Using 32–128 transitions per gradient step balances stability and speed. Too small: noisy gradients. Too large: stale experience. 64 is a common default.

DQN vs plain Q-learning

The two approaches differ most in how they scale. Q-learning with a table is exact and provably convergent on small problems; DQN trades those guarantees for the ability to handle real-world state spaces.

Aspect Q-learning DQN
State space Discrete, small Continuous, large
Storage Q-table (grows with states) Fixed neural network
Generalisation None (each state independent) Yes (similar states → similar Q)
Stability tricks Not needed Replay + target network required
Convergence Guaranteed (finite MDP) Empirical (no guarantee)
Typical use Toy problems Real trading environments