Lesson 04

Deep Q-Networks

Replace the Q-table with a neural network. DQN extends Q-learning to continuous, high-dimensional state spaces using function approximation, experience replay, and a target network.

Function approximation

Instead of a lookup table, use a neural network Q(s,a;θ) parameterised by weights θ. The network takes a state vector as input and outputs Q-values for all actions simultaneously.

Neural network architecture — input state to Q-values for all actions

Best action: — | Q(Hold)=— | Q(Buy)=— | Q(Sell)=—

Weight sharing. The same neural network processes all states. This allows the agent to generalise — states that look similar produce similar Q-values, even if they've never been seen before. This is the key advantage over Q-tables.

Two stability tricks

Experience Replay

Naive Q-learning updates from consecutive (s,a,r,s') transitions. Consecutive transitions are highly correlated — the network overfits to recent experience and 'forgets' the past. Experience replay solves this.

Circular replay buffer — blue = just written · yellow = current batch · green/red = reward sign · fading = older

Each click writes one experience tuple to the buffer, ages all existing entries, then draws a fresh random mini-batch for training.

Buffer: 20 / 1000 · Batch size: 4 · Batch: —

r(t) stored as (s_t, a_t, r_t, s_t+1, done_t) ← transition tuple

Target Network

Standard DQN uses two networks: online Q(s,a;θ) and target Q(s,a;θ^⁻). The target network is a delayed copy updated every C steps. This prevents the target from shifting every step — stabilising training.

Loss = E[(r + γ · max_a' Q(s',a'; θ^⁻) − Q(s,a; θ))²]
↑ target network (frozen) ↑ online network (trained)
θ^⁻ ← τ·θ + (1−τ)·θ^⁻ ← soft update every step (τ=0.005)
or hard copy every C=1000 steps

Without a target network, the Bellman target changes every gradient step — like trying to hit a moving target. Freezing θ^⁻ for C steps makes the target stationary long enough for the online network to converge toward it.

The DQN training loop

All pieces together in one loop.

# Initialise networks and buffer
Initialise Q(s,a;θ) and Q̂(s,a;θ^⁻=θ), replay buffer B
For each episode:
  s ← env.reset()
  For each step t:
    a ← ε-greedy(Q(s;θ))            # select action
    s', r, done ← env.step(a)         # execute action
    B.push(s, a, r, s', done)          # store transition
    if |B| ≥ batch_size:
      Sample mini-batch from B
      y = r + γ · max Q̂(s';θ^⁻)     # compute target
      L = MSE(Q(s,a;θ), y)          # compute loss
      θ ← θ − α·∇L                # gradient step
    if t mod C == 0: θ^⁻ ← θ        # update target
    s ← s'

Batch size matters. Using 32–128 transitions per gradient step balances stability and speed. Too small: noisy gradients. Too large: stale experience. 64 is a common default.

DQN vs plain Q-learning

The two approaches differ most in how they scale. Q-learning with a table is exact and provably convergent on small problems; DQN trades those guarantees for the ability to handle real-world state spaces.

Aspect	Q-learning	DQN
State space	Discrete, small	Continuous, large
Storage	Q-table (grows with states)	Fixed neural network
Generalisation	None (each state independent)	Yes (similar states → similar Q)
Stability tricks	Not needed	Replay + target network required
Convergence	Guaranteed (finite MDP)	Empirical (no guarantee)
Typical use	Toy problems	Real trading environments

← Q-Learning Next: Exploration Strategies →