Replace the Q-table with a neural network. DQN extends Q-learning to continuous, high-dimensional state spaces using function approximation, experience replay, and a target network.
Function approximation
Instead of a lookup table, use a neural network Q(s,a;θ) parameterised by weights θ. The network takes a state vector as input and outputs Q-values for all actions simultaneously.
Neural network architecture — input state to Q-values for all actions
Best action: — | Q(Hold)=— | Q(Buy)=— | Q(Sell)=—
Weight sharing. The same neural network processes all states. This allows the agent to generalise — states that look similar produce similar Q-values, even if they've never been seen before. This is the key advantage over Q-tables.
Two stability tricks
Experience Replay
Naive Q-learning updates from consecutive (s,a,r,s') transitions. Consecutive transitions are highly correlated — the network overfits to recent experience and 'forgets' the past. Experience replay solves this.
Circular replay buffer — blue = just written · yellow = current batch · green/red = reward sign · fading = older
Each click writes one experience tuple to the buffer, ages all existing entries, then draws a fresh random mini-batch for training.
Standard DQN uses two networks: online Q(s,a;θ) and target Q(s,a;θ⁻). The target network is a delayed copy updated every C steps. This prevents the target from shifting every step — stabilising training.
Loss = E[(r + γ · maxa' Q(s',a'; θ⁻) − Q(s,a; θ))2]
↑ target network (frozen) ↑ online network (trained)
θ⁻ ← τ·θ + (1−τ)·θ⁻ ← soft update every step (τ=0.005)
or hard copy every C=1000 steps
Without a target network, the Bellman target changes every gradient step — like trying to hit a moving target. Freezing θ⁻ for C steps makes the target stationary long enough for the online network to converge toward it.
The DQN training loop
All pieces together in one loop.
# Initialise networks and buffer
Initialise Q(s,a;θ) and Q̂(s,a;θ⁻=θ), replay buffer B For each episode:
s ← env.reset() For each step t:
a ← ε-greedy(Q(s;θ)) # select action
s', r, done ← env.step(a) # execute action
B.push(s, a, r, s', done) # store transition if |B| ≥ batch_size:
Sample mini-batch from B
y = r + γ · max Q̂(s';θ⁻) # compute target
L = MSE(Q(s,a;θ), y) # compute loss
θ ← θ − α·∇L # gradient step if t mod C == 0: θ⁻ ← θ # update target
s ← s'
Batch size matters. Using 32–128 transitions per gradient step balances stability and speed. Too small: noisy gradients. Too large: stale experience. 64 is a common default.
DQN vs plain Q-learning
The two approaches differ most in how they scale. Q-learning with a table is exact and provably convergent on small problems; DQN trades those guarantees for the ability to handle real-world state spaces.