← Back to RL Markets

Lesson 07

The Full System

All pieces assembled. A complete DQN trading agent — environment, network, replay buffer, training loop — with Python code ready to run.

Architecture overview

End-to-end data flow through the DQN trading system

Simulated training progress

Typical DQN training on 2 years of hourly BTC data (≈17,500 bars, 175 episodes of 100 bars). The reward improves as the agent discovers that momentum + trend-following beats random action.

Training curves — episode reward (green) and ε decay (purple dashed)
Final performance metrics
Training time. On a laptop CPU, 200 episodes × 100 steps × one DQN forward/backward pass takes about 5–15 minutes. With GPU acceleration and 252-step episodes it trains in 1–3 minutes. The bottleneck is usually data loading, not network computation.

Code


  

  

  

  

Getting started

Expected results. A well-trained DQN on BTC data typically achieves positive reward in 60–70% of evaluation episodes after 300–500 training episodes. Performance varies significantly with market conditions — always backtest on held-out data.
Hyperparameter sensitivity. DQN is notoriously sensitive to hyperparameters. Start with the defaults in train.py. If training diverges (reward collapses), try: lower lr (1e-4), larger replay buffer (100k), or stronger gradient clipping.
Next steps. Try Double DQN (already implemented), Dueling DQN (separate advantage + value heads), Prioritised Experience Replay (sample important transitions more), or PPO (a policy gradient method that often outperforms DQN on continuous action spaces).