The environment is the market. Designing it correctly — observations, actions, and rewards — is as important as the agent architecture itself.
Gym-compatible interface
Our environment follows the OpenAI Gym interface: reset() → returns initial observation, step(action) → returns (observation, reward, done, info). This standardisation means any RL algorithm can plug in.
class TradingEnv:
observation_space: Box(shape=(10,)) # 10 continuous features
action_space: Discrete(3) # 0=Hold, 1=Buy, 2=Sell
obs = env.reset() # start new episode → shape (10,)
obs, reward, done, info = env.step(action) # repeat until done=True (episode ends after N bars or portfolio ruins)
Standardisation matters. By following the Gym interface, we can swap in any RL library (Stable-Baselines3, RLlib, CleanRL) without changing the environment code. The environment is completely decoupled from the agent.
Observation space
The observation is a fixed-length vector the agent sees at each step. It must contain enough information for the agent to make good decisions, but not so much that training is slow.
Feature
Index
Description
Range
ret_1
0
1-bar return
≈ [−0.1, +0.1]
ret_5
1
5-bar return
≈ [−0.2, +0.2]
ret_20
2
20-bar return
≈ [−0.4, +0.4]
vol_10
3
10-bar rolling volatility
[0, 0.05]
rsi_14
4
RSI normalised to [−1, +1]
[−1, +1]
position
5
Current position (−1, 0, 1)
{−1, 0, 1}
upnl
6
Unrealised P&L
≈ [−0.3, +0.3]
steps_left
7
Fraction of episode remaining
[0, 1]
max_dd
8
Max drawdown this episode
[−1, 0]
since_trade
9
Bars since last trade (normalised)
[0, 1]
Normalisation is critical. Neural networks work best when inputs are near zero with unit variance. We normalise each feature to [−1, +1] or clip at ±3σ. Un-normalised inputs cause exploding or vanishing gradients.
Episode walkthrough
Let's step through one episode bar by bar. Click "Step" to advance, or "Run episode" to watch it complete. The agent uses a random policy here — real DQN training uses learned Q values.
Price chart with position overlay — green background = long position
Episode length. We use 252-bar episodes (one trading year of daily data, or ~10 days of hourly data). Longer episodes give the agent more context but slow training. Too short and the agent never sees a full market cycle.
Reward calculation in detail
# At each step t:
price_return = (pricet − pricet−1) / pricet−1
position_return = position × price_return # 0 if flat, ±return if long/short
transaction_cost = |positiont − positiont−1| × λtc# λtc = 0.001
rewardt = position_return − transaction_cost
− λdd × max(0, max_portfolio − portfoliot) / max_portfolio
Transaction costs matter enormously. Even λtc=0.001 (0.1% per trade) can eliminate the profitability of a high-frequency strategy. The agent must learn to hold positions long enough to cover costs.
Drawdown penalty. Penalising the current drawdown from peak portfolio value teaches the agent to protect capital. Without it, the agent may happily ride a 50% drawdown expecting a recovery.