Lesson 06

Trading Environment

The environment is the market. Designing it correctly — observations, actions, and rewards — is as important as the agent architecture itself.

Gym-compatible interface

Our environment follows the OpenAI Gym interface: reset() → returns initial observation, step(action) → returns (observation, reward, done, info). This standardisation means any RL algorithm can plug in.

class TradingEnv:
    observation_space: Box(shape=(10,))   # 10 continuous features
    action_space:      Discrete(3)        # 0=Hold, 1=Buy, 2=Sell

obs = env.reset()           # start new episode → shape (10,)
obs, reward, done, info = env.step(action)
# repeat until done=True (episode ends after N bars or portfolio ruins)

Standardisation matters. By following the Gym interface, we can swap in any RL library (Stable-Baselines3, RLlib, CleanRL) without changing the environment code. The environment is completely decoupled from the agent.

Observation space

The observation is a fixed-length vector the agent sees at each step. It must contain enough information for the agent to make good decisions, but not so much that training is slow.

Feature	Index	Description	Range
ret_1	0	1-bar return	≈ [−0.1, +0.1]
ret_5	1	5-bar return	≈ [−0.2, +0.2]
ret_20	2	20-bar return	≈ [−0.4, +0.4]
vol_10	3	10-bar rolling volatility	[0, 0.05]
rsi_14	4	RSI normalised to [−1, +1]	[−1, +1]
position	5	Current position (−1, 0, 1)	{−1, 0, 1}
upnl	6	Unrealised P&L	≈ [−0.3, +0.3]
steps_left	7	Fraction of episode remaining	[0, 1]
max_dd	8	Max drawdown this episode	[−1, 0]
since_trade	9	Bars since last trade (normalised)	[0, 1]

Normalisation is critical. Neural networks work best when inputs are near zero with unit variance. We normalise each feature to [−1, +1] or clip at ±3σ. Un-normalised inputs cause exploding or vanishing gradients.

Episode walkthrough

Let's step through one episode bar by bar. Click "Step" to advance, or "Run episode" to watch it complete. The agent uses a random policy here — real DQN training uses learned Q values.

Price chart with position overlay — green background = long position

Portfolio equity curve

Bar 0/100 | Price: $40,000 | Position: Flat | Unrealised: 0.0% | Episode reward: 0.00

Episode length. We use 252-bar episodes (one trading year of daily data, or ~10 days of hourly data). Longer episodes give the agent more context but slow training. Too short and the agent never sees a full market cycle.

Reward calculation in detail

# At each step t:
price_return = (price_t − price_t−1) / price_t−1
position_return = position × price_return              # 0 if flat, ±return if long/short
transaction_cost = |position_t − position_t−1| × λ_tc   # λ_tc = 0.001
reward_t = position_return − transaction_cost
         − λ_dd × max(0, max_portfolio − portfolio_t) / max_portfolio

Transaction costs matter enormously. Even λ_tc=0.001 (0.1% per trade) can eliminate the profitability of a high-frequency strategy. The agent must learn to hold positions long enough to cover costs.

Drawdown penalty. Penalising the current drawdown from peak portfolio value teaches the agent to protect capital. Without it, the agent may happily ride a 50% drawdown expecting a recovery.

← Exploration Strategies Next: Full System →