Lesson 02

Reward Design

The reward function defines what the agent optimises. A poorly designed reward leads to unexpected — and often catastrophic — trading behaviour.

Why reward design is the hardest part

Goodhart's Law applied to RL: when a measure becomes a target, it ceases to be a good measure. The agent will find any loophole in your reward function with surgical precision. It has no common sense — only the reward signal.

Reward hacking examples. Reward = total trades executed → agent churns the portfolio thousands of times per episode. Reward = avoid losses → agent never opens a position (0% return guaranteed). Reward = raw P&L → agent takes on extreme leverage until a single crash wipes the account.

Three reward designs

The choice of reward function is one of the most consequential design decisions in RL trading. Here are three common approaches, each with different trade-offs between simplicity, stability, and risk management.

Raw P&L. r(t) = (V(t) − V(t−1)) / V(t−1). Simple and direct but rewards risk-taking: a 10× leveraged win is 10× as good as a prudent win, even if the risk of ruin is enormous.

Sharpe-adjusted. Reward the agent proportionally to its risk-adjusted return. At each episode end, compute r_sharpe = mean(returns) / std(returns) × √252. Encourages consistent profits over volatile windfalls.

Combined. r(t) = r_pnl(t) − λ_dd · dd(t) − λ_tc · |Δpos(t)|. Combines per-step P&L with a drawdown penalty and transaction cost penalty. λ values are hyperparameters you tune.

r_raw(t)     = ΔV(t) / V(t-1)                           ← per-step return
r_sharpe     = μ_ret / σ_ret × √N                         ← episode Sharpe (annualised)
r_combined = r_raw(t)
            − λ_dd × max_drawdown(t)                   ← penalise drawdown
            − λ_tc × |position(t) − position(t-1)|     ← penalise churning

Interactive reward builder

A random agent takes random Buy/Hold/Sell actions. Adjust the reward components to see how the cumulative reward signal changes. The agent learns from whatever signal you provide — make it count.

Price chart with agent actions (▲ buy, ▼ sell, • hold)

Cumulative reward curve

Transaction cost λ_tc 0.005

Drawdown penalty λ_dd 0.50

Sharpe bonus weight 0.00

The reward landscape shapes learning. With high transaction cost penalty, the agent learns to hold positions longer. With high drawdown penalty, it learns to exit quickly when losing. Tune these before training.

Sparse vs dense rewards

A sparse reward gives feedback only at the end of an episode (e.g., final P&L). A dense reward gives a signal every step. Dense rewards are much easier to learn from but risk encoding your biases into every step.

Our DQN uses dense rewards — a combined per-step signal. This is critical because a trading episode can be 252 bars (one trading year). Without per-step feedback, the agent cannot attribute a good or bad outcome to the individual decisions that caused it.

← MDP Foundations Next: Q-Learning →