Lesson 01

Markov Decision Processes

Reinforcement learning frames every decision problem as an agent interacting with an environment. The Markov Decision Process is the mathematical foundation — states, actions, transitions, and rewards.

The RL feedback loop

The agent observes the current state, chooses an action, the environment transitions to a new state, and the agent receives a reward. This cycle repeats until the episode ends. The agent's goal is to maximise the total discounted reward Σ γ^t·r_t.

RL feedback cycle — each arrow lights up in sequence

The Markov property. The transition to the next state depends only on the current state and action — not on any history before it. P(s_t+1 | s_t, a_t, s_t-1, ...) = P(s_t+1 | s_t, a_t). This simplification makes the problem tractable.

The five MDP components

A Markov Decision Process is defined by five components. Together they fully specify any sequential decision problem.

MDP = (S, A, T, R, γ)
S         : state space      — what the agent can observe
A         : action space     — the choices available each step
T(s'|s,a): transition       — probability of landing in s' from (s,a)
R(s,a,s'): reward function — immediate signal after each transition
γ ∈ [0,1): discount factor — how much future rewards are worth today

State Space

In trading: [price returns, position, unrealised P&L, volatility, step count]. Anything the agent needs to make decisions.

Action Space

Three discrete choices: 0 = Hold, 1 = Buy (go long), 2 = Sell/Close. Simple but sufficient for long-only trading.

Reward

Change in portfolio value minus transaction costs. Positive when the position made money, negative when it lost.

Discount factor γ. A reward one step in the future is worth γ times a reward now. γ = 0.99 means the agent cares about rewards 100 steps ahead. γ = 0 is a myopic agent. For trading, γ = 0.99 works well — trades play out over many bars.

A toy trading MDP

Consider a simplified 1D market with 5 discrete price levels. The agent holds one of three positions: Flat, Long, or Short. At each step it takes one of three actions. Watch how different actions lead to different state transitions and rewards.

Toy market — 5 price levels, 3 positions, 3 actions

Action: — | Reward: — | Cumulative: 0.00

In real trading the state space is enormous — price history windows, multiple technical indicators, current position size, portfolio value, time of day. The MDP framework handles all of these uniformly.

Value functions

Rather than asking 'what's the best action right now?', RL asks 'what's the best policy — a mapping from every state to the best action?' Value functions answer this.

V*(s)    = max_a Q*(s,a)                         ← state value: best possible expected return
Q*(s,a) = R(s,a) + γ · Σ_s' T(s'|s,a) · V*(s')    ← Bellman optimality
π*(s)    = argmax_a Q*(s,a)                     ← optimal policy

Q(s,a) is the key quantity. It answers: 'If I'm in state s and take action a, what total discounted reward can I expect?' The DQN is a neural network that approximates Q*(s,a) directly from raw observations.

← Back to RL Markets Next: Reward Design →