Reinforcement learning frames every decision problem as an agent interacting with an environment. The Markov Decision Process is the mathematical foundation — states, actions, transitions, and rewards.
The RL feedback loop
The agent observes the current state, chooses an action, the environment transitions to a new state, and the agent receives a reward. This cycle repeats until the episode ends. The agent's goal is to maximise the total discounted reward Σ γt·rt.
RL feedback cycle — each arrow lights up in sequence
The Markov property. The transition to the next state depends only on the current state and action — not on any history before it. P(st+1 | st, at, st-1, ...) = P(st+1 | st, at). This simplification makes the problem tractable.
The five MDP components
A Markov Decision Process is defined by five components. Together they fully specify any sequential decision problem.
MDP = (S, A, T, R, γ)
S : state space — what the agent can observe
A : action space — the choices available each step
T(s'|s,a): transition — probability of landing in s' from (s,a)
R(s,a,s'): reward function — immediate signal after each transition
γ ∈ [0,1): discount factor — how much future rewards are worth today
S
State Space
In trading: [price returns, position, unrealised P&L, volatility, step count]. Anything the agent needs to make decisions.
A
Action Space
Three discrete choices: 0 = Hold, 1 = Buy (go long), 2 = Sell/Close. Simple but sufficient for long-only trading.
R
Reward
Change in portfolio value minus transaction costs. Positive when the position made money, negative when it lost.
Discount factor γ. A reward one step in the future is worth γ times a reward now. γ = 0.99 means the agent cares about rewards 100 steps ahead. γ = 0 is a myopic agent. For trading, γ = 0.99 works well — trades play out over many bars.
A toy trading MDP
Consider a simplified 1D market with 5 discrete price levels. The agent holds one of three positions: Flat, Long, or Short. At each step it takes one of three actions. Watch how different actions lead to different state transitions and rewards.
Toy market — 5 price levels, 3 positions, 3 actions
Action: — | Reward: — | Cumulative: 0.00
In real trading the state space is enormous — price history windows, multiple technical indicators, current position size, portfolio value, time of day. The MDP framework handles all of these uniformly.
Value functions
Rather than asking 'what's the best action right now?', RL asks 'what's the best policy — a mapping from every state to the best action?' Value functions answer this.
Q(s,a) is the key quantity. It answers: 'If I'm in state s and take action a, what total discounted reward can I expect?' The DQN is a neural network that approximates Q*(s,a) directly from raw observations.