← Back to RL Markets

Lesson 05

Exploration Strategies

A purely greedy agent exploits its current knowledge and never discovers better strategies. Exploration is how the agent escapes local optima — but too much exploration wastes time on known-bad actions.

The explore-exploit dilemma

Every step the agent faces a choice: exploit (pick the best known action) or explore (try something different that might be better). This fundamental tension defines much of RL research.

In trading this is concrete. Should the agent try a new position sizing strategy it hasn't seen reward data for? Or stick with the strategy that worked last week? Explore too little and you miss better strategies. Explore too much and you rack up losses on bad experiments.

ε-greedy

The simplest strategy. With probability ε, pick a random action. With probability 1−ε, pick the best known action. ε decays over training as the agent learns.

ε decay schedules over 1000 training steps
In our DQN we use exponential decay: ε = εmin + (εmax − εmin) × exp(−steps_done / εdecay). This gives dense exploration at the start and almost pure exploitation by the end.

Multi-armed bandit demo

Four arms have hidden reward distributions. The agent can only learn by pulling — each pull returns a noisy sample. Run both strategies and compare how they explore before revealing the truth.

How to use: Click Run 10× on each strategy a few times, then hit Reveal true means to see how close each came. The solid line is the current estimate; the shaded band is the 95% confidence interval. Narrower band = more pulls = more certain.
ε-greedy — 30% random pulls, 70% exploit best known
Pulls: 0  ·  Best estimate: —
UCB — always pulls the arm with highest potential upside
Pulls: 0  ·  Best estimate: —
UCB (Upper Confidence Bound) picks the arm whose reward could be the highest, accounting for uncertainty. Arms with fewer pulls have wider confidence intervals, so UCB naturally tries them first. This is 'optimism in the face of uncertainty' — and it wastes fewer pulls than random exploration.

Noisy exploration and practical tips

Beyond ε-greedy, two other approaches are widely used. NoisyNets add learnable Gaussian noise directly to the network weights — the agent can modulate its own exploration. Boltzmann (softmax) exploration picks actions proportional to exp(Q/T), where temperature T controls randomness. For trading, ε-greedy with careful annealing tends to work reliably and is easy to tune.

NoisyNets. Add learnable Gaussian noise to network weights. The network can adjust its own noise level — high noise in uncertain states, low noise in well-understood ones. No ε to tune.
Practical advice for trading. Start with ε=1.0 (pure random). Anneal to ε=0.05 over the first 50% of training. Never go to ε=0 completely — keep 5% exploration to handle non-stationarity as market regimes shift.