Imagine teaching a child to ride a bicycle. You don't hand them a textbook on balance physics or a labelled dataset of successful rides. You let them try, fall, adjust, and try again. Success — staying upright — is its own reward. This trial-and-error process, guided by feedback from the environment, is the intuition behind reinforcement learning (RL).

The RL Framework

Reinforcement learning formalises this intuition as an interaction between an agent and an environment. At each time step, the agent observes the current state of the environment, selects an action, receives a reward signal, and transitions to a new state. The agent's goal is to learn a policy — a mapping from states to actions — that maximises the expected cumulative discounted reward over time.

This framework, formalised as a Markov Decision Process (MDP), is deceptively general. The "environment" can be a board game, a robotic arm, a simulated financial market, or a data centre cooling system. The "reward" can be winning a game, lifting a weight, earning profit, or reducing energy consumption. The power of RL lies in this generality: the same algorithmic framework can be applied to radically different problems.

Value-Based Methods: Q-Learning

Q-learning, introduced by Christopher Watkins in 1989, learns the value of taking a specific action in a specific state — the Q-function. The Q-function represents the expected cumulative reward of taking action a in state s and following the optimal policy thereafter. Once the Q-function is known, the optimal policy is simply to take the action with the highest Q-value in each state.

Deep Q-Networks (DQN), introduced by DeepMind in 2013, extended Q-learning to high-dimensional state spaces by using a neural network to approximate the Q-function. Applied to Atari video games — with raw pixel inputs as the state — DQN achieved human-level performance on 29 out of 49 games without any game-specific programming. This demonstrated that RL agents could learn complex, abstract strategies from raw sensory data, a landmark result that attracted enormous attention from the broader AI community.

Policy Gradient Methods

An alternative approach learns the policy directly, without first estimating a value function. Policy gradient methods parameterise the policy as a neural network and optimise its parameters directly to maximise expected reward, using gradient ascent. The REINFORCE algorithm computes gradients by sampling trajectories — sequences of states, actions, and rewards — and updating the policy to increase the probability of actions that led to high rewards.

Actor-Critic methods combine value-based and policy gradient approaches: the actor is the policy network that selects actions; the critic is a value function network that estimates how good each state is. Proximal Policy Optimisation (PPO) and Soft Actor-Critic (SAC) are modern Actor-Critic algorithms that are widely used in practice for their stability and efficiency.

AlphaGo and AlphaZero

DeepMind's AlphaGo demonstrated RL at its most dramatic. Go — an ancient Chinese board game with 10170 possible board positions — had been considered a uniquely human game, one that computers would not master for decades. AlphaGo combined deep learning (CNNs to evaluate board positions) with Monte Carlo Tree Search and RL to learn from millions of self-play games. In 2016, it defeated 18-time world champion Lee Sedol four games to one.

AlphaZero, its successor, generalised this approach: starting from random play with only the rules of the game, it mastered Go, chess, and shogi to superhuman levels within hours to days. AlphaZero's solutions are often strikingly alien — unconventional moves that human grandmasters initially dismissed as mistakes, then later came to appreciate as deep strategic insights.

RL in Financial Applications

Financial trading is a natural domain for RL: it involves sequential decisions under uncertainty, with delayed rewards and complex state dynamics. RL agents have been applied to optimal execution (minimising market impact when trading large orders), portfolio allocation (dynamically adjusting weights to maximise risk-adjusted returns), and options hedging (learning to hedge derivative positions in the presence of transaction costs and discrete rebalancing).

A key advantage of RL over classical methods in finance is its ability to incorporate transaction costs and market impact directly into the reward function, leading to policies that are genuinely executable rather than optimal only in theory. Challenges include designing realistic simulation environments for training, avoiding overfitting to historical data, and ensuring that learned policies remain stable across changing market regimes.