A medical AI system predicts that a patient has a 73% probability of a particular condition. Should the physician order a biopsy? The answer depends critically on how reliable that 73% estimate is — whether it represents genuine calibrated uncertainty or simply the output of an overconfident neural network that has seen few similar cases. Standard deep learning models are notoriously poorly calibrated: they often produce extreme probability estimates with high confidence even on inputs far from their training distribution. Bayesian machine learning addresses this fundamental limitation.
The Bayesian Framework
Bayesian methods treat model parameters as probability distributions rather than point estimates. Instead of learning a single set of weights that minimises a loss function, Bayesian inference computes the posterior distribution over weights: given the training data, what is the probability distribution over model parameters? Predictions are then made by averaging over this posterior — integrating the predictions of all plausible models weighted by their posterior probability.
This framework provides two types of uncertainty that standard ML models cannot distinguish. Aleatoric uncertainty (from the Greek for "luck" or "dice") is irreducible noise inherent in the data — a measurement sensor with limited precision, for instance. Epistemic uncertainty (from "knowledge") reflects the model's ignorance due to limited data. The model should be uncertain in regions where training data is sparse or absent. Bayesian methods naturally quantify both; standard ML models conflate and underestimate both.
Gaussian Processes
Gaussian Processes (GPs) are the canonical Bayesian approach to regression and classification. A GP places a prior distribution over functions — assuming that the function values at any set of input points are jointly Gaussian distributed, with a covariance structure defined by a kernel function. The kernel encodes assumptions about function smoothness, periodicity, and scale.
Given training data, Bayesian updating yields the posterior GP: a distribution over functions consistent with both the prior and the observations. Predictions at new points are Gaussian, providing not just a point estimate but a full predictive distribution with analytically computable mean and variance. This uncertainty estimate automatically widens in regions with sparse training data — exactly the desired behaviour.
GPs are computationally expensive, scaling cubically with the number of training points. Sparse GP approximations and stochastic variational inference methods have extended their applicability to datasets with millions of points. In finance, GPs are used for return prediction, volatility surface fitting, and Bayesian optimisation of hyperparameters and strategy parameters.
Bayesian Neural Networks
Bayesian neural networks (BNNs) apply the Bayesian framework to deep learning: placing prior distributions over neural network weights and performing approximate inference. Exact Bayesian inference is computationally intractable for large networks; approximate methods include variational inference (fitting a tractable approximate posterior), Monte Carlo Dropout (using dropout at test time to sample from an approximate posterior), and deep ensembles (training multiple networks with different random initialisations and treating them as samples from the posterior).
Deep ensembles are surprisingly effective in practice: a collection of 5–10 independently trained networks, combined through averaging, provides well-calibrated uncertainty estimates that are competitive with more principled Bayesian approaches at much lower computational cost. This has made ensembles the practical standard for uncertainty quantification in production ML systems.
Calibration and Reliability
Calibration — the alignment between predicted probabilities and empirical frequencies — is a measurable property. A well-calibrated binary classifier, when it predicts 80% probability, should be correct 80% of the time. Calibration plots (reliability diagrams) and metrics such as Expected Calibration Error (ECE) measure this alignment. Standard neural networks are typically overconfident; temperature scaling — a simple post-processing step that divides logits by a learned scalar — substantially improves calibration at negligible computational cost and is a practical first step for any production ML system.
Applications in High-Stakes Settings
Uncertainty quantification is most valuable in high-stakes, safety-critical applications. In medicine, well-calibrated uncertainty allows AI systems to flag cases where they lack confidence for human review, rather than making overconfident predictions. In finance, uncertainty estimates enable more sophisticated risk management: positions can be sized in proportion to the model's confidence, and extreme positions avoided in low-data regimes. In autonomous systems, epistemic uncertainty can trigger conservative behaviour (slowing down, requesting human intervention) in unfamiliar situations — a crucial safety property.