Federated Learning: Privacy-Preserving Machine Learning

Many of the most valuable datasets for machine learning are also the most sensitive. Medical records contain information we share only with our doctors. Financial transactions reveal our spending habits, income, and vulnerabilities. Mobile device data tracks our movements, communications, and daily routines. Privacy regulations and institutional data governance policies increasingly restrict the centralisation of such data — creating a fundamental tension between the data hunger of ML and the privacy rights of individuals.

Federated learning resolves this tension through a simple but powerful insight: instead of bringing data to the model, bring the model to the data.

How Federated Learning Works

In conventional centralised training, data from all sources is collected in one place and used to train a model. In federated learning, the training data never leaves the device or institution where it resides. Instead, a central server sends the current model to each participating device or institution (called a client). Each client trains the model locally on its own data and sends back only the updated model weights (or gradients) — not the underlying data. The server aggregates these updates — typically by computing a weighted average in the FedAvg algorithm — to produce an improved global model, and repeats the process.

This architecture provides privacy by design: a curious server sees only model updates, not raw data. Reconstruction of individual training examples from gradient updates is theoretically possible but computationally challenging and increasingly mitigated by additional privacy techniques.

Cross-Device vs. Cross-Silo Federated Learning

Federated learning is applied in two very different settings. Cross-device federated learning involves training across millions of mobile devices — smartphones, tablets, and IoT sensors. Google pioneered this approach for training keyboard next-word prediction models on Android devices, achieving personalised models that improve from local usage patterns without any user data leaving the device. The engineering challenges are substantial: devices may be unavailable (offline, low battery), communication bandwidth is limited, and the number of participants can be enormous.

Cross-silo federated learning trains across a small number of institutions — banks, hospitals, or research centres — that each hold proprietary datasets. A consortium of banks might jointly train a fraud detection model without sharing customer transaction data; multiple hospitals might collaborate on a cancer detection model without pooling patient records. The participants are fewer, more reliable, and more powerful computationally, but may have different incentives and may be less trustworthy than a central server.

Privacy Guarantees: Differential Privacy

Basic federated learning provides practical privacy benefits, but sophisticated adversaries can still potentially reconstruct sensitive information from model gradients. Differential privacy (DP) provides formal mathematical guarantees: by adding calibrated noise to model updates, DP federated learning ensures that the probability of distinguishing whether any individual participated in training is bounded by a small factor ε. The trade-off is model accuracy: more noise means stronger privacy but larger degradation in model quality. Finding the right privacy-accuracy balance for a given application is a key engineering challenge.

Secure aggregation protocols — cryptographic techniques that allow the server to compute the sum of client updates without seeing individual updates — provide additional protection against a compromised server. Combined with differential privacy and trusted execution environments, these techniques are moving federated learning towards deployments that provide strong, provable privacy guarantees.

Applications in Finance and Healthcare

Financial services represent one of the highest-potential applications for cross-silo federated learning. Credit scoring models trained across multiple banks' datasets would see far more diverse borrower profiles than any single institution's data contains — potentially reducing bias and improving predictive accuracy for underserved populations. Anti-money laundering models that learn from transaction patterns across multiple institutions could identify complex money-laundering networks that would be invisible to any single bank.

In healthcare, federated learning enables collaborative training of diagnostic AI across hospital networks without patient data leaving any individual hospital's firewall — a critical property for GDPR and HIPAA compliance. The MELLODDY consortium, a collaboration between ten pharmaceutical companies and academic institutions, used federated learning to train drug discovery models on combined chemical datasets without any company sharing proprietary compound data — demonstrating the commercial viability of privacy-preserving ML collaboration.