Training a large neural network from scratch requires enormous quantities of labelled data, substantial computational resources, and significant time. For most organisations and most tasks, these resources are simply unavailable. Transfer learning addresses this constraint by reusing knowledge extracted from one learning problem to accelerate or improve performance on a related problem. It is one of the most practically important techniques in applied machine learning today.

The Intuition Behind Transfer Learning

The intuition is familiar from human experience. A radiologist who has spent years interpreting chest X-rays does not start from scratch when tasked with analysing a new type of scan — they transfer the visual pattern recognition skills, anatomical knowledge, and diagnostic reasoning developed through prior experience. Similarly, a neural network trained on millions of images learns low-level visual features (edges, textures, colours) and higher-level representations (shapes, parts, objects) that are useful across a wide range of vision tasks.

The key insight is that the internal representations learned by large models on large datasets are generalisable: they capture genuine statistical structure of the domain, not just the specifics of the training task. This generalisation is what makes transfer learning work — the learned features transfer because they are real.

Computer Vision Transfer Learning

In computer vision, transfer learning is now the default approach for almost all applied tasks. The standard methodology is to take a model pre-trained on ImageNet — typically a ResNet, EfficientNet, or Vision Transformer — and fine-tune it on the target dataset. The pre-trained weights provide a strong initialisation: the convolutional filters already encode general visual features, and only the task-specific head needs to be learned from scratch.

When the target dataset is small (hundreds or thousands of examples), it is common to "freeze" the early layers — keeping the pre-trained weights fixed — and train only the later layers and the classification head. This prevents overfitting and preserves the general features learned from large-scale pre-training. With larger datasets, the entire network can be fine-tuned jointly, typically with a much lower learning rate than would be used for training from scratch.

NLP Transfer Learning: The BERT Paradigm

The impact of transfer learning in NLP has been even more dramatic. Before BERT (2018), training an NLP model for a specific task typically required large task-specific labelled datasets and substantial computational cost. BERT changed the equation: a model pre-trained on 3.3 billion words of text, fine-tuned with just a few thousand labelled examples, achieves state-of-the-art performance on question answering, sentiment analysis, named entity recognition, and a wide range of other tasks.

The fine-tuning process for BERT-style models is straightforward: add a task-specific head (a classification layer for sentiment, a span-extraction layer for question answering) on top of the pre-trained model, and train the combined system on the target task for a small number of epochs. The pre-trained representations provide an extraordinary warm start that drastically reduces the amount of labelled data required.

Domain Adaptation

Standard transfer learning works best when the source and target domains are similar. When there is a significant distribution shift — for instance, applying a model trained on general English text to legal or medical text — direct fine-tuning may be insufficient. Domain adaptation techniques address this by further pre-training the model on domain-specific unlabelled text before task-specific fine-tuning. FinBERT, BioBERT, LegalBERT, and SciBERT are all examples of models further pre-trained on domain-specific corpora, achieving substantially better downstream performance than general-domain models.

Low-Data Regimes and Few-Shot Learning

An extreme case of transfer learning is few-shot learning — achieving good performance with just a handful of labelled examples. Large language models like GPT-4 can be prompted with a few examples directly in the input, demonstrating the task and then applying it to new instances. This "in-context learning" requires no gradient updates — the model adapts to new tasks through the attention mechanism alone. While the mechanisms underlying in-context learning are still being researched, it represents a remarkable form of transfer that reduces the data requirements for new task adaptation to near zero, democratising access to high-performance ML for organisations without large labelled datasets.