You trained a model, Accuracy- 98.1%. You test it on real data, the accuracy drops to 62%.
And then you dare to blame it on the already known random nature of the real world data. While you don’t want to believe that the culprit is not the data, rather your own model.
Let me introduce you to four suspects behind this: overfitting, data leakage, bias, and variance
While I walk you through each concept, it will be more and more clear to you that that building a machine learning model is not just about making it learn, it’s about making sure it learns the right things for the right reasons.
Overfitting: When Your Model Memorises Instead of Learning
Imagine that you are preparing for an exam by memorising answers to every question from previous year question papers. You will definitely be able to ace the practice set, because you have memorised each and every question. But will you be able to perform when the real exam arrives with slightly different phrasing? There, you learnt the questions, not the subject, which you were actually expected to do.
This is what Overfitting is! A model becomes so precisely tuned to the training data (including its noise, quirks, and random flukes) that it loses the ability to generalise. It hasn’t learned the underlying pattern, which it was expected to do, rather, it has memorised the dataset.
But, how do you spot an overfitting model? Easy! If your training accuracy makes you feel invincible, but your validation accuracy suddenly humbles you, chances are your model is overfitting.
| Scenario | Train Acc. | Val Acc. | Diagnosis |
|---|---|---|---|
| Healthy | 93% | 91% | Good generalisation |
| Mild overfit | 97% | 85% | Watch closely |
| Severe overfit | 99.8% | 62% | Model is memorising |
How to fix it?
- More data: The single most effective cure. More examples force the model to find real patterns.
- Regularisation (L1/L2): Penalises large weights, discouraging the model from fitting noise.
- Dropout: Randomly deactivates neurons during training, preventing co-dependency.
- Early stopping: Halt training when validation loss starts rising, even if training loss keeps falling.
- Simpler model: Sometimes the best fix is a less powerful model that can’t memorise as easily.
- Cross-validation: Evaluate on multiple folds to get a reliable performance estimate.
Data Leakage: When Your Model Knows Too Much
Data leakage is among the most deceitful problems in ML because it can be nearly invisible. This happens when data from outside the training data, somehow, bleeds into the training process. Its like when a student is able to, somehow, get the copy of the question paper of the exam. He aces in that subject, but fails in rest of the subjects. This is what raises the suspision.
Common sources of Leakage
- Target leakage: Where the training data contains a feature that’s derived from or directly correlated with the label (e.g., using “days_in_hospital” to predict “hospitalised”).
- Temporal leakage: This happens when training happens on future data to predict past events.
- Preprocessing leakage: This happens when you fit a scaler or imputer on the entire dataset before the train/test split, letting test statistics influence training.
- Duplicate leakage: When there exists same record or near-duplicate appears in both train and test sets it is considered duplicate leakage.
Bias & Variance: Overthinking vs Underthinking
Now, bias and variance are the two sides of the same coin. They describe two different yet connected sources of error in a machine learning model.
Bias- Oversimplifying
Bias occurs when we want to solve complex problems with overly simplified features. This forces the model to make assumptions and looses the sight of important relationships in your data. A high-bias model makes strong assumptions about the data, for example, it assumes that the relationship is linear when it’s actually curved. It gets things wrong in the same direction, every time.
It’s like a student studies only the basic definitions for an exam that is designed to test deep understanding and problem-solving. No matter how the questions are asked, the student keeps making the same kind of mistakes because they never truly understood the subject in the first place.
Signs of high bias:
- Poor performance on both training and test data
- Model is too simple for the complexity of the problem
- Training loss plateaus at a high value very quickly
Fix: Use a more complex model, add more features, reduce regularisation.
Variance- Overcomplicating
Variance is the sensitivity to fluctuations in training data. A high-variance model is so responsive to its training set that it captures noise as if it were signal. High Variance actually leads to overfitting.
Now, when a student tries to memorize everything without actually understanding the concepts and their applications, they may perform extremely well on familiar practice questions but struggle the moment the exam asks something slightly different. Instead of learning patterns, the student simply remembers answers, and that is exactly how a high-variance model behaves.
Signs of high variance:
- Great performance on training data, poor on test data (sound familiar?)
- Large swings in performance across different validation folds
- Model is too complex for the amount of data available
Fix: Regularisation, more training data, ensemble methods (bagging), simpler architecture.
| Problem | What it means | Symptom | Primary fix |
|---|---|---|---|
| Overfitting | Model memorises training data, can’t generalise | Train ↑↑, Test ↓↓ | Regularise, add data, simplify |
| Data Leakage | Future/forbidden info enters training | Suspiciously perfect metrics | Strict train/test pipeline discipline |
| High Bias | Model too simple; systematic errors | Both train & test errors high | More complex model, more features |
| High Variance | Model too sensitive to training data | Train ↓↓, Test ↑↑ | Regularise, ensemble, more data |
Conclusion
At the end of the day, machine learning is not about chasing the highest accuracy score on a training set. A model that memorises instead of learning, leaks information it should never see, oversimplifies complex relationships, or overreacts to noise may look impressive on paper, but collapses in the real world. Overfitting, data leakage, bias, and variance are warnings that your model may not truly understand the data at all. The real goal of machine learning is not to build a model that performs perfectly on what it has already seen, but one that can confidently handle what it hasn’t.