
Model Evaluation Metrics
Here are the most common evaluation metrics used when pre-training and fine-tuning transformer models for classification tasks.
Recall
Out of all actual positive cases, how many were correctly predicted?
Precision
Out of all the cases predicted as positive, how many were correct?
F1 Score
The F1 score is the harmonic mean of precision and recall, balancing the trade-off between the two. Useful when you want to balance both false positives and false negatives.
Accuracy
Out of all predictions made, how many were correct?
Table of Metrics
Metric | Formula | Intuition |
---|---|---|
Recall | Out of all positives, how many found? | |
Precision | Out of all predicted positives, how many correct? | |
F1 Score | Balance between precision and recall. | |
Accuracy | Overall correctness of predictions. |
- Recall is critical when missing a positive case has severe consequences (e.g., medical diagnoses).
- Precision is important in scenarios where false positives are costly (e.g., spam filtering).
- F1 Score balances precision and recall, useful for imbalanced datasets.
- Accuracy is a general measure but can be misleading in highly imbalanced datasets.