Model Evaluation Metrics
Here are the most common evaluation metrics used when pre-training and fine-tuning transformer models for classification tasks.
Recall
Out of all actual positive cases, how many were correctly predicted?
$$
\text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
$$
Precision
Out of all the cases predicted as positive, how many were correct?
$$
\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
$$
F1 Score
The F1 score is the harmonic mean of precision and recall, balancing the trade-off between the two. Useful when you want to balance both false positives and false negatives.
$$
\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$
Accuracy
Out of all predictions made, how many were correct?
$$
\text{Accuracy} = \frac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{Total Predictions (TP + TN + FP + FN)}}
$$
Table of Metrics
Metric | Formula | Intuition |
---|---|---|
Recall | $$ \frac{TP}{TP + FN} $$ | Out of all positives, how many found? |
Precision | $$ \frac{TP}{TP + FP} $$ | Out of all predicted positives, how many correct? |
F1 Score | $$ 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$ | Balance between precision and recall. |
Accuracy | $$ \frac{TP + TN}{TP + TN + FP + FN} $$ | Overall correctness of predictions. |
- Recall is critical when missing a positive case has severe consequences (e.g., medical diagnoses).
- Precision is important in scenarios where false positives are costly (e.g., spam filtering).
- F1 Score balances precision and recall, useful for imbalanced datasets.
- Accuracy is a general measure but can be misleading in highly imbalanced datasets.