metrics

Model Evaluation Metrics

Here are the most common evaluation metrics used when pre-training and fine-tuning transformer models for classification tasks.

Recall

Out of all actual positive cases, how many were correctly predicted?
$$
\text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
$$

Precision

Out of all the cases predicted as positive, how many were correct?
$$
\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
$$

F1 Score

The F1 score is the harmonic mean of precision and recall, balancing the trade-off between the two. Useful when you want to balance both false positives and false negatives.
$$
\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

Accuracy

Out of all predictions made, how many were correct?
$$
\text{Accuracy} = \frac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{Total Predictions (TP + TN + FP + FN)}}
$$

Table of Metrics

MetricFormulaIntuition
Recall$$ \frac{TP}{TP + FN} $$Out of all positives, how many found?
Precision$$ \frac{TP}{TP + FP} $$Out of all predicted positives, how many correct?
F1 Score$$ 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$Balance between precision and recall.
Accuracy$$ \frac{TP + TN}{TP + TN + FP + FN} $$Overall correctness of predictions.
  • Recall is critical when missing a positive case has severe consequences (e.g., medical diagnoses).
  • Precision is important in scenarios where false positives are costly (e.g., spam filtering).
  • F1 Score balances precision and recall, useful for imbalanced datasets.
  • Accuracy is a general measure but can be misleading in highly imbalanced datasets.