Model Training Evaluation Metrics

metrics
metrics

Model Evaluation Metrics

Here are the most common evaluation metrics used when pre-training and fine-tuning transformer models for classification tasks.

Recall

Out of all actual positive cases, how many were correctly predicted?
Recall=True Positives (TP)True Positives (TP)+False Negatives (FN)

Precision

Out of all the cases predicted as positive, how many were correct?
Precision=True Positives (TP)True Positives (TP)+False Positives (FP)

F1 Score

The F1 score is the harmonic mean of precision and recall, balancing the trade-off between the two. Useful when you want to balance both false positives and false negatives.
F1=2PrecisionRecallPrecision+Recall

Accuracy

Out of all predictions made, how many were correct?
Accuracy=True Positives (TP)+True Negatives (TN)Total Predictions (TP + TN + FP + FN)

Table of Metrics

MetricFormulaIntuition
RecallTPTP+FNOut of all positives, how many found?
PrecisionTPTP+FPOut of all predicted positives, how many correct?
F1 Score2PrecisionRecallPrecision+RecallBalance between precision and recall.
AccuracyTP+TNTP+TN+FP+FNOverall correctness of predictions.
  • Recall is critical when missing a positive case has severe consequences (e.g., medical diagnoses).
  • Precision is important in scenarios where false positives are costly (e.g., spam filtering).
  • F1 Score balances precision and recall, useful for imbalanced datasets.
  • Accuracy is a general measure but can be misleading in highly imbalanced datasets.