Model Evaluation Metrics

Here are the most common evaluation metrics used when pre-training and fine-tuning transformer models for classification tasks.

Recall

Out of all actual positive cases, how many were correctly predicted?
$$
\text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
$$

Precision

Out of all the cases predicted as positive, how many were correct?
$$
\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
$$

F1 Score

The F1 score is the harmonic mean of precision and recall, balancing the trade-off between the two. Useful when you want to balance both false positives and false negatives.
$$
\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

Accuracy

Out of all predictions made, how many were correct?
$$
\text{Accuracy} = \frac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{Total Predictions (TP + TN + FP + FN)}}
$$

Table of Metrics

Metric	Formula	Intuition
Recall	$$ \frac{TP}{TP + FN} $$	Out of all positives, how many found?
Precision	$$ \frac{TP}{TP + FP} $$	Out of all predicted positives, how many correct?
F1 Score	$$ 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$	Balance between precision and recall.
Accuracy	$$ \frac{TP + TN}{TP + TN + FP + FN} $$	Overall correctness of predictions.