Classification metrics
Contents
Classification metrics#
We focus our discussion on threshold-independent metrics.
Recall that the threshold is the number
With a fixed threshold, we can build confusion matrices and all metrics which derive from them: accuracy, precision, recall, F1, MCC, among others.
There are two reasons to focus on metrics which do not depend on a threshold:
In the modelling pipeline, finding a continuous score model comes first - only after that comes defining a threshold. Usually, one picks the best model in a threshold-independent way. If the application requires them to define a binary output (which is not always the case), then one tries to find a good threshold after the model has been chosen.
There is no “best” threshold - it will be found as the result of a calculation including trade-offs (between false positives vs false negatives, for example). This is, usually, very dependent on specific business needs.
Our sample data#
Throughout this section, we will use the following simulated data with 10 features and 5000 data points, out of which only 8 are meaningful:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# create dummy classification problem
X, y = make_classification(n_samples=5000, n_features=10, n_informative=8, n_redundant=1, n_repeated=1,
random_state=10) # for reproducibility
# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)
The base ingredient: confusion matrices#
All metrics will be build from a single fundamental element: the confusion matrix.
Basically, assume for a second you have an estimator
Now, given an observation
and : right prediction. This is called a true negative (TN) (since we also call the class 0 the “negative class”) and : wrong prediction. This is called a false positive (FP) (since we falsely predicted the class 1, also known as the positive class) and : wrong prediction. This is called a false negative (FN). and : right prediction. This is called a true positive (TP).
These 4 possibilities can be stored in a 2x2 matrix called the confusion matrix. We illustrate below how to build one in scikit-learn
.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pylab as plt
plt.rcParams['figure.dpi'] = 120
# assume we have a series of observed and predicted labels
y_real = [1,1,1,1,0,0,1,0,1,0,1,0,1,1,1,1,0,0]
y_pred = [1,1,0,1,0,1,1,0,1,0,1,1,1,0,0,1,0,0]
# first we build a confusion matrix object...
matrix = confusion_matrix(y_real, y_pred)
#... which we then plot
ConfusionMatrixDisplay(matrix).plot()
plt.show()

A quick disclaimer: predict
vs. predict_proba
#
In scikit-learn
(and scikit-learn
-friendly libraries such as XGBoost and LightGBM) all classifiers contain at last two methods: predict
and predict_proba
.
Their difference is straightforward: predict
returns the predicted classes (0 or 1 in the binary case) whereas predict_proba
returns a float between 0 and 1, with the score.
For most models, the output of
predict_proba
is not a probability, but simply a score. We will discuss this better in the calibration section.
To exemplify this, let’s use a simple logistic regression model and see how it works.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression().fit(X_train, y_train)
x = X_train[25,:].reshape(1,-1) # take one training sample for us to predict
model.predict(x)
array([1])
model.predict_proba(x)
array([[0.12682433, 0.87317567]])
What predict_proba
does is output two values: the probability that the
Since we are in a binary case, we can just extract the last component via
model.predict_proba(x)[:,1]
, since the 0th component will be fully determined by it. In the multiclass case (with more than two classes) then it is important to have all components given.
What predict
does is to take the result of predict_proba
and binarize it - essentially by just applying a threshold of 0.5! In order to keep things within our control, we recommend to simply use the predicted score via predict_proba
and then feed it to different metrics.
y_train_pred = model.predict_proba(X_train)[:,1]
y_test_pred = model.predict_proba(X_test)[:,1]
How well did our model do?#
Looking into the documentation, you might be tempted to use the model’s .score
function:
print(">> Don't use this function")
print("Train score: {}".format(model.score(X_train, y_train)))
print("Test score: {}".format(model.score(X_test, y_test)))
>> Don't use this function
Train score: 0.6762857142857143
Test score: 0.668
But we recommend you to never use the score
function - since under the hood it calculates accuracies with the predict
function, which we don’t recommend.
Instead, we will present you with three useful threshold-independent metrics in what follows: the ROC AUC, average precision, and lift/delinquency curves.
We recommend you always use them all for your model assessments.
Ratios of the confusion matrix#
For now, fix a confusion matrix (this amounts to fixing an estimator
In most references, you will see these rates defined in terms of elements of the confusion matrix. Here, in order to keep up with the probabilistic language we have been using, we will define them a bit differently.
[True positive rate / sensitivity / TPR] The TPR of an estimator is
[False positive rate, FPR] The FPR of an estimator is
Similar definitions can be made for the true negative rate and false negative rate, although these tend to be less commonly used.
[Calculating TPR from the confusion matrix] TPR can be approximated from the confusion matrix via
Proof: from the definition of conditional probability,
We can break the denominator into
(see note below on why) Then
But we have seen before that
Note: this comes from the general identity
where each is independent of the others, and together “reconstruct” in the sense that . Convince yourself of that drawing a few Venn diagrams
[Calculating FPR from the confusion matrix] Similar to TPR, one can calculate
The ROC curve and the ROC AUC#
The area under the ROC curve (ROC AUC) is, perhaps surprisingly, both the most used and the least understood metric in machine learning. In what follows we give a mathematically sound treatment of it. The main results will be highlighted, in case you want to skip the heavy math.
scikit-learn
makes everything seem like it is easy. It is literally just one line in order to calculate the ROC AUC of our model:
from sklearn.metrics import roc_auc_score
print("Train ROC AUC: {0:.4f}".format(roc_auc_score(y_train, y_train_pred)))
print("Test ROC AUC: {0:.4f}".format(roc_auc_score(y_test, y_test_pred)))
Train ROC AUC: 0.7614
Test ROC AUC: 0.7458
But it is another thing to understand what is going on under the hood. We explain this below.
Constructing the ROC curve#
From now on, we drop the hats on
Notice that the whole discussion so far has considered a fixed estimator
Now, we let
Definition. Let
Before we use scikit-learn
to plot it, let us intuitively understand what behaviors to expect from this curve. Consider the expression
When
Gets all members of the positive class
correctly, so the true positive rate is maximalBut it gets all members of the negative class
wrongly. There are no true negatives nor false negatives.
Hence it is easy to check that
Now go to the other extremum when
from sklearn.metrics import RocCurveDisplay
fig, ax = plt.subplots(figsize=(5,5))
RocCurveDisplay.from_predictions(y_test, y_test_pred, ax=ax, label='Test ROC')
RocCurveDisplay.from_predictions(y_train, y_train_pred, ax=ax, label='Train ROC')
plt.legend()
plt.show()

This is the ROC curve. It is a non-decreasing function in the (FPR, TPR) plane, ranging from (0,0) to (1,1). Notice how test and train ROC curves are similar, but different.
There are four unique properties satisfied by the ROC curve which we will discuss going forward:
Interpretable area under the curve (ROC AUC)
Invariance under class imbalance
Existence of a universal baseline
Convexity
1. Interpretable area under the curve (ROC AUC)#
The area under the ROC curve (usually called ROC AUC or just AUC) is the most used metric in machine learning, yet many data scientists don’t know its interpretation. We explain it (and prove why it is so) in what follows.
Randomly take a point of the positive (
) class, and calculate its scoreRandomly take a point of the negative (
) class, and calculate its score
Claim:
that is: the ROC AUC measures how likely it is that a point of the positive class scores higher than a point of the negative class.
This result is proven in the Appendix of this chapter.
Hence, in our example (where both train & test AUCs are close to 80.3%): there is a probability of 80.3% of a point in the positive class scoring higher than a point in the negative class. This is very good - 4 out of 5 times we will correctly sort them under the score.
Numerically testing this claim: let us run a series of samplings to check if we obtain a similar fraction to the ROC AUC we calculated above.
import pandas as pd
df = pd.DataFrame({'score': y_test_pred, 'label': y_test})
print(df.head().round(3).to_string())
n_samples = 10000
# randomly sample scores from class 1 and 0
class_1_score_samples = df[df['label']==1]['score'].sample(n_samples, replace=True, random_state=0)
class_0_score_samples = df[df['label']==0]['score'].sample(n_samples, replace=True, random_state=1)
# check how many times class 1 score higher
total = (class_1_score_samples.values >= class_0_score_samples.values).sum()
print("-> Percentage of times score for class 1 was higher than class 0: {0:.1f}%".format(100* total/n_samples))
score label
0 0.059 0
1 0.214 0
2 0.750 1
3 0.636 1
4 0.253 0
-> Percentage of times score for class 1 was higher than class 0: 74.5%
As we can see, this works - the percentage of times was very close to the 75.6% ROC AUC we got!
Properties:
This characterization of the ROC AUC allows one to extract interesting insights on how the AUC will behave under some transformations. For example: given the ROC AUC of a classifier
, how will it change if we change the classifier to ? Or ?
Answer: it will stay the same. Since
In particular this is true for functions like
print("Original test ROC AUC: {0:.4f}".format(roc_auc_score(y_test, y_test_pred)))
print("Applying 10x: {0:.4f}".format(roc_auc_score(10*y_test, 10*y_test_pred)))
print("Applying square: {0:.4f}".format(roc_auc_score(y_test**2, y_test_pred**2)))
Original test ROC AUC: 0.7458
Applying 10x: 0.7458
Applying square: 0.7458
This also explains why ROC AUC is the best metric when one is interested in sorting values based on the score. This is particularly the case in credit scoring, where one usually takes a list of potential individuals, scores them using a classification model, and sorts them (from good scores to bad scores) in order to shortlist those which are creditworthy. Notice that a person’s absolute score does not matter - what matters is how high it scores on the list compared to others. In this sense, a good model will have high AUC, because it ranks points well.
ROC AUC is blind to intra-class score performance. Suppose we have the following model (based on a real-life buggy model I once built):
For members of the positive class, it mostly predicts a random score between 70%-100%
For members of the negative class, it mostly predicts a score of 0%
This model will have very high AUC, because there is a high chance that a point in the positive class scores higher than one in the negative class. However, this model does a terrible job regarding scoring within each class, since it is essentially random (for the positive class) and all identical to zero (for the negative class)
2. Invariance under class imbalance#
As we discussed in Chapter 1, class imbalance is one of the biggest sources of complexity in classification problems. We claim that the ROC AUC is “invariant under” class imbalance. What does it mean?
First, let us clarify one thing. Physicists reading this probably feel like invariance under something is a positive thing, but this is not necessarily true in machine learning.
Invariance under class imbalance means that the ROC AUC (and the ROC curve more generally) do not change if we change the relative proportion of the positive and negative classes. To see this, notice how
the ratio
The good thing about this is that AUC analysis works the same for balanced or imbalanced problems…
The bad part is that if you blindly run AUC analysis alone, you are shortsighting yourself to class imbalance issues. You might think that a model is very good, when it actually is not!
As a spoiler: metrics such as precision do depend on class imbalance, as we will see further down.
3. Existence of a universal baseline#
It is common knowledge that the diagonal line in the ROC plane is the baseline corresponding to a random classifier. More formally, consider that
Thus this random classifier corresponds to the point
In practice, this will happen in the case of infinitely large sample size. Below we show the test ROC for the Logistic Regression model; a random classifier; and the theoretical random classifier. Notice how the “real” random classifier is noisy.
fig, ax = plt.subplots(figsize=(5,5))
# completely random classifier, generating random scores between [0,1]
np.random.seed(123)
y_pred_random = np.random.rand(*y_test_pred.shape)
# theoretical random classifier
p = np.linspace(0,1)
RocCurveDisplay.from_predictions(y_test, y_test_pred, ax=ax, label='Logistic regression')
RocCurveDisplay.from_predictions(y_test, y_pred_random, ax=ax, label='Random classifier')
plt.plot(p,p,linestyle='--', label='Theoretical random clf.')
plt.legend()
plt.show()

Because of this baseline, ROC AUC is theoretically bounded between 0.5 (area of the triangle below the 45 degree line) and 1.
4. Convexity#
Convexity is a geometrical property of sets which will allow us to construct new (and better) classifiers based on old ones.
Suppose we have two estimators A and B, represented in the FPR/TPR plane as below (A in red, B in blue). Recall that an estimator is a function that outputs either 0 or 1.
Convexity allows us to build a whole family of estimators in the line segment between A and B (represented in a dotted line).
The procedure on how to do it (which is not as important as knowing we can do it!) follows below:
[Theorem] Let
Then the classifier
attains a FPR of
Proof: we do the calculation for FPR; that of TPR follows analogously. By definition, the FPR of
as claimed.
The importance of this result is that it gives an optimal boundary for a set of classifiers. Consider the situation below, where a new classifier
Because of this, when given a classifier
We can build a function which creates the convex hull and also computs its AUC as below:
def hull_roc_auc(y_true, y_score):
"""
Computes coordinates (TPR, FPR) and ROC AUC for the convex hull
of a ROC curve built from a ground truth y_true (0s and 1s) and
a vector of scores y_score
"""
from sklearn.metrics import roc_curve, auc
from scipy.spatial import ConvexHull
fpr, tpr, thresholds = roc_curve(y_true, y_score)
# add artificial vertex at (1,0)
fpr, tpr = np.append(fpr, [1]), np.append(tpr, [0])
points = np.array([fpr, tpr]).T
hull = ConvexHull(points)
# get vertices and remove artificial vertex
vertices = np.array([points[v] for v in hull.vertices if not np.array_equal(points[v],np.array([1., 0.]))])
fpr_hull, tpr_hull = vertices[:,0], vertices[:,1]
# hull AUC
hull_auc = auc(fpr_hull, tpr_hull)
return hull_auc, fpr_hull, tpr_hull
# Calculate variables for hull - train/test
hull_auc_train, fpr_hull_train, tpr_hull_train = hull_roc_auc(y_train, y_train_pred)
hull_auc_test, fpr_hull_test, tpr_hull_test = hull_roc_auc(y_test, y_test_pred)
## Plot
fig, ax = plt.subplots(figsize=(8,4), ncols=2)
# original ROC
original_auc_train = roc_auc_score(y_train, y_train_pred)
RocCurveDisplay.from_predictions(y_train, y_train_pred, ax=ax[0],
label=f'Log. Reg (AUC = {round(original_auc_train, 3)})')
original_auc_test = roc_auc_score(y_test, y_test_pred)
RocCurveDisplay.from_predictions(y_test, y_test_pred, ax=ax[1],
label=f'Log. Reg (AUC = {round(original_auc_test, 3)})')
# convex hull
ax[0].plot(fpr_hull_train, tpr_hull_train, label=f"Hull ROC (AUC = {round(hull_auc_train,3)})", marker='.', color='black')
ax[1].plot(fpr_hull_test, tpr_hull_test, label=f"Hull ROC (AUC = {round(hull_auc_test,3)})", marker='.', color='black')
# legends/labels
ax[0].legend(); ax[1].legend()
ax[0].set_title("Train"); ax[1].set_title("Test")
plt.tight_layout()
plt.show()

Is this cheating?#
No. It is a remarkable property of the “ROC space” (FPR vs TPR) that linear interpolation between classifiers yields a new classifier. Since we do not have access to the “true” distribution of
Takeaways#
The ROC curve represents a machine learning classifier in the TPR / FPR plane
If it is not convex, it can be made convex by connecting points via line segments. This is equivalent to building new classifiers as probabilistic samplings of the endpoint classifiers
The area under the ROC curve (ROC AUC) represents the likelihood that a point of the positive class scores higher than one in the negative class. It is bound between 0.5 and 1.0
The ROC curve (and thus the ROC AUC) are invariant under rebalancing of the positive / negative classes. This is both good and bad
Appendix#
Proof of probabilistic interpretation of the ROC AUC#
Here, we prove the claim that, if
Proof: by definition,
It is natural to parameterize FPR and TPR via the threshold
for which
where
Plugging these back into the definition of ROC AUC gives
where we have used the fundamental theorem of calculus in the last equality. Now one can equivalently write these iterated integrals as
where we identify
Further properties of the ROC curve#
There is an interesting characterization of the ROC curve based on the cumulative distribution functions (CDFs) of the variables
Let
Similarly,
We can explicitly write
To simplify notation, call
From either expression, one can take the derivative and see that (calling
since the PDFs are always non-negative, and so is their ratio. So we see that the ROC curve is necessarily non-decreasing.
It has, however, no obligation of being concave (= curved face-down), even if that is how we usually draw it. Taking a second derivative of the expression above yields
where all quantities are calculated at
References:#
Peter Flach, Meelis Kull, Precision-Recall-Gain Curves: PR Analysis Done Right, NIPS 2015
https://arxiv.org/pdf/1809.04808.pdf on the concavity of ROC curves