Behold : The Confusion Matrix

Not a “confusing” matrix anymore

Vijayasri Iyer
4 min readNov 23, 2021

A confusion matrix is one of the primary metrics for evaluating the robustness of a classification model. Used extensively in binary classification tasks, it can easily be extended to a multi-class setting as well. Let’s go over all of the important terminology and formulas you need to work with a confusion matrix.

The Confusion Matrix

Consider this example : You are using a classification model to find out whether the 10 apples you bought from the market are good or rotten by feeding in images of apples into the model. Let’s assume rotten apples have a label ‘Yes’ and good apples have the label ‘No’. From inspecting the apples earlier, you know that you have exactly 3 rotten apples in your basket. When you plot the results from your classifier into a confusion matrix, this is the result :

Two representations of confusion matrix

Tricky ? I bet. These two are widely used representations of the confusion matrix. The one on the left is used (actual labels on top) in a lot of academic course material, the one on the right (predicted labels on top) is used in machine learning libraries like scikit-learn. Either way, once you know the meanings of the values in the matrix along with their associated metrics, navigating both will be easy.

True Positive (TP) : The “rotten” apples, that are classified as “rotten” by the model. i.e (TP = 5)

True Negative (TN) : The “bad” apples that are classified as “bad” by the model. (TN = 2)

False Positive (FP) : The “good” apples that are misclassified as “rotten” by the model. You discard good apples thinking they are bad. (Also known as Type-1 error) (FP = 2)

False Negative (FN) : The “rotten” apples that are misclassified as “good” by the model. In this case, you end up consuming bad apples thinking they are good. (Also known as Type-2 error) (FN=1)

Metrics

Following is a list of metrics that can be derived from the confusion matrix.

P (Number of actually positive/rotten apple samples) = TP + FN

N (Number of actually negative/good apple samples) = TN + FP

  • Precision = TP/(TP + FP)
  • Sensitivity/Recall/True Positive Rate (TPR) = TP/P = TP/(TP+FN)
  • Specificity/True Negative Rate (TNR) = TN/N = TN/(TN+FP)
  • False Negative Rate (FNR) = FN/P = FN/(TP+FN)
  • False Positive Rate (FPR) = FP/N = FP/(TN+FP)
  • Accuracy = (TP + TN)/(P+N)
  • F1-score = 2*(Precision*Recall)/(Precision + Recall) =2TP/(2TP + FP + FN)

For any classification task, we would want to maximize the TPR &TNR (Sensitivity & Specificity) since they indicate the number of “correctly classified samples” and minimize the FPR & FNR which are the number of “misclassified samples”. However, in many real-world situations, we will have to choose between minimizing number of FP cases or FN cases since it may not be possible to minimize both. In our example, we would want to reduce the FN since we don’t want to accidentally eat a rotten apple ! A detailed description of type 1 and 2 errors can be found here.

Two other metrics used for evaluating the classifier are accuracy and F1-score. The accuracy takes into account both the TP & TN and uses the total number of samples as the denominator. Since the weightage of all classes are equal in this metric, this score may not be fit for imbalanced datasets. The F1-score on the other hand, is the harmonic mean of precision and recall and takes into account both FP & FN. Hence, F1-score is better indicative of the robustness of a classifier compared to an accuracy score. To read more about the different types of F1-score metrics click here.

Points to remember

  • Always pay attention to the labels of your dataset/task. Your TP, FP, TN, FN all depend on the labels you want to assign to output. If you are building trying to classify. For instance if we reverse the labels, (rotten apples=1, good apples = 0) to (good apples =1, rotten apples =0) our matrix will be completely different.
  • The FP&FN will also differ based on the task your model is performing and labels attached to that task. In the reversed label scenario, then we would be looking to reduce the False Positives instead.

In upcoming posts, we will discuss more classical machine learning algorithms and metrics used for tasks like classification and regression. Till then stay tuned !

References

  1. https://medium.com/analytics-vidhya/what-is-a-confusion-matrix-d1c0f8feda5
  2. https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
  3. https://towardsdatascience.com/accuracy-recall-precision-f-score-specificity-which-to-optimize-on-867d3f11124
  4. https://towardsdatascience.com/the-two-variations-of-confusion-matrix-get-confused-never-again-8d4fb00df308
  5. https://towardsdatascience.com/performance-metrics-confusion-matrix-precision-recall-and-f1-score-a8fe076a2262

--

--

Vijayasri Iyer

Machine Learning Scientist @ Pi School. MTech in AI. Musician. Yoga Instructor. Learnaholic. I write about anything that makes me curious.