Inference Insight Reports
Last updated
Last updated
The inference insight report is a guide that helps you evaluate your trained model on a given test dataset. The provided metrics and figures depend on the task being performed. We also provide some case studies that showcase the model’s raw predictions. By analyzing the mistakes the model makes, we can identify possible weaknesses of the model and adjust future training accordingly.
Note: It is not necessary for the test dataset to originate from the same data source as the training dataset. Ideally, the test dataset should be rigorous (e.g. contain both simple and hard samples) so that the model’s true performance can be tested.
The definitions of the deep learning metrics are given in https://console.deepq.ai/docs/console/account-management/deep-learning-metrics-explained.
Multi-Class Classification
In multi-class classification, we provide the ROC curves (left), the Precision-Recall (PR, center) curves, and the Confusion Matrix (right).
Multi-Label Classification
In multi-class classification, we provide the ROC curves and the Precision-Recall (PR) curves (see multi-class classification for example plots).
Object Detection
In object detection, we show the PR curves (below left) and the FROC curves (below right).
Semantic Segmentation
In Semantic Segmentation, there are no relevant curves to display.
The model performance table provides quantifiable values that evaluate the model’s performance. Below we show several example performance tables:
\
Aside from object detection, we organize the model performance according to class. We also provide the average performance across classes on the last row. By comparing the per-class performance with the average performance, you can identify classes where the model might be under-performing. In each row, we provide different metrics. Users will need to focus on the metrics that matter most according to their need.
The Data and Prediction Statistics figures and tables section provides a comparison between the ground truth label distribution and the model’s predicted label distribution. This is useful when we want to identify whether the model has potentially overfit/underfit to a certain class during training. We show an examples below:
Explanation: We can see that the test dataset is roughly evenly distributed. But the majority of our model’s predictions are focused on cats and dogs. It’s possible this is caused by our training dataset having a large number of cat and dog images. We can see that our model rarely predicts fish. It’s possible that the fish images in our test set look largely different from the training set fish images. Or it could be that there are not that many fish images in our training set.
Note: We cannot use this comparison to conclude that a model is well trained (even in the case where the predicted and ground truth label distributions are equivalent). This comparison can only be used to identify potential overfitting or underfitting.
The explanation for each of these plots are given in the Training Insight Report tutorial. We can perform the same comparison across for the different tasks in a similar manner as shown above.
The case studies section of our inference report samples some of the model’s predictions and sorts them as either True Positive, False Positive, or False Negative by each class.
The definition for True Positive, False Positive, and False Negative are different between each task. We use the following figure to explain:
We provide 18 samples of True Positive, False Positive, False Negative cases from each class. By analyzing each of these predictions and their ground truths, we can identify the issues (False Positive and False Negatives) and identify possible areas of improvement for our model.