Identifying Potential Sources of Bias in Deep Learning Models for Embryo Assessment
Presented at: ASRM Scientific Congress, 2021
Authors: Kevin Loewke, Justina Hyunjii Cho, Paxton Maeder-York, Oleksii Barash, Marcos Meseguer, Nikica Zaninovic, Kathleen A. Miller, Denny Sakkas, Michael Levy, Matthew (Tex) VerMilyea
Materials and Methods:
Historical, de-identified images of blastocyst-stage embryos were collected from 11 IVF clinics in the United States between 2015 and 2020. Each clinic captured a single image using its existing ICSI microscope, stereo zoom microscope, or time-lapse microscope. Approximately 8,000 images were matched to a known outcome (positive clinical pregnancy, negative clinical pregnancy, or PGT-aneuploid). We trained a series of deep convolutional neural networks (CNNs) to rank embryo images according to their likelihood of having a positive or negative outcome. Experiments were performed using different techniques for combining images together from the 12 clinical sites, including naive and balanced methods. For each experiment, the aggregated data were split into 70% train and 30% test. The area under the receiver operating curve (AUC) was used for evaluating the ability of the models to rank embryos according to their likelihood of achieving a positive outcome. Total and per-site AUCs, as well as total and per-site inference probabilities, were evaluated for each experiment, to identify and reduce potential sources of bias in the training data.
Results:
Using a naive approach for combining data together from all clinical sites achieved the highest total AUC for the test set (0.75) but also the lowest per-site AUC (0.51). Investigation of this discrepancy revealed two strong sources of bias, which artificially inflated the total AUC and significantly limited per-site performance. The biases included the unique optical signature of each type of microscope, and the presence of foreign objects, such as a holding micropipette in the image. If a certain optical signature or foreign artifact appeared more in the positive training class, compared to the negative training class, the CNN models were found to learn these biases and give a higher score to those images regardless of the embryo morphology. With these insights, a new dataset was prepared that balanced the ratios of positive-to-negative samples for each type of microscope and for each group containing foreign objects. This balanced dataset provided a non-inflated total AUC of 0.72 and significantly raised the lowest per-site AUC from 0.51 to 0.61.
Conclusions:
There has been significant recent interest in using deep learning for analyzing images of embryos at the blastocyst stage. The black-box nature of deep learning models, such as CNNs, makes it difficult to recognize when potential sources of bias have been introduced during the training process. We performed a series of experiments that identified and reduced two sources of bias, and improved per-site performance of the CNN. Future work will continue to search for other sources of bias and address them accordingly.
Impact Statement:
Naive approaches to preparing training data for deep learning models for embryo ranking can create bias in the models. Our work illustrates the need for careful preparation of training data and monitoring of different metrics to identify and reduce potential sources of bias.