Towards reliable use of artificial intelligence to classify otitis media using otoscopic images: Addressing bias and improving data quality

Yixi Xu; Al-Rahim Habib; Graeme Crossland; Hemi Patel; Chris Perry; Kris Bock; Tony Lian; Bill Weeks; Rahul Dodhia; Juan M. Lavista Ferres; Narinder Pal Singh

Towards reliable use of artificial intelligence to classify otitis media using otoscopic images: Addressing bias and improving data quality

Yixi Xu ,
Al-Rahim Habib ,
Graeme Crossland ,
Hemi Patel ,
Chris Perry ,
Kris Bock ,
Tony Lian ,
Bill Weeks ,
Rahul Dodhia ,
Juan M. Lavista Ferres ,
Narinder Pal Singh

July 2025

Download BibTex

Importance
Ear disease contributes significantly to global hearing loss, with recurrent otitis media being a primary
preventable cause in children, impacting development. Artificial intelligence (AI) offers promise for early
diagnosis via otoscopic image analysis, but dataset biases and inconsistencies limit model generalizability
and reliability.

Objective
To evaluate biases in public otoscopic image datasets and assess their impact on diagnostic performance
and generalizability of AI models for middle ear conditions.

Design, Setting and Participants
This retrospective study systematically evaluated three public otoscopic image datasets (Chile: 880; Ohio,
USA: 454; Türkiye: 956 images) using quantitative and qualitative methods. Two counterfactual
experiments were performed: (1) obscuring clinically relevant features to assess model reliance on non
clinical artifacts, and (2) evaluating the impact of hue, saturation, and value on diagnostic outcomes.
Model performance was assessed internally and externally to determine the extent of bias and
generalizability.

Main Outcomes and Measures
The primary outcome was dataset biases’ effect on AI model performance, measured by changes in area
under the receiver operating characteristic curve (AUC) across internal and external validation. Secondary
outcomes included identification of redundant and stylistically biased images.

Results
Quantitative analysis revealed significant biases in the Chile and Ohio, USA datasets. Counterfactual
Experiment I found high internal performance (AUC > 0.90) but poor external generalization, because of
dataset-specific artifacts. The Türkiye dataset had fewer biases, with AUC decreasing from 0.86 to 0.65
as masking increased, suggesting higher reliance on clinically meaningful features. Counterfactual
Experiment II identified common artifacts in the Chile and Ohio, USA datasets. A logistic regression
model trained on clinically irrelevant features from the Chile dataset achieved high internal (AUC = 0.89)
and external (Ohio, USA: AUC = 0.87) performance. Qualitative analysis identified redundancy (61% of
the Chile dataset included near-duplicates) and stylistic biases in the Ohio, USA dataset that correlated
with clinical outcomes.

Conclusions and Relevance
Dataset biases significantly compromise reliability and generalizability of AI-based otoscopic diagnostic
models. Addressing these biases through standardized imaging protocols, diverse dataset inclusion, and
improved labeling methods is crucial for developing robust AI solutions, improving high-quality
healthcare access, and enhancing diagnostic accuracy.