Skin Tone Bias in AI Dermatology Tools: Are We Building Inclusive Systems?
Introduction: The Rise of AI in Dermatology
Artificial intelligence (AI) is rapidly transforming dermatology by assisting clinicians in diagnosing skin diseases and detecting cancers at earlier stages [1]. Landmark research, such as the 2017 Stanford study demonstrating dermatologist-level accuracy in skin lesion classification, has accelerated AI’s integration into clinical practice [2]. AI-based tools like smartphone apps have now gained regulatory approval in Europe, addressing the global shortage of dermatologists and enhancing healthcare accessibility [3,4]. However, AI’s reliance on machine learning datasets raises critical ethical concerns regarding potential biases, fairness, and patient safety, particularly across diverse patient populations. While lighter skin tones statistically exhibit higher rates of skin cancer incidence, addressing AI accuracy across all skin types remains essential to prevent exacerbation of existing diagnostic disparities and to ensure equitable, high-quality dermatological care [5].
Differential Accuracy Across Skin Tones
Emerging research highlights troubling disparities in AI diagnostic accuracy across different skin tones, particularly darker skin types (Fitzpatrick IV-VI). For example, a 2022 evaluation using the Diverse Dermatology Images (DDI) dataset revealed substantial limitations in widely cited AI models. Stanford’s DeepDerm initially acclaimed for high accuracy, displayed sensitivity as high as 0.69 for lighter skin tones but only 0.23 for darker skin—a nearly three-fold disparity [6]. Another notable algorithm, ModelDerm, similarly demonstrated significant sensitivity drops (0.41 in lighter skin versus 0.12 in darker skin) [6].
These findings suggest that the disparities primarily arise from inadequate representation of darker skin tones within training datasets. Notably, when researchers enhanced datasets with diverse skin images, accuracy gaps significantly reduced, emphasizing dataset quality as a pivotal factor [6].
Key Findings on Dataset Bias
Severe underrepresentation of darker skin in public dermatological image databases is a recurring issue. Among 106,000 clinical images analyzed by Wen et al., only 11 represented darker skin, with no representation from African, African-Caribbean, or South Asian populations [7].
Such dataset imbalances directly impact AI generalizability, risking diagnostic inaccuracies and exacerbating healthcare disparities. Notably, between 2011 and 2015, the five-year melanoma survival rate in the United States was 66% for Black patients, compared to 90% for non-Hispanic White patients [8].
AI Dermatology Models: Real-World Implications of Bias
The experiences with Google’s DermAssist and Stanford’s DeepDerm illustrate significant pitfalls in AI dermatology. Google’s DermAssist, although demonstrating high initial accuracy, included merely 2.7% Fitzpatrick type V and a single instance of type VI in its foundational dataset [9]. Recognizing these limitations, Google has committed to greater dataset diversity, underscoring transparency’s importance in AI development [9].
DeepDerm’s evaluation using the diverse DDI dataset similarly exposed significant drops in accuracy for darker skin, highlighting discrepancies between benchmark performances and real-world applicability [6]. Additionally, a 2020 BMJ review identified notable reliability concerns in commercial skin cancer detection apps, particularly in recognizing atypical presentations common in skin of color [10].
Dataset Diversity and Health Equity
At its core, skin tone bias in AI dermatology originates from inadequate dataset diversity and transparency. A 2023 scoping review found that only 10% of dermatology AI studies reported skin tone data, with fewer still detailing patient ethnicity [7]. Furthermore, most FDA-approved AI dermatology tools provide little transparency regarding their dataset demographics, hindering critical evaluation [7].
Improving dataset diversity directly translates into more accurate diagnoses for conditions disproportionately affecting darker skin, such as acral lentiginous melanoma, thereby addressing clinical disparities [8]. Experts advocate for representative datasets and standardized demographic datasheets, facilitating thorough algorithm assessment and ensuring equitable patient outcomes [7].
Future Directions: Embracing Diversity for Inclusive AI Systems
Ensuring that AI dermatology tools are representative and accurate for all populations requires intentional and proactive approaches to dataset inclusivity:
- Expanded Dataset Representation: Increasing the diversity of dermatology image databases, actively incorporating images across all Fitzpatrick skin types, geographical regions, and patient demographics.
- Transparent Data Practices: Encouraging clear and detailed documentation of the composition of datasets, enabling clinicians and researchers to assess the applicability of AI tools across diverse patient groups.
- Collaboration in Research and Development: Strengthening interdisciplinary collaboration between dermatologists, AI developers, and researchers to identify and fill gaps in current datasets, promoting equitable diagnostic performance.
- Educational Integration: Including discussions on dataset diversity and potential AI biases within medical education curricula to enhance clinician awareness and engagement in promoting equitable care.
Conclusion
AI’s transformative potential in dermatology hinges upon equitable implementation. Proactively addressing skin tone biases through enhanced dataset diversity, transparency, and collaborative research ensures these promising technologies genuinely improve outcomes for all patients, regardless of skin color.
References
- Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017 Feb 2;542(7639):115-118. doi: 10.1038/nature21056. Epub 2017 Jan 25. Erratum in: Nature. 2017 Jun 28;546(7660):686. doi: 10.1038/nature22985. PMID: 28117445; PMCID: PMC8382232.
- Topol, E.J. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 25, 44–56 (2019). https://doi.org/10.1038/s41591-018-0300-7
- European Commission. Medical Device Regulation. https://health.ec.europa.eu/medical-devices-sector_en
- UK MHRA. Artificial Intelligence as a Medical Device (AIaMD). 2023. https://www.gov.uk/government/publications/software-and-artificial-intelligence-ai-as-a-medical-device/software-and-artificial-intelligence-ai-as-a-medical-device
- American Academy of Dermatology Association. Skin cancer rates by skin tone. https://www.aad.org/media/stats-skin-cancer
- Daneshjou R, Smith MP, Sun MD, Rotemberg V, Zou J. Lack of Transparency and Potential Bias in Artificial Intelligence Data Sets and Algorithms: A Scoping Review. JAMA Dermatol. 2021;157(11):1362–1369. doi:10.1001/jamadermatol.2021.3129
- Wen D, Khan SM, Ji Xu A, Ibrahim H, Smith L, Caballero J, Zepeda L, de Blas Perez C, Denniston AK, Liu X, Matin RN. Characteristics of publicly available skin cancer image datasets: a systematic review. Lancet Digit Health. 2022 Jan;4(1):e64-e74. doi: 10.1016/S2589-7500(21)00252-1. Epub 2021 Nov 9. PMID: 34772649.
- CDC. Melanoma Incidence and Mortality. https://www.cdc.gov/skin-cancer/statistics/index.html
- Liu Y, Jain A, Eng C, Way DH, Lee K, Bui P, Kanada K, de Oliveira Marinho G, Gallegos J, Gabriele S, Gupta V, Singh N, Natarajan V, Hofmann-Wellenhof R, Corrado GS, Peng LH, Webster DR, Ai D, Huang SJ, Liu Y, Dunn RC, Coz D. A deep learning system for differential diagnosis of skin diseases. Nat Med. 2020 Jun;26(6):900-908. doi: 10.1038/s41591-020-0842-3. Epub 2020 May 18. PMID: 32424212.
- Freeman, K., Dinnes, J., Chuchu, N., Takwoingi, Y., Bayliss, S.E., Matin, R.N., Jain, A., Walter, F.M., Williams, H.C., Deeks, J.J., 2020. Algorithm based smartphone apps to assess risk of skin cancer in adults: systematic review of diagnostic accuracy studies. BMJ m127.. https://doi.org/10.1136/bmj.m127