Artificial Intelligence for the Classification of Pigmented Skin Lesions in Populations with Skin of Color: A Systematic Review

Background: While skin cancers are less prevalent in people with skin of color, they are more often diagnosed at later stages and have a poorer prognosis. The use of artificial intelligence (AI) models can potentially improve early detection of skin cancers; however, the lack of skin color diversity in training datasets may only widen the pre-existing racial discrepancies in dermatology. Objective: The aim of this study was to systematically review the technique, quality, accuracy, and implications of studies using AI models trained or tested in populations with skin of color for classification of pigmented skin lesions. Methods: PubMed was used to identify any studies describing AI models for classification of pigmented skin lesions. Only studies that used training datasets with at least 10% of images from people with skin of color were eligible. Outcomes on study population, design of AI model, accuracy, and quality of the studies were reviewed. Results: Twenty-two eligible articles were identified. The majority of studies were trained on datasets obtained from Chinese (7/22), Korean (5/22), and Japanese populations (3/22). Seven studies used diverse datasets containing Fitzpatrick skin type I–III in combination with at least 10% from black Americans, Native Americans, Pacific Islanders, or Fitzpatrick IV–VI. AI models producing binary outcomes (e.g., benign vs. malignant) reported an accuracy ranging from 70% to 99.7%. Accuracy of AI models reporting multiclass outcomes (e.g., specific lesion diagnosis) was lower, ranging from 43% to 93%. Reader studies, where dermatologists’ classification is compared with AI model outcomes, reported similar accuracy in one study, higher AI accuracy in three studies, and higher clinician accuracy in two studies. A quality review revealed that dataset description and variety, benchmarking, public evaluation, and healthcare application were frequently not addressed. Conclusions: While this review provides promising evidence of accurate AI models in populations with skin of color, the majority of the studies reviewed were obtained from East Asian populations and therefore provide insufficient evidence to comment on the overall accuracy of AI models for darker skin types. Large discrepancies remain in the number of AI models developed in populations with skin of color (particularly Fitzpatrick type IV–VI) compared with those of largely European ancestry. A lack of publicly available datasets from diverse populations is likely a contributing factor, as is the inadequate reporting of patient-level metadata relating to skin color in training datasets.

© 2023 The Author(s). Published by S. Karger AG, Basel

Introduction

Skin cancer is the most common malignancy worldwide, with melanoma representing the deadliest form. While skin cancers are less prevalent in people with skin of color, they are more often diagnosed at a later stage and have a poorer prognosis when compared to Caucasian populations [13]. Even when diagnosed at the same stage, Hispanic, Native, Asian, and African Americans have significantly shorter survival time than Caucasian Americans (p < 0.05) [4]. Skin cancers in people with skin of color often present differently from those with Caucasian skin and are often underrepresented in dermatology training [5, 6].

The use of artificial intelligence (AI) algorithms for image analysis and detection of skin cancer has the potential to decrease healthcare disparities by removing unintended clinician bias and improving accessibility and affordability [7]. Skin lesion classification by AI algorithms to date has performed equivalently to [8] and, in some cases, better than dermatologists [9]. Human-computer collaboration can increase diagnostic accuracy further [10]. However, most AI advances have used homogenous datasets [1115] collected from countries with predominantly European ancestry [16]. Exclusion of skin of color in training datasets poses the risk of incorrect diagnosis or missing skin cancers entirely [8] and risks widening racial disparities that already exist in dermatology [8, 17].

While multiple reviews have compared AI-based model performances for skin cancer detection [1820], the use of AI in populations with skin of color has not been evaluated. The objective of this study was to systematically review the current literature for AI models for classification of pigmented skin lesion images in populations with skin of color.

MethodsLiterature Search

The systematic review follows the PRISMA guidelines [21]. A protocol was registered with PROSPERO (International Prospective Register of Systematic Reviews) and can be accessed at https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42021281347.

A PubMed search in March 2021 used search terms relating to artificial intelligence, skin cancer, and skin lesions (search strings in online suppl. eTable; for all online suppl. material, see www.karger.com/doi/10.1159/000530225). No date range was applied, language was restricted to English, and only original research was included. Covidence software was used for screening administration. Search results were screened by reviewing titles/abstracts by two independent reviewers (Y.L. 100% and B.B.S. 20%) using eligibility criteria described in Table 1. Remaining articles were assessed for eligibility by reviewing methods or full text. Disagreements were resolved following discussions with a third independent reviewer (C.P.).

Table 1.

Inclusion and exclusion criteria used for screening and assessing eligibility of articles

Inclusion criteriaExclusion criteria1. Any computer modeling or use of AI on diagnosis of skin conditions
2. Datasets provide information on the population (racial or Fitzpatrick skin type breakdown) or datasets obtained from countries with predominantly skin of color population
3. Uses dermoscopic, clinical, 3D, or other photographic images of the skin surface
4. Includes the assessment of malignant and/or non-malignant pigmented skin lesions1. No population description of the training datasets (demographic, racial, or ethnicity breakdown) or from a country with predominantly Caucasian population of European ancestry
2. Dataset description with >90% Caucasian population, or fair skin type, or Fitzpatrick skin type I–III
3. Solely used images from ISIC [56], PH2 [13], IAD [57], ISBI [58], HAM10000 [12], MED-NODE [14], ILSVRC [59], DermaQuest [60], DERMIS [61], DERM IQA[62], DermNet NZ [63], and datasets known to be of predominantly European ancestryData Extraction and Synthesis

Data extraction was performed using a standardized form by author Y.L. and confirmation by V.K. The following parameters were recorded: reference, ethnicity/ancestry/race, lesion number, sex, age, location, skin condition, public availability of dataset, number of images, type of images, methods of confirmation, deep learning system, model output, comparison with human input, and any missing data reported. Algorithm performance measures are recorded by either accuracy, sensitivity, specificity, and/or area under the receiver operating characteristic curve. A narrative synthesis of extracted data was used to present findings as a meta-analysis was not feasible due to heterogeneity of the study design, AI systems, skin lesions, and outcomes.

Quality Assessment

Quality was assessed using the Checklist for Evaluation of Image-Based Artificial Intelligence Reports in Dermatology (CLEAR Derm) Consensus Guidelines [22]. This 25-point checklist offers comprehensive recommendations on factors critical to the development, performance, and application of image-based AI algorithms in dermatology [22].

Author Y.L. performed the quality assessment of all included studies, and author B.B.S. assessed 20%. Inter-rater agreement rate was 87%, with disagreements resolved via a third independent reviewer (V.K.). Each criterion was evaluated to be either fully, partially, or not addressed and scored either 1, 0.5, or 0, respectively, using a scoring rubric in online supplementary eTable2.

Results

The database search identified 993 articles, including 13 duplicates. After screening titles/abstracts, 535 records were excluded, and the remaining 445 records were screened by methods, with 63 articles reviewed by full text. Forward and backward citations search revealed no additional articles. A total of 22 studies were included in the final review (PRISMA flow diagram in online supplementary eFig. 1).

Study Design

All 22 studies were performed between 2002 and 2021 [2332], with 11 (50%) studies published between 2020 and 2021 [3344]. An overview of study characteristics is displayed in Table 2. The median number of total images used in each study for all datasets combined was 5,846 (ranging from 212 to 185,192). The median dataset size for training, testing, and validation was 4,732 images (range: 247–22,608), 362 (range: 100–40,331), and 1,258 (range: 109–14,883), respectively.

Table 2.

Overview of study characteristics

First author, yearPatient populationPublic availability of datasetImage type/Image No.Validation (H = histology, C = clinical diagnosis)Deep learning systemModel outputethnicity/ancestry/race/locationdataset informationPiccolo et al. (2002) [23]Fitzpatrick I–VLesion n = 341
Patient n = 289
F 65% (n = 188)
Average age 33.6NoDermoscopy
Total 341
Training 0
Testing 341
Validation 0All – HNetwork architecture DEM-MIPS (artificial neural network designed to evaluate different colorimetric and geometric parameters)Binary (melanoma, non-melanoma)Iyatomi et al. (2008) [24]Italian
Austrian
Japanesen/aNoDermoscopy
Total 1,258
Training 247
Testing NA
Validation 1,258Dataset A and B– H
Dataset C – H + CNetwork architecture
ANN (back-propagation artificial neural networks)Binary (malignant or benign)
Malignancy (risk) score (0–100)Chang et al. (2013) [25]TaiwaneseLesion n = 769
Patient n = 676
F 56% (n = 380)
Average age 47.6NoClinical
Total 1,899
Training NA
Testing NA
Validation NABenign – C
Malignant – HNetwork architecture
Computer-aided diagnosis (CADx) system built on 91 conventional features of shape, texture, and colors (developing software – MATLAB)Binary (benign or malignant)Chen et al. (2016) [26]American Indian, Alaska Native, Asian, Pacific Islander, black or African, American, CaucasianCommunity dataset
Patient n = 1,900
F 52.3% (n = 993)
Age >50% under 35Partially
DermNet NZ – Yes
Community – NoClinical
Total 12,000
Training 11,780
Testing 337
Validation NACommunity dataset (benign and malignant) – C
DermNet – HNetwork architecture
Patented image-search algorithm that builds on proven computer vision methods from the field of CBIRBinary (melanoma and non-melanoma)Yang et al. (2017) [27]KoreanPatient n = 110
F 50% (n = 55)NoDermoscopy
Total 297
Training 0
Testing 297
Validation 0All – HNetwork architecture
3 stage algorithm, pre-processing, stripe pattern detection and automatic discrimination (MATLAB)Binary (LM, nevus)Han et al. (2018) [28]Korean
Caucasiann/aPartially
MED-NODE, Atlas, Edinburgh, Dermofit – yes
Others – noClinical
Total 182,044
Training 178,875
Testing 1,276
Validation 22,728ASAN – C+ H
Multiple other dataset (5) used with unclear validationNeural network – CNN
Network Architecture
Microsoft ResNet – 152
Google Inception12-class skin tumor typesYu et al. (2018) [29]KoreanLesion n = 275NoDermoscopy
Total 724
Training 372
Testing 362
Validation 109All – HNeural network – CNN
Network architecture modified VGG – 16Binary (melanoma/non-melanoma)Zhang et al. (2018) [30]ChineseLesion n = 1,067NoDermoscopy
Total 1,067
Training 4,867
Testing 1,142
Validation NABenign and malignant – C
Three dermatologists disagree – HNeural network – CNN
Network architecture
GoogLeNet Inception v3Four class classifier (BCC, SK, melanocytic nevus, and psoriasis)Fujisawa et al. (2019) [31]JapanesePatients n = 2,296NoClinical
Total 6,009
Training 4,867
Testing 1,142
Validation NAMelanocytic nevus, split nevus, lentigo simplex – C
Others – PENeural network – DCNN
Network architecture
GoogLeNet DCNN model1. Two class classifier (benign vs. malignant)
2. Four class classifier (malignant epithelial lesion, malignant melanocytic lesion, benign epithelial lesion, benign melanocytic lesion)
3. 14 class classification
4. 21 class classificationJinnai et al. (2019) [38]JapanesePatient n = 3,551NoDermoscopy
Total 5,846
Training 4,732
Testing 666
Validation NAMalignant – H
Benign tumor – CNeural network – FRCNN
Network architecture – VGG-16Binary (benign/malignant)
Six-class classifications (6 skin conditions)Zhao et al. (2019) [32]Chinesen/aNoClinical
Total 4,500
Training 3,375
Testing 1,125
Validation NABenign – C
Malignant – PENeural network – CNN
Network architecture – XceptionRisk (low/high/dangerous)Cho et al. (2020) [33]KoreanPatient n = 404NoClinical
Total 2,254
Training 1,629
Testing 625
Validation NABenign – C
Malignant – HNeural network – DCNN
Network architecture – Inception-ResNet-V2Binary classification (benign or malignant)Han et al. (2020) [36]Korean
CaucasianASNA, Normal, SNU
Patient n = 28,222
F 55% (n = 15,522)
Average age 40
MED-NODE, Web, Edinburgh
NAPartially
Edinburgh – Yes
SNU – upon requestClinical
Total 224,181
Training 220,680
Testing 2,441
Validation 3,501ASAN – C+ H
Edinburgh – H
Med-Node – H
SNU – C + H
Web – image findingNeural network – CNN
Network architecture
SENet
Se-ResNet-50
Visual geometry group (VGG-19)Binary (malignant, non-malignant)
Binary (steroids, antibiotics, antivirals, antifungals)
Multiple class classification (134 skin disorders)Han et al. (2020) [35]KoreanPatients n = 673
Lesions n = 673
F 54% (n = 363)
Average age 58NoClinical
Total 185,192
Training 182,348
Testing NA
Validation 2,844All – HNeural network – CNN
Network architecture
SENet
SE-ResNeXt-50
SE-ResNet-50)Risk output (risk of malignancy)Han et al. (2020) [34]KoreanPatient n = 9,556
Lesion n = 10,426
F 55% (5,255)
Average age 52NoClinical
Total 40,331
Training 1106,886a
Testing NA
Validation 40,331All – HNeural network – RCNN
Network architecture
SENet
Se-ResNeXt-50Binary (malignant, non-malignant)
32 class classificationHuang et al. (2020) [37]ChineseLesion n = 1,225NoClinical
Total 3,299
Training 2,474
Testing 825
Validation NAAll – PENeural network - CNN
Network architecture
Inception V3
Inception-ResNet V2
DenseNet 121
ResNet 50Binary (SK/BCC)Li (2020) [44]ChinesePatient n = 106NoDermoscopy and clinical
Total 212
Training 200,000a
Testing 212
Validation NAAll – HNetwork architecture
Youzhi AI software (Shanghai Maise Information Technology Co., Ltd., Shanghai, China)Binary (benign or malignant)
14 class classificationLiu et al. (2020) [39]Fitzpatrick type I–VIPatient n = 15,640
Lesion n = 20,676NoClinical
Total 79,720
Training 64,837
Testing NA
Validation 14,483Benign – C
Malignant – HNeural network – DLS
Network architecture
Inception – v426 class classification (primary output)
419 class classification (secondary output)Wang et al. (2020) [40]Chinese, with Fitzpatrick type IVn/aNoDermoscopy
Total 10,307
Training 8,246
Testing 1,031
Validation 1,031BCC – C+H
Others – CNeural network – CNN
Network architecture
GoogLeNet Inception v3Binary classification (psoriasis and others)
Multi-class classificationHuang et al. (2021) [37]Taiwanese
CaucasianKCGMH
Patient no. 1,222
F 52.4% (n = 640)
Average age 62
HAM10000 n/aPartially
KCGMH – no
HAM10000 – yesClinical
Total 1,287
Training 1,031
Testing 128
Validation 128All – HNeural network – CNN
Network architecture
DenseNet 121 – binary classification
EfficientNet E4 – five class classificationBinary (benign/malignant)
5 class classification (BCC, BK, MM, NV, SCC)
7 class (AK, BCC, BKL, SK, DF, MM, NV)Minagawa et al. (2021) [42]Caucasian
JapanesePatient n = 50Partially
ISIC–yes
Shinshu – noDermoscopy
Total 12,948
Training 12,848
Testing 100
Validation NABenign – C
Malignant – HNeural network - DNN
Neural architecture – Inception-ResNet-V24 class classification (MM/BCC/MN/BK)Yang et al. (2021) [43]Chinesen/aNoClinical
Total 12,816
Training 10,414
Testing 300
Validation 2,102All – CNeural network – DCNN
Neural architecture
DenseNet-96
ResNet-152
ResNet-99
Converged network (DenseNet – ResNet fusion)6 class classification (Nevi, Melasma, cafe-au-lait, SK, and acquired nevi)

The majority of studies (15/22, 68%) analyzed clinical images (i.e., wide-field or regional images), while seven studies analyzed dermoscopy images [23, 24, 27, 29, 30, 40, 42], and one study included both [44]. All but one study included both malignant and benign pigmented skin lesions, with one investigating only benign pigmented facial lesions [43].

Histopathology was used as the ground truth in 15 studies for all malignant lesions and partially in two studies [24, 26], while one study only used histopathology to resolve clinician disagreements [23]. Seven studies used histopathology as ground truth for benign lesions [23, 27, 29, 34, 35, 41, 44]. In nine studies, ground truth was established by consensus of experienced dermatologists [25, 3032, 3840, 42, 43]. Other studies used a mix of both [24, 26, 33, 36] or were not clearly defined [28, 37].

The number of pigmented skin lesion classifications used for AI model evaluation ranged from binary outcomes (e.g., benign vs. malignant) to classification of up to 419 skin conditions [39]. While most studies (19/22, 86%) evaluated lesions across all body sites, one study exclusively analyzed the lips/mouth [33], another assessed only facial skin lesions [43], and one study specifically addressed acral melanoma [29].

Population

Homogenous datasets were collected from the Chinese/Taiwanese (n = 8, 36%) [25, 30, 32, 37, 40, 41, 43, 44], Korean (n = 5, 23%) [2729, 3335], and Japanese populations (n = 3, 14%) [31, 38, 42]. Seven studies (32%) included populations from Caucasians/Fitzpatrick skin type I–III [23, 24, 26, 28, 36, 39, 42], with at least 10% American Indian [26], Alaska Native [26], black or African American [26], Pacific Islander [26], Native American [26], or Fitzpatrick IV–VI [23, 39] in the training and/or test set (Table 2).

The majority of studies did not specify the sex distribution (n = 13, 59%) or participant age (n = 15, 68%). Seven studies included age specification, ranging from 18 to older than 85 years [23, 25, 26, 3436, 41].

Outcome and Performance

The outcome of the classification algorithms used either a diagnostic model, risk categorical model (e.g., low, medium, or high), or a combination of both. An overview of AI model performance is described in Table 3. Majority of studies (20/22, 91%) used a diagnostic model, either with binary classification of benign or malignant [2327, 29, 33, 35, 37], multiclass classification of specific lesion diagnosis [28, 30, 32, 39, 42, 43], or both [31, 34, 36, 38, 40, 41, 44]. One study used categorical risk as the outcome [32]. Another study reported both diagnostic model and risk categorical model [24].

Table 3.

Measures of output and performance for AI models included in the review

ReferenceAccuracy (%)Sensitivity (%)Specificity (%)AUCBinary classification modelsPiccolo et al. (2002) [23]n/a9274n/aIyatomi et al. (2008) [24]n/a86860.93Chang et al. (2013) [25]9186880.95Chen et al. (2016) [26]919092n/aYang et al. (2017) [27]99.710099n/aYu et al. (2018) [29]8293720.80Cho et al. (2020) [33]n/aDataset 1: 76
Dataset 2: 70Dataset 1: 80
Dataset 2: 76Dataset 1: 0.83
Dataset 2: 0.77Huang et al. (2020) [37]86n/an/a0.92Han et al. (2020) [35]n/a77910.91Fujisawa et al. (2019) [31]939690n/aJinnai et al. (2019) [38]928395n/aHan et al. (2020) [36]n/an/an/aEdinburgh dataset: 0.93
SNU dataset: 0.94Han et al. (2020) [34]n/aTop 1: 63Top 1: 900.86Li et al. (2020) [44]867593n/aWang et al. (2020) [40]77n/an/an/aMulticlass classification modelsHan et al. (2018) [28]n/aASAN dataset: 86
Edinburg dataset: 85ASAN dataset: 86
Edinburg dataset: 81Zhang et al. (2018) [30]Dataset A: 87
Dataset B: 87n/an/aFujisawa et al. (2019) [31]77n/an/aJinnai et al. (2019) [38]878687Liu et al. (2020) (26-classification model) [39]Top 1: 71
Top 3: 93Top 1: 58
Top 3: 88n/aHan et al. (2020) [36]Top 1
Edinburgh dataset: 57
SNU dataset: 45
Top 3
Edinburgh dataset: 84
SNU dataset: 69
Top 5
Edinburgh dataset: 92
SNU dataset: 78n/an/aHan et al. (2020) [34]Top 1: 43
Top 3: 62n/an/aLi et al. (2020) [44]73n/an/aWang et al. (2020) [40]82n/an/aMinagawa et al. (2021) [42]90n/an/aYang et al. (2021) [43]Algorithm A: 88
Algorithm B: 77
Algorithm C: 90
Algorithm D: 87Algorithm A: 83
Algorithm B: 63
Algorithm C: 81
Algorithm D: 80Algorithm A: 98
Algorithm B: 90
Algorithm C: 99
Algorithm D: 98Huang et al. (2021) [37]5 class (KCGMH dataset): 72
7 class (HAM10000 dataset): 86n/an/aRisk categorical classificationZhao et al. (2019) [32]83Benign: 93
Low risk: 85
High risk: 86Benign: 88
Low risk: 85
High risk: 91Benign: 0.96
Low risk: 0.92
High risk: 0.95

The AI models using binary classification (16/22) reported an accuracy ranging from 70% to 99.7%. Of these studies, 6/16 reported ≥90% accuracy [2527, 31, 38, 41], three studies reported between 80 and 90% accuracy [29, 37, 44], and one study reported <80% accuracy [40]. Twelve AI models reported sensitivity and specificity as a measure of performance, which ranged from 58 to 100% and 72 to 99%, respectively. Eight studies provided an area under the curve (AUC) with 5/8 reporting values >0.9 [24, 25, 3537], with the remaining three models scoring between 0.77 and 0.86 [29, 33, 34].

For the 13 studies using multiclass output (i.e., >2 diagnoses), accuracy of models ranged from 43% to 93%. Six of these studies (6/13) scored <80% accuracy [31, 34, 36, 39, 41, 44], six others scored between 80 and 90% accuracy [30, 32, 38, 40, 42, 43], and one provided sensitivity and specificity of 86% and 86%, respectively, as a measure of performance [28].

Reader Studies

Reader studies, where the performance of AI models and clinician classification is compared, were performed in 14/22 studies, with results provided in Table 4[23, 25, 29, 3139, 42, 44]. Six studies compared AI outcomes to classification by experts, e.g., dermatologists [25, 32, 34, 36, 42, 44]. Eight studies compared outcomes for both experts and non-experts, e.g., dermatology residents and general practitioners [23, 29, 31, 33, 35, 3739].

Table 4.

Reader studies between AI models and human experts (e.g., dermatologists), and non-experts (e.g., dermatology residents, GPs)

ReferenceAI performanceExpert performanceNon-expert performancePiccolo et al. (2002) [23]Sensitivity: 92%
Specificity: 74%Sensitivity: 92%
Specificity: 99%Sensitivity: 69%
Specificity: 94%Chang et al. (2013) [25]Accuracy Melanoma: 91%
Non-melanoma: 83%
Sensitivity: 86%
Specificity: 88%Accuracy: 81%
Sensitivity: 83%
Specificity: 86%Yu et al. (2018) [29]Accuracy: 82%
Sensitivity: 93%
Specificity: 72%
AUC: 0.80Accuracy: 81%
Sensitivity: 97%
Specificity: 67%
AUC: 0.80Accuracy: 65%
Sensitivity: 45%
Specificity: 84%
AUC: 0.65Huang et al. (2020) [37]Sensitivity: 90%
AUC 0.94Sensitivity: 85%
Specificity: 90%Sensitivity: 66%
Specificity: 72%Han et al. (2020) [35]Sensitivity: 89%
Specificity: 78%
AUC: 0.92Sensitivity: 95%
Specificity: 72%
ROC: 0.91Accuracy
Dermatology resident: 94%
Non-dermatology clinician: 77%
Sensitivity
Dermatology resident: 69%
Non-dermatology clinician: 65%
AUC
Dermatology resident: 0.88
Non-dermatology clinician: 0.73Fujisawa et al. (2019) [31]Accuracy
Binary: 92%
Multiclass: 75%Accuracy
Binary: 85%
Multiclass: 60%Accuracy
Binary: 74%
Multiclass: 42%Jinnai et al., (2019) [38]Accuracy: 92%
Sensitivity: 83%
Specificity: 95%Accuracy: 87%
Sensitivity: 86%
Specificity: 87%Accuracy: 85%
Sensitivity: 84%
Specificity: 86%Zhao et al. (2019) [32]Sensitivity
Benign: 90%
Low risk: 90%
High risk: 75%Sensitivity
Benign: 61%
Low risk: 50%
High risk: 64%Cho et al. (2020) [33]Sensitivity
Dataset 1: 76%
Dataset 2: 70%
Specificity
Dataset 1: 80%
Dataset 2: 76%
AUC
Dataset 1: 0.83
Dataset 2: 0.77Sensitivity
-Without algorithm: 90%
-With algorithm: 90%
Specificity
-Without algorithm: 58%
-With algorithm: 61%Sensitivity
Dermatology resident
-Without algorithm: 80%
-With algorithm: 85%
Non-dermatology clinician
-Without algorithm: 65%
-With algorithm: 74%
Specificity
Dermatology resident
-Without algorithm: 53%
-With algorithm: 71%
Non-dermatology clinician
-Without algorithm: 46%
-With algorithm: 49%
AUC
Dermatology resident
-Without algorithm: 0.33
-With algorithm: 0.42
Non-dermatology clinician
-Without algorithm: 0.11
-With algorithm: 0.23Han et al. (2020) [36]Multiclass model
Accuracy
Top 1: 45%
Top 3: 69%
Top 5: 78%Multiclass model
Accuracy (without algorithm)
Top 1: 50%
Top 3: 67% (with algorithm)
Top 1: 53%
Top 3: 74%
Binary model
Accuracy
-Without algorithm: 77%
-With algorithm: 85%Han et al. (2020) [34]Binary model
Sensitivity: 67%
Specificity: 87%
Multiclass accuracy
Top 1: 50%
Top 3: 70%Binary model
Sensitivity: 66%
Specificity: 67%
Multiclass accuracy
Top 1: 38%
Top 3: 53%Li et al. (2020) [44]Accuracy
Binary: 73%
Multiclass: 86%Accuracy
Binary: 83%
Multiclass: 74%Liu et al. (2020) [39]Accuracy
Top 1: 66%
Top 3: 90%Accuracy
Top 1: 63%
Top 3: 75%Accuracy
Primary care physician
Top 1: 44%
Top 3: 60%
Nurse practitioner
Top 1: 40%
Top 3: 55%Minagawa et al. (2021) [42]Accuracy: 71%Accuracy: 90%

In reader studies comparing binary classification between AI and experts (n = 11), one study reported similar diagnostic accuracy/specificity [29], three showed higher accuracy for AI models [25, 31, 38], and two reported higher accuracy in experts [42, 44]. Five studies reported specificity, sensitivity, and AUC instead of accuracy with varying outcomes [23, 32, 33, 35, 37]. For reader studies between AI and non-experts (n = 7), AI showed higher accuracy, specificity, sensitivity, and AUC in most studies [23, 29, 31, 33, 35, 37

留言 (0)

沒有登入
gif