We carefully designed a protocol according to the guidelines outlined in the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) checklist (see Additional file 1). This protocol defined the research question, eligibility criteria, information sources, and search terms to be used, as well as the methods for the posterior synthesis of the sensitivity and specificity results reported in the selected studies.
Search strategyWe searched the PubMed and Web of Science databases for studies published in the twenty-first century (between 2001 and May 31, 2023) to account for the improved high-resolution thermal cameras developed in the late 1990s. We used the keywords breast and mammary along with cancer-related terms (cancer, carcinoma, malignan*, neoplasm*) and various ways to denote thermography (thermograph*, thermogram*, thermolog*, infrared imag*, infra-red imag*, thermal imag*), combined with the Boolean operators AND and OR (see Additional file 2 for the complete search queries). Additional articles were identified by searching for the names of commercial systems and by examining the references included in the retrieved systematic reviews.
Study selectionArticles were included in the analysis if they met the following criteria:
Topic and technologyThe focus of our review is the clinical evaluation of long-wave, digital infrared breast thermography [45]. Studies on a different topic or disease were excluded, as well as those using a different thermography method, including the following:
Imagers operating in a spectral range other than the long-wave infrared band (i.e., 8–14 µm), such as the near-infrared or the microwave bands, because infrared emissions from human skin peak at about 10 µm at normal room temperature [45];
Contact thermography, in which the surface temperature is measured by placing either a heat-sensitive liquid crystal-coated film (liquid crystal thermography, LCT) or multiple temperature sensors directly on the skin;
Active thermography, in which an external agent is used to enhance the contrast between the target and the surrounding tissue, either by exciting the tissue with antennas or by administering fluorescent dyes that bind to cancer cells and emit radiation when excited with lasers. We did include, however, studies that applied cold stress by means of cool air or another cooling method to induce a vascular response and thus identify unresponsive blood vessels affected by malignancy. This type of imaging protocol is known as dynamic, as opposed to static, where images are acquired after a period of acclimation to reach thermal equilibrium with the room temperature.
Application of breast thermographyWe included only studies that evaluated the diagnostic ability of breast thermography, either for screening asymptomatic patients or for diagnosing patients with symptoms or with abnormal findings on a previous imaging technique, i.e., mammography and/or ultrasonography. Consequently, we excluded publications that studied thermography for other breast cancer applications, such as monitoring ongoing treatment, estimating the prognosis of a malignant lesion, guiding hyperthermia treatment, or predicting treatment-related complications, such as skin toxicity after radiation therapy. We also excluded studies that focused on other aspects of thermal image processing, including segmentation, blood vessel detection, feature extraction, or lesion localization.
Document typeBecause our goal is to numerically review the effectiveness of infrared thermography in detecting malignant lesions, we were interested in clinical studies. Other types of documents were excluded because they lacked numerical results (i.e., narrative overviews and opinion articles, such as comments or letters to the editor), a detailed description of the methodology (i.e., conference abstracts, summaries, and posters), or objectivity (i.e., literature review articles). Articles that described a device, an imaging protocol, or an algorithm without including any experimentation to evaluate it quantitatively were also excluded.
PopulationWe included studies with patients attending either routine screening or follow-up tests, with no restrictions on age, sex, or breast density. Animal studies and articles that used phantoms or computer simulations were excluded, as well as studies that used images obtained from an external database instead of collecting their own data. This also includes studies that mixed their own data with those from external sources. Studies with a setting other than screening or diagnosis were also excluded, such as those that examined patients with a known diagnosis. In addition, for the sake of significance, we considered only studies that included at least five cancer cases.
Evaluation metricsWe required that studies report data on sensitivity (defined as the number of true positives divided by the sum of true positives and false negatives) and specificity (defined as the number of true negatives divided by the sum of true negatives and false positives), or at least provide the information necessary to calculate them. We excluded studies that considered non-malignant tumors as positives, as well as articles that did not use standard tests to establish the ground truth diagnosis, i.e., cancer must be confirmed by biopsy, whereas healthy and benign cases should be diagnosed with at least one standard imaging test, either mammography or ultrasound.
Screening and data extractionOne reviewer screened article titles and abstracts applying the inclusion and exclusion criteria. The full texts of the remaining articles were retrieved and read for eligibility screening. Documents that did not provide enough information to determine the recruitment process, the reference tests, or the criterion for positivity (i.e., using ambiguous labels like abnormal, sick, or unhealthy without describing them) were excluded. Any doubts were resolved by discussion with the other authors.
Data extracted from each article included the topic or technology (thermography or other), the body part or disease (breast cancer or other), the cancer-related task (diagnosis, screening, or other), the type of article (clinical study or other), the goal of image processing (classification or other), the subject type (human or other), the source of the data (internal or external), patient recruitment (e.g., asymptomatic women undergoing routine screening, patients with a palpable lump, patients scheduled for breast biopsy due to suspicious mammographic findings, etc.), what is considered to be positive (only biopsy-proven cancer or other breast diseases too), the reference tests for both positive and negative cases, the sample size, the number of positive cases, the size of the subset used for evaluation (i.e., the test set if machine learning algorithms were used), the imaging protocol (static or dynamic), the technical characteristics of the thermal camera (spectral range and thermal and spatial resolution), the funding source, and the evaluation metrics (sensitivity and specificity).
To analyze trends in sensitivity and specificity over time, we plotted publication years against the corresponding sensitivity and specificity values reported in the selected studies. Linear regression models were fitted to each dataset to evaluate the relationships between the publication year and the respective metrics. The equations of the fitted regression lines, along with the p values for the slopes, were extracted to assess the statistical significance of any observed trends. A positive or negative trend is considered statistically significant if the p value of the slope is less than 0.05.
Risk of bias in individual studiesThe methodological quality of the selected studies was assessed independently by the same reviewer who screened the documents using the QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies) tool [46]. It covers four key domains, each focusing on different aspects of study quality: (1) patient selection, which assesses whether the patients enrolled in the study were representative of those who would typically receive the diagnostic test and whether the selection process avoided bias; (2) index test, which evaluates whether the diagnostic test under investigation was performed and interpreted in a manner that avoided bias and maintained applicability to the review question; (3) reference standard, which examines the reliability and applicability of the reference standard used to verify the results of the index test; and (4) flow and timing, which assesses the time intervals between the index test and the reference standard and the patient flow through the study to ensure consistency and reduce bias. The tool helps reviewers systematically identify potential biases, determine the relevance of study findings to clinical practice, and guide evidence synthesis.
Meta-analysisFor each study that met the eligibility criteria, we computed the confusion matrix with the number of true positive (TP), false negative (FN), true negative (TN), and false positive (FP) cases. If a study compared the results obtained with different algorithms or interpretation criteria, we selected the one with the highest F-score, which is defined as follows:
$$F= 2\times \frac\times \text}+\text} ,$$
where
and
$$\text=\frac}+\text} .$$
The F-score ranges from 0 to 1, with values closer to 1 reflecting a better balance between the correct detection of positive instances (sensitivity or recall) and the accuracy of positive predictions (precision or positive predictive value, PPV). We chose to use the F-score instead of other metrics that combine sensitivity and specificity because of the high class imbalance present in the studies’ datasets and the clinical implications of missing positive cases (i.e., cancer patients). In the context of cancer detection, the primary goal of a diagnostic test is to identify as many true positive cases as possible (high recall) while ensuring that false positives remain at a manageable level (high precision). The F-score effectively balances these two factors.
In machine learning studies, the data is split into a training set, used to build the model, and a test set, to evaluate it. In the meta-analysis, we selected the results obtained on the test set because they measure the expected effectiveness in clinical practice.
We conducted two meta-analyses of proportions, one for sensitivity and one for specificity, to statistically combine the results of the selected studies [47]. For this purpose, we compared two methods: inverse-variance and generalized linear mixed models (GLMMs). The inverse-variance method uses the inverse of the variance of each study’s effect size (sensitivity and specificity) as weights, giving more weight to studies with more precise (less variable) estimates. GLMMs extend generalized linear models by incorporating both fixed effects and random effects, allowing for the analysis of data with complex, hierarchical structures, and accounting for variability at multiple levels. Because proportions can be skewed, especially when they are close to 0 or 1, transformations are often applied to stabilize variances and normalize the data before pooling [48]. We compared the following combinations of transformations and meta-analysis methods:
Inverse-variance with no transformation
Inverse-variance with logit transformation
Inverse-variance with arcsine transformation
GLMMs with logit transformation.
Meta‑bias assessmentDifferences between the studies in terms of methodological factors, such as the purpose of the examination, the imaging protocol followed, or the thermal sensitivity of the infrared camera, may lead to differences in the results obtained. Heterogeneity, i.e., the variability between studies, was measured with the \(^\) statistic [49], defined as follows:
$$^=\frac\times 100\% ,$$
where \(Q\) is the chi-squared statistic of the reported proportions and \(k\) is the number of studies in the meta-analysis, so there are \(k-1\) degrees of freedom. \(^\) describes the percentage of variability in the estimation of sensitivity or specificity that is due to differences between studies rather than to sampling errors (chance). Heterogeneity was considered statistically significant when p value < 0.05 and/or \(^\) > 50%.
We performed a subgroup analysis stratifying by imaging protocol (static or dynamic) and type of interpretation to explore possible sources of heterogeneity (see Additional file 3: Tables S2 and S3). However, the results were so similar to the general analysis that we decided not to include these results in the article. We also attempted to perform a cumulative analysis, given the recent improvement in the resolution of thermal cameras, but unfortunately, this did not yield the results we had hoped for.
Publication bias was assessed by visual inspection of funnel plots [50]. In the absence of publication bias, the plot should resemble a symmetrical inverted funnel. By contrast, asymmetry in the funnel plot may indicate the presence of publication bias, where smaller studies with nonsignificant or negative results are less likely to be published.
The statistical computations and visualization of results were performed using the “meta” package in the R programming language, a user-friendly tool for meta-analysis [51].
Meta-analysis of diagnostic test accuracyAlthough not initially considered in the pre-defined protocol, we included a bivariate analysis of sensitivity and specificity to consider the correlation between them. For this analysis, we used MetaDTA (Meta-analysis of Diagnostic Test Accuracy), an online tool specifically designed for the meta-analysis of diagnostic test accuracy studies [52]. MetaDTA employs advanced statistical methods tailored for bivariate data. Among other features, MetaDTA produces summary receiver operating characteristic (SROC) plots that graphically summarize the diagnostic accuracy across multiple studies by plotting the sensitivity against 1-specificity, or the false positive rate (FPR), of each study. The resulting curve represents the overall trade-off between sensitivity and specificity.
留言 (0)