Since the early stages of the worldwide COVID-19 pandemic outbreak, caused by the SARS-CoV-2 virus, a broad variety in the clinical evolution of patients was observed: from asymptomatic cases and mild affectations to critical cases and deadly respiratory failure. Such difference suggests the existence of distinct population groups who respond in notably disparate manners.
COVID-19 has fostered massive attention by the scientific community, who followed a wide spectrum of techniques and approaches: to improve our understanding about the behaviour of the disease, its transmission, diagnosis, therapy and prognosis, etc. Machine learning-based models provided predictions of severity and mortality which facilitated hospital resource allocation and aided in clinical decision making. In addition, several works in the literature have been devoted to discovering heterogeneous ‘phenotypes’ (i.e. clusters in the data science terminology) underlying in the population, and to associate them with eventual clinical outcomes: e.g. mortality, need for admission to intensive care units (ICU) or for mechanical ventilation, survival time and/or length of in-hospital stay.
This work aims to contribute towards the understanding of clinical phenotypes in COVID-19, obtained for a Spanish cohort of hospitalized patients with SARS-CoV-2 pneumonia; and to relate such phenotypes with two different clinical outcomes: severity in patients’ evolution and mortality.
Related worksWang et al. [1] examined n=20572 cases positive for COVID-19, of which 3548 required hospitalization. The study enrolled patients in the USA from March to October 2020 and incorporated data about patient’s demographics (age, sex), comorbidities and a selection of 17 biomarkers from routine blood tests. Using Latent Class Analysis (LCA) for clustering, the authors found 7 distinct phenotypes across the entire cohort, as well as 5 subphenotypes for the hospitalized population. Among these latter, the first subphenotype (14% prevalence) was formed by younger patients, with elevated counts of white blood cells (WBC) and platelets, mild anaemia and normal ranges of C-reactive protein (CRP), creatinine and albumin. The second subphenotype (21% prevalence) had mid-aged individuals with none or few comorbidities, lymphopenia and elevated CRP. The third (20%) had also mid-aged, but with more comorbidities, hyperinflammatory response and markedly high CRP, WBC and platelets. The fourth subphenotype (25%) were older patients, with the highest presence of comorbidity, leukopenia and lymphopenia. The fifth (20%) was also formed by old individuals, with a hyperinflammatory response and kidney dysfunction, high creatinine, anaemia, lymphopenia, hypoalbuminemia, elevated CRP, etc. In terms of clinical outcomes, 3 and 5 related to higher likelihoods of ICU admission and/or in-hospital death than 1 and 2; whereas 4 and 5 had more unfavourable survivals than the others – despite 3 going more often to ICU.
Su et al. [2] analyzed n=14418 patients from 5 hospitals based in the USA (16.3% treated in the emergency department, 83.7% hospitalized), for an enrolment period spanning between March and June 2020. The authors collected sociodemographic data (age, sex, race/ethnicity), 9 comorbidities and 30 biomarkers; selecting 23 variables after data quality assessment. Via hierarchical agglomerative clustering, they discover 4 underlying subphenotypes. Subphenotype I (33% prevalence) tended to include younger patients, more females and lower comorbidities. II (37%) had more males, more abnormal markers of inflammation (CRP, interleukin IL-6, lactate dehydrogenase LDH, erythrocyte, etc.) and hepatic dysfunction (ferritin, alanine, bilirubin). III (18%) encompassed older patients, with more frequent black ethnicity, renal dysfunction (blood urea nitrogen BUN, creatinine) and hematologic (D-dimer, hemoglobin). IV (12%) had also older patients, more males, higher comorbidity and more abnormal values across all biomarkers. The authors reported those subphenotypes to behave as a strong predictor for various clinical outcomes: most notably, for 60-day mortality. Interestingly, there also existed an association with the patient’s socioeconomic status. I had the most favourable outcomes (in terms of rates of death, need for mechanical ventilation and ICU admission), whereas II and III showed intermediate situations, and IV was the most unfavourable.
Lusczek et al. [3] enrolled n=1022 in-hospital patients from 14 centres in the USA, from March to August 2020. The authors collected 33 variables within the first 72 hours after admission: demographics (age, body mass index BMI), 9 comorbidity categories, vital signs (heart and respiratory rates, blood pressure, oxygenation SpO2), and laboratory analyses. An ensemble consensus clustering –based on k-means– suggested the presence of 3 phenotypes, with statistically significant interactions with comorbidity, complications and hazard of death. Phenotype I (23% prevalence) was termed ‘adverse’: it included older patients, with more comorbidities (cardiac, hematologic, renal, although less resporatory), and altered LDH, neutrophils, D-dimer, aspartate aminotransferase AST, CRP, etc. It was associated with the most unfavourable clinical outcomes: in terms of mortality, mechanical ventilation and ICU. Phenotype II was the most common (60%) and represented an intermediate situation, with less hepatic disease than I or III but more comorbidity in general (e.g. metabolic and autoimmune). Phenotype III (17%) was ‘favourable’: with more females and more neutropenia, also more frequence smoking and/or alcohol abuse. Despite the very high rate of respiratory comorbidity, it showed the best clinical outcomes –lowest mortality–; and the authors hypothesized that they were more predisposed to long-term sequelae.
Besides, Gutiérrez et al. [4] conducted a clustering study with an internal cohort for phenotype derivation and internal validation (n=4035 patients from 127 hospitals in Spain, belonging to the first COVID-19 pandemic wave in the country, February to April 2020 – 66% of them for derivation, 34% for validation), alongside external validation (n=2226). Their dataset encompassed 69 variables per patient: age, sex, race/ethnicity, 16 comorbidities, 6 prior medication treatments, 7 COVID-19 symptoms, laboratory data and chest radiological findings. Through a two-step cluster analysis –in which the optimal number of clusters was found by maximizing the Silhouette score–, the authors identified 3 phenotypes. Phenotype A (19% prevalence) had younger individuals, less frequently males, with mild symptoms, normal inflammatory patterns (CRP, Il-6, ferritin, LDH) and higher lymphocytes. B (73%) showed cases with more symptoms (fever, cough), often without pulmonary infiltrations in chest X-ray but more interstitial, obesity, lymphocytopenia, and moderately elevated inflammatory parameters. Patients in C (7%) suffered more obesity, frequent comorbidities (hypertension, diabetes, chronic heart/lung/kidney diseases), poorer oxygenation, and even higher inflammatory biomarkers than B (neutrophils, D-dimer, procalcitonin, CRP). In turn, these phenotypes showed statistically remarkable differences in 30-day mortality rates: 3.7% for A in the external validation cohort, 23.7% for B, and 51.4% for C.
Ranard et al. [5] examined another USA cohort with n=528 hospitalized patients (March to July 2020), employing age and around 40 laboratory values (median and inter-quartile range throughout each patient’s hospitalization) as their input data. The authors trained a range of clustering algorithms, namely: k-means, Birch, Gaussian Mixture Models and agglomerative hierarchical; obtaining 4 phenotypes. Endotype 1 (25.6% prevalence) had the highest rate of women, the lowest hypertension and diabetes, but the highest chronic obstructive pulmonary disease; it encompasses the cases with the lowest inflammatory status (ferritin, IL-6, CRP, LDH), the lowest infectious status (WBC, procalcitonin), and the lowest coagulopathy (prothrombin time and partial thromboplastin time). Endotype 2 (18.9%) showed the most aggravated comorbidities (hypertension, diabetes, chronic kidney and renal diseases, heart failure), moderate inflammatory and infectious statuses, and low coagulopathy. Endotype 3 (32.0%) had low comorbidity, moderate inflammatory and infectious statuses, but high coagulopathy. Finally, endotype 4 (23.5%) had the fewest women, high comorbidity, high inflammatory and infectious statuses, and high coagulopathy. The authors reported evidence of statistical differences in mortality – increasing from 1 to 4; and in the ratio of intubations – below average for 1 and 2, above average for 3 and 4.
Teng et al. [6] considered n=483 hospitalized patients in the USA, enrolled between February and May 2020. The authors collected information on demographics (age, sex, race/ethnicity, BMI), 8 comorbidities, 8 laboratory variables and 8 types of medications during admission. With these, they found two phenotypes in their overall cohort via LCA. Cluster C1 (40% prevalence) encompassed older patients, fewer males, fewer individuals from non-white ethnicity, more comorbidities (hypertension, coronary, chronic heart failure, diabetes, kidney, pre-existing respiratory conditions, etc.), higher creatinine and pro-natriuretic peptide (pro-BNP), but lower inflammatory markers (CRP, alanine). Conversely, patients in cluster C2 (60%) were younger, more obese and with higher inflammatory markers (CRP, alanine). In terms of the observed clinical outcomes, these two clusters did not differ significantly in the length of stay, but they did for in-hospital death: 25.4% for C1 versus 9.0% for C2. Subsequently, the authors derived an extra clustering for the subpopulation of 75 deceased cases, although the resulting two subphenotypes (C1’, C2’) were statistically comparable to the overall ones (C1, C2).
Epsi et al. [7] investigated symptom clusters with n=1273 USA military patients from different pandemic waves, (March 2020 to March 2022), relating these symptoms to various clinical progressions (including failure to return to usual health and/or prolonged COVID-19). Methodologically, they exploited linear Principal Component Analysis (PCA) and k-means clustering – with the optimal k chosen by gap statistics. The authors reported three clusters: ‘Nasal’ (34% prevalence) –runny nose, sneezing– showcased intermediate comorbidity (40% cases with non-zero Charlson comorbidity index), and had a hospitalization rate (11.9%) lower than the overall average. ‘Sensory’ (35%) –loss of smell and/or taste– had individuals younger than in the other two clusters, with the lowest presence of comorbidity (28% non-zero Charlson), and also low hospitalization (10.5%). The ‘Respiratory/systemic’ cluster (31%) –upper and lower respiratory symptoms (cough, trouble breathing) and/or systemic (e.g. body ache)– entailed the worst comorbidity (47% non-zero Charlson), which translated to the highest hospitalization (36.3%) and other unfavourable outcomes: no-return to usual health and/or prolonged COVID-19 (beyond 6 months).
With a particular focus on the characterization of ICU patients, Chen et al. [8] recruited n=504 ICU cases in China, from January to March 2020. The authors collected 26 clinical variables: age, comorbidities, vital signs (heart and respiratory rates, blood pressure, oxygenation, etc.), and laboratory results within the first 24 h after ICU admission. Both consensus k-means clustering and LCA agreed on a two-phenotype model: the former determining k by gap statistics, the latter by minimization of the Akaike information criterion (AIC) for parsimoniousness. In addition, 5 out of the 26 variables –neutrophils vs. lymphocytes ratio NLR, SpO2/FiO2, LDH, tumour necrosis factor TNF-\(\alpha\), and urea nitrogen) were selected attending to their informativeness –feature importance– as judged by various supervised machine learning classifiers of bagging and boosting types. The so-termed ‘hyperactive’ cluster (36% prevalence), when compared against the ‘hypoactive’ one (64%), encompassed: older patients, with more comorbidities, elevated heart and respiratory rates, higher Sequential Organ Failure Assessment (SOFA) score, elevated inflammation markers (e.g. WBC, NLR, CRP, IL-6, TNF-\(\alpha\)), and more extreme laboratory values regarding organ dysfunction (platelets, bilirubin, creatinine, urea nitrogen, LDH, SpO2/FiO2, etc.). Besides, these two clusters showed significant differences across all clinical outcomes of interest, not only 28-day mortality (74.3% for ‘hyperactive’ versus 10.8% for ‘hypoactive’) but also for frequency of acute respiratory distress, septic shock, acute cardiac and/or kidney injury and coagulopathy.
For Spain, Rodríguez et al. [9] studied a cohort formed by n=2022 ICU patients (February to May 2020). The authors investigated the association between phenotype and mortality risk. Having collected 42 clinical variables at ICU admission (age, sex, 13 comorbidities, APACHE II score for severity of illness, SOFA score for severity of organ dysfunction, 6 types of treatment and 8 laboratory measurements), they selected 25 of these variables as the most informative in relation to ICU mortality. By applying Partition Around Medoids (PAM) techniques, the authors found 3 phenotypes. Phenotypes A –‘mild’– and B –‘moderate’– showcased younger patients that C –‘severe’–; both with lower severity (APACHE II, SOFA), better inflammatory (LDH), renal (ferritin) and hematologic markers (D-dimer). Between A and B, the main differences are in D-dimer and in the presence of shock. Besides, their C cluster was reported to entail significant differences in clinical evolution with respect to the other two: particularly, higher ICU mortality (20.3% for A, 25.5% for B, and 45.4% for C).
In the Netherlands, Siepel et al. [10] collected data from n=2438 patients admitted to ICU, from February 2020 to March 2021 (the first and second COVID-19 pandemic waves in the country). They used 41 explanatory variables (demographics, clinical observations, medication, lab tests, vital signs and recordings of life support devices at the ICU) to describe the time-dependent evolution in the clinical status of patients. The authors conducted 21 day-by-day analyses. At admission and until ICU day 4, two clusters were reported to exist: ‘mild’ (38.2% prevalence) and ‘severe’ (61.8%). From then onwards, and until day 15, the ‘severe’ one split into ‘mild’ (38.2% prevalence) and ‘severe’ (36.3%). Throughout day 21, only 8.2% of the initial ‘mild’ cluster and only 4.6% of the initial ‘severe’ remained assigned to the same phenotype. This behaviour highlighted the suitability of time-dependent analyses. Besides, the authors pointed out that the heterogeneity appeared to be driven by inflammation biomarkers and dead space ventilation.
留言 (0)