Development of a phenotype ontology for autism spectrum disorder by natural language processing on electronic health records

Workflow and results summary

The workflow of identifying ASD phenotype terms from HQ clinical text is shown in Fig. 1. Our analysis includes four steps: (1) patient cohort selection, (2) data quality control on the raw clinical notes, (3) identifying ASD phenotype terms by NLP and statistical analysis, and (4) quantifying individual ASD patients’ phenotypes using our terminology set and performing dimensional reduction for patient clustering analysis. After performing cohort selection and rigorous quality control, 8499 ASD patients with 56,958 HQ clinical notes from the psychiatric department were selected. The final ASD terminology set contained 3336 phenotype terms which were further organized into a 5-layer ontology structure based on DSM-5 criteria and our collaborating clinicians’ design. The subsequent analysis showed our ASD phenotype terminology set is better than an existing published ASD vocabulary developed by Lingren et al. [9] in distinguishing ASD patients from non-ASD psychiatric patients. Furthermore, we demonstrated that our terminology set could be used to cluster ASD patients into subgroups and quantitatively map an individual ASD patient’s phenotypic characteristics to DSM-5 criteria.

Fig. 1figure 1

Workflow of ASD phenotype ontology development

Patients and clinical notes availability summary

Initially, we identified 33,230 ASD patients with 3,611,649 clinical notes by ICD-9 and ICD-10-CM codes (see “Methods” for details). In the same way, we queried our EHR database to create two comparison cohorts, a non-ASD psychiatric cohort and a nonpsychiatric cohort, enabling statistical comparisons between these groups. The initial number of patients and clinical notes, as well as their ICD-9 and ICD-10-CM codes, is shown in Table S1. The age and gender distribution of individuals with ASD are shown in Fig. 2. Among these patients, 26,020 are males, 7218 are females, and 24 are of unknown gender in the study cohort. The age at diagnosis was calculated from the value of date at diagnosis of minors date at birth. After QC, the final number of patients, number of HQ psychiatric notes, and gender distributions for the ASD and the other two control cohorts are shown in Table S2, respectively. In total, we validated 56,958 HQ clinical notes from 8499 individuals with ASD, 41,753 HQ notes from 8177 individuals with psychiatric (non-ASD), and 21,028 HQ notes from 8482 individuals without any ASD and psychiatric problems. The distribution of number of HQ psychiatric notes for each ASD individuals can be found in Fig. S1.

Fig. 2figure 2

Gender and age distribution in ASD patient cohort. Some patients were diagnosed at very early age, which may represent an artifact of retrospective assignment of ICD codes in EHRs

Statistical analysis for ASD terms comparison in case and control groups

Following the same NLP protocol in our previously published paper for ASD literature mining [14], we used CLAMP [13] to extract phenotypic named entities from unstructured documents. Initially, CLAMP recognized 4,794,554 biomedical concept named entities (NEs), and these entities can be mapped to 80,350 UMLS CUIs (concept unique identifiers) based on their similarities with the unique biomedical concepts in UMLS database. We removed those NEs that cannot be mapped to any UMLS CUIs. Also, since not all mapped UMLS CUIs are disease phenotypes relevant, we only kept NEs belong to 12 semantic types (e.g., activity, individual behavior, mental process, shown in Table 1), given that most NEs belong to irrelevant UMLS semantic categories such as lab test, body system, and molecular and chemical types.

Then, we further calculated the frequencies for each NE in the three cohorts and performed odds-ratio (OR) analysis for the ASD cohort against the two control cohorts, non-ASD psychiatric cohort, and nonpsychiatric cohort, respectively. As an initial filtering step, since we wanted a relatively large candidate pool, we used a relatively loose selection criterion. We chose 1.5 as the OR cutoff, because our hypothesis is that if the frequency of a NE is 50% higher likelihood in ASD cohort than in non-ASD cohort, we consider that NE is potentially associated with ASD. A list of 8661 NEs passed the initial odds ratio cutoff, OR > 1.5. Results showed the NEs referring to ASD communication characteristics such as “nonverbal/nonverbal communications” had the largest odds ratios between ASD vs. nonpsychiatric group comparison. While NEs are describing restricted/repetitive behavior features, such as “autistic behavior,” “stereotyped phrases, repetitive language, have the largest odds ratio from ASD vs. non-ASD psychiatric group comparison. Thus, impairment of social interactions and communication behaviors, as well as stereotyped behaviors, is the most significant and consistent psychiatric symptoms in the EHR of ASD patients compared to non-ASD population. This observation, purely made from clinical notes of EHR, is consistent with the DSM-5 ASD diagnosis criteria.

NLP embedding analysis to further identify ASD specific terms

Filtering NEs based on the odds ratio value of CLAMP output is not optimal, since the cutoff selection is arbitrary, and it is possible that some false-positive terms were included while some true positive terms were excluded in the initial list of 8661 ASD terms. Because human manual evaluation on all these terms is labor-intensive, we used semiautomated approaches to recognize some novel ASD terms and exclude false-positive ones automatically, based on a limited number of true positive ASD terms labeled by human experts. In collaboration with three clinicians specialized in ASD, we manually examined and verified 1102 ASD terms collected from PubMed literature searching and organized them into four main categories based on types of terms (Table 2).

Table 2 Clinician curated classification categories and examples

Using these 1102 terms as the “seed” list, we further performed two steps of NLP analysis to warrant the delivery of high-quality terms: (1) using BioBERT’s NER [17] to train ASD patients’ clinical notes, we identified and verified 302 novel ASD terms from clinical notes that were not captured in the initial list. These 302 terms were added to the “seed” list, which was considered as the new gold standard list; (2) using the BioSent2Vec embedding model [18], we converted all these ASD terms into 700 high-dimensional vectors and calculated the semantic cosine similarities between the initial 8661 ASD NEs and the new gold standard ASD terms. Among these 8661 NEs, 1943 ASD terms showed high similarity with ASD gold standard terms, so we consider them as the true positive ASD NEs with high confidence.

The final list of 3336 NEs with high confidence is shown in Table S3. The newly recognized ASD phenotype terms have ID starting with “ASD,” while the gold standard terms have IDs starting with “S.” Each ASD NE has statistical information such as the percentage of patients containing this term in each cohort and the odds ratio value for ASD VS. control group comparison. The correlation score for each ASD NEs mapping to the DSM-5 criteria comes from the cosine similarity value of embedding analysis. To obtain the statistical data for the new recognized terms, we run CLAMP on clinical notes again using dictionary look-up function.

Patient clustering

To show the utility of our terminology set in characterizing ASD phenotype, we selected 2000 ASD patients and 2000 non-ASD psychiatric patients whose clinical notes contain the most ASD phenotypic concept information. We used CLAMP to extract ASD terms that map to the terminology set from psychiatric notes of each patient. In this case, each patient contains a list of standard ASD terms. A binary data matrix with 4000 rows (patients) and 3336 columns (terms) was generated based on whether a particular term is presented or absented in the patient’s notes. We then used the TF-IDF (term frequency-inverse document frequency) method to transform the data matrix and performed nonnegative matrix factorization (NMF) to cluster patients. Next, we used a t-SNE plot to demonstrate ASD and non-ASD psychiatric patient clusters for data visualization. Figure 3 showed the comparison of patient clusters using our terminology set and Lingren’s ASD list. As shown in Fig. 3a, we can see that ASD patients are mostly clustered in the lower half of the plot, while the non-ASD psychiatric patients are clustered on the upper half. We also observed four distinct subgroups of ASD patients, while a subset of ASD patients is mixed with non-ASD psychiatric patients. However, there are no clear cluster patterns that can be observed using Lingren’s ASD list, as shown in Fig. 3b. Of note, because Lingren’s list contains very limited ASD terms, only a fraction of patients in ASD and non-ASD groups contains features from the Lingren’s list. For example, only 191 ASD terms from Lingren’s list can be found in 1960 ASD group and 999 psychiatric non-ASD group, while in comparison, 1196 ASD terms from our list can be found in both ASD and non-ASD psychiatric groups with 2000 patients. It suggests that our terminology set is more efficient in identifying ASD patients’ phenotypes and therefore has a better separation for ASD patients from general psychiatric patients.

Fig. 3figure 3

Comparison of t-SNE clustering analysis for top 2000 ASD patients and 2000 psychiatric (non-ASD) patients using our terminology set (a) and using Lingren’s terminology set (b). Since not all the patients contain the ASD vocabulary developed by Lingren et al., we only analyzed patients containing these terms. Results showed that our terminology set separates ASD patients from general psychiatric (non-ASD) patients much better than Lingren’s list. From the t-SNE plot, we can see ASD patients can be further divided into 4 subgroups; however, one group of ASD patients (cluster 4) is mixed with non-ASD psychiatric patients

We further examined the subgroups of the 2000 ASD patients and explored how these subgroups of patients map to the DSM-5 guidelines. The relationship between clusters and DSM-5 guidelines is determined by their NE similarities. The descriptive sentence of each DSM-5 criterion was truncated into a list of ASD NEs (Table S4), then these NEs were converted to high-dimensional vectors using the BioSent2Vec embedding approach. To determine the relationships between ASD patients’ phenotypes and DSM-5 guideline, we mapped phenotype terms with the highest cosine similarity value to the criteria. Table S5 displays examples of ASD phenotype terms that are matched to each DSM-5 criterion for ASD.

To show the real-world notes of patients containing key diagnostic information, we generated a radar plot to display the mapping from summarized patients’ phenotypes to DSM-5’s individual guideline. As long as a patient’s notes contain any phenotypic terms that are the children of the parental (root) category “social interaction” in the ontology, we will assign this patient to the DSM-5 A1 criteria “social interaction.” The value in radar plot means the percentage of patients that matches to a certain criterion in DSM-5. As the radar plot is shown in Fig. 4a, nearly all patients from the four clusters had ASD phenotype terms under DSM-5 criterion A1 (social interaction), A2 (social communication), A3 (social relationship), and B2 (ritualized behaviors). Meanwhile, the percentage of patients in each cluster varied in terms of having phenotype terms under DSM-5 criteria B1 (repetitive behaviors), B3 (fascination and preoccupation), and B4 (unusual sensory and comorbidities). This corroborates the ASD diagnosis criteria in DSM-5; subjects should manifest symptoms in all A1, A2, and A3 and two of B1, B2, B3, and B4. Figure 4b shows how individual patients from different subgroups map to DSM-5 criteria based on the phenotype terms extracted from their clinical notes.

Fig. 4figure 4

Mapping subgroup of ASD patients to DSM-5 guideline. a The percentage of subgroups of ASD patients in each cluster that maps to DSM-5 individual criteria. b As an illustrative example, we quantified individual patient’s ASD characteristics to DSM-5 guideline for patients in cluster 1 and cluster 4

Better ontology structure facilitates ASD phenotype interpretation

The ASD phenotype ontology is a data tree with five levels (Fig. 5A). The first level contains only the root node, “ASD.” The second level consists of 3 nodes: “social interaction” represents criterion A in DSM-5, “repetitive behavior represents criterion B,” and “ASD and comorbidities” refer to the different names of ASD and its comorbidities and represent criterion E. The nodes in level 3 are the sub-criteria in DSM-5 domain for ASD. In level 4, the nodes are ASD phenotype terms extracted from DSM-5 guideline by CLAMP. The nodes in the last level are our ASD phenotype terminology set learned from ASD patients’ clinical notes, which is the most important level as the one that can provide a richer characterization of ASD. Each node has the following properties: the CUI, one of the standard names in UMLS database, the semantic type(s) of CUI, the category of DSM-5 guideline, and the odds ratios of the term as used in ASD vs control EHR notes. The entire information of the ontology is stored in both an RDF and an XML file. These file formats can be imported to Protégé, an open-source ontology editor and a knowledge management system, where the ontology structure can be viewed easily (Fig. 5B).

Fig. 5figure 5

Five levels of ASD phenotype ontology developed in our study. A Example of ASD phenotype ontology. B Examples of our ASD phenotype ontology displayed in the Protégé software for ontology analysis

留言 (0)

沒有登入
gif